Dr. Chaochun Wei at Shanghai Jiao Tong University

NeSSM: a Next-generation Sequencing Simulator for Metagenomics

I. Introduction
    NeSSM is a tool to generate Next-Generation Sequencing (NGS) reads with parameters set by users. The goal of NeSSM is to generate metagenome sequencing reads close to the reality. Currently, 454, Illumina sequencing platforms are supported. It can help develop methods or systems for metagenomics analysis.

II. System requirements
    Linux operation system, memory 1G or up; Perl 5.8.5 or up and gcc version 4.1.2 or up. If you want to run a GPU version of NeSSM, CUDA 4.0 or up is required (**).

    Download Required Tools or Drivers:
CUDA driver, which could be downloaded from the NVIDIA website:http://www.nvidia.com.
BWA, which could be downloaded from the BWA website:http://bio-bwa.sourceforge.net/.
Samtools, which could be downloaded from the Samtools website:http://samtools.sourceforge.net/.

III. Files and Directories
    This system is zipped in one file, "NeSSM.tarz",which can be downloaded from here. The important files inside this tar file are listed here.

\|-NeSSM_CPU.c	The CPU version of NeSSM.
\|-NeSSM_GPU.cu	The GPU version of NeSSM.
\|-composition-table.pl	A perl script to analyse the composition from a real metagenome dataset.
\|-simulation.config	A configure file of NeSSM.
\|-mk_index.pl	A Perl script to make the index file.
\|-complete_update_step.pl	A Perl script to download the genome sequences from NCBI.
\|-NCBI_PowerScripting.pm	A Perl module for NCBI database connetion.
\|-quality-value.pl	A Perl script to obtain the distribution of quality values in every base from FASTQ files.
\|-error-type.pl	A Perl script to obtain the error model from BWA result files.
\|-coverage-bias.pl	A perl script to estimate the information of sequencing coverage bias from BWA result file.
\|-startcuda.sh	A Shell script to initialize CUDA.

IV. Install NeSSM
After you download the tarball, you can intall NeSSM as follows.
1. unzip NeSSM.tarz

tar -xzf NeSSM.tarz

2.1. If you want to use CPU version:

cd NeSSM/NeSSM_CPU/

make

2.2. If you want to use GPU version:

cd NeSSM/NeSSM_GPU/

make

V:Run sequencing simulation.

1.Download the NCBI genome database:

If you don't have NCBI database, use "complete_update_step.pl" to download the NCBI genome database,if you have the genome database, you can skip this step.

cd NeSSM/scripts/

perl complete_update_step.pl string1

string1: the directory to store the genome database

for example: perl complete_update_step.pl NeSSM/example/data/

2.Generate the index file:

The index file contains the genomes' name, length, path and so on. It can be used in simulation or analyzing composition of metagenome.

cd NeSSM/scripts/

perl mk_index.pl string1 string2

string1: the directory of the whole database (generated in step 1)

string2: the directory to store the index file generated by this script

for example: perl mk_index.pl NeSSM/example/data/ NeSSM/example/

The "index" file can be generated under the directory of NeSSM/example/

ATTENTION: the directory of the whole database in string1 should be with the absolute path!!!

3.Create a composition structure table:

The composition structure table contains the names and their abundances for genomes in a metagenome. Here the abundance can be the percentage of an organism (based on its read number). There are two ways to obtain the composition structure table.
3.1.Input the composition structure table by users. If the abundance is the percentage of reads number, users should confirm that the sum of all abundances is one. If the abundance is the percentage of organism number, use "adjust.pl" to adjust the table.

cd NeSSM/scripts/

perl adjust.pl string1 string2

string1: the composition structure table inputted by users

string2: the index file generated in step 2

for example: perl adjust.pl NeSSM/example/percentage.txt NeSSM/example/index

The "new-percentage.txt" file can be generated under the directory of NeSSM/scripts/

3.2.Input a metagenome data. The "composition-table.pl" can generate a composition structure table from the metagenome data.

First, use the BWA to map the metagenome data. If the reads are less 200 bps, the recommed algorithm in BWA is "is" (the recommed parameters in "aln"step is "-I -N" and "-n 100" in the step "samse/sampe") and the algorithm "bwasw" is better for reads of longer than 200 bps.

Then, use the "composition-table.pl" to analyze BWA result.

cd NeSSM/scripts/

perl composition-table.pl string1 string2

string1: the index file generated in step 2

string2: the BWA result

for example: perl composition-table.pl NeSSM/example/index NeSSM/example/example.sam

The "percentage.txt" file can be generated under the directory of NeSSM/scripts/

4.Run NeSSM:

There are two versions of NeSSM program. One is a CPU version of NeSSM under the dirctory of NeSSM/NeSSM_CPU/. The other is GPU version under the directory of NeSSM/NeSSM_GPU/.

The usage of NeSSM now is took CPU version for example.

cd NeSSM/NeSSM_CPU/

./NeSSM -list string1 -index string2 -m string3 -o string4

string1: the compostition structure table generated in step 3

string2: the index file generated in step 2

string3: the platform used to simulate, 454 or illumina

string4: output file

for example: ./NeSSM -list NeSSM/scripts/percentage.txt -index NeSSM/scripts/index -m illumina -o NeSSM/example/simulation

The "simulation.fq" file can be generated under the directory of NeSSM/example/

The four parameters: -list, -index, -m, -o are necessary to run the NeSSM. There are many other parameters to use, no matter CPU version or GPU version:

-r < int > : number of reads to simulate, default is 1000

-l < int > : length of read to simulate, default is 50(bps)

-e < int > : simulate single reads or pair-end reads, 0 means single reads and 1 means pair-end reads, default is 0

-w < int > : the length of gap when to simulate pair-end reads, default is 200(bps)

-c < string > : the cofigure file used to simulate, defaulte is "simulation.config"

-exact < int > : 0 means the length of read is decided by the parameter "-l", 1 means the length of read is decided by the distribution of length according to a real data, default is 0

-b < string > : the file of sequencing coverage bias

There are two parameters only used in GPU version:

-block < int > : the blocks number used in GPU, default is 100

-thread < int > : the treads number used in GPU, default is 200

VI:Error model estimation.
Users can estimate error models from a FASTQ file by two perl scripts.

1.Estimating the distribution of quality values in every base by "quality-value.pl"

cd NeSSM/scripts/

perl quality-value.pl string1 string2

string1: the FASTQ file inputted by users

string2: the platform of the FASTQ file, 454 or illumina

for example: perl quality-value.pl NeSSM/example/test.fq 454

The "quality.txt" file can be generated under the directory of NeSSM/scripts/

2.Mapping the reads in a FASTQ file to the reference genomes by BWA.

If the length of reads in FASTQ file is less than 200 bps, use the "is" option with default parameters.
If the length of reads in FASTQ file is more than 200 bps, use the "bwasw" option with default parameters. After BWA mapping, use Samtools to adjust the result of BWA.

cd directory of Samtools

./samtools calmd -S string1 string2 > string3

string1: BWA result

string2: the reference genomes used in BWA, this file must be FASTA format

string3: output file

3.Estimating the error type by "error-type.pl".

cd NeSSM/scripts/

perl error-type.pl string1 string2

string1: the BWA result generated in step 2

string2: the platform of the FASTQ file, 454 or illumina

for example: perl error-type.pl NeSSM/example/test.sam 454

The "self-simulation.config" file can be generated under the directory of NeSSM/scripts/

VII:Sequencing coverage bias estimation.
Users can estimate the information of sequencing coverage bias from a real metagenome dataset. First, the reads are mapped back by BWA with parameters above. Then estimate the information of sequencing coverage bias by "coverage-bias.pl".

cd NeSSM/scripts/

perl coverage-bias.pl string1 string2 string3

string1: the index file generated in step V-2

string2: the compostition structure table generated in step V-3

string3: the BWA result

for example: perl coverage-bias.pl NeSSM/scripts/index NeSSM/scripts/percentage.txt NeSSM/example/example.sam

The "coverage.txt" file can be generated under the directory of NeSSM/scripts/

VIII:Datasets in the paper.
All datasets mentioned in paper are provided here except those with sizes above 2Gb.

LC-100 dataset (in Table 2)
LC-250 dataset (in Table 2)
MC-100 dataset (in Table 2)
MC-250 dataset (in Table 2)
HC-100 dataset (in Table 2)
HC-250 dataset (in Table 2)
LC dataset used in assemble (in Table 7)

(**)ATTENTION:
1:If this is your first time to run your cuda, you should run the "startcuda.sh" with root permissions to initialize the CUDA.
The startcuda.sh file is under NeSSM/NeSSM_GPU/ and its usage is: ./startcuda.sh start
2:If your CUDA version is above 4.0, you should run the command "nvidia-smi" with root permissions
3:You can generate the simulation datasets used in paper "NeSSM: a Next-generation Sequencing Simulator for Metagenomics" according to the commands.

Citation
Please cite the following paper if you use NeSSM (can be considered as NeSSM1.0).

"NeSSM: a Next-generation Sequencing Simulator for Metagenomics", 2013, PLoS ONE , 8(10):e75448.

If you need sequencing simulation for the third generation sequencing platforms, please cite the paper bellow (can be considered as NeSSM2.0).

"PaSS: a sequencing simultor for PacBio sequencing", 2019, BMC Bioinformatics 20:352.

Contact:
If you have any questions, feel free to contact us.
< chenmodexiaoxi@126.com >
< ccwei@sjtu.edu.cn >

Please send your comments or bug reports to Dr. Wei .

Dr. Chaochun Wei, Department of Bioinformatics and Biostatistics

NeSSM: a Next-generation Sequencing Simulator for Metagenomics