Dr. Chaochun Wei, Department of Bioinformatics and Biostatistics

NeSSM: a Next-generation Sequencing Simulator for Metagenomics


I. Introduction
    NeSSM is a tool to generate Next-Generation Sequencing (NGS) reads with parameters set by users. The goal of NeSSM is to generate metagenome sequencing reads close to the reality. Currently, 454, Illumina sequencing platforms are supported. It can help develop methods or systems for metagenomics analysis.

II. System requirements
    Linux operation system, memory 1G or up; Perl 5.8.5 or up and gcc version 4.1.2 or up. If you want to run a GPU version of NeSSM, CUDA 4.0 or up is required (**).

    Download Required Tools or Drivers:
CUDA driver, which could be downloaded from the NVIDIA website:http://www.nvidia.com.
BWA, which could be downloaded from the BWA website:http://bio-bwa.sourceforge.net/.
Samtools, which could be downloaded from the Samtools website:http://samtools.sourceforge.net/.

III. Files and Directories
    This system is zipped in one file, "NeSSM.tarz",which can be downloaded from here. The important files inside this tar file are listed here.
|-NeSSM_CPU.cThe CPU version of NeSSM.
|-NeSSM_GPU.cuThe GPU version of NeSSM.
|-composition-table.plA perl script to analyse the composition from a real metagenome dataset.
|-simulation.configA configure file of NeSSM.
|-mk_index.plA Perl script to make the index file.
|-complete_update_step.plA Perl script to download the genome sequences from NCBI.
|-NCBI_PowerScripting.pm A Perl module for NCBI database connetion.
|-quality-value.plA Perl script to obtain the distribution of quality values in every base from FASTQ files.
|-error-type.plA Perl script to obtain the error model from BWA result files.
|-coverage-bias.plA perl script to estimate the information of sequencing coverage bias from BWA result file.
|-startcuda.shA Shell script to initialize CUDA.

IV. Install NeSSM
After you download the tarball, you can intall NeSSM as follows.
1. unzip NeSSM.tarz
tar   -xzf   NeSSM.tarz
2.1. If you want to use CPU version:
cd   NeSSM/NeSSM_CPU/
make
2.2. If you want to use GPU version:
cd   NeSSM/NeSSM_GPU/
make

V:Run sequencing simulation.
1.Download the NCBI genome database:
  If you don't have NCBI database, use "complete_update_step.pl" to download the NCBI genome database,if you have the genome database, you can skip this step.
cd    NeSSM/scripts/
perl   complete_update_step.pl   string1
string1: the directory to store the genome database
for example: perl   complete_update_step.pl   NeSSM/example/data/

2.Generate the index file:
  The index file contains the genomes' name, length, path and so on. It can be used in simulation or analyzing composition of metagenome.
cd   NeSSM/scripts/
perl   mk_index.pl   string1   string2
string1: the directory of the whole database (generated in step 1)
string2: the directory to store the index file generated by this script
for example: perl   mk_index.pl   NeSSM/example/data/   NeSSM/example/
The "index" file can be generated under the directory of NeSSM/example/
  ATTENTION: the directory of the whole database in string1 should be with the absolute path!!!

3.Create a composition structure table:
  The composition structure table contains the names and their abundances for genomes in a metagenome. Here the abundance can be the percentage of an organism (based on its read number). There are two ways to obtain the composition structure table.
3.1.Input the composition structure table by users. If the abundance is the percentage of reads number, users should confirm that the sum of all abundances is one. If the abundance is the percentage of organism number, use "adjust.pl" to adjust the table.
cd   NeSSM/scripts/
perl   adjust.pl   string1   string2
string1: the composition structure table inputted by users
string2: the index file generated in step 2
for example: perl adjust.pl   NeSSM/example/percentage.txt   NeSSM/example/index
The "new-percentage.txt" file can be generated under the directory of NeSSM/scripts/

3.2.Input a metagenome data. The "composition-table.pl" can generate a composition structure table from the metagenome data.
First, use the BWA to map the metagenome data. If the reads are less 200 bps, the recommed algorithm in BWA is "is" (the recommed parameters in "aln"step is "-I -N" and "-n 100" in the step "samse/sampe") and the algorithm "bwasw" is better for reads of longer than 200 bps.
Then, use the "composition-table.pl" to analyze BWA result.
cd   NeSSM/scripts/
perl   composition-table.pl   string1   string2
string1: the index file generated in step 2
string2: the BWA result
for example: perl   composition-table.pl   NeSSM/example/index   NeSSM/example/example.sam
The "percentage.txt" file can be generated under the directory of NeSSM/scripts/

4.Run NeSSM:
  There are two versions of NeSSM program. One is a CPU version of NeSSM under the dirctory of NeSSM/NeSSM_CPU/. The other is GPU version under the directory of NeSSM/NeSSM_GPU/.
The usage of NeSSM now is took CPU version for example.
cd   NeSSM/NeSSM_CPU/
./NeSSM   -list   string1   -index   string2   -m   string3   -o   string4
string1: the compostition structure table generated in step 3
string2: the index file generated in step 2
string3: the platform used to simulate, 454 or illumina
string4: output file
for example: ./NeSSM   -list   NeSSM/scripts/percentage.txt   -index   NeSSM/scripts/index   -m   illumina   -o   NeSSM/example/simulation
The "simulation.fq" file can be generated under the directory of NeSSM/example/

  The four parameters: -list, -index, -m, -o are necessary to run the NeSSM. There are many other parameters to use, no matter CPU version or GPU version:
-r < int > : number of reads to simulate, default is 1000
-l < int > : length of read to simulate, default is 50(bps)
-e < int > : simulate single reads or pair-end reads, 0 means single reads and 1 means pair-end reads, default is 0
-w < int > : the length of gap when to simulate pair-end reads, default is 200(bps)
-c < string > : the cofigure file used to simulate, defaulte is "simulation.config"
-exact < int > : 0 means the length of read is decided by the parameter "-l", 1 means the length of read is decided by the distribution of length according to a real data, default is 0
-b < string > : the file of sequencing coverage bias

  There are two parameters only used in GPU version:
-block < int > : the blocks number used in GPU, default is 100
-thread < int > : the treads number used in GPU, default is 200

VI:Error model estimation.
Users can estimate error models from a FASTQ file by two perl scripts.
1.Estimating the distribution of quality values in every base by "quality-value.pl"
cd   NeSSM/scripts/
perl   quality-value.pl   string1   string2
string1: the FASTQ file inputted by users
string2: the platform of the FASTQ file, 454 or illumina
for example: perl   quality-value.pl   NeSSM/example/test.fq   454
The "quality.txt" file can be generated under the directory of NeSSM/scripts/

2.Mapping the reads in a FASTQ file to the reference genomes by BWA.
  If the length of reads in FASTQ file is less than 200 bps, use the "is" option with default parameters.
  If the length of reads in FASTQ file is more than 200 bps, use the "bwasw" option with default parameters. After BWA mapping, use Samtools to adjust the result of BWA.
cd   directory of Samtools
./samtools   calmd   -S   string1   string2   >   string3
string1: BWA result
string2: the reference genomes used in BWA, this file must be FASTA format
string3: output file

3.Estimating the error type by "error-type.pl".
cd   NeSSM/scripts/
perl   error-type.pl   string1   string2
string1: the BWA result generated in step 2
string2: the platform of the FASTQ file, 454 or illumina
for example: perl   error-type.pl   NeSSM/example/test.sam   454
The "self-simulation.config" file can be generated under the directory of NeSSM/scripts/

VII:Sequencing coverage bias estimation.
Users can estimate the information of sequencing coverage bias from a real metagenome dataset. First, the reads are mapped back by BWA with parameters above. Then estimate the information of sequencing coverage bias by "coverage-bias.pl".
cd   NeSSM/scripts/
perl   coverage-bias.pl   string1   string2   string3
string1: the index file generated in step V-2
string2: the compostition structure table generated in step V-3
string3: the BWA result
for example: perl   coverage-bias.pl   NeSSM/scripts/index   NeSSM/scripts/percentage.txt    NeSSM/example/example.sam
The "coverage.txt" file can be generated under the directory of NeSSM/scripts/

VIII:Datasets in the paper.
All datasets mentioned in paper are provided here except those with sizes above 2Gb. (**)ATTENTION:
1:If this is your first time to run your cuda, you should run the "startcuda.sh" with root permissions to initialize the CUDA.
The startcuda.sh file is under NeSSM/NeSSM_GPU/ and its usage is: ./startcuda.sh   start
2:If your CUDA version is above 4.0, you should run the command "nvidia-smi" with root permissions
3:You can generate the simulation datasets used in paper "NeSSM: a Next-generation Sequencing Simulator for Metagenomics" according to the commands.

Citation
Please cite the following paper if you use NeSSM (can be considered as NeSSM1.0).
If you need sequencing simulation for the third generation sequencing platforms, please cite the paper bellow (can be considered as NeSSM2.0).
Contact:
If you have any questions, feel free to contact us.
< chenmodexiaoxi@126.com >
< ccwei@sjtu.edu.cn >


Please send your comments or bug reports to Dr. Wei .

 

©2010 Chaochun Wei