PaSS is a fast sequencing simulator for PacBio sequencing with a high fidelity. It will facilitate the evaluation and development of new analysis tools for the PacBio sequencing data.

System requirements

Linux operation system, memory 1G or up; Perl and gcc is needed.

Installation

  1. Download the tarball here.

  2. Uncompress the PaSS.tar.gz
  3. tar xzvf PaSS.tar.gz
  4. Compile the source codes.
  5. gcc -lm -lpthread PaSS.c -o PaSS

Simulate PacBio multi-pass sequencing reads

  1. Generate the index file of the target genome
  2. perl pacbio_mkindex.pl E.coli/ecoli_ref.fa ./

    After this step, you can get two files percentage.txt and index (containing some information about the taget genome) in the current directory and they will be used in the following simulation stage.

  3. Simulation
  4. ./PaSS -list percentage.txt -index index -m pacbio_RS -c sim.config -r 1000 -t 4 -o out 
    Parameters: 
    -list      percentage.txt
    -index     index file
    -m         pacbio_RS or pacbio_sequel, the sequencer that can choose
    -c         the profile that generated in the error model stage. 
               sim.config is the profile of the example dataset.
               There are three profiles prepared for E.coli,C.elegan and A.thaliana respectively.
    -r         number of reads to simulate
    -t         number of threads to use, default is 1.
    -o         name of output file
    -d         If '-d' is set, the ground truth of simulation will output concurrently.

Estimate the error model from the real PacBio sequencing data

  1. Align the sequencing reads to reference genome by blasr.
  2. blasr real.fastq reference.fasta --allowAdjacentIndels --hitPolicy randombest --out real.blasr -m 0
  3. Generating profiles by "run.pl".
    perl run.pl example/example.fq example/example.blasr RS/sequel

    parameter1: real PacBio sequencing data.

    parameter2:alignment results of real data.

    parameter3: the version of sequencer, RS or sequel. If the sequencer is RS, the distribution of quality-value is included in the model.

    The ouput is sim.config.

Datasets

Real PacBio sequencing datasets and their alignment results can be downloaded here.

Use & Citation

Please cite the following paper if you use PaSS (can be considered as NeSSM 2.0).

Contact Information

Wenmin Zhang: Melody091835@163.com
Chaochun Wei: ccwei@sjtu.edu.cn