logo

EUPAN-LSF

eupanLSF tool list:
        qualSta         Run fastqc on a large number of files
        mergeQualSta    View the overall sequencing quality by combining fastqc outputs
        trim            Trim or filter low-quality reads parallelly
        align           Map reads to a reference parallelly
        sam2bam         Convert alignments (.sam) to sorted .bam files
        bamSta          Statistics of parallel mapping
        assemble        Assemble reads parallelly
        assemSta        Statistics of parallel assembly
        mergeAssemSta   Merge statistics of all indivduals to a single file
        getUnalnCtg     Extract the unaligned contigs from nucmer alignment (processed by quast)
        mergeUnalnCtg   Merge the unaligned contigs into a single file
        rmRedundant     Remove redundant contigs of a fasta file
        pTpG            Get the longest transcripts to represent genes
        geneCov         Calculate gene body coverage and CDS coverage
        mergeGeneCov    Merge gene body coverage and cds coverage of each sample to two summary files
        geneExist       Determine gene presence-absence based on gene body coverage and CDS coverage
        subSample       Select subset of samples from gene PAV profile
        gFamExist       Determine gene family presence-absence based on gene presence-absence
        bam2bed         Calculate genome region presence-absence from .bam
        fastaSta        Calculate statistics of fasta file
        sim             simulation and plot of the pan-genome and the core genome

  1. qualSta
  2. "qualSta" tool is used to check qualities of .fastq/.fastq.gz files on a large scale. It will plot the quality statistics automatically.

    Usage: eupanLSF qualSta [options]  

    The script will call fastqc program, so please make sure fastqc is in your PATH, or you need to use -f option to tell the script where it locates.

    Necessary input description:

      data_directory   <string>      This directory should contain many sub-directories
                                     named by sample names, such as CX101, B152,etc.
                                     In each sub-directory, there should be several 
                                     sequencing files ended by .fastq or .fastq.gz.
    
      output_directory <string>      Both final output files and intermediate results 
                                     will be found in this directory. To avoid 
                                     overwriting of existing files. We kindly request
                                     that the output_directory should not exist. It is
                                     to say, this directory will be created by the 
                                     script itself.
    
    Options:
         -h                          Print this usage page.
    
         -f            <string>      The directory where the executable file (fastqc) 
                                     locate. If this option isn't given, we assume 
                                     that it is in your PATH.
         
         -t            <int>         Specifies the number of files which can be processed
                                     simultaneously. This parameter is sent to fastqc 
                                     program. It is recommended to set as the number of 
                                     files within each sample. Pay attention that the
                                     machine should have this number of threads.
                                     default: 1 
    
         -q            <string>      The queue name for job submiting. 
                                      default: default queue

    Example command:

    eupanLSF qualSta -f /path/to/Fastqc -t 4 data/ preview_quality/

  3. mergeQualSta
  4. Usage: eupanLSF mergeQualSta 
    directory    The exact output of "eupanLSF qualSta".

  5. trim
  6. "trim" tool is used to trim or filter raw sequencing data to generate high-quality paired-end fastq data.

    Usage: eupanLSF trim [options]

    The tool will call trimmomatic program and parameter files within trimmomatic directory is also needed. So the directory where trimmomatic locates should be given to the script as a necessary input.

    Necessary input description:

      fastq_data_directory   <string>    This directory should contain many sub-directories
                                         named by sample names, such as CX101, B152,etc.
                                         In each sub-directory, there should be several 
                                         sequencing files ended by .fastq(or .fq) or .fastq.gz(or .fq.gz).
    
      output_directory       <string>    High-quality reads will be output to this directory. 
                                         To avoid overwriting of existing files. We kindly request
                                         that the output_directory should not exist. It is
                                         to say, this directory will be created by the 
                                         script itself.
    
      Trimmomatic_directory  <string>    Directory where trimmometic program locates.
    
    Options:
         -h                              Print this usage page.
    
         -t                   <int>      thread number
    
         -a                   <string>   Adaptor file in fasta utilized by trimmomatic program.
                                         Default: trimmomatoc_dir/adapters/TruSeq3-PE-2.fa
         
         -s                   <string>   Suffix of the fastq_file. Check your sequencing data and
                                         change it if needed.
                                         Default: ".fq.gz"
    
         -k                   <string>   Linker for paired_end identifer. Paired-end fastq file
                                         should end with *1suffix or *2suffix, where suffix is
                                         ".fq.gz"( or ".fastq", etc. See -s option) and * is the
                                         linker such as "_".As an example, the file should 
                                         be like CX123_1.fq.gz (linker is "_", suffix is ".fq.gz")
                                         or BX125_R1.fastq(linker is "_R", suffix is ".fastq")
                                         Default: "_"
         -p                   <33 or 64> Quality score version. 
                                         Default: 33 (phred+33)
    
    
         -i                   <int>      Parameter passed to Trimmomatic (LEADING or TRAILING). 
                                         Specifies the minimum  quality required to keep a base at the 
                                         head or tail of a read.
                                         Default: 20. 
    
         -w                   <int>      Parameter passed to Trimmomatic (SLIDINGWINDOW).specifies the 
                                         number of bases to average across.
                                         Default: 4.
         
         -n                   <int>      Parameter passed to Trimmomatic (quality for SLIDINGWINDOW).
                                         Default: 20.
    
         -m                   <int>      Parameter passed to Trimmomatic (MINLEN).Specifies the minimum 
                                         length of reads to be kept.
                                         Default: 35.
    
         -j                   <int>      Parameter passed to Trimmomatic (HEADCROP). The number of bases 
                                         to remove from the start of the read. If it is set to 0, no base
                                         will be removed from the head.
                                         Default: 0.
         -q                   <string>   The queue name for job submiting. 
                                         default: default queue

    Example commands:

    eupanLSF trim -p 64 -t 4 data/ trim/ /path/to/Trimmomatic/
    eupanLSF trim -p 33 -t 4 -w 83 -m 83 data/ filter/ /path/to/Trimmomatic/

  7. alignRead
  8. "alignRead" tool is used to map reads to the genome on large scale.

    Usage: eupanLSF align [options]     

    The script will call mapping program (bwa mem or bowtie2), so the directory where mapping tool locates is needed.

    Necessary input description:

      fastq_data_directory    <string>    This directory should contain many sub-directories
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, there should be several 
                                          sequencing files ended by .fq(.gz) or .fastq(.gz).
    
      output_directory        <string>    Alignment results will be output to this directory.
                                          To avoid overwriting of existing files. We kindly request
                                          that the output_directory should not exist. It is
                                          to say, this directory will be created by the 
                                          script itself.
    
      Mapping_tool_directory  <string>    Directory where bwa or bowtie2 program locates.
    
      Alignment_index         <string>    bowtie2 index built from bowtie2-build program;
                                          or bwa index built from bwa index.   
    
    Options:
         -h                              Print this usage page.
    
         -f                   <string>   Select a mapping tool. Can be "bwa" or "bowtie2".
                                         Default: bwa
    
         -t                   <int>      Threads used.
                                         Default: 1
    
         -s                   <string>   Suffix of the fastq_file. Check your sequencing data and
                                         change it if needed.
                                         Default: ".fq.gz"
    
         -k                   <string>   Linker for paired_end identifer. Paired-end fastq file
                                         should end with *1suffix or *2suffix, where suffix is
                                         ".fq.gz"( or ".fastq", etc. See -s option) and * is the
                                         linker such as "_".As an example, the file should 
                                         be like CX123_1.fq.gz (linker is "_", suffix is ".fq.gz")
                                         or BX125_R1.fastq(linker is "_R", suffix is ".fastq")
                                         Default: "_"
    
         -m                   <int>      min insertion length, for bowtie2 only 
                                         Default: 0
       
         -n                   <int>      max insertion length, for bowtie2 only
                                         Default: 1000
    
         -q            <string>      The queue name for job submiting. 
                                      default: default queue

    Examples:

    eupanLSF alignRead -f bwa -t 4 trim/data/ map2pan/ /path/to/bwa/ pan/pan.fa
    eupanLSF alignRead -f bowtie2 -t 4 trim/data/ map2pan/ /path/to/bowtie2/ pan/pan

  9. sam2bam
  10. Usage: eupanLSF sam2bam [options]    

    eupanLSF sam2bam is used to adjust mapping results including:

    1. coverting sam to bam
    2. sorting bam
    3. merging bam
    4. indexing bam

    The script will call samtools program, so the directory where samtools locates is needed.

    Necessary input description:

      mapping_directory      <string>     This directory should contain many sub-directories
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, One or more mapping results,
                                          *.sam, should exist.
    
      output_directory        <string>    Results will be output to this directory.To avoid 
                                          overwriting of existing files. We kindly request
                                          that the output_directory should not exist. It is
                                          to say, this directory will be created by the 
                                          script itself.
    
      QUAST_directory         <string>    samtools directory where executable samtools locates.
    
    Options:
         -h                               Print this usage page.
    
         -t                   <int>       Threads used.
                                          Default: 1
    
         -q            <string>      The queue name for job submiting. 
                                      default: default queue

    Example command:

    eupanLSF sam2bam -t 4 map2pan/data/ panBam/ /path/to/samtools-1.3/

  11. bamSta
  12. "bamSta" tool is used to check statistics of .bam files.

    Usage: eupanLSF bamSta [commands] ...
    Commands:
            basic               calculate basic statistics
            cov                 calculate genome coverage
            mergeBasicSta       merge basic statistics of each individual
            mergeCovSta         merge genome coverages of each individual
    6.1 basic

    eupanLSF bamSta basic is used to check the basic statistics of mapping.

    Usage: eupanLSF bamSta basic [options]    

    The script will call bam_stats (in BamUtil), so the directory where bamUtil locates is needed.

    Necessary input description:

      bam_directory           <string>    This directory should contain many sub-directories
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, mapping result, a sorted .bam
                                          file, should exist.
    
      output_directory        <string>    Results will be output to this directory.To avoid 
                                          overwriting of existing files. We kindly request
                                          that the output_directory should not exist. It is
                                          to say, this directory will be created by the 
                                          script itself.
    
      bamUtil_directory       <string>    bamUtil directory where bin/bam locates.
    
    Options:
    
         -q            <string>      The queue name for job submiting. 
                                     default: default queue
    6.2 cov

    eupanLSF bamSta cov is used to check the coverages of genome.

    Usage: eupanLSF bamSta cov [options]    

    The script will call qualimap software, so the directory where qualimap locates is needed.

    Necessary input description:

      bam_directory           <string>    This directory should contain many sub-directories
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, mapping result, a sorted .bam
                                          file, should exist.
    
      output_directory        <string>    Results will be output to this directory.To avoid 
                                          overwriting of existing files. We kindly request
                                          that the output_directory should not exist. It is
                                          to say, this directory will be created by the 
                                          script itself.
    
      qualimap_directory      <string>    qualimap directory where executable qualimap locates.
    
    Options:
         -h                               Print this usage page.
    
         -m                   <string>    Maximum memory size for java use
                                          Default: 12G
    
         -t                   <int>       Thread number.
                                          Default:4
    
         -q                   <string>    LSF queue name
                                          Default: default queue    
    6.3 mergeBasicSta

    "bamSta mergeBasicSta" is used to collect and merge bam basic statistics.

    Usage: eupanLSF bamSta mergeBasicSta [options]  >output

    Necessary input description:

      directory        <string>      directories of bamUtil results. 
                                     Each directory contains sub-directories named
                                     by sample names.
                                     data/ directory of bamSta basic output.
    6.4 mergeCovSta

    "bamSta mergeCovSta" is used to collect and merge bam coverage statistics.

    Usage: eupanLSF bamSta mergeCovSta [options]  >output

    Necessary input description:

    directory        <string>      directories of qualimap results. 
                                     Each directory contains sub-directories named
                                     by sample names.

  13. assemble
  14. "assemble" is used to assemble short reads. It contains 2 sub-programs:

    1) "soapdenovo" raw sopadenovo2 assembly with fixed Kmer

    2) "linearK" iterative use of soapdenovo2 with flexible Kmer

    Usage: eupanLSF assemble [commands] ...
    Commands:
            soapdenovo    Aseembly with SOAPdenovo 2.
            linearK       Assembly with an iterative use of SOAPdenovo 2 (Recommended).
    7.1 soapdenovo
    Usage: eupanLSF assemble soapdenovo [options]   

    eupanLSF assemble soapdenovo is used to assemble high-quality reads on large scale.

    Necessary input description:

      fastq_data_directory    <string>    This directory should contain many sub-directories
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, there should be several 
                                          sequencing files ended by .fastq or .fastq.gz.
    
      output_directory        <string>    Alignment results will be output to this directory.
                                          To avoid overwriting of existing files. We kindly request
                                          that the output_directory should not exist. It is
                                          to say, this directory will be created by the 
                                          script itself.
    
      sopadenovo_directory    <string>    directory where soapdenovo2 executable files exists   
    
    Options:
         -h                              Print this usage page.
    
         -t                   <int>      Threads used.
                                         Default: 1
    
         -s                   <string>    Suffix of files within data_directory.
                                          Default: .fq.gz 
    
         -k                   <int>      Kmer.
                                         Default: 35
    
         -c                   <string>    Parameters of soapdenovo2 config file. 8 parameters ligated by comma
                                            1)maximal read length
                                            2)average insert size
                                            3)if sequence needs to be reversed
                                            4)in which part(s) the reads are used
                                            5)use only first N bps of each read
                                            6)in which order the reads are used while scaffolding
                                            7)cutoff of pair number for a reliable connection (at least 3 for 
                                              short insert size)
                                            8)minimum aligned length to contigs for a reliable read location 
                                              (at least 32 for short insert size)
                                          Default: 80,460,0,3,80,1,3,32
    
         -g                               enable gapcloser 
    
         -q                  <string>     The queue name for job submiting. 
                                          Default: default queue
    7.2 linearK
    Usage: eupanLSF assemble linearK [options]   

    Necessary input description:

      fastq_data_directory    <string>    This directory should contain many sub-directories
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, there should be several 
                                          sequencing files ended by .fastq or .fastq.gz.
    
      output_directory        <string>    Alignment results will be output to this directory.
                                          To avoid overwriting of existing files. We kindly request
                                          that the output_directory should not exist. It is
                                          to say, this directory will be created by the 
                                          script itself.
    
      sopadenovo_directory    <string>    directory where soapdenovo2 executable files exists   
    
    Options:
         -h                              Print this usage page.
    
         -t                   <int>      Threads used.
                                         Default: 1
    
         -g                   <int>       Genome size. Used to infer sequencing depth. 
                                          Default: 380000000 (380M)
         
         -s                   <string>    Suffix of files within data_directory.
                                          Default: .fq.gz 
    
         -r                   <string>    Parameters of linear function: Kmer=2*int(0.5*(a*Depth+b))+1. 
                                          The parameter should be input as "a,b".
                                          Default: 0.76,20
    
         -w                   <int>       Step-length of Kmer change.
                                          Default: 2
    
         -u                   <int>       Upper limmited times of Kmer change. This parameter is set to reduce
                                          redundancy computation.
                                          Default: 10
    
         -c                   <string>    Parameters of soapdenovo2 config file. 8 parameters ligated by comma
                                            1)maximal read length
                                            2)average insert size
                                            3)if sequence needs to be reversed
                                            4)in which part(s) the reads are used
                                            5)use only first N bps of each read
                                            6)in which order the reads are used while scaffolding
                                            7)cutoff of pair number for a reliable connection (at least 3 for 
                                              short insert size)
                                            8)minimum aligned length to contigs for a reliable read location 
                                              (at least 32 for short insert size)
                                          Default: 80,460,0,3,80,1,3,32
    
         -n                   <int>       The minimum length of contigs. Contigs shorter than this length will
                                          NOT be used when calculating N50.
                                          Default: 100
    
         -k                   <string>    Available Kmer range. Give comma-seperated lower bound and upper bound.
                                          Default: 15,127
    
         -m                   <int>       The number of consecutive Ns to be broken down to contigs.This is used 
                                          in the process break gapclosed scaffolds to contigs.
                                          Default: 10.
         -q                   <string>    The queue name for job submiting. 
                                          Default: default queue

  15. assemSta
  16. "assemSta" tool is used to map assembled contigs to reference and to check the statistics of assembled contigs (or scaffolds).

    Usage: eupanLSF assemSta [options]     

    The script will call QUAST program, so the directory where quast.py locates is needed.

    Necessary input description:

      assembly_directory      <string>    This directory should contain many sub-directories
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, assembly results, including 
                                          files *_gc.scafSeq and *_gc.contig, should exist.
    
      output_directory        <string>    Results will be output to this directory.To avoid 
                                          overwriting of existing files. We kindly request
                                          that the output_directory should not exist. It is
                                          to say, this directory will be created by the 
                                          script itself.
    
      QUAST_directory         <string>    QUAST directory where quast.py locates.
    
      reference.fa            <string>    Reference sequence file (.fa or .fa.gz).
    
    Options:
         -h                               Print this usage page.
    
         -t                   <int>       Threads used.
                                          Default: 1
    
         -m                   <int>       Minimum contig length used for assessment.
                                          Default: 500
    
         -g                               Check the statistics of gap-closed assemblies if -g is 
                                          enabled. In the assembly directory of each sample, 
                                          *_gc.scafSeq and *_gc.contig should exist.
                                          Default: check statistics of raw assemblies
    
         -s                               Check the statistics of assembled scaffolds if -s is enabled.
                                          Default: check statistics of assembled contigs

  17. mergeAssemSta
  18. "mergeAssemSta" is used to collect statistices info of assembly from quast.

    Usage: eupanLSF mergeAssemSta  > output_statistics

    Necessary input description:

    QUAST_output_directory_list  <string>    One or more of quast .
    
      unaligned_contig_list   <string>    File including a list of names of unaligned contigs. 
                                          In each directory, there should be sub directories
                                          named by the sample names.

  19. getUnalnCtg
  20. eupan getUnalnCtg is used to collect unaligned contigs.

    Usage: eupanLSF getUnalnCtg [options]    

    eupanLSF getUnalnCtg is used to collect unaligned contigs.

    Necessary input description:

      assembly_directory      <string>    This directory should contain many sub-directories
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, assembly results, including 
                                          file *_gc.contig, should exist.
    
      QUAST_assess_directory  <string>    This directory should contain many sub-directories 
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, quast assessment, including 
                                          directory file contigs_reports, should exist.
    
      output_directory        <string>    Results will be output to this directory.To avoid 
                                          overwriting of existing files. We kindly request
                                          that the output_directory should not exist. It is
                                          to say, this directory will be created by the 
                                          script itself.
    
    Options:
         -h                               Print this usage page.
         -m                               Use gap-closed contigs instead of raw contigs

  21. mergeUnalnCtg
  22. "mergeUnalnCtg" is used to merge unaligned contigs of each individuals to a single file.

    Usage: eupanLSF mergeUnalnCtg [options]     

  23. "rmRedundant"
  24. "rmRedundant" is used to cluster sequences and extract the representative ones.

    Usage: eupanLSF rmRedundant [commands]
    Available commands:
          cdhitCluster      Clustering with CDHIT, fast but only accept identity >0.8
          blastCluster      Clustering with Blastn
    12.1 cdhitCluster
    Usage: eupanLSF rmRedundant cdhitCluster [options]   

    eupanLSF rmRedundant cdhitCluster is used to cluster contigs and remove the redundant ones.

    Necessary input description:

      input_fasta_file        <string>    Contig sequences to be clustereed.
    
      output_directory        <string>    Output directory.
    
      cdhit_directory         <string>    directory where cdhit-est locates. 
    
    Options:
         -h                               Print this usage page.
    
         -t                   <int>       Threads used.
                                          Default: 1
    
         -c                   <float>     Sequence identity threshold
                                          Default: 0.9
         -q                   <string>    The queue name for job submiting. 
                                          Default: default queue
    12.2 blastCluster
    Usage: eupanLSF rmRedundant blastCluster [options]   

    eupanLSF rmRedundant blastCluster is used to cluster contigs and remove the redundant ones.

    Necessary input description:

      input_fasta_file        <string>    Contig sequences to be clustereed.
    
      output_directory        <string>    Output directory.
    
      blast_directory         <string>    directory where blastn and makeblastdb locate. 
    
    Options:
         -h                               Print this usage page.
    
         -t                   <int>       Threads used.
                                          Default: 1
    
         -c                   <float>     Sequence identity threshold
                                          Default: 0.5
    
         -q                   <string>    The queue name for job submiting. 
                                          Default: default queue

  25. pTpG
  26. "pTpG" tool is to obtain the longest trancript of each gene.

    Usage: eupanLSF perTranPerGene  
    Option:
       -e    Check "exon" length instead of check "CDS" length
             Note "exon" or "CDS" should exist in the 3rd column of the input file

  27. geneCov
  28. "geneCov" tool is used to calculate gene body coverage and CDS coverage of each gene.

    Usage: eupanLSF geneCov [options]     

    eupanLSF geneCov is used to calculate gene coverages of each gene.

    The script will call samtools and ccov.

    Necessary input description:

      bam_directory           <string>    This directory should contain many sub-directories
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, mapping result, a sorted .bam
                                          file, should exist.
    
      output_directory        <string>    Results will be output to this directory.To avoid 
                                          overwriting of existing files. We kindly request
                                          that the output_directory should not exist. It is
                                          to say, this directory will be created by the 
                                          script itself.
    
      genome_sequence         <string>    genome sequences in a single fasta
    
      gene_annotation         <string>    gene annotations in a single gtf file
    
    Options:
         -h                               Print this usage page.
    
         -t                   <int>       Thread number.
                                          Default:1
    
         -q                   <string>    LSF queue name
                                          Default: default queue   

  29. mergeGeneCov
  30. Usage: eupanLSF mergeGeneCov   

  31. geneExist
  32. "geneExist" tool is used to determine gene presence-absence.

    eupanLSF geneExist gene_file cds_file min_gene_cov min_cds_cov >output
    Inputs:
    gene_file     <string>   gene body coverage file
    cds_file      <string>   CDS coverage file
    min_gene_cov  <float>    minimum gene body coverage
    min_cds_cov   <float>    minimum CDS coverage

  33. subSample
  34. "subSample" tool is used to select a subset of samples of a PAV profile.

    Usage: eupanLSF subSample gene_existence_matrix sample_list >output
    Inputs:
        gene_existence_matrix      gene presence/absence matrix
        sample_list                a sample list file

  35. gFamExist
  36. "gFamExist" is used to determine gene family presence-absence from gene presence=absence

    Usage: eupanLSF gFamExist   > geneFamExist.txt
    Inputs:
    
        geneExist.txt         gene PAV matrix
        geneFam.info          gene family annotation

  37. bam2bed
  38. "bam2bed" tool is used to calculate the covered region of the genome.

    Usage: eupanLSF bam2bed [options]    

    The outputs are covered fragments without overlap in 3-column .bed format.

    Necessary input description:

      bam_directory           <string>    This directory should contain many sub-directories
                                          named by sample names, such as CX101, B152,etc.
                                          In each sub-directory, mapping result, a sorted .bam
                                          file, should exist.
    
      output_directory        <string>    Results will be output to this directory. To avoid 
                                          overwriting of existing files. We kindly request
                                          that the output_directory should not exist. It is
                                          to say, this directory will be created by the 
                                          script itself.
    
         -q            <string>      The queue name for job submiting. 
                                     default: default queue

  39. fastaSta
  40. "fastaSta" tool is used to check basic statistics of a fasta file.

    Usage: eupanLSF fastaSta <fasta>

    Inputs:

    fasta   <string>     fasta file  

  41. sim
  42. "sim" tool is used to do ramdom sampling for pan-genome simulation. For each iteration of simulation, we will randomly sample one by one, and calculate the core genome size and pan genome size.

    Usage: eupanLSF sim -n <sim_num> <gene_existence file> <out_dir>

    inputs:

    sim_num              <int>   number of random simulations (default=100)
    gene_existence file  <file>  gene PAV matrix