eupan tool list:
qualSta View the overall sequencing quality of a large number of files
trim Trim or filter low-quality reads parallelly
alignRead Map reads to a reference parallelly
sam2bam Covert alignments (.sam) to sorted .bam files
bamSta Statistics of parallel mapping
assemble Assemble reads parallelly
assemSta Statistics of parallel assembly
getUnalnCtg Extract the unaligned contigs from nucmer alignment (processed by quast)
rmRedundant Remove redundant contigs of a fasta file
pTpG Get the longest transcripts to represent genes
geneCov Calculate gene body coverage and CDS coverage
geneExist Determine gene presence-absence based on gene body coverage and CDS coverage
subSample Select subset of samples from gene PAV profile
gFamExist Determine gene family presence-absence based on gene presence-absence
bam2bed Calculate genome region presence-absence from .bam
fastaSta Calculate statistics of fasta file
sim simulation and plot of the pan-genome and the core genome
"qualSta" tool is used to check qualities of .fastq/.fastq.gz files on a large scale. It will plot the quality statistics automatically.
Usage: eupan qualSta [options]
The script will call fastqc program, so please make sure fastqc is in your
PATH, or you need to use -f option to tell the program where fastqc locates.
Necessary input description:
data_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, there should be several
sequencing files ended by .fastq or .fastq.gz.
output_directory <string> Both final output files and intermediate results
will be found in this directory. To avoid
overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
Options:
-h Print this usage page.
-f <string> The directory where the executable file (fastqc)
locate. If this option isn't given, we assume
that it is in your PATH.
-t <int> Specifies the number of files which can be processed
simultaneously. This parameter is sent to fastqc
program. It is recommended to set as the number of
files within each sample. Pay attention that the
machine should have this number of threads.
default: 1
Example command:
eupan qualSta -f /path/to/Fastqc -t 4 data/ preview_quality/
"trim" tool is used to trim or filter raw sequencing data to generate high-quality paired-end fastq data.
Usage: eupan trim [options]
The script will call trimmomatic program and parameter files within trimmomatic directory is also needed. So the directory where trimmomatic locates should be given to the script as a necessary input.
Necessary input description:
fastq_data_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, there should be several
sequencing files ended by .fastq(or .fq) or .fastq.gz(or .fq.gz).
output_directory <string> High-quality reads will be output to this directory.
To avoid overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
Trimmomatic_directory <string> Directory where trimmometic program locates.
Options:
-h Print this usage page.
-t <int> thread number
-a <string> Adaptor file in fasta utilized by trimmomatic program.
Default: trimmomatoc_dir/adapters/TruSeq3-PE-2.fa
-s <string> Suffix of the fastq_file. Check your sequencing data and
change it if needed.
Default: ".fq.gz"
-k <string> Linker for paired_end identifer. Paired-end fastq file
should end with *1suffix or *2suffix, where suffix is
".fq.gz"( or ".fastq", etc. See -s option) and * is the
linker such as "_".As an example, the file should
be like CX123_1.fq.gz (linker is "_", suffix is ".fq.gz")
or BX125_R1.fastq(linker is "_R", suffix is ".fastq")
Default: "_"
-p <33 or 64> Quality score version.
Default: 33 (phred+33)
-i <int> Parameter passed to Trimmomatic (LEADING or TRAILING).
Specifies the minimum quality required to keep a base at the
head or tail of a read.
Default: 20.
-w <int> Parameter passed to Trimmomatic (SLIDINGWINDOW).specifies the
number of bases to average across.
Default: 4.
-n <int> Parameter passed to Trimmomatic (quality for SLIDINGWINDOW).
Default: 20.
-m <int> Parameter passed to Trimmomatic (MINLEN).Specifies the minimum
length of reads to be kept.
Default: 35.
-j <int> Parameter passed to Trimmomatic (HEADCROP). The number of bases
to remove from the start of the read. If it is set to 0, no base
will be removed from the head.
Default: 0.
Example commands:
eupan trim -p 64 -t 4 data/ trim/ /path/to/Trimmomatic/ eupan trim -p 33 -t 4 -w 83 -m 83 data/ filter/ /path/to/Trimmomatic/
"alignRead" tool is used to map reads to the genome on large scale.
Usage: eupan align [options]
The script will call mapping program (bwa mem or bowtie2), so the directory where mapping tool locates is needed.
Necessary input description:
fastq_data_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, there should be several
sequencing files ended by .fq(.gz) or .fastq(.gz).
output_directory <string> Alignment results will be output to this directory.
To avoid overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
Mapping_tool_directory <string> Directory where bwa or bowtie2 program locates.
Alignment_index <string> bowtie2 index built from bowtie2-build program;
or bwa index built from bwa index.
Options:
-h Print this usage page.
-f <string> Select a mapping tool. Can be "bwa" or "bowtie2".
Default: bwa
-t <int> Threads used.
Default: 1
-s <string> Suffix of the fastq_file. Check your sequencing data and
change it if needed.
Default: ".fq.gz"
-k <string> Linker for paired_end identifer. Paired-end fastq file
should end with *1suffix or *2suffix, where suffix is
".fq.gz"( or ".fastq", etc. See -s option) and * is the
linker such as "_".As an example, the file should
be like CX123_1.fq.gz (linker is "_", suffix is ".fq.gz")
or BX125_R1.fastq(linker is "_R", suffix is ".fastq")
Default: "_"
-m <int> min insertion length, for bowtie2 only
Default: 0
-n <int> max insertion length, for bowtie2 only
Default: 1000
Examples:
eupan alignRead -f bwa -t 4 trim/data/ map2pan/ /path/to/bwa/ pan/pan.fa eupan alignRead -f bowtie2 -t 4 trim/data/ map2pan/ /path/to/bowtie2/ pan/pan
"sam2bam" tool is used to adjust mapping results including:
Usage: eupan sam2bam [options]
The script will call samtools program, so the directory where samtools locates is needed.
Necessary input description:
mapping_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, One or more mapping results,
*.sam, should exist.
output_directory <string> Results will be output to this directory.To avoid
overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
QUAST_directory <string> samtools directory where executable samtools locates.
Options:
-h Print this usage page.
-t <int> Threads used.
Default: 1
Example command:
eupan sam2bam -t 4 map2pan/data/ panBam/ /path/to/samtools-1.3/
"bamSta" tool is used to check statistics of .bam files.
Usage: eupan bamSta [commands] ...
Commands:
basic calculate basic statistics
cov calculate genome coverage
"bamSta" contains two sub-programs: 1)"basic" to provide basic alignment statistics and 2)"cov" to provide the coverage of the genome.
Usage: eupan bamSta basic [options]
eupan bamSta basic is used to check the basic statistics of mapping.
The script will call bam_stats (in BamUtil), so the directory where bamUtil locates is needed.
Necessary input description:
bam_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, mapping result, a sorted .bam
file, should exist.
output_directory <string> Results will be output to this directory.To avoid
overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
bamUtil_directory <string> bamUtil directory where bin/bam locates.
Example command:
eupan bamSta basic refBam/data/ refBamSta/basic/ /path/to/bamUtil/
Usage: eupan bamSta cov [options]
eupan bamSta cov is used to check the coverages of the genome.
The script will call qualimap software, so the directory where qualimap locates is needed.
Necessary input description:
bam_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, mapping result, a sorted .bam
file, should exist.
output_directory <string> Results will be output to this directory.To avoid
overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
qualimap_directory <string> qualimap directory where executable qualimap locates.
Options:
-h Print this usage page.
-m <string> Maximum memory size for java use
Default: 12G
-t <int> Thread number.
Default:4
Example command:
eupan bamSta cov refBam/data/ refBamSta/cov/ /path/to/qualimap/
"assemble" is used to assemble short reads. It contains 2 sub-programs:
1) "soapdenovo" raw sopadenovo2 assembly with fixed Kmer
2) "linearK" iterative use of soapdenovo2 with flexible Kmer
Usage: eupan assemble [commands] ...
Commands:
soapdenovo Aseembly with SOAPdenovo 2.
linearK Assembly with an iterative use of SOAPdenovo 2 (Recommended).
Usage: eupan assemble soapdenovo [options]
Necessary input description:
fastq_data_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, there should be several
sequencing files ended by .fastq or .fastq.gz.
output_directory <string> Alignment results will be output to this directory.
To avoid overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
sopadenovo_directory <string> directory where soapdenovo2 executable files exists
Options:
-h Print this usage page.
-t <int> Threads used.
Default: 1
-s <string> Suffix of files within data_directory.
Default: .fq.gz
-k <int> Kmer.
Default: 35
-c <string> Parameters of soapdenovo2 config file. 8 parameters ligated by comma
1)maximal read length
2)average insert size
3)if sequence needs to be reversed
4)in which part(s) the reads are used
5)use only first N bps of each read
6)in which order the reads are used while scaffolding
7)cutoff of pair number for a reliable connection (at least 3 for
short insert size)
8)minimum aligned length to contigs for a reliable read location
(at least 32 for short insert size)
Default: 80,460,0,3,80,1,3,32
-g enable gapcloser
Usage: eupan assemble linearK [options]
eupan assemble linearK is used to assemble high-quality reads on large scale.
Necessary input description:
fastq_data_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, there should be several
sequencing files ended by .fastq or .fastq.gz.
output_directory <string> Alignment results will be output to this directory.
To avoid overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
sopadenovo_directory <string> directory where soapdenovo2 executable files exists
Options:
-h Print this usage page.
-t <int> Threads used.
Default: 1
-g <int> Genome size. Used to infer sequencing depth.
Default: 380000000 (380M)
-s <string> Suffix of files within data_directory.
Default: .fq.gz
-r <string> Parameters of linear function: Kmer=2*int(0.5*(a*Depth+b))+1.
The parameter should be input as "a,b".
Default: 0.76,20
-w <int> Step-length of Kmer change.
Default: 2
-u <int> Upper limmited times of Kmer change. This parameter is set to reduce
redundancy computation.
Default: 10
-c <string> Parameters of soapdenovo2 config file. 8 parameters ligated by comma
1)maximal read length
2)average insert size
3)if sequence needs to be reversed
4)in which part(s) the reads are used
5)use only first N bps of each read
6)in which order the reads are used while scaffolding
7)cutoff of pair number for a reliable connection (at least 3 for
short insert size)
8)minimum aligned length to contigs for a reliable read location
(at least 32 for short insert size)
Default: 80,460,0,3,80,1,3,32
-n <int> The minimum length of contigs. Contigs shorter than this length will
NOT be used when calculating N50.
Default: 100
-k <string> Available Kmer range. Give comma-seperated lower bound and upper bound.
Default: 15,127
-m <int> The number of consecutive Ns to be broken down to contigs.This is used
in the process break gapclosed scaffolds to contigs.
Default: 10.
"assemSta" tool is used to map assembled contigs to reference and to check the statistics of assembled contigs (or scaffolds).
Usage: eupan assemSta [options]
The script will call QUAST program, so the directory where quast.py locates is needed.
Necessary input description:
assembly_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, assembly results, including
files *.scafSeq and *.contig, should exist.
output_directory <string> Results will be output to this directory.To avoid
overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
QUAST_directory <string> QUAST directory where quast.py locates.
reference.fa <string> Reference sequence file (.fa or .fa.gz).
Options:
-h Print this usage page.
-t <int> Threads used.
Default: 1
-m <int> Minimum contig length used for assessment.
Default: 500
-g Check the statistics of gap-closed assemblies if -g is
enabled. In the assembly directory of each sample,
*_gc.scafSeq and *_gc.contig should exist.
Default: check statistics of raw assemblies
-s Check the statistics of assembled scaffolds if -s is enabled.
Default: check statistics of assembled contigs
eupan getUnalnCtg is used to collect unaligned contigs.
Usage: eupan getUnalnCtg [options]
Necessary input description:
assembly_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, assembly results, including
file *.contig, should exist.
QUAST_assess_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, quast assessment, including
directory file contigs_reports, should exist.
output_directory <string> Results will be output to this directory.To avoid
overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
Options:
-h Print this usage page.
-m Use gap-closed contigs instead of raw contigs
"rmRedundant" is used to cluster sequences and extract the representative ones.
eupan rmRedundant [commands]
Available commands:
cdhitCluster Clustering with CDHIT, fast but only accept identity >0.8 blastCluster Clustering with Blastn
Usage: eupan rmRedundant cdhitCluster [options]
eupan rmRedundant cdhitCluster is used to cluster contigs and remove the redundant ones.
Necessary input description:
input_fasta_file <string> Contig sequences to be clustereed.
output_directory <string> Output directory.
cdhit_directory <string> directory where cdhit-est locates.
Options:
-h Print this usage page.
-t <int> Threads used.
Default: 1
-c <float> Sequence identity threshold
Default: 0.9
Usage: eupanLSF rmRedundant blastCluster [options]
eupanLSF rmRedundant blastCluster is used to cluster contigs and remove the redundant ones.
Necessary input description:
input_fasta_file <string> Contig sequences to be clustereed.
output_directory <string> Output directory.
blast_directory <string> directory where blastn and makeblastdb locate.
Options:
-h Print this usage page.
-t <int> Threads used.
Default: 1
-c <float> Sequence identity threshold
Default: 0.5
"pTpG" tool is to obtain the longest trancript of each gene.
Usage: eupan perTranPerGene
"geneCov" tool is used to calculate gene body coverage and CDS coverage of each gene.
Usage: eupan geneCov [options]
This script will call samtools and ccov.
Necessary input description:
bam_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, mapping result, a sorted .bam
file, should exist.
output_directory <string> Results will be output to this directory.To avoid
overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
genome_sequence <string> genome sequences in a single fasta
gene_annotation <string> gene annotations in a single gtf file
Options:
-h Print this usage page.
-t <int> Thread number.
Default:1
"geneExist" tool is used to determine gene presence-absence.
eupan geneExist gene_file cds_file min_gene_cov min_cds_cov >output
Inputs:
gene_file <string> gene body coverage file cds_file <string> CDS coverage file min_gene_cov <float> minimum gene body coverage min_cds_cov <float> minimum CDS coverage
"subSample" tool is used to select a subset of samples of a PAV profile.
Usage: eupan subSample gene_existence_matrix sample_list >output
Inputs:
gene_existence_matrix gene presence/absence matrix sample_list a sample list file
"gFamExist" is used to determine gene family presence-absence from gene presence=absence
Usage: eupan gFamExist> geneFamExist.txt
Inputs:
geneExist.txt gene PAV matrix geneFam.info gene family annotation
"bam2bed" tool is used to calculate the covered region of the genome.
Usage: eupan bam2bed [options]
The outputs are covered fragments without overlap in 3-column .bed format.
Necessary input description:
bam_directory <string> This directory should contain many sub-directories
named by sample names, such as CX101, B152,etc.
In each sub-directory, mapping result, a sorted .bam
file, should exist.
output_directory <string> Results will be output to this directory. To avoid
overwriting of existing files. We kindly request
that the output_directory should not exist. It is
to say, this directory will be created by the
script itself.
"fastaSta" tool is used to check basic statistics of a fasta file.
Usage: eupan fastaSta <fasta>
Inputs:
fasta <string> fasta file
"sim" tool is used to do ramdom sampling for pan-genome simulation. For each iteration of simulation, we will randomly sample one by one, and calculate the core genome size and pan genome size.
Usage: eupan sim -n <sim_num> <gene_existence file> <out_dir>
inputs:
sim_num <int> number of random simulations (default=100) gene_existence file <file> gene PAV matrix