The purpose of this SOP is to help users to carry out pan-genome studies with EUPAN strategy. The EUPAN SOP is divided into 3 parts:
The example data used in this SOP can be downloaded here. The example includes sequencing data of 3 individual. For each individual, there are 3 Illumina sequencing runs. Each run contains 500,000 paired-end reads. The read length is 83bp and the quality values follow phred+64 rule. Note these are only simple examples to help users understand the input data type and data structure. The real data may be much larger and more complex.
After downloading the data, follow the command:
tar zxvf eupanExample.tar.gz
and you will find two directories within the eupanExample/ directory:
data/ this directory includes the sequencing data with a per-sample-per-directory structure ref/ref.fa example of a reference sequence (chr12 of the IRGSP-1.0 rice genome) ref/ref.gtf gene annotation of the reference sequence unaln.gtf a blank file
The main analysis procedures include:
We'll introduce how to process your data step by step using EUPAN toolkit and other required tools.
Requirements
EUPAN incorporates FastQC tool to view the sequencing quality and Trimmomatic tool to filter and trim low-quality reads.
A recommended quality control procedure includes:
We provide the qualitySta
tool to overview the overall qualities. -t
indicates #cores used to run Fastqc parallelly. -v
indicates pair-end or single-end data. Follow the commands:
cd eupanExample eupan qualSta -f /path/to/Fastqc -t 4 -v PE data/ preview_quality/
Results can be found in the preview_quality/
directory:
fastqc_output detailed outputs of each sample file.summary combined quality statistics of all files sample.summary combined quality statistics of all samples (A sample's "PASS" requires "PASS" of all files of the sample) file.summary_heatmap.pdf heatmap of file statistics from file.summary sample.summary_heatmap.pdf heatmap of sample statistics from sample.summary
In all output figures, green represents "PASS"; yellow represents "WARNING"; and red represents "FAIL".
The summary files together with the plots will provide a landscape of the overall sequencing qualities. We also recommends the users go through some detailed sample results to gain a better understanding of the data and data qualities.
We provide the trim tool to trim or filter low-quality reads. After preview the qualities, the users should have a basic understanding of the sequencing data including the quality score version (phred33 or phred64), read length, etc. To trim the reads, follow the command:
eupan trim -p 64 -t 4 data/ trim/ /path/to/Trimmomatic/
To filter low-quality reads, follow the comand:
eupan trim -p 64 -t 4 -w 83 -m 83 data/ filter/ /path/to/Trimmomatic/
Results can be found in the output directory (trim/ or filter/):
data/ directory of the high-quality reads of each sample err/ logs and standard errors
After trimming or filtration of reads, the quality should be scanned again. Follow the command:
eupan qualSta -f /path/to/Fastqc -t 4 trim/data/ review_quality/
Similar to step 1, check the qualities manually and decide if the trimming or filtration step is good enough for subsequent analyses.
If the trimming results do not match the user's expection, new parameters should be selected and step 2 and step 3 should be conducted for several times. We highly recommend the users to run step 2 with a set of parameters on a small subset of their data, evaluate their performance and apply the selected parameter on the whole data.
The output results for this step can be found here.
Requirements
To gain the comprehensive sequences of all samples, we need first to assemble the individual genomes from short reads. We provide two different strategies in this step: 1) direct assembly and 2) iterative assembly to select the optimal Kmer (recommended). See linearK usage for the detailed strategies for the iterative assembly method. Please note that assembly requires relatively large memory. Pay attention to the memory usage when running the following steps.
1) To assemble all samples directly with SOAPdenovo 2 and Gapcloser under a fixed Kmer (e.g. K=23), use command
eupan assemble soapdenovo -t 4 -k 23 -g data/ assembly_soap/ /path/to/SOAPdenovo2/
The assemblies of individuals can be found in the assembly_soap/
directory. To enable GapCloser, add -g
option.
2) To assemble all samples directly with iterative use of SOAPdenovo, run with command
eupan assemble linearK data assembly_linearK/ /path/to/SOAPdenovo2/
(More information about this optimized strategy)
The assemblies of individuals can be found in the assembly_linearK/
directory.
This program utilizes a linear model of sequencing depth to estimate the initialized Kmer. Therefore when applying to user's real data, the estimated genome size (reference genome size is OK) should be provided with -g
option. We recommend the user to re-estimate the linear model on a subset of samples and provide it to the program with -r
option when processing all samples. According to our experience, this method is about 2.5~3.5 times lower than the direct use of SOAPdenovo2.
The output results for this step can be found here.
In order to have an overview of the assemblies, we evaluate the major assembly statistics with QUAST software including assembly size, N50, genome fraction (how much of the reference genome is covered), unaligned size, etc. In this process, the contigs from each individual are aligned to the reference genome with nucmer tool within Mummer package, enabling the direct extraction of unaligned sequences in the subsequent analysis. Also the users can extract the unaligned sequences with their own parameters.
QUAST installation:
cd ../tools/ tar -zxvf quast-3.2.tar.gz cd quast-3.2 python quast.py --test cd ../../eupanExample/
To evaluate the assemblies of the contigs, run with command
eupan assemSta -t 4 assembly_soap/data/ assembly_soap_sta/ /path/to/quast/ ref/ref.fa
or
eupan assemSta -t 4 assembly_linearK/linearKdata/ assembly_linearK_sta/ /path/to/quast/ ref/ref.fa
To evaluate the assemblies of the scaffolds, run with command
eupan assemSta -t 4 -s assembly_soap/data/ assembly_soap_sta/ /path/to/quast/ ref/ref.fa
or
eupan assemSta -t 4 -s assembly_linearK/data/ assembly_linearK_sta/ /path/to/quast/ ref/ref.fa
Note that the assembly step could take a long time. Here for our example data, it would take ~2h with 4 cores(-t 4). The assembly statistics of each sample can be found in the assembly_soap_sta/data/
or assembly_linearK_sta/data/
directory and a summary can be found in assembly_soap_sta/total_statistics.txt
or assembly_linearK_sta/total_statistics.txt
file. Note the sequencing data used in this example is of small size in order to limit the running time, therefore the assembly size is also very small and the statistics is incomplete. It will work well on the user's large data. The assembly of each example sample using SOAPdenovo directly takes about 65 minutes on 16 cpus of 1064 MHz. Totally running assembly with SOAPdenovo directly takes about 2 hours on a single machine. It takes much longer to assemble with the optimized strategy. Therefore we recomend the users who want to go over this SOP to run the first command instead of the second command.
The output results for this step can be found here.
There are several strategies to construct the pan-genome sequences.
If there is a reference genome of the studied organism, we reccommend the first strategy to benifit from the high-quality assembly of the reference together with the high-quality annotation. If there is no such reference, you may use the second strategy. Or still follow the first strategy by building a reference first. To do this, there should be a multi-library sequencing for the reference sample with high-sequencing depth and multiple insertion sizes.
In this SOP, we only describe how to build the comprehensive pan-genome sequences based on a reference.
Requirements
Construction of the pan-genome sequences based on a reference genome involves the following steps:
Actually this has been done in the evaluation of assemblies. Users can also carry out this step with Nucmer tool within Mummer package.
To do this, run command
eupan getUnalnCtg assembly_soap/data/ assembly_soap_sta/data/ unalnCtgs
or
eupan getUnalnCtg assembly_linearK/data/ assembly_linearK_sta/data/ unalnCtgs
The unaligned contigs can be found in unalnCtgs/total_unaln.fa file
Check the basic statistics of contigs (base number and contig number) with fastaSta command
eupan fastaSta unalnCtgs/total_unaln.fa
To remove redundant contigs, run command
eupan rmRedundant cdhitCluster unalnCtgs/total_unaln.fa unalnCtgs/nrUnaln /path/to/cd-hit/ eupan fastaSta unalnCtgs/nrUnaln/non-redundant.fa
Non-redundant contigs (identity<0.9) and the clustering information can be found in unalnCtgs/nrUnaln/
directory.
This is a very important step in such pan-genome analysis involving large number of individuals. The non-redundant contigs are the enrichment of sequences missed in the reference, and are also the enrichment of contaminates. Users should remove contaminates according to their studied species. For our rice study, we removed contaminates from bacteria, virus, archaea and even animals. Actually, we remove all contigs whose best alignment to nt database is not green plants. There is no gold standard in this step, therefore we provide no tool for this part. Users should remove contaminates by themselves.
The non-redundant novel sequences we obtained above is at identity of 0.9. Sometimes we want to know the size under a much lower identity (say 0.5).Here we provide a blast-based method to cluster sequences and remove the non-redundant contigs. To do this, run command
eupan rmRedundant blastCluster -c 0.5 unalnCtgs/total_unaln.fa unalnCtgs/nrUnalnBlast/ /path/to/ncbi-blast/bin/
The outputs are similar to those of "eupan rmRedundant cdhitCluster" command. The results can be found in unalnCtgs/nrUnalnBlast/ directory.
We recommend to use non-redundant sequences derived from step 4) instead of sequences derived at much lower identity for the subsequent analyses. We will map raw reads to the comprehensive sequences to determine gene presence-absence. Using low-identity contigs will result in loss of mapping of reads and further result in miss annotation of a gene existence or a gene family existence. To merge the reference and non-redundant contigs, simply run
mkdir pan cat ref/ref.fa unalnCtgs/nrUnaln/non-redundant.fa >pan/pan.fa
The output results for this step can be found here.
Annotation of the pan-genome sequences can be divided into two parts:
Gene annotation of the pan-genome sequences can also be divided into 2 parts: i)gene annotation of the reference genome; and ii)gene prediction of the novel sequences.
As described above, one of the advantages to build pan-genome sequences is that we can use the high-quality annotation of the reference genome. The annotation of the reference genome is in ref/ref.gtf file. To be consistent with the annotation of unaligned sequences and also to be convenient for subsequent analyses, the annotation should be re-arranged following per-transcript-per-gene pattern. For genes with multiple transcripts, only the one with the longest coding sequences are selected as represents. To do this run command
eupan pTpG ref/ref.gtf ref/ref-ptpg.gtf
Gene prediction of eukaryotes is a challenging and complex task. And the selection of gene prediction tools or pipelines depends on the target organism. In rice study, we used MAKER-P pipeline to integrate ab initio predictions, transcript evidence and protein homologies. For animals or other organisms, you may want MAKER or PASA. That's why we provide no tools for this step. Users need to carry out the gene prediction step carefully.
The reccomended pipelines are:
Here we use a blank file (./unaln.gtf
) to subtitute the gene annotation of unaligned sequences.
After these two steps, we should combine the annotation of the reference and the annotation of unaligned sequences. Simply run
cat ref/ref-ptpg.gtf unaln.gtf >pan/pan.gtf
The output results for this step can be found here.
Pan-genome analyses are often carried out on gene families. Here we recommend cluster genes into gene families with OrthoMCL. OrthoMCL requires specific configuration, therefore we didn't integrate it into the toolkit. Users need to carry out this step by themselves.
Requirements
We utilize a "map-to-pan" strategy to determine gene presence-absence and gene family presence-absence. Therefore we need to map the raw reads to the comprehensive sequences first. This can be done using "bwa mem" (recommended) or bowtie2.
To use bwa mem, run the following commands
cd pan/ /path/to/bwa/bwa index -a bwtsw pan.fa cd .. eupan alignRead -f bwa -t 4 trim/data/ map2pan/ /path/to/bwa/ pan/pan.fa
To use bowtie2, run the following commands
cd pan/ /path/to/bowtie2/bowtie2-build pan.fa pan cd .. eupan alignRead -f bowtie2 -t 4 trim/data/ map2pan/ /path/to/bowtie2/ pan/pan
The results can be found in map2pan/
directory.
Next, the .sam
output should be converted to .bam
and then sorted, merged and indexed. The .bam
and .bai
files can be used to view alignments in genome browser. And .sam
files can be further removed to save space. To do this, run command
eupan sam2bam -t 4 map2pan/data/ panBam/ /path/to/samtools-1.3/ rm -r map2pan/data/
The results can be found in panBam/
directory. There are two files for each sample: .bam
and .bai
. We also provide a bam2bed tool to calculate the covered region of the genome. The output non-overlapped fragments are organized in 3-column .bed
files which can be utilized for visualization in pan-genome browser and also matained for further use. See our Rice Pan-genome Browser (RPAN) for example.
eupan bam2bed panBam/data/ covRegion/
The .bed
files can be found in covRegion/data/
directory.
What's more, the users can also use this tool to align reads to the reference genome alone and check the statistics. We show how to do this with bwa as an example. Similar as mentioned above, do the alignment with commands
cd ref/ /path/to/bwa/bwa index -a bwtsw ref.fa cd .. eupan alignRead -f bwa -t 4 trim/data/ map2ref/ /path/to/bwa/ ref/ref.fa
Convert .sam
to .bam
and remove .sam
with commands
eupan sam2bam -t 4 map2ref/data/ refBam/ /path/to/samtools-1.3/ rm -r map2ref/data/
Check the basic statistics from .bam
files with command
mkdir refBamSta eupan bamSta basic refBam/data/ refBamSta/basic/ /path/to/bamUtil/
The basic statistics can be found in refBamSta/basic/summary.txt
Check how much of the genome can be covered from .bam
files with command
eupan bamSta cov refBam/data/ refBamSta/cov/ /path/to/qualimap/
The coverages can be found in refBamSta/cov/summary.txt
After mapping the reads to the pan-genome sequence, we can calculate the gene body coverage and CDS coverage of each gene. Run command
eupan geneCov -t 4 panBam/data/ geneCov/ pan/pan.gtf
The gene body coverages can be found in geneCov/summary_gene.cov
and the cds coverages can be found in geneCov/summary_cds.cov
.
Next we can determine gene presence-absence considering genebody coverages and cds coverages. If we define genes with gene body coverage >0.8 and cds coverage >0.95 ere considered as present in the genome, gene PAV profile can be calculated by
mkdir geneExist eupan geneExist geneCov/summary_gene.cov geneCov/summary_cds.cov 0.8 0.95 >geneExist/gene.exist
The output file are organized as
gene sampleA sampleB sampleC ... geneA 1 1 1 ... geneB 1 0 1 ... geneC 1 1 1 ... geneD 1 0 1 ... ...
where 1 represent presents and 0 represents absence. The gene PAV profile of example data can be acquired here. Also, The output results for this step can be found here.
Obviously the coverages are influenced by the sequencing depth. Insufficient sequencing will result in mis-annotation of presented genes. The example sequencing files are from rice and has very low sequencing deth (below 1x). We recommend the users to add a sample selection step before the pan-genome analyses. The users can select a subset of individuals based on sequencing depth and mapping depth (based on reference) for the subsequent analyses. In our rice study, we required the sequencing depth is higher than 20 and the mapping depth is higher than 15.
Users can put the selected sample name in a list as
sampleA sampleB sampleC ...
With the subSample command, users can select a subset of samples from the PAV profile
eupan subSample geneExist/gene.exist list >geneExist/SampleFilteredGene.exist
The new PAV profile for the selected sample can be found in geneExist/SampleFilteredGene.exist
file.
Next we can further convert gene PAV profile into gene family PASV profile with a gene family annotation. An example of gene family annotation file can be found here. Gene family PAVs can be derived from gene PAVs with gFamExist command
eupan gFamExist geneExist/gene.exist geneFam.txt >geneFam.exist
geneFam.txt contains the annotation of gene families. Each line represents a gene family and genes included in this gene family (tab-seperated gene family name and gene names. Gene names are further seperated with spaces). An example of the gene family annotation file can be found at File 5 of Hu, Z. et al. Figshare
The geneFam.exist
file is organized as the same pattern of gene PAV profile. Therefore the subsequent pan-genome analyses can be carried out for both gene PAV profile and gene family PAV profile.
Next we will do random sampling for pan-genome simulation. For each iteration of simulation, we will randomly sample one by one, and calculate the core genome size and pan genome size.
eupan sim -n 100 geneExist/gene.exist simulation_pav
Here, -n
indicates the number of random simulations (default=100). The result can be found in simulation_pav/sim_out.txt_PAV_plot.pdf
Traditional pan-genome analyses can be carried out based on the PAV profiles:
In addition, classifications of samples can be derived from the phylogenetic analysis if there's no such information before you do this analysis. Based on the grouping information of samples, you can carry out many comparisons among groups:
What's more, by integrating other knowledge, many customized analyses can be carried out. We give some examples here.
.......
.......
We recommend to use tools with the exact version we mentioned. If you got errors, please first read the error files and try the recommended version. There are logs and error files for each step. Bugs should be sent to HZQ. Pan-genome analysis using "map-to-pan" strategy is somewhat complex. We welcome the users' suggestions and discussions on any point involved in the pipeline.