EUPAN: Eukaryotic PAN-genome analysis toolkit

The purpose of this SOP is to help users to carry out pan-genome studies with EUPAN strategy. The EUPAN SOP is divided into 3 parts:

eupan SOP for single machine;
eupanLSF SOP for LSF system;
eupanSLURM SOP for SLURM system.

EUPAN

Installation

Example data

The example data used in this SOP can be downloaded here. The example includes sequencing data of 3 individual. For each individual, there are 3 Illumina sequencing runs. Each run contains 500,000 paired-end reads. The read length is 83bp and the quality values follow phred+64 rule. Note these are only simple examples to help users understand the input data type and data structure. The real data may be much larger and more complex.

After downloading the data, follow the command:

tar zxvf eupanExample.tar.gz

and you will find two directories within the eupanExample/ directory:

data/                 this directory includes the sequencing data with a per-sample-per-directory structure
ref/ref.fa            example of a reference sequence (chr12 of the IRGSP-1.0 rice genome) 
ref/ref.gtf           gene annotation of the reference sequence
unaln.gtf             a blank file

Main analysis procedures

The main analysis procedures include:

the parallel quality control;
de novo assembly of individual genomes;
construction of pan-genome sequences;
gene annotation and gene family annotation;
determination of gene (or gene-family) presence-absence variations (PAVs);
PAV-based pan-genome analyses.

We'll introduce how to process your data step by step using EUPAN toolkit and other required tools.

3.1 the parallel quality control

Requirements

EUPAN incorporates FastQC tool to view the sequencing quality and Trimmomatic tool to filter and trim low-quality reads.

A recommended quality control procedure includes:

Preview the overall qualities
Trim or filter reads
Review the overall qualities
Re-trim with selected parameters by repeating step 2 and step 3

Preview the overall qualities

We provide the qualitySta tool to overview the overall qualities. -t indicates #cores used to run Fastqc parallelly. -v indicates pair-end or single-end data. Follow the commands:

cd eupanExample
eupan qualSta -f /path/to/Fastqc -t 4 -v PE data/ preview_quality/

Results can be found in the preview_quality/ directory:

fastqc_output       detailed outputs of each sample
file.summary        combined quality statistics of all files
sample.summary      combined quality statistics of all samples (A sample's "PASS" requires "PASS" of all files of the sample)
file.summary_heatmap.pdf    heatmap of file statistics from file.summary
sample.summary_heatmap.pdf  heatmap of sample statistics from sample.summary

In all output figures, green represents "PASS"; yellow represents "WARNING"; and red represents "FAIL".

The summary files together with the plots will provide a landscape of the overall sequencing qualities. We also recommends the users go through some detailed sample results to gain a better understanding of the data and data qualities.

Trim or filter reads

We provide the trim tool to trim or filter low-quality reads. After preview the qualities, the users should have a basic understanding of the sequencing data including the quality score version (phred33 or phred64), read length, etc. To trim the reads, follow the command:

eupan trim -p 64 -t 4 data/ trim/ /path/to/Trimmomatic/

To filter low-quality reads, follow the comand:

eupan trim -p 64 -t 4 -w 83 -m 83 data/ filter/ /path/to/Trimmomatic/

Results can be found in the output directory (trim/ or filter/):

data/       directory of the high-quality reads of each sample
err/        logs and standard errors

Review the overall qualities

After trimming or filtration of reads, the quality should be scanned again. Follow the command:

eupan qualSta -f /path/to/Fastqc -t 4 trim/data/ review_quality/

Similar to step 1, check the qualities manually and decide if the trimming or filtration step is good enough for subsequent analyses.

Re-trim with selected parameters by repeating step 2 and step 3

If the trimming results do not match the user's expection, new parameters should be selected and step 2 and step 3 should be conducted for several times. We highly recommend the users to run step 2 with a set of parameters on a small subset of their data, evaluate their performance and apply the selected parameter on the whole data.

The output results for this step can be found here.

3.2 de novo assembly of individual genomes

Requirements

de novo assembly of individual genomes

To gain the comprehensive sequences of all samples, we need first to assemble the individual genomes from short reads. We provide two different strategies in this step: 1) direct assembly and 2) iterative assembly to select the optimal Kmer (recommended). See linearK usage for the detailed strategies for the iterative assembly method. Please note that assembly requires relatively large memory. Pay attention to the memory usage when running the following steps.

1) To assemble all samples directly with SOAPdenovo 2 and Gapcloser under a fixed Kmer (e.g. K=23), use command

eupan assemble soapdenovo -t 4 -k 23 -g data/ assembly_soap/ /path/to/SOAPdenovo2/

The assemblies of individuals can be found in the assembly_soap/ directory. To enable GapCloser, add -g option.

2) To assemble all samples directly with iterative use of SOAPdenovo, run with command

eupan assemble linearK data assembly_linearK/ /path/to/SOAPdenovo2/

(More information about this optimized strategy)

The assemblies of individuals can be found in the assembly_linearK/ directory. This program utilizes a linear model of sequencing depth to estimate the initialized Kmer. Therefore when applying to user's real data, the estimated genome size (reference genome size is OK) should be provided with -g option. We recommend the user to re-estimate the linear model on a subset of samples and provide it to the program with -r option when processing all samples. According to our experience, this method is about 2.5~3.5 times lower than the direct use of SOAPdenovo2.

The output results for this step can be found here.

evaluation of the assemblies

In order to have an overview of the assemblies, we evaluate the major assembly statistics with QUAST software including assembly size, N50, genome fraction (how much of the reference genome is covered), unaligned size, etc. In this process, the contigs from each individual are aligned to the reference genome with nucmer tool within Mummer package, enabling the direct extraction of unaligned sequences in the subsequent analysis. Also the users can extract the unaligned sequences with their own parameters.

QUAST installation:

cd ../tools/
tar -zxvf quast-3.2.tar.gz
cd quast-3.2
python quast.py --test
cd ../../eupanExample/

To evaluate the assemblies of the contigs, run with command

eupan assemSta -t 4 assembly_soap/data/ assembly_soap_sta/ /path/to/quast/ ref/ref.fa

eupan assemSta -t 4 assembly_linearK/linearKdata/ assembly_linearK_sta/ /path/to/quast/ ref/ref.fa

To evaluate the assemblies of the scaffolds, run with command

eupan assemSta -t 4 -s assembly_soap/data/ assembly_soap_sta/ /path/to/quast/ ref/ref.fa

eupan assemSta -t 4 -s assembly_linearK/data/ assembly_linearK_sta/ /path/to/quast/ ref/ref.fa

Note that the assembly step could take a long time. Here for our example data, it would take ~2h with 4 cores(-t 4). The assembly statistics of each sample can be found in the assembly_soap_sta/data/ or assembly_linearK_sta/data/ directory and a summary can be found in assembly_soap_sta/total_statistics.txt or assembly_linearK_sta/total_statistics.txt file. Note the sequencing data used in this example is of small size in order to limit the running time, therefore the assembly size is also very small and the statistics is incomplete. It will work well on the user's large data. The assembly of each example sample using SOAPdenovo directly takes about 65 minutes on 16 cpus of 1064 MHz. Totally running assembly with SOAPdenovo directly takes about 2 hours on a single machine. It takes much longer to assemble with the optimized strategy. Therefore we recomend the users who want to go over this SOP to run the first command instead of the second command.

The output results for this step can be found here.

3.3 construction of pan-genome sequences

There are several strategies to construct the pan-genome sequences.

Construct the pan-genome sequences based on a reference genome
Construct the pan-genome sequences by re-assembling the individual assemblies

If there is a reference genome of the studied organism, we reccommend the first strategy to benifit from the high-quality assembly of the reference together with the high-quality annotation. If there is no such reference, you may use the second strategy. Or still follow the first strategy by building a reference first. To do this, there should be a multi-library sequencing for the reference sample with high-sequencing depth and multiple insertion sizes.

In this SOP, we only describe how to build the comprehensive pan-genome sequences based on a reference.

Requirements

Construction of the pan-genome sequences based on a reference genome involves the following steps:

Align contigs to the reference genome to retrieve unaligned contigs

Actually this has been done in the evaluation of assemblies. Users can also carry out this step with Nucmer tool within Mummer package.

Merge unaligned contigs from individuals to one file

To do this, run command

eupan getUnalnCtg assembly_soap/data/ assembly_soap_sta/data/ unalnCtgs

eupan getUnalnCtg assembly_linearK/data/ assembly_linearK_sta/data/ unalnCtgs

The unaligned contigs can be found in unalnCtgs/total_unaln.fa file

Check the basic statistics of contigs (base number and contig number) with fastaSta command

eupan fastaSta unalnCtgs/total_unaln.fa

Get non-redundant novel contigs

To remove redundant contigs, run command

eupan rmRedundant cdhitCluster unalnCtgs/total_unaln.fa unalnCtgs/nrUnaln /path/to/cd-hit/
eupan fastaSta unalnCtgs/nrUnaln/non-redundant.fa

Non-redundant contigs (identity<0.9) and the clustering information can be found in unalnCtgs/nrUnaln/ directory.

Remove contaminates

This is a very important step in such pan-genome analysis involving large number of individuals. The non-redundant contigs are the enrichment of sequences missed in the reference, and are also the enrichment of contaminates. Users should remove contaminates according to their studied species. For our rice study, we removed contaminates from bacteria, virus, archaea and even animals. Actually, we remove all contigs whose best alignment to nt database is not green plants. There is no gold standard in this step, therefore we provide no tool for this part. Users should remove contaminates by themselves.

Check the non-redundant novel sequences at a given identity

The non-redundant novel sequences we obtained above is at identity of 0.9. Sometimes we want to know the size under a much lower identity (say 0.5).Here we provide a blast-based method to cluster sequences and remove the non-redundant contigs. To do this, run command

eupan rmRedundant blastCluster -c 0.5 unalnCtgs/total_unaln.fa unalnCtgs/nrUnalnBlast/ /path/to/ncbi-blast/bin/

The outputs are similar to those of "eupan rmRedundant cdhitCluster" command. The results can be found in unalnCtgs/nrUnalnBlast/ directory.

Merge reference sequences and non-redundant unalined sequences

We recommend to use non-redundant sequences derived from step 4) instead of sequences derived at much lower identity for the subsequent analyses. We will map raw reads to the comprehensive sequences to determine gene presence-absence. Using low-identity contigs will result in loss of mapping of reads and further result in miss annotation of a gene existence or a gene family existence. To merge the reference and non-redundant contigs, simply run

mkdir pan
cat ref/ref.fa unalnCtgs/nrUnaln/non-redundant.fa >pan/pan.fa

The output results for this step can be found here.

3.4 Annotation of the pan-genome sequences

Annotation of the pan-genome sequences can be divided into two parts:

gene annotation of the pan-genome sequences
gene family annotation of the pan-genome sequences

gene annotation of the pan-genome sequences

Gene annotation of the pan-genome sequences can also be divided into 2 parts: i)gene annotation of the reference genome; and ii)gene prediction of the novel sequences.

i) gene annotation of the reference genome

As described above, one of the advantages to build pan-genome sequences is that we can use the high-quality annotation of the reference genome. The annotation of the reference genome is in ref/ref.gtf file. To be consistent with the annotation of unaligned sequences and also to be convenient for subsequent analyses, the annotation should be re-arranged following per-transcript-per-gene pattern. For genes with multiple transcripts, only the one with the longest coding sequences are selected as represents. To do this run command

eupan pTpG ref/ref.gtf ref/ref-ptpg.gtf

ii) gene prediction of the novel sequences

Gene prediction of eukaryotes is a challenging and complex task. And the selection of gene prediction tools or pipelines depends on the target organism. In rice study, we used MAKER-P pipeline to integrate ab initio predictions, transcript evidence and protein homologies. For animals or other organisms, you may want MAKER or PASA. That's why we provide no tools for this step. Users need to carry out the gene prediction step carefully.

The reccomended pipelines are:

PASA2
MAKER

Here we use a blank file (./unaln.gtf) to subtitute the gene annotation of unaligned sequences.

After these two steps, we should combine the annotation of the reference and the annotation of unaligned sequences. Simply run

cat ref/ref-ptpg.gtf unaln.gtf >pan/pan.gtf

The output results for this step can be found here.

gene family annotation of the pan-genome sequences

Pan-genome analyses are often carried out on gene families. Here we recommend cluster genes into gene families with OrthoMCL. OrthoMCL requires specific configuration, therefore we didn't integrate it into the toolkit. Users need to carry out this step by themselves.

3.5 Determination of gene (or gene-family) presence-absence variations (PAVs)

Requirements

bwa or bowtie2

bwa v0.7.10

bowtie2 v2.24

We utilize a "map-to-pan" strategy to determine gene presence-absence and gene family presence-absence. Therefore we need to map the raw reads to the comprehensive sequences first. This can be done using "bwa mem" (recommended) or bowtie2.

To use bwa mem, run the following commands

cd pan/
/path/to/bwa/bwa index -a bwtsw pan.fa
cd ..
eupan alignRead -f bwa -t 4 trim/data/ map2pan/ /path/to/bwa/ pan/pan.fa

To use bowtie2, run the following commands

cd pan/
/path/to/bowtie2/bowtie2-build pan.fa pan
cd ..
eupan alignRead -f bowtie2 -t 4 trim/data/ map2pan/ /path/to/bowtie2/ pan/pan

The results can be found in map2pan/ directory.

Next, the .sam output should be converted to .bam and then sorted, merged and indexed. The .bam and .bai files can be used to view alignments in genome browser. And .sam files can be further removed to save space. To do this, run command

eupan sam2bam -t 4 map2pan/data/ panBam/ /path/to/samtools-1.3/
rm -r map2pan/data/

The results can be found in panBam/ directory. There are two files for each sample: .bam and .bai. We also provide a bam2bed tool to calculate the covered region of the genome. The output non-overlapped fragments are organized in 3-column .bed files which can be utilized for visualization in pan-genome browser and also matained for further use. See our Rice Pan-genome Browser (RPAN) for example.

eupan bam2bed panBam/data/ covRegion/

The .bed files can be found in covRegion/data/ directory.

What's more, the users can also use this tool to align reads to the reference genome alone and check the statistics. We show how to do this with bwa as an example. Similar as mentioned above, do the alignment with commands

cd ref/
/path/to/bwa/bwa index -a bwtsw ref.fa
cd ..
eupan alignRead -f bwa -t 4 trim/data/ map2ref/ /path/to/bwa/ ref/ref.fa

Convert .sam to .bam and remove .sam with commands

eupan sam2bam -t 4 map2ref/data/ refBam/ /path/to/samtools-1.3/
rm -r map2ref/data/

Check the basic statistics from .bam files with command

mkdir refBamSta
eupan bamSta basic refBam/data/ refBamSta/basic/ /path/to/bamUtil/

The basic statistics can be found in refBamSta/basic/summary.txt

Check how much of the genome can be covered from .bam files with command

eupan bamSta cov refBam/data/ refBamSta/cov/ /path/to/qualimap/

The coverages can be found in refBamSta/cov/summary.txt

After mapping the reads to the pan-genome sequence, we can calculate the gene body coverage and CDS coverage of each gene. Run command

eupan geneCov -t 4 panBam/data/ geneCov/ pan/pan.gtf

The gene body coverages can be found in geneCov/summary_gene.cov and the cds coverages can be found in geneCov/summary_cds.cov.

Next we can determine gene presence-absence considering genebody coverages and cds coverages. If we define genes with gene body coverage >0.8 and cds coverage >0.95 ere considered as present in the genome, gene PAV profile can be calculated by

mkdir geneExist
eupan geneExist geneCov/summary_gene.cov geneCov/summary_cds.cov 0.8 0.95 >geneExist/gene.exist

The output file are organized as

 gene    sampleA    sampleB   sampleC     ...
geneA      1          1        1         ...
geneB      1          0        1         ...
geneC      1          1        1         ...
geneD      1          0        1         ...
...

where 1 represent presents and 0 represents absence. The gene PAV profile of example data can be acquired here. Also, The output results for this step can be found here.

Obviously the coverages are influenced by the sequencing depth. Insufficient sequencing will result in mis-annotation of presented genes. The example sequencing files are from rice and has very low sequencing deth (below 1x). We recommend the users to add a sample selection step before the pan-genome analyses. The users can select a subset of individuals based on sequencing depth and mapping depth (based on reference) for the subsequent analyses. In our rice study, we required the sequencing depth is higher than 20 and the mapping depth is higher than 15.

Users can put the selected sample name in a list as

sampleA
sampleB
sampleC
...

With the subSample command, users can select a subset of samples from the PAV profile

eupan subSample geneExist/gene.exist list >geneExist/SampleFilteredGene.exist

The new PAV profile for the selected sample can be found in geneExist/SampleFilteredGene.exist file.

Next we can further convert gene PAV profile into gene family PASV profile with a gene family annotation. An example of gene family annotation file can be found here. Gene family PAVs can be derived from gene PAVs with gFamExist command

eupan gFamExist geneExist/gene.exist geneFam.txt >geneFam.exist

geneFam.txt contains the annotation of gene families. Each line represents a gene family and genes included in this gene family (tab-seperated gene family name and gene names. Gene names are further seperated with spaces). An example of the gene family annotation file can be found at File 5 of Hu, Z. et al. Figshare

The geneFam.exist file is organized as the same pattern of gene PAV profile. Therefore the subsequent pan-genome analyses can be carried out for both gene PAV profile and gene family PAV profile.

Next we will do random sampling for pan-genome simulation. For each iteration of simulation, we will randomly sample one by one, and calculate the core genome size and pan genome size.

 eupan sim -n 100 geneExist/gene.exist simulation_pav

Here, -n indicates the number of random simulations (default=100). The result can be found in simulation_pav/sim_out.txt_PAV_plot.pdf

3.6 PAV-based pan-genome analyses

Traditional pan-genome analyses can be carried out based on the PAV profiles:

Classification of core genes or distributed genes
Study of the functions of core genes and distributed genes
Simulation of the pan-genome size and the core-genome size
Phylogenetic analysis to reveal relationship of the samples

In addition, classifications of samples can be derived from the phylogenetic analysis if there's no such information before you do this analysis. Based on the grouping information of samples, you can carry out many comparisons among groups:

Comparison of genome size (gene number and gene family number)
Comparison of sizes of pan-genomes and core-genomes among the groups (or subspecies)
Study of group-dominant and group-specific genes

What's more, by integrating other knowledge, many customized analyses can be carried out. We give some examples here.

Comparison of gene ages between core and distributed genes.
Comparison of mutation rates between core and distributed genes.
Integrating phenotypes to carry out GWAS based on gene PAVs

.......

Bugs or Suggestions

We recommend to use tools with the exact version we mentioned. If you got errors, please first read the error files and try the recommended version. There are logs and error files for each step. Bugs should be sent to HZQ. Pan-genome analysis using "map-to-pan" strategy is somewhat complex. We welcome the users' suggestions and discussions on any point involved in the pipeline.