logo

The purpose of this SOP is to help users to carry out pan-genome studies with EUPAN strategy. The EUPAN SOP is divided into 3 parts:

  1. eupan SOP for single machine;
  2. eupanLSF SOP for LSF system;
  3. eupanSLURM SOP for SLURM system.

EUPAN

  1. Installation
  2. see Installation.

  3. Example data
  4. The example data used in this SOP can be downloaded here. The example includes sequencing data of 3 individual. For each individual, there are 3 Illumina sequencing runs. Each run contains 500,000 paired-end reads. The read length is 83bp and the quality values follow phred+64 rule. Note these are only simple examples to help users understand the input data type and data structure. The real data may be much larger and more complex.

    After downloading the data, follow the command:

    tar zxvf eupanExample.tar.gz

    and you will find two directories within the eupanExample/ directory:

    data/                 this directory includes the sequencing data with a per-sample-per-directory structure
    ref/ref.fa            example of a reference sequence (chr12 of the IRGSP-1.0 rice genome) 
    ref/ref.gtf           gene annotation of the reference sequence
    unaln.gtf             a blank file

  5. Main analysis procedures
  6. The main analysis procedures include:

    1. the parallel quality control;
    2. de novo assembly of individual genomes;
    3. construction of pan-genome sequences;
    4. gene annotation and gene family annotation;
    5. determination of gene (or gene-family) presence-absence variations (PAVs);
    6. PAV-based pan-genome analyses.

    We'll introduce how to process your data step by step using EUPAN toolkit and other required tools.

    3.1 the parallel quality control

    Requirements

    1. Fastqc
    2. Trimmomatic

    EUPAN incorporates FastQC tool to view the sequencing quality and Trimmomatic tool to filter and trim low-quality reads.

    A recommended quality control procedure includes:

    1. Preview the overall qualities
    2. Trim or filter reads
    3. Review the overall qualities
    4. Re-trim with selected parameters by repeating step 2 and step 3
    1. Preview the overall qualities
    2. We provide the qualitySta tool to overview the overall qualities. -t indicates #cores used to run Fastqc parallelly. -v indicates pair-end or single-end data. Follow the commands:

      cd eupanExample
      eupan qualSta -f /path/to/Fastqc -t 4 -v PE data/ preview_quality/

      Results can be found in the preview_quality/ directory:

      fastqc_output       detailed outputs of each sample
      file.summary        combined quality statistics of all files
      sample.summary      combined quality statistics of all samples (A sample's "PASS" requires "PASS" of all files of the sample)
      file.summary_heatmap.pdf    heatmap of file statistics from file.summary
      sample.summary_heatmap.pdf  heatmap of sample statistics from sample.summary

      In all output figures, green represents "PASS"; yellow represents "WARNING"; and red represents "FAIL".

      The summary files together with the plots will provide a landscape of the overall sequencing qualities. We also recommends the users go through some detailed sample results to gain a better understanding of the data and data qualities.

    3. Trim or filter reads
    4. We provide the trim tool to trim or filter low-quality reads. After preview the qualities, the users should have a basic understanding of the sequencing data including the quality score version (phred33 or phred64), read length, etc. To trim the reads, follow the command:

      eupan trim -p 64 -t 4 data/ trim/ /path/to/Trimmomatic/ 

      To filter low-quality reads, follow the comand:

      eupan trim -p 64 -t 4 -w 83 -m 83 data/ filter/ /path/to/Trimmomatic/ 

      Results can be found in the output directory (trim/ or filter/):

      data/       directory of the high-quality reads of each sample
      err/        logs and standard errors
    5. Review the overall qualities
    6. After trimming or filtration of reads, the quality should be scanned again. Follow the command:

      eupan qualSta -f /path/to/Fastqc -t 4 trim/data/ review_quality/

      Similar to step 1, check the qualities manually and decide if the trimming or filtration step is good enough for subsequent analyses.

    7. Re-trim with selected parameters by repeating step 2 and step 3
    8. If the trimming results do not match the user's expection, new parameters should be selected and step 2 and step 3 should be conducted for several times. We highly recommend the users to run step 2 with a set of parameters on a small subset of their data, evaluate their performance and apply the selected parameter on the whole data.

    The output results for this step can be found here.

    3.2 de novo assembly of individual genomes

    Requirements

    1. SOAPdenovo2 r240 and GapCloser
    2. QUAST

    1. de novo assembly of individual genomes
    2. To gain the comprehensive sequences of all samples, we need first to assemble the individual genomes from short reads. We provide two different strategies in this step: 1) direct assembly and 2) iterative assembly to select the optimal Kmer (recommended). See linearK usage for the detailed strategies for the iterative assembly method. Please note that assembly requires relatively large memory. Pay attention to the memory usage when running the following steps.

      1) To assemble all samples directly with SOAPdenovo 2 and Gapcloser under a fixed Kmer (e.g. K=23), use command

      eupan assemble soapdenovo -t 4 -k 23 -g data/ assembly_soap/ /path/to/SOAPdenovo2/

      The assemblies of individuals can be found in the assembly_soap/ directory. To enable GapCloser, add -g option.

      2) To assemble all samples directly with iterative use of SOAPdenovo, run with command

      eupan assemble linearK data assembly_linearK/ /path/to/SOAPdenovo2/

      (More information about this optimized strategy)

      The assemblies of individuals can be found in the assembly_linearK/ directory. This program utilizes a linear model of sequencing depth to estimate the initialized Kmer. Therefore when applying to user's real data, the estimated genome size (reference genome size is OK) should be provided with -g option. We recommend the user to re-estimate the linear model on a subset of samples and provide it to the program with -r option when processing all samples. According to our experience, this method is about 2.5~3.5 times lower than the direct use of SOAPdenovo2.

      The output results for this step can be found here.

    3. evaluation of the assemblies
    4. In order to have an overview of the assemblies, we evaluate the major assembly statistics with QUAST software including assembly size, N50, genome fraction (how much of the reference genome is covered), unaligned size, etc. In this process, the contigs from each individual are aligned to the reference genome with nucmer tool within Mummer package, enabling the direct extraction of unaligned sequences in the subsequent analysis. Also the users can extract the unaligned sequences with their own parameters.

      QUAST installation:

      cd ../tools/
      tar -zxvf quast-3.2.tar.gz
      cd quast-3.2
      python quast.py --test
      cd ../../eupanExample/

      To evaluate the assemblies of the contigs, run with command

      eupan assemSta -t 4 assembly_soap/data/ assembly_soap_sta/ /path/to/quast/ ref/ref.fa

      or

      eupan assemSta -t 4 assembly_linearK/linearKdata/ assembly_linearK_sta/ /path/to/quast/ ref/ref.fa

      To evaluate the assemblies of the scaffolds, run with command

      eupan assemSta -t 4 -s assembly_soap/data/ assembly_soap_sta/ /path/to/quast/ ref/ref.fa

      or

      eupan assemSta -t 4 -s assembly_linearK/data/ assembly_linearK_sta/ /path/to/quast/ ref/ref.fa

      Note that the assembly step could take a long time. Here for our example data, it would take ~2h with 4 cores(-t 4). The assembly statistics of each sample can be found in the assembly_soap_sta/data/ or assembly_linearK_sta/data/ directory and a summary can be found in assembly_soap_sta/total_statistics.txt or assembly_linearK_sta/total_statistics.txt file. Note the sequencing data used in this example is of small size in order to limit the running time, therefore the assembly size is also very small and the statistics is incomplete. It will work well on the user's large data. The assembly of each example sample using SOAPdenovo directly takes about 65 minutes on 16 cpus of 1064 MHz. Totally running assembly with SOAPdenovo directly takes about 2 hours on a single machine. It takes much longer to assemble with the optimized strategy. Therefore we recomend the users who want to go over this SOP to run the first command instead of the second command.

    The output results for this step can be found here.

    3.3 construction of pan-genome sequences

    There are several strategies to construct the pan-genome sequences.

    • Construct the pan-genome sequences based on a reference genome
    • Construct the pan-genome sequences by re-assembling the individual assemblies

    If there is a reference genome of the studied organism, we reccommend the first strategy to benifit from the high-quality assembly of the reference together with the high-quality annotation. If there is no such reference, you may use the second strategy. Or still follow the first strategy by building a reference first. To do this, there should be a multi-library sequencing for the reference sample with high-sequencing depth and multiple insertion sizes.

    In this SOP, we only describe how to build the comprehensive pan-genome sequences based on a reference.

    Requirements

    1. Mummer
    2. NCBI-Blast
    3. CDHIT

    Construction of the pan-genome sequences based on a reference genome involves the following steps:

    1. Align contigs to the reference genome to retrieve unaligned contigs
    2. Actually this has been done in the evaluation of assemblies. Users can also carry out this step with Nucmer tool within Mummer package.

    3. Merge unaligned contigs from individuals to one file
    4. To do this, run command

      eupan getUnalnCtg assembly_soap/data/ assembly_soap_sta/data/ unalnCtgs

      or

      eupan getUnalnCtg assembly_linearK/data/ assembly_linearK_sta/data/ unalnCtgs

      The unaligned contigs can be found in unalnCtgs/total_unaln.fa file

      Check the basic statistics of contigs (base number and contig number) with fastaSta command

      eupan fastaSta unalnCtgs/total_unaln.fa
    5. Get non-redundant novel contigs
    6. To remove redundant contigs, run command

      eupan rmRedundant cdhitCluster unalnCtgs/total_unaln.fa unalnCtgs/nrUnaln /path/to/cd-hit/
      eupan fastaSta unalnCtgs/nrUnaln/non-redundant.fa

      Non-redundant contigs (identity<0.9) and the clustering information can be found in unalnCtgs/nrUnaln/ directory.

    7. Remove contaminates
    8. This is a very important step in such pan-genome analysis involving large number of individuals. The non-redundant contigs are the enrichment of sequences missed in the reference, and are also the enrichment of contaminates. Users should remove contaminates according to their studied species. For our rice study, we removed contaminates from bacteria, virus, archaea and even animals. Actually, we remove all contigs whose best alignment to nt database is not green plants. There is no gold standard in this step, therefore we provide no tool for this part. Users should remove contaminates by themselves.

    9. Check the non-redundant novel sequences at a given identity
    10. The non-redundant novel sequences we obtained above is at identity of 0.9. Sometimes we want to know the size under a much lower identity (say 0.5).Here we provide a blast-based method to cluster sequences and remove the non-redundant contigs. To do this, run command

      eupan rmRedundant blastCluster -c 0.5 unalnCtgs/total_unaln.fa unalnCtgs/nrUnalnBlast/ /path/to/ncbi-blast/bin/

      The outputs are similar to those of "eupan rmRedundant cdhitCluster" command. The results can be found in unalnCtgs/nrUnalnBlast/ directory.

    11. Merge reference sequences and non-redundant unalined sequences
    12. We recommend to use non-redundant sequences derived from step 4) instead of sequences derived at much lower identity for the subsequent analyses. We will map raw reads to the comprehensive sequences to determine gene presence-absence. Using low-identity contigs will result in loss of mapping of reads and further result in miss annotation of a gene existence or a gene family existence. To merge the reference and non-redundant contigs, simply run

      mkdir pan
      cat ref/ref.fa unalnCtgs/nrUnaln/non-redundant.fa >pan/pan.fa

    The output results for this step can be found here.

    3.4 Annotation of the pan-genome sequences

    Annotation of the pan-genome sequences can be divided into two parts:

    1. gene annotation of the pan-genome sequences
    2. gene family annotation of the pan-genome sequences

    1. gene annotation of the pan-genome sequences
    2. Gene annotation of the pan-genome sequences can also be divided into 2 parts: i)gene annotation of the reference genome; and ii)gene prediction of the novel sequences.

      i) gene annotation of the reference genome

      As described above, one of the advantages to build pan-genome sequences is that we can use the high-quality annotation of the reference genome. The annotation of the reference genome is in ref/ref.gtf file. To be consistent with the annotation of unaligned sequences and also to be convenient for subsequent analyses, the annotation should be re-arranged following per-transcript-per-gene pattern. For genes with multiple transcripts, only the one with the longest coding sequences are selected as represents. To do this run command

      eupan pTpG ref/ref.gtf ref/ref-ptpg.gtf
      ii) gene prediction of the novel sequences

      Gene prediction of eukaryotes is a challenging and complex task. And the selection of gene prediction tools or pipelines depends on the target organism. In rice study, we used MAKER-P pipeline to integrate ab initio predictions, transcript evidence and protein homologies. For animals or other organisms, you may want MAKER or PASA. That's why we provide no tools for this step. Users need to carry out the gene prediction step carefully.

      The reccomended pipelines are:

      Here we use a blank file (./unaln.gtf) to subtitute the gene annotation of unaligned sequences.

      After these two steps, we should combine the annotation of the reference and the annotation of unaligned sequences. Simply run

      cat ref/ref-ptpg.gtf unaln.gtf >pan/pan.gtf

      The output results for this step can be found here.

    3. gene family annotation of the pan-genome sequences
    4. Pan-genome analyses are often carried out on gene families. Here we recommend cluster genes into gene families with OrthoMCL. OrthoMCL requires specific configuration, therefore we didn't integrate it into the toolkit. Users need to carry out this step by themselves.

    3.5 Determination of gene (or gene-family) presence-absence variations (PAVs)

    Requirements

    1. bwa or bowtie2
    2. bwa v0.7.10
      bowtie2 v2.24
    3. samtools v1.3
    4. qualimap v0.8
    5. bamUtil v1.0.12

    We utilize a "map-to-pan" strategy to determine gene presence-absence and gene family presence-absence. Therefore we need to map the raw reads to the comprehensive sequences first. This can be done using "bwa mem" (recommended) or bowtie2.

    To use bwa mem, run the following commands

    cd pan/
    /path/to/bwa/bwa index -a bwtsw pan.fa
    cd ..
    eupan alignRead -f bwa -t 4 trim/data/ map2pan/ /path/to/bwa/ pan/pan.fa

    To use bowtie2, run the following commands

    cd pan/
    /path/to/bowtie2/bowtie2-build pan.fa pan
    cd ..
    eupan alignRead -f bowtie2 -t 4 trim/data/ map2pan/ /path/to/bowtie2/ pan/pan

    The results can be found in map2pan/ directory.

    Next, the .sam output should be converted to .bam and then sorted, merged and indexed. The .bam and .bai files can be used to view alignments in genome browser. And .sam files can be further removed to save space. To do this, run command

    eupan sam2bam -t 4 map2pan/data/ panBam/ /path/to/samtools-1.3/
    rm -r map2pan/data/

    The results can be found in panBam/ directory. There are two files for each sample: .bam and .bai. We also provide a bam2bed tool to calculate the covered region of the genome. The output non-overlapped fragments are organized in 3-column .bed files which can be utilized for visualization in pan-genome browser and also matained for further use. See our Rice Pan-genome Browser (RPAN) for example.

    eupan bam2bed panBam/data/ covRegion/

    The .bed files can be found in covRegion/data/ directory.

    What's more, the users can also use this tool to align reads to the reference genome alone and check the statistics. We show how to do this with bwa as an example. Similar as mentioned above, do the alignment with commands

    cd ref/
    /path/to/bwa/bwa index -a bwtsw ref.fa
    cd ..
    eupan alignRead -f bwa -t 4 trim/data/ map2ref/ /path/to/bwa/ ref/ref.fa

    Convert .sam to .bam and remove .sam with commands

    eupan sam2bam -t 4 map2ref/data/ refBam/ /path/to/samtools-1.3/
    rm -r map2ref/data/

    Check the basic statistics from .bam files with command

    mkdir refBamSta
    eupan bamSta basic refBam/data/ refBamSta/basic/ /path/to/bamUtil/

    The basic statistics can be found in refBamSta/basic/summary.txt

    Check how much of the genome can be covered from .bam files with command

    eupan bamSta cov refBam/data/ refBamSta/cov/ /path/to/qualimap/

    The coverages can be found in refBamSta/cov/summary.txt

    After mapping the reads to the pan-genome sequence, we can calculate the gene body coverage and CDS coverage of each gene. Run command

    eupan geneCov -t 4 panBam/data/ geneCov/ pan/pan.gtf

    The gene body coverages can be found in geneCov/summary_gene.cov and the cds coverages can be found in geneCov/summary_cds.cov.

    Next we can determine gene presence-absence considering genebody coverages and cds coverages. If we define genes with gene body coverage >0.8 and cds coverage >0.95 ere considered as present in the genome, gene PAV profile can be calculated by

    mkdir geneExist
    eupan geneExist geneCov/summary_gene.cov geneCov/summary_cds.cov 0.8 0.95 >geneExist/gene.exist

    The output file are organized as

     gene    sampleA    sampleB   sampleC     ...
    geneA      1          1        1         ...
    geneB      1          0        1         ...
    geneC      1          1        1         ...
    geneD      1          0        1         ...
    ...

    where 1 represent presents and 0 represents absence. The gene PAV profile of example data can be acquired here. Also, The output results for this step can be found here.

    Obviously the coverages are influenced by the sequencing depth. Insufficient sequencing will result in mis-annotation of presented genes. The example sequencing files are from rice and has very low sequencing deth (below 1x). We recommend the users to add a sample selection step before the pan-genome analyses. The users can select a subset of individuals based on sequencing depth and mapping depth (based on reference) for the subsequent analyses. In our rice study, we required the sequencing depth is higher than 20 and the mapping depth is higher than 15.

    Users can put the selected sample name in a list as

    sampleA
    sampleB
    sampleC
    ...

    With the subSample command, users can select a subset of samples from the PAV profile

    eupan subSample geneExist/gene.exist list >geneExist/SampleFilteredGene.exist

    The new PAV profile for the selected sample can be found in geneExist/SampleFilteredGene.exist file.

    Next we can further convert gene PAV profile into gene family PASV profile with a gene family annotation. An example of gene family annotation file can be found here. Gene family PAVs can be derived from gene PAVs with gFamExist command

    eupan gFamExist geneExist/gene.exist geneFam.txt >geneFam.exist

    geneFam.txt contains the annotation of gene families. Each line represents a gene family and genes included in this gene family (tab-seperated gene family name and gene names. Gene names are further seperated with spaces). An example of the gene family annotation file can be found at File 5 of Hu, Z. et al. Figshare

    The geneFam.exist file is organized as the same pattern of gene PAV profile. Therefore the subsequent pan-genome analyses can be carried out for both gene PAV profile and gene family PAV profile.

    Next we will do random sampling for pan-genome simulation. For each iteration of simulation, we will randomly sample one by one, and calculate the core genome size and pan genome size.

     eupan sim -n 100 geneExist/gene.exist simulation_pav

    Here, -n indicates the number of random simulations (default=100). The result can be found in simulation_pav/sim_out.txt_PAV_plot.pdf

    3.6 PAV-based pan-genome analyses

      Traditional pan-genome analyses can be carried out based on the PAV profiles:

    1. Classification of core genes or distributed genes
    2. Study of the functions of core genes and distributed genes
    3. Simulation of the pan-genome size and the core-genome size
    4. Phylogenetic analysis to reveal relationship of the samples

    5. In addition, classifications of samples can be derived from the phylogenetic analysis if there's no such information before you do this analysis. Based on the grouping information of samples, you can carry out many comparisons among groups:

    6. Comparison of genome size (gene number and gene family number)
    7. Comparison of sizes of pan-genomes and core-genomes among the groups (or subspecies)
    8. Study of group-dominant and group-specific genes

    9. What's more, by integrating other knowledge, many customized analyses can be carried out. We give some examples here.

    10. Comparison of gene ages between core and distributed genes.
    11. Comparison of mutation rates between core and distributed genes.
    12. Integrating phenotypes to carry out GWAS based on gene PAVs
    13. .......

      .......

  7. Bugs or Suggestions
  8. We recommend to use tools with the exact version we mentioned. If you got errors, please first read the error files and try the recommended version. There are logs and error files for each step. Bugs should be sent to HZQ. Pan-genome analysis using "map-to-pan" strategy is somewhat complex. We welcome the users' suggestions and discussions on any point involved in the pipeline.