HUPAN manual

HUPAN is an advanced version of EUPAN, and most commands are similar to that in EUPAN. Besides, several commands are developed for efficient pan-genome analysis on human genome. Here, we provide the main analysis procedures of human pan-genome analysis on an example data.

We provide three types of tools, which deal with different computing platforms:

  1. Single machine version;
  2. LSF version (working on supercomputer based on LSF system, in which, "bsub" is used to submit jobs);
  3. SLURM version (working on supercomputer based on SLURM system, in which, "sbatch" is used to submit jobs).

Due to the large genome size of individual genome, conducting pan-genome analysis on hudreds of individuals could hardly finish on in single machine. We strongly suggestted the users conduct all the analysis in the supercomputer implemented LSF system or SLURM system. All the commands of hupanLSF and hupanSLURM are same excepted for the way of submit jobs are different. In the following, we give all the exmaples of commands based on SLURM system. If the users work on supercomputer based on LSF system, please replace “hupanSLURM” with “hupanLSF”.

(1) Download code and tools

Download the latest version of HUPAN toolkit and Tools required by HUPAN toolkit Download,and install HUPAN according to Installation

(2) Example data

This data set includes sequencing data of three samples from NA12878. Each sample include 6,000,000 paired-end reads that could map to chromosome 22. Note these are only simple example to help users understand the input data type and data structure and guide run the pipeline. The real data may be much larger and more complex. Please download here and undecompress it:

tar zxvf hupanExample.tar.gz & cd hupanExample

And you can find two directories:

data/   The sequencing data with a per-sample-per-directory structure;
ref/ The reference sequence and annotation information (chr22 of GRCh38).

(3) The parallel quality control

The tools FastQC and Trimmomatic are used to view the sequencing quality and trim low-quality reads.

i. The command qualitySta is used to overview qualities:

hupanSLURM  qualSta -f /path/to/Fastqc -t 16 -v PE data/ preview_quality/

Results can be found in the preview_quality/ directory.

ii. If the reads are not so good, the users could trim or filter low-quality reads by the command trim:

hupanSLURM trim data/ trim/ /path/to/Trimmomatic
hupanSLURM trim -w 100 -m 100 data/ filter/ /path/to/Trimmomatic

Results could be found in the trim or filter directory.

iii.After trimming or filtration of reads, the sequencing quality should be evaluated again by qualitySta, and if the trimming results are still not good for subsequent analyses, new parameters should be given and the above steps should be conducted for several times.

(4) De novo assembly of individual genomes

To obtain non-reference sequences from each individual genome, we need first to conduct de novo assembly on the raw reads. We provide three distinct strategies:

i.Directly assembly by SOAPDenovo2:

hupanSLURM assemble soapdenovo -t 16 -k 91 data/ assembly_soap/ /path/to/SOAPDenovo2/

Please note that this startegy requires huge memory for assembly an individual human genome (according to our test, finishing the assembly of a human genome of 30-fold sequencing data needs more than 500 Gb memory), we strongly suggested that do not use this command unless the users finish the analysis on the supercomputer with multiple nodes of huge memory.

ii.Assembly by the iterative use of SOAPDenovo2. Not Recommend.

hupanSLURM linearK data assembly_linearK/ /path/to/SOAPDenovo2 

iii. Assembly by SGA.

hupanSLURM assemble sga -t 16 data/ assembly_result /path/to/sga/

We recommend the users preform de novo assembly by this command. According to our experience on 185 newly sequenced genomes, the maximum memory consumption in assembling the human genome of 30-fold sequencing data is about 60Gb.

(5) Extract non-reference sequences from assembled contigs

i. In order to obtain the non-reference sequence from each individual genome, the assembled contigs are aligned to the reference genome with nucmer tool within Mummer package:

hupanSLURM alignContig assembly_result/data/ aligned_result  /path/to/MUMmer/ /path/to/reference.fa

ii. Then the contigs those are highly similar with the reference genome are discarded and the remaining contigs are considered as candidate non-reference sequences:

hupanSLURM extractSeq assembly_result/data/ candidate aligned_result

iii. All the candidate non-reference sequences are assessed by QUAST to obtain non-reference sequences:

hupanSLURM assemSta candidate/data/ quast_result /path/to/quast-4.5/ /path/to/reference.fa

iv. Two types of non-reference sequences, fully unaligned sequences and partially unaligned sequences, for each individual could be collected:

hupanSLURM getUnalnCtg -s .contig candidate/data/ quast_result/data/ Unalign_result

v. Non-reference sequences from multiple individuals are merged:

hupanSLURM mergeUnalnCtg Unalign_result/data/ mergeUnalnCtg_result

(6) Remove redundancy and potential contamination sequences

After obtaining the non-reference sequences from multiple individuals, redundant sequences between different individuals should be excluded, and the potential contamination sequences from non-human species are also removed for further analysis.

i. The step of remove redundancy sequences is conducted by CDHIT for fully unaligned sequences and partially unaligned sequences, respectively:

hupanSLURM rmRedundant cdhitCluster  mergeUnalnCtg_result/total.fully.fa rmRedundant.fully.unaligned /path/to/cdhit/
hupanSLURM rmRedundant cdhitCluster mergeUnalnCtg_result/total.partilly.fa rmRedundant.partially.unaligned /path/to/cd-hit/

ii. Then the non-redundant sequences are aligned to NCBI’s non-redundant nucleotide database by BLAST:

mkdir nt & cd nt
wget https://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz |gunzip & cd ..
hupanSLURM blastAlign mkblastdb nt nt_index path/to/blast
mkdir rmRedundant & mv rmRedundant.fully.unaligned rmRedundant & mv rmRedundant.partially.unaligned rmRedundant
hupanSLURM blastAlign blast rmRedundant rmRedundant_blast /path/to/nt_index /path/to/blast

iii. According to the alignment result, the taxonomic classification of each sequences (if have) could be obtained:

hupanSLURM getTaxClass rmRedundant_blast/data/fully/fully.non-redundant.blast info/ TaxClass_fully/
hupanSLURM getTaxClass rmRedundant_blast/data/partially/partially.non-redundant.blast info/ TaxClass_partially/

iv. And the sequences classifying as microbiology and non-primate eukaryotes are considered as non-human sequences and removed from further consideration:

hupanSLURM rmCtm -i 60 rmRedundant/fully/fully.non-redundant.fa rmRedundant_blast/data/fully/fully.non-redundant.blast TaxClass_fully/data/accession.name rmCtm_fully
hupanSLURM rmCtm -i 60 rmRedundant/partially/partially.non-redundant.fa rmRedundant_blast/data/partially/partially.non-redundant.blast TaxClass_partially/data/accession.name rmCtm_partially

(7) Construction and annotation of pan-genome

i. The non-redundant sequences of fully unaligned sequences and partially unaligned sequences are merged and further clustered to remove redundant sequences:

mkdir Nonreference
cat rmCtm_fully/data/novel_sequence.fa rmCtm_partially/data/novel_sequence.fa > Nonreference/nonrefernce.before.fa
hupanSLURM rmRedundant cdhitCluster Nonreference/nonrefernce.before.fa NonredundantNonreference /path/to/cdhit/

ii. And the resulted sequences together with the human reference genome construct the pan-genome sequences. The annotation of reference genome could be directly download from GenCODE or other common used annotation dataset. The annotation information of non-reference sequences is predicted by MAKER.In general, the size of non-reference sequences is large and the procedure of gene prediction is slow. We recommend the users to split the file of non-reference sequences into multiple small files and predict novel genes in parallel:

hupanSLURM splitSeq NonredundantNonreference/non-redundant.fa GenePre_input
hupanSLURM genePre GenePre_input GenePre /path/to/maker/config_file /path/to/maker

iii. Then after all procedures are finished, the outcomes are merged:

hupanSLURM mergeNovGene GenePre GenePre_merge /path/to/maker

iv. The new predicted genes may be highly similar to the genes that are located in reference genome, and additional filtering step should be conducted to ensure the novelty of predicted gene:

hupanSLURM filterNovGene GenePre_merge GenePre_filter /path/to/reference/ /path/to/blast /path/to/cdhit /path/to/RepeatMask

v. The annotation of pan-genome sequences is simply merged to obtain by combine two annotation files:

hupanSLURM pTpG ref/ref.gff ref/ref-ptpg.gff
cat ref/ref-ptpg.gff non-reference.gff >pan/pan.gff

(8) PAV analysis

The “map-to_pan” strategy is utilized to determine gene presence-absence.

i. The raw reads are mapped to pan-genome sequences by Bowtie2:

cd pan & /path/to/bowtie2/bowtie2-build pan.fa pan &cd ..
hupanSLURM alignRead –f bowtie2 data/ map2pam /path/to/bowtie2 pan/pan

ii. The result of .sam should be converted to .bam and sorted and indexed use Samtools:

hupanSLURM sam2bam map2pan/data panBam /path/to/samtools

iii. Then the gene body coverage and the cds coverage of each gene are calculated:

hupanSLURM geneCov panBam/data geneCov/ pan/pan.gtf

iv. Finally, the gene presence-absence is determined by the threshold of cds coverage as 95%:

mkdir geneExist & hupanSLURM geneExist geneCov/summary_gene.cov geneCov/summary_cds.cov 0 0.95 >geneExist/gene.exist

(9) Bugs or suggestions

Any bugs or suggestions, please contact the authors.