Pan-genome analysis of gastric cancer

Introduction

Gastric cancer is one of the most common digestive cancers worldwide, especially in Asia. Up to date, several whole genome sequencing (WGS) studies have been released and multiple genetic variations have been proposed to promote the molecular classifications of gastric cancer. Moreover, pan-cancer analysis of WGS data promoted comprehensively understand on molecular mechanism across multiple types of cancers. However, all those studies were conducted based on the human reference genome, which is now known to miss tens or hundreds of million bases individual-specific or population-specific genomic regions. These non-reference genome sequences may play important roles in pathogenesis or carcinogenesis. In order to deal with this problem, we conducted pan-genome analysis of gastric cancer by 185 genomes to detect non-reference sequences, novel protein-coding gene and the gene presence-absence variation (PAV) on 185 genomes.

Gastric cancer pan-genome analysis using HUPAN to construct the cancer pan-genome, then compare the cancer pan-genome to other pan-genomes, and do association study about the gene presence/absence variations (PAV) with the cancer phenotypes. These steps are listed below.

Preprocess step to control the sequencing quality and trim low-quality reads if neccesary;
De novo asssembly of individual genomes;
Extracting non-reference sequences from assembled contigs;
Detecting placed novel sequences from partially unaligned contigs;
Removing redundancy and potential contamination among multiple genomes;
Construction and annotation of pan-genome;
Gene presence-absence variation (PAV) anaylsis;
Comparative analysis of gatric cancer pan-genome with other pan-genomes from other populations,such as Korean gastric cancer patients and healthy individuals from SGDP ;
Association of gene PAV profile and phenotypes.

In addition, this analysis could be expand to other types of cancer. Please see the Standard Operating Proedure(SOP) below if you want to conduct pan-genome analysis of cancer genomes on your data.

Datasets of gasctric cancer pan-genome ananlysis results

Whole genome sequencing (WGC)data and assembled contigs of 185 individuals diagonsed with gastric cancer are available at NODE with the accessions OEP000301 for matched gastric mucosa and OEP000482 for primary tumor tissues. Please see the detail of Request for Restricted Data if you want to access this dataset.
Placed novel sequences
- Two-end placed novel sequences (1.41 Mb, 827 sequences): Two-endPlacedNovelSequences.fa.gz (md5: b03ccfe8304fd1886816923bca27497a).
- One-end placed novel sequences (4.91 Mb, 1,778 sequences): One-endPlacedNovelSequences.fa.gz (md5: ae7736fe1bb95a3d62e89c2f96b94154).
Final non-reference sequences (80.88 Mb,35,488 sequences): Nonreference.final.fa.gz (md5: d7a236fdd3009f45e6c359c2dc251404).
Annotation of non-reference sequences
- gff file: Novelgenes.gff3.
- gene sequences: Novelgenes.gene.fasta.
- transcript sequences: Novelgenes.transcript.fasta.
- protein sequences: Novelgenes.protein.fasta.
PAV profile of distributed genes:
- Normal.Reference.Distributed.186genes.PAV.xlsx.
- Normal.Novel.Predicted.Distributed.64genes.PAV.xlsx.
- Normal.Novel.Predicted.Distributed.9genes.PAV.xlsx.
- Tumor.Reference.Distributed.186genes.PAV.xlsx.
- Tumor.Novel.Predicted.Distributed.64genes.PAV.xlsx.
- Tumor.Novel.Predicted.Distributed.9genes.PAV.xlsx.
- SGDP.Reference.Distributed.186genes.PAV.xlsx.
- SGDP.Novel.Predicted.Distributed.64genes.PAV.xlsx.
- SGDP.Novel.Predicted.Distributed.9genes.PAV.xlsx.

SOP of pan-genome analysis on cancer genomes

This SOP is to help users to carry out pan-genome analysis on cancer genomes with HUPAN. Due to the large genome size of individual human genome, conducting pan-genome analysis on hudreds of individuals could hardly finish on in single machine. We strongly suggestted the users conduct all the analysis in the supercomputer implemented LSF system or SLURM system. All the commands of hupanLSF and hupanSLURM are same excepted for the way of submit jobs are different. In the following, we give all the exmaples of commands based on SLURM system. If the users work on supercomputer based on LSF system, please replace “hupanSLURM” with “hupanLSF”.

Please download HUPAN (HUman Pan-genome ANalysis) tool in Github or HUPAN homepage and install it according to the manual.
We provide a set of example dataset to help the users test the pipeline. Note these are only simple example to help users understand the input data type and data structure and guide run the pipeline. The real data may be much larger and more complex. Please download here and undecompress it:
```
tar zxvf cpanExampleData.tar.gz & cd cpanExampleData
```
And you can find two directories:data and ref. The data directory contains two sub-directories of normal and tumor and each directory has three samples, of which include two fastq files, respectively. The ref directory contains the genome sequences (chr22.fa) and gene annotation information (chr22.gff) of chr22 in GRCh38.

The quality of sequencing reads are assessed by hupanSLURM qualsta.

#normal samples
hupanSLURM  qualSta -f /path/to/Fastqc -t 16 -v PE data/normal normal_preview_quality/
#tumor samples
hupanSLURM  qualSta -f /path/to/Fastqc -t 16 -v PE data/tumor tumor_preview_quality/

If the read quality is not high enough, users can trim or filter low-quality reads by the command hupanSLURM trim

#normal samples
hupanSLURM trim data/normal normal_trim/ /path/to/Trimmomatic
hupanSLURM trim -w 100 -m 100 data/normal normal_filter/ /path/to/Trimmomatic
#tumor samples
hupanSLURM trim data/tumor tumor_trim/ /path/to/Trimmomatic
hupanSLURM trim -w 100 -m 100 data/tumor tumor_filter/ /path/to/Trimmomatic

The raw sequencing reads (or trimmed reads) are assembled by SGA for each sample.

#normal samples
hupanSLURM assemble sga -t 16 data/normal normal_assembly_result /path/to/sga/
#tumor sampless
hupanSLURM assemble sga -d 60 -t 16 data/tumor tumor_assembly_result /path/to/sga/

Non-reference sequences of each sample are extracted as followed:

#normal samples
hupanSLURM alignContig normal_assembly_result/data/ normal_aligned_result/  /path/to/MUMmer/ /path/to/reference.fa
hupanSLURM extractSeq normal_assembly_result/data/ normal_candidate/ normal_aligned_result/
hupanSLURM assemSta normal_candidate/data/ normal_quast_result/ /path/to/quast-4.5/ /path/to/reference.fa
hupanSLURM getUnalnCtg -p .contig normal_candidate/data/ normal_quast_result/data/ normal_Unalign_result/
#tumor samples
hupanSLURM alignContig tumor_assembly_result/data/ tumor_aligned_result/  /path/to/MUMmer/ /path/to/reference.fa
hupanSLURM extractSeq tumor_assembly_result/data/ tumor_candidate/ normal_aligned_result/
hupanSLURM assemSta tumor_candidate/data/ tumor_quast_result/ /path/to/quast-4.5/ /path/to/reference.fa
hupanSLURM getUnalnCtg -p .contig tumor_candidate/data/ tumor_quast_result/data/ normal_Unalign_result/

Non-reference sequences from multiple samples are merged by hupanSLURM mergeUnalnCtg:

#normal samples
hupanSLURM mergeUnalnCtg normal_Unalign_result/data/ normal_mergeUnalnCtg_result
#tumor samples
hupanSLURM mergeUnalnCtg tumor_Unalign_result/data/ tumor_mergeUnalnCtg_result

Redundancy sequences among multiple samples are removed by hupanSLURM rmRedundant:

#normal samples
##fully unaligned contigs
hupanSLURM rmRedundant cdhitCluster normal_mergeUnalnCtg_result/total.fully.fa normal_rmRedundant_fully_unaligned/ /path/to/cdhit/
##partially unaligned contigs
hupanSLURM rmRedundant cdhitCluster tumor_mergeUnalnCtg_result/total.partially.fa normal_rmRedundant_partially_unaligned/ /path/to/cd-hit/
#tumor samples
##fully unaligned contigs
hupanSLURM rmRedundant cdhitCluster tumor_mergeUnalnCtg_result/total.fully.fa tumor_rmRedundant_fully_unaligned/ /path/to/cdhit/
##partially unaligned contigs
hupanSLURM rmRedundant cdhitCluster tumor_mergeUnalnCtg_result/total.partially.fa tumor_rmRedundant_partially_unaligned/ /path/to/cd-hit/

Potential contamination sequences are excluded by aligning to non-redundant nt database:

mkdir nt & cd nt
wget https://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz |gunzip & cd ..
hupanSLURM blastAlign mkblastdb nt nt_index path/to/blast

mkdir rmRedundant
mv normal_rmRedundant_fully_unaligned rmRedundant
mv normal_rmRedundant_partially_unaligned rmRedundant
mv tumor_rmRedundant_fully_unaligned rmRedundant
mv tumor_rmRedundant_partially_unaligned rmRedundant
hupanSLURM blastAlign blast rmRedundant\ rmRedundant_blast /path/to/nt_index /path/to/blast

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz & tar -zvxf new_taxdump.tar.gz
mkdir info & mv nucl_gb.accession2taxid info & mv new_taxdump/rankedlineage.dmp info
hupanSLURM getTaxClass rmRedundant_blast/data/normal_rmRedundant_fully_unaligned/non-redundant.blast info/ normal_fully_TaxClass
hupanSLURM getTaxClass rmRedundant_blast/data/normal_rmRedundant_partially_unaligned/non-redundant.blast info/ normal_partially_TaxClass
hupanSLURM getTaxClass rmRedundant_blast/data/tumor_rmRedundant_fully_unaligned/non-redundant.blast info/ tumor_fully_TaxClass
hupanSLURM getTaxClass rmRedundant_blast/data/normal_rmRedundant_partially_unaligned/non-redundant.blast info/ tumor_partially_TaxClass

hupanSLURM rmCtm -i 60 rmRedundant/normal_rmRedundant.fully.unaligned/non-redundant.fa rmRedundant_blast/data/normal_rmRedundant_fully_unaligned/non-redundant.blast normal_fully_TaxClass/data/accession.name normal_fully_rmCtm
hupanSLURM rmCtm -i 60 rmRedundant/normal_rmRedundant.partially.unaligned/non-redundant.fa rmRedundant_blast/data/normal_rmRedundant_partially_unaligned/non-redundant.blast normal_partially_TaxClass/data/accession.name normal_partially_rmCtm
hupanSLURM rmCtm -i 60 rmRedundant/tumor_rmRedundant.fully.unaligned/non-redundant.fa rmRedundant_blast/data/tumor_rmRedundant_fully_unaligned/non-redundant.blast tumor_fully_TaxClass/data/accession.name tumor_fully_rmCtm
hupanSLURM rmCtm -i 60 rmRedundant/tumor_rmRedundant.partially.unaligned/non-redundant.fa rmRedundant_blast/data/tumor_rmRedundant_partially_unaligned/non-redundant.blast tumor_partially_TaxClass/data/accession.name tumor_partially_rmCtm

Further remove redundancy sequences between fully and partially unaligned contigs as well as beteen normal and tumor samples:

#normal samples
mkdir normal_Nonreference
cat normal_fully_rmCtm/data/novel_sequence.fa normal_partially_rmCtm/data/novel_sequence.fa > normal_Nonreference/nonrefernce.fa
hupanSLURM rmRedundant cdhitCluster normal_Nonreference/nonrefernce.fa normal_Nonredundant_Nonreference /path/to/cdhit/

#tumor samples
mkdir tumor_Nonreference
cat tumor_fully_rmCtm/data/novel_sequence.fa tumor_partially_rmCtm/data/novel_sequence.fa > tumor_Nonreference/nonrefernce.fa
hupanSLURM rmRedundant cdhitCluster tumor_Nonreference/nonrefernce.fa tumor_Nonredundant_Nonreference /path/to/cdhit/

#combine normal and tumor samples
mkdir combined_Nonreference
cat normal_Nonredundant_Nonreference/non-redundant.fa tumor_Nonredundant_Nonreference/non-redundant.fa >combined_Nonreference/nonrefernce.fa
hupanSLURM rmRedundant cdhitCluster combined_Nonreference/nonrefernce.fa final_Nonreference/ /path/to/cdhit/

The annotation information of non-reference sequences is predicted by MAKER:

hupanSLURM genePre final_Nonreference/ gene_Prediction/ /path/to/maker/config_file /path/to/maker
hupanSLURM filterNovGen gene_Prediction gene_Prediction_Filter/ /path/to/reference/ /path/to/blast /path/to/cdhit /path/to/RepeatMask

Construction and annotation of pan-genome

mkdir pan &cat /path/to/reference.fa final_Nonreference/non-redundant.fa >pan/pan.fa
cat /path/to/reference.gtf non-reference.gtf >pan/pan.gtf
hupanSLURM pTpG pan/pan.gtf pan/pan.pTpG.gtf

Gene presence-absence variation (PAV) analysis

cd pan & /path/to/bowtie2/bowtie2-build pan.fa pan &cd ..
hupanSLURM alignRead –f bowtie2 data/normal/ normal_map2pan/ /path/to/bowtie2 pan/pan
hupanSLURM alignRead –f bowtie2 data/tumor/ tumor_map2pan/ /path/to/bowtie2 pan/pan

hupanSLURM geneCov normal_map2pan/ tumor_map2pan/ sample_list.txt pan/pan.pTpG.gtf gene_Coverage/
hupanSLURM mergeGeneCov gene_Coverage/ merge_Gene_Coverage/
hupanSLURM geneExist merge_Gene_Coverage/ gene_exist/ 0 0.8

Detect the placed novel sequences from partially unaligned contigs in each genome

#normal samples
hupanSLURM extractNovSeq normal_Unalign_result/ normal_Placed_Novel
hupanSLURM mergeNovSeq normal_Placed_Novel normal_Merge_Placed_Novel
hupanSLURM clusterNovSeq normal_Merge_Placed_Novel normal_Cluster_Placed_Novel
#tumor samples
hupanSLURM extractNovSeq tumor_Unalign_result/ tumor_Placed_Novel
hupanSLURM mergeNovSeq tumor_Placed_Novel tumor_Merge_Placed_Novel
hupanSLURM clusterNovSeq tumor_Merge_Placed_Novel tumor_Cluster_Placed_Novel

Contact Information

Zhongqu Duan
Chaochun Wei