Pan-genome analysis of gastric cancer

Introduction

Gastric cancer is one of the most common digestive cancers worldwide, especially in Asia. Up to date, several whole genome sequencing (WGS) studies have been released and multiple genetic variations have been proposed to promote the molecular classifications of gastric cancer. Moreover, pan-cancer analysis of WGS data promoted comprehensively understand on molecular mechanism across multiple types of cancers. However, all those studies were conducted based on the human reference genome, which is now known to miss tens or hundreds of million bases individual-specific or population-specific genomic regions. These non-reference genome sequences may play important roles in pathogenesis or carcinogenesis. In order to deal with this problem, we conducted pan-genome analysis of gastric cancer by 185 genomes to detect non-reference sequences, novel protein-coding gene and the gene presence-absence variation (PAV) on 185 genomes.

Gastric cancer pan-genome analysis using HUPAN to construct the cancer pan-genome, then compare the cancer pan-genome to other pan-genomes, and do association study about the gene presence/absence variations (PAV) with the cancer phenotypes. These steps are listed below.

  • Preprocess step to control the sequencing quality and trim low-quality reads if neccesary;
  • De novo asssembly of individual genomes;
  • Extracting non-reference sequences from assembled contigs;
  • Detecting placed novel sequences from partially unaligned contigs;
  • Removing redundancy and potential contamination among multiple genomes;
  • Construction and annotation of pan-genome;
  • Gene presence-absence variation (PAV) anaylsis;
  • Comparative analysis of gatric cancer pan-genome with other pan-genomes from other populations,such as Korean gastric cancer patients and healthy individuals from SGDP ;
  • Association of gene PAV profile and phenotypes.

In addition, this analysis could be expand to other types of cancer. Please see the Standard Operating Proedure(SOP) below if you want to conduct pan-genome analysis of cancer genomes on your data.

Datasets of gasctric cancer pan-genome ananlysis results

  • Whole genome sequencing (WGC)data and assembled contigs of 185 individuals diagonsed with gastric cancer are available at NODE with the accessions OEP000301 for matched gastric mucosa and OEP000482 for primary tumor tissues. Please see the detail of Request for Restricted Data if you want to access this dataset.
  • Placed novel sequences
  • Final non-reference sequences (80.88 Mb,35,488 sequences): Nonreference.final.fa.gz (md5: d7a236fdd3009f45e6c359c2dc251404).
  • Annotation of non-reference sequences
  • PAV profile of distributed genes:
  • SOP of pan-genome analysis on cancer genomes

    This SOP is to help users to carry out pan-genome analysis on cancer genomes with HUPAN. Due to the large genome size of individual human genome, conducting pan-genome analysis on hudreds of individuals could hardly finish on in single machine. We strongly suggestted the users conduct all the analysis in the supercomputer implemented LSF system or SLURM system. All the commands of hupanLSF and hupanSLURM are same excepted for the way of submit jobs are different. In the following, we give all the exmaples of commands based on SLURM system. If the users work on supercomputer based on LSF system, please replace “hupanSLURM” with “hupanLSF”.

    • Please download HUPAN (HUman Pan-genome ANalysis) tool in Github or HUPAN homepage and install it according to the manual.
    • We provide a set of example dataset to help the users test the pipeline. Note these are only simple example to help users understand the input data type and data structure and guide run the pipeline. The real data may be much larger and more complex. Please download here and undecompress it:
      tar zxvf cpanExampleData.tar.gz & cd cpanExampleData
      
      And you can find two directories:data and ref. The data directory contains two sub-directories of normal and tumor and each directory has three samples, of which include two fastq files, respectively. The ref directory contains the genome sequences (chr22.fa) and gene annotation information (chr22.gff) of chr22 in GRCh38.
    • The quality of sequencing reads are assessed by hupanSLURM qualsta.
      #normal samples
      hupanSLURM  qualSta -f /path/to/Fastqc -t 16 -v PE data/normal normal_preview_quality/
      #tumor samples
      hupanSLURM  qualSta -f /path/to/Fastqc -t 16 -v PE data/tumor tumor_preview_quality/
      
    • If the read quality is not high enough, users can trim or filter low-quality reads by the command hupanSLURM trim
      #normal samples
      hupanSLURM trim data/normal normal_trim/ /path/to/Trimmomatic
      hupanSLURM trim -w 100 -m 100 data/normal normal_filter/ /path/to/Trimmomatic
      #tumor samples
      hupanSLURM trim data/tumor tumor_trim/ /path/to/Trimmomatic
      hupanSLURM trim -w 100 -m 100 data/tumor tumor_filter/ /path/to/Trimmomatic
      
    • The raw sequencing reads (or trimmed reads) are assembled by SGA for each sample.
      #normal samples
      hupanSLURM assemble sga -t 16 data/normal normal_assembly_result /path/to/sga/
      #tumor sampless
      hupanSLURM assemble sga -d 60 -t 16 data/tumor tumor_assembly_result /path/to/sga/
      
    • Non-reference sequences of each sample are extracted as followed:
      #normal samples
      hupanSLURM alignContig normal_assembly_result/data/ normal_aligned_result/  /path/to/MUMmer/ /path/to/reference.fa
      hupanSLURM extractSeq normal_assembly_result/data/ normal_candidate/ normal_aligned_result/
      hupanSLURM assemSta normal_candidate/data/ normal_quast_result/ /path/to/quast-4.5/ /path/to/reference.fa
      hupanSLURM getUnalnCtg -p .contig normal_candidate/data/ normal_quast_result/data/ normal_Unalign_result/
      #tumor samples
      hupanSLURM alignContig tumor_assembly_result/data/ tumor_aligned_result/  /path/to/MUMmer/ /path/to/reference.fa
      hupanSLURM extractSeq tumor_assembly_result/data/ tumor_candidate/ normal_aligned_result/
      hupanSLURM assemSta tumor_candidate/data/ tumor_quast_result/ /path/to/quast-4.5/ /path/to/reference.fa
      hupanSLURM getUnalnCtg -p .contig tumor_candidate/data/ tumor_quast_result/data/ normal_Unalign_result/
      
    • Non-reference sequences from multiple samples are merged by hupanSLURM mergeUnalnCtg:
      #normal samples
      hupanSLURM mergeUnalnCtg normal_Unalign_result/data/ normal_mergeUnalnCtg_result
      #tumor samples
      hupanSLURM mergeUnalnCtg tumor_Unalign_result/data/ tumor_mergeUnalnCtg_result
      
    • Redundancy sequences among multiple samples are removed by hupanSLURM rmRedundant:
      #normal samples
      ##fully unaligned contigs
      hupanSLURM rmRedundant cdhitCluster normal_mergeUnalnCtg_result/total.fully.fa normal_rmRedundant_fully_unaligned/ /path/to/cdhit/
      ##partially unaligned contigs
      hupanSLURM rmRedundant cdhitCluster tumor_mergeUnalnCtg_result/total.partially.fa normal_rmRedundant_partially_unaligned/ /path/to/cd-hit/
      #tumor samples
      ##fully unaligned contigs
      hupanSLURM rmRedundant cdhitCluster tumor_mergeUnalnCtg_result/total.fully.fa tumor_rmRedundant_fully_unaligned/ /path/to/cdhit/
      ##partially unaligned contigs
      hupanSLURM rmRedundant cdhitCluster tumor_mergeUnalnCtg_result/total.partially.fa tumor_rmRedundant_partially_unaligned/ /path/to/cd-hit/
      
    • Potential contamination sequences are excluded by aligning to non-redundant nt database:

      mkdir nt & cd nt
      wget https://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz |gunzip & cd ..
      hupanSLURM blastAlign mkblastdb nt nt_index path/to/blast
      
      mkdir rmRedundant
      mv normal_rmRedundant_fully_unaligned rmRedundant
      mv normal_rmRedundant_partially_unaligned rmRedundant
      mv tumor_rmRedundant_fully_unaligned rmRedundant
      mv tumor_rmRedundant_partially_unaligned rmRedundant
      hupanSLURM blastAlign blast rmRedundant\ rmRedundant_blast /path/to/nt_index /path/to/blast
      
      wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid
      wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz & tar -zvxf new_taxdump.tar.gz
      mkdir info & mv nucl_gb.accession2taxid info & mv new_taxdump/rankedlineage.dmp info
      hupanSLURM getTaxClass rmRedundant_blast/data/normal_rmRedundant_fully_unaligned/non-redundant.blast info/ normal_fully_TaxClass
      hupanSLURM getTaxClass rmRedundant_blast/data/normal_rmRedundant_partially_unaligned/non-redundant.blast info/ normal_partially_TaxClass
      hupanSLURM getTaxClass rmRedundant_blast/data/tumor_rmRedundant_fully_unaligned/non-redundant.blast info/ tumor_fully_TaxClass
      hupanSLURM getTaxClass rmRedundant_blast/data/normal_rmRedundant_partially_unaligned/non-redundant.blast info/ tumor_partially_TaxClass
      
      hupanSLURM rmCtm -i 60 rmRedundant/normal_rmRedundant.fully.unaligned/non-redundant.fa rmRedundant_blast/data/normal_rmRedundant_fully_unaligned/non-redundant.blast normal_fully_TaxClass/data/accession.name normal_fully_rmCtm
      hupanSLURM rmCtm -i 60 rmRedundant/normal_rmRedundant.partially.unaligned/non-redundant.fa rmRedundant_blast/data/normal_rmRedundant_partially_unaligned/non-redundant.blast normal_partially_TaxClass/data/accession.name normal_partially_rmCtm
      hupanSLURM rmCtm -i 60 rmRedundant/tumor_rmRedundant.fully.unaligned/non-redundant.fa rmRedundant_blast/data/tumor_rmRedundant_fully_unaligned/non-redundant.blast tumor_fully_TaxClass/data/accession.name tumor_fully_rmCtm
      hupanSLURM rmCtm -i 60 rmRedundant/tumor_rmRedundant.partially.unaligned/non-redundant.fa rmRedundant_blast/data/tumor_rmRedundant_partially_unaligned/non-redundant.blast tumor_partially_TaxClass/data/accession.name tumor_partially_rmCtm
      
    • Further remove redundancy sequences between fully and partially unaligned contigs as well as beteen normal and tumor samples:

      #normal samples
      mkdir normal_Nonreference
      cat normal_fully_rmCtm/data/novel_sequence.fa normal_partially_rmCtm/data/novel_sequence.fa > normal_Nonreference/nonrefernce.fa
      hupanSLURM rmRedundant cdhitCluster normal_Nonreference/nonrefernce.fa normal_Nonredundant_Nonreference /path/to/cdhit/
      
      #tumor samples
      mkdir tumor_Nonreference
      cat tumor_fully_rmCtm/data/novel_sequence.fa tumor_partially_rmCtm/data/novel_sequence.fa > tumor_Nonreference/nonrefernce.fa
      hupanSLURM rmRedundant cdhitCluster tumor_Nonreference/nonrefernce.fa tumor_Nonredundant_Nonreference /path/to/cdhit/
      
      #combine normal and tumor samples
      mkdir combined_Nonreference
      cat normal_Nonredundant_Nonreference/non-redundant.fa tumor_Nonredundant_Nonreference/non-redundant.fa >combined_Nonreference/nonrefernce.fa
      hupanSLURM rmRedundant cdhitCluster combined_Nonreference/nonrefernce.fa final_Nonreference/ /path/to/cdhit/
      
    • The annotation information of non-reference sequences is predicted by MAKER:
      hupanSLURM genePre final_Nonreference/ gene_Prediction/ /path/to/maker/config_file /path/to/maker
      hupanSLURM filterNovGen gene_Prediction gene_Prediction_Filter/ /path/to/reference/ /path/to/blast /path/to/cdhit /path/to/RepeatMask
      
    • Construction and annotation of pan-genome
      mkdir pan &cat /path/to/reference.fa final_Nonreference/non-redundant.fa >pan/pan.fa
      cat /path/to/reference.gtf non-reference.gtf >pan/pan.gtf
      hupanSLURM pTpG pan/pan.gtf pan/pan.pTpG.gtf
      
    • Gene presence-absence variation (PAV) analysis

      cd pan & /path/to/bowtie2/bowtie2-build pan.fa pan &cd ..
      hupanSLURM alignRead –f bowtie2 data/normal/ normal_map2pan/ /path/to/bowtie2 pan/pan
      hupanSLURM alignRead –f bowtie2 data/tumor/ tumor_map2pan/ /path/to/bowtie2 pan/pan
      
      hupanSLURM geneCov normal_map2pan/ tumor_map2pan/ sample_list.txt pan/pan.pTpG.gtf gene_Coverage/
      hupanSLURM mergeGeneCov gene_Coverage/ merge_Gene_Coverage/
      hupanSLURM geneExist merge_Gene_Coverage/ gene_exist/ 0 0.8
    • Detect the placed novel sequences from partially unaligned contigs in each genome
      #normal samples
      hupanSLURM extractNovSeq normal_Unalign_result/ normal_Placed_Novel
      hupanSLURM mergeNovSeq normal_Placed_Novel normal_Merge_Placed_Novel
      hupanSLURM clusterNovSeq normal_Merge_Placed_Novel normal_Cluster_Placed_Novel
      #tumor samples
      hupanSLURM extractNovSeq tumor_Unalign_result/ tumor_Placed_Novel
      hupanSLURM mergeNovSeq tumor_Placed_Novel tumor_Merge_Placed_Novel
      hupanSLURM clusterNovSeq tumor_Merge_Placed_Novel tumor_Cluster_Placed_Novel

    Contact Information

    Zhongqu Duan
    Chaochun Wei