PPanG

PPanG: a precise pangenome browser combining linear and graph-based pan-genomes

Currently, graph-based pangenome is gradually gaining popularity than linear pangenome because the graph model stores more comprehensive information of variations, including the locations and structures. The use of graph-based browser to visualize pangenomes has proven to be quite helpful and effective. However, traditional linear genome browser has its own advantages especially the tremendous resource accumulated historically. In addition, the demand for precise annotation of each individual in a pangenome is also becoming high. Here we report a new pangenome browser called PPanG, precise pangenome browser combining linear and graph-based pangenome. We used rice pangenome as an example to show it. Nine rice genomes with high quality sequences and annotations were provided by default as the potential reference genomes, and all individual genomes can be selected as the reference. The gene annotations for all individuals can be displayed according to the graph-based pangenome simultaneously. By this, we can present the differences between different genomes including the sequences and different levels of annotations.

The graph model of PPanG is a sequence graph composed of nodes and edges:

(The source sequences are presented on the right and variations are underlined.)

By drawing features on the graph model, gene annotations for all individuals are clearly visualized:

(Gray bars represent exons and color bars between exons represent introns.)

The linear model of PPanG is implemented by JBrowse2:

(Only reference track of IRGSP-1.0 is visible by default, and other accessions can be freely added by double clicking the track in graph view.)

The linear views are aligned according to the start coordinate, and gene strctures from different accessions can be compared intuitively. (simple but more ambigous than graph model)

The graph model and linear model of PPanG correspond to each other:

Pangenome data used in PPanG is the rice pangenome composed of 105 Oryza sativa (OS) samples and 6 Oryza rufipogon (OR) samples, with the raw data reaching 3 TB.

Genome sequences (contigs and scaffolds) of newly sequenced rice accessions:

Sample	Polished contigs	Chromosome-level scaffolds
QUAN	QUAN_pilon.fa.gz	QUAN.fixed.fa.gz
TG1	TG1_pilon.fa.gz	TG1.fixed.fa.gz
TG10	TG10_pilon.fa.gz	TG10.fixed.fa.gz
TG11	TG11_pilon.fa.gz	TG11.fixed.fa.gz
TG12	TG12_pilon.fa.gz	TG12.fixed.fa.gz
TG13	TG13_pilon.fa.gz	TG13.fixed.fa.gz
TG14	TG14_pilon.fa.gz	TG14.fixed.fa.gz
TG15	TG15_pilon.fa.gz	TG15.fixed.fa.gz
TG16	TG16_pilon.fa.gz	TG16.fixed.fa.gz
TG17	TG17_pilon.fa.gz	TG17.fixed.fa.gz

1
2
3
4
5
•••
9
10 / page

Other data related can be accessed in https://cgm.sjtu.edu.cn/TGSrice/.

12 Rice pangenome graphs are separately built by two pangenome graph builders: Minigraph-Cactus/v2.2.2 and PGGB/v0.5.3-19-g507fc04.

Github link of Cactus: https://github.com/ComparativeGenomicsToolkit/cactus.

Github link of PGGB: https://github.com/pangenome/pggb.

Chromosome	Minigraph-Cactus	PGGB
Chromosome 1	chr01_mc.xg	chr01_pggb.xg
Chromosome 2	chr02_mc.xg	chr02_pggb.xg
Chromosome 3	chr03_mc.xg	chr03_pggb.xg
Chromosome 4	chr04_mc.xg	chr04_pggb.xg
Chromosome 5	chr05_mc.xg	chr05_pggb.xg
Chromosome 6	chr06_mc.xg	chr06_pggb.xg
Chromosome 7	chr07_mc.xg	chr07_pggb.xg
Chromosome 8	chr08_mc.xg	chr08_pggb.xg
Chromosome 9	chr09_mc.xg	chr09_pggb.xg
Chromosome 10	chr10_mc.xg	chr10_pggb.xg
Chromosome 11	chr11_mc.xg	chr11_pggb.xg
Chromosome 12	chr12_mc.xg	chr12_pggb.xg

We used MAKER to precisely annotate 113 samples of the rice pan-genome and reference genome Nipponbare.

Here are the annotation results for 114 samples: (Names of 9 representative genomes are in bold)

Sample	Annotation	Transcripts	Proteins
ARC10497	ARC10497_maker.gff	ARC10497.all.maker.transcripts.fasta	ARC10497.all.maker.proteins.fasta
Azucena	Azucena_maker.gff	Azucena.all.maker.transcripts.fasta	Azucena.all.maker.proteins.fasta
Basmati	Basmati_maker.gff	Basmati.all.maker.transcripts.fasta	Basmati.all.maker.proteins.fasta
CGS	CGS_maker.gff	CGS.all.maker.transcripts.fasta	CGS.all.maker.proteins.fasta
CHAOMEO	CHAOMEO_maker.gff	CHAOMEO.all.maker.transcripts.fasta	CHAOMEO.all.maker.proteins.fasta
GOBOLSAIL	GOBOLSAIL_maker.gff	GOBOLSAIL.all.maker.transcripts.fasta	GOBOLSAIL.all.maker.proteins.fasta
H7L1	H7L1_maker.gff	H7L1.all.maker.transcripts.fasta	H7L1.all.maker.proteins.fasta
H7L26	H7L26_maker.gff	H7L26.all.maker.transcripts.fasta	H7L26.all.maker.proteins.fasta
H7L27	H7L27_maker.gff	H7L27.all.maker.transcripts.fasta	H7L27.all.maker.proteins.fasta
H7L28	H7L28_maker.gff	H7L28.all.maker.transcripts.fasta	H7L28.all.maker.proteins.fasta

1
2
3
4
5
•••
12
10 / page

In precise annotation step, EST evidences and protein homology evidences are clustered by 90% identity to build non-redundant evidence set.

EST evidences: OS_MIR90_ts.fa (76,071, 125Mb)
Protein homology evidences: O_MIR90_cds.pep (80,552, 36Mb)

Here is the detailed MAKER configuration in `maker_opts.ctl`:

 #-----Genome (these are always required)
 genome= #genome sequence (fasta file or fasta embeded in GFF3 file)
 organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic 
 #-----Re-annotation Using MAKER Derived GFF3
 maker_gff= #MAKER derived GFF3 file
 est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
 altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
 protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
 rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
 model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
 pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
 other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no 
 #-----EST Evidence (for best results provide a file for at least one)
 est=OS_MIR90_ts.fa #set of ESTs or assembled mRNA-seq in fasta format
 altest= #EST/cDNA sequence file in fasta format from an alternate organism
 est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
 altest_gff= #aligned ESTs from a closly relate species in GFF3 format 
 #-----Protein Homology Evidence (for best results provide a file for at least one)
 protein=O_MIR90_cds.pep #protein sequence file in fasta format (i.e. from mutiple oransisms)
 protein_gff= #aligned protein homology evidence from an external GFF3 file 
 #-----Repeat Masking (leave values blank to skip repeat masking)
 model_org=rice #select a model organism for RepBase masking in RepeatMasker
 rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
 repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
 rm_gff= #pre-identified repeat elements from an external GFF3 file
 prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
 softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering) 
 #-----Gene Prediction
 snaphmm= #SNAP HMM file
 gmhmm= #GeneMark HMM file
 augustus_species=rice #Augustus gene prediction species model
 fgenesh_par_file= #FGENESH parameter file
 pred_gff= #ab-initio predictions from an external GFF3 file
 model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
 est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
 protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
 trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
 snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
 unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no 
 #-----Other Annotation Feature Types (features MAKER doesn't recognize)
 other_gff= #extra features to pass-through to final MAKER generated GFF3 file
 #-----External Application Behavior Options
 alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
 cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI) 
 #-----MAKER Behavior Options
 max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
 min_contig=1 #skip genome contigs below this length (under 10kb are often useless) 
 pred_flank=200 #flank for extending evidence clusters sent to gene predictors
 pred_stats=0 #report AED and QI statistics for all predictions as well as models
 AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
 min_protein=0 #require at least this many amino acids in predicted proteins
 alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
 always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
 map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
 keep_preds=1 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1) 
 split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
 single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
 single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
 correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes 
 tries=3 #number of times to try a contig if there is a failure for some reason
 clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
 clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
 TMP= #specify a directory other than the system default temporary directory for temporary files

PPanG: a precise pangenome browser combining linear and graph-based pan-genomes

We used MAKER to precisely annotate 113 samples of the rice pan-genome and reference genome Nipponbare.

Here are the annotation results for 114 samples: (Names of 9 representative genomes are in bold)

In precise annotation step, EST evidences and protein homology evidences are clustered by 90% identity to build non-redundant evidence set.

Here is the detailed MAKER configuration in maker_opts.ctl:

Here is the detailed MAKER configuration in `maker_opts.ctl`: