Pan-genome analysis were carried out for the 3,010 rice accessions. First, we built a comprehensive dataset of rice sequences by combining IRGSP reference and de novo assembled contigs from 3,010 deep sequencing rice genomes. It showed rice as a species contains almost as twice genome sequence contents as an individual rice genome. 15,362 genes were predicted on these sequences. The presence/absence variation of each gene was detected for 453 rice accessions with sequencing coverage higher than 20. Phylogenetic study based on these variation was carried out. All rice were grouped based on the phylogenetic study and between-group variations were further studied. The distributed genes not included in the reference genome have important functions, such as those response to freezing and cold acclimation. The pan-genome analysis of the 3K rice genomes revealed the variation among different rice accessions.
If you use RPAN in a project that you publish, please cite the most recent RPAN paper, which is here in Nucleic Acids Research.
The cultivated rice, Oryza sativa L., is one of the major staple food for the world and a model organism in plant biology. The 3,000(3K) Rice Genome Project gives us an opportunity to gain insight into the genome diversity within the O. sativa gene pool. Comprehensive analyses of 3,010 rice genomes revealed the population organization of the genome variation in the rice pan-genome. RPAN presents analysis results from 3K rice genome data, focusing on gene presence/absence variation (PAV), which provides new perspective for rice researchers and breeding experts.
RPAN includes the following data:
RPAN also provides the following analysis tools:
The reference pan-genome was constructed on the IRGSP genome and the non-redundant unaligned contigs. All these contigs were grouped into 12 groups according to the classification of their corresponding rice accessions. These groups include four subgroups (IG1, IG2, IG3, IG4, IG5) of subspecies Indica, AUSG6, four subgroups (JG7, JG8, JG9, JG10) of subspecies Japonica, AROG11 and admixtures (Adm). The contigs from the same group were concatenated with 100 consecutive Ns as delimiters. Finally, the IRGSP genome and these pseudo-chromosomes were merged as the reference pan-genome. All the contents in RPAN are based on this reference.
453 high quality genomes with sequencing depths >20x and mapping depths >15x were chosen for detailed Pan-genome analyses.
To get the list of high quality accessions, visit rice table page and check "Yes" in the "High quality accessions" option.
Figure 1. High quality accession criterion.
Core genes | Genes which exist in all high-quality rice accessions |
Distributed genes | Genes which exist in significantly less than 99% of accessions (binomial tests, p-value < 0.05, null hypothesis is “loss rate < 1%”) |
Candidate Core genes | Genes which exist in > 99% (not all) of high-quality rice accessions (binomial test, fdr < 0.05) |
Subspecies-unbalanced genes | Distributed genes whose frequency in one or more subspecies is significantly higher than that in other subspecies (Fisher's test, FDR < 0.05) |
Indica-dominant genes | Subspecies-unbalanced genes whose frequencies in Indica is 5% greater than their frequencies in Japonica |
Japonica-dominant genes | Subspecies-unbalanced genes whose frequencies in Japonica is 5% greater than their frequencies in Indica |
Subspecies-specific genes | Distributed genes which exist in a subspecies but absent in all other subspecies |
Indica-specific genes | genes only exist in Indica |
Japonica-specific genes | genes only exist in Japonica |
AUS-specific genes | genes only exist in AUS |
ARO-specific genes | genes only exist in ARO |
Subgroup-unbalanced genes | Distributed genes whose frequency in one or more sub-groups of a subspecies is significantly higher than the frequencies in other sub-groups in this subspecies. |
Indica-subgroup-unbalanced genes | Distributed genes which are abundant (or have significantly higer frequencies) in specific Indica subgroup(s) but have low frequencies in the other Indica subgroup(s) (Fisher's test, FDR < 0.05) |
Japonica-subgroup-unbalanced genes | Distributed genes which are abundant (or have significantly higer frequencies) in specific Japonica subgroup(s) but have low frequencies in the other Japonica subgroup(s) (Fisher's test, FDR < 0.05) |
Random genes | Distributed genes which show no difference among gorups and sub-groups (genes are not core, candidate core, subspecies unbalanced and sub-group unbalanced) |
The phylogenetic tree was constructed based on PAVs among 453 high quality accessions.
Through the gene distribution tree of single gene search result, users could find the presence of this gene on phylogenetic tree directly. Users could also compare this tree with the tree with classification and geographical distribution labeled below.
Figure 2. Phylogenetic tree of 453 high quality accessions.
Users can type a gene ID (e.g. Os02g0561500) in the search box. After clicking the "Search" button, a new page will display search results. The results consist of seven parts: basic gene information, gene categorization, gene distribution, gene presence frequency, gene ontology, CDS and protein sequence.
Figure 3. Example of search by a gene ID.
Basic gene information includes:
Gene categorization
Gene presence frequency
Gene ontology
CDS
Protein sequence
Users can also search with genomic sequences against the rice pan-genome directly. One or more sequences in the FASTA format can be searched. All alignments can be further checked in a detailed page by clicking the "Genome Browser" button in the record line and visualized in the pan-genome browser.
Figure 4. Example of search by DNA sequences.
Users can type an accession code (e.g. B001) into the search box. After clicking the “Search” button, a new page will display search results. The results consist of three parts: basic rice accession information and statistics of genes’ categorizations in this rice accession.
Figure 5. Example of search by a rice accession.
Basic rice information includes:
Gene statistics
Users can input multiple accession codes in the search box or upload a file containing accession codes. The least number of rice accessions sharing a specific gene can be an optional parameter. If this number is set to 1, the search result will be all genes existing in all the input accessions; similarly, if the number is set to the number of all input accessions, the core genes of all input accessions would be acquired. Then, the basic information of these accessions and the resulted genes could be downloaded and the statistics tables and charts for these genes are also provided.
Figure 6. Example of search by rice list.
Users can type multiple gene IDs into the search box or upload a file containing gene IDs. Then, the basic information of rice accessions where these gene IDs all present and the input genes could be downloaded and the statistics tables and charts for these genes are also provided.
Figure 7. Example of Search by gene ID list.
All information in the pan-genome browser was stored in tables that can be downloaded. These tables include the rice accession information table, the genome annotation table and gene expression profile table.
In the rice accession information table, users can filter the results by selecting browse options such as categories, geological regions and sequencing depth status (high/low). Attention: the result is the intersection of all the options. A summary table can be generated for filtered results.
Figure 8. Usage of rice accession table.
For visualization, please ref visualization part.
For summary, it is same with search multiple rice accessions.
In the gene information table, there are 50,995 full length genes. The basic gene information including chromosome positions on the reference IRGSP-1.0 genome, strand, CDS length and exon number, are contained in the table. Detailed gene information, such as gene categorization (core/distributed), gene presence frequency, gene ontology, coding sequence, and protein sequence, and visualization could be acquired by clicking the related links. The location of a genomic region can also be searched in a format of “chromosome ID: start coordinate-end coordinate”.
Figure 9. Usage of gene table.
A total of 226 runs of RNA-seq data from diverse rice tissues were collected. The detailed information of gene expression profiles could be acquired and visualized in the genome browser.
Figure 10. Usage of expression profile table.
The visualization page contains two parts, a dynamic tree browser on the left panel and a genome browser on the right panel. The tree was constructed from the SNP data. Users can select multiple nodes (including leaf nodes and internal nodes) and click the “Submit” button to visualize these rice accessions in the genome browser. The tree browser also supports search function to accelerate target genome selection. The pan-genome reference sequence, gene annotation and overall presence frequency of high quality accession are three basic tracks. There are 3,010 rice genome tracks and 226 RNA-seq tracks. Users can select any number of accessions or expression data through the hidden “Select tracks” panel or the tree browser as well. For the performance concern, we recommend to select less than 300 tracks each time.
Figure 11. Usage of browsers.
The tree browser is composed of a tree viewer and 4 toolbars, one of which lies at the bottom of the browser. The top toolbar is for locating the terminal nodes by accession codes.
The next two toolbars change the behaviors of internal nodes and leaf nodes respectively. When "fold" is chosen, clicking internal nodes hides their child nodes. When "select" is chosen, clicking nodes select their child nodes (or themselves when clicking leaf nodes). A selected leaf node will be shown in genome browser when the "submit" button is clicked. When "preserve" is chosen, clicking nodes preserve their child nodes (or themselves when clicking leaf nodes). A preserved leaf node won't be hidden when folding its ancestors. Clicking a node for the second time behave oppositely in every chosen 'mode'.
The last toolbar lies at bottom. It provides functions on selection. Clicking the "Next Selected" button scrolls the tree browser down to the location of the next selected leaf node. Clicking the "clear" button deselects all selected accessions. Clicking the "submit" buttons shows selected accessions in genome browser. Clicking the "Help" button shows description about the usage of each button. Clicking the "Hide All" button hides all internal nodes.
The genome browser was based on JBrowse. The detailed usage of JBrowse could be acquired in the JBrowse official site.
There are four buttons on the top of this panel.
There are five types of tracks, including “reference sequence”, “gene”, “presence frequency”, accession and RNA-seq, and the first three types are default tracks.
The reference pan-genome sequence and annotation are available on download page.
Database update
Database update
Database update
Add expression data
Add detailed information for genes
Database update
Database update