A: Users can adjust the key parameters which we authors think is important through EUPAN. Please visit the manual page for details. If users think other parameters are also important, please contact the authors.
A: The “pan-genome” concept itself has no limitations. For example, people even studied the pan-genome of all bacteria or eubacteria with genome data available. For these cases, the pan-genome studies are somewhat like the ortholog analysis and in these studies only gene families instead of genes are studied. This is the same for eukaryotes. However, the “map-to-pan” strategy itself has limitations. The “map-to-pan” strategy can be only used for closely related individuals instead of all eukaryotes. This limitation comes from both the strategy of using representative sequence and the short read mapping limitations. Therefore we authors recommend users to study individuals within the same species or very closed species (like rice and wild rice) with EUPAN.
A: EUPAN can be used for bacteria. But we authors recommend users to study strains within the same species or very closed species. We have described the mostly used pan-genome strategies for bacteria and also its limitations for eukaryotes in the homepage. Almost all pan-genome analysis pipelines for bacteria depend on the complete genome sequences of the individuals, which are difficult and very expensive to derive for eukaryotes. For bacteria pan-genome analysis, you may also want to try the following tools:
A: “resemble the reference”, actually means that for a gene A on a given genome/strain X, which is very similar but different from a reference gene B (for example, 99% bases identical to each other), EUPAN will say that gene B instead of gene A exist in the given strain X. In our opinion, this is an unavoidable problem from elimination of redundant sequences, similar to sequence processing in NT/NR database. To hold all sequences of the same gene or a group of similar genes means that you need to assemble all the sequences of this gene group successfully in all the individuals. Actually this is the traditional method we presented in the figure on the homepage. In the “map-to-pan” strategy, the first step is to construct a pan-genome, that is, to reduce the sequence redundancy. The key point in this step is that for a group of very similar sequence, which one is chosen as a representative. Current strategy used in EUPAN is to combine the reference genome and non-redundant novel sequences from de novo assemblies. The advantages of this strategy are that 1) the gene PAV results can be directly used by other scientists. The information (id, sequence, function and other annotation) of the reference gene is widely used; and 2) the pan-genome sequence seems not very “fragmented” and is easier to be integrated in a genome browser. The disadvantage is, for all gene groups, reference genes are selected as representative. Remarkably, for most cases, the PAVs of a gene or gene group would not be affected by which sequence is chosen as representative. This can be justified by the parameters (mapping coverage of the gene) when determining presence/absence by read mapping. It is to say, it has little effect on the subsequent analysis.
Moreover, EUPAN is a flexible toolkit and people can build the pan-genome sequence with other strategies and then integrated the resulted pan-genome sequences to the pipeline again. For example, one can build pan-genome sequences by re-assembling the de novo assembled contigs, which has no bias to the reference. The limitations of this strategy are that (1) there will be chimera sequences and (2) the pan-genome is very “fragmented”. At this stage, we preferred the first strategy.
A: For most steps, EUPAN will divide the whole task into many jobs, each job is recommended to be run on a single node with multiple CPUs. Users can select the CPU number for a single job via the –t option. The users should have a good knowledge of the basic information of the computer cluster you used, for example, the CPU number and the memory size of each node. Please use proper CPU number for each job. For example, if each node has 16 CPUs, you may assign each job with 2, 4, 8 or 16 CPUs, instead of 5, 7, or 9 CPUs. We recommended the users to use 4 CPUs for most steps.
Also, please take consideration of the memory, especially for de novo assemblies. Running more jobs (with less number of CPUs for each job) means that you need more memory. 32G memory is enough for most steps, except the assembly step. Memory for assembling one genome depends on both the genome size and the sequencing depth. The user should test the memory use for one sample. For example, for a 400M genome with sequencing depth of 15X, 50G memory is needed.
A: The accuracy of EUPAN can be evaluated at “gene presence-absence detection”. To give an intuitive description to the users, here we compared rice gene PAVs detected by EUPAN with a recent rice pan-genome study where they used 3 representative rice accessions including Nipponbare, DJ123 and IR64 (Schatz, M. C. et al. Genome Biol 2014, Traditional method 2 in the following table). In that study, they assembled the genomes from deep sequencing data with multiple sequencing libraries and the total sequencing depth reached ~110x. As a result, they assembled 81.3%~82.5% (81.8% for Nipponbare) of non-N bases of each genome. This number increased to 88.5%~91.4% (91.4% for Nipponbare) if N is considered. And they predicted 39,083 genes for Nipponbare genome. Therefore, the total gene number was adjusted to be 42,852~47,779 if all the genome would be assembled and the gene density remains the same (actually there should be less genes on the remained sequence, nevertheless we can make a rough estimation). In our gold standard annotation of Nipponbare genome, there are 35,633 annotated genes. Assume that all genes on the assembled sequences are correctly predicted (as we also used reference annotations directly in our analyses), the sensitivity can be estimated as the assembled fraction of the genome, to be about 81.3% or 91.4%. If all the reference genes are within their predictions, the specificity should be 35,633/(>39,083/0.914)=83.1%.
|EUPAN||Traditional method 1||Traditional method 2|
|Reference||Sun, C. et al. NAR 2017||Sun, C. et al. NAR 2017||Schatz, M. C. et al. Genome Biol 2014|
|Total sequencing depth||19.8x||19.8x||110x|
|Insertion size (bp)||450||450||180bp & 2k & 5k|
|Specificity||84.7%||similar to Traditional method 2||83.1%|
|Type of false positives||non-random||random||random|
|Source of false positives||gene prediction & gene loss by a small number of mutations||gene prediction||gene prediction|
We then evaluated accuracy of our method based on the EUPAN result for a Nipponbare accession with sequencing depth of 19.7X. De novo assembly of this set of sequencing data provided only <57% of the whole genome (Traditional method 1 in the table). With EUPAN analysis on >3000 accessions, we detected 41,039 genes present in CX140 including 34,759 Nipponbare reference genes and 6,280 novel genes. The sensitivity is estimated to be 34,759/35,633=97.5%. The mapping coverage of CX140 genome is 98.4%, therefore we think we can capture almost all genes if mapping evidence exists. The specificity can be estimated as 34,759/41,039=84.7%, better than traditional pan-genome methods. However there is still a significant number of false discovered genes, in-depth studies suggest all these gene regions show high similarity (>90%) to the reference genome, indicating these corresponding regions in the reference genome contained no genes but we predict genes on similar sequences. This might be the wrong predictions involved by gene prediction like previous work predicted 39,083 genes on the incomplete sequences. Or this might be partially attributed to gene loss on the reference due to SNVs and small indels as illustrated in the following Figure, in which, we predicted a gene on a novel sequence with high local similarity with a reference sequence segment and the SNVs on these two sequences disrupt the gene models. It is to say, the Nipponbare genome might experience several point mutations resulting to the loss of the gene and some other accessions still hold this gene in their genomes. This might be a short-coming of mapping-based pan-genome study. Nevertheless, these false positives are still based on gene sequences and are not random, and sequences of the genes indeed present in the genome. Therefore, we concluded that our mapping-based method has relatively good accuracy with very high sensitivity and reasonable specificity.