FAQ

Frequently Asked Questions

As EUPAN integrates many independent tools which usually involve lots of parameters determined by the user. Can these parameters be adjusted/configured by the user through EUPAN?

A: Users can adjust the key parameters which we authors think is important through EUPAN. Please visit the manual page for details. If users think other parameters are also important, please contact the authors.

Can EUPAN be used to construct pan-genome of all eukaryotes or construct pan-genome of a group of long-distance species?

A: The “pan-genome” concept itself has no limitations. For example, people even studied the pan-genome of all bacteria or eubacteria with genome data available. For these cases, the pan-genome studies are somewhat like the ortholog analysis and in these studies only gene families instead of genes are studied. This is the same for eukaryotes. However, the “map-to-pan” strategy itself has limitations. The “map-to-pan” strategy can be only used for closely related individuals instead of all eukaryotes. This limitation comes from both the strategy of using representative sequence and the short read mapping limitations. Therefore we authors recommend users to study individuals within the same species or very closed species (like rice and wild rice) with EUPAN.

Can EUPAN be used for bacteria pan-genome analysis?

A: EUPAN can be used for bacteria. But we authors recommend users to study strains within the same species or very closed species. We have described the mostly used pan-genome strategies for bacteria and also its limitations for eukaryotes in the homepage. Almost all pan-genome analysis pipelines for bacteria depend on the complete genome sequences of the individuals, which are difficult and very expensive to derive for eukaryotes. For bacteria pan-genome analysis, you may also want to try the following tools:

Tool	Type	Website	Year
EDGAR	Web-based	https://edgar.computational.bio.uni-giessen	2009
Panseq	Web-based	http://76.70.11.198/pans	2010
PGAT	Web-based	http://nwrce.org/pgat	2011
PGAP	Stand-alone	http://pgap.sf.net	2012
PanFunPro	Stand-alone	http://zenodo.org/record/7583#.VxhwxEc0h-U	2013
PanGP	Stand-alone	http://PanGP.big.ac.cn	2014
ITEP	Stand-alone	https://price.systemsbiology.net/itep	2014
BPGA	Stand-alone	http://www.iicb.res.in/bpga/index.html	2016

The "map to pan" strategy is effective but may force the genome of a strain to resemble the reference. What’s the effect on the subsequent analysis?

A: “resemble the reference”, actually means that for a gene A on a given genome/strain X, which is very similar but different from a reference gene B (for example, 99% bases identical to each other), EUPAN will say that gene B instead of gene A exist in the given strain X. In our opinion, this is an unavoidable problem from elimination of redundant sequences, similar to sequence processing in NT/NR database. To hold all sequences of the same gene or a group of similar genes means that you need to assemble all the sequences of this gene group successfully in all the individuals. Actually this is the traditional method we presented in the figure on the homepage. In the “map-to-pan” strategy, the first step is to construct a pan-genome, that is, to reduce the sequence redundancy. The key point in this step is that for a group of very similar sequence, which one is chosen as a representative. Current strategy used in EUPAN is to combine the reference genome and non-redundant novel sequences from de novo assemblies. The advantages of this strategy are that 1) the gene PAV results can be directly used by other scientists. The information (id, sequence, function and other annotation) of the reference gene is widely used; and 2) the pan-genome sequence seems not very “fragmented” and is easier to be integrated in a genome browser. The disadvantage is, for all gene groups, reference genes are selected as representative. Remarkably, for most cases, the PAVs of a gene or gene group would not be affected by which sequence is chosen as representative. This can be justified by the parameters (mapping coverage of the gene) when determining presence/absence by read mapping. It is to say, it has little effect on the subsequent analysis.

Moreover, EUPAN is a flexible toolkit and people can build the pan-genome sequence with other strategies and then integrated the resulted pan-genome sequences to the pipeline again. For example, one can build pan-genome sequences by re-assembling the de novo assembled contigs, which has no bias to the reference. The limitations of this strategy are that (1) there will be chimera sequences and (2) the pan-genome is very “fragmented”. At this stage, we preferred the first strategy.

It is recommended to run EUPAN on a computer cluster. What is the CPU/memory usage?

A: For most steps, EUPAN will divide the whole task into many jobs, each job is recommended to be run on a single node with multiple CPUs. Users can select the CPU number for a single job via the –t option. The users should have a good knowledge of the basic information of the computer cluster you used, for example, the CPU number and the memory size of each node. Please use proper CPU number for each job. For example, if each node has 16 CPUs, you may assign each job with 2, 4, 8 or 16 CPUs, instead of 5, 7, or 9 CPUs. We recommended the users to use 4 CPUs for most steps.

Also, please take consideration of the memory, especially for de novo assemblies. Running more jobs (with less number of CPUs for each job) means that you need more memory. 32G memory is enough for most steps, except the assembly step. Memory for assembling one genome depends on both the genome size and the sequencing depth. The user should test the memory use for one sample. For example, for a 400M genome with sequencing depth of 15X, 50G memory is needed.

Though EUPAN is the only tool enabling gene PAV-based pan-genome analysis for large number of eukaryotic individuals, what about EUPAN’s accuracy?

A: The accuracy of EUPAN can be evaluated at “gene presence-absence detection”. To give an intuitive description to the users, here we compared rice gene PAVs detected by EUPAN with a recent rice pan-genome study where they used 3 representative rice accessions including Nipponbare, DJ123 and IR64 (Schatz, M. C. et al. Genome Biol 2014, Traditional method 2 in the following table). In that study, they assembled the genomes from deep sequencing data with multiple sequencing libraries and the total sequencing depth reached ~110x. As a result, they assembled 81.3%~82.5% (81.8% for Nipponbare) of non-N bases of each genome. This number increased to 88.5%~91.4% (91.4% for Nipponbare) if N is considered. And they predicted 39,083 genes for Nipponbare genome. Therefore, the total gene number was adjusted to be 42,852~47,779 if all the genome would be assembled and the gene density remains the same (actually there should be less genes on the remained sequence, nevertheless we can make a rough estimation). In our gold standard annotation of Nipponbare genome, there are 35,633 annotated genes. Assume that all genes on the assembled sequences are correctly predicted (as we also used reference annotations directly in our analyses), the sensitivity can be estimated as the assembled fraction of the genome, to be about 81.3% or 91.4%. If all the reference genes are within their predictions, the specificity should be 35,633/(>39,083/0.914)=83.1%.

	EUPAN	Traditional method 1	Traditional method 2
Reference	Sun, C. et al. NAR 2017	Sun, C. et al. NAR 2017	Schatz, M. C. et al. Genome Biol 2014
Total sequencing depth	19.8x	19.8x	110x
Insertion size (bp)	450	450	180bp & 2k & 5k
Sensitivity	97.5%	57%	81.3~91.4%
Specificity	84.7%	similar to Traditional method 2	83.1%
Type of false positives	non-random	random	random
Source of false positives	gene prediction & gene loss by a small number of mutations	gene prediction	gene prediction

We then evaluated accuracy of our method based on the EUPAN result for a Nipponbare accession with sequencing depth of 19.7X. De novo assembly of this set of sequencing data provided only <57% of the whole genome (Traditional method 1 in the table). With EUPAN analysis on >3000 accessions, we detected 41,039 genes present in CX140 including 34,759 Nipponbare reference genes and 6,280 novel genes. The sensitivity is estimated to be 34,759/35,633=97.5%. The mapping coverage of CX140 genome is 98.4%, therefore we think we can capture almost all genes if mapping evidence exists. The specificity can be estimated as 34,759/41,039=84.7%, better than traditional pan-genome methods. However there is still a significant number of false discovered genes, in-depth studies suggest all these gene regions show high similarity (>90%) to the reference genome, indicating these corresponding regions in the reference genome contained no genes but we predict genes on similar sequences. This might be the wrong predictions involved by gene prediction like previous work predicted 39,083 genes on the incomplete sequences. Or this might be partially attributed to gene loss on the reference due to SNVs and small indels as illustrated in the following Figure, in which, we predicted a gene on a novel sequence with high local similarity with a reference sequence segment and the SNVs on these two sequences disrupt the gene models. It is to say, the Nipponbare genome might experience several point mutations resulting to the loss of the gene and some other accessions still hold this gene in their genomes. This might be a short-coming of mapping-based pan-genome study. Nevertheless, these false positives are still based on gene sequences and are not random, and sequences of the genes indeed present in the genome. Therefore, we concluded that our mapping-based method has relatively good accuracy with very high sensitivity and reasonable specificity.