Long-read sequencing of 111 rice genomes reveals significantly larger pan-genomes
Introduction
Rice is one of the most important crops for human. The third-generation sequencing (TGS, also called long-read sequencing, LRS) helps us assemble more high-quality genomes and construct more complete pan-genomes. Here are codes and result data of the article "Long-read sequencing of 111 rice genomes reveals significantly larger pan-genomes".
Codes
The main pipelines and self-writen scripts have been uploaded to github.
Link: https://github.com/SJTU-CGM/TGSRICEPAN
Result data
Genome sequences (contigs and scaffolds) of newly sequenced rice accessions
- Polished contigs: The trimmed long reads were corrected and assembled using NextDenovo. First, contigs were polished using Racon and Medaka with long reads. Next, the contigs were polished one round using Pilon with short reads.
- Chromosome-level scaffolds: With the NipRG (Nipponbare reference genome)’s guide, contigs’ misassemblies were corrected using “ragtag.py correct” and chromosome-level scaffolds were achieved.
Genome gap-filling
- Format of data of orignial/updated_scaff_infos(results from TGS-gapcloser):
- column 1: ctg_id (genome choromosomes are splited into contig with "N")
- column 2: strand
- column 3: fill_len (fill length)
- column 4: ctg_len (contig length)
- column 5: chr_start (chromosome start of this contig)
- column 6: ctg_chr_id (contig chr_id in the chromosome)
- colomn 7: chr_id (chromosome id)
- colomn 8: fill_sequences (optional, missing in orignial_scaff_infos)
SV
Merged structure variations (SVs) appear in all samples (vcf format): merged_SURVIVOR_111samples0.vcf.gz
Merged SVs appear in all samples (bed format): merged_SURVIVOR_111samples0.bedpe.gz
- We used minimap2 for mapping and sniffle for detecting SVs. The SVs are filtered and merged with Survivor.
Pan-genome
Sequences and annotations of pan-genome (Nipponbare + Novel)
- Pan-genome sequences of 111 rice accessions (Nipponbare + Elongated novel sequences, to make novel genes complete): CW113.fa.gz
- Pan-genome annotations of 111 rice accessions (Nipponbare + Elongated novel sequences, to make novel genes complete): CW113.gff.gz
- Pan-genome CDS sequences of 111 rice accessions (MSU7 + Novel): CW113_cds.fa.gz
- Pan-genome nonredundant protein sequences of 111 rice accessions(MSU7 + Novel): pan_nonredundant.pep.gz
- Pan-genome redunction protein sequences of 111 rice accessions(MSU7 + All predict): pan_redundant.pep.gz
Novel sequences of pan-genome
- Raw novel sequences
- Novel sequences information: idtable.txt.gz
- Novel sequences (raw, blocks from assemblies): novelctg_rename.fa.gz
- Elongated novel sequences
- Novel sequences elongated information: elongation.txt.gz
- Novel sequences (elongated): elongated.replace.fa.gz
- Shortened novel sequences
- Novel sequences shortened information: shorten.txt.gz
- Novel sequences (shortened): shortened.fa.smallletter.gz
- Novel protein-coding transcripts and proteins
- Novel gene protein id: novel_pep.id
- Novel gene CDS sequences: novel_cds.fa.gz
- Novel gene protein sequences: novel_pep.fa.gz
- Gene family information: familygroups.txt.gz
PAV
Gene and gene family presence-absence variations (PAVs) matrix
- Gene PAVs
- Gene PAVs in Nanopore (ONT) samples: ont_pav.tsv.gz
- Gene PAVs in PacBio (PB) samples: pb_pav.tsv.gz
- Gene PAVs in Illumina (NGS) samples: ngs_pav.tsv.gz
- Gene family PAVs
- Gene family PAVs in Nanopore (ONT) samples: ont_Fampav.tsv.gz
- Gene family PAVs in PacBio (PB) samples: pb_Fampav.tsv.gz
- Gene family PAVs in Illumina (NGS) samples: ngs_Fampav.tsv.gz
Contact Information
Hongzhang Xue: xuehzh95@sjtu.edu.cn
Chaochun Wei: ccwei@sjtu.edu.cn
Copyright © 2022 The laboratory of computational genomics and metagenomics in Shanghai Jiao Tong University. All Rights Reserved.