Long-read sequencing of 111 rice genomes reveals significantly larger pan-genomes

Introduction

Rice is one of the most important crops for human. The third-generation sequencing (TGS, also called long-read sequencing, LRS) helps us assemble more high-quality genomes and construct more complete pan-genomes. Here are codes and result data of the article "Long-read sequencing of 111 rice genomes reveals significantly larger pan-genomes".

Codes

The main pipelines and self-writen scripts have been uploaded to github.

Link: https://github.com/SJTU-CGM/TGSRICEPAN

Result data

Genome sequences (contigs and scaffolds) of newly sequenced rice accessions

Sample Polished contigs Chromosome-level scaffolds
QUAN QUAN_pilon.fa.gz QUAN.fixed.fa.gz
TG1 TG1_pilon.fa.gz TG1.fixed.fa.gz
TG10 TG10_pilon.fa.gz TG10.fixed.fa.gz
TG11 TG11_pilon.fa.gz TG11.fixed.fa.gz
TG12 TG12_pilon.fa.gz TG12.fixed.fa.gz
TG13 TG13_pilon.fa.gz TG13.fixed.fa.gz
TG14 TG14_pilon.fa.gz TG14.fixed.fa.gz
TG15 TG15_pilon.fa.gz TG15.fixed.fa.gz
TG16 TG16_pilon.fa.gz TG16.fixed.fa.gz
TG17 TG17_pilon.fa.gz TG17.fixed.fa.gz
TG18 TG18_pilon.fa.gz TG18.fixed.fa.gz
TG19 TG19_pilon.fa.gz TG19.fixed.fa.gz
TG2 TG2_pilon.fa.gz TG2.fixed.fa.gz
TG21 TG21_pilon.fa.gz TG21.fixed.fa.gz
TG22 TG22_pilon.fa.gz TG22.fixed.fa.gz
TG24 TG24_pilon.fa.gz TG24.fixed.fa.gz
TG27 TG27_pilon.fa.gz TG27.fixed.fa.gz
TG28 TG28_pilon.fa.gz TG28.fixed.fa.gz
TG29 TG29_pilon.fa.gz TG29.fixed.fa.gz
TG3 TG3_pilon.fa.gz TG3.fixed.fa.gz
TG30 TG30_pilon.fa.gz TG30.fixed.fa.gz
TG31 TG31_pilon.fa.gz TG31.fixed.fa.gz
TG32 TG32_pilon.fa.gz TG32.fixed.fa.gz
TG33 TG33_pilon.fa.gz TG33.fixed.fa.gz
TG34 TG34_pilon.fa.gz TG34.fixed.fa.gz
TG4 TG4_pilon.fa.gz TG4.fixed.fa.gz
TG42 TG42_pilon.fa.gz TG42.fixed.fa.gz
TG43 TG43_pilon.fa.gz TG43.fixed.fa.gz
TG45 TG45_pilon.fa.gz TG45.fixed.fa.gz
TG46 TG46_pilon.fa.gz TG46.fixed.fa.gz
TG49 TG49_pilon.fa.gz TG49.fixed.fa.gz
TG5 TG5_pilon.fa.gz TG5.fixed.fa.gz
TG50 TG50_pilon.fa.gz TG50.fixed.fa.gz
TG51 TG51_pilon.fa.gz TG51.fixed.fa.gz
TG52 TG52_pilon.fa.gz TG52.fixed.fa.gz
TG53 TG53_pilon.fa.gz TG53.fixed.fa.gz
TG54 TG54_pilon.fa.gz TG54.fixed.fa.gz
TG55 TG55_pilon.fa.gz TG55.fixed.fa.gz
TG56 TG56_pilon.fa.gz TG56.fixed.fa.gz
TG58 TG58_pilon.fa.gz TG58.fixed.fa.gz
TG59 TG59_pilon.fa.gz TG59.fixed.fa.gz
TG6 TG6_pilon.fa.gz TG6.fixed.fa.gz
TG60 TG60_pilon.fa.gz TG60.fixed.fa.gz
TG61 TG61_pilon.fa.gz TG61.fixed.fa.gz
TG62 TG62_pilon.fa.gz TG62.fixed.fa.gz
TG63 TG63_pilon.fa.gz TG63.fixed.fa.gz
TG64 TG64_pilon.fa.gz TG64.fixed.fa.gz
TG65 TG65_pilon.fa.gz TG65.fixed.fa.gz
TG68 TG68_pilon.fa.gz TG68.fixed.fa.gz
TG7 TG7_pilon.fa.gz TG7.fixed.fa.gz
TG70 TG70_pilon.fa.gz TG70.fixed.fa.gz
TG75 TG75_pilon.fa.gz TG75.fixed.fa.gz
TG76 TG76_pilon.fa.gz TG76.fixed.fa.gz
TG77 TG77_pilon.fa.gz TG77.fixed.fa.gz
TG78 TG78_pilon.fa.gz TG78.fixed.fa.gz
TG8 TG8_pilon.fa.gz TG8.fixed.fa.gz
TG80 TG80_pilon.fa.gz TG80.fixed.fa.gz
TG81 TG81_pilon.fa.gz TG81.fixed.fa.gz
TG82 TG82_pilon.fa.gz TG82.fixed.fa.gz
TG83 TG83_pilon.fa.gz TG83.fixed.fa.gz
TG84 TG84_pilon.fa.gz TG84.fixed.fa.gz
TG85 TG85_pilon.fa.gz TG85.fixed.fa.gz
TG86 TG86_pilon.fa.gz TG86.fixed.fa.gz
TG87 TG87_pilon.fa.gz TG87.fixed.fa.gz
TG88 TG88_pilon.fa.gz TG88.fixed.fa.gz
TG9 TG9_pilon.fa.gz TG9.fixed.fa.gz
TG90 TG90_pilon.fa.gz TG90.fixed.fa.gz
WSSM WSSM_pilon.fa.gz WSSM.fixed.fa.gz
WW8 WW8_pilon.fa.gz WW8.fixed.fa.gz
wild111 wild111_pilon.fa.gz wild111.fixed.fa.gz
wild12 wild12_pilon.fa.gz wild12.fixed.fa.gz
wild131 wild131_pilon.fa.gz wild131.fixed.fa.gz
wild219 wild219_pilon.fa.gz wild219.fixed.fa.gz
wild273 wild273_pilon.fa.gz wild273.fixed.fa.gz
wild65 wild65_pilon.fa.gz wild65.fixed.fa.gz
SE-3 SE-3_pilon.fa.gz SE-3.fixed.fa.gz
SE-19 SE-19_pilon.fa.gz SE-19.fixed.fa.gz
SE-33 SE-33_pilon.fa.gz SE-33.fixed.fa.gz
SE-134 SE-134_pilon.fa.gz SE-134.fixed.fa.gz
H7L1 H7L1_pilon.fa.gz H7L1.fixed.fa.gz
H7L26 H7L26_pilon.fa.gz H7L26.fixed.fa.gz
H7L27 H7L27_pilon.fa.gz H7L27.fixed.fa.gz
H7L28 H7L28_pilon.fa.gz H7L28.fixed.fa.gz
H7L29 H7L29_pilon.fa.gz H7L29.fixed.fa.gz
H7L30 H7L30_pilon.fa.gz H7L30.fixed.fa.gz
H7L31 H7L31_pilon.fa.gz H7L31.fixed.fa.gz
H7L32 H7L32_pilon.fa.gz H7L32.fixed.fa.gz
H7L33 H7L33_pilon.fa.gz H7L33.fixed.fa.gz
  • Polished contigs: The trimmed long reads were corrected and assembled using NextDenovo. First, contigs were polished using Racon and Medaka with long reads. Next, the contigs were polished one round using Pilon with short reads.
  • Chromosome-level scaffolds: With the NipRG (Nipponbare reference genome)’s guide, contigs’ misassemblies were corrected using “ragtag.py correct” and chromosome-level scaffolds were achieved.

Genome gap-filling

Sample Subpopulation N_location_info Split_N_info Raw_fasta Filled_fasta Corrected_reads_filled_info Polished_contigs_filled_info
NATELBORO cA cA_NATELBORO.chrs.N.bed cA_NATELBORO.N_infos NATELBORO.chrs.fa.gz NATELBORO.chrs.min.fa.gz - cA_NATELBORO.updated_infos
ARC10497 cB cB_ARC10497.chrs.N.bed cB_ARC10497.N_infos ARC10497.chrs.fa.gz ARC10497.chrs.min.fa.gz cB_ARC10497.corrected_read_infos cB_ARC10497.updated_infos
Nipponbare (MSU7) GJ GJ_Nipponbare.chrs.N.bed GJ_Nipponbare.N_infos Nipponbare.chrs.fa.gz Nipponbare.chrs.min.fa.gz GJ_Nipponbare.corrected_read_infos GJ_Nipponbare.updated_infos
CHAOMEO GJsbtrp GJsbtrp_CHAOMEO.chrs.N.bed GJsbtrp_CHAOMEO.N_infos CHAOMEO.chrs.fa.gz CHAOMEO.chrs.min.fa.gz - GJsbtrp_CHAOMEO.updated_infos
TG22 GJtmp GJtmp_TG22.chrs.N.bed GJtmp_TG22.N_infos TG22.chrs.fa.gz TG22.chrs.min.fa.gz GJtmp_TG22.corrected_read_infos GJtmp_TG22.updated_infos
KETANNANGKA GJtrp GJtrp_KETANNANGKA.chrs.N.bed GJtrp_KETANNANGKA.N_infos KETANNANGKA.chrs.fa.gz KETANNANGKA.chrs.min.fa.gz - GJtrp_KETANNANGKA.updated_infos
PR106 XI1B XI1B_PR106.chrs.N.bed XI1B_PR106.N_infos PR106.chrs.fa.gz PR106.chrs.min.fa.gz - XI1B_PR106.updated_infos
LARHAMUGAD XI2 XI2_LARHAMUGAD.chrs.N.bed XI2_LARHAMUGAD.N_infos LARHAMUGAD.chrs.fa.gz LARHAMUGAD.chrs.min.fa.gz - XI2_LARHAMUGAD.updated_infos
LIMA XI3 XI3_LIMA.chrs.N.bed XI3_LIMA.N_infos LIMA.chrs.fa.gz LIMA.chrs.min.fa.gz - XI3_LIMA.updated_infos
  • Format of data of orignial/updated_scaff_infos(results from TGS-gapcloser):
  1. column 1: ctg_id (genome choromosomes are splited into contig with "N")
  2. column 2: strand
  3. column 3: fill_len (fill length)
  4. column 4: ctg_len (contig length)
  5. column 5: chr_start (chromosome start of this contig)
  6. column 6: ctg_chr_id (contig chr_id in the chromosome)
  7. colomn 7: chr_id (chromosome id)
  8. colomn 8: fill_sequences (optional, missing in orignial_scaff_infos)

SV

Merged structure variations (SVs) appear in all samples (vcf format): merged_SURVIVOR_111samples0.vcf.gz

Merged SVs appear in all samples (bed format): merged_SURVIVOR_111samples0.bedpe.gz

  • We used minimap2 for mapping and sniffle for detecting SVs. The SVs are filtered and merged with Survivor.

Pan-genome

Sequences and annotations of pan-genome (Nipponbare + Novel)

Novel sequences of pan-genome

  1. Raw novel sequences
  1. Elongated novel sequences
  1. Shortened novel sequences
  1. Novel protein-coding transcripts and proteins

PAV

Gene and gene family presence-absence variations (PAVs) matrix

  1. Gene PAVs
  1. Gene family PAVs

Contact Information

Hongzhang Xue: xuehzh95@sjtu.edu.cn
Chaochun Wei: ccwei@sjtu.edu.cn


Copyright © 2022 The laboratory of computational genomics and metagenomics in Shanghai Jiao Tong University. All Rights Reserved.