Creating Reference-Grade Human Genome AssembliesTina Graves LindsayGRC Workshop at ASHGOct 18, 2016
The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to represent the full range of genetic diversity in humans
AC074378.4AC079749.5
AC134921.2AC147055.2
AC140484.1AC019173.4
AC093720.2AC021146.7
NCBI36 NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37 NC_000004.11 (chr4) Tiling Path
AC074378.4AC079749.5
AC134921.1AC147055.2
AC093720.2AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4AC140484.1
AC019173.4AC226496.2
AC021146.7
TMPRSS11E2
UGT2B17 – Conflicting Alleles
GAP
Samples to be Sequenced
Sequencing Plan
Definitions of Genome Level• Platinum Genome
• Haploid genome source• Contiguous, haplotype-resolved representation of entire genome• BAC library available
• Gold Genome• Diploid genome source• Part of a trio
• Parents will be sequenced to help haplotype resolve some regions• BAC libraries available • Targeted regions sequenced using these BAC libraries• Will contain some haplotype resolved regions
CHM1: A Key Resource for Improving the Reference• CHM1 cell line established from a haploid hydatidiform mole
(complete, paternal; 46XX) (U.Surti)
• CHORI-17 BAC library (P. deJong)• CHORI-17 BAC end sequences (n=325,659)• CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)• CHORI-17 BACs
• >750 have been sequenced• 664 of them in Genbank as phase 3 sequence
• CHM1 WGS assembly• Initial assembly produced from >100X coverage of Illumina data• Initial PacBio assembly produced using ~54X of P5/C3 PacBio data• Latest PacBio assembly produced using ~60X of P6/C4 PacBio data
Assembly Assessment Methods• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the finished BACs
• Assembly Assembly alignments will be generated between each PB assembly and GRCh38
• BioNano Genome Map• SV calls generated from comparing the BioNano data to each of the
assemblies • Hybrid scaffolding conflicts will also point out potential assembly errors
• Alignment of the Illumina reads back to the each of the assemblies• Heterozygous calls are likely indicative of a collapse in the assembly
(for the haploid genomes)
Hybrid Scaffolds – PacBio and BioNano
Seq Assem
Seq Assem
Seq Assem
BN Hybrid BN Hybrid BN Hybrid
# of Contigs
Contig N50 (Mb)
Total Size (Gb)
# of Scaffolds
Scaff N50 (Mb)
Total Size (Gb)
CHM1 (P6)GCA_001297185MGI CHM1 map(Jason’s version)
3641 26.9 2.99 161 47.6 2.84
CHM1 (P6) GCA_001307025MGI CHM1 Map
(Adam’s version)
4850 20.6 2.94 221 40.04 2.82
Hybrid ScaffoldHybrid Scaffold
PacBio Contigs
BioNano Contigs
1q21 Region – GRCh38 vs GCA_0012971851 Megabase
GRCh38
GCA_001297185
Seg Dup Track
1q21 Region - GRCh38 vs GCA_001297185
GRCh38
GCA_001297185
Seg Dup Track
99.9+% identity99.1% identity
CHM1 – Next Steps
• Move forward with improving GCA_001297185
• Based on alignment of BioNano data as well as comparisons to GRCh38, make additional breaks where possible
• Incorporate all finished BACs
• Final alignment to GRCh38 in order to produce chromosome AGPs and submit
Genome StatusData Source Origin Level of Coverage Status
CHM1 NA Platinum Assembly ImprovementCHM13 NA Platinum Assembly Assessment
NA19240 Yoruban Gold Assessing New AssemblyHG00733 Puerto Rican Gold Assembly AssessmentHG00514 Han Chinese Gold Assembly Assessment**NA12878 European Gold Assembly AssessmentHG01352 Columbian Gold Assembly AssessmentHG02818 Gambian Gold Data Generation UnderwayHG02059 Kinh
VietnameseGold Data Generation Completed
NA19434 Luhya Gold Data Generation Underway
**100x coverage was generated for the Han Chinese sample
First Gold Genome - NA19240• NA19240 – Yoruban sample
• Generated >70X raw P6/C4 RSII PacBio data
Initial Assembly Stats Latest Assembly Stats# Seq Contigs 3569 3242Max Contig Length 20,393,869 bp 75,769,079 bp
Total Assembly Size 2,745,634,789 bp 2,878,123,324 bp
N50 6,003,115 bp 23,422,217 bpN90 848,151 bp 2,559,914 bpN95 345,457 bp 710,070 bp
NA19240 BioNano Hybrid and SV StatsSeq
AssemSeq
AssemSeq
AssemBN
HybridBN
HybridBN
HybridBN
HybridBN
Hybrid
# of Contigs
Contig N50 (Mb)
Total Size (Gb)
# of Scaffolds
Scaffold N50 (Mb)
Total Size (Gb)
Conflicts WGS
Conflicts BN
NA19240 – Initial
Assembly
3569 6.01 2.75 421 14.78 2.74 49 60
Potential mis-assemblies
Breaks made
Conflicts 28 22Ends 13 5Insertions 5 2Translocations 74 14
Initial curated assembly accession = GCA_001524155.1
Finished BACs Resolve This Region
GRCh38
PB Assembly
BAC Alignments
Seg Dup
Assembly Stats
Genome Total Size # Contigs N50 Contig New Contig N50
NA19240 2.75 Gb 3569 6.0 Mb 23.98 Gb(Jason Chin)
HG00733 2.84 Gb 3715 7.6 MbNA12878 2.80 Gb 4412 4.49 Mb 15.17 Mb
(at MGI)HG01352 2.85 Gb 4080 8.22 MbHG02818 2.82 Gb 3300 7.24 MbHG00514 2.81 -2.85
Gb2808-3669 6.1-10.0 Mb
NA19240 MHC Region
NA19240 MHC Region
NA19240
Reference
Alts
Large Inversion in HG00514
Inversion in HG00514
Han Chinese Data Coverage Comparison
• 100x total coverage generated
• Assemblies were performed at 60X, 80X, and 100X
• Comparisons now underway to determine the optimal coverage needed for a PacBio assembly
60X
80X
100X
Gap Comparisons
Spanning Reference Gaps
• HG00514 80X assembly• Initial assessment had 75 potential gap spanning contigs• Closer look only 32 are real gap spanning contigs, that
span 40 total gaps
• HG00514 All assemblies• 37 gaps spanned by all 3 assemblies, 60X, 80X, and 100X• 2 additional gaps spanned by 80X and 100X• 1 gap spanned by 60X and 80X• 1 gap spanned by 60X and 100X
True Gap Spanner
GRCh38
HG00514Contig
False Gap Spanner
False Alignment
Seg Dup
True Alignment
7kb3 kb
10 kb
Genome StatusData Source Origin Level of Coverage Status
CHM1 NA Platinum Assembly ImprovementCHM13 NA Platinum Assembly Assessment
NA19240 Yoruban Gold Assessing New AssemblyHG00733 Puerto Rican Gold Assembly AssessmentHG00514 Han Chinese Gold Assembly Assessment**NA12878 European Gold Assembly AssessmentHG01352 Columbian Gold Assembly AssessmentHG02818 Gambian Gold Data Generation UnderwayHG02059 Kinh
VietnameseGold Data Generation Completed
NA19434 Luhya Gold Data Generation Underway
**100x coverage was generated for the Han Chinese sample
Future Plans
• Lots of assemblies to analyze!
• Generate the latest Falcon assemblies for all samples
• Improve those assemblies• Identifying misassemblies• Making the breaks where needed• Scaffolding the assemblies • Incorporating BACs as they are finished
• Submit to Genbank
AcknowledgementsThe McDonnell Genome Institute at Washington University in St. Louis
Susan DutcherBob FultonWes WarrenKaryn Meltz SteinbergDerek AlbrachtMilinn KremitzkiSusan RockChad Tomlinson
Patrick MinxChris MarkovicEddie BelterLee TraniSara Kohlberg
University of WashingtonEvan Eichler
NCBIValerie
Schneider
University of Pittsburgh School of Medicine
(CHM1 and CHM13 cell line)Urvashi Surti
BioNano GenomicsPalak ShethAlex Hastie
Pacific BiosciencesJason ChinNick Sisneros
UCSFPui-Yan KwokYvonne LaiChin LinCatherine Chu
NHGRIAdam PhillippySergey Koren
10X GenomicsDeanna
Church
Nationwide Children’s HospitalRichard WilsonVince MagriniSean McGrath