Download pptx - Creating Reference-Grade Human Genome Assemblies

Creating Reference-Grade Human Genome AssembliesTina Graves LindsayGRC Workshop at ASHGOct 18, 2016

The Human Reference is a Work in Progress!

• The current reference – GRCh38 - is not optimal for some regions of the genome and/or some individuals/ancestries.

• GRCh38 is comprised of DNA from several individual humans.

• Allelic diversity and structural variation present major challenges when assembling a representative diploid genome.

• New technologies, methods, and resources since 2003 have allowed for substantial improvements in the reference genome.

• Additional high-quality reference sequences are needed to represent the full range of genetic diversity in humans

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

UGT2B17 – Conflicting Alleles

GAP

Samples to be Sequenced

Sequencing Plan

Definitions of Genome Level• Platinum Genome

• Haploid genome source• Contiguous, haplotype-resolved representation of entire genome• BAC library available

• Gold Genome• Diploid genome source• Part of a trio

• Parents will be sequenced to help haplotype resolve some regions• BAC libraries available • Targeted regions sequenced using these BAC libraries• Will contain some haplotype resolved regions

CHM1: A Key Resource for Improving the Reference• CHM1 cell line established from a haploid hydatidiform mole

(complete, paternal; 46XX) (U.Surti)

• CHORI-17 BAC library (P. deJong)• CHORI-17 BAC end sequences (n=325,659)• CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)• CHORI-17 BACs

• >750 have been sequenced• 664 of them in Genbank as phase 3 sequence

• CHM1 WGS assembly• Initial assembly produced from >100X coverage of Illumina data• Initial PacBio assembly produced using ~54X of P5/C3 PacBio data• Latest PacBio assembly produced using ~60X of P6/C4 PacBio data

Assembly Assessment Methods• Assemblies will run through NCBI QA pipeline

• Assessed for contiguity, annotation, and concordance with the finished BACs

• Assembly Assembly alignments will be generated between each PB assembly and GRCh38

• BioNano Genome Map• SV calls generated from comparing the BioNano data to each of the

assemblies • Hybrid scaffolding conflicts will also point out potential assembly errors

• Alignment of the Illumina reads back to the each of the assemblies• Heterozygous calls are likely indicative of a collapse in the assembly

(for the haploid genomes)

Hybrid Scaffolds – PacBio and BioNano

Seq Assem

Seq Assem

Seq Assem

BN Hybrid BN Hybrid BN Hybrid

# of Contigs

Contig N50 (Mb)

Total Size (Gb)

# of Scaffolds

Scaff N50 (Mb)

Total Size (Gb)

CHM1 (P6)GCA_001297185MGI CHM1 map(Jason’s version)

3641 26.9 2.99 161 47.6 2.84

CHM1 (P6) GCA_001307025MGI CHM1 Map

(Adam’s version)

4850 20.6 2.94 221 40.04 2.82

Hybrid ScaffoldHybrid Scaffold

PacBio Contigs

BioNano Contigs

1q21 Region – GRCh38 vs GCA_0012971851 Megabase

GRCh38

GCA_001297185

Seg Dup Track

1q21 Region - GRCh38 vs GCA_001297185

GRCh38

GCA_001297185

Seg Dup Track

99.9+% identity99.1% identity

CHM1 – Next Steps

• Move forward with improving GCA_001297185

• Based on alignment of BioNano data as well as comparisons to GRCh38, make additional breaks where possible

• Incorporate all finished BACs

• Final alignment to GRCh38 in order to produce chromosome AGPs and submit

Genome StatusData Source Origin Level of Coverage Status

CHM1 NA Platinum Assembly ImprovementCHM13 NA Platinum Assembly Assessment

NA19240 Yoruban Gold Assessing New AssemblyHG00733 Puerto Rican Gold Assembly AssessmentHG00514 Han Chinese Gold Assembly Assessment**NA12878 European Gold Assembly AssessmentHG01352 Columbian Gold Assembly AssessmentHG02818 Gambian Gold Data Generation UnderwayHG02059 Kinh

VietnameseGold Data Generation Completed

NA19434 Luhya Gold Data Generation Underway

**100x coverage was generated for the Han Chinese sample

First Gold Genome - NA19240• NA19240 – Yoruban sample

• Generated >70X raw P6/C4 RSII PacBio data

Initial Assembly Stats Latest Assembly Stats# Seq Contigs 3569 3242Max Contig Length 20,393,869 bp 75,769,079 bp

Total Assembly Size 2,745,634,789 bp 2,878,123,324 bp

N50 6,003,115 bp 23,422,217 bpN90 848,151 bp 2,559,914 bpN95 345,457 bp 710,070 bp

NA19240 BioNano Hybrid and SV StatsSeq

AssemSeq

AssemSeq

AssemBN

HybridBN

HybridBN

HybridBN

HybridBN

Hybrid

# of Contigs

Contig N50 (Mb)

Total Size (Gb)

# of Scaffolds

Scaffold N50 (Mb)

Total Size (Gb)

Conflicts WGS

Conflicts BN

NA19240 – Initial

Assembly

3569 6.01 2.75 421 14.78 2.74 49 60

Potential mis-assemblies

Breaks made

Conflicts 28 22Ends 13 5Insertions 5 2Translocations 74 14

Initial curated assembly accession = GCA_001524155.1

Finished BACs Resolve This Region

GRCh38

PB Assembly

BAC Alignments

Seg Dup

Assembly Stats

Genome Total Size # Contigs N50 Contig New Contig N50

NA19240 2.75 Gb 3569 6.0 Mb 23.98 Gb(Jason Chin)

HG00733 2.84 Gb 3715 7.6 MbNA12878 2.80 Gb 4412 4.49 Mb 15.17 Mb

(at MGI)HG01352 2.85 Gb 4080 8.22 MbHG02818 2.82 Gb 3300 7.24 MbHG00514 2.81 -2.85

Gb2808-3669 6.1-10.0 Mb

NA19240 MHC Region

NA19240 MHC Region

NA19240

Reference

Alts

Large Inversion in HG00514

Inversion in HG00514

Han Chinese Data Coverage Comparison

• 100x total coverage generated

• Assemblies were performed at 60X, 80X, and 100X

• Comparisons now underway to determine the optimal coverage needed for a PacBio assembly

60X

80X

100X

Gap Comparisons

Spanning Reference Gaps

• HG00514 80X assembly• Initial assessment had 75 potential gap spanning contigs• Closer look only 32 are real gap spanning contigs, that

span 40 total gaps

• HG00514 All assemblies• 37 gaps spanned by all 3 assemblies, 60X, 80X, and 100X• 2 additional gaps spanned by 80X and 100X• 1 gap spanned by 60X and 80X• 1 gap spanned by 60X and 100X

True Gap Spanner

GRCh38

HG00514Contig

False Gap Spanner

False Alignment

Seg Dup

True Alignment

7kb3 kb

10 kb

Genome StatusData Source Origin Level of Coverage Status

CHM1 NA Platinum Assembly ImprovementCHM13 NA Platinum Assembly Assessment

NA19240 Yoruban Gold Assessing New AssemblyHG00733 Puerto Rican Gold Assembly AssessmentHG00514 Han Chinese Gold Assembly Assessment**NA12878 European Gold Assembly AssessmentHG01352 Columbian Gold Assembly AssessmentHG02818 Gambian Gold Data Generation UnderwayHG02059 Kinh

VietnameseGold Data Generation Completed

NA19434 Luhya Gold Data Generation Underway

**100x coverage was generated for the Han Chinese sample

Future Plans

• Lots of assemblies to analyze!

• Generate the latest Falcon assemblies for all samples

• Improve those assemblies• Identifying misassemblies• Making the breaks where needed• Scaffolding the assemblies • Incorporating BACs as they are finished

• Submit to Genbank

AcknowledgementsThe McDonnell Genome Institute at Washington University in St. Louis

Susan DutcherBob FultonWes WarrenKaryn Meltz SteinbergDerek AlbrachtMilinn KremitzkiSusan RockChad Tomlinson

Patrick MinxChris MarkovicEddie BelterLee TraniSara Kohlberg

University of WashingtonEvan Eichler

NCBIValerie

Schneider

University of Pittsburgh School of Medicine

(CHM1 and CHM13 cell line)Urvashi Surti

BioNano GenomicsPalak ShethAlex Hastie

Pacific BiosciencesJason ChinNick Sisneros

UCSFPui-Yan KwokYvonne LaiChin LinCatherine Chu

NHGRIAdam PhillippySergey Koren

10X GenomicsDeanna

Church

Nationwide Children’s HospitalRichard WilsonVince MagriniSean McGrath