42
Getting the Most from the Reference Assembly Valerie Schneider, Ph.D. NCBI 6 October 2015 http://genomereference.o

Ashg2015 schneider final

Embed Size (px)

Citation preview

Page 1: Ashg2015 schneider final

Getting the Most from the Reference Assembly

Valerie Schneider, Ph.D.NCBI

6 October 2015

http://genomereference.org

Page 2: Ashg2015 schneider final

http://genomereference.org

Twitter: @[email protected]

Page 3: Ashg2015 schneider final

Outline

• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data

Page 4: Ashg2015 schneider final

Reference Assembly Basics

Page 5: Ashg2015 schneider final
Page 6: Ashg2015 schneider final

Sims et al. (2014) Nat Rev Genet. 15(2):121-32

30x

1x

Reference Assembly Basics

Lander and Waterman(1988) Genomics SequencedNot sequenced

1X Coverage5X Coverage

10X Coverage

37% 63%0.6% 99.4%

0.005% 99.995%

Page 7: Ashg2015 schneider final

Reference Assembly Basics

FINISHED?

Page 8: Ashg2015 schneider final

BAC insertBAC vector

Shotgun sequence

Assemble

Fold

sequ

ence

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Clone based assemblies

Reference Assembly Basics

Minimal Tiling Path

Page 9: Ashg2015 schneider final

Human assemblies available in the NCBI assembly database

http://www.ncbi.nlm.nih.gov/assembly

Reference Assembly Basics

Oct. 2014: 13 assemblies

Oct. 2015: 25 assemblies

YRI

CEUCEU

CHB

Page 10: Ashg2015 schneider final

Reference Assembly Basics

Sanger Sanger Illumina Illumina PacBio (older)clone WGS WGS WGS WGS

Reads:Method:

PacBio (newer)WGS

N50:Measure of continuity.Half of the contigs in the assembly are this length or greater. Why all this matters:

Longer haplotype blocksFewer collapsed repeats & segmental duplications

Improved annotationMore robust mapping target

Page 11: Ashg2015 schneider final

Outline

• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data

Page 12: Ashg2015 schneider final

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

Current Assembly model: represent both haplotypes

GRC Assembly Model

many

Page 13: Ashg2015 schneider final

Assembly (e.g. GRCh38)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091GRC Assembly Model

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

ALT 1

Page 14: Ashg2015 schneider final

GRC Assembly Model

Alt loci alignments are an integral part of the assembly modelalignment to chr + scaffold sequence = Alt

Page 15: Ashg2015 schneider final

GRCh38• 178 regions with alt loci: 2% of

chromosome sequence (61.9 Mb)• 261 Alt Loci: 3.6 Mb novel sequence

relative to chromosomes• Average alt length = 400 kb, max = ~5 Mb

GRCh38

Page 16: Ashg2015 schneider final

Outline

• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data

Page 17: Ashg2015 schneider final

GRCh38: Alt Loci

Alignment Legend

no alignmentmismatchdeletion

Page 18: Ashg2015 schneider final

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

GRCh38: Alt Loci

Page 19: Ashg2015 schneider final

GRCh38: Assembly Stats

http://genomereference.org

GRCh38 vs. GRCh37

Page 20: Ashg2015 schneider final

GRCh38: Annotation Stats

Page 21: Ashg2015 schneider final

GRCh38 Base Updates

Targeted PCR/WGS: n=91

Page 22: Ashg2015 schneider final

GRCh38 Centromeres

Miga et al., Genome Res. 2014 Apr;24(4):697-707

Page 23: Ashg2015 schneider final

GRCh38 Novel Sequence

Page 24: Ashg2015 schneider final

GRCh38 Novel Sequence

Page 25: Ashg2015 schneider final

Assembly (e.g. GRCh38.p1)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region

(FOXO6)Genomic

Region(FCGBP)

Assembly Updates

Patches

FIX NOVEL

SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE

ALT LOCI

--(integrated)

Treat as: Allelic

Treat as: Preferred

Page 26: Ashg2015 schneider final

GRCh38.p4• 55 Patches: >400 kb novel

sequence• 37 FIX• 18 NOVEL

Assembly Updates

Learn more about assembly updates at

the GRC poster: 1834W (6-7 pm)

Page 27: Ashg2015 schneider final

Outline

• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data

Page 28: Ashg2015 schneider final

Accessing the Data

http://genomereference.org

Page 29: Ashg2015 schneider final

Accessing the Data

Page 30: Ashg2015 schneider final

Accessing the Data

Page 31: Ashg2015 schneider final

Accessing the Data

Page 32: Ashg2015 schneider final

Accessing the Data

Page 33: Ashg2015 schneider final

Accessing the Data

Page 34: Ashg2015 schneider final

GRC Assembly Management

Page 35: Ashg2015 schneider final

Accessing the Data

http://www.ensembl.org/

Page 36: Ashg2015 schneider final

Accessing the Data

Page 37: Ashg2015 schneider final

Accessing the Data

ftp://ngs.sanger.ac.uk/production/grit/track_hub/hub.txt

Page 38: Ashg2015 schneider final

Accessing the Data

Page 39: Ashg2015 schneider final

http://www.ncbi.nlm.nih.gov/variation/view/ NCBI Variation ViewerAccessing the Data

Learn more about viewing GRCh38 at

NCBI: 1748T (12-1 pm)

Page 41: Ashg2015 schneider final

Outline

• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data

Page 42: Ashg2015 schneider final

GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes

GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs

GRC Creditshttp://genomereference.org