92
Getting the Most from the Human Genome: Understanding Updates and Making Use of Improvements in the Reference Assembly ASHG 18 Oct 2014 Valerie Schneider (NCBI) Tina Graves-Lindsay (TGI-WashU) Deanna Church (Personalis) Laura Clarke (EMBL-EBI)

Ashg2014 grc workshop_schneider

Embed Size (px)

DESCRIPTION

Assembly background slides from Valerie Schneider

Citation preview

Page 1: Ashg2014 grc workshop_schneider

Getting the Most from the Human

Genome: Understanding Updates

and Making Use of Improvements

in the Reference Assembly

ASHG

18 Oct 2014

Valerie Schneider (NCBI)

Tina Graves-Lindsay (TGI-WashU)

Deanna Church (Personalis)

Laura Clarke (EMBL-EBI)

Page 2: Ashg2014 grc workshop_schneider

Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes

GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Matthew Hurles• Richard Gibbs

GRC Credits

Page 3: Ashg2014 grc workshop_schneider

OutlineReference Assembly Basics

GRC: Assembly management and model

GRCh38

Accessing the assembly and data

http://genomereference.org

Page 4: Ashg2014 grc workshop_schneider

What is the Reference Assembly?

Reference Assembly Basics

Page 5: Ashg2014 grc workshop_schneider
Page 6: Ashg2014 grc workshop_schneider
Page 7: Ashg2014 grc workshop_schneider

Dilthey et al.Paten et al.

Reference Assembly Basics

Page 8: Ashg2014 grc workshop_schneider
Page 9: Ashg2014 grc workshop_schneider

Reference Assembly Basics

Reference ≠ Error-Free

• Highest quality mammalian genome, but it’s not perfect

• Errors can influence data that relies upon the reference

• Annotation

• Variant calling and interpretation

• Guided genome assemblies

What factors influence the production and quality

of genome assemblies?

Page 10: Ashg2014 grc workshop_schneider

Lander and Waterman

(1988) Genomics

Reads are randomly distributed

Overlap between reads does not vary

AssumptionsVariables:

G= haploid genome length in bp

L= sequence read length in bp

N= number of reads sequenced

T= amount of overlap needed for detection in bp

C= Coverage (C=LN/G)

Poisson distribution: P(Y=y)=(ly * e–l)/y!

y= number of events in an interval

l = mean number of events in an interval

Reference Assembly Basics

For sequence calculations, coverage can be viewed as l

Using this equation, you can calculate the probability that a base has

been sequenced y number of times.

By manipulating this formula, you can estimate the numbers of gaps for

any given level of coverage.

Page 11: Ashg2014 grc workshop_schneider

SequencedNot sequenced

1X Coverage

5X Coverage10X Coverage

37% 63%

0.6% 99.4%0.005% 99.995%

Reference Assembly Basics

Page 12: Ashg2014 grc workshop_schneider

Reference Assembly Basics

Sims et al. (2014) Nat Rev Genet. 15(2):121-32

Page 13: Ashg2014 grc workshop_schneider
Page 14: Ashg2014 grc workshop_schneider

Reference Assembly Basics

Even if you sequence to an “appropriate” coverage, you’re

still likely to have missing sequence in your assembly.

Complicating Factors:

• Library construction

• Cloning bias

• Sequencing Limitations

• Assembly method

• Underlying biology

Page 15: Ashg2014 grc workshop_schneider

BiologyRepetitive sequence (interspersed repeats, segmental duplications)

Variation(regions of high diversity, structural variation)

Kidd et al., 2008

Reference Assembly Basics

Page 16: Ashg2014 grc workshop_schneider

Reference Assembly Basics

Eugene Yaschenko, NCBI GRCh37

Page 17: Ashg2014 grc workshop_schneider

Technology

Read lengthlong reads vs. short reads

Mate lengthsdistribution of insert sizes

Read accuracyerror model for your technology

Read depthcoverage at each base

Genome distributionreads covering entire genome equally

Ajay et al., 2011

Page 18: Ashg2014 grc workshop_schneider

Genome Research, May, 1997

Reference Assembly Basics

Page 19: Ashg2014 grc workshop_schneider

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

WGS: Sanger Reads

Scaffold

Reference Assembly Basics

Page 20: Ashg2014 grc workshop_schneider

Contig: a sequence constructed from

smaller, overlapping sequences, which

contains no gaps.

Scaffold: a sequence constructed from

smaller sequences, which may contain

gaps.

Genome Vocabulary

Typically built from reads, but also from sequences in GenBank/EMBL/DDBJ

Typically built from sequences in GenBank/EMBL/DDBJ

Reference Assembly Basics

Page 21: Ashg2014 grc workshop_schneider

Schatz et al, 2010

Reference Assembly Basics

Page 22: Ashg2014 grc workshop_schneider

A T T T T C C C T T C T G A A A T G A T G A A A G A G T C

Reference Assembly Basics

Page 23: Ashg2014 grc workshop_schneider

BAC insertBAC vector

Shotgun sequence

Assemble

Fold

seq

uen

ce

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Clone based assemblies

Reference Assembly Basics

Page 24: Ashg2014 grc workshop_schneider

A

BC

D

EF

GH

I

J

K

L

M

N

O

A

B

C

D

FGH

KL

O

N

Ideally…

Non-sequence based Map

(flip)

A

B

C

D

FGH

KL

O

N

Reference Assembly Basics

Page 25: Ashg2014 grc workshop_schneider

More like…

A

BC

D

EF

GH

I

J

K

L

M

N

O

A

B

C

ZYX

W

H

J

M

V

N

O

AB

HIJ

CD

Y

LM

N

O

AB

HIJ

LM

N

O

?

Reference Assembly Basics

Page 26: Ashg2014 grc workshop_schneider

Sequence vs. Non-sequence based maps

Mmu7

WI Genetic

WI/MRC RH

Page 27: Ashg2014 grc workshop_schneider

Human assemblies available in the NCBI assembly database

http://www.ncbi.nlm.nih.gov/assembly

Reference Assembly Basics

Page 28: Ashg2014 grc workshop_schneider

Reference Assembly Basics

Page 29: Ashg2014 grc workshop_schneider

Reference Assembly Basics

N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.

Page 30: Ashg2014 grc workshop_schneider

Reference Assembly Basics

Fragmented genomes tend to have

more partial models

Fragmented genomes have

fewer frameshifts

Alexander Souvorov, NCBI

Page 31: Ashg2014 grc workshop_schneider

OutlineReference Assembly Basics

GRC: Assembly management and model

GRCh38

Accessing the assembly and data

http://genomereference.org

Page 32: Ashg2014 grc workshop_schneider

http://genomereference.org

Page 33: Ashg2014 grc workshop_schneider

Distributed data

Genome not in INSDC Database

Old Assembly Model

GRC Assembly Management

Human Genome Project (HGP)

Page 34: Ashg2014 grc workshop_schneider

AECOM BCM Beijing CGM CHGC CMGWCH CSHL GBF GS GTC IIGB-CNR

IMB JGI JST Keio MPIMG RIKEN SC SDSTDC SHGC TIGR Tokai

UOKNOR UTSW UUGC UWGC UWMSC WIBR WUGSC YMGC unknown

GRC Assembly Management

Page 35: Ashg2014 grc workshop_schneider

GRC Assembly Management

Page 36: Ashg2014 grc workshop_schneider

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

GRC Assembly Management

Page 37: Ashg2014 grc workshop_schneider

Issue tracking system (based on JIRA)

GRC Assembly Management

http://genomereference.org

Page 38: Ashg2014 grc workshop_schneider

GRC Assembly Management

Page 39: Ashg2014 grc workshop_schneider

GRC Assembly Management

http://genomereference.org

Page 40: Ashg2014 grc workshop_schneider

GRC Assembly Management

Page 41: Ashg2014 grc workshop_schneider

ACCESSION NAME CONTIG

GAP Telomere 10000

AP006221 XX-190A2 Hschr1_ctg1

AL627309 RP11-34P13 Hschr1_ctg1

GAP type-3

AC114498 RP5-857K21 Hschr1_ctg3

AL669831 RP11-206L10 Hschr1_ctg3

AL645608 RP11-54O7 Hschr1_ctg3

Tiling Path File (TPF)

GRC Assembly Management

Page 42: Ashg2014 grc workshop_schneider

Full Dovetail

Half-dovetail

Contained

Short/Blunt

GRC Assembly Management

Page 43: Ashg2014 grc workshop_schneider

GRC Assembly Management

Page 44: Ashg2014 grc workshop_schneider

GRC Assembly Management

Page 45: Ashg2014 grc workshop_schneider

GRC Assembly Management

Page 46: Ashg2014 grc workshop_schneider

GRC Assembly Management

Page 47: Ashg2014 grc workshop_schneider

Build sequence contigs based on contigs

defined in TPF (Tiling Path File).

Check for orientation consistencies

Select switch points

Instantiate sequence for further analysis

Switch point

Representative chromosome sequence

GRC Assembly Management

Page 48: Ashg2014 grc workshop_schneider

AGP: A Golden Path

Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types

GRC Produces

GRC Assembly Management

• AGP• FASTA

Page 49: Ashg2014 grc workshop_schneider

Distributed data

Old Assembly Model

Centralized Data

Updated Assembly Model

GRC Assembly Management

Genome not in INSDC Database

Page 50: Ashg2014 grc workshop_schneider

Sequences from haplotype 1

Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

GRC Assembly Management

Page 51: Ashg2014 grc workshop_schneider

Assembly (e.g. GRCh38)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)

Genomic Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091

The human reference genome assembly is not a haploid model

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

ALT 1

Alternate loci are not synonymous with haplotypes

GRC Assembly Management

Page 52: Ashg2014 grc workshop_schneider

Assembly (e.g. GRCh38.p1)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)

Genomic Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091

Patches

Genomic Region(ABO)

Genomic Region

(FOXO6)

Genomic Region

(FCGBP)

GRC Assembly Management

Page 53: Ashg2014 grc workshop_schneider

Patches

FIX NOVEL

SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE

ALT LOCI

--(integrated)

Page 54: Ashg2014 grc workshop_schneider

1q32 1q21 1p21

Dennis et al., 2012

Fix patches are different than novel patches

GRC Assembly Management

Page 55: Ashg2014 grc workshop_schneider

The alignments of the alternate loci scaffolds to the chromosomes are part of the assembly

Page 56: Ashg2014 grc workshop_schneider

Anatomy of an alt

Alignment Legend

no alignmentmismatchdeletion

Page 57: Ashg2014 grc workshop_schneider

Anatomy of an alt

AC012314.8

CU151838.1

ALT LOCI

AC012314.8

AC245052.3 CHR. 19

Alternate loci contain some sequence that is redundant to the primary assembly unit

Page 58: Ashg2014 grc workshop_schneider

Masks and alt aware aligners reduce the incidence of

ambiguous alignments observed when aligning reads to

the full assembly

Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds

Simulated Reads

GRCh38: Alt Loci

Page 59: Ashg2014 grc workshop_schneider

GRC Assembly Management

GRCh38.p1• 192 Regions

• 261 ALT LOCI (61.9 Mb, 3.6 Mb unique to alts)

• PATCHES (94 kb unique to patches)

• 13 FIX

• 3 NOVEL

Page 60: Ashg2014 grc workshop_schneider

GRCh38: Alt Loci

Page 61: Ashg2014 grc workshop_schneider

GRCh38: Alt Loci

GRCh38 alt loci alignment

GRCh37 chr. 7

Page 62: Ashg2014 grc workshop_schneider

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

GRCh38: Alt Loci

Page 63: Ashg2014 grc workshop_schneider

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Updated Assembly Model

Genome in INSDC Database

Genome not in INSDC Database

GRC Assembly Management

Page 64: Ashg2014 grc workshop_schneider

OutlineReference Assembly Basics

GRC: Assembly management and model

GRCh38

Accessing the assembly and data

http://genomereference.org

Page 65: Ashg2014 grc workshop_schneider

GRCh38: Assembly Stats

http://genomereference.org

Page 66: Ashg2014 grc workshop_schneider

GRCh38: Annotation Stats

Page 67: Ashg2014 grc workshop_schneider

GRCh38 Base Updates

Targeted PCR/WGS: n=91

Page 68: Ashg2014 grc workshop_schneider

GRCh38 Sequence Updates

Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components

79% of these bases are heterozygous in RP11 WGS

n=10489

Page 69: Ashg2014 grc workshop_schneider

GRCh38 Centromeres

Miga et al., Genome Res. 2014 Apr;24(4):697-707

Page 70: Ashg2014 grc workshop_schneider

GRCh38 Model Centromeres

Page 71: Ashg2014 grc workshop_schneider

GRCh38 Impact

NOVEL GENES!

Page 72: Ashg2014 grc workshop_schneider

GRCh38 Impact

Sudmant et al., 2010

Page 73: Ashg2014 grc workshop_schneider

GRCh38: Novel Sequence

Page 74: Ashg2014 grc workshop_schneider

GRCh38 Impact

Page 75: Ashg2014 grc workshop_schneider

OutlineReference Assembly Basics

GRC: Assembly management and model

GRCh38

Accessing the assembly and data

http://genomereference.org

Page 76: Ashg2014 grc workshop_schneider

Accessing the Data

http://genomereference.org

Page 77: Ashg2014 grc workshop_schneider

Accessing the Data

Page 78: Ashg2014 grc workshop_schneider

Accessing the Data

Page 79: Ashg2014 grc workshop_schneider

Accessing the Data

Page 80: Ashg2014 grc workshop_schneider

Accessing the Data

Page 81: Ashg2014 grc workshop_schneider

Accessing the Data

Page 82: Ashg2014 grc workshop_schneider

Accessing the Data

Page 83: Ashg2014 grc workshop_schneider

http://twitter.com/[email protected]

Accessing the Data

Page 84: Ashg2014 grc workshop_schneider

http://genomeref.blogspot.com/

Accessing the Data

Page 85: Ashg2014 grc workshop_schneider

Accessing the Data

Page 86: Ashg2014 grc workshop_schneider

Accessing the Data

Page 87: Ashg2014 grc workshop_schneider

SearchGene and exon navigator

Variant Filter

Variant Table

Sequence Viewer

Slide: Peter Cooper, NCBI

http://www.ncbi.nlm.nih.gov/variation/view/

NCBI Variation Viewer

Accessing the Data

Page 88: Ashg2014 grc workshop_schneider

Accessing the Data

Page 89: Ashg2014 grc workshop_schneider

http://www.ensembl.org/

Accessing the Data

Page 90: Ashg2014 grc workshop_schneider

Accessing the Data

Page 91: Ashg2014 grc workshop_schneider

http://www.ncbi.nlm.nih.gov/genome/tools/remap

Accessing the Data

Page 92: Ashg2014 grc workshop_schneider