Ashg2014 grc workshop_schneider

  • View
    195

  • Download
    3

  • Category

    Science

Preview:

DESCRIPTION

Assembly background slides from Valerie Schneider

Citation preview

Getting the Most from the Human

Genome: Understanding Updates

and Making Use of Improvements

in the Reference Assembly

ASHG

18 Oct 2014

Valerie Schneider (NCBI)

Tina Graves-Lindsay (TGI-WashU)

Deanna Church (Personalis)

Laura Clarke (EMBL-EBI)

Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes

GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Matthew Hurles• Richard Gibbs

GRC Credits

OutlineReference Assembly Basics

GRC: Assembly management and model

GRCh38

Accessing the assembly and data

http://genomereference.org

What is the Reference Assembly?

Reference Assembly Basics

Dilthey et al.Paten et al.

Reference Assembly Basics

Reference Assembly Basics

Reference ≠ Error-Free

• Highest quality mammalian genome, but it’s not perfect

• Errors can influence data that relies upon the reference

• Annotation

• Variant calling and interpretation

• Guided genome assemblies

What factors influence the production and quality

of genome assemblies?

Lander and Waterman

(1988) Genomics

Reads are randomly distributed

Overlap between reads does not vary

AssumptionsVariables:

G= haploid genome length in bp

L= sequence read length in bp

N= number of reads sequenced

T= amount of overlap needed for detection in bp

C= Coverage (C=LN/G)

Poisson distribution: P(Y=y)=(ly * e–l)/y!

y= number of events in an interval

l = mean number of events in an interval

Reference Assembly Basics

For sequence calculations, coverage can be viewed as l

Using this equation, you can calculate the probability that a base has

been sequenced y number of times.

By manipulating this formula, you can estimate the numbers of gaps for

any given level of coverage.

SequencedNot sequenced

1X Coverage

5X Coverage10X Coverage

37% 63%

0.6% 99.4%0.005% 99.995%

Reference Assembly Basics

Reference Assembly Basics

Sims et al. (2014) Nat Rev Genet. 15(2):121-32

Reference Assembly Basics

Even if you sequence to an “appropriate” coverage, you’re

still likely to have missing sequence in your assembly.

Complicating Factors:

• Library construction

• Cloning bias

• Sequencing Limitations

• Assembly method

• Underlying biology

BiologyRepetitive sequence (interspersed repeats, segmental duplications)

Variation(regions of high diversity, structural variation)

Kidd et al., 2008

Reference Assembly Basics

Reference Assembly Basics

Eugene Yaschenko, NCBI GRCh37

Technology

Read lengthlong reads vs. short reads

Mate lengthsdistribution of insert sizes

Read accuracyerror model for your technology

Read depthcoverage at each base

Genome distributionreads covering entire genome equally

Ajay et al., 2011

Genome Research, May, 1997

Reference Assembly Basics

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

WGS: Sanger Reads

Scaffold

Reference Assembly Basics

Contig: a sequence constructed from

smaller, overlapping sequences, which

contains no gaps.

Scaffold: a sequence constructed from

smaller sequences, which may contain

gaps.

Genome Vocabulary

Typically built from reads, but also from sequences in GenBank/EMBL/DDBJ

Typically built from sequences in GenBank/EMBL/DDBJ

Reference Assembly Basics

Schatz et al, 2010

Reference Assembly Basics

A T T T T C C C T T C T G A A A T G A T G A A A G A G T C

Reference Assembly Basics

BAC insertBAC vector

Shotgun sequence

Assemble

Fold

seq

uen

ce

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Clone based assemblies

Reference Assembly Basics

A

BC

D

EF

GH

I

J

K

L

M

N

O

A

B

C

D

FGH

KL

O

N

Ideally…

Non-sequence based Map

(flip)

A

B

C

D

FGH

KL

O

N

Reference Assembly Basics

More like…

A

BC

D

EF

GH

I

J

K

L

M

N

O

A

B

C

ZYX

W

H

J

M

V

N

O

AB

HIJ

CD

Y

LM

N

O

AB

HIJ

LM

N

O

?

Reference Assembly Basics

Sequence vs. Non-sequence based maps

Mmu7

WI Genetic

WI/MRC RH

Human assemblies available in the NCBI assembly database

http://www.ncbi.nlm.nih.gov/assembly

Reference Assembly Basics

Reference Assembly Basics

Reference Assembly Basics

N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.

Reference Assembly Basics

Fragmented genomes tend to have

more partial models

Fragmented genomes have

fewer frameshifts

Alexander Souvorov, NCBI

OutlineReference Assembly Basics

GRC: Assembly management and model

GRCh38

Accessing the assembly and data

http://genomereference.org

http://genomereference.org

Distributed data

Genome not in INSDC Database

Old Assembly Model

GRC Assembly Management

Human Genome Project (HGP)

AECOM BCM Beijing CGM CHGC CMGWCH CSHL GBF GS GTC IIGB-CNR

IMB JGI JST Keio MPIMG RIKEN SC SDSTDC SHGC TIGR Tokai

UOKNOR UTSW UUGC UWGC UWMSC WIBR WUGSC YMGC unknown

GRC Assembly Management

GRC Assembly Management

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

GRC Assembly Management

Issue tracking system (based on JIRA)

GRC Assembly Management

http://genomereference.org

GRC Assembly Management

GRC Assembly Management

http://genomereference.org

GRC Assembly Management

ACCESSION NAME CONTIG

GAP Telomere 10000

AP006221 XX-190A2 Hschr1_ctg1

AL627309 RP11-34P13 Hschr1_ctg1

GAP type-3

AC114498 RP5-857K21 Hschr1_ctg3

AL669831 RP11-206L10 Hschr1_ctg3

AL645608 RP11-54O7 Hschr1_ctg3

Tiling Path File (TPF)

GRC Assembly Management

Full Dovetail

Half-dovetail

Contained

Short/Blunt

GRC Assembly Management

GRC Assembly Management

GRC Assembly Management

GRC Assembly Management

GRC Assembly Management

Build sequence contigs based on contigs

defined in TPF (Tiling Path File).

Check for orientation consistencies

Select switch points

Instantiate sequence for further analysis

Switch point

Representative chromosome sequence

GRC Assembly Management

AGP: A Golden Path

Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types

GRC Produces

GRC Assembly Management

• AGP• FASTA

Distributed data

Old Assembly Model

Centralized Data

Updated Assembly Model

GRC Assembly Management

Genome not in INSDC Database

Sequences from haplotype 1

Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

GRC Assembly Management

Assembly (e.g. GRCh38)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)

Genomic Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091

The human reference genome assembly is not a haploid model

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

ALT 1

Alternate loci are not synonymous with haplotypes

GRC Assembly Management

Assembly (e.g. GRCh38.p1)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)

Genomic Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091

Patches

Genomic Region(ABO)

Genomic Region

(FOXO6)

Genomic Region

(FCGBP)

GRC Assembly Management

Patches

FIX NOVEL

SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE

ALT LOCI

--(integrated)

1q32 1q21 1p21

Dennis et al., 2012

Fix patches are different than novel patches

GRC Assembly Management

The alignments of the alternate loci scaffolds to the chromosomes are part of the assembly

Anatomy of an alt

Alignment Legend

no alignmentmismatchdeletion

Anatomy of an alt

AC012314.8

CU151838.1

ALT LOCI

AC012314.8

AC245052.3 CHR. 19

Alternate loci contain some sequence that is redundant to the primary assembly unit

Masks and alt aware aligners reduce the incidence of

ambiguous alignments observed when aligning reads to

the full assembly

Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds

Simulated Reads

GRCh38: Alt Loci

GRC Assembly Management

GRCh38.p1• 192 Regions

• 261 ALT LOCI (61.9 Mb, 3.6 Mb unique to alts)

• PATCHES (94 kb unique to patches)

• 13 FIX

• 3 NOVEL

GRCh38: Alt Loci

GRCh38: Alt Loci

GRCh38 alt loci alignment

GRCh37 chr. 7

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

GRCh38: Alt Loci

Distributed data

Genome not in INSDC Database

Old Assembly Model

Centralized Data

Updated Assembly Model

Genome in INSDC Database

Genome not in INSDC Database

GRC Assembly Management

OutlineReference Assembly Basics

GRC: Assembly management and model

GRCh38

Accessing the assembly and data

http://genomereference.org

GRCh38: Assembly Stats

http://genomereference.org

GRCh38: Annotation Stats

GRCh38 Base Updates

Targeted PCR/WGS: n=91

GRCh38 Sequence Updates

Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components

79% of these bases are heterozygous in RP11 WGS

n=10489

GRCh38 Centromeres

Miga et al., Genome Res. 2014 Apr;24(4):697-707

GRCh38 Model Centromeres

GRCh38 Impact

NOVEL GENES!

GRCh38 Impact

Sudmant et al., 2010

GRCh38: Novel Sequence

GRCh38 Impact

OutlineReference Assembly Basics

GRC: Assembly management and model

GRCh38

Accessing the assembly and data

http://genomereference.org

Accessing the Data

http://genomereference.org

Accessing the Data

Accessing the Data

Accessing the Data

Accessing the Data

Accessing the Data

Accessing the Data

http://twitter.com/GenomeRefgrc-announce@ncbi.nlm.nih.gov

Accessing the Data

http://genomeref.blogspot.com/

Accessing the Data

Accessing the Data

Accessing the Data

SearchGene and exon navigator

Variant Filter

Variant Table

Sequence Viewer

Slide: Peter Cooper, NCBI

http://www.ncbi.nlm.nih.gov/variation/view/

NCBI Variation Viewer

Accessing the Data

Accessing the Data

http://www.ensembl.org/

Accessing the Data

Accessing the Data

http://www.ncbi.nlm.nih.gov/genome/tools/remap

Accessing the Data

Recommended