48
262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear- time algorithm

Fragment Assembly

  • Upload
    borka

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Fragment Assembly. Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm. Steps to Assemble a Genome. Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment - PowerPoint PPT Presentation

Citation preview

Page 1: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Fragment Assembly

Given N reads…Where N ~ 30

million…

We need to use a linear-time algorithm

Page 2: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Steps to Assemble a Genome

1. Find overlapping reads

4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

Some Terminology

read a 500-900 long word that comes out of sequencer

mate pair a pair of reads from two endsof the same insert fragment

contig a contiguous sequence formed by several overlapping readswith no gaps

supercontig an ordered and oriented set(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from thesequene multiple alignment of reads

in a contig

Page 3: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

1. Find Overlapping Reads

• Find pairs of reads sharing a k-mer, k ~ 24• Extend to full alignment – throw away if not >98% similar

TAGATTACACAGATTAC

TAGATTACACAGATTAC|||||||||||||||||

T GA

TAGA| ||

TACA

TAGT||

• Caveat: repeats A k-mer that occurs N times, causes O(N2) read/read comparisons ALU k-mers could cause up to 1,000,0002 comparisons

• Solution: Discard all k-mers that occur “too often”

• Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available

Page 4: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

1. Find Overlapping Reads

• Correct errors using multiple alignment

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTACTGA

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGATAGATTACACAGATTACTGATAG-TTACACAGATTATTGA

insert Areplace T with C

correlated errors—probably caused by repeats disentangle overlaps

TAGATTACACAGATTACTGATAGATTACACAGATTACTGA

TAG-TTACACAGATTATTGA

TAGATTACACAGATTACTGA

TAG-TTACACAGATTATTGAIn practice, error correction removes up to 98% of the errors

Page 5: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

2. Merge Reads into Contigs

• Overlap graph: Nodes: reads r1…..rn

Edges: overlaps (ri, rj, shift, orientation, score)

Note:of course, we don’tknow the “color” ofthese nodes

Reads that comefrom two regions ofthe genome (blueand red) that containthe same repeat

Page 6: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

2. Merge Reads into Contigs

Page 7: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Overlap graph after forming contigs

Unitigs:Gene Myers, 95

Page 8: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Repeats, errors, and contig lengths

• Repeats shorter than read length are easily resolved Read that spans across a repeat disambiguates order of flanking regions

• Repeats with more base pair diffs than sequencing error rate are OK We throw overlaps between two reads in different copies of the repeat

• To make the genome appear less repetitive, try to:

Increase read length Decrease sequencing error rate

Role of error correction:Discards up to 98% of single-letter sequencing errors

decreases error rate decreases effective repeat content increases contig length

Page 9: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

3. Link Contigs into Supercontigs

Too dense Overcollapsed

Inconsistent links Overcollapsed?

Normal density

Page 10: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Find all links between unique contigs

3. Link Contigs into Supercontigs

Connect contigs incrementally, if 2 links

supercontig(aka scaffold)

Page 11: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Fill gaps in supercontigs with paths of repeat contigs

3. Link Contigs into Supercontigs

Page 12: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

4. Derive Consensus Sequence

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

(Alternative: take maximum-quality letter)

Page 13: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Some Assemblers

• PHRAP• Early assembler, widely used, good model of read errors• Overlap O(n2) layout (no mate pairs) consensus

• Celera• First assembler to handle large genomes (fly, human, mouse)• Overlap layout consensus

• Arachne• Public assembler (mouse, several fungi)• Overlap layout consensus

• Phusion• Overlap clustering PHRAP assemblage consensus

• Euler• Indexing Euler graph layout by picking paths consensus

Page 14: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Quality of assemblies

Celera’s assemblies of human and mouse

Page 15: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Quality of assemblies—mouse

Page 16: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Quality of assemblies—mouse

Terminology: N50 contig lengthIf we sort contigs from largest to smallest, and startCovering the genome in that order, N50 is the lengthOf the contig that just covers the 50th percentile.

Page 17: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Quality of assemblies—rat

Page 18: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

History of WGA

• 1982: -virus, 48,502 bp

• 1995: h-influenzae, 1 Mbp

• 2000: fly, 100 Mbp

• 2001 – present human (3Gbp), mouse (2.5Gbp), rat*, chicken, dog, chimpanzee,

several fungal genomes

Gene Myers

Let’s sequence the human

genome with the shotgun

strategy

That is impossible, and

a bad idea anyway

Phil Green

1997

Page 19: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Genomes Sequenced

• http://www.genome.gov/10002154

Page 20: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Multiple Sequence Alignments

Page 21: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

Page 22: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Protein Phylogenies

• Proteins evolve by both duplication and species divergence

Page 23: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Orthology and Paralogy

HB Human

WB Worm

HA1 Human

HA2 Human

Yeast

WA Worm

Orthologs:Derived by speciation

Paralogs:Everything else

Page 24: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Orthology, Paralogy, Inparalogs, Outparalogs

Page 26: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Definition

• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L• Score of the global map is maximum

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology

• The patterns of conservation can help us tell function of the element

Page 27: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignmentA pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 28: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Sum Of Pairs (cont’d)

• Heuristic way to incorporate evolution tree:

Human

Mouse

Chicken

• Weighted SOP:

S(m) = k<l wkl s(mk, ml)

Duck

Page 29: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

A Profile Representation

• Given a multiple alignment M = m1…mn Replace each column mi with profile entry pi

• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings

Can think of this as a “likelihood” of each letter in each position

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4

Page 30: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Multiple Sequence Alignments

Algorithms

Page 31: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Multidimensional DP

Generalization of Needleman-Wunsh:

S(m) = i S(mi)

(sum of column scores)

F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)

F(i1,i2,…,iN)= max(all neighbors of cube)(F(nbr)+S(nbr))

Page 32: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

• Example: in 3D (three sequences):

• 7 neighbors/cell

F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk),

F(i – 1, j – 1, k ) + S(xi, xj, - ),F(i – 1, j , k – 1) + S(xi, -, xk),F(i – 1, j , k ) + S(xi, -, - ),F(i , j – 1, k – 1) + S( -, xj, xk),F(i , j – 1, k ) + S( -, xj, - ),F(i , j , k – 1) + S( -, -, xk) }

Multidimensional DP

Page 33: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Multidimensional DP

Page 34: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Multidimensional DP

• How do gap states generalize?

• VERY badly! Require 2N – 1 states, one per combination of

gapped/ungapped sequences Running time: O(2N 2N LN) = O(4N LN)

XY XYZ Z

Y YZ

X XZ

Page 35: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

pxy

pzwpxyzw

Page 36: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

Example

Profile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0, 0.4)

s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)

Result: pxy = (0.7, 0.1, 0, 0, 0.2)

s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)

Result: px- = (0.4, 0.1, 0, 0, 0.5)

Page 37: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Progressive Alignment

• When evolutionary tree is unknown:

Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary

distance, based on pairwise alignment Construct a tree (UPGMA / Neighbor Joining / Other methods) Align on the tree

x

w

y

z?

Page 38: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Heuristics to improve alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

Page 39: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT

Page 40: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Iterative Refinement

Algorithm (Barton-Stenberg):

1. For j = 1 to N,Remove xj, and realign to x1…

xj-1xj+1…xN

2. Repeat 4 until convergence

x

y

z

x,z fixed projection

allow y to vary

Page 41: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Page 42: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Iterative Refinement

Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Page 43: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Consistency

z

x

y

xi

yj yj’

zk

Page 44: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Consistency

Basic method for applying consistency

• Compute all pairs of alignments xy, xz, yz, …

• When aligning x, y during progressive alignment,

For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) Align x and y with DP using the modified s(.,.) function

z

x

y

xi

yj yj’

zk

Page 45: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Real-world protein aligners

• MUSCLE High throughput One of the best in accuracy

• ProbCons High accuracy Reasonable speed

Page 46: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

MUSCLE at a glance

1. Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time

2. Build tree TDRAFT based on those distances, with UPGMA

3. Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT• Only perform alignment steps for the parts of the tree that have changed

4. Measure new Kimura-based distances D(x, y) based on MDRAFT

5. Build tree T based on D

6. Progressive alignment over T, to build M

7. Iterative refinement; for many rounds, do:• Tree Partitioning: Split M on one branch and realign the two resulting profiles• If new alignment M’ has better sum-of-pairs score than previous one, accept

Page 47: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

PROBCONS at a glance

1. Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM

2. Re-estimation of posterior matrices M’xy with probabilistic consistency• M’xy(i, j) = 1/N sequence z k Mxz(i, k) Myz (j, k); M’xy = Avgz(MxzMzy)

3. Compute for every pair x, y, the maximum expected accuracy alignment• Axy: alignment that maximizes aligned (i, j) in A M’xy(i, j)• Define E(x, y) = aligned (i, j) in Axy M’xy(i, j)

4. Build tree T with hierarchical clustering using similarity measure E(x, y)

5. Progressive alignment on T to maximize E(.,.)

6. Iterative refinement; for many rounds, do:• Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each

sequence and realign the two resulting profiles

Page 48: Fragment Assembly

CS262 Lecture 9, Win07, Batzoglou

Some Resources

Genome Resources

Annotation and alignment genome browser at UCSChttp://genome.ucsc.edu/cgi-bin/hgGateway

Specialized VISTA alignment browser at LBNLhttp://pipeline.lbl.gov/cgi-bin/gateway2

ABC—Nice Stanford tool for browsing alignmentshttp://encode.stanford.edu/~asimenos/ABC/

Protein Multiple Aligners

http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable

http://probcons.stanford.edu/ PROBCONS – most accurate