24
Assembly Assembly

Assembly - University of California, San Diego

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

AssemblyAssembly

Assembling with RepeatsAssembling with Repeats

Mate PairsMate Pairs

Whole genomeWhole genomeshotgunshotgun

ßß Input:Input:ßß Shotgun sequence fragments (reads)Shotgun sequence fragments (reads)

ßß Mate pairsMate pairs

ßß Output:Output:ßß A single sequence created by consensus of overlapping readsA single sequence created by consensus of overlapping reads

ßß First generation of assemblers did not include mate-pairsFirst generation of assemblers did not include mate-pairs((PhrapPhrap, CAP..), CAP..)

ßß Second generation: CA, Second generation: CA, ArachneArachne, Euler, Euler

ßß We will We will discuss Arachnediscuss Arachne, a freely available sequence, a freely available sequenceassembler (2nd generation)assembler (2nd generation)

ArachneArachne: : DetailsDetails

ßß Initial processingInitial processing

ßß Alignment moduleAlignment module

Alignment ModuleAlignment Module

ßß Input: Collection of DNA sequences ofInput: Collection of DNA sequences ofarbitrary lengtharbitrary length

ßß Output: Output: PairwisePairwise alignments between alignments betweenthem.them.

Overlap detectionOverlap detection

ßß Option 1: Compute an alignment betweenOption 1: Compute an alignment betweenevery pair.every pair.ßß G = 150Mb, L=500G = 150Mb, L=500

ßß Coverage LN/G = 10Coverage LN/G = 10

ßß N = 10*150*10N = 10*150*1066/500 = 3*10/500 = 3*1066

ßß Not good! (Only a small fraction are trueNot good! (Only a small fraction are trueoverlaps)overlaps)

K-K-mermer based overlap based overlap

ßß A 25-A 25-bpbp sequence appears at most once sequence appears at most oncein the genome!in the genome!

ßß Two overlapping sequences should shareTwo overlapping sequences should sharea 25-a 25-mermer

ßß Two non-overlapping sequences shouldTwo non-overlapping sequences shouldnot!not!

Sorting k-Sorting k-mersmers

ßß Build a list of k-Build a list of k-mers mers that appear in thethat appear in thesequences and their reverse complementssequences and their reverse complementsßß Create a record with 4 entries:Create a record with 4 entries:ßß K-K-mermerßß Sequence numberSequence numberßß Position in the sequencePosition in the sequenceßß Reverse complementation flagReverse complementation flag

ßß Sort a vector of these according to k-Sort a vector of these according to k-mermerßß If number of records exceeds If number of records exceeds thresholdthreshold, discard, discard

(why?)(why?)

Phase 2-4 of Alignment modulePhase 2-4 of Alignment module

ßß Coalesce k-Coalesce k-mermer hits into hits intolonger, gap-free partiallonger, gap-free partialalignments.alignments.

ßß These extended k-These extended k-mermerhits are saved.hits are saved.

ßß For each pair ofFor each pair ofsequences, form asequences, form adirected graph.directed graph.

ßß For each maximal pathFor each maximal pathin the graph, constructin the graph, constructan alignment.an alignment.

ßß Refine alignment viaRefine alignment viabanded DPbanded DP

Detecting Detecting Chimeric Chimeric readsreads

ßß ChimericChimeric reads: Reads that reads: Reads thatcontain sequence from twocontain sequence from twogenomic locations.genomic locations.

ßß Good overlaps: G(a,b) if a,bGood overlaps: G(a,b) if a,boverlap with a overlap with a high high scorescore

ßß Transitive overlap: T(a,c) ifTransitive overlap: T(a,c) ifG(a,b), and G(b,c)G(a,b), and G(b,c)

ßß Find a point x across whichFind a point x across whichonly transitive overlaps occur.only transitive overlaps occur.X is a point of X is a point of chimerismchimerism

RepeatsRepeats

ContigContig assembly assembly

ßß Reads are merged into Reads are merged into contigscontigsuptoupto repeat boundaries. repeat boundaries.

ßß (a,b) & (a,c) overlap, (b,c)(a,b) & (a,c) overlap, (b,c)should overlap as well. Also,should overlap as well. Also,ßß shift(a,c)=shift(a,b)+shift(b,c)shift(a,c)=shift(a,b)+shift(b,c)

ßß Most of the Most of the contigscontigs are unique are uniquepieces of the genome, and endpieces of the genome, and endat some Repeat boundary.at some Repeat boundary.

ßß Some Some contigscontigs might be entirely might be entirelywithin repeats. These must bewithin repeats. These must bedetecteddetected

Detecting Repeat Detecting Repeat ContigsContigs 1: Read Density 1: Read Density

ßß Compute the log-oddsCompute the log-oddsratio of tworatio of twohypotheses:hypotheses:ßß H1: The H1: The contig contig is fromis from

a unique region of thea unique region of thegenome.genome.ßß The The contigcontig is from a is from a

region that isregion that isrepeated at leastrepeated at leasttwicetwice

Creating Super Creating Super ContigsContigs

Supercontig Supercontig assemblyassembly

ßß SupercontigsSupercontigs are built incrementally are built incrementallyßß Initially, each Initially, each contigcontig is a is a supercontigsupercontig..ßß In each round, a In each round, a pair pair of super-of super-contigscontigs is is

merged until no more can be performed.merged until no more can be performed.ßß Create a Priority Queue with a score forCreate a Priority Queue with a score for

every pair of every pair of ‘‘mergeable supercontigsmergeable supercontigs’’..ßß Score has two terms:Score has two terms:ßß A reward for multiple mate-pair linksA reward for multiple mate-pair linksßß A penalty for distance between the links.A penalty for distance between the links.

SupercontigSupercontig merging merging

ßß Remove the top scoring pair (S1,S2) fromRemove the top scoring pair (S1,S2) fromthe priority queue.the priority queue.ßß Merge (SMerge (S11,S,S22) to form ) to form contig contig T.T.ßß Remove all pairs in Q containing SRemove all pairs in Q containing S11 or S or S22

ßß Find all Find all supercontigs supercontigs W that share mate-W that share mate-pair links with T and insert (T,W) into thepair links with T and insert (T,W) into thepriority queue.priority queue.ßß Detect Repeated Detect Repeated Supercontigs Supercontigs andand

removeremove

Repeat Repeat SupercontigsSupercontigs

ßß If the distanceIf the distancebetween two super-between two super-contigs contigs is not correct,is not correct,they are marked asthey are marked asRepeatedRepeated

ßß If transitivity is notIf transitivity is notmaintained, thenmaintained, thenthere is a Repeatthere is a Repeat

Filling gaps in Filling gaps in SupercontigsSupercontigs

Consenus Consenus DerivationDerivation

ßß Consensus sequence is created byConsensus sequence is created byconverting converting pairwise pairwise read alignments intoread alignments intomultiple-read alignmentsmultiple-read alignments

SummarySummary

ßß Whole genome shotgun is now routine:Whole genome shotgun is now routine:ßß Human, Mouse, Rat, Dog, Chimpanzee..Human, Mouse, Rat, Dog, Chimpanzee..

ßß Many Prokaryotes (One can be sequenced in a day)Many Prokaryotes (One can be sequenced in a day)

ßß Plant genomes: Arabidopsis, RicePlant genomes: Arabidopsis, Rice

ßß Model organisms: Worm, Fly, YeastModel organisms: Worm, Fly, Yeast

ßß A lot is not known about genome structure,A lot is not known about genome structure,organization and function.organization and function.ßß Comparative genomics offers low hanging fruitComparative genomics offers low hanging fruit

The central dogma againThe central dogma again

Protein SequenceAnalysis

Sequence Analysis

Gene Finding

Assembly

Much other analysis isMuch other analysis ispossiblepossible

Protein SequenceAnalysis

Sequence Analysis

Gene Finding

Assembly

ncRNA

GenomicAnalysis/Pop. Genetics

A Static picture of the cell is insufficientA Static picture of the cell is insufficient

ßß Each Cell is continuously active,Each Cell is continuously active,ßß Genes are being transcribed into RNAGenes are being transcribed into RNA

ßß RNA is translated into proteinsRNA is translated into proteins

ßß Proteins are PT modified and transportedProteins are PT modified and transported

ßß Proteins perform various cellular functionsProteins perform various cellular functions

ßß Can we probe the Cell dynamicallyCan we probe the Cell dynamically