Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Whole genomeWhole genomeshotgunshotgun
ßß Input:Input:ßß Shotgun sequence fragments (reads)Shotgun sequence fragments (reads)
ßß Mate pairsMate pairs
ßß Output:Output:ßß A single sequence created by consensus of overlapping readsA single sequence created by consensus of overlapping reads
ßß First generation of assemblers did not include mate-pairsFirst generation of assemblers did not include mate-pairs((PhrapPhrap, CAP..), CAP..)
ßß Second generation: CA, Second generation: CA, ArachneArachne, Euler, Euler
ßß We will We will discuss Arachnediscuss Arachne, a freely available sequence, a freely available sequenceassembler (2nd generation)assembler (2nd generation)
ArachneArachne: : DetailsDetails
ßß Initial processingInitial processing
ßß Alignment moduleAlignment module
Alignment ModuleAlignment Module
ßß Input: Collection of DNA sequences ofInput: Collection of DNA sequences ofarbitrary lengtharbitrary length
ßß Output: Output: PairwisePairwise alignments between alignments betweenthem.them.
Overlap detectionOverlap detection
ßß Option 1: Compute an alignment betweenOption 1: Compute an alignment betweenevery pair.every pair.ßß G = 150Mb, L=500G = 150Mb, L=500
ßß Coverage LN/G = 10Coverage LN/G = 10
ßß N = 10*150*10N = 10*150*1066/500 = 3*10/500 = 3*1066
ßß Not good! (Only a small fraction are trueNot good! (Only a small fraction are trueoverlaps)overlaps)
K-K-mermer based overlap based overlap
ßß A 25-A 25-bpbp sequence appears at most once sequence appears at most oncein the genome!in the genome!
ßß Two overlapping sequences should shareTwo overlapping sequences should sharea 25-a 25-mermer
ßß Two non-overlapping sequences shouldTwo non-overlapping sequences shouldnot!not!
Sorting k-Sorting k-mersmers
ßß Build a list of k-Build a list of k-mers mers that appear in thethat appear in thesequences and their reverse complementssequences and their reverse complementsßß Create a record with 4 entries:Create a record with 4 entries:ßß K-K-mermerßß Sequence numberSequence numberßß Position in the sequencePosition in the sequenceßß Reverse complementation flagReverse complementation flag
ßß Sort a vector of these according to k-Sort a vector of these according to k-mermerßß If number of records exceeds If number of records exceeds thresholdthreshold, discard, discard
(why?)(why?)
Phase 2-4 of Alignment modulePhase 2-4 of Alignment module
ßß Coalesce k-Coalesce k-mermer hits into hits intolonger, gap-free partiallonger, gap-free partialalignments.alignments.
ßß These extended k-These extended k-mermerhits are saved.hits are saved.
ßß For each pair ofFor each pair ofsequences, form asequences, form adirected graph.directed graph.
ßß For each maximal pathFor each maximal pathin the graph, constructin the graph, constructan alignment.an alignment.
ßß Refine alignment viaRefine alignment viabanded DPbanded DP
Detecting Detecting Chimeric Chimeric readsreads
ßß ChimericChimeric reads: Reads that reads: Reads thatcontain sequence from twocontain sequence from twogenomic locations.genomic locations.
ßß Good overlaps: G(a,b) if a,bGood overlaps: G(a,b) if a,boverlap with a overlap with a high high scorescore
ßß Transitive overlap: T(a,c) ifTransitive overlap: T(a,c) ifG(a,b), and G(b,c)G(a,b), and G(b,c)
ßß Find a point x across whichFind a point x across whichonly transitive overlaps occur.only transitive overlaps occur.X is a point of X is a point of chimerismchimerism
ContigContig assembly assembly
ßß Reads are merged into Reads are merged into contigscontigsuptoupto repeat boundaries. repeat boundaries.
ßß (a,b) & (a,c) overlap, (b,c)(a,b) & (a,c) overlap, (b,c)should overlap as well. Also,should overlap as well. Also,ßß shift(a,c)=shift(a,b)+shift(b,c)shift(a,c)=shift(a,b)+shift(b,c)
ßß Most of the Most of the contigscontigs are unique are uniquepieces of the genome, and endpieces of the genome, and endat some Repeat boundary.at some Repeat boundary.
ßß Some Some contigscontigs might be entirely might be entirelywithin repeats. These must bewithin repeats. These must bedetecteddetected
Detecting Repeat Detecting Repeat ContigsContigs 1: Read Density 1: Read Density
ßß Compute the log-oddsCompute the log-oddsratio of tworatio of twohypotheses:hypotheses:ßß H1: The H1: The contig contig is fromis from
a unique region of thea unique region of thegenome.genome.ßß The The contigcontig is from a is from a
region that isregion that isrepeated at leastrepeated at leasttwicetwice
Supercontig Supercontig assemblyassembly
ßß SupercontigsSupercontigs are built incrementally are built incrementallyßß Initially, each Initially, each contigcontig is a is a supercontigsupercontig..ßß In each round, a In each round, a pair pair of super-of super-contigscontigs is is
merged until no more can be performed.merged until no more can be performed.ßß Create a Priority Queue with a score forCreate a Priority Queue with a score for
every pair of every pair of ‘‘mergeable supercontigsmergeable supercontigs’’..ßß Score has two terms:Score has two terms:ßß A reward for multiple mate-pair linksA reward for multiple mate-pair linksßß A penalty for distance between the links.A penalty for distance between the links.
SupercontigSupercontig merging merging
ßß Remove the top scoring pair (S1,S2) fromRemove the top scoring pair (S1,S2) fromthe priority queue.the priority queue.ßß Merge (SMerge (S11,S,S22) to form ) to form contig contig T.T.ßß Remove all pairs in Q containing SRemove all pairs in Q containing S11 or S or S22
ßß Find all Find all supercontigs supercontigs W that share mate-W that share mate-pair links with T and insert (T,W) into thepair links with T and insert (T,W) into thepriority queue.priority queue.ßß Detect Repeated Detect Repeated Supercontigs Supercontigs andand
removeremove
Repeat Repeat SupercontigsSupercontigs
ßß If the distanceIf the distancebetween two super-between two super-contigs contigs is not correct,is not correct,they are marked asthey are marked asRepeatedRepeated
ßß If transitivity is notIf transitivity is notmaintained, thenmaintained, thenthere is a Repeatthere is a Repeat
Consenus Consenus DerivationDerivation
ßß Consensus sequence is created byConsensus sequence is created byconverting converting pairwise pairwise read alignments intoread alignments intomultiple-read alignmentsmultiple-read alignments
SummarySummary
ßß Whole genome shotgun is now routine:Whole genome shotgun is now routine:ßß Human, Mouse, Rat, Dog, Chimpanzee..Human, Mouse, Rat, Dog, Chimpanzee..
ßß Many Prokaryotes (One can be sequenced in a day)Many Prokaryotes (One can be sequenced in a day)
ßß Plant genomes: Arabidopsis, RicePlant genomes: Arabidopsis, Rice
ßß Model organisms: Worm, Fly, YeastModel organisms: Worm, Fly, Yeast
ßß A lot is not known about genome structure,A lot is not known about genome structure,organization and function.organization and function.ßß Comparative genomics offers low hanging fruitComparative genomics offers low hanging fruit
The central dogma againThe central dogma again
Protein SequenceAnalysis
Sequence Analysis
Gene Finding
Assembly
Much other analysis isMuch other analysis ispossiblepossible
Protein SequenceAnalysis
Sequence Analysis
Gene Finding
Assembly
ncRNA
GenomicAnalysis/Pop. Genetics
A Static picture of the cell is insufficientA Static picture of the cell is insufficient
ßß Each Cell is continuously active,Each Cell is continuously active,ßß Genes are being transcribed into RNAGenes are being transcribed into RNA
ßß RNA is translated into proteinsRNA is translated into proteins
ßß Proteins are PT modified and transportedProteins are PT modified and transported
ßß Proteins perform various cellular functionsProteins perform various cellular functions
ßß Can we probe the Cell dynamicallyCan we probe the Cell dynamically