Upload
danielstandage
View
338
Download
3
Embed Size (px)
DESCRIPTION
A review of the 3 transcript reconstruction modes available in the Trinity RNA-Seq package.
Citation preview
Transcript reconstruction algorithms available in theTrinity RNA-Seq package
Daniel Standage
Brendel Group, Indiana University
4 Mar 2014
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 1 / 24
Introduction RNA-Seq
RNA-Seq
Examination of transcriptomesdeep
effective
affordable
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 2 / 24
Introduction RNA-Seq
RNA-Seq
High throughput comes at the expense ofcontiguity.
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 3 / 24
Introduction RNA-Seq
RNA-Seq
High throughput comes at the expense ofcontiguity...well, at least for now.
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 4 / 24
Introduction Assembly with Trinity
Transcriptome assembly
In the absence of full-length transcript sequences,reconstruct full-length sequences from fragments.
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 5 / 24
Introduction Assembly with Trinity
Trinity RNA-Seq
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 6 / 24
Introduction Assembly with Trinity
Trinity RNA-Seq
Now with 3 transcript reconstruction modes!Butterfly (default)
--PasaFly
--CuffFly
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 7 / 24
Introduction Assembly with Trinity
Review outline
Trinity algorithm
PASA algorithm
Cufflinks algorithm
Discussion
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 8 / 24
Trinity Inchworm
Step 1: Inchworm
Assemble unique contigs representing transcriptsubsequences.
Often produces dominant isoform in full length, and then just uniqueportions of alternative isoforms.
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 9 / 24
Trinity Inchworm
Inchworm procedure
1 Create dictionary of k-mers (k = 25)
2 Remove k-mers containing probable errors (based on coverage?)
3 Selects highest occurring k-mer
4 Build contig by extending k-mer (find highest occurring k-mer withk − 1 bp overlap, extend 1 bp), remove k-mer from dictionary
5 Repeat previous step until the contig cannot be extended further,report contig
6 Repeat steps 3-5 until all k-mers are exhausted
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 10 / 24
Trinity Chrysalis
Step 2: Chrysalis
Group Inchworm contigs, construct de Bruijngraph for each cluster.
Each connected component of the graph corresponds to one or more geneswith shared sequence.
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 11 / 24
Trinity Chrysalis
Chrysalis procedure
1 Group contigs if they share perfect overlap of k − 1 bp (with readssupporting the overlap)
2 Build de Bruijn graph with k − 1 word size for nodes, k for edges;edges weighted by supporting reads
3 Assign each read to component with which it shares the largestnumber of k-mers
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 12 / 24
Trinity Butterfly
Step 3: Butterfly
Traverse read-supported paths in each subgraph,enumerate plausible sequences.
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 13 / 24
Trinity Butterfly
Butterfly procedure
1 Graph simplification: merge consecutive nodes in linear paths,pruning minor deviations
2 Plausible path scoring: identify paths in graph with read supportInitialize DP table with source nodes (no incoming edges)Fill in table by extending path prefixes by one node
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 14 / 24
PASA
PASA
Program to Assemble Spliced Alignments
designed for ESTs and FL-cDNAs (pre-NGS era)
works on sequence alignments
computes consensus spliced alignments
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 15 / 24
PASA
PASA algorithm
Input: a set of spliced cDNA alignments AOutput: for each alignment a ∈ A, the largest assembly containing a
1 Sort alignments
2 Test overlapping alignments for compatibility
3 Build DP table, backtrace to find maximal assembly A∗
4 If ∃a′ /∈ A∗, build reciprocal DP table, trace to enumerate additionalassemblies
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 16 / 24
PASA
PASA algorithm
Recurrences
La = maxb{Ca, Lb + Ca/b}
Ra = maxb{Ca,Rb + Ca/b}
La,Ra: maximum number of cDNAs in an assembly that containsalignment a, starting from left and right (respectively)
Ca: number of a-compatible alignments in the span of a
Ca/b: number of a-compatible alignments in the span of a but not inthe span of b
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 17 / 24
PASA
PASA algorithm
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 18 / 24
Cufflinks
Cufflinks
designed for short transcript reads (NGS era)
works on read alignments (mappings)
identifies fewest number of transcripts that “explain” the readmappings
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 19 / 24
Cufflinks
Cufflinks algorithm
Input: overlap graph G ′ of mapped readsOutput: a minimal path cover of G ′, with each path correspondingto a single assembled transcript
1 Alignments divided into non-overlapping loci
2 Erroneous read alignments removed
3 Compute transitive reduction of G ′, G
4 Construct bipartite graph G ∗ from transitive closure of G ,with edgesweighted by coverage to “phase” distant exons by their coverage
5 Compute minimum-cost maximal matching in G ∗, which correspondsto minimum path cover of G ′
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 20 / 24
Discussion
Three different construction approaches
Butterfly: enumerate all plausible transcripts with minimal readsupport
PASA: for each alignment, find largest assembly (transcript)containing the alignment
CuffLinks: find minimal assembl(y|ies) that explain the data,using read coverage to “phase” distant exons
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 21 / 24
Discussion
Next time: comparison of 8 Trinity assemblies
Four assembly settingsButterfly--PasaFly
--CuffFly
Butterfly, --min kmer cov 2
Two input data setsGroomed dataGroomed data with digital normalization
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 22 / 24
Discussion
Next time: comparison of 8 Trinity assemblies
Hypotheses (transcripts per assembly)
Butterfly > PasaFly > CuffFly
Diginorm > No diginorm
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 23 / 24
Discussion
Thank you!
Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 24 / 24