24
Transcript reconstruction algorithms available in the Trinity RNA-Seq package Daniel Standage Brendel Group, Indiana University 4 Mar 2014 Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 1 / 24

Brendel Group Presentation: 4 Mar 2013

Embed Size (px)

DESCRIPTION

A review of the 3 transcript reconstruction modes available in the Trinity RNA-Seq package.

Citation preview

Page 1: Brendel Group Presentation: 4 Mar 2013

Transcript reconstruction algorithms available in theTrinity RNA-Seq package

Daniel Standage

Brendel Group, Indiana University

4 Mar 2014

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 1 / 24

Page 2: Brendel Group Presentation: 4 Mar 2013

Introduction RNA-Seq

RNA-Seq

Examination of transcriptomesdeep

effective

affordable

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 2 / 24

Page 3: Brendel Group Presentation: 4 Mar 2013

Introduction RNA-Seq

RNA-Seq

High throughput comes at the expense ofcontiguity.

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 3 / 24

Page 4: Brendel Group Presentation: 4 Mar 2013

Introduction RNA-Seq

RNA-Seq

High throughput comes at the expense ofcontiguity...well, at least for now.

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 4 / 24

Page 5: Brendel Group Presentation: 4 Mar 2013

Introduction Assembly with Trinity

Transcriptome assembly

In the absence of full-length transcript sequences,reconstruct full-length sequences from fragments.

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 5 / 24

Page 6: Brendel Group Presentation: 4 Mar 2013

Introduction Assembly with Trinity

Trinity RNA-Seq

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 6 / 24

Page 7: Brendel Group Presentation: 4 Mar 2013

Introduction Assembly with Trinity

Trinity RNA-Seq

Now with 3 transcript reconstruction modes!Butterfly (default)

--PasaFly

--CuffFly

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 7 / 24

Page 8: Brendel Group Presentation: 4 Mar 2013

Introduction Assembly with Trinity

Review outline

Trinity algorithm

PASA algorithm

Cufflinks algorithm

Discussion

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 8 / 24

Page 9: Brendel Group Presentation: 4 Mar 2013

Trinity Inchworm

Step 1: Inchworm

Assemble unique contigs representing transcriptsubsequences.

Often produces dominant isoform in full length, and then just uniqueportions of alternative isoforms.

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 9 / 24

Page 10: Brendel Group Presentation: 4 Mar 2013

Trinity Inchworm

Inchworm procedure

1 Create dictionary of k-mers (k = 25)

2 Remove k-mers containing probable errors (based on coverage?)

3 Selects highest occurring k-mer

4 Build contig by extending k-mer (find highest occurring k-mer withk − 1 bp overlap, extend 1 bp), remove k-mer from dictionary

5 Repeat previous step until the contig cannot be extended further,report contig

6 Repeat steps 3-5 until all k-mers are exhausted

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 10 / 24

Page 11: Brendel Group Presentation: 4 Mar 2013

Trinity Chrysalis

Step 2: Chrysalis

Group Inchworm contigs, construct de Bruijngraph for each cluster.

Each connected component of the graph corresponds to one or more geneswith shared sequence.

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 11 / 24

Page 12: Brendel Group Presentation: 4 Mar 2013

Trinity Chrysalis

Chrysalis procedure

1 Group contigs if they share perfect overlap of k − 1 bp (with readssupporting the overlap)

2 Build de Bruijn graph with k − 1 word size for nodes, k for edges;edges weighted by supporting reads

3 Assign each read to component with which it shares the largestnumber of k-mers

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 12 / 24

Page 13: Brendel Group Presentation: 4 Mar 2013

Trinity Butterfly

Step 3: Butterfly

Traverse read-supported paths in each subgraph,enumerate plausible sequences.

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 13 / 24

Page 14: Brendel Group Presentation: 4 Mar 2013

Trinity Butterfly

Butterfly procedure

1 Graph simplification: merge consecutive nodes in linear paths,pruning minor deviations

2 Plausible path scoring: identify paths in graph with read supportInitialize DP table with source nodes (no incoming edges)Fill in table by extending path prefixes by one node

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 14 / 24

Page 15: Brendel Group Presentation: 4 Mar 2013

PASA

PASA

Program to Assemble Spliced Alignments

designed for ESTs and FL-cDNAs (pre-NGS era)

works on sequence alignments

computes consensus spliced alignments

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 15 / 24

Page 16: Brendel Group Presentation: 4 Mar 2013

PASA

PASA algorithm

Input: a set of spliced cDNA alignments AOutput: for each alignment a ∈ A, the largest assembly containing a

1 Sort alignments

2 Test overlapping alignments for compatibility

3 Build DP table, backtrace to find maximal assembly A∗

4 If ∃a′ /∈ A∗, build reciprocal DP table, trace to enumerate additionalassemblies

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 16 / 24

Page 17: Brendel Group Presentation: 4 Mar 2013

PASA

PASA algorithm

Recurrences

La = maxb{Ca, Lb + Ca/b}

Ra = maxb{Ca,Rb + Ca/b}

La,Ra: maximum number of cDNAs in an assembly that containsalignment a, starting from left and right (respectively)

Ca: number of a-compatible alignments in the span of a

Ca/b: number of a-compatible alignments in the span of a but not inthe span of b

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 17 / 24

Page 18: Brendel Group Presentation: 4 Mar 2013

PASA

PASA algorithm

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 18 / 24

Page 19: Brendel Group Presentation: 4 Mar 2013

Cufflinks

Cufflinks

designed for short transcript reads (NGS era)

works on read alignments (mappings)

identifies fewest number of transcripts that “explain” the readmappings

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 19 / 24

Page 20: Brendel Group Presentation: 4 Mar 2013

Cufflinks

Cufflinks algorithm

Input: overlap graph G ′ of mapped readsOutput: a minimal path cover of G ′, with each path correspondingto a single assembled transcript

1 Alignments divided into non-overlapping loci

2 Erroneous read alignments removed

3 Compute transitive reduction of G ′, G

4 Construct bipartite graph G ∗ from transitive closure of G ,with edgesweighted by coverage to “phase” distant exons by their coverage

5 Compute minimum-cost maximal matching in G ∗, which correspondsto minimum path cover of G ′

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 20 / 24

Page 21: Brendel Group Presentation: 4 Mar 2013

Discussion

Three different construction approaches

Butterfly: enumerate all plausible transcripts with minimal readsupport

PASA: for each alignment, find largest assembly (transcript)containing the alignment

CuffLinks: find minimal assembl(y|ies) that explain the data,using read coverage to “phase” distant exons

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 21 / 24

Page 22: Brendel Group Presentation: 4 Mar 2013

Discussion

Next time: comparison of 8 Trinity assemblies

Four assembly settingsButterfly--PasaFly

--CuffFly

Butterfly, --min kmer cov 2

Two input data setsGroomed dataGroomed data with digital normalization

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 22 / 24

Page 23: Brendel Group Presentation: 4 Mar 2013

Discussion

Next time: comparison of 8 Trinity assemblies

Hypotheses (transcripts per assembly)

Butterfly > PasaFly > CuffFly

Diginorm > No diginorm

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 23 / 24

Page 24: Brendel Group Presentation: 4 Mar 2013

Discussion

Thank you!

Daniel Standage (Brendel Group @ IU) Trinity Assembly 4 Mar 2014 24 / 24