31
CRC Project on Robust Transcript Discovery and Quantification from Sequencing Data Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi GSU: Alex Zelikovsky, Serghei Mangul, Adrian Caciula Lifetech PI: Dumitru Brinza

CRC Project on Robust Transcript Discovery and Quantification from Sequencing Data

Embed Size (px)

DESCRIPTION

CRC Project on Robust Transcript Discovery and Quantification from Sequencing Data. Dec . 22 , 2011 live call. UCONN: Ion Mandoiu , Sahar Al Seesi GSU: Alex Zelikovsky , Serghei Mangul , Adrian Caciula Lifetech PI: Dumitru Brinza. Outline. SNV calling from RNA- Seq reads - PowerPoint PPT Presentation

Citation preview

Page 1: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

CRC Project on Robust Transcript Discovery and Quantification from Sequencing

Data

Dec. 22, 2011 live call

UCONN: Ion Mandoiu, Sahar Al SeesiGSU: Alex Zelikovsky, Serghei Mangul, Adrian Caciula

Lifetech PI: Dumitru Brinza

Page 2: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Outline

1. SNV calling from RNA-Seq reads2. Transcriptome reconstruction update

Page 3: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

SNV Calling from RNA-Seq Reads

• RNA-Seq typically used for gene expression analysis• SNV calling from RNA-Seq data?• Much less expensive than genome sequencing• Motivated by project in personalized genomic-

guided immunotherapy, where we only need expressed variants

Page 4: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Hybrid Approach Based on Merging Alignments

mRNA reads

Transcript Library

Mapping

Genome Mapping

Read Merging

Transcript mapped reads

Genome mapped reads

Mapped reads

Page 5: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Converting Transcriptome Alignments to Genome Coordinates

A C

Convert to genome coordinates

Transcriptome alignments

Page 6: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Merging Rules for Short ReadsGenome Transcripts Agree? Hard Merge

Unique Unique Yes Keep

Unique Unique No Throw

Unique Multiple No Throw

Unique Not Mapped No Keep

Multiple Unique No Throw

Multiple Multiple No Throw

Multiple Not Mapped No Throw

Not mapped Unique No Keep

Not mapped Multiple No Throw

Not mapped Not Mapped Yes Throw

Page 7: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Merging Local Alignments of ION Reads: HardMerge at Base-Level

• Input: SAM files with alignments from genome and transcriptome mapping

• The following alignments are filtered out– Any local alignments of length <= 15 bases– All alignments of read that has alignments on different chromosomes or

different strands

• Key idea: a read base mapped to multiple locations is discarded

• Output alignments are generated from contiguous stretches of non-ambiguously mapped bases, based on the unique genomic location of these bases– Subject to the above filtering criteria

Page 8: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

HardMerge Example

Input alignments in genome coordinates:

Filter multiple local alignments/sub-alignments

Output alignment:

Page 9: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

SNV Detection and Genotyping

AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGCAACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC

Reference

Locus i

Ri

r(i) : Base call of read r at locus iεr(i) : Probability of error reading base call r(i)Gi : Genotype at locus i

Page 10: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

SNV Detection and Genotyping

• Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one

Page 11: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

SNVQ Model• Calculate conditional probabilities by multiplying contributions of

individual reads

Page 12: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

ERCC SNV Simulation

• Random SNVs were inserted to the ERCC reference with probability 0.005

• The modified ERCC sequences were appended to the reference genome

• For each ERCC, one exon transcript annotation where added to the Ensembl64 transcript library (GTF).

• tmap indices where for the reference genome and transcriptome including the ERCCs with the simluated SNVs

Page 13: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

HBR Sample Statistics

Page 14: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

UHR Sample Statistics

Page 15: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

ION,[0,1]

SNVQ,[0,1],1

SNVQ,[0,1],2

ION,(1,5]

SNVQ,(1,5],1

SNVQ,(1,5],2

ION,(5,10]

SNVQ,(5,10],

1

SNVQ,(5,10],

2

ION,(10,50]

SNVQ,(10,50]

,1

SNVQ,(10,50]

,2

ION,(50,inf)

SNVQ,(50,inf)

,1

SNVQ,(50,inf)

,2

0

20

40

60

80

100

120

HBR - 5 datasets average

FPFNTP

Method, ERCC average coverage, min alternative allele coverage

Page 16: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

ION,[0,1]

SNVQ,[0,1],1

SNVQ,[0,1],2

ION,(1,5]

SNVQ,(1,5],1

SNVQ,(1,5],2

ION,(5,10]

SNVQ,(5,10],

1

SNVQ,(5,10],

2

ION,(10,50

]

SNVQ,(10,50

],1

SNVQ,(10,50

],2

ION,(50,inf

)

SNVQ,(50,inf

),1

SNVQ,(50,inf

),2

0

20

40

60

80

100

120

HBR - 5 datasets, combined

FPFNTP

Method, ERCC average coverage, min alternative allele coverage

Page 17: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

ION,[0,1]

SNVQ,[0,1],1

SNVQ,[0,1],2

ION,(1,5]

SNVQ,(1,5],1

SNVQ,(1,5],2

ION,(5,10]

SNVQ,(5,10],

1

SNVQ,(5,10],

2

ION,(10,50

]

SNVQ,(10,50

],1

SNVQ,(10,50

],2

ION,(50,inf

)

SNVQ,(50,inf

),1

SNVQ,(50,inf

),2

0

20

40

60

80

100

120

UHR - 5 datasets average

FPFNTP

Method, ERCC average coverage, min alternative allele coverage

Page 18: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

ION,[0,1]

SNVQ,[0,1],1

SNVQ,[0,1],2

ION,(1,5]

SNVQ,(1,5],1

SNVQ,(1,5],2

ION,(5,10]

SNVQ,(5,10],

1

SNVQ,(5,10],

2

ION,(10,50

]

SNVQ,(10,50

],1

SNVQ,(10,50

],2

ION,(50,inf

)

SNVQ,(50,inf

),1

SNVQ,(50,inf

),2

0

20

40

60

80

100

120

UHR - 5 datasets, combined

FPFNTP

Method, ERCC average coverage, min alternative allele coverage

Page 19: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Comparing SNVQ & Samtools on HardMerge Alignments

SNVQ,[0,1]

HM/sam,[0,1]

SNVQ,(1,5]

HM/sam,(1,5]

SNVQ,(5,10]

HM/sam,

(5,10]

SNVQ,(10,50]

HM/sam,

(10,50]

SNVQ,(50,inf)

HM/sam,

(50,inf)

0

20

40

60

80

100

120

HBR - 5 datasets, combined

FPFNTP

Method, ERCC average coverage/min alternative allele coverage

Page 20: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Comparing SNVQ & Samtools on HardMerge Alignments

SNVQ,[0,1]

HM/sam,[0,1]

SNVQ,(1,5]

HM/sam,(1,5]

SNVQ,(5,10]

HM/sam,

(5,10]

SNVQ,(10,50]

HM/sam,

(10,50]

SNVQ,(50,inf)

HM/sam,

(50,inf)

0

20

40

60

80

100

120

UHR - 5 datasets, combined

FPFNTP

Method, ERCC average coverage/min alternative allele coverage

Page 21: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Whole Transcriptome Comparison on NA12878 Illumina Reads

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

RPKM < 1 1 < RPKM < 5 5 < RPKM < 10 10 < RPKM < 50 50 < RPKM < 100 RPKM > 100

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

TPHomoVar TPHetero FP FNHomoVar FNHetero

Page 22: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Plugin Interface

Page 23: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Plugin Output

Page 24: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Outline

1. SNV calling from RNA-Seq reads2. Transcriptome reconstruction update

Page 25: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Challenges and Solutions

• Challenge: Read lengths are currently much shorter then transcripts length– Phasing “free” exons(no direct evidence from

reads) during assembly is challenging• Solutions : Statistical reconstruction method – fragment length distribution

Page 26: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Candidate Transcripts:

1 743 5t4 :

1 742 3 65t1 :

1 743 65t2 :

1 742 3 5t3 :

1 742 3 65

Exon 2 and 6 are “free” exons : no direct evidence from reads

Page 27: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

ILP based Transcriptome Reconstruction from PE reads

SE(from PE)• Splicing Graph : candidate transcriptsPE• ILP based filtering of candidate transcripts

Page 28: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Splicing Graph

Genome Research(2004) : The Multiassembly Problem: Reconstructing Multiple Transcript Isoforms From EST

Page 29: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Naive ILP formulation Variables:

y(t) = 1 iff candidate transcript t is selected, 0 otherwise

x(p) = 1 iff the pe read p is mapped within 1 std. dev.

Objective:

Constraints:(1)

(2)

Tt

ty )(min

)()( jTt

pxty

Np

sNpxsN )()()(

number of reads mapped within 1 std. dev. ~68%

for each read pj at least one transcript is selected

Page 30: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Sophisticated ILP Formulation

• Consider reads mapped within >1 std.dev.• Integrate reads with different fragment length – Prepare libraries with different insert sizes– reduce number of “free” exons

Page 31: CRC Project on  Robust Transcript Discovery and  Quantification  from Sequencing Data

Preliminary results

Spec PPV

2 0.73 0.95

3 0.66 0.92

Note : results are on ~20% of UCSC genes