Accurate Estimation of Gene Expression Levelsfrom Digital Gene Expression Sequencing Data
Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)
Outline
• DGE/SAGE-Seq protocol• EM algorithm• Experimental results• Conclusions
RNA-Seq Protocol
Make cDNA & shatter into fragments
Sequence fragment ends
A B C D E
Map reads
Gene Expression (GE)Isoform Expression (IE)
A B C
A C
D E
Isoform Discovery (ID)
DGE ProtocolAAAAA
Gene Expression (GE)
Cleave with tagging enzymeCATG
Map tags
A B C D E
Cleave with anchoring enzyme (AE)AAAAACATG
AE
TCCRAC AAAAACATG
AETE
Attach primer for tagging enzyme (TE)
Our Approach
Previous methods• Discard ambiguous tags [Asmann et al. 09, Zaretzki et al. 10]• Heuristics to rescue some ambiguous tags [Wu et al. 10]
New DGE-EM algorithm• Uses all tags, including all ambiguous ones• Uses quality scores• Takes into account partial digest and gene isoforms
Tag Formation Probability
12k …3’5’
AE siteMRNA
Tag formation probability
pp(1 -p)p(1 -p) k-1
Tag-Isoform Compatibility
1,, )1( j
ajit ppQw
assign random values to all f(i)while not converged
DGE-EM Algorithm
E-step
twjiiwfs
),,()(
siwfjin )(),(
init all n(i,j) to 0for each tag t
for (i,j,w) in t
M-step )()(
1 ,
)1(1/)( isites
isites
j ji
pNif
nN
for each isoform i
MAQC Data (UHRR, HBRR)
DGE• 9 Illumina libraries, 238M 20bp tags [Asmann et al. 09]• Anchoring enzyme DpnII (GATC)
RNA-Seq • 6 libraries, 47-92M 35bp reads each [Bullard et al. 10]
qPCR • Quadruplicate measurements for 832 Ensembl genes
[MAQC Consortium 06]
Compared Algorithms
DGE• Uniq [Asmann et al. 09, Zaretzki et al. 10]• DGE-EM
RNA-Seq• IsoEM [Nicolae et al. 10]• Cufflinks [Trapnell et al. 10]
DGE-EM vs. Uniq on HBRR Library 4
0 10000000 20000000 30000000 40000000 50000000 6000000065
70
75
80
85
Uniq 0 mismatches Uniq 1 mismatch Uniq 2 mismatches
DGE-EM 0 mismatches DGE-EM 1 mismatch DGE-EM 2 mismatches
Med
ian
Perc
ent E
rror
DGE vs. RNA-Seq
60
65
70
75
80
85
90
95
100RNA HBRR 1X, IsoEMRNA HBRR 1A, IsoEMRNA UHRR 1X, IsoEMRNA UHRR 1A, IsoEMRNA UHRR 2, IsoEMRNA UHRR 3, IsoEMRNA UHRR 4, IsoEMRNA UHRR 5, IsoEMDGE HBRR 1, DGE-EMDGE HBRR 2, DGE-EMDGE HBRR 3, DGE-EMDGE HBRR 4, DGE-EMDGE HBRR 5, DGE-EMDGE HBRR 6, DGE-EMDGE HBRR 7, DGE-EMDGE HBRR 8, DGE-EMDGE UHRR 1, DGE-EMMillion Mapped Bases
Med
ian
Perc
ent E
rror
DGE vs. RNA-Seq
60
65
70
75
80
85
90
95
100RNA HBRR 1X, IsoEMRNA HBRR 1X, CufflinksRNA HBRR 1A, IsoEMRNA HBRR 1A, CufflinksRNA UHRR 1X, IsoEMRNA UHRR 1X, CufflinksRNA UHRR 1A, IsoEMRNA UHRR 1A, CufflinksRNA UHRR 2, IsoEMRNA UHRR 2, CufflinksRNA UHRR 3, IsoEMRNA UHRR 3, CufflinksRNA UHRR 4, IsoEMRNA UHRR 4, CufflinksRNA UHRR 5, IsoEMRNA UHRR 5, CufflinksDGE HBRR 1, DGE-EMDGE HBRR 1, UniqDGE HBRR 2, DGE-EMDGE HBRR 2, UniqDGE HBRR 3, DGE-EMDGE HBRR 3, UniqDGE HBRR 4, DGE-EMDGE HBRR 4, UniqDGE HBRR 5, DGE-EMDGE HBRR 5, UniqDGE HBRR 6, DGE-EMDGE HBRR 6, UniqDGE HBRR 7, DGE-EMDGE HBRR 7, UniqDGE HBRR 8, DGE-EMDGE HBRR 8, UniqDGE UHRR 1, DGE-EMDGE UHRR 1, UniqMillion Mapped Bases
Med
ian
Perc
ent E
rror
DGE vs. RNA-Seq
0.35
0.45
0.55
0.65
0.75
0.85RNA HBRR 1X, IsoEMRNA HBRR 1X, CufflinksRNA HBRR 1A, IsoEMRNA HBRR 1A, CufflinksRNA UHRR 1X, IsoEMRNA UHRR 1X, CufflinksRNA UHRR 1A, IsoEMRNA UHRR 1A, CufflinksRNA UHRR 2, IsoEMRNA UHRR 2, CufflinksRNA UHRR 3, IsoEMRNA UHRR 3, CufflinksRNA UHRR 4, IsoEMRNA UHRR 4, CufflinksRNA UHRR 5, IsoEMRNA UHRR 5, CufflinksDGE HBRR 1, DGE-EMDGE HBRR 1, UniqDGE HBRR 2, DGE-EMDGE HBRR 2, UniqDGE HBRR 3, DGE-EMDGE HBRR 3, UniqDGE HBRR 4, DGE-EMDGE HBRR 4, UniqDGE HBRR 5, DGE-EMDGE HBRR 5, UniqDGE HBRR 6, DGE-EMDGE HBRR 6, UniqDGE HBRR 7, DGE-EMDGE HBRR 7, UniqDGE HBRR 8, DGE-EMDGE HBRR 8, UniqDGE UHRR 1, DGE-EMDGE UHRR 1, UniqMillion Mapped Bases
R2
Synthetic Data
• 1-30M tags, lengths 14-26bp• UCSC hg19 genome and known isoforms• Simulated expression levels
– Gene expression for 5 tissues from the GNFAtlas2– Geometric expression for the isoforms of each gene
• Anchoring enzymes from REBASE– DpnII (GATC) [Asmann et al. 09]– NlaIII (CATG) [Wu et al. 10]– CviJI (RGCY, R=G or A, Y=C or T)
MPE for 30M 21bp tags
RNA-Seq: 8.3 MPE
GATC GGCC CATG TGCA AGCT YATR ASST RGCY0
5
10
15
20
25
30
Uniq p=1.0 Uniq p=0.5 DGE-EM p=1.0 DGE-EM p=.5
Med
ian
Perc
ent E
rror
ConclusionsIntroduced new DGE-EM algorithm
• Improves accuracy over previous methods by using ambiguous tags and considering isoforms and partial digestion
• Source code freely availabe at http://www.dna.engr.uconn.edu/software/DGE-EM
First direct comparison of RNA-Seq and DGE protocols• Best inference algorithms yield comparable cost-normalized
accuracy on MAQC dataSimulations suggest possible DGE protocol improvements
• Enzymes with degenerate recognition sites (e.g. CviJI)• Optimizing cutting probability
Questions?
ACKNOWLEDGEMENTSWork supported in part by NSF awards IIS-0546457 and IIS-0916948
Anchoring Enzyme Statistics
GATC GGCC CATG TGCA AGCT YATR ASST RGCY75
80
85
90
95
100
% Genes Cut % Unique Tags (p=1.0) % Unique Tags (p=0.5)
RNA-Seq
10000005000000
1000000015000000
30000000
0
5
10
15
20
25
14
18
21
26
36
50
75
100
14 18 21 26 36 50 75 100
#Reads
MPE
Read Length
DGE enzyme GATC p=1.0
1418
2124
26
0
2
4
6
8
10
12
14
16
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags
DGE enzyme CATG p=1.0
1418
2124
26
0
2
4
6
8
10
12
14
16
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags
DGE enzyme RGCY p=1.0
1418
2124
26
0
2
4
6
8
10
12
14
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags
DGE enzyme GATC p=.5
1418
2124
26
0
2
4
6
8
10
12
14
16
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags
DGE enzyme CATG p=.5
1418
2124
26
0
2
4
6
8
10
12
14
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags
DGE enzyme RGCY p=.5
1418
2124
26
0
2
4
6
8
10
12
14
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags