39
ERCC 2.0 Workshop Day 1 July 10, 2014 Sarah Munro and Marc Salit

20140710 1 day1_nist_ercc2.0workshop

Embed Size (px)

Citation preview

Page 1: 20140710 1 day1_nist_ercc2.0workshop

ERCC  2.0  Workshop  

Day  1  July  10,  2014  

Sarah  Munro  and  Marc  Salit  

Page 2: 20140710 1 day1_nist_ercc2.0workshop

Welcome  

Page 3: 20140710 1 day1_nist_ercc2.0workshop

Agenda  

•  MeeCng  IntroducCon  –  ERCC  1.0  Recap  &  ApplicaCons  

–  Charge  to  Workshop  –  ERCC  2.0  Scope  &  Process  Discussion  

•  ParCcipant  PresentaCons  –  Bob  Se'erquist  –  Lukas  Paul  

 

•  Break:  3:30  –  4:00pm  •  ParCcipant  PresentaCons  –  Anne  Bergstrom  Lucas  –  Karol  Thompson  –  Christopher  Mason  

•  Working  Group  FormaCon  and  Scoping  Discussion  

•  Dinner:  6:00pm  

Page 4: 20140710 1 day1_nist_ercc2.0workshop

ERCC  1.0  RECAP  

Page 5: 20140710 1 day1_nist_ercc2.0workshop

Gene  Expression  Measurements  

5  

!"#$%#&'() (* +,), ,-./,00'() 1,#0%/,1,)&0 */(12(11,/2'#$ 1'2/(#//#3 .$#&*(/104#%$ 56 7#)8 79(1#0 :6 ;(<),3=8 !><#/> ?6 @.'&A)#+,$ :/B8 4') C%8 ;#>') D%8

;'1'&,/ @6 ;'1'&/("E8 F'29#/> G6 ?,1.'2H'I8 J/%2, K6 F##H#L #)> K#/+#/,& M6 M#1N

!"#$%&$$&' (%$) *&+%$&,%$'- .&,"%/&0 1/2,",3,) %4 5"&+),)2 &/6 5"7)2,"8) &/6 9"6/)' 5"2%$6)$2 :.1559;-.&,"%/&0 1/2,",3,)2 %4 <)&0,=- >?&$,)@ 1/#%$A%$&,)6- B5)A&$,C)/, %4 !&,=)C&,"#2- D&2="/7,%/ E/"8)$2",'-F*&+%$&,%$' %4 GHA)$"C)/,&0 &/6 (%CA3,&,"%/&0 I"%0%7' :*G(I;- .&,"%/&0 (&/#)$ 1/2,",3,)- .1<- J.&,"%/&0 1/2,",3,)%4 K00)$7' &/6 1/4)#,"%32 5"2)&2)2 :.1K15;- .1<- LK1(MN$)6)$"#@- 1/#O- P(0"/"#&0 G/6%#$"/%0%7' I$&/#=- .1559-.1<- ELK

Q)#)"8)6 !&' BF- BRRFS Q)8"2)6 T30' >>- BRRFS K##)A,)6 K3732, >>- BRRF

GJ@7FGM7

K%$&'.$, 2(11,/2'#$ 1'2/(#//#30 *(/ 1,#0%/')++,)(1,O<'>, +,), ,-./,00'() $,",$0 #/, 2%//,)&$3#"#'$#P$,8 ')2$%>')+ ($'+()%2$,(&'>, #)> 2;QG80')+$,O #)> &<(O29#)),$ *(/1#&06 79'0 0&%>3 /,.(/&0() &9, /,0%$&0 (* +,), ,-./,00'() 1,#0%/,1,)&0+,),/#&,> */(1 '>,)&'2#$ FQG ./,.#/#&'()0 &9#&<,/, (P&#'),> %0')+ &9/,, 2(11,/2'#$$3 #"#'$#P$,1'2/(#//#3 .$#&*(/106 FQG <#0 2($$,2&,> */(14GQMO= 2,$$0 +/(<) ') 0,/%1O/'29 1,>'%1 #)> #&BI 9 *($$(<')+ &9, /,1("#$ (* 0,/%16 79/,, P'(O$(+'2#$ /,.$'2#&,0 <,/, ./,.#/,> *(/ ,#29 2()>'&'()8#)> &9/,, ,-.,/'1,)&#$ /,.$'2#&,0 <,/, ./(>%2,> *(/&9, R/0& P'($(+'2#$ /,.$'2#&,6 FQG <#0 $#P,$,> #)>93P/'>'A,> &( 1'2/(#//#30 */(1 &9/,, 1#S(/ 0%.O.$',/0 #22(/>')+ &( 1#)%*#2&%/,/0T ./(&(2($08 #)>+,), ,-./,00'() 1,#0%/,1,)&0 <,/, (P&#'),>%0')+ ,#29 .$#&*(/1T0 0&#)>#/> 0(*&<#/,6 D(/ ,#29.$#&*(/18 +,), &#/+,&0 */(1 # 0%P0,& (* BUUV 2(1O1() +,),0 <,/, 2(1.#/,>6 M(//,$#&'()0 ') +,),,-./,00'() $,",$0 #)> 2(1.#/'0()0 *(/ 0'+)'R2#)&+,), ,-./,00'() 29#)+,0 ') &9'0 0%P0,& <,/, 2#$2%O$#&,>8 #)> 09(<,> 2()0'>,/#P$, >'",/+,)2, #2/(00&9, >'**,/,)& .$#&*(/108 0%++,0&')+ &9, ),,> *(/,0&#P$'09')+ ')>%0&/'#$ 1#)%*#2&%/')+ 0&#)>#/>08#)> *%/&9,/ ')>,.,)>,)& #)> &9(/(%+9 "#$'>#&'() (*&9, &,29)($(+36

WQ7FX;YM7WXQ

! "#$%&'() *"")+,*-+#. #' /+,&#*&&*0 -%,1.#)#20 +3 +.4+3,#5%&067*3%4 7+#/%4+,*) &%3%*&,18 9.4%& -1% 4+3,#5%&07*3%4 *""&#*,1: ;<! /+,&#*&&*03 *&% (3%4 *3 3,&%%.+.2 -##)3-# +4%.-+'0 2%.%3 *33#,+*-%4 $+-1 7+#)#2+,*) "&#,%33%3 #'+.-%&%3-8 93+.2 /+,&#*&&*03: * 2%.#/%6$+4% *33*0 ,*. 7%,#.4(,-%4 *.4 &%3%*&,1%&3 ,*. /+.% -1% &%3()-+.2 %="%&+/%.6-*) 4*-* -# 3,&%%. * )*&2% 3(73%- #' -1% 2%.#/% -# 4+3,#5%& 3%-3#' 2%.%3 *33#,+*-%4 $+-1 -1% 7+#)#2+,*) "1%.#/%.* #' +.-%&%3-

>?@8 A.,% -*&2%- 2%.%3 *&% +4%.-+B%4: *44+-+#.*) )*7#&*-#&0&%3#(&,%3 /*0 7% +.5%3-%4 -# 5*)+4*-% -1+3 )+3- *.4 -# '(&-1%&,1*&*,-%&+C% -1% &%)*-+#.31+" #' -1%+& 7+#)#2+,*) '(.,-+#.3 -#-1% "&#,%33 (.4%& 3-(40 >D@8 E1% %'B,+%.,0 #' F.#$)%42%4+3,#5%&0 (3+.2 -1+3 1+216-1&#(21"(- %="%&+/%.-*) "&#,%334%"%.43 ("#. -1% &%)+*7+)+-0 #' -1% /+,&#*&&*0 -%,1.#)#20 (3%4+. -1% +.+-+*) 3,&%%.+.2 %="%&+/%.-38 G%3%*&,1%&3 ")*..+.2 -#(-+)+C% /+,&#*&&*0 %="%&+/%.-3 '#& 4+3,#5%&067*3%4 &%3%*&,1/(3- %5*)(*-% *5*+)*7)% ,#//%&,+*) -%,1.#)#2+%3 $1%. *))#6,*-+.2 )*7#&*-#&0 &%3#(&,%3 '#& "&#3"%,-+5% %="%&+/%.-38

H%5%&*) '#&/*-3 #' /+,&#*&&*03 '#& /%*3(&+.2 2%.#/%6$+4%2%.% %="&%33+#. )%5%)3 *&% ,(&&%.-)0 *5*+)*7)% >I@8 J/"#&-*.-'*,-#&3 '#& 3%)%,-+.2 *. *""&#"&+*-% /+,&#*&&*0 ")*-'#&/ $#()4+.,)(4% 3%.3+-+5+-0: 3"%,+B,+-0 *.4 7#-1 +.-%&6 *.4 +.-&*6*33*0&%"&#4(,+7+)+-08 !)3# +/"#&-*.- +3 F.#$)%42% #' -1% 4%2&%% #',&#336")*-'#&/ *2&%%/%.-: *3 +.-%&,1*.2%*7+)+-0 */#.23-5*&+#(3 /+,&#*&&*0 '#&/*-3 $#()4 *))#$ '#& -1% (-+)+-0 #'2%.% %="&%33+#. 4*-* $+-1#(- &%2*&4 -# ")*-'#&/8 K*5+.2 3(,1* "&#"%&-0 $#()4 *))#$ &%3%*&,1%&3 '&#/ +.4%"%.4%.- )*7#&*6-#&+%3 -# /*F% 4+&%,- ,#/"*&+3#.3 #. 4*-* "&#4(,%4 '&#/4+''%&%.- -0"%3 #' *5*+)*7)% ")*-'#&/3: *.4 $#()4 &%4(,% -1%.%%4 -# &%")+,*-% %="%&+/%.-3 >L@8 H(,1 ,&#336")*-'#&/ ,#/6"*&+3#.3 +4%*))0 &%M(+&% -1*- ,#&&%3"#.4+.2 G<! %="&%33+#./%*3(&%/%.-3 7% ,#.,#&4*.-8 N&%5+#(3 ,#/"*&+3#.3 #'/+,&#*&&*0 '#&/*-3 3(22%3-%4 -1*- %="&%33+#. 4*-* #. -1%<OJPQ ,%)) )+.%3 '&#/ 3"#--%4 ,;<! /+&,&#*&&*03 ,#()4 .#-7% 4+&%,-)0 ,#/7+.%4 $+-1 4*-* '&#/ 30.-1%3+C%4 #)+2#.(,)%#6-+4% *&&*03 >R@8 E1+3 B.4+.2 $*3 4%-%&/+.%4 (3+.2 +4%.-+,*)#&+2+.*-+.2 ,%)) )+.%3S 1#$%5%&: ,%)) ,()-(&+.2: /G<! "&%"6*&*-+#. *.4 107&+4+C*-+#. #' -*&2%-3 $%&% *)) "%&'#&/%43%"*&*-%)08 J. -1+3 3-(40 $% *.*)0C%4 +4%.-+,*) G<! "&%"*&6*-+#.3 (3+.2 -1&%% ,#//%&,+*))0 *5*+)*7)% 1+2164%.3+-0/+,&#*&&*0 ")*-'#&/38 E1+3 %="%&+/%.-*) 4%3+2. *))#$%4 (3-# ,#/"*&% -1% /+,&#*&&*0 '#&/*-3 $1+)% ,#.-&#))+.2 '#&5*&+*-+#. -1*- /*0 1*5% *&+3%. '&#/ +.4%"%.4%.- ,%)),()-(&+.2: G<! +3#)*-+#. *.4 "(&+B,*-+#.8

E1&%% /*T#& ,#//%&,+*) /+,&#*&&*0 ")*-'#&/3 $%&% %5*)(6*-%4 70 (3+.2 3-*.4*&4+C%4 +."(- G<! 3*/")%: *.4 %.3(&+.2-1*- *)) /+,&#*&&*0 %="%&+/%.-3 $%&% ,*&&+%4 #(- 70 -%,1.#)#62+3-3 3"%,+*)+C%4 +. %*,1 "*&-+,()*& /+,&#*&&*0 )*7%)+.2 *.4107&+4+C*-+#. "&#-#,#)8 J. *44+-+#.: -1% *.*)03+3 #' 4*-* '&#/ *

UE# $1#/ ,#&&%3"#.4%.,% 31#()4 7% *44&%33%48 E%)V W? IQ? RXL DLXIS Y*=V W? IQ? LZQ QZRRS [/*+)V /*22+%,\+.-&*8.+44F8.+182#5

!"#"$!"%& !"#$%&# '#&() *%)%+,#-. /001. 23$4 15. !34 56789: 5045061;<+,;=>=?@1

by guest on July 2, 2013http://nar.oxfordjournals.org/

Dow

nloaded from

•  Transcript  raCos  across  samples  

•  Lack  of  confidence  in  gene  expression  experiments  –  Same  pair  of  samples,  

different  plaUorms,  different  raCo  results!  

•  CriCcal  applicaCons  –  Cancer  Biology  –  Drug  Discovery  –  Tissue  engineering  –  Stem  Cell  Biology  

Page 6: 20140710 1 day1_nist_ercc2.0workshop

External  RNA  Control  ConsorCum  (ERCC)  

•  Industry-­‐iniCated,  NIST-­‐hosted,  stakeholder  coupled  –  grew  out  of  NIST  workshop  in  

2003  •  iniCated  by  Janet  Warrington,  

VP  Clinical  Genomics  at  Affymetrix  

–  all  major  microarray  technology  developers  

–  other  gene  expression  assay  developers  

•  Open  to  all  interested  parCes  •  Voluntary  •  More  than  90  parCcipants  

–  Private,  Public,  Academic  

6  

Spike-­‐ins  

Page 7: 20140710 1 day1_nist_ercc2.0workshop

ERCC  CollaboraCve  Study  •  Developed  sequence  library  

from  submission  by  ERCC  members,  as  well  as  synthesis  –  evaluated  performance  of  

RNA  controls  on  variety  of  plaUorms  

–  selected  96  well-­‐performing  sequences  in  collaboraCve  study  

•  Array  manufacturers  modified  products  to  include  ERCC  control  sequences   7  

176  

144  

106  

96  

Page 8: 20140710 1 day1_nist_ercc2.0workshop

SRM  2374  –  DNA  Sequence  Library    for  External  RNA  Controls  

•  NIST  Standard  Reference  Material  (SRM  2374)  

•  Contains  96  unique  control  sequences  inserted  in  common  plasmid  DNA  –  engineered  to  be  readily  

in  vitro  transcribed  to  make  RNA  controls  

–  RNA  controls  intended  to  mimic  mammalian  mRNA  

•  hdp://www.nist.gov/srm/  

Page 9: 20140710 1 day1_nist_ercc2.0workshop

Library  of  96  Controls  in  SRM  2374  

Sequence  Lengths  

Sequence Length

count

0

5

10

15

20

25

500 1000 1500 2000

GC  Content  

GC Content

count

0

2

4

6

8

10

12

14

0.35 0.40 0.45 0.50

9  

Page 10: 20140710 1 day1_nist_ercc2.0workshop

CreaCng  Spike-­‐in  Mixtures  from  SRM  2374  

10  

SRM  2374  Plasmid    DNA  Library  

in  vitro  transcripCon  

RNA  transcripts  

Pooling  

Mixtures  with  known  abundance  raCos  

…  

Page 11: 20140710 1 day1_nist_ercc2.0workshop

Feature   A_1   A_2   A_3   B_1   B_2   B_3  

T1   1   5   4   0   2   3  

T2   200   204   199   101   97   103  

T3   142   153   147   149   130   155  

ERCC-­‐0001   5   8   10   20   23   19  

…  

Method  ValidaCon  with    erccdashboard  R  package  

erccdashboard Package Vignette

Sarah A. Munro

May 4, 2014

This vignette describes the use of the erccdashboard R package to analyze External RNA Control Con-sortium (ERCC) spike-in control ratio mixtures in gene expression experiments. If you use this package formethod validation of your gene expression experiments please cite our publication:

Please cite our paper when you use the erccdashboardpackage for analysis. This is a placeholder citation,because our manuscript is still under review.

Munro SA, Lund S, Pine PS, Binder H, Clevert D,Conesa A, Dopazo J, Fasold M, Hochreiter S, Hong H,Jafari N, Kreil DP, A ,Aabaj PP, Liao Y, Lin S, MeehanJ, Mason CE, Santoyo J, Setterquist RA, Shi L, ShiW, Smyth GK, Stralis-Pavese N, Su Z, Tong W, WangC, Wang J, Xu J, Ye Z, Yang Y, Yu Y, Salit M (UnderReview, 2014). Assessing Technical Performance inGene Expression Experiments with External Spike-inRNA Control Ratio Mixtures.

A BibTeX entry for LaTeX users is

@Article{,title = {Assessing Technical Performance in Gene Expression Experiments with External Spike-in RNA Control Ratio Mixtures},author = {Munro SA and Lund S and Pine PS and Binder H and Clevert D and Conesa A and Dopazo J and Fasold M and Hochreiter S and Hong H and Jafari N and Kreil DP and A ,Aabaj PP and Li S and Liao Y and Lin S and Meehan J and Mason CE and Santoyo J and Setterquist RA and Shi L and Shi W and Smyth GK and Stralis-Pavese N and Su Z and Tong W and Wang C and Wang J and Xu J and Ye Z and Yang Y and Yu Y and Salit M},journal = {Under Review},volume = {0},pages = {0},year = {2014},

}

Munro SA, Lund S, Pine PS, Binder H, Clevert D, Conesa A, Dopazo J, Fasold M, Hochreiter S, Hong H,Jafari N, Kreil DP, 0141abaj PP, Li S, Liao Y, Lin S, Meehan J, Mason CE, Santoyo J, Setterquist RA, ShiL, Shi W, Smyth GK, Stralis-Pavese N, Su Z, Tong W, Wang C, Wang J, Xu J, Ye Z, Yang Y, Yu Y, SalitM (Under Review, 2014). Assessing Technical Performance in Gene Expression Experiments with ExternalSpike-in RNA Control Ratio Mixtures.

Analysis is shown for two types of samples spiked with ERCC control ratio mixtures from the SEQCproject

• Rat toxicogenomics treatment and control samples for di↵erent drug treatments

• Human reference RNA samples from the MAQC I project, Universal Human Reference RNA (UHRR)and Human Brain Reference RNA (HBRR)

1

•  Open-­‐source  R  package  –  erccdashboard  

•  Assess  technical  performance  of  a  gene  expression  experiment  

•  Compare  results  – Within  a  single  laboratory  

–  Between  laboratories  

Page 12: 20140710 1 day1_nist_ercc2.0workshop

Method  ValidaCon  with    erccdashboard  R  package  

•  Open-­‐source  R  package  –  erccdashboard  

•  Assess  technical  performance  of  a  gene  expression  experiment  

•  Compare  results  – Within  a  single  laboratory  

–  Between  laboratories  

Page 13: 20140710 1 day1_nist_ercc2.0workshop

APPLICATIONS  OF  ERCC  1.0  

Page 14: 20140710 1 day1_nist_ercc2.0workshop

Product  and  Method  Development  

•  CerCfied  Sequences  •  Known  concentraCons  –  ValidaCon  and  method  tesCng  

–  Product  development  and  evaluaCon    

Page 15: 20140710 1 day1_nist_ercc2.0workshop

Measurement  Analysis  Quality  Control  

•  Limit  of  DetecCon  •  Dynamic  Range  •  Noise  models  

Page 16: 20140710 1 day1_nist_ercc2.0workshop

Sample  NormalizaCon  

•  Key  to  comparing  transcriptomes:  –  Immunology  –  Agriculture  –  Virology  –  Cancer  

Page 17: 20140710 1 day1_nist_ercc2.0workshop

Single-­‐Cell  Measurements  

•  NormalizaCon  •  Noise  modeling  •  RT  Efficiency  •  Limit  of  DetecCon  

Page 18: 20140710 1 day1_nist_ercc2.0workshop

Others…  Synthetic Spike-in Standards Improve Run-SpecificSystematic Error Analysis for DNA and RNA SequencingJustin M. Zook1*, Daniel Samarov2, Jennifer McDaniel1, Shurjo K. Sen3, Marc Salit1

1 Biochemical Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America, 2 Statistical Engineering Division,

National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America, 3Genetic Disease Research Branch, National Human Genome Research

Institute, National Institutes of Health, Bethesda, Maryland, United States of America

Abstract

While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematicsequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants.These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting infalse positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells,bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data setused to calculate association of SSEs with various features in the reads and sequence context. This data set is typically eitherfrom a part of the data set being ‘‘recalibrated’’ (Genome Analysis ToolKit, or GATK) or from a separate data set with specialcharacteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standardsto human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared toconventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina basequality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since thespike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specificrecalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with thespike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG,and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using theseDNA and RNA spike-in standards with GATK improves base quality score recalibration.

Citation: Zook JM, Samarov D, McDaniel J, Sen SK, Salit M (2012) Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNASequencing. PLoS ONE 7(7): e41356. doi:10.1371/journal.pone.0041356

Editor: Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Germany

Received February 28, 2012; Accepted June 20, 2012; Published July 31, 2012

This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone forany lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Funding: This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes ofHealth. No additional external funding was received for this study. The funders had no role in study design, data collection and analysis, decision to publish, orpreparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

As sequencing costs drop, it is becoming cost-effective tosequence even whole genomes to a sufficient depth that randomerrors become insignificant. However, systematic sequencingerrors (SSEs) and biases remain problematic even at highsequencing depths, so recent research has started to focus onunderstanding these SSEs and biases [1,2]. In this work, we focuson SSEs rather than coverage biases, where SSEs are systematicerrors in sample preparation and sequencing processes that causebase call errors to accumulate preferentially at certain basepositions in the genome, and coverage biases are biases in thenumber of reads covering certain genomic regions such as GC-bias [3–5]. Examples of SSEs, as well as random errors, areportrayed in Figure 1(a). Compensating for these SSEs is criticalfor applications in which a variant might be expected to be in onlya small fraction of the reads, such as samples containing RNA-editing [6,7], cancer tissues and circulating tumor cells [8–11],fetal DNA in mother’s blood [12], mixtures of bacterial strains[13], mitochondrial heteroplasmy [14], mosaic disorders [15], andpooled samples [16,17]. Since the causes of many SSEs are notwell understood and may vary due to batch effects in a run-specific

manner, compensating for them requires training data sets. Thetwo previously proposed approaches either use a separate data setwith special characteristics (e.g., SysCall uses overlapping paired-end reads [1]) or use the data set itself excluding regions known tohave variants (e.g., Genome Analysis Toolkit, or GATK, basequality score recalibration [2]). Here, we combine the advantagesof these approaches by using DNA or RNA spike-in standardswithout homology to almost all biological organisms.The first approach, SysCall, used a methyl-Seq dataset that had

overlapping paired-end reads to detect SSEs depending onsequencing direction for the Illumina sequencer [1]. The regionin which the reads overlap can be used to find systematic errorsthat preferentially occur on one DNA strand compared to theother strand. To improve variant calls, the SysCall method usesa separate dataset with overlapping reads to train a logisticregression model that accounts for SSEs correlated with severalcovariates: (1) the 2 preceding bases + the base in question (eachbase independently), (2) directionality bias of the errors, theproportion of non-reference reads, and (3) a comparison of thequality scores of the error base to the next base. Most sequencingruns do not contain overlapping paired reads, so SysCall assumesthe SSEs in a training data set are the same as the SSEs in other

PLoS ONE | www.plosone.org 1 July 2012 | Volume 7 | Issue 7 | e41356

Page 19: 20140710 1 day1_nist_ercc2.0workshop

Experience  from  Expression  Analysis  Thanks  to  Wendell  Jones  and  Erik  Aronesty  

I  like…  •  1000s  of  RNA-­‐Seq  samples  

–  Ambion  Mix  1  –  “did  the  library  reacCons  

work  appropriately  and  consistently?  

–  “did  our  lab  degrade  samples  or  were  the  samples  already  degraded?”      

–  “effects  (or  lack  of)  between  lane  or  flowcell”  

I  wish…  •  “Construct  Ext  RNA  Controls  that  

emulate  a  variety  of  splice  variaCon  (some  that  may  be  challenging)  and  have  them  at  different  magnitudes”  –  ”examine  not  only  the  chemistry  but  also  

the  bioinformaCc  pipeline  to  ensure  it  has  basic  fitness.”  

•  “Suggest  a  protocol  for  adding  Ext  RNA  Controls  for  FFPE.”    –  “While  we  spike  in  ERCC  controls  at  a  fixed  

amount  for  FFPE  samples,  we  get  out  a  wild  range  of  sequence  coming  out  that  aligns  to  the  ERCC  controls.”  

–  Hypothesis:  “much  of  the  target  RNA  is  so  damaged  that  it  doesn't  ligate  to  adapters  correctly,  but  ERCC  controls  do  (ligate);  as  a  result  they  are  (much)  preferenCally  amplified.”    

Page 20: 20140710 1 day1_nist_ercc2.0workshop

ERCC  1.0  Shortcomings  

•  Poly  A  SelecCon  is  broken  

•  Too  short  •  No  isoforms  •  No  good  mimics  for  variants  –  SNPs,  cancer  fusions  

•  Bimodal  GC  distribuCon   ERCC−00002ERCC−00003ERCC−00004ERCC−00009ERCC−00013ERCC−00014ERCC−00019ERCC−00022ERCC−00025ERCC−00028ERCC−00031ERCC−00033ERCC−00034ERCC−00035ERCC−00039ERCC−00040ERCC−00042ERCC−00043ERCC−00044ERCC−00046ERCC−00051ERCC−00053ERCC−00054ERCC−00058ERCC−00059ERCC−00060ERCC−00062ERCC−00067ERCC−00069ERCC−00071ERCC−00073ERCC−00074ERCC−00076ERCC−00077ERCC−00078ERCC−00079ERCC−00084ERCC−00085ERCC−00092ERCC−00095ERCC−00096ERCC−00099ERCC−00108ERCC−00109ERCC−00111ERCC−00112ERCC−00113ERCC−00116ERCC−00126ERCC−00130ERCC−00131ERCC−00136ERCC−00137ERCC−00143ERCC−00144ERCC−00145ERCC−00148ERCC−00150ERCC−00154ERCC−00157ERCC−00160ERCC−00162ERCC−00163ERCC−00165ERCC−00168ERCC−00170ERCC−00171

Lab1 Lab2 Lab3 Lab4 Lab5 Lab6 Lab7 Lab8 Lab9Lab

Feature

−10

−5

0

5

ScaledEffect

Page 21: 20140710 1 day1_nist_ercc2.0workshop

OpportuniCes  for  ERCC  2.0  

•  New  technologies  –  RNA-­‐Seq  –  Long  reads  

•  PacBio,  Moleculo  –  Digital  counCng  

•  Cellular  Research,  digital  PCR  

•  Method  improvements  –  Library  preparaCon  –  BioinformaCcs  

•  New  discoveries  

Counting individual DNA molecules by thestochastic attachment of diverse labelsGlenn K. Fu, Jing Hu, Pei-Hua Wang, and Stephen P. A. Fodor1

Affymetrix, Inc., 3420 Central Expressway, Santa Clara, CA 95051

Edited* by Ronald W. Davis, Stanford Genome Technology Center, Palo Alto, CA, and approved March 22, 2011 (received for review November 27, 2010)

We implement a unique strategy for single molecule countingtermed stochastic labeling, where random attachment of a diverseset of labels converts a population of identical DNA moleculesinto a population of distinct DNA molecules suitable for thresholddetection. The conceptual framework for stochastic labeling isdeveloped and experimentally demonstrated by determining theabsolute and relative number of selected genes after stochasticallylabeling approximately 360,000 different fragments of the humangenome. The approach does not require the physical separation ofmolecules and takes advantage of highly parallel methods such asmicroarray and sequencing technologies to simultaneously countabsolute numbers of multiple targets. Stochastic labeling shouldbe particularly useful for determining the absolute numbers ofRNA or DNA molecules in single cells.

absolute counting ∣ digital PCR ∣ next-generation sequencing ∣single molecule detection

Determining small numbers of biological molecules and theirchanges is essential when unraveling mechanisms of cellular

response, differentiation or signal transduction, and in perform-ing a wide variety of clinical measurements. Although many ana-lytical methods have been developed to measure the relativeabundance of different molecules through sampling (e.g., micro-arrays and sequencing), the only practical method available todetermine the absolute number of molecules in a sample is digitalPCR (1–3), a powerful analytical technique typically limited toexamining only a few different molecules at a time.

In 2003, a theoretical approach to measure the number ofmolecules of a single mRNA species in a complex mRNA pre-paration was proposed (4). To our knowledge no experimentaldemonstration of this idea has been published. We have general-ized this idea and have expanded it to a highly parallel methodcapable of absolute counting of many different molecules simul-taneously. The concept is illustrated in Fig. 1. Each copy of amolecule randomly captures a label by choosing from a large,nondepleting reservoir of diverse labels. The subsequent diversityof the labeled molecules is governed by the statistics of randomchoice, and depends on the number of copies of identical mole-cules in the collection compared to the number of kinds of labels.Once the molecules are labeled, they can be amplified so thatsimple present/absent threshold detection methods can be usedfor each. Counting the number of distinctly labeled targetsreveals the original number of molecules of each species.

We can generalize the stochastic labeling process as follows.Consider a given set of copies of a single target sequenceT ! ft1;t2…tng; where n is the number of copies of T. A set oflabels is defined as L ! fl1;l2…lmg; where m is the number ofdifferent labels. T reacts stochastically with L, such that each tbecomes attached to one l. If the ls are in nondepleting excess,each t will choose one l randomly, and will take on a new identitylitj; where li is chosen from L and j is the jth copy from the setof n molecules. We identify each new molecule litj by its labelsubscript and drop the subscript for the copies of T because theyare identical. The new collection of molecules becomes T" !fl1t;l2t;…litg; where li is the ith choice from the set of m labels.At this point, the subscripts of l refer only to the ith choice and

provide no information about the identity of each l. In fact, l1 andl2 will have some probability of being identical, depending uponthe diversity m of the set of labels. Overall, T" will contain a setof k unique labels resulting from n targets choosing from the non-depleting reservoir of m labels. Or, T"#m;n$ ! flktg; where krepresents the number of unique labels that have been captured.In all cases, k will be smaller than m, approaching m only whenn becomes very large. We can define the stochastic attachmentof the set of labels on a target using a stochastic operator S withm members, acting upon a target population of n, such thatS#m$T#n$ ! T"#m;n$ generating the set flktg. Furthermore, be-cause S operates on all molecules independently, it can act onmany different targets. Hence, by combining the informationof target sequence and label, we can simultaneously count copiesof multiple target sequences. The probability of the number oflabels generated by the number of trials n, from a diversityof m, can be approximated by the Poisson equation, Px !%#n∕m$x∕x!&e−#n∕m$. Then P0 is the probability that a label willnot be chosen in n trials, therefore, 1 − P0 is the probability thata label will occur at least once. It follows that the expected num-ber of unique labels captured is given by:

Identical DNAtarget molecules {t1, t2 …. tn }

t1

t2

t3

t4

Pool of labels{l1 , l2 …. lm}

Random labeling

t1l20

t2l107

t3l477

t4l9

Amplification and detection of k distinctly labeled molecules

Fig. 1. A schematic representation of the labeling process. An exampleshowing four identical target molecules in solution. Each DNA molecule ran-domly captures and joins with a label by choosing from a large, nondepletingreservoir of m labels. Each resulting labeled DNA molecule takes on a newidentity and is amplified to detect the number of k distinct labels.

Author contributions: G.K.F. and S.P.A.F. designed research; G.K.F. and P.-H.W. performedresearch; G.K.F., J.H., and S.P.A.F. analyzed data; and G.K.F., J.H., and S.P.A.F. wrotethe paper.

Conflict of interest statement: The authors are employees of Affymetrix, Inc. and thesubject matter of this article may be a future commercial product.

*This Direct Submission article had a prearranged editor.

Freely available online through the PNAS open access option.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1017621108/-/DCSupplemental.

9026–9031 ∣ PNAS ∣ May 31, 2011 ∣ vol. 108 ∣ no. 22 www.pnas.org/cgi/doi/10.1073/pnas.1017621108

Fu  et  al.  PNAS  2011    

Page 22: 20140710 1 day1_nist_ercc2.0workshop

CHARGE  TO  THE  WORKSHOP  

Page 23: 20140710 1 day1_nist_ercc2.0workshop

Charge  to  the  Workshop  

•  Develop  consensus  on…    –  Concept  

•  Shared  interests  –  PorUolio  

•  Controls  •  Analysis  •  Documentary  Standards  

•  Develop  consorCum  structure…  – Working  groups  –  Steering  commidee  

Page 24: 20140710 1 day1_nist_ercc2.0workshop

Principles  of  OperaCon  

•  Pre-­‐compeCCve  •  Consensus  decision-­‐making  

•  Data-­‐driven  •  Technology  independent  

•  Leadership  – Working  Groups  –  Steering  Commidee  

•  NIST-­‐hosted  •  “You  get  out  of  it  what  you  put  into  it.”  

Page 25: 20140710 1 day1_nist_ercc2.0workshop

“A rising tide floats all boats”

ERCC operates by consensus

“A rising tide floats all boats…”

Page 26: 20140710 1 day1_nist_ercc2.0workshop

VISION  OF  SCOPE  &  LIFESPAN  OF  ERCC  2.0  

Page 27: 20140710 1 day1_nist_ercc2.0workshop

Scope  of  ERCC  2.0  •  ERCC  2.0  is  convened  to  

develop  standard  controls  for  RNA  measurements  

•  Three  working  groups  are  proposed  

1.  Design  –  Types  of  controls  &  

sequence  design  2.  Development  

–  Building  controls,  developing  &  tesCng  control  mixtures  

3.  Analysis  –  Standard  performance  

metrics  –  Tools  as  needed  to  

support  design  &  development  

Page 28: 20140710 1 day1_nist_ercc2.0workshop

The  Arc  of  ERCC  2.0  •  Products  

–  Sequences  represenCng  different  types  of  RNA  •  Transcript  isoforms  •  miRNA  •  New  mRNA  mimics  •  …  

–  Documentary  standards  for  using  controls  

–  Performance  metrics  •  LogisCcs  

–  Workshops  •  Number,  frequency  

–  Telecons,  Mailing  list,  Wiki  

•  Development  Schedules  –  Sequence  selecCon  –  Control  synthesis  –  Control  tesCng  and  analysis  –  Reference  material  

development,  characterizaCon,  release  

–  AnalyCcal  methods  &  tools  –  Documentary  standards  –  …  –  Finished.  

•  DisseminaCon  –  Steering  commidee  to  

address  business  models  

 

Page 29: 20140710 1 day1_nist_ercc2.0workshop

ERCC  2.0  PROCESS  DISCUSSION  

Page 30: 20140710 1 day1_nist_ercc2.0workshop

What  will  we  do  together?  

•  NIST  is  commided  to...  –  HosCng  the  consorCum  –  SupporCng  product  development  

•  PorUolio  possibiliCes  –  Reference  materials  –  Reference  data  –  Analysis  methods  –  Analysis  tools  –  Documentary  standards  –  …  

•  Define  consorCum  mission  –  Purpose  of  ERCC  2.0  products  •  Providing  infrastructure  to  discern  signal  from  arCfact  

•  Confidence  in  RNA  measurement  results  

•  …  

Page 31: 20140710 1 day1_nist_ercc2.0workshop

How  can  we  work  together?  

•  How  do  we  make  decisions?  

•  How  do  we  operate?  – Working  groups  –  Semi-­‐annual  meeCngs  –  Conference  calls  –  Email  list,  wiki?  

•  Why  a  consorCum?    – We  can  make  beder  standards  together  

•  Things  the  consorCum  can  do  as  an  enCty:  –  Integrate  controls  from  the  membership  

–  Conduct  validaCon  studies  

– Make  recommendaCons,  guidelines,  develop  standards  •  Documentary  standards  to  support  regulated  applicaCons  

Page 32: 20140710 1 day1_nist_ercc2.0workshop

PARTICIPANT  PRESENTATIONS  

Page 33: 20140710 1 day1_nist_ercc2.0workshop

WORKING  GROUP  AND  SCOPE  DISCUSSION    

Page 34: 20140710 1 day1_nist_ercc2.0workshop

Design  Working  Group  •  Types  of  Controls  

–  Transcript  isoforms  –  miRNA  –  Small  RNAs  –  pre-­‐miRNA,  

noncoding  –  Cancer-­‐fusion  transcripts  –  Microbial  RNAs  –  Polysome-­‐associated  RNA  

spike-­‐ins  –  Long  noncoding  RNAs  –  Epitranscriptome  standards    –  Refined  mRNA  mimics  –  …  

•  Design  consideraCons  –  Sequence  source  –  GC  content  –  Length  –  Complexity  –  Poly-­‐adenylaCon  –  Secondary  structure  –  Non-­‐cognate  sequences  

(“alien”)  –  ModificaCons  –  …  

Page 35: 20140710 1 day1_nist_ercc2.0workshop

Development  Working  Group  •  Control  synthesis  

–  DNA  templates,  RNA  molecules  

–  Special  modificaCons  •  QC  of  DNA,  RNA  controls  

–  Purity  –  Homogeneity  –  Stability  

•  Control  Mixture  Design  –  Dynamic  range  –  RaCos  –  …    

•  Plan  and  conduct  interlaboratory  studies  to  evaluate  controls  –  ValidaCon  of  controls  –  ValidaCon  of  concepts  and  analysis  

–  Use  mulCple  measurement  technologies  

Page 36: 20140710 1 day1_nist_ercc2.0workshop

Analysis  Working  Group  

•  Develop  standard  performance  metrics  –  Develop  reference  implementaCon  as  example  

•  ApplicaCons  –  Process  control  –  QuanCtaCve  benchmarking  

–  NormalizaCon  –  OpCmizaCon  

•  Tools  as  needed  to  support  design  &  development  –  ValidaCon  study  analysis  – Mixture  design  tools    

Page 37: 20140710 1 day1_nist_ercc2.0workshop

Design  Working  Group  •  Types  of  Controls  

–  Transcript  isoforms  –  miRNA  –  Small  RNAs  –  pre-­‐miRNA,  

noncoding  –  Cancer-­‐fusion  transcripts  –  Microbial  RNAs  –  Polysome-­‐associated  RNA  

spike-­‐ins  –  Long  noncoding  RNAs  –  Epitranscriptome  standards    –  Refined  mRNA  mimics  –  …  

•  Design  consideraCons  –  Sequence  source  –  GC  content  –  Length  –  Complexity  –  Poly-­‐adenylaCon  –  Secondary  structure  –  Non-­‐cognate  sequences  

(“alien”)  –  ModificaCons  –  …  

Page 38: 20140710 1 day1_nist_ercc2.0workshop

Design  Working  Group  

I  like…   I  wish…  

Page 39: 20140710 1 day1_nist_ercc2.0workshop

Closing  Comments  Day  1  

•  9:00  am  start  tomorrow  (there  will  be  coffee)  

•  More  presentaCons  tomorrow  morning  

•  Open  Pitch  session  is  also  available  tomorrow  –  Let  us  know  if  you  want  to  speak  tomorrow,  but  you  can  also  can  just  get  up  and  pitch  

 

•  Working  groups  will  reconvene  tomorrow  to  develop  summaries  

•  Please  join  us  now  for  dinner