20140710 1 day1_nist_ercc2.0workshop

ERCC 2.0 Workshop

Day 1 July 10, 2014

Sarah Munro and Marc Salit

Welcome

Agenda

•  MeeCng IntroducCon –  ERCC 1.0 Recap & ApplicaCons

–  Charge to Workshop –  ERCC 2.0 Scope & Process Discussion

•  ParCcipant PresentaCons –  Bob Se'erquist –  Lukas Paul

•  Break: 3:30 – 4:00pm •  ParCcipant PresentaCons –  Anne Bergstrom Lucas –  Karol Thompson –  Christopher Mason

•  Working Group FormaCon and Scoping Discussion

•  Dinner: 6:00pm

ERCC 1.0 RECAP

Gene Expression Measurements

5

!"#$%#&'() (* +,), ,-./,00'() 1,#0%/,1,)&0 */(12(11,/2'#$ 1'2/(#//#3 .$#&*(/104#%$ 56 7#)8 79(1#0 :6 ;(<),3=8 !><#/> ?6 @.'&A)#+,$ :/B8 4') C%8 ;#>') D%8

;'1'&,/ @6 ;'1'&/("E8 F'29#/> G6 ?,1.'2H'I8 J/%2, K6 F##H#L #)> K#/+#/,& M6 M#1N

!"#$%&$$&' (%$) *&+%$&,%$'- .&,"%/&0 1/2,",3,) %4 5"&+),)2 &/6 5"7)2,"8) &/6 9"6/)' 5"2%$6)$2 :.1559;-.&,"%/&0 1/2,",3,)2 %4 <)&0,=- >?&$,)@ 1/#%$A%$&,)6- B5)A&$,C)/, %4 !&,=)C&,"#2- D&2="/7,%/ E/"8)$2",'-F*&+%$&,%$' %4 GHA)$"C)/,&0 &/6 (%CA3,&,"%/&0 I"%0%7' :*G(I;- .&,"%/&0 (&/#)$ 1/2,",3,)- .1<- J.&,"%/&0 1/2,",3,)%4 K00)$7' &/6 1/4)#,"%32 5"2)&2)2 :.1K15;- .1<- LK1(MN$)6)$"#@- 1/#O- P(0"/"#&0 G/6%#$"/%0%7' I$&/#=- .1559-.1<- ELK

Q)#)"8)6 !&' BF- BRRFS Q)8"2)6 T30' >>- BRRFS K##)A,)6 K3732, >>- BRRF

GJ@7FGM7

K%$&'.$, 2(11,/2'#$ 1'2/(#//#30 *(/ 1,#0%/')++,)(1,O<'>, +,), ,-./,00'() $,",$0 #/, 2%//,)&$3#"#'$#P$,8 ')2$%>')+ ($'+()%2$,(&'>, #)> 2;QG80')+$,O #)> &<(O29#)),$ *(/1#&06 79'0 0&%>3 /,.(/&0() &9, /,0%$&0 (* +,), ,-./,00'() 1,#0%/,1,)&0+,),/#&,> */(1 '>,)&'2#$ FQG ./,.#/#&'()0 &9#&<,/, (P&#'),> %0')+ &9/,, 2(11,/2'#$$3 #"#'$#P$,1'2/(#//#3 .$#&*(/106 FQG <#0 2($$,2&,> */(14GQMO= 2,$$0 +/(<) ') 0,/%1O/'29 1,>'%1 #)> #&BI 9 *($$(<')+ &9, /,1("#$ (* 0,/%16 79/,, P'(O$(+'2#$ /,.$'2#&,0 <,/, ./,.#/,> *(/ ,#29 2()>'&'()8#)> &9/,, ,-.,/'1,)&#$ /,.$'2#&,0 <,/, ./(>%2,> *(/&9, R/0& P'($(+'2#$ /,.$'2#&,6 FQG <#0 $#P,$,> #)>93P/'>'A,> &( 1'2/(#//#30 */(1 &9/,, 1#S(/ 0%.O.$',/0 #22(/>')+ &( 1#)%*#2&%/,/0T ./(&(2($08 #)>+,), ,-./,00'() 1,#0%/,1,)&0 <,/, (P&#'),>%0')+ ,#29 .$#&*(/1T0 0&#)>#/> 0(*&<#/,6 D(/ ,#29.$#&*(/18 +,), &#/+,&0 */(1 # 0%P0,& (* BUUV 2(1O1() +,),0 <,/, 2(1.#/,>6 M(//,$#&'()0 ') +,),,-./,00'() $,",$0 #)> 2(1.#/'0()0 *(/ 0'+)'R2#)&+,), ,-./,00'() 29#)+,0 ') &9'0 0%P0,& <,/, 2#$2%O$#&,>8 #)> 09(<,> 2()0'>,/#P$, >'",/+,)2, #2/(00&9, >'**,/,)& .$#&*(/108 0%++,0&')+ &9, ),,> *(/,0&#P$'09')+ ')>%0&/'#$ 1#)%*#2&%/')+ 0&#)>#/>08#)> *%/&9,/ ')>,.,)>,)& #)> &9(/(%+9 "#$'>#&'() (*&9, &,29)($(+36

WQ7FX;YM7WXQ

! "#$%&'() *"")+,*-+#. #' /+,&#*&&*0 -%,1.#)#20 +3 +.4+3,#5%&067*3%4 7+#/%4+,*) &%3%*&,18 9.4%& -1% 4+3,#5%&07*3%4 *""&#*,1: ;<! /+,&#*&&*03 *&% (3%4 *3 3,&%%.+.2 -##)3-# +4%.-+'0 2%.%3 *33#,+*-%4 $+-1 7+#)#2+,*) "&#,%33%3 #'+.-%&%3-8 93+.2 /+,&#*&&*03: * 2%.#/%6$+4% *33*0 ,*. 7%,#.4(,-%4 *.4 &%3%*&,1%&3 ,*. /+.% -1% &%3()-+.2 %="%&+/%.6-*) 4*-* -# 3,&%%. * )*&2% 3(73%- #' -1% 2%.#/% -# 4+3,#5%& 3%-3#' 2%.%3 *33#,+*-%4 $+-1 -1% 7+#)#2+,*) "1%.#/%.* #' +.-%&%3-

>?@8 A.,% -*&2%- 2%.%3 *&% +4%.-+B%4: *44+-+#.*) )*7#&*-#&0&%3#(&,%3 /*0 7% +.5%3-%4 -# 5*)+4*-% -1+3 )+3- *.4 -# '(&-1%&,1*&*,-%&+C% -1% &%)*-+#.31+" #' -1%+& 7+#)#2+,*) '(.,-+#.3 -#-1% "&#,%33 (.4%& 3-(40 >D@8 E1% %'B,+%.,0 #' F.#$)%42%4+3,#5%&0 (3+.2 -1+3 1+216-1&#(21"(- %="%&+/%.-*) "&#,%334%"%.43 ("#. -1% &%)+*7+)+-0 #' -1% /+,&#*&&*0 -%,1.#)#20 (3%4+. -1% +.+-+*) 3,&%%.+.2 %="%&+/%.-38 G%3%*&,1%&3 ")*..+.2 -#(-+)+C% /+,&#*&&*0 %="%&+/%.-3 '#& 4+3,#5%&067*3%4 &%3%*&,1/(3- %5*)(*-% *5*+)*7)% ,#//%&,+*) -%,1.#)#2+%3 $1%. *))#6,*-+.2 )*7#&*-#&0 &%3#(&,%3 '#& "&#3"%,-+5% %="%&+/%.-38

H%5%&*) '#&/*-3 #' /+,&#*&&*03 '#& /%*3(&+.2 2%.#/%6$+4%2%.% %="&%33+#. )%5%)3 *&% ,(&&%.-)0 *5*+)*7)% >I@8 J/"#&-*.-'*,-#&3 '#& 3%)%,-+.2 *. *""&#"&+*-% /+,&#*&&*0 ")*-'#&/ $#()4+.,)(4% 3%.3+-+5+-0: 3"%,+B,+-0 *.4 7#-1 +.-%&6 *.4 +.-&*6*33*0&%"&#4(,+7+)+-08 !)3# +/"#&-*.- +3 F.#$)%42% #' -1% 4%2&%% #',&#336")*-'#&/ *2&%%/%.-: *3 +.-%&,1*.2%*7+)+-0 */#.23-5*&+#(3 /+,&#*&&*0 '#&/*-3 $#()4 *))#$ '#& -1% (-+)+-0 #'2%.% %="&%33+#. 4*-* $+-1#(- &%2*&4 -# ")*-'#&/8 K*5+.2 3(,1* "&#"%&-0 $#()4 *))#$ &%3%*&,1%&3 '&#/ +.4%"%.4%.- )*7#&*6-#&+%3 -# /*F% 4+&%,- ,#/"*&+3#.3 #. 4*-* "&#4(,%4 '&#/4+''%&%.- -0"%3 #' *5*+)*7)% ")*-'#&/3: *.4 $#()4 &%4(,% -1%.%%4 -# &%")+,*-% %="%&+/%.-3 >L@8 H(,1 ,&#336")*-'#&/ ,#/6"*&+3#.3 +4%*))0 &%M(+&% -1*- ,#&&%3"#.4+.2 G<! %="&%33+#./%*3(&%/%.-3 7% ,#.,#&4*.-8 N&%5+#(3 ,#/"*&+3#.3 #'/+,&#*&&*0 '#&/*-3 3(22%3-%4 -1*- %="&%33+#. 4*-* #. -1%<OJPQ ,%)) )+.%3 '&#/ 3"#--%4 ,;<! /+&,&#*&&*03 ,#()4 .#-7% 4+&%,-)0 ,#/7+.%4 $+-1 4*-* '&#/ 30.-1%3+C%4 #)+2#.(,)%#6-+4% *&&*03 >R@8 E1+3 B.4+.2 $*3 4%-%&/+.%4 (3+.2 +4%.-+,*)#&+2+.*-+.2 ,%)) )+.%3S 1#$%5%&: ,%)) ,()-(&+.2: /G<! "&%"6*&*-+#. *.4 107&+4+C*-+#. #' -*&2%-3 $%&% *)) "%&'#&/%43%"*&*-%)08 J. -1+3 3-(40 $% *.*)0C%4 +4%.-+,*) G<! "&%"*&6*-+#.3 (3+.2 -1&%% ,#//%&,+*))0 *5*+)*7)% 1+2164%.3+-0/+,&#*&&*0 ")*-'#&/38 E1+3 %="%&+/%.-*) 4%3+2. *))#$%4 (3-# ,#/"*&% -1% /+,&#*&&*0 '#&/*-3 $1+)% ,#.-&#))+.2 '#&5*&+*-+#. -1*- /*0 1*5% *&+3%. '&#/ +.4%"%.4%.- ,%)),()-(&+.2: G<! +3#)*-+#. *.4 "(&+B,*-+#.8

E1&%% /*T#& ,#//%&,+*) /+,&#*&&*0 ")*-'#&/3 $%&% %5*)(6*-%4 70 (3+.2 3-*.4*&4+C%4 +."(- G<! 3*/")%: *.4 %.3(&+.2-1*- *)) /+,&#*&&*0 %="%&+/%.-3 $%&% ,*&&+%4 #(- 70 -%,1.#)#62+3-3 3"%,+*)+C%4 +. %*,1 "*&-+,()*& /+,&#*&&*0 )*7%)+.2 *.4107&+4+C*-+#. "&#-#,#)8 J. *44+-+#.: -1% *.*)03+3 #' 4*-* '&#/ *

UE# $1#/ ,#&&%3"#.4%.,% 31#()4 7% *44&%33%48 E%)V W? IQ? RXL DLXIS Y*=V W? IQ? LZQ QZRRS [/*+)V /*22+%,\+.-&*8.+44F8.+182#5

!"#"$!"%& !"#$%&# '#&() *%)%+,#-. /001. 23$4 15. !34 56789: 5045061;<+,;=>=?@1

by guest on July 2, 2013http://nar.oxfordjournals.org/

Dow

nloaded from

•  Transcript raCos across samples

•  Lack of confidence in gene expression experiments –  Same pair of samples,

different plaUorms, different raCo results!

•  CriCcal applicaCons –  Cancer Biology –  Drug Discovery –  Tissue engineering –  Stem Cell Biology

External RNA Control ConsorCum (ERCC)

•  Industry-‐iniCated, NIST-‐hosted, stakeholder coupled –  grew out of NIST workshop in

2003 •  iniCated by Janet Warrington,

VP Clinical Genomics at Affymetrix

–  all major microarray technology developers

–  other gene expression assay developers

•  Open to all interested parCes •  Voluntary •  More than 90 parCcipants

–  Private, Public, Academic

6

Spike-‐ins

ERCC CollaboraCve Study •  Developed sequence library

from submission by ERCC members, as well as synthesis –  evaluated performance of

RNA controls on variety of plaUorms

–  selected 96 well-‐performing sequences in collaboraCve study

•  Array manufacturers modified products to include ERCC control sequences 7

176

144

106

96

SRM 2374 – DNA Sequence Library for External RNA Controls

•  NIST Standard Reference Material (SRM 2374)

•  Contains 96 unique control sequences inserted in common plasmid DNA –  engineered to be readily

in vitro transcribed to make RNA controls

–  RNA controls intended to mimic mammalian mRNA

•  hdp://www.nist.gov/srm/

Library of 96 Controls in SRM 2374

Sequence Lengths

Sequence Length

count

0

5

10

15

20

25

500 1000 1500 2000

GC Content

GC Content

count

0

2

4

6

8

10

12

14

0.35 0.40 0.45 0.50

9

CreaCng Spike-‐in Mixtures from SRM 2374

10

SRM 2374 Plasmid DNA Library

in vitro transcripCon

RNA transcripts

Pooling

Mixtures with known abundance raCos

…

Feature A_1 A_2 A_3 B_1 B_2 B_3

T1 1 5 4 0 2 3

T2 200 204 199 101 97 103

T3 142 153 147 149 130 155

ERCC-‐0001 5 8 10 20 23 19

…

Method ValidaCon with erccdashboard R package

erccdashboard Package Vignette

Sarah A. Munro

May 4, 2014

This vignette describes the use of the erccdashboard R package to analyze External RNA Control Con-sortium (ERCC) spike-in control ratio mixtures in gene expression experiments. If you use this package formethod validation of your gene expression experiments please cite our publication:

Please cite our paper when you use the erccdashboardpackage for analysis. This is a placeholder citation,because our manuscript is still under review.

Munro SA, Lund S, Pine PS, Binder H, Clevert D,Conesa A, Dopazo J, Fasold M, Hochreiter S, Hong H,Jafari N, Kreil DP, A ,Aabaj PP, Liao Y, Lin S, MeehanJ, Mason CE, Santoyo J, Setterquist RA, Shi L, ShiW, Smyth GK, Stralis-Pavese N, Su Z, Tong W, WangC, Wang J, Xu J, Ye Z, Yang Y, Yu Y, Salit M (UnderReview, 2014). Assessing Technical Performance inGene Expression Experiments with External Spike-inRNA Control Ratio Mixtures.

A BibTeX entry for LaTeX users is

@Article{,title = {Assessing Technical Performance in Gene Expression Experiments with External Spike-in RNA Control Ratio Mixtures},author = {Munro SA and Lund S and Pine PS and Binder H and Clevert D and Conesa A and Dopazo J and Fasold M and Hochreiter S and Hong H and Jafari N and Kreil DP and A ,Aabaj PP and Li S and Liao Y and Lin S and Meehan J and Mason CE and Santoyo J and Setterquist RA and Shi L and Shi W and Smyth GK and Stralis-Pavese N and Su Z and Tong W and Wang C and Wang J and Xu J and Ye Z and Yang Y and Yu Y and Salit M},journal = {Under Review},volume = {0},pages = {0},year = {2014},

}

Munro SA, Lund S, Pine PS, Binder H, Clevert D, Conesa A, Dopazo J, Fasold M, Hochreiter S, Hong H,Jafari N, Kreil DP, 0141abaj PP, Li S, Liao Y, Lin S, Meehan J, Mason CE, Santoyo J, Setterquist RA, ShiL, Shi W, Smyth GK, Stralis-Pavese N, Su Z, Tong W, Wang C, Wang J, Xu J, Ye Z, Yang Y, Yu Y, SalitM (Under Review, 2014). Assessing Technical Performance in Gene Expression Experiments with ExternalSpike-in RNA Control Ratio Mixtures.

Analysis is shown for two types of samples spiked with ERCC control ratio mixtures from the SEQCproject

• Rat toxicogenomics treatment and control samples for di↵erent drug treatments

• Human reference RNA samples from the MAQC I project, Universal Human Reference RNA (UHRR)and Human Brain Reference RNA (HBRR)

1

•  Open-‐source R package –  erccdashboard

•  Assess technical performance of a gene expression experiment

•  Compare results – Within a single laboratory

–  Between laboratories

Method ValidaCon with erccdashboard R package

•  Open-‐source R package –  erccdashboard

•  Assess technical performance of a gene expression experiment

•  Compare results – Within a single laboratory

–  Between laboratories

APPLICATIONS OF ERCC 1.0

Product and Method Development

•  CerCfied Sequences •  Known concentraCons –  ValidaCon and method tesCng

–  Product development and evaluaCon

Measurement Analysis Quality Control

•  Limit of DetecCon •  Dynamic Range •  Noise models

Sample NormalizaCon

•  Key to comparing transcriptomes: –  Immunology –  Agriculture –  Virology –  Cancer

Single-‐Cell Measurements

•  NormalizaCon •  Noise modeling •  RT Efficiency •  Limit of DetecCon

Others… Synthetic Spike-in Standards Improve Run-SpecificSystematic Error Analysis for DNA and RNA SequencingJustin M. Zook1*, Daniel Samarov2, Jennifer McDaniel1, Shurjo K. Sen3, Marc Salit1

1 Biochemical Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America, 2 Statistical Engineering Division,

National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America, 3Genetic Disease Research Branch, National Human Genome Research

Institute, National Institutes of Health, Bethesda, Maryland, United States of America

Abstract

While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematicsequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants.These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting infalse positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells,bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data setused to calculate association of SSEs with various features in the reads and sequence context. This data set is typically eitherfrom a part of the data set being ‘‘recalibrated’’ (Genome Analysis ToolKit, or GATK) or from a separate data set with specialcharacteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standardsto human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared toconventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina basequality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since thespike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specificrecalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with thespike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG,and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using theseDNA and RNA spike-in standards with GATK improves base quality score recalibration.

Citation: Zook JM, Samarov D, McDaniel J, Sen SK, Salit M (2012) Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNASequencing. PLoS ONE 7(7): e41356. doi:10.1371/journal.pone.0041356

Editor: Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Germany

Received February 28, 2012; Accepted June 20, 2012; Published July 31, 2012

This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone forany lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Funding: This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes ofHealth. No additional external funding was received for this study. The funders had no role in study design, data collection and analysis, decision to publish, orpreparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

As sequencing costs drop, it is becoming cost-effective tosequence even whole genomes to a sufficient depth that randomerrors become insignificant. However, systematic sequencingerrors (SSEs) and biases remain problematic even at highsequencing depths, so recent research has started to focus onunderstanding these SSEs and biases [1,2]. In this work, we focuson SSEs rather than coverage biases, where SSEs are systematicerrors in sample preparation and sequencing processes that causebase call errors to accumulate preferentially at certain basepositions in the genome, and coverage biases are biases in thenumber of reads covering certain genomic regions such as GC-bias [3–5]. Examples of SSEs, as well as random errors, areportrayed in Figure 1(a). Compensating for these SSEs is criticalfor applications in which a variant might be expected to be in onlya small fraction of the reads, such as samples containing RNA-editing [6,7], cancer tissues and circulating tumor cells [8–11],fetal DNA in mother’s blood [12], mixtures of bacterial strains[13], mitochondrial heteroplasmy [14], mosaic disorders [15], andpooled samples [16,17]. Since the causes of many SSEs are notwell understood and may vary due to batch effects in a run-specific

manner, compensating for them requires training data sets. Thetwo previously proposed approaches either use a separate data setwith special characteristics (e.g., SysCall uses overlapping paired-end reads [1]) or use the data set itself excluding regions known tohave variants (e.g., Genome Analysis Toolkit, or GATK, basequality score recalibration [2]). Here, we combine the advantagesof these approaches by using DNA or RNA spike-in standardswithout homology to almost all biological organisms.The first approach, SysCall, used a methyl-Seq dataset that had

overlapping paired-end reads to detect SSEs depending onsequencing direction for the Illumina sequencer [1]. The regionin which the reads overlap can be used to find systematic errorsthat preferentially occur on one DNA strand compared to theother strand. To improve variant calls, the SysCall method usesa separate dataset with overlapping reads to train a logisticregression model that accounts for SSEs correlated with severalcovariates: (1) the 2 preceding bases + the base in question (eachbase independently), (2) directionality bias of the errors, theproportion of non-reference reads, and (3) a comparison of thequality scores of the error base to the next base. Most sequencingruns do not contain overlapping paired reads, so SysCall assumesthe SSEs in a training data set are the same as the SSEs in other

PLoS ONE | www.plosone.org 1 July 2012 | Volume 7 | Issue 7 | e41356

Experience from Expression Analysis Thanks to Wendell Jones and Erik Aronesty

I like… •  1000s of RNA-‐Seq samples

–  Ambion Mix 1 –  “did the library reacCons

work appropriately and consistently?

–  “did our lab degrade samples or were the samples already degraded?”

–  “effects (or lack of) between lane or flowcell”

I wish… •  “Construct Ext RNA Controls that

emulate a variety of splice variaCon (some that may be challenging) and have them at different magnitudes” –  ”examine not only the chemistry but also

the bioinformaCc pipeline to ensure it has basic fitness.”

•  “Suggest a protocol for adding Ext RNA Controls for FFPE.” –  “While we spike in ERCC controls at a fixed

amount for FFPE samples, we get out a wild range of sequence coming out that aligns to the ERCC controls.”

–  Hypothesis: “much of the target RNA is so damaged that it doesn't ligate to adapters correctly, but ERCC controls do (ligate); as a result they are (much) preferenCally amplified.”

ERCC 1.0 Shortcomings

•  Poly A SelecCon is broken

•  Too short •  No isoforms •  No good mimics for variants –  SNPs, cancer fusions

•  Bimodal GC distribuCon ERCC−00002ERCC−00003ERCC−00004ERCC−00009ERCC−00013ERCC−00014ERCC−00019ERCC−00022ERCC−00025ERCC−00028ERCC−00031ERCC−00033ERCC−00034ERCC−00035ERCC−00039ERCC−00040ERCC−00042ERCC−00043ERCC−00044ERCC−00046ERCC−00051ERCC−00053ERCC−00054ERCC−00058ERCC−00059ERCC−00060ERCC−00062ERCC−00067ERCC−00069ERCC−00071ERCC−00073ERCC−00074ERCC−00076ERCC−00077ERCC−00078ERCC−00079ERCC−00084ERCC−00085ERCC−00092ERCC−00095ERCC−00096ERCC−00099ERCC−00108ERCC−00109ERCC−00111ERCC−00112ERCC−00113ERCC−00116ERCC−00126ERCC−00130ERCC−00131ERCC−00136ERCC−00137ERCC−00143ERCC−00144ERCC−00145ERCC−00148ERCC−00150ERCC−00154ERCC−00157ERCC−00160ERCC−00162ERCC−00163ERCC−00165ERCC−00168ERCC−00170ERCC−00171

Lab1 Lab2 Lab3 Lab4 Lab5 Lab6 Lab7 Lab8 Lab9Lab

Feature

−10

−5

0

5

ScaledEffect

OpportuniCes for ERCC 2.0

•  New technologies –  RNA-‐Seq –  Long reads

•  PacBio, Moleculo –  Digital counCng

•  Cellular Research, digital PCR

•  Method improvements –  Library preparaCon –  BioinformaCcs

•  New discoveries

Counting individual DNA molecules by thestochastic attachment of diverse labelsGlenn K. Fu, Jing Hu, Pei-Hua Wang, and Stephen P. A. Fodor1

Affymetrix, Inc., 3420 Central Expressway, Santa Clara, CA 95051

Edited* by Ronald W. Davis, Stanford Genome Technology Center, Palo Alto, CA, and approved March 22, 2011 (received for review November 27, 2010)

We implement a unique strategy for single molecule countingtermed stochastic labeling, where random attachment of a diverseset of labels converts a population of identical DNA moleculesinto a population of distinct DNA molecules suitable for thresholddetection. The conceptual framework for stochastic labeling isdeveloped and experimentally demonstrated by determining theabsolute and relative number of selected genes after stochasticallylabeling approximately 360,000 different fragments of the humangenome. The approach does not require the physical separation ofmolecules and takes advantage of highly parallel methods such asmicroarray and sequencing technologies to simultaneously countabsolute numbers of multiple targets. Stochastic labeling shouldbe particularly useful for determining the absolute numbers ofRNA or DNA molecules in single cells.

absolute counting ∣ digital PCR ∣ next-generation sequencing ∣single molecule detection

Determining small numbers of biological molecules and theirchanges is essential when unraveling mechanisms of cellular

response, differentiation or signal transduction, and in perform-ing a wide variety of clinical measurements. Although many ana-lytical methods have been developed to measure the relativeabundance of different molecules through sampling (e.g., micro-arrays and sequencing), the only practical method available todetermine the absolute number of molecules in a sample is digitalPCR (1–3), a powerful analytical technique typically limited toexamining only a few different molecules at a time.

In 2003, a theoretical approach to measure the number ofmolecules of a single mRNA species in a complex mRNA pre-paration was proposed (4). To our knowledge no experimentaldemonstration of this idea has been published. We have general-ized this idea and have expanded it to a highly parallel methodcapable of absolute counting of many different molecules simul-taneously. The concept is illustrated in Fig. 1. Each copy of amolecule randomly captures a label by choosing from a large,nondepleting reservoir of diverse labels. The subsequent diversityof the labeled molecules is governed by the statistics of randomchoice, and depends on the number of copies of identical mole-cules in the collection compared to the number of kinds of labels.Once the molecules are labeled, they can be amplified so thatsimple present/absent threshold detection methods can be usedfor each. Counting the number of distinctly labeled targetsreveals the original number of molecules of each species.

We can generalize the stochastic labeling process as follows.Consider a given set of copies of a single target sequenceT ! ft1;t2…tng; where n is the number of copies of T. A set oflabels is defined as L ! fl1;l2…lmg; where m is the number ofdifferent labels. T reacts stochastically with L, such that each tbecomes attached to one l. If the ls are in nondepleting excess,each t will choose one l randomly, and will take on a new identitylitj; where li is chosen from L and j is the jth copy from the setof n molecules. We identify each new molecule litj by its labelsubscript and drop the subscript for the copies of T because theyare identical. The new collection of molecules becomes T" !fl1t;l2t;…litg; where li is the ith choice from the set of m labels.At this point, the subscripts of l refer only to the ith choice and

provide no information about the identity of each l. In fact, l1 andl2 will have some probability of being identical, depending uponthe diversity m of the set of labels. Overall, T" will contain a setof k unique labels resulting from n targets choosing from the non-depleting reservoir of m labels. Or, T"#m;n$ ! flktg; where krepresents the number of unique labels that have been captured.In all cases, k will be smaller than m, approaching m only whenn becomes very large. We can define the stochastic attachmentof the set of labels on a target using a stochastic operator S withm members, acting upon a target population of n, such thatS#m$T#n$ ! T"#m;n$ generating the set flktg. Furthermore, be-cause S operates on all molecules independently, it can act onmany different targets. Hence, by combining the informationof target sequence and label, we can simultaneously count copiesof multiple target sequences. The probability of the number oflabels generated by the number of trials n, from a diversityof m, can be approximated by the Poisson equation, Px !%#n∕m$x∕x!&e−#n∕m$. Then P0 is the probability that a label willnot be chosen in n trials, therefore, 1 − P0 is the probability thata label will occur at least once. It follows that the expected num-ber of unique labels captured is given by:

Identical DNAtarget molecules {t1, t2 …. tn }

t1

t2

t3

t4

Pool of labels{l1 , l2 …. lm}

Random labeling

t1l20

t2l107

t3l477

t4l9

Amplification and detection of k distinctly labeled molecules

Fig. 1. A schematic representation of the labeling process. An exampleshowing four identical target molecules in solution. Each DNA molecule ran-domly captures and joins with a label by choosing from a large, nondepletingreservoir of m labels. Each resulting labeled DNA molecule takes on a newidentity and is amplified to detect the number of k distinct labels.

Author contributions: G.K.F. and S.P.A.F. designed research; G.K.F. and P.-H.W. performedresearch; G.K.F., J.H., and S.P.A.F. analyzed data; and G.K.F., J.H., and S.P.A.F. wrotethe paper.

Conflict of interest statement: The authors are employees of Affymetrix, Inc. and thesubject matter of this article may be a future commercial product.

*This Direct Submission article had a prearranged editor.

Freely available online through the PNAS open access option.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1017621108/-/DCSupplemental.

9026–9031 ∣ PNAS ∣ May 31, 2011 ∣ vol. 108 ∣ no. 22 www.pnas.org/cgi/doi/10.1073/pnas.1017621108

Fu et al. PNAS 2011

CHARGE TO THE WORKSHOP

Charge to the Workshop

•  Develop consensus on… –  Concept

•  Shared interests –  PorUolio

•  Controls •  Analysis •  Documentary Standards

•  Develop consorCum structure… – Working groups –  Steering commidee

Principles of OperaCon

•  Pre-‐compeCCve •  Consensus decision-‐making

•  Data-‐driven •  Technology independent

•  Leadership – Working Groups –  Steering Commidee

•  NIST-‐hosted •  “You get out of it what you put into it.”

“A rising tide floats all boats”

ERCC operates by consensus

“A rising tide floats all boats…”

VISION OF SCOPE & LIFESPAN OF ERCC 2.0

Scope of ERCC 2.0 •  ERCC 2.0 is convened to

develop standard controls for RNA measurements

•  Three working groups are proposed

1.  Design –  Types of controls &

sequence design 2.  Development

–  Building controls, developing & tesCng control mixtures

3.  Analysis –  Standard performance

metrics –  Tools as needed to

support design & development

The Arc of ERCC 2.0 •  Products

–  Sequences represenCng different types of RNA •  Transcript isoforms •  miRNA •  New mRNA mimics •  …

–  Documentary standards for using controls

–  Performance metrics •  LogisCcs

–  Workshops •  Number, frequency

–  Telecons, Mailing list, Wiki

•  Development Schedules –  Sequence selecCon –  Control synthesis –  Control tesCng and analysis –  Reference material

development, characterizaCon, release

–  AnalyCcal methods & tools –  Documentary standards –  … –  Finished.

•  DisseminaCon –  Steering commidee to

address business models

ERCC 2.0 PROCESS DISCUSSION

What will we do together?

•  NIST is commided to... –  HosCng the consorCum –  SupporCng product development

•  PorUolio possibiliCes –  Reference materials –  Reference data –  Analysis methods –  Analysis tools –  Documentary standards –  …

•  Define consorCum mission –  Purpose of ERCC 2.0 products •  Providing infrastructure to discern signal from arCfact

•  Confidence in RNA measurement results

•  …

How can we work together?

•  How do we make decisions?

•  How do we operate? – Working groups –  Semi-‐annual meeCngs –  Conference calls –  Email list, wiki?

•  Why a consorCum? – We can make beder standards together

•  Things the consorCum can do as an enCty: –  Integrate controls from the membership

–  Conduct validaCon studies

– Make recommendaCons, guidelines, develop standards •  Documentary standards to support regulated applicaCons

PARTICIPANT PRESENTATIONS

WORKING GROUP AND SCOPE DISCUSSION

Design Working Group •  Types of Controls

–  Transcript isoforms –  miRNA –  Small RNAs – pre-‐miRNA,

noncoding –  Cancer-‐fusion transcripts –  Microbial RNAs –  Polysome-‐associated RNA

spike-‐ins –  Long noncoding RNAs –  Epitranscriptome standards –  Refined mRNA mimics –  …

•  Design consideraCons –  Sequence source –  GC content –  Length –  Complexity –  Poly-‐adenylaCon –  Secondary structure –  Non-‐cognate sequences

(“alien”) –  ModificaCons –  …

Development Working Group •  Control synthesis

–  DNA templates, RNA molecules

–  Special modificaCons •  QC of DNA, RNA controls

–  Purity –  Homogeneity –  Stability

•  Control Mixture Design –  Dynamic range –  RaCos –  …

•  Plan and conduct interlaboratory studies to evaluate controls –  ValidaCon of controls –  ValidaCon of concepts and analysis

–  Use mulCple measurement technologies

Analysis Working Group

•  Develop standard performance metrics –  Develop reference implementaCon as example

•  ApplicaCons –  Process control –  QuanCtaCve benchmarking

–  NormalizaCon –  OpCmizaCon

•  Tools as needed to support design & development –  ValidaCon study analysis – Mixture design tools

Design Working Group •  Types of Controls

–  Transcript isoforms –  miRNA –  Small RNAs – pre-‐miRNA,

noncoding –  Cancer-‐fusion transcripts –  Microbial RNAs –  Polysome-‐associated RNA

spike-‐ins –  Long noncoding RNAs –  Epitranscriptome standards –  Refined mRNA mimics –  …

•  Design consideraCons –  Sequence source –  GC content –  Length –  Complexity –  Poly-‐adenylaCon –  Secondary structure –  Non-‐cognate sequences

(“alien”) –  ModificaCons –  …

Design Working Group

I like… I wish…

Closing Comments Day 1

•  9:00 am start tomorrow (there will be coffee)

•  More presentaCons tomorrow morning

•  Open Pitch session is also available tomorrow –  Let us know if you want to speak tomorrow, but you can also can just get up and pitch

•  Working groups will reconvene tomorrow to develop summaries

•  Please join us now for dinner