42
Mapping Sites of Transcription Across the Drosophila Genome Using High Resolution Tiling Microarrays LBNL, Berkeley CA August 20, 2007 A. Willingham Affymetrix, Inc

Mapping Sites of Transcription Across the Drosophila Genome Using High Resolution Tiling Microarrays LBNL, Berkeley CA August 20, 2007 A. WillinghamAffymetrix,

Embed Size (px)

Citation preview

Mapping Sites of Transcription Across the Drosophila Genome

Using High Resolution Tiling Microarrays LBNL, Berkeley CA

August 20, 2007

A. Willingham Affymetrix, Inc

I. Affymetrix’s Contribution to Specific Aims and Milestones

II. Previous Studies•Manak et al analysis of developmental transcriptome

III. Initial Results for Aim I•sample preparation & data processing•first look at cell line data on 35bp arrays•pilot analysis of brand-new 7bp arrays

IV. RACE-array •example of ENCODE extension analysis of genes on Chr21 & 22

V. Summary and Steps for Moving Forward

Presentation Outline

Specific Aim 1

• 480 samples on 35-bp genome tiling arrays• 24 samples on 7-bp genome tiling array sets• 160 RACE-fragment pools (16,000 prod’s)

Specific Aim 2

• RNAi of 120 RNA binding proteins on arrays

Specific Aim 3

• Northern blotting of ncRNA models

RNA Samples and Genome Tiling Arrays

total RNA A+ RNAnuclear

RNApolysomal

RNAtotal #

samples

total # arrays

(triplicates)Life Cycle Embryos 12 12 12 48 144

Larvae 10 10 - - 20 60Pupae 5 5 - - 10 30Adults (male) 8 8 - - 16 48Adults (female) 8 8 - - 16 48

Cell Lines Survey 100 - - - 100 300Selected - 30 30 30 90 270

Dissected tissues 90 90 - - 180 540TOTAL 480 1440

35 bp Arrays

A+ RNAsmall RNAs

(strand-specific)total # array sets

(triplicates)*Life Cycle Embryos 8 8 72

Larvae 2 2 18Pupae 2 2 18Adults (male) 2 2 18Adults (female) 2 2 18

Cell Lines Survey - - -Selected 2 2 18

Dissected tissues 6 6 54TOTAL 24 24 216

7 bp Arrays

MilestonesAim Description Group Total

Aim 1.1 RNA sample preparation Bloomington (Cherbas) 480 RNAs in triplicateAim 1.2 35-bp tiling arrays Affymetrix (Gingeras) 1,440 35-bp arraysAim 1.3 5-bp tiling arrays Affymetrix (Gingeras) 504 5-bp arrays (72 sets)**Aim 1.4 pooled RT-PCR products Affymetrix (Gingeras) 160 pools of 100 products

Aim 2.1 data synthesis Wash U (Brent) data synthesisAim 2.2a RT-PCR sequencing Wash U (Brent) 20,000 RT-PCR to validate exonsAim 2.2b RACE sequencing LBNL (Celniker) 20,000 RACE to map TSSsAim 2.2c cDNA screening LBNL (Celniker) 6,000 cDNA screens for new transcriptsAim 2.3a ncRNA sequencing LBNL (Celniker) 16 454 Life Sciences runsAim 2.3b ncRNA Northern blots Affymetrix (Gingeras) 1,000 Northern blots

Aim 2.4 RNA splicing UCHC (Gravely); UCB (Brenner) 120 RNAi knock-downs or RNA binding proteinsAim 3.1 RNAi of ncRNAs Harvard (Perrimon) 100 ncRNA transcriptsAim 3.2 overexpression of ncRNAs Harvard (Perrimon) 100 ncRNA transcripts

Aim Year 1 Year 2 Year 3 Year 4Life Cycle Cell Lines Dissected Tissues Advanced Dissections

Aim 1.1 120 RNAs x 3 120 RNAs x 3 120 RNAs x 3 120 RNAs x 3Aim 1.2 360 arrays 360 arrays 360 arrays 360 arraysAim 1.3 18 array sets 18 array sets 18 array sets 18 array setsAim 1.4 40 pools 40 pools 40 pools 40 pools

Aim 2.1 data synthesis data synthesis data synthesis data synthesisAim 2.2a 5,000 RT-PCR 5,000 RT-PCR 5,000 RT-PCR 5,000 RT-PCRAim 2.2b 5,000 RACE 5,000 RACE 5,000 RACE 5,000 RACEAim 2.2c 1,500 cDNAs 1,500 cDNAs 1,500 cDNAs 1,500 cDNAsAim 2.3a 4 runs 4 runs 4 runs 4 runsAim 2.3b 250 blots 250 blots 250 blots 250 blots

Aim 2.4 30 knock-downs 30 knock-downs 30 knock-downs 30 knock-downsAim 3.1 25 ncRNAs 25 ncRNAs 25 ncRNAs 25 ncRNAsAim 3.2 25 ncRNAs 25 ncRNAs 25 ncRNAs 25 ncRNAs

•stepwise nature of individual aims & responsibilities•involvement & interdependencies of each step•propose shifting milestones to more of a “ramp-up” model

0 1212 24months

36 48

120x3-35bp arrays 120x3-35bp arrays 120x3-35bp arrays 120x3-35bp arrays

18sets-7bp arrays 18sets-7bp arrays 18sets-7bp arrays 18sets-7bp arrays

40pools (100)RACE40pools (100)RACE 40pools (100)RACE40pools (100)RACE

250Northerns 250Northerns 250Northerns 250Northerns

30 RNA-binder K.D. 30 RNA-binder K.D. 30 RNA-binder K.D. 30 RNA-binder K.D.

70x3-35bp arrays 110x3-35bp arrays 140x3-35bp arrays 160-35bp arrays

10sets-7bp arrays 20sets-7bp arrays 20sets-7bp arrays 22sets-7bp arrays

20pools (100)RACE40pools (100)RACE 50pools (100)RACE50pools (100)RACE

100Northerns 200Northerns 400Northerns 400Northerns

15 RNA-binder K.D. 30 RNA-binder K.D. 30 RNA-binder K.D. 45 RNA-binder K.D.

current

proposed

Timeline for Milestones

Previous StudiesManak et al.

Nature Genetics, v38 Sep 2006

Transcription Analysis of Early (0-24hr) of Drosophila Embryogenesis

•70% Annotated•30% Unannotated Manak et al. Nature Genetics, v38 Sep 2006

Differential expression in Drosophila embryogenesis(~40kb region of Chromosome 3R)

5’ TSS

0-2 hr

2-4 hr

4-6 hr

6-8 hr

8-10 hr

10-12 hr

12-14 hr

14-16 hr

16-18 hr

18-20 hr

20-22 hr

22-24 hr

19Kb

Maternally Expressed Genes (Restarted in two patterns)

Unannotated transcription updates known gene annotations

Manak et al. Nature Genetics, v38 Sep 2006

Drosophila:5`-sites predicted by txn co-reg.~1500 genesavg 1st intron size = ~20kbavg 1st annotated intron = ~1.7kb

Initial Results of Aim I

Affymetrix sample preparation & data generation pipeline

sample treatment & QC•DNase-treat•BioAnalyzer

1st-strand cDNA synth.

•random primed•Superscript-II

2nd-strand cDNA synth.

•DNA Pol-I•save aliquot for downstream QC

label & hybridize to arrays

•TdT-based end labeling•CEL file generation

signal graph generation

•median-scaling•q-norm bioreps•select bandwidth

transfrag generation

•select min-run•select max-gap

data distribution•tomeweb hosting•FTP to servers?•deliver to DCC, GEO, etc

this example highlights method for generation of RNA maps but is similar for other applications:

•RNA maps of long and short RNAs•RACE-array maps•RNAi knockdown experiments•chromatin-immunoprecipitation

quality control•overlap w/ RACE•Northern blots•QPCR of cDNA

Current Sample Prep(5 cell line samples completed in triplicate)

(for 3 other cell lines, several samples failed)

RNA IDBiological sample ID sample

AFFX: DNase treat

AFFX: RNA ug yield (post-DNase)

AFFX: Agilent BioAnalyzer

AFFX: cDNA yield (ug) (15ug RNA)

AFFX: ug amt hyb to array

AFFX: array type

AFFX: hyb date AFFX: comment

14 BS10 S2-DRSC 8/7/2007 69.3 8/8/2007 15.3 8ug 35bp-AS 8/13/200716 BS11 S2-DRSC 8/7/2007 72.8 8/8/2007 16.5 8ug 35bp-AS 8/13/200718 BS12 S2-DRSC 8/7/2007 73.0 8/8/2007 4.4 repeat cDNA synth86 BS28 CME L1 8/7/2007 35.3 8/8/2007 14.4 8ug 35bp-AS 8/13/200788 BS29 CME L1 8/7/2007 60.1 8/8/2007 12.9 8ug 35bp-AS 8/13/200790 BS30 CME L1 8/7/2007 49.0 8/8/2007 14.5 8ug 35bp-AS 8/13/2007

134 BS13 ML-DmD16-c3 8/7/2007 22.8 8/8/2007 11.5 8ug 35bp-AS 8/13/2007136 BS14 ML-DmD16-c3 8/7/2007 35.4 8/8/2007 RNA sample degraded138 BS15 ML-DmD16-c3 8/7/2007 48.0 8/8/2007 10.7 8ug 35bp-AS 8/13/2007170 BS1 ML-DmD11 7/25/2007 38.4 7/25/2007 17.3 8ug 35bp-AS 8/13/2007172 BS2 ML-DmD11 7/25/2007 36.4 7/25/2007 18.9 8ug 35bp-AS 8/13/2007174 BS3 ML-DmD11 7/25/2007 28.8 7/25/2007 16.8 8ug 35bp-AS 8/13/2007182 BS4 ML-DmD20-c2 7/25/2007 31.7 7/25/2007 15.3 8ug 35bp-AS 8/13/2007184 BS5 ML-DmD20-c2 7/25/2007 27.6 7/25/2007 13.4 8ug 35bp-AS 8/13/2007186 BS6 ML-DmD20-c2 7/25/2007 33.5 7/25/2007 14.4 8ug 35bp-AS 8/13/2007188 BS16 ML-DmD20-c5 8/7/2007 62.8 8/8/2007 14.1 8ug 35bp-AS 8/13/2007190 BS17 ML-DmD20-c5 8/7/2007 64.8 8/8/2007 16.2 8ug 35bp-AS 8/13/2007192 BS18 ML-DmD20-c5 8/7/2007 32.3 8/8/2007 16.3 8ug 35bp-AS 8/13/2007194 BS7 ML-DmD17-c3 7/25/2007 23.3 7/25/2007 16.1 8ug 35bp-AS 8/13/2007196 BS8 ML-DmD17-c3 7/25/2007 5.14 7/25/2007 sample lost in EtOH precip198 BS9 ML-DmD17-c3 7/25/2007 58.1 7/25/2007 14.6 8ug 35bp-AS 8/13/2007200 BS34 Sg4 8/7/2007 30.4 8/8/2007 16.0 8ug 35bp-AS 8/13/2007202 BS35 Sg4 8/7/2007 49.8 8/8/2007 12.2 8ug 35bp-AS 8/13/2007204 BS36 Sg4 8/7/2007 44.6 8/8/2007 14.4 8ug 35bp-AS 8/13/2007

•Hosted at http://transcriptome.affymetrix.com/download/modENCODE/

RNA QC by Agilent BioAnalyzersample name conc

(mg/ml) comments GEL EGRAM

#170_BS1_ML-DmD11

0.384

date: 07-25-07

0

10

20

30

40

50

Sam

ple

1

[FU]

20 25 30 35 40 45 50 55 60 65 [s]

Sample 1

#86_BS28_CME L1

0.70613

date: 08-08-07

0

10

20

30

40

50

60

Sam

ple

4

[FU]

20 25 30 35 40 45 50 55 60 [s]

Sample 4

#136_BS14_ML-DmD16-c3

0.70777

date: 08-08-07 RNA sample degraded obtain more from Cherbas group

0

2

4

6

8

10

12

14

16

18

Sam

ple

8

[FU]

20 25 30 35 40 45 50 55 60 65 [s]

Sample 8

Chr2L: Transcription Expression Maps Across ~50 Kb

ML-DmD4-

c1

ML-DmBG3-c2

Kc167

CME-W1-Cl8

Chr2L: Transcription Expression Maps Across ~25 Kb

ML-DmD4-

c1

ML-DmBG3-c2

Kc167

CME-W1-Cl8

transcription in 4 Drosophila cell lines:overlapping transcription

union of transcription in 4 cell lines

0.0E+00

5.0E+06

1.0E+07

1.5E+07

2.0E+07

2.5E+07

3.0E+07

3.5E+07

Kc167

Kc167

+DmBG3

Kc167

+DmBG3+

Cl8

Kc167

+DmBG3+

Cl8+D

mD4

bas

e-p

air

cove

rag

e

union of transcription in 4 cell lines

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

Kc167

Kc167+

DmBG3

Kc167+

DmBG3+C

l8

Kc167+

DmBG3+C

l8+Dm

D4

% o

verl

ap (

rela

tive

to

bla

nke

t tr

ansf

rag

s)

exons

500bp promoter

5`exons

introns

intergenic regions

transcription in 4 Drosophila cell lines: overlapping annotation

RNA Samples and Genome Tiling Arrays7 nt resolution arrays

•new 7bp design•5 arrays, total of ~14.4 million probes

•by comparison, 35bp array has ~3.1 million probes•5bp design required 7 arrays… 40% more chips required

•1512 arrays instead of 1080•replicates & strand not calculated in original budget

•updated genome version (release 5) used for design•repeats can be masked or unmasked

•virtual probes

•existing 35-bp design•1 array, total of ~3.1 million probes•Affy commercial group will produce an “updated” 2.0 design

•39bp resolution, release5-based design•however, we will continue using the current design

•35bp resolution more optimal for RNA maps•7bp arrays have better coverage & newer design•question of $ cost per array?

•comparison of nucleotide coverage (dm3, release5)•35bp array = 111,117,940 nt •7bp array masked = 107,355,171 nt•7bp array unmasked = 118,523,115 nt

•Cherbas total RNA samples from 2 cell lines (KC & clone8)

•Same labeled reactions hyb’d to 35bp and 7bp arrays

•Signal graphs generated in TAS: 2 technical replicates for each sample were q-norm together

•Bandwidth = 30 (7bp) or 50 (35bp), Norm target = 200

•Transfrags generated in TAS using 5% bacterial negative controls

•7bp arrays: min-run 50, max-gap 10

•35bp arrays: min-run 50, max-gap 90

•Intersections of 7bp vs 35bp and overlap with FlyBase annotations performed in Galaxy

•Hosted at: http://transcriptome.affymetrix.com/download/modENCODE/pilot_studies/Dros-7bp-pilot/

•Share with modENCODE DCC & ArrayExpress to determine whole-chromosome vs whole-chip data hosting

New 7-bp 5-chip array compared to 35-bp 1-chip array

Improved exon discrimination by transfrags from 7bp arrays

Pseudo-ROC curves comparing base-pair coverage& overlap with annotated exons

probe FPR thresholds: 7bp vs 35bp arrays

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

0.0% 0.5% 1.0% 1.5% 2.0% 2.5%

% false bacterial transfrags

% b

ase-

pai

r co

vera

ge

35bp exon overlap

35bp total txn

7bp exon overlap

7bp total txn

•five different thresholds for calculated probe false-positive rate were used•1%...3%...5%...7%...10% (7% and 10% not shown for 35bp array)

•7bp arrays clearly have a significantly lower false-positive rate for forming transfrags from bacterial negative regions

•~4-5 fold lower than 35bp arrays •attributable to higher probe density and different min-run & max-gap rules

•35bp and 7bp arrays have similar amount of bp coverage in transfrags •BUT 7bp arrays have 50-65% more transfrags•7bp transfrags are more “fragmented” and do a better job of delineating exons with small introns•7-bp array has better “resolution” of small exons

•Intersection with annotations shows both 35bp and 7bp arrays are detecting similar amounts of transcription as measured by bp coverage

sample bp coverage # of transfrags bp coverage # of transfragsCME-W1-Cl8 35bp array 19,419,814 75,586 12,939,004 43,798CME-W1-Cl8 7bp array 19,205,875 113,791 13,686,476 61,929

KC167 35bp array 21,604,108 78,879 13,772,604 44,243KC167 7bp array 21,744,331 130,819 14,412,649 64,075

*transfrags were generated using bacterial controls false-positive rate of 5%

total array detected

sample% false bact transfrags

% bp coverage (exon overlap)

% bp coverage (total)

35bp_5%fpr 2.3% 46.4% 17.2%7bp_5%fpr 0.5% 50.4% 18.0%

Summary: 7bp arrays

Improved exon discrimination by transfrags from 7bp arrays

modENCODE RACE array methodology

•5` RACE for 16,000 Drosophila genes

•choice of tissues?

•hybridize products (in pools of 100) to 35bp arrays

•1Mb separation between genes

•confirm presence of transfrags

•identify new, “rare” transfrags due to amplification of PCR

•human ENCODE project has done a similar study on the genes present on chromosomes 21 & 22

RACE Analysis of Coding Genes

DeGeorge Critical Region 14 gene

Kapranov, et al. Genome Res. (2005)

Conclusions•array types & applications

•pilot analysis of 7bp arrays•updated for dm3-release5 genome annotation: bpmaps & IGB

•sample processing pipeline & data generation•multiple applications require different types of graphs & transfrags

•bandwidth0 versus smoothing (e.g. bandwidth50)•RACE array

•lessons learned by ENCODE•QC and validation

•some of the specific aims (Northerns, RACE) will address these•additional analysis such as RT-PCR and QPCR validation of novel transcripts

•data hosted at affy-transcriptome website:•http://transcriptome.affymetrix.com/download/modENCODE/•sharing pilot data with DCC (Nicole Washington) to facilitate the process

Steps Moving Forward•adjusting milestones?•changes in samples? (usage of 7bp versus 35bp)•shifting focus in favor of more analysis of small RNAs?•data hosting and transfer issues?

Acknowledgements

ComputationS. GhoshH. TammanaN. GargS. DikeJ. Cheng

Molecular BiologyI. BellJ. DrenkowE. DumaisJ. DumaisR. DuttaguptaP. KapranovA. Willingham J. Manak

AFFX Transcriptome GroupTom Gingeras

supplemental slides

Kapranov et al. Science, v316 Jun 2007

same intronic expression seen by all arrays

value-of-probe-density

value-of-smoothing

value-of-unmasked

masking-issue-in-exons

unmasked regions are frequently higher