26
Learning Approaches for Aiding Probe Selection for Gene- Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison M. Molla, E. Nuwaysir, R. Green Nimblegen Systems Inc.

Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Embed Size (px)

Citation preview

Page 1: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays

J. Tobler, M. Molla, J. ShavlikUniversity of Wisconsin-Madison M. Molla, E. Nuwaysir, R. GreenNimblegen Systems Inc.

Page 2: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

probes

surface

Oligonucleotide Microarrays

Specific probes synthesized atknown spot on chip’s surface

Probes complementary to RNA of genes to be measured

Typical gene (1kb+) MUCH longer than typical probe (24 bases)

Page 3: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Probes: Good vs. Bad

good probe

bad probe

Blue = ProbeRed = Sample

Page 4: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Probe-Picking Method Needed

Hybridization characteristics differ between probes

Probe set represents very small subset of gene

Accurate measurement of expression requires good probe set

Page 5: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Related Work

Use known hybridization characteristics

Lockhardt et al. 1996

Melting point (Tm) predictionsKurata and Suyama 1999

Li and Stormo 2001

Stable secondary structureKurata and Suyama 1999

Page 6: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Our Approach

Apply established machine-learning algorithms Train on categorized examples Test on examples with category hidden

Choose features to represent probes

Categorize probes as good or bad

Page 7: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

The FeaturesFeature Name Description

fracA, fracC, fracG, fracT The fraction of A, C, G, or T in the 24-mer

fracAA, fracAC, fracAG, fracAT, fracCA, fracCC, fracCG, fracCT, fracGA, fracGC, fracGG, fracGT,fracTA, fracTC, fracTG, fracTT

The fraction of each of these dimers in the 24-mer

n1, n2, …., n24 The particular nucleotide (A, C, G, or T) at the specified position in the 24-mer

d1, d2, …, d23 The particular dimer (AA, AC,…TT) at the specified position in the 24-mer

Page 8: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

The Data

Gene Sequence: GTAGCTAGCATTAGCATGGCCAGTCATG…Complement: CATCGATCGTAATCGTACCGGTCAGTAC…

Probe 1: CATCGATCGTAATCGTACCGGTCA

Probe 2: ATCGATCGTAATCGTACCGGTCAG

Probe 3: TCGATCGTAATCGTACCGGTCAGT

… …

Tilings of 8 genes (from E. coli & B. subtilus) Every possible probe (~10,000 probes) Genes known to be expressed in sample

Page 9: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Our Microarray

Page 10: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

0 99

Defining our Categories

Normalized Probe Intensity

Low Intensity = BAD Probes

(45%)

High Intensity = GOOD

Probes (32%)

Mid-Intensity = Not Used in Training Set

(23%)

Frequenc

y

0 .05 .15 1.0

Page 11: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

The Machine Learning Techniques

Naïve Bayes (Mitchell 1997)

Neural Networks (Rumelhart et al. 1995)

Decision Trees (Quinlan 1996)

Can interpret predictions of each learner probabilistically

Page 12: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Naïve Bayes

Assumes conditional independence between features

Make judgments about test set examples based on conditional probability estimates made on training set

Page 13: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Naïve Bayes

For each example in the test set, evaluate the following:

ilowivalueifeaturePlowP

ihighivalueifeaturePhighP

)|()(

)|()(

Page 14: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Neural Network(1-of-n encoding with probe length = 3)

Example probe

sequence: “CAG”

Weights

ACTIVATI

O

NERROR

Good or Bad…

A2

C2

G2

T2

A3

C3

G3

T3

A1

C1

G1

T1

Page 15: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Decision Tree

n14

fracAC

fracT

fracTC

Bad Probe … … …

Good Probe

Automatically builds a tree of rules

High

…Low

High

Low High

Low High

Low

High

C G TA

fracC

fracG

fracG

Low

Low

High

Page 16: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Decision Tree

The information gain of a feature, F, is:

)(||

||)(

),(

)(v

FValuesv

v SEntropyS

SSEntropy

FSnGainInformatio

Page 17: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Information Gain per Feature

CG

CC

C

A G

T

AA

AC

AG

ATCA

CTGA GG

TC

GC TAGT TT

TG0.0

1.0

22 2324 1 2 3 4 5 6 789 10

11 1213 1415 16 1718 19 20

21 22232119 2017181614 151311 129 108764 51 2 3

0.0

1.0

Probe Composition Features

Norm

aliz

ed

In

form

ati

on

Gain

Base Position Features

Base Position

Dimer Position

Page 18: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Cross-Validation

Leave-one-out testing: For each gene (of the 8)

Train on all but this geneTest on this geneRecord resultForget what was learned

Average results across 8 test genes

Page 19: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Typical Probe-Intensity Prediction Across Short Region

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

650 655 660 665 670 675 680 685 690 695 700

Actual

Norm

aliz

ed

Pro

be In

ten

sity

Starting Nucleotide Position for 24-mer Probe

Page 20: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Typical Probe-Intensity Prediction Across Short Region

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

650 655 660 665 670 675 680 685 690 695 700

Naïve Bayes Decisio

n Tree

Neural Network

Actual

Norm

aliz

ed

Pro

be In

ten

sity

Starting Nucleotide Position for 24-mer Probe

Page 21: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Probe-Picking Results

0

2

4

6

8

10

12

14

16

18

20

0 2 4 6 8 10 12 14 16 18 20

Nu

mb

er

of

pro

bes

sele

cted

wit

h

inte

nsi

ty >

= 9

0th p

erc

enti

le

Number of probes selected

Perfect Selector

Page 22: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Probe-Picking Results

0

2

4

6

8

10

12

14

16

18

20

0 2 4 6 8 10 12 14 16 18 20

Nu

mb

er

of

pro

bes

sele

cted

wit

h

inte

nsi

ty >

= 9

0th p

erc

enti

le

Number of probes selected

Naïve Bayes

Neural Network

Decision Tree

Primer Melting Point

Perfect Selector

Page 23: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Current and Future Directions

Consider more features Folding patterns Melting point

Feature selection

Evaluate specificity along with sensitivity Ie, consider false positives

Evaluate probe selection + gene calling

Try more ML techniques SVMs, ensembles, …

Page 24: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Take-Home Message

Machine learning does a good job on this part of probe-selection problem Easy to collect large number of training

ex’s Easily measured features work well

Intelligent probe selection can increase microarray accuracy and efficiency

Page 25: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Acknowledgements

NimbleGen Systems, Inc. for providing the intensities from the eight tiled genes measured on their maskless array. Darryl Roy for helping in creating the training data. Grants NIH 2 R44 HG02193-02, NLM 1 R01 LM07050-01, NSF IRI-9502990, NIH 2 P30 CA14520-29, and NIH 5 T32 GM08349.

Page 26: Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison

Thanks