Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J....

Preview:

Citation preview

Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays

J. Tobler, M. Molla, J. ShavlikUniversity of Wisconsin-Madison M. Molla, E. Nuwaysir, R. GreenNimblegen Systems Inc.

probes

surface

Oligonucleotide Microarrays

Specific probes synthesized atknown spot on chip’s surface

Probes complementary to RNA of genes to be measured

Typical gene (1kb+) MUCH longer than typical probe (24 bases)

Probes: Good vs. Bad

good probe

bad probe

Blue = ProbeRed = Sample

Probe-Picking Method Needed

Hybridization characteristics differ between probes

Probe set represents very small subset of gene

Accurate measurement of expression requires good probe set

Related Work

Use known hybridization characteristics

Lockhardt et al. 1996

Melting point (Tm) predictionsKurata and Suyama 1999

Li and Stormo 2001

Stable secondary structureKurata and Suyama 1999

Our Approach

Apply established machine-learning algorithms Train on categorized examples Test on examples with category hidden

Choose features to represent probes

Categorize probes as good or bad

The FeaturesFeature Name Description

fracA, fracC, fracG, fracT The fraction of A, C, G, or T in the 24-mer

fracAA, fracAC, fracAG, fracAT, fracCA, fracCC, fracCG, fracCT, fracGA, fracGC, fracGG, fracGT,fracTA, fracTC, fracTG, fracTT

The fraction of each of these dimers in the 24-mer

n1, n2, …., n24 The particular nucleotide (A, C, G, or T) at the specified position in the 24-mer

d1, d2, …, d23 The particular dimer (AA, AC,…TT) at the specified position in the 24-mer

The Data

Gene Sequence: GTAGCTAGCATTAGCATGGCCAGTCATG…Complement: CATCGATCGTAATCGTACCGGTCAGTAC…

Probe 1: CATCGATCGTAATCGTACCGGTCA

Probe 2: ATCGATCGTAATCGTACCGGTCAG

Probe 3: TCGATCGTAATCGTACCGGTCAGT

… …

Tilings of 8 genes (from E. coli & B. subtilus) Every possible probe (~10,000 probes) Genes known to be expressed in sample

Our Microarray

0 99

Defining our Categories

Normalized Probe Intensity

Low Intensity = BAD Probes

(45%)

High Intensity = GOOD

Probes (32%)

Mid-Intensity = Not Used in Training Set

(23%)

Frequenc

y

0 .05 .15 1.0

The Machine Learning Techniques

Naïve Bayes (Mitchell 1997)

Neural Networks (Rumelhart et al. 1995)

Decision Trees (Quinlan 1996)

Can interpret predictions of each learner probabilistically

Naïve Bayes

Assumes conditional independence between features

Make judgments about test set examples based on conditional probability estimates made on training set

Naïve Bayes

For each example in the test set, evaluate the following:

ilowivalueifeaturePlowP

ihighivalueifeaturePhighP

)|()(

)|()(

Neural Network(1-of-n encoding with probe length = 3)

Example probe

sequence: “CAG”

Weights

ACTIVATI

O

NERROR

Good or Bad…

A2

C2

G2

T2

A3

C3

G3

T3

A1

C1

G1

T1

Decision Tree

n14

fracAC

fracT

fracTC

Bad Probe … … …

Good Probe

Automatically builds a tree of rules

High

…Low

High

Low High

Low High

Low

High

C G TA

fracC

fracG

fracG

Low

Low

High

Decision Tree

The information gain of a feature, F, is:

)(||

||)(

),(

)(v

FValuesv

v SEntropyS

SSEntropy

FSnGainInformatio

Information Gain per Feature

CG

CC

C

A G

T

AA

AC

AG

ATCA

CTGA GG

TC

GC TAGT TT

TG0.0

1.0

22 2324 1 2 3 4 5 6 789 10

11 1213 1415 16 1718 19 20

21 22232119 2017181614 151311 129 108764 51 2 3

0.0

1.0

Probe Composition Features

Norm

aliz

ed

In

form

ati

on

Gain

Base Position Features

Base Position

Dimer Position

Cross-Validation

Leave-one-out testing: For each gene (of the 8)

Train on all but this geneTest on this geneRecord resultForget what was learned

Average results across 8 test genes

Typical Probe-Intensity Prediction Across Short Region

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

650 655 660 665 670 675 680 685 690 695 700

Actual

Norm

aliz

ed

Pro

be In

ten

sity

Starting Nucleotide Position for 24-mer Probe

Typical Probe-Intensity Prediction Across Short Region

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

650 655 660 665 670 675 680 685 690 695 700

Naïve Bayes Decisio

n Tree

Neural Network

Actual

Norm

aliz

ed

Pro

be In

ten

sity

Starting Nucleotide Position for 24-mer Probe

Probe-Picking Results

0

2

4

6

8

10

12

14

16

18

20

0 2 4 6 8 10 12 14 16 18 20

Nu

mb

er

of

pro

bes

sele

cted

wit

h

inte

nsi

ty >

= 9

0th p

erc

enti

le

Number of probes selected

Perfect Selector

Probe-Picking Results

0

2

4

6

8

10

12

14

16

18

20

0 2 4 6 8 10 12 14 16 18 20

Nu

mb

er

of

pro

bes

sele

cted

wit

h

inte

nsi

ty >

= 9

0th p

erc

enti

le

Number of probes selected

Naïve Bayes

Neural Network

Decision Tree

Primer Melting Point

Perfect Selector

Current and Future Directions

Consider more features Folding patterns Melting point

Feature selection

Evaluate specificity along with sensitivity Ie, consider false positives

Evaluate probe selection + gene calling

Try more ML techniques SVMs, ensembles, …

Take-Home Message

Machine learning does a good job on this part of probe-selection problem Easy to collect large number of training

ex’s Easily measured features work well

Intelligent probe selection can increase microarray accuracy and efficiency

Acknowledgements

NimbleGen Systems, Inc. for providing the intensities from the eight tiled genes measured on their maskless array. Darryl Roy for helping in creating the training data. Grants NIH 2 R44 HG02193-02, NLM 1 R01 LM07050-01, NSF IRI-9502990, NIH 2 P30 CA14520-29, and NIH 5 T32 GM08349.

Thanks

Recommended