14
Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Embed Size (px)

Citation preview

Page 1: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks

Yetian Chen

2008-12-12

Page 2: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

2008 Nobel Prize in Chemistry

Roger Tsien Osamu Shimomura Martin Chalfie

Green Fluorescent Protein (GFP)

Use GFP to track a protein in living cells

Page 3: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

The cellular Localization information of a protein is embedded in protein sequence

PKKKRKV: Nuclear Localization Signal

VALLAL: transmembrane segment

Cellular Localization SitesAmino Acid sequence of a protein

Challenge: predict

Page 4: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Extracting cellular localization information from protein sequence mcg: McGeoch's method for signal sequence recognition.

gvh: von Heijne's method for signal sequence recognition.

alm: Score of the ALOM membrane spanning region prediction program.

mit: Score of discriminant analysis of the amino acid content of the N- terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins.

erl: Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute.

pox: Peroxisomal targeting signal in the C-terminus.

vac: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins.

nuc: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins.

Page 5: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Problem Statement & Datasets

Protein Name mcg gvh lip chg aac alm1 alm2 LocationEMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cpATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imSNFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im

Dataset 1: 336 proteins from E.coli (Prokaryote Kingdom)http://archive.ics.uci.edu/ml/datasets/Ecoli

Dataset 2: 1484 proteins from yeast (Eukaryote Kingdom)http://archive.ics.uci.edu/ml/datasets/Yeast

Page 6: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Implementation of AI algorithms

Decision Tree

> C5

Neural Network

> Single layer feed-forwad NN: Perceptrons

> Multilayer feed-forward NN: one hidden layer

Page 7: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Implementation of Decision Tree: C5

Preprocessing of Dataset

> If the data point is linear and continous, divide the data range to 5 equal-width bins: tiny, small, medium, large, huge. Then discretize the data points to these bins.

> if the feature value is missing (?), replace ? with tiny.

Generating training set and test set

> Randomly split the data set to training set and test set such that 70% will be in the training set and 30% for test set.

Learning the Decision Tree

> using the decision tree learning algorithm in chapter 18.3 of text book

Testing

Page 8: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Implementation of Neural Networks Structure of Perceptrons and two-layer NN

Protein Name mcg gvh lip chg aac alm1 alm2 LocationEMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cpATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imSNFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im

input

Att 1

Att 2

Att 3

Att 4

output

cp

imS

im

1

0

0

Desired output

input

Att 1

Att 2

Att 3

Att 4

output

cp

imS

im

1

0

0

Desired output

Perceptrons Two-layer NN

Page 9: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Implementation of Perceptrons & Two-layer NN: Algorithms

max ( )r j jO O 0

[ ][ ] [ ]n

j iiO g w i j x e

0 0[ ][ ] [ ][ ] [ ] [ ][ ] [ ] 1 [ ][ ] [ ]

n n

j i i ii iw i j w i j Err x e g w i j x e g w i j x e

Function PERCEPTRONS-LEARNING (examples, network)

initially set correct=0

initialize the weight matrix w[i][j] with randomized number within[-0.5,0.5]

While(correct < threshold) //threshold =0.0, 0.1, 0.2…, 1.0

for each e in the example do

calculate output for each output node //g() is sigmoid function

prediction = r such that

if r != y(e)

for each output node j

for i=1,…,m

endfor

endfor

endif

endfor

endwhile

Return w[i][j]

0( ) [ ][ ] [ ]

n

j j iiErr y e g w i j x e

2-layer NN(example,network)

Using the Back-Prop-Learning in Chap 20.5 of textbook

Page 10: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Results

Accuracy comparison

Dataset Decision Tree Perceptrons Two-layer NN

(hidden nodes:5)

Majority

E.coli 68.04±5.03% 66.76±6.34% (Threshold=0.7)

65.68±6.09% (Threshold=0.7)

45.05%

Yeast 46.63±2.55% 50.41±2.74%

(Threshold=0.5)

50.28±2.23%

(Threshold=0.55)

28.82%

•The statistics for Decision Tree are average over 100 runs

•The statistics for Perceptrons and Two-layer NN are average over 50 runs

•Threshold is the termination condition for training the neural networks

Page 11: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Conclusions

The two datasets are linearly inseparable.

For the E.coli dataset, DT, Perceptrons, Two-layer NN achieve similar accuracy

For the yeast dataset, Perceptrons, Two-layer NN achieve slightly better accuracy than DT

All the three AI algorithms have much better accuracy than the simple majority algorithm

Page 12: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

Future work

Probabilistic modelBayesian networkK-Nearest Neighbor……

Page 13: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

A protein localization sites prediction scheme

mcg gvh alm mit erl pox vac nuc

Classifiers

prediction

Guide the experimental design and biological research, save much labor and time!

Page 14: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12