Upload
theresa-barton
View
218
Download
2
Embed Size (px)
Citation preview
Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks
Yetian Chen
2008-12-12
2008 Nobel Prize in Chemistry
Roger Tsien Osamu Shimomura Martin Chalfie
Green Fluorescent Protein (GFP)
Use GFP to track a protein in living cells
The cellular Localization information of a protein is embedded in protein sequence
PKKKRKV: Nuclear Localization Signal
VALLAL: transmembrane segment
Cellular Localization SitesAmino Acid sequence of a protein
Challenge: predict
Extracting cellular localization information from protein sequence mcg: McGeoch's method for signal sequence recognition.
gvh: von Heijne's method for signal sequence recognition.
alm: Score of the ALOM membrane spanning region prediction program.
mit: Score of discriminant analysis of the amino acid content of the N- terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins.
erl: Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute.
pox: Peroxisomal targeting signal in the C-terminus.
vac: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins.
nuc: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins.
Problem Statement & Datasets
Protein Name mcg gvh lip chg aac alm1 alm2 LocationEMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cpATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imSNFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im
Dataset 1: 336 proteins from E.coli (Prokaryote Kingdom)http://archive.ics.uci.edu/ml/datasets/Ecoli
Dataset 2: 1484 proteins from yeast (Eukaryote Kingdom)http://archive.ics.uci.edu/ml/datasets/Yeast
Implementation of AI algorithms
Decision Tree
> C5
Neural Network
> Single layer feed-forwad NN: Perceptrons
> Multilayer feed-forward NN: one hidden layer
Implementation of Decision Tree: C5
Preprocessing of Dataset
> If the data point is linear and continous, divide the data range to 5 equal-width bins: tiny, small, medium, large, huge. Then discretize the data points to these bins.
> if the feature value is missing (?), replace ? with tiny.
Generating training set and test set
> Randomly split the data set to training set and test set such that 70% will be in the training set and 30% for test set.
Learning the Decision Tree
> using the decision tree learning algorithm in chapter 18.3 of text book
Testing
Implementation of Neural Networks Structure of Perceptrons and two-layer NN
Protein Name mcg gvh lip chg aac alm1 alm2 LocationEMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cpATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imSNFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im
input
Att 1
Att 2
Att 3
Att 4
output
cp
imS
im
1
0
0
Desired output
input
Att 1
Att 2
Att 3
Att 4
output
cp
imS
im
1
0
0
Desired output
Perceptrons Two-layer NN
Implementation of Perceptrons & Two-layer NN: Algorithms
max ( )r j jO O 0
[ ][ ] [ ]n
j iiO g w i j x e
0 0[ ][ ] [ ][ ] [ ] [ ][ ] [ ] 1 [ ][ ] [ ]
n n
j i i ii iw i j w i j Err x e g w i j x e g w i j x e
Function PERCEPTRONS-LEARNING (examples, network)
initially set correct=0
initialize the weight matrix w[i][j] with randomized number within[-0.5,0.5]
While(correct < threshold) //threshold =0.0, 0.1, 0.2…, 1.0
for each e in the example do
calculate output for each output node //g() is sigmoid function
prediction = r such that
if r != y(e)
for each output node j
for i=1,…,m
endfor
endfor
endif
endfor
endwhile
Return w[i][j]
0( ) [ ][ ] [ ]
n
j j iiErr y e g w i j x e
2-layer NN(example,network)
Using the Back-Prop-Learning in Chap 20.5 of textbook
Results
Accuracy comparison
Dataset Decision Tree Perceptrons Two-layer NN
(hidden nodes:5)
Majority
E.coli 68.04±5.03% 66.76±6.34% (Threshold=0.7)
65.68±6.09% (Threshold=0.7)
45.05%
Yeast 46.63±2.55% 50.41±2.74%
(Threshold=0.5)
50.28±2.23%
(Threshold=0.55)
28.82%
•The statistics for Decision Tree are average over 100 runs
•The statistics for Perceptrons and Two-layer NN are average over 50 runs
•Threshold is the termination condition for training the neural networks
Conclusions
The two datasets are linearly inseparable.
For the E.coli dataset, DT, Perceptrons, Two-layer NN achieve similar accuracy
For the yeast dataset, Perceptrons, Two-layer NN achieve slightly better accuracy than DT
All the three AI algorithms have much better accuracy than the simple majority algorithm
Future work
Probabilistic modelBayesian networkK-Nearest Neighbor……
A protein localization sites prediction scheme
mcg gvh alm mit erl pox vac nuc
Classifiers
prediction
Guide the experimental design and biological research, save much labor and time!