28
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung- Hyuk Cha, Ph.D. Pace University, School of Computer Science and Information Systems September 27, 2003

Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Embed Size (px)

Citation preview

Page 1: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Application of Stacked Generalization to a Protein

Localization Prediction TaskMelissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D.

Pace University, School of Computer Science and Information Systems

September 27, 2003

Page 2: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Overview

• Introduction• Purpose• Methods• Algorithms• Results• Conclusions and Future Work

Page 3: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Introduction

Page 4: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Introduction: Data Mining

• Application of machine learning algorithms to large databases

• Often used to classify future data based on training set

• “Target” variable is variable to be predicted• Theoretically, algorithms are context-independent

Page 5: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Introduction: Stacked Generalization

• Method for combining models• Part of training set used to train level-0, or base,

models as usual• Level-1 data built from predictions of level-0

models on remainder of set• Level-1 Generalizers are models trained on level-1

data

Page 6: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Introduction: Bioinformatics and Protein Localization

• Bioinformatics: application of computing to molecular biology

• Currently much interest in information about proteins• Expression of proteins localized in a particular type

or part of cell (localization)• Knowledge of protein localization can shed light on

protein’s function• Data mining employed to predict localization from

database of information about encoding genes

Page 7: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Introduction: KDD Cup 2001 Task

• KDD Cup: Annual data mining competition sponsored by ACM SIGKDD

• Participants use training set to predict target variable values in test dataset of different instances

• Winner is most accurate model (correct predictions/total instances in test set)

• 2001 task: predict protein localization of genes• Anonymized genes were instances, information about

genes were attributes• Datasets (incl. revealed target) used in this project

Page 8: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Purpose

• Use Stacked Generalization approach on this task

• Compare inter-algorithm performance using level-0 models and level-1 generalizers

• Evaluate strategy of equally distributing target variable

Page 9: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Methods

Page 10: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Methods: Dataset Manipulations

• Reduce number of input variables• Reduce number of potential target values to 3• Separate original training dataset into training and

validation sets for stacking• Eliminate effectively unary variables in final

training dataset

Page 11: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Table: Target Variable Distribution

Localization Training N (Percentage)

Validation N (Percentage)

Test N (Percentage)

Nucleus 366 (58.4%) 189 (60.4%) 174 (63.3%) Cytoplasm 192 (30.6%) 90 (28.75%) 66 (24.0%) Mitochondria 69 (11.0%) 34 (10.9%) 35 (12.7%)

Page 12: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

• Second training set created by stratifying to ensure equally distributed localizations

• Level-0 models trained on both raw (unequally distributed) and equally distributed training sets

• Separate level-1 data and level-1 generalizers from this dataset

Methods: Equally Distributed Approach

Page 13: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Algorithms

Page 14: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Algorithms: Level-0 Artificial Neural Network (ANN)

• Fully connected feedforward network• Input variables dummy variables 186 input

nodes• Target variable dummy variables 2 output

nodes• 1 hidden node• Training based on change in misclassification rate

Page 15: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

• Used CHAID-like algorithm• Chi-squared p value splitting criterion: p < 0.2 • Model selection based on proportion of instances

correctly classified

Algorithms: Level-0 Decision Tree

Page 16: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Algorithms: Level-0 Nearest Neighbor (NN)

• Compare each instance between two datasets• Count number of matching attributes• Predict target value of instance matching on

greatest number of attributes• Use relative frequency in unequally distributed

dataset to break ties

Page 17: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Algorithms: Level-0 Hybrid Decision Tree/ANN

• Difficult for ANN to learn with too many variables

• Decision Tree can be used as a “feature selector”• Important variables are those used as branching

criteria• New ANN trained using only important variables

as inputs

Page 18: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Algorithms: Level-1 Generalizers

• ANN and Decision Tree– Designed and trained essentially the same as

level-0 counterparts– ANN had 8 input nodes

• Naïve Bayesian Model– Calculated likelihood of each target value based

on Bayes rule– Predicted value with highest likelihood

Page 19: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Results

Page 20: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Results: Accuracy RatesLevel-0 Models Approach Dataset ANN Tree NN Hybrid

Validation 65.8% 72.2% 73.8% 71.3% Unequally Distributed Test 64.7% 65.1% 70.6% 71.3%

Validation 62.9% 61.7% 64.9% 62.3% Equally Distributed Test 66.9% 59.6% 65.1% 61.8%

Level-1 Generalizers Approach Level-1

ANN Level-1 Tree

Level-1 Bayesian

Unequally Distributed 71.27% 71.64% 72.00% Equally Distributed 65.82% 67.64% 70.18%

Page 21: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Results: Evaluation of Accuracy Rates

• Similar to highest-performing KDD Cup models• However, predictions drawn from much smaller

pool of potential localizations• Also not much better than just predicting nucleus• Still, had fewer input variables with which to work

Page 22: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Level-1 Decision Tree Diagram

Page 23: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Results: Statistical Comparisons

• No significant inter-algorithm differences for level-0 models

• Hybrid offered some improvement over ANN alone

• Equal distribution usually resulted in slightly worse performance

• Stacked Generalization resulted in better performance, sometimes significantly so

Page 24: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Conclusions and Future Work

Page 25: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Conclusions and Future Work: Stratifying for Equal Distribution

• Not worth it and perhaps harmful• Resulting small sample size may be to blame• Could sample from full training set• Other sampling approaches could be used• Weight variable not necessarily meaningful

Page 26: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Conclusions and Future Work: Specific Models

• Algorithms performed comparably to each other• ANN may need more hidden nodes• Hybrid model improved ANN’s performance

slightly, but not much• NN may owe some of performance to tie-breaker

implementation• Naïve Bayesian not standout, as might be expected

– Could run A Priori search first

Page 27: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

Conclusions and Future Work: Stacked Generalization in General

• Somewhat, not drastically, better performance• Possible ways to improve performance

– Cross-validation could improve both performance and evaluation

– Use posterior probabilities instead of actual predictions

– Try different algorithms– Continue stacking on more levels (level-2, level-3,

etc.)• Apply Stacked Generalization to actual KDD Cup task

Page 28: Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School

References

• Page, D. (2001). KDD Cup 2001. Website located at http://www.cs.wisc.edu/~dpage/kddcup2001/.

• Ting, K.M., Witten, I.H. (1997). Stacked generalization: when does it work?. Proc International Joint Conference on Artificial Intelligence, Japan, 866-871.

• Witten, I.H., Frank, E. (2000). Data Mining. Morgan Kaufmann (San Francisco).

• Wolpert, D.H. (1992). Stacked Generalization. Neural Networks, 5:241-259.