Upload
willa-walton
View
218
Download
0
Embed Size (px)
Citation preview
COP5992 – DATA MINING COP5992 – DATA MINING TERM PROJECTTERM PROJECT
RANDOM SUBSPACE METHOD RANDOM SUBSPACE METHOD + +
CO-TRAININGCO-TRAINING
byby
SELIM KALAYCISELIM KALAYCI
RANDOM SUBSPACE METHOD RANDOM SUBSPACE METHOD (RSM)(RSM)
Proposed by HoProposed by Ho
““The Random Subspace for The Random Subspace for Constructing Decision Forests”, Constructing Decision Forests”, 19981998
Another combining technique for Another combining technique for weak classifiers like Bagging, weak classifiers like Bagging, Boosting.Boosting.
RSM ALGORITHMRSM ALGORITHM
1. Repeat for b = 1, 2, . . ., B:
(a) Select an r-dimensional random subspace X from the original p-dimensional feature space X.
2. Combine classifiers Cb(x), b = 1, 2, . . ., B, by simple majority voting to a final decision rule
MOTIVATION FOR RSMMOTIVATION FOR RSM
Redundancy in Data Feature SpaceRedundancy in Data Feature Space Completely redundant feature setCompletely redundant feature set Redundancy is spread over many Redundancy is spread over many
featuresfeatures
Weak classifiers that have critical Weak classifiers that have critical training sample sizestraining sample sizes
RSM PERFORMANCE RSM PERFORMANCE ISSUESISSUES
RSM Performance depends on:RSM Performance depends on: Training sample sizeTraining sample size The choice of a base classifierThe choice of a base classifier The choice of combining rule (simple The choice of combining rule (simple
majority vs. weighted)majority vs. weighted) The degree of redundancy of the The degree of redundancy of the
datasetdataset The number of features chosenThe number of features chosen
DECISION FORESTS (by DECISION FORESTS (by Ho)Ho)
A combination of trees instead of a A combination of trees instead of a single treesingle tree
Assumption: Dataset has some Assumption: Dataset has some redundant featuresredundant features Works efficiently with any decision tree Works efficiently with any decision tree
algorithm and data splitting methodalgorithm and data splitting method Ideally, look for best individual trees Ideally, look for best individual trees
with lowest tree similarity with lowest tree similarity
UNLABELED DATAUNLABELED DATA
Small number of labeled documentsSmall number of labeled documents
Large pool of unlabeled documentsLarge pool of unlabeled documents
How to classify unlabeled documents How to classify unlabeled documents accurately?accurately?
EXPECTATION-MAXIMIZATION EXPECTATION-MAXIMIZATION (E-M)(E-M)
CO-TRAININGCO-TRAINING
Blum and Mitchel, “Combining Blum and Mitchel, “Combining Labeled and Unlabeled Data with Labeled and Unlabeled Data with Co-Training”, 1998.Co-Training”, 1998.
Requirements:Requirements: Two sufficiently strong feature setsTwo sufficiently strong feature sets Conditionally independentConditionally independent
CO-TRAININGCO-TRAINING
APPLICATION OF CO-TRAINING APPLICATION OF CO-TRAINING
TO A SINGLE FEATURE SETTO A SINGLE FEATURE SET Algorithm:Obtain a small set L of labeled examplesObtain a large set U of unlabeled examplesObtain two sets F1 and F2 of features that are sufficiently redundant
While U is not empty do:Learn classifier C1 from L based on F1
Learn classifier C2 from L based on F2
For each classifier Ci do:
Ci labels examples from U based on Fi
Ci chooses the most confidently predicted examples E from U
E is removed from U and added (with their given labels) to LEnd loop
THINGS TO DOTHINGS TO DO
How can we measure redundancy How can we measure redundancy and use it efficiently?and use it efficiently?
Can we improve Co-training?Can we improve Co-training? How can we apply RSM efficiently How can we apply RSM efficiently
to:to: Supervised learningSupervised learning Semi-supervised learningSemi-supervised learning Unsupervised learningUnsupervised learning
QUESTIONSQUESTIONS
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????