COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI

COP5992 – DATA MINING COP5992 – DATA MINING TERM PROJECTTERM PROJECT

RANDOM SUBSPACE METHOD RANDOM SUBSPACE METHOD + +

CO-TRAININGCO-TRAINING

byby

SELIM KALAYCISELIM KALAYCI

RANDOM SUBSPACE METHOD RANDOM SUBSPACE METHOD (RSM)(RSM)

Proposed by HoProposed by Ho

““The Random Subspace for The Random Subspace for Constructing Decision Forests”, Constructing Decision Forests”, 19981998

Another combining technique for Another combining technique for weak classifiers like Bagging, weak classifiers like Bagging, Boosting.Boosting.

RSM ALGORITHMRSM ALGORITHM

1. Repeat for b = 1, 2, . . ., B:

(a) Select an r-dimensional random subspace X from the original p-dimensional feature space X.

2. Combine classifiers Cb(x), b = 1, 2, . . ., B, by simple majority voting to a final decision rule

MOTIVATION FOR RSMMOTIVATION FOR RSM

Redundancy in Data Feature SpaceRedundancy in Data Feature Space Completely redundant feature setCompletely redundant feature set Redundancy is spread over many Redundancy is spread over many

featuresfeatures

Weak classifiers that have critical Weak classifiers that have critical training sample sizestraining sample sizes

RSM PERFORMANCE RSM PERFORMANCE ISSUESISSUES

RSM Performance depends on:RSM Performance depends on: Training sample sizeTraining sample size The choice of a base classifierThe choice of a base classifier The choice of combining rule (simple The choice of combining rule (simple

majority vs. weighted)majority vs. weighted) The degree of redundancy of the The degree of redundancy of the

datasetdataset The number of features chosenThe number of features chosen

DECISION FORESTS (by DECISION FORESTS (by Ho)Ho)

A combination of trees instead of a A combination of trees instead of a single treesingle tree

Assumption: Dataset has some Assumption: Dataset has some redundant featuresredundant features Works efficiently with any decision tree Works efficiently with any decision tree

algorithm and data splitting methodalgorithm and data splitting method Ideally, look for best individual trees Ideally, look for best individual trees

with lowest tree similarity with lowest tree similarity

UNLABELED DATAUNLABELED DATA

Small number of labeled documentsSmall number of labeled documents

Large pool of unlabeled documentsLarge pool of unlabeled documents

How to classify unlabeled documents How to classify unlabeled documents accurately?accurately?

EXPECTATION-MAXIMIZATION EXPECTATION-MAXIMIZATION (E-M)(E-M)


Blum and Mitchel, “Combining Blum and Mitchel, “Combining Labeled and Unlabeled Data with Labeled and Unlabeled Data with Co-Training”, 1998.Co-Training”, 1998.

Requirements:Requirements: Two sufficiently strong feature setsTwo sufficiently strong feature sets Conditionally independentConditionally independent


APPLICATION OF CO-TRAINING APPLICATION OF CO-TRAINING

TO A SINGLE FEATURE SETTO A SINGLE FEATURE SET Algorithm:Obtain a small set L of labeled examplesObtain a large set U of unlabeled examplesObtain two sets F1 and F2 of features that are sufficiently redundant

While U is not empty do:Learn classifier C1 from L based on F1

Learn classifier C2 from L based on F2

For each classifier Ci do:

Ci labels examples from U based on Fi

Ci chooses the most confidently predicted examples E from U

E is removed from U and added (with their given labels) to LEnd loop

THINGS TO DOTHINGS TO DO

How can we measure redundancy How can we measure redundancy and use it efficiently?and use it efficiently?

Can we improve Co-training?Can we improve Co-training? How can we apply RSM efficiently How can we apply RSM efficiently

to:to: Supervised learningSupervised learning Semi-supervised learningSemi-supervised learning Unsupervised learningUnsupervised learning

QUESTIONSQUESTIONS

????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Documents

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI