143
Deriving Knowledge from Data at Scale

Barga Data Science lecture 7

Embed Size (px)

Citation preview

Page 1: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 2: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Good data preparation is key to

producing valid and reliable models…

Page 3: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Lecture 7 Agenda

Page 4: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 5: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 6: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 7: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Heads Up, Lecture 8 Agenda

Page 8: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Course Project

Page 9: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 10: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

https://www.kaggle.com/c/springleaf-marketing-response

not

Determine whether to send a direct mail piece to a customer

Page 11: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

The Data

Page 12: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

The Rules

Page 13: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 14: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

what is the data telling you

Page 15: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 16: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

The Tragedy of the Titanic

Page 17: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 18: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 19: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 20: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Does the new data point x* exactly match a previous point xi?

If so, assign it to the same class as xi

Otherwise, just guess.

This is the “rote” classifier

Page 21: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Does the new data point x* match a set pf previous points xi on some specific attribute?

If so, take a vote to determine class.

Example: If most females survived, then assume every female survives

But there are lots of possible rules like this.

And an attribute can have more than two values.

If most people under 4 years old survive, then assume everyone under 4 survives

If most people with 1 sibling survive, then assume everyone with 1 sibling survives

How do we choose?

Page 22: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

IF sex=‘female’ THEN survive=yes

ELSE IF sex=‘male’ THEN survive = no

confusion matrix

no yes <-- classified as

468 109 | no

81 233 | yes

(468 + 233) / (468+109+81+233) = 79% correct (and 21% incorrect)

Not bad!

Page 23: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 24: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

IF pclass=‘1’ THEN survive=yes

ELSE IF pclass=‘2’ THEN survive=yes

ELSE IF pclass=‘3’ THEN survive=no

confusion matrix

no yes <-- classified as

372 119 | no

177 223 | yes

(372 + 223) / (372+119+223+177) = 67% correct (and 33% incorrect)

a little worse

Page 25: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Support Vector Machines

Page 26: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 27: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 28: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

f x yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you

classify this data?

Estimation:

w: weight vector

x: data vector

Page 29: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you

classify this data?

Page 30: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you

classify this data?

Page 31: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you

classify this data?

Page 32: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Any of these

would be fine..

..but which is best?

Page 33: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Define the margin of a linear classifier

as the width that the boundary could

be increased by before hitting a

datapoint.

Page 34: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

O

X

O

OO

X

X

X

SUPPORT VECTORS

Page 35: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

O

X

X

O

OO

X

X

X

SUPPORT VECTORS

Page 36: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin

linear classifier is the

linear classifier with the,

um, maximum margin.

This is the simplest kind

of SVM (Called an LSVM)

Linear SVM

Page 37: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale2016/4/12 37

f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x + b)

The maximum margin

linear classifier is the

linear classifier with the,

um, maximum margin.

This is the simplest kind

of SVM (Called an LSVM)Support Vectors

are those

datapoints that

the margin

pushes up

against

Linear SVM

Page 38: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Support Vectors

are those

datapoints that

the margin

pushes up

against

1. Intuitively this feels safest.

2. If we’ve made a small error in the location of

the boundary this gives us least chance of

causing a misclassification.

3. LOOCV (leave one out cross validation) is

easy since the model is immune to removal

of any nonsupport-vector data points.

5. Empirically it works very well.

Page 39: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Class 1

Class 2

m

Page 40: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

This is going to be a problem!

Page 41: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 42: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

What if the data looks like that below?...

kernel function

Page 43: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

What if the data looks like that below?...

Page 44: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Everything you need to know in one slide…

Page 45: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 46: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 47: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Probably the most tricky part of using SVM

RBF is a good first option…

Depends on your data—try several.

• Kernels have even been developed for nonnumeric data like sequences,

structures, and trees/graphs.

May help to use a combination of several kernels.

Don’t touch your evaluation data while you’re trying out different

kernels and parameters.

– Use cross-validation for this if you’re short on data

Page 48: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Complexity of the optimization problem remains only dependent on the dimensionality of

the input space and not of the feature space!

Page 49: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 50: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• SVM 1 learns “Output==1” vs “Output != 1”

• SVM 2 learns “Output==2” vs “Output != 2”

….

• SVM N learns “Output==N” vs “Output != N”

Page 51: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale2016/4/12 51

Page 52: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 53: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 54: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 55: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• weka.classifiers.functions.SMO

• weka.classifiers.functions.libSVM

• weka.classifiers.functions.SMOreg

Page 56: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 57: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 58: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 59: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

IF pclass=‘1’ THEN survive=yes

ELSE IF pclass=‘2’ THEN survive=yes

ELSE IF pclass=‘3’ THEN survive=no

confusion matrix

no yes <-- classified as

372 119 | no

177 223 | yes

(372 + 223) / (372+119+223+177) = 67% correct (and 33% incorrect)

a little worse

Page 60: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Support Vector Machine Model, Titanic Data, Linear Kernel

Page 61: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Support Vector Machine Model, RBF KernelTitanic Data

Page 62: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Support Vector Machine Model, RBF Kernel Titanic Data

overfitting?

Page 63: Barga Data Science lecture 7

Deriving Knowledge from Data at ScaleBill Howe, UW 63

Support Vector Machine Model, RBF KernelTitanic Data

A gamma, parameter that controls/balances model complexity against accuracy

Page 64: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

How They Won It!

Lessons from data mining the past Kaggle contests…

Page 65: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 66: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

10 Minute Break…

Page 67: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Data Wrangling

Page 68: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

gold standard data sets

Page 69: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

one dominant attribute even

distribution

Page 70: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• Rule of thumb: 5,000 or more desired

• Rule of thumb: for each attribute, 10 or more instances

• Rule of thumb: >100 for each class

Page 71: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Data cleaning

Data integration

Data transformation

Data reduction

Data discretization

Page 72: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 73: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

1. Missing values

2. Outliers

3. Coding

4. Constraints

Page 74: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

ReplaceMissingValues

• RemoveMisclassified

• MergeTwoValues

Page 75: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Missing values – UCI machine learning repository, 31 of 68 data sets

reported to have missing values. “Missing” can mean many things…

MAR: "Missing at Random":– usually best case

– usually not true

Non-randomly missing

Presumed normal, so not measured

Causally missing

– attribute value is missing because of other attribute values (or because of

the outcome value!)

Page 76: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 77: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Representation matters…

Page 78: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 79: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 80: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Bank Data A

Bank Data B

Q

Q

Page 81: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Simple transformations can often have a large impact in performance

Example transformations (not all for performance improvement):• Difference of two date attributes, distance between coordinate,…

• Ratio of two numeric (ratioscale) attributes, average for smoothing,….

• Concatenating the values of nominal attributes

• Encoding (probabilistic) cluster membership

• Adding noise to data (for robustness tests)

• Removing data randomly or selectively

• Obfuscating the data (for anonymity)

Intuition: add features that increase class discrimination (E, IG)…

Data Transformation

Page 82: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• Combine attributes

• Normalizing data

• Simplifying data

Page 83: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Change of scale

11

11

1

Page 84: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 85: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 86: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Aggregated data tends to have less variability and potentially more information for better predictions…

Page 87: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 88: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 89: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 90: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 91: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 92: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 93: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

fills in the missing values for an instance with the expectedvalues

Page 94: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Specify the new ordering…

2-last, 1

1-5, 7-last, 6

Page 95: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

instance

Resample, a random subsample

with or without replacement;

To replace or not…

Same random seed, will result in

same (repeatable) sample.

Sample size, as percentage of

original data set size.

Page 96: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Suggestions for you to try…

Page 97: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 98: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Curse of Dimensionality

exponentially

In many cases the information that is lost by

discarding variables is made up for by a more

accurate mapping/sampling in the lower-

dimensional space !

Page 99: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• Sampling

Page 100: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

work almost as well as using the entire

data set

the same

property (of interest) as the original set of data

Page 101: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 102: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 103: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

•Principle Component Analysis

Page 104: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 105: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

find k ≤ n new features

(principal components) that can best represent data

Works for numeric data only…

Page 106: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 107: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Attribute Selection(feature selection)

Page 108: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

What is Feature selection ?

DIMENSIONALITY REDUCTION

Page 109: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Problem: Where to focus attention?

Page 110: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Feature Selection, starts with you…

smallest subset of attributes

Page 111: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

What is Evaluated?

AttributesSubsets of

Attributes

Evaluation

Method

IndependentFilters Filters

Learning

Algorithm Wrappers

Page 112: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

What is Evaluated?

AttributesSubsets of

Attributes

Evaluation

Method

IndependentFilters Filters

Learning

Algorithm Wrappers

Page 113: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

list of attributes

evaluated individually

select

subset

Page 114: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 115: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

A correlation coefficient shows the degree of linear dependence of x and y. In other words, the coefficient shows

how close two variables lie along a line. If the coefficient is equal to 1 or -1, all the points lie along a line. If the

correlation coefficient is equal to zero, there is no linear relation between x and y. However, this does not

necessarily mean that there is no relation between the two variables. There could e.g. be a non-linear relation.

Page 116: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Tab for selecting attributes in a data set…

Page 117: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Interface for classes that evaluate attributes…

Interface for ranking or searching for a subset of attributes…

Page 118: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Select CorrelationAttributeEval for Pearson Correlation…

False, doesn’t return R score

True, returns R scores;

Page 119: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Ranks attributes by their individual evaluations, used in

conjunction with GainRatio, Entropy, Pearson, etc…

Number of attributes to return,

-1 returns all ranked attributes;

Attributes to ignore (skip) in the

evaluation forma: [1, 3-5, 10];

Cutoff at which attributes can

be discarded, -1 no cutoff;

Page 120: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Predicting Self-Reported Health StatusThe Data Set, NHANES_data.csv (National Health and Nutrition Examination Survey)

How would you say your health in general is?

Excellent predictor of mortality, health care utilization & disability

How I processed it…

• 4000 variables;

• Attributes with > 30% missing values removed (dropped column);

• 105 variables remaining;

• Chi-square test, variable and target, remove variables with P value < .20;

• Impute all missing values using expectation minimization;

• 85 variables remaining;

Pearson Correlation Exercise…

Page 121: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

NHANES_data.csv

Page 122: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

NHANES_data.csv

• Convert the last column from numeric to nominal

• Find the top 15 features using Pearson Correlation

Page 123: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• OneRAttributeEval

• GainRatioAttributeEval

• InfoGainAttributeEval

• ChiSquaredAttributeEval

• ReliefFAttributeEval

Page 124: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• Right Click on the new line in the Result list;

• From the pop-up menu, select the item

Save reduced data…

Page 125: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• Right Click on the new line in the Result list;

• From the pop-up menu, select the item

Save reduced data…

• Save the dataset with 15 selected attributes

to file NHanesPearson.arff

Page 126: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• Right Click on the new line in the Result list;

• From the pop-up menu, select the item

Save reduced data…

• Save the dataset with 15 selected attributes

to file NHanesPearson.arff

• Switch to the Preprocess mode in Explorer

• Click on Open file… and open the file

NHanesPearson.arff

• Switch to the Classify submode

• Click on Choose, select classifier and use this

feature set and data to build a predictive

model;

Page 127: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• Anything below 0.3 isn’t highly

correlated with the target…

Page 128: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

What is Evaluated?

AttributesSubsets of

Attributes

Evaluation

Method

IndependentFilters Filters

Learning

Algorithm Wrappers

Page 129: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

given the already picked features

independent

Page 130: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

all

makes the least contribution

independent

Page 131: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 132: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 133: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 134: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 135: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Interface for classes that evaluate attributes…

Interface for ranking or searching for a subset of attributes…

Page 136: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Page 137: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Forward, Backward, Bi-Directional

Attributes to “seed” the search,

listed individually or by range.

Cutoff for backtracking…

Page 138: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

True: Adds features that are correlated

with class and NOT intercorrelated with

other features already in selection.

False: Eliminates redundant features.

Precompute the correlation matrix in

advance, useful for fast backtracking, or

compute lazily. When given a large

number of attributes, compute lazily…

CfsSubsetEval

Page 139: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

NHANES_data.csv

• Convert the last column from numeric to nominal

• Set the search method as Best First, Forward

• Set the attribute evaluator as CfsSubsetEval

• Run across all attributes in data set…

Page 140: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

• Feature selection can significantly increase the performance of a learning algorithm (both accuracy and computation time) – but it is not easy!

• Relevance <-> Optimality

• Correlation and Mutual information between single variables and the target are often used as Ranking-Criteria of variables.

Important points 1/2

Page 141: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

Important points 2/2

• One can not automatically discard variables with small scores – they may still be useful together with other variables.

• Filters – Wrappers - Embedded Methods

• How to search the space of all feature subsets ?

• How to asses performance of a learner that uses a particular feature subset ?

Page 142: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

not all about accuracy

• Filtering is fast linear intuitive

• Filtering model oblivious

may not be optimal

• Wrappers model-aware slow nonintuitive

• PCA and SVD are lossy

work on the entire data set

startwith fast feature filtering first

NOT to use any feature selection

Page 143: Barga Data Science lecture 7

Deriving Knowledge from Data at Scale

That’s all for tonight…