Barga Data Science lecture 7

Deriving Knowledge from Data at Scale


Good data preparation is key to

producing valid and reliable models…


Lecture 7 Agenda





Heads Up, Lecture 8 Agenda


Course Project



https://www.kaggle.com/c/springleaf-marketing-response

not

Determine whether to send a direct mail piece to a customer


The Data


The Rules



what is the data telling you



The Tragedy of the Titanic





Does the new data point x* exactly match a previous point xi?

If so, assign it to the same class as xi

Otherwise, just guess.

This is the “rote” classifier


Does the new data point x* match a set pf previous points xi on some specific attribute?

If so, take a vote to determine class.

Example: If most females survived, then assume every female survives

But there are lots of possible rules like this.

And an attribute can have more than two values.

If most people under 4 years old survive, then assume everyone under 4 survives

If most people with 1 sibling survive, then assume everyone with 1 sibling survives

How do we choose?


IF sex=‘female’ THEN survive=yes

ELSE IF sex=‘male’ THEN survive = no

confusion matrix

no yes <-- classified as

468 109 | no

81 233 | yes

(468 + 233) / (468+109+81+233) = 79% correct (and 21% incorrect)

Not bad!



IF pclass=‘1’ THEN survive=yes

ELSE IF pclass=‘2’ THEN survive=yes

ELSE IF pclass=‘3’ THEN survive=no

confusion matrix


372 119 | no

177 223 | yes


a little worse


Support Vector Machines




f x yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you

classify this data?

Estimation:

w: weight vector

x: data vector


f x

a

yest

denotes +1

denotes -1


How would you

classify this data?


f x

a

yest

denotes +1

denotes -1


How would you

classify this data?


f x

a

yest

denotes +1

denotes -1


How would you

classify this data?


f x

a

yest

denotes +1

denotes -1


Any of these

would be fine..

..but which is best?


f x

a

yest

denotes +1

denotes -1


Define the margin of a linear classifier

as the width that the boundary could

be increased by before hitting a

datapoint.


O

X

O

OO

X

X

X

SUPPORT VECTORS


O

X

X

O

OO

X

X

X

SUPPORT VECTORS


f x

a

yest

denotes +1

denotes -1


The maximum margin

linear classifier is the

linear classifier with the,

um, maximum margin.

This is the simplest kind

of SVM (Called an LSVM)

Linear SVM

Deriving Knowledge from Data at Scale2016/4/12 37

f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x + b)

The maximum margin

linear classifier is the

linear classifier with the,

um, maximum margin.

This is the simplest kind

of SVM (Called an LSVM)Support Vectors

are those

datapoints that

the margin

pushes up

against

Linear SVM


denotes +1

denotes -1


Support Vectors

are those

datapoints that

the margin

pushes up

against

1. Intuitively this feels safest.

2. If we’ve made a small error in the location of

the boundary this gives us least chance of

causing a misclassification.

3. LOOCV (leave one out cross validation) is

easy since the model is immune to removal

of any nonsupport-vector data points.

5. Empirically it works very well.


Class 1

Class 2

m


This is going to be a problem!



What if the data looks like that below?...

kernel function


What if the data looks like that below?...


Everything you need to know in one slide…




Probably the most tricky part of using SVM

RBF is a good first option…

Depends on your data—try several.

• Kernels have even been developed for nonnumeric data like sequences,

structures, and trees/graphs.

May help to use a combination of several kernels.

Don’t touch your evaluation data while you’re trying out different

kernels and parameters.

– Use cross-validation for this if you’re short on data


Complexity of the optimization problem remains only dependent on the dimensionality of

the input space and not of the feature space!



• SVM 1 learns “Output==1” vs “Output != 1”

• SVM 2 learns “Output==2” vs “Output != 2”

….

• SVM N learns “Output==N” vs “Output != N”

Deriving Knowledge from Data at Scale2016/4/12 51





• weka.classifiers.functions.SMO

• weka.classifiers.functions.libSVM

• weka.classifiers.functions.SMOreg





IF pclass=‘1’ THEN survive=yes

ELSE IF pclass=‘2’ THEN survive=yes

ELSE IF pclass=‘3’ THEN survive=no

confusion matrix


372 119 | no

177 223 | yes


a little worse


Support Vector Machine Model, Titanic Data, Linear Kernel


Support Vector Machine Model, RBF KernelTitanic Data


Support Vector Machine Model, RBF Kernel Titanic Data

overfitting?

Deriving Knowledge from Data at ScaleBill Howe, UW 63

Support Vector Machine Model, RBF KernelTitanic Data

A gamma, parameter that controls/balances model complexity against accuracy


How They Won It!

Lessons from data mining the past Kaggle contests…



10 Minute Break…


Data Wrangling


gold standard data sets


one dominant attribute even

distribution


• Rule of thumb: 5,000 or more desired

• Rule of thumb: for each attribute, 10 or more instances

• Rule of thumb: >100 for each class


Data cleaning

Data integration

Data transformation

Data reduction

Data discretization



1. Missing values

2. Outliers

3. Coding

4. Constraints


ReplaceMissingValues

• RemoveMisclassified

• MergeTwoValues


Missing values – UCI machine learning repository, 31 of 68 data sets

reported to have missing values. “Missing” can mean many things…

MAR: "Missing at Random":– usually best case

– usually not true

Non-randomly missing

Presumed normal, so not measured

Causally missing

– attribute value is missing because of other attribute values (or because of

the outcome value!)



Representation matters…




Bank Data A

Bank Data B

Q

Q


Simple transformations can often have a large impact in performance

Example transformations (not all for performance improvement):• Difference of two date attributes, distance between coordinate,…

• Ratio of two numeric (ratioscale) attributes, average for smoothing,….

• Concatenating the values of nominal attributes

• Encoding (probabilistic) cluster membership

• Adding noise to data (for robustness tests)

• Removing data randomly or selectively

• Obfuscating the data (for anonymity)

Intuition: add features that increase class discrimination (E, IG)…

Data Transformation


• Combine attributes

• Normalizing data

• Simplifying data


Change of scale

11

11

1




Aggregated data tends to have less variability and potentially more information for better predictions…








fills in the missing values for an instance with the expectedvalues


Specify the new ordering…

2-last, 1

1-5, 7-last, 6


instance

Resample, a random subsample

with or without replacement;

To replace or not…

Same random seed, will result in

same (repeatable) sample.

Sample size, as percentage of

original data set size.


Suggestions for you to try…



Curse of Dimensionality

exponentially

In many cases the information that is lost by

discarding variables is made up for by a more

accurate mapping/sampling in the lower-

dimensional space !


• Sampling


work almost as well as using the entire

data set

the same

property (of interest) as the original set of data




•Principle Component Analysis



find k ≤ n new features

(principal components) that can best represent data

Works for numeric data only…



Attribute Selection(feature selection)


What is Feature selection ?

DIMENSIONALITY REDUCTION


Problem: Where to focus attention?


Feature Selection, starts with you…

smallest subset of attributes


What is Evaluated?

AttributesSubsets of

Attributes

Evaluation

Method

IndependentFilters Filters

Learning

Algorithm Wrappers


What is Evaluated?


Attributes

Evaluation

Method


Learning

Algorithm Wrappers


list of attributes

evaluated individually

select

subset



A correlation coefficient shows the degree of linear dependence of x and y. In other words, the coefficient shows

how close two variables lie along a line. If the coefficient is equal to 1 or -1, all the points lie along a line. If the

correlation coefficient is equal to zero, there is no linear relation between x and y. However, this does not

necessarily mean that there is no relation between the two variables. There could e.g. be a non-linear relation.


Tab for selecting attributes in a data set…


Interface for classes that evaluate attributes…

Interface for ranking or searching for a subset of attributes…


Select CorrelationAttributeEval for Pearson Correlation…

False, doesn’t return R score

True, returns R scores;


Ranks attributes by their individual evaluations, used in

conjunction with GainRatio, Entropy, Pearson, etc…

Number of attributes to return,

-1 returns all ranked attributes;

Attributes to ignore (skip) in the

evaluation forma: [1, 3-5, 10];

Cutoff at which attributes can

be discarded, -1 no cutoff;


Predicting Self-Reported Health StatusThe Data Set, NHANES_data.csv (National Health and Nutrition Examination Survey)

How would you say your health in general is?

Excellent predictor of mortality, health care utilization & disability

How I processed it…

• 4000 variables;

• Attributes with > 30% missing values removed (dropped column);

• 105 variables remaining;

• Chi-square test, variable and target, remove variables with P value < .20;

• Impute all missing values using expectation minimization;

• 85 variables remaining;

Pearson Correlation Exercise…


NHANES_data.csv


NHANES_data.csv

• Convert the last column from numeric to nominal

• Find the top 15 features using Pearson Correlation


• OneRAttributeEval

• GainRatioAttributeEval

• InfoGainAttributeEval

• ChiSquaredAttributeEval

• ReliefFAttributeEval


• Right Click on the new line in the Result list;

• From the pop-up menu, select the item

Save reduced data…





• Save the dataset with 15 selected attributes

to file NHanesPearson.arff





• Save the dataset with 15 selected attributes

to file NHanesPearson.arff

• Switch to the Preprocess mode in Explorer

• Click on Open file… and open the file

NHanesPearson.arff

• Switch to the Classify submode

• Click on Choose, select classifier and use this

feature set and data to build a predictive

model;


• Anything below 0.3 isn’t highly

correlated with the target…


What is Evaluated?


Attributes

Evaluation

Method


Learning

Algorithm Wrappers


given the already picked features

independent


all

makes the least contribution

independent






Interface for classes that evaluate attributes…

Interface for ranking or searching for a subset of attributes…



Forward, Backward, Bi-Directional

Attributes to “seed” the search,

listed individually or by range.

Cutoff for backtracking…


True: Adds features that are correlated

with class and NOT intercorrelated with

other features already in selection.

False: Eliminates redundant features.

Precompute the correlation matrix in

advance, useful for fast backtracking, or

compute lazily. When given a large

number of attributes, compute lazily…

CfsSubsetEval


NHANES_data.csv

• Convert the last column from numeric to nominal

• Set the search method as Best First, Forward

• Set the attribute evaluator as CfsSubsetEval

• Run across all attributes in data set…


• Feature selection can significantly increase the performance of a learning algorithm (both accuracy and computation time) – but it is not easy!

• Relevance <-> Optimality

• Correlation and Mutual information between single variables and the target are often used as Ranking-Criteria of variables.

Important points 1/2


Important points 2/2

• One can not automatically discard variables with small scores – they may still be useful together with other variables.

• Filters – Wrappers - Embedded Methods

• How to search the space of all feature subsets ?

• How to asses performance of a learner that uses a particular feature subset ?


not all about accuracy

• Filtering is fast linear intuitive

• Filtering model oblivious

may not be optimal

• Wrappers model-aware slow nonintuitive

• PCA and SVD are lossy

work on the entire data set

startwith fast feature filtering first

NOT to use any feature selection


That’s all for tonight…

Data & Analytics

Barga Data Science lecture 7