Upload
roger-barga
View
116
Download
1
Embed Size (px)
Citation preview
Deriving Knowledge from Data at Scale
• Opening Discussion 30 minutesReview Discussion…
• Hands-On with Decision Trees 30 minutes
• Ensembles, Random Forests 60 minutes
• Data Science Modelling 30 minutesModel performance evaluation…
• Machine Learning Boot Camp ~60 minutesClustering, k-Means…
Deriving Knowledge from Data at Scale
• Optional Reading: Data Science Weekly (2)
• Two Homework Assignments, due next Wednesday1. One is described in the lecture notes
2. Two is uploaded to the class Catalyst site
• Key Points to Understand, review and discuss1. Ensembles, the techniques of Bagging and Boosting
2. Random Forests
3. Clustering, specifically K-Means Clustering
What will your data science workflow be? (not having one is a fail…)
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
Train ML Model
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature and/or
Target
construction
1. Define the objective and quantify it with a metric – optionally with constraints,
if any. This typically requires domain knowledge.
2. Collect and understand the data, deal with the vagaries and biases in the data
acquisition (missing data, outliers due to errors in the data collection process,
more sophisticated biases due to the data collection procedure etc
3. Frame the problem in terms of a machine learning problem – classification,
regression, ranking, clustering, forecasting, outlier detection etc. – some
combination of domain knowledge and ML knowledge is useful.
4. Transform the raw data into a “modeling dataset”, with features, weights,
targets etc., which can be used for modeling. Feature construction can often
be improved with domain knowledge. Target must be identical (or a very
good proxy) of the quantitative metric identified step 1.
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train/ Test split
5. Train, test and evaluate, taking care to control
bias/variance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here), be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) – this is the
ML heavy step.
Deriving Knowledge from Data at Scale
Data Science Workflow.pdf
Develop your own for defining
and evaluating project
opportunities…
Deriving Knowledge from Data at Scale
Example 1: Amazon, big spenders. Target of the competition wasto predict customers who spend a lot of money among customersusing past purchases. The data consisted of transaction data indifferent categories. But a winning model identified that ‘Freeshipping = True’ was an excellent predictor.
Leakage: “Free Shipping = True” was simultaneous with the sale,which is a no-no… We can only use data from beforehand topredict the future…
Deriving Knowledge from Data at Scale
Example 2: Cancer patients plotted by Patient ID – what happened?
What could you do to improve this?...
Deriving Knowledge from Data at Scale
Winning competition on leakage is easier than building good models.
But even if you don’t explicitly understand and game the leakage,
your model will do it for you. Either way, leakage is a huge problem.
• You need a strict temporal cutoff: remove all information just prior to the
event of interest.
• There has to be a timestamp on every entry and you need to keep it
• The best practice is to start from scratch with clean, raw data after careful
consideration
• You need to know how the data was created! I (try to ) work only with data I
pulled and prepared myself…
Deriving Knowledge from Data at Scale
To avoid overfitting, we cross-validate and we cut down on the complexity of the model to
begin with. Here’s a standard picture (although keep in mind we generally work in high
dimensional space and don’t have a pretty picture to look at)
Deriving Knowledge from Data at Scale
The art in data science
The science in data science
someevaluation metric
rigoroustesting and experimentation to either validate or refute
Deriving Knowledge from Data at Scale
0.65
0.67
0.69
0.71
0.73
0.75
0.77
0.79
0 500 1000 1500 2000 2500 3000
A
c
c
u
r
a
c
y
# Training Records
Accuracy on test data stabilizes above 1000 training samples
Deriving Knowledge from Data at Scale
Review: Decision Tree
1. Automatically selects features
2. Able to handle large number of features
3. Numeric, nominal, missing
4. Easy to ensemble (Random Forrest, Boosted DT)
5. I can romance on DTs for hours …
Deriving Knowledge from Data at Scale
Blood pressure
Drug A Age
Drug A
Drug B
Drug B
high normal low
≤ 40 > 40
Assignment of drug to a patient
Deriving Knowledge from Data at Scale
++++
++
+
+
+
+
+
+
pm=5/6
Once regions are
chosen class
probabilities are easy
to calculate
Deriving Knowledge from Data at Scale
When completed, submit this assignment to
the dropbox for homework Lecture 4
Deriving Knowledge from Data at Scale
Your task for this assignment: Design a simple, low-cost sensor that can distinguish
between red wine and white wine.
Your sensor must correctly distinguish between red and white wine for at least 95% of the
samples in a set of 6497 test samples of red and white wine.
Your technology is capable of sensing the following wine attributes:
- Fixed acidity - Free sulphur dioxide
- Volatile acidity - Total sulphur dioxide
- Citric acid - Sulphates
- Residual sugar - pH
- Chlorides - Alcohol
- Density
To keep your sensor cheap and simple, you need to sense as few of these attributes as
possible to meet the 95% requirement.
Question: Which attributes should your sensor be capable of measuring?
Deriving Knowledge from Data at Scale
1. Go to our class website, Lecture 4, and download the associated homework files
2. Read WineQuality.pdf.
3. Open the RedWhiteWine.arff file in Weka, and remove the quality attribute, which you
will not need for this assignment.
4. Run J48 with default set-up to see what kind of percent correct classification results
you get using all attributes.
5. Remove attributes to find the minimum
number of attributes needed to meet
the 95% correct classification requirement.
Remove
Deriving Knowledge from Data at Scale
Use these buttons to
simplify the task of
removing attributes
You can use these buttons
to simplify the task of
removing attributes
Deriving Knowledge from Data at Scale
(Paste a screenshot showing
your results for your minimum
attribute set here)
Deriving Knowledge from Data at Scale
• Opening Discussion 30 minutesReview Discussion…
• Ensembles, Random Forests 60 minutes
• Data Science Modelling 30 minutesModel performance evaluation…
• Machine Learning Boot Camp ~60 minutesClustering, k-Means…
• Close
Deriving Knowledge from Data at Scale
Diversity of Opinion
Independence
Decentralization
Aggregation
Deriving Knowledge from Data at Scale
The basic idea:
Randomly draw datasets with replacement from the
training data, each sample the same size as the original training set
Deriving Knowledge from Data at Scale
Ellipsoid separation
Two categories,
Two predictors
Single tree decision boundary 100 bagged trees..
Deriving Knowledge from Data at Scale
Random Forest Classifier
N e
xam
ple
sTraining Data
M features
Deriving Knowledge from Data at Scale
Random Forest Classifier
N e
xam
ple
sCreate bootstrap samples
from the training data
....…
M features
Deriving Knowledge from Data at Scale
Random Forest Classifier
N e
xam
ple
sConstruct a decision tree
....…
M features
Deriving Knowledge from Data at Scale
Random Forest Classifier
N e
xam
ple
s
....…
M features
At each node in choosing the split featurechoose only among m<M features
Deriving Knowledge from Data at Scale
Random Forest Classifier
Create decision treefrom each bootstrap sample
N e
xam
ple
s
....…
....…
M features
Deriving Knowledge from Data at Scale
Random Forest Classifier
N e
xam
ple
s
....…
....…
Take the majority
vote
M features
Deriving Knowledge from Data at Scale
Diversity of Opinion
private information
Independence
Decentralization
Aggregation
Deriving Knowledge from Data at Scale
Decision Trees and Decision Forests
A forest is an ensemble of trees. The trees are all slightly different from one another.
terminal (leaf) node
internal
(split) node
root node0
1 2
3 4 5 6
7 8 9 10 11 12 13 14
A general tree structureIs top
part blue?
Is bottom
part green?Is bottom
part blue?
A decision tree
Deriving Knowledge from Data at Scale
Decision Forest Model: the randomness model1) Bagging (randomizing the training set)
The full training set
The randomly sampled subset of training data made available for the tree t
Forest training
Deriving Knowledge from Data at Scale
Decision Forest Model: the randomness model
The full set of all possible node test parameters
For each node the set of randomly sampled features
Randomness control parameter.
For no randomness and maximum tree correlation.
For max randomness and minimum tree correlation.
2) Randomized node optimization (RNO)
Small value of ; little tree correlation. Large value of ; large tree correlation.
The effect of
Node weak learner
Node test params
Node training
Deriving Knowledge from Data at Scale
Decision Forest Model: training and information gain
Befo
re s
pli
t
Information gain
Shannon’s entropy
Node training
(for categorical, non-parametric distributions)
Sp
lit
1S
pli
t 2
Deriving Knowledge from Data at Scale
Classification ForestTraining data in feature space
?
?
?
Entropy of a discrete distribution
with
Classification treetraining
Obj. funct. for node j (information gain)
Training node j
Output is categorical
Input data point
Node weak learner
Predictor model (class posterior)
Model specialization for classification
( is feature response)
(discrete set)
Deriving Knowledge from Data at Scale
Classification Forest: the weak learner model
Node weak learner
Node test params
Splitting data at node j
Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section
Examples of weak learners
Feature response
for 2D example.
With a generic line in homog. coordinates.
Feature response
for 2D example.
With a matrix representing a conic.
Feature response
for 2D example.
In general may select only a very small subset of features
With or
Deriving Knowledge from Data at Scale
Classification Forest: the prediction model
What do we do at the leaf?
leafleaf
leaf
Prediction model: probabilistic
Deriving Knowledge from Data at Scale
Classification Forest: the ensemble model
Tree t=1 t=2 t=3
Forest output probability
The ensemble model
Deriving Knowledge from Data at Scale
Training different trees in the forest
Testing different trees in the forest
(2 videos in this page)
Classification Forest: effect of the weak learner model
Parameters: T=200, D=2, weak learner = aligned, leaf model = probabilistic
• “Accuracy of prediction”
• “Quality of confidence”
• “Generalization”
Three concepts to keep in mind:
Training points
Deriving Knowledge from Data at Scale
Classification Forest: with >2 classesTraining different trees in the forest
Testing different trees in the forest
Parameters: T=200, D=3, weak learner = conic, leaf model = probabilistic(2 videos in this page)
Training points
Deriving Knowledge from Data at Scale
Classification Forest: effect of tree depth
max tree depth, D
overfittingunderfitting
T=200, D=3, w. l. = conic T=200, D=6, w. l. = conic T=200, D=15, w. l. = conic
Predictor model = prob.(3 videos in this page)
Training points: 4-class mixed
Deriving Knowledge from Data at Scale
Classification Forest: analysing generalization
Parameters: T=200, D=13, w. l. = conic, predictor = prob.(3 videos in this page)
Training points: 4-class spiral Training pts: 4-class spiral, large gaps Tr. pts: 4-class spiral, larger gapsTe
stin
g p
ost
eri
ors
Deriving Knowledge from Data at Scale
• Opening Discussion 30 minutesReview Discussion…
• Ensembles, Random Forests 60 minutes
• Data Science Modelling 30 minutesModel performance evaluation…
• Machine Learning Boot Camp ~60 minutesClustering, k-Means…
• Close
Deriving Knowledge from Data at Scale
Data
Acquisition
Data
Exploration
Pre-processing
Feature and
Target
construction
Train/ Test
split
Feature
selection
Model
training
Model
scoring
Model
scoring
Evaluation
Evaluation
Compare
metrics
Deriving Knowledge from Data at Scale
Performance Metrics
Percent Reduction in Error
• 80% accuracy = 20% error
• Suppose learning increases accuracy from 80% to 90%
error reduced from 20% to 10%
• 50% reduction in error
• 99.90% to 99.99% = 90% reduction in error
• 50% to 75% = 50% reduction in error, can be applied to
many other measures
Deriving Knowledge from Data at Scale
Performance Metrics
Precision and Recall
• Typically used in document retrieval
• Precision:
– how many of the returned documents are correct
– precision (threshold)
• Recall:
– how many of the positives does the model return
– recall (threshold)
Deriving Knowledge from Data at Scale
Next week we will go deeper with ROC curves, kappa, lift charts, etc…
Deriving Knowledge from Data at Scale
• Opening Discussion 30 minutesReview Discussion…
• Ensembles, Random Forests 60 minutes
• Break 5 minutes
• Data Science Modelling 30 minutesModel performance evaluation…
• Machine Learning Boot Camp ~60 minutesClustering, k-Means…
• Close
Deriving Knowledge from Data at Scale
grouping within a group aresimilar and different from (or unrelated to)the objects in other groups
Inter-cluster
distances are
maximizedIntra-cluster
distances are
minimized
Deriving Knowledge from Data at Scale
• Outliers objects that do not belong to any cluster
outlier analysis
cluster
outliers
Deriving Knowledge from Data at Scale
p4
p1p3
p2
p4
p1 p3
p2
p4p1 p2 p3
p4p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical ClusteringNon-traditional Dendrogram
Traditional Dendrogram
Deriving Knowledge from Data at Scale
Single Linkage:
Minimum distance* *
Complete Linkage:
Maximum distance* *
Average Linkage:
Average distance*
*
*
*
Wards method:
Minimization of within-cluster variance
**
*
**
¤*
* *
*¤
Centroid method:
Distance betweencentres
**
*
* *
***
* *¤ ¤
Non overlapping Overlapping
Hierarchical Non-hierarchical
1a 1b
1c
1a 1b
1b11b22
Agglomerative Divisive
Deriving Knowledge from Data at Scale
d(x, y) x y metric
• d(i, j) 0 non-negativity
• d(i, i) = 0 isolation
• d(i, j) = d(j, i) symmetry
• d(i, j) ≤ d(i, h)+d(h, j) triangular inequality
real, boolean, categorical, ordinal
Deriving Knowledge from Data at Scale
p = 2, L2 Euclidean distance
weighted distance
)||...|22
||11
(|),( 222
dy
dxyxyxyxd
)||...|22
|2
|11
|1
(),( 222
dy
dx
dwxxwxxwyxd
dy
dx
dwyxwyxwyxd ...
222111),(
Deriving Knowledge from Data at Scale
Q1 Q2 Q3 Q4 Q5 Q6
X 1 0 0 1 1 1
Y 0 1 1 0 1 0
• Jaccard similarity between binary vectors X and Y
• Jaccard distance between binary vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y)
• Example:
• JSim = 1/6
• Jdist = 5/6
YX
YXYXJSim
),(
Deriving Knowledge from Data at Scale
• Lp Minkowski
p
p = 1, L1 Manhattan (or city block)
pd
ii
yi
xpp
dx
dx
pyx
pyxyxpL
/1
1)(
/1||...|
22||
11|),(
d
ii
yi
xd
yd
xyxyxyxL1
||...|22
|||),(1 11
Deriving Knowledge from Data at Scale
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal Clustering
Original Points
Deriving Knowledge from Data at Scale
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Deriving Knowledge from Data at Scale
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Deriving Knowledge from Data at Scale
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
Deriving Knowledge from Data at Scale
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5