Upload
donga
View
220
Download
2
Embed Size (px)
Citation preview
Introduction to Predictive Modeling:
From Correlation to Supervised
Segmentation
KSE 521
Topic 4
Mun Yi
Agenda
Basic Terminology
Supervised Segmentation
Entropy and Information Gain
Tree-Based Classifier
Probability Estimation
2KSE 521
Prediction is very difficult, especially if it’s about the future. – Niels Bohr, Physicist
A little prediction goes a long way –Eric Siegel, Author of Predictive Analytics
An organization’s ability to learn, and translate that learning into action rapidly, is the ultimate competitive advantage. – Jack Welch, Former Chairman
of GE
3
Quotes on Prediction
KSE 521
4
Gartner Hype Cycle
KSE 521
Key concept of BI: Predictive modeling
Supervised segmentation: how can we segment the population
with respect to something that we would like to predict or estim
ate
“Which customers are likely to leave the company when their
contracts expire?"
“Which potential customers are likely not to pay off their acc
ount balances?"
Technique: Find or select important, informative
variables / attributes of the entities w.r.t. a target
Is there one or more other variables that reduces our unc
ertainty about the value of the target?
Select informative subsets in large databases
5
Introduction
KSE 521
A model is a simplified representation of reality
created to serve a purpose
A predictive model is a formula for estimating the
unknown value of interest: the target
Classification/class-probab. estim. and regression models
Prediction = estimate an unknown value
Credit scoring, spam filtering, fraud detection
Descriptive modeling: gain insight into the
underlying phenomenon or process
6
Models and Induction
KSE 521
Data Terminology
Person
ID Age Gender Income Balance
Mortgage
payment
123213 32 F 25000 32000 Y
17824 49 M 12000 -3000 N
232897 60 F 8000 1000 Y
288822 28 M 9000 3000 Y
…. …. …. …. …. ….
7KSE 521
Data Terminology
Person
ID Age Gender Income Balance
Mortgage
payment
123213 32 F 25000 32000 Y
17824 49 M 12000 -3000 N
232897 60 F 8000 1000 Y
288822 28 M 9000 3000 Y
…. …. …. …. …. ….
Dataset
“The data table”
(Flat file)
8KSE 521
Data Terminology
Person
ID Age Gender Income Balance
Mortgage
payment
123213 32 F 25000 32000 Y
17824 49 M 12000 -3000 N
232897 60 F 8000 1000 Y
288822 28 M 9000 3000 Y
…. …. …. …. …. ….
Variables
(columns)
9KSE 521
Data Terminology
Person
ID Age Gender Income Balance
Mortgage
payment
123213 32 F 25000 32000 Y
17824 49 M 12000 -3000 N
232897 60 F 8000 1000 Y
288822 28 M 9000 3000 Y
…. …. …. …. …. ….
Attributes
Features
Explanatory or independent variables
10KSE 521
Data Terminology
Person
ID Age Gender Income Balance
Mortgage
payment
123213 32 F 25000 32000 Y
17824 49 M 12000 -3000 N
232897 60 F 8000 1000 Y
288822 28 M 9000 3000 Y
…. …. …. …. …. ….
Target variable
Dependent variable
11KSE 521
Data Terminology
Person
ID Age Gender Income Balance
Mortgage
payment
123213 32 F 25000 32000 Y
17824 49 M 12000 -3000 N
232897 60 F 8000 1000 Y
288822 28 M 9000 3000 Y
…. …. …. …. …. ….
Records
(Data) Instances
12KSE 521
Data Terminology
Person
ID Age Gender Income Balance
Mortgage
payment
123213 32 F 25000 32000 Y
17824 49 M 12000 -3000 N
232897 60 F 8000 1000 Y
288822 28 M 9000 3000 Y
…. …. …. …. …. ….
(17824, 49, M, 12000, -3000) is a feature vector
13KSE 521
Finding Informative Attributes
Is there one or more other variables that reduce
our uncertainty about the value of the target
variable?
Person
ID Age Gender Income Balance
Mortgage
payment
123213 32 F 25000 32000 Y
17824 49 M 12000 -3000 N
232897 60 F 8000 1000 Y
288822 28 M 9000 3000 Y
…. …. …. …. …. ….
14KSE 521
Main Questions
How can we judge whether a variable
contains important information about the
target variable?
How can we (automatically) obtain a
selection of the more informative variables
with respect to predicting the value of the
target variable?
Even better, can we obtain the ranking of the
variables?
15KSE 521
Example - A Set of People to be Classified
Attributes:head-shape: square, circularbody-shape: rectangular, ovalbody-color: black, white
Target variable: Yes, No
16KSE 521
Selecting Informative Attributes
Which attribute is the most informative? Or the most useful for distinguishing between data instances?
If we split our data according to this variable, we would like the resulting groups to be as pure as possible.
By pure we mean homogeneous with respect to the target variable.
If every member of a group has the same value for the target, then the group is totally pure.
17KSE 521
Example
If this is our entire dataset:
Then, we can obtain two pure groups by splitting according to body shape:
18KSE 521
Concerns
Attributes rarely split a group perfectly.
Even if one subgroup happens to be pure, the other
may not.
Is a very small, pure group, a good thing?
How should continuous and categorical attributes be
handled?
19KSE 521
Entropy and Information Gain
Target variable has two (or more) categories:
1, 2 (,…m)
- Probability P1 for category 1
- Probability P2 for category 2
…
Entropy:
mm ppppppXH 2222121 logloglog)(
20KSE 521
Entropy
15.0log5.05.0log5.0)( 22 XH
mm ppppppXH 2222121 logloglog)(
81.025.0log25.075.0log75.0)( 22 XH
01log1)( 2 XH
21KSE 521
Entropy
22KSE 521
Information Gain
Calculation of information gain (IG):
IG (parent, children) = entropy(parent)−[p(c1)×entropy(c1)+p(c2)×entropy(c2) +…]
Note: Higher IG indicates a more informative split by
the variable.
Parent
Child 1
(c1)
Child 2
(c2)
Child…
23KSE 521
Information Gain
person id age>50 gender residence balance
mortgage
payment
123213 N F own 52000 delayed
17824 Y M own -3000 OK
232897 N F rent 70000 delayed
288822 Y M other 30000 delayed
…. …. …. …. …. ….
24KSE 521
Information Gain
- delay- OK
25KSE 521
Information Gain
- delay- OK
Entropy(parent)
= −[p(•)× log2 p(•) +p( )× log2 p( )]
= −[0.53 × (−0.9) +0.47 × (−1.1)]
= 0.99 (very impure!)
Left child: entropy(Balance < 50K)
= −[p(•) log2 p(•) + p( ) log2 p( )]
= −[0.92 × (−0.12) + 0.08 × (−3.7)]
= 0.39
Right child: entropy(Balance ≥50K)
= −[p(•) log2 p(•) + p( ) log2 p( )]
= −[0.24 × (−2.1) + 0.76 × (−0.39)]
= 0.79
26KSE 521
Information Gain
Entropy(parent)
= 0.99
Left child: entropy(Balance < 50K)
= 0.39
Right child: entropy(Balance ≥50K)
= 0.79
IG for the split based on “balance” variable :
IG = entropy(parent) − [p(Balance < 50K) × entropy(Balance < 50K)
+p(Balance ≥ 50K) × entropy(Balance ≥ 50K)]
= 0.99 − [0.43 × 0.39 + 0.57 × 0.79]
= 0.3727
Information Gain
entropy(parent)
=0.99
entropy(Residence=OWN)
=0.54
entropy(Residence=RENT)
=0.97
entropy(Residence=OTHER)
=0.98
IG = 0.13
- delay- OK
28KSE 521
So far…
We have measures of:Purity of the data (entropy)How informative a split by a variable is.
We can identify and rank informative variables.
Next – we use this method to build our first supervised
learning classifier – a decision tree.
29KSE 521
Decision Trees
If we select multiple attributes each giving some
information gain, it‘s not clear how to put them t
ogether decision trees
The tree creates a segmentation of the data
Each node in the tree contains a test of an attribute
Each path eventually terminates at a leaf
Each leaf corresponds to a segment,
and the attributes and values along t
he path give the characteristics
Each leaf contains a value for the
target variable
Decision trees are often used
as predictive models
30KSE 521
How to build a decision tree (1/3)
Manually build the tree
based on expert knowledge
very time-consuming
trees are sometimes corrupt
(redundancy, contradictions,
non-completenes, inefficient)
Build the tree automatically
by induction
recursively partition the instan
ces based on their attributes (d
ivide-and-conquer)
easy to understand
relatively efficient
DECISION
TREE
EXPERT INDUCTION
DWH
heuristic enumerative heuristic enumerative
31
GENERATED
ELEMENTARY
RULES
SAMPLE ELE
MENTARY RU
LES
KSE 521
Recursively apply attribute selection to find the
best attribute to partition the data set
The goal at each step is to select an attribute to p
artition the current group into subgroups that are
as pure as possible w.r.t. the target variable
How to build a decision tree (2/3)
32KSE 521
How to build a decision tree (3/3)
33KSE 521
Based on this dataset we will build a tree-based classifier.
Dataset
person id age>50 gender residence
Balance>=
50,000
mortgage
payment
delay
123213 N F own N delayed
17824 Y M own Y OK
232897 N F rent N delayed
288822 Y M other N delayed
…. …. …. …. …. ….
34KSE 521
Tree Structure
All customers
Balance≥50,000 Balance<50,000
Residence = Own
OKResidence = other
Delay
Age ≥ 50
OK
Age<50
DelayResidence = Rent
OK
35KSE 521
Tree Structure
All customers
(14 Delay,16 OK)
36KSE 521
Tree Structure
All customers
(14 Delay,16 OK)
0 0.1 0.2 0.3 0.4 0.5
cust id
age
gender
residence
balance
Information Gain
37KSE 521
Tree Structure
All customers
(14 Delay,16 OK)
Balance≥50,000
(4 delay, 12 OK)
Balance<50,000
(2 OK, 12 delay)
38KSE 521
Tree Structure
All customers
(14 Delay,16 OK)
Balance≥50,000
(4 delay, 12 OK)
Balance<50,000
(2 OK, 12 delay)
39KSE 521
Tree Structure
All customers
(14 Delay,16 OK)
Balance≥50,000
(4 delay, 12 OK)
Balance<50,000
(2 OK, 12 delay)
Age ≥ 50
OK
(1 delay,2 OK)
Age<50
Delay
(11 delay,0 OK)
40KSE 521
Tree Structure
All customers
(14 Delay,16 OK)
Balance≥50,000
(4 delay, 12 OK)
Balance<50,000
(2 OK, 12 delay)
Age ≥ 50
OK
(1 delay,2 OK)
Age<50
Delay
(11 delay,0 OK)
0 0.05 0.1 0.15 0.2
cust id
age
gender
residence
Information Gain
41KSE 521
Tree Structure
All customers
(14 Delay,16 OK)
Balance≥50,000
(4 delay, 12 OK)
Balance<50,000
(2 OK, 12 delay)
Residence = Own
OK
(0 delay, 5 OK)
Residence = Other
Delay
(3 delay, 2 OK)
Age ≥ 50
OK
(1 delay,2 OK)
Age<50
Delay
(11 delay,0 OK)
Residence = Rent
OK
(1 delay, 5 OK)
42KSE 521
Tree Structure
All customers
Balance≥50,000 Balance<50,000
Residence = Own
OKResidence = Other
Delay
Age ≥ 50
OK
Age<50
DelayResidence = Rent
OK
iD Age>50 Gender
Residenc
e
Balance
>=50K Delay
87594 Y F own <50K ???
43KSE 521
"Discretize" numeric attributes by split points
How to choose the split points that provide the highest
information gain?
Segmentation for regression problems
Information gain is not the right measure
We need a measure of purity for numeric values
Look at reduction of VARIANCE
To create the best segmentation given a numeric t
arget, we might choose the one that produces the
best weighted average variance reduction
44
Information gain for numeric attributes
KSE 521
Open Issues
All customers
(14 Delay,16 OK)
Balance≥50,000
(4 delay, 12 OK)
Balance<50,000
(2 OK, 12 delay)
Residence = Own
OK
(0 delay, 5 OK)
RESIDENCE = Other
Delay
(3 delay, 2 OK)
Age ≥ 50
OK
(1 delay,2 OK)
Age<50
Delay
(11 delay,0 OK)
Residence = Rent
OK
(1 delay, 5 OK)
45KSE 521
We often need a more informative prediction than
just a classification
E.g. allocate your budget to the instances with the highest
expected loss
More sophisticated decision-making process
Classification may oversimplify the problem
E.g. if all segments have a probability of <0.5 for write-off,
every leaf will be labeled “not write-off"
We would like each segment (leaf)
to be assigned an estimate of t
he probability of membership
in the different classes
Probability estimation tree
Probability Estimation (1/3)
46KSE 521
Tree induction can easily produce probability
estimation trees instead of simple classification trees
Instance counts at each leaf provide class probability
estimates
Frequency-based estimate of class membership:
if a leaf contains positive and negative instances, the
probability of any new instance being positive may be
estimated as n/(n+m).
Approach may be too optimistic for segments with a
very small number of instances ( overfitting)
Smoothed version of frequency-based estimate by Laplace
correction, which moderates the influence of leaves with
only a few instances:n+1
47
n+m+2with n as number of instances that
belong to class c and m as the number of instances not belonging to class c.
Probability Estimation (2/3)
KSE 521
p(c) =
Effect of Laplace correction o
n several class ratios as the
number of instances increas
es (2/3, 3/4, 4/5)
Example:
A leaf of the classification tree that
has 2 pos. instances and no
negative instances would produce the same f-b estimate
( = 1) as a leaf node with 20 pos. and no negatives.
The Laplace correction smooths the estimate of the first lea
f down to = 0.75 to reflect this uncertainty, but it has much
less effect on the leaf with 20 instances ( ≈ 0.95)
48
Probability Estimation (3/3)
KSE 521
Solve the churn problem by tree induction
Historical data set of 20,000 customers
Each customer either had stayed with the company or left
Customers are described by the following variables:
We want to use this data to predict which new
customers are going to churn.
Example - The Churn Problem (1/3)
49KSE 521
How good are each of these variables individually?
Measure the information gain of each variable
Compute information gain for each variable independently
The Churn Problem (2/3)
50KSE 521
The highest information gain feature
(HOUSE) is at the root of the tree.
Why is the order of features chosen for
the tree different from the ranking?
When to stop building the
tree?
How do we know that
this is a good model?
51
The Churn Problem (3/3)
KSE 521