Introduction to Predictive Modeling: From Correlation to ...kirc.kaist.ac.kr/kse521/Topic4.pdf · with respect to something that we would like to predict or estim ate ... To create

Introduction to Predictive Modeling:

From Correlation to Supervised

Segmentation

KSE 521

Topic 4

Mun Yi

Agenda

Basic Terminology

Supervised Segmentation

Entropy and Information Gain

Tree-Based Classifier

Probability Estimation

2KSE 521

Prediction is very difficult, especially if it’s about the future. – Niels Bohr, Physicist

A little prediction goes a long way –Eric Siegel, Author of Predictive Analytics

An organization’s ability to learn, and translate that learning into action rapidly, is the ultimate competitive advantage. – Jack Welch, Former Chairman

of GE

3

Quotes on Prediction

KSE 521

4

Gartner Hype Cycle

KSE 521

Key concept of BI: Predictive modeling

Supervised segmentation: how can we segment the population

with respect to something that we would like to predict or estim

ate

“Which customers are likely to leave the company when their

contracts expire?"

“Which potential customers are likely not to pay off their acc

ount balances?"

Technique: Find or select important, informative

variables / attributes of the entities w.r.t. a target

Is there one or more other variables that reduces our unc

ertainty about the value of the target?

Select informative subsets in large databases

5

Introduction

KSE 521

A model is a simplified representation of reality

created to serve a purpose

A predictive model is a formula for estimating the

unknown value of interest: the target

Classification/class-probab. estim. and regression models

Prediction = estimate an unknown value

Credit scoring, spam filtering, fraud detection

Descriptive modeling: gain insight into the

underlying phenomenon or process

6

Models and Induction

KSE 521

Data Terminology

Person

ID Age Gender Income Balance

Mortgage

payment

123213 32 F 25000 32000 Y

17824 49 M 12000 -3000 N

232897 60 F 8000 1000 Y

288822 28 M 9000 3000 Y

…. …. …. …. …. ….

7KSE 521

Data Terminology

Person


Mortgage

payment

123213 32 F 25000 32000 Y

17824 49 M 12000 -3000 N

232897 60 F 8000 1000 Y

288822 28 M 9000 3000 Y

…. …. …. …. …. ….

Dataset

“The data table”

(Flat file)

8KSE 521

Data Terminology

Person


Mortgage

payment

123213 32 F 25000 32000 Y

17824 49 M 12000 -3000 N

232897 60 F 8000 1000 Y

288822 28 M 9000 3000 Y

…. …. …. …. …. ….

Variables

(columns)

9KSE 521

Data Terminology

Person


Mortgage

payment

123213 32 F 25000 32000 Y

17824 49 M 12000 -3000 N

232897 60 F 8000 1000 Y

288822 28 M 9000 3000 Y

…. …. …. …. …. ….

Attributes

Features

Explanatory or independent variables

10KSE 521

Data Terminology

Person


Mortgage

payment

123213 32 F 25000 32000 Y

17824 49 M 12000 -3000 N

232897 60 F 8000 1000 Y

288822 28 M 9000 3000 Y

…. …. …. …. …. ….

Target variable

Dependent variable

11KSE 521

Data Terminology

Person


Mortgage

payment

123213 32 F 25000 32000 Y

17824 49 M 12000 -3000 N

232897 60 F 8000 1000 Y

288822 28 M 9000 3000 Y

…. …. …. …. …. ….

Records

(Data) Instances

12KSE 521

Data Terminology

Person


Mortgage

payment

123213 32 F 25000 32000 Y

17824 49 M 12000 -3000 N

232897 60 F 8000 1000 Y

288822 28 M 9000 3000 Y

…. …. …. …. …. ….

(17824, 49, M, 12000, -3000) is a feature vector

13KSE 521

Finding Informative Attributes

Is there one or more other variables that reduce

our uncertainty about the value of the target

variable?

Person


Mortgage

payment

123213 32 F 25000 32000 Y

17824 49 M 12000 -3000 N

232897 60 F 8000 1000 Y

288822 28 M 9000 3000 Y

…. …. …. …. …. ….

14KSE 521

Main Questions

How can we judge whether a variable

contains important information about the

target variable?

How can we (automatically) obtain a

selection of the more informative variables

with respect to predicting the value of the

target variable?

Even better, can we obtain the ranking of the

variables?

15KSE 521

Example - A Set of People to be Classified

Attributes:head-shape: square, circularbody-shape: rectangular, ovalbody-color: black, white

Target variable: Yes, No

16KSE 521

Selecting Informative Attributes

Which attribute is the most informative? Or the most useful for distinguishing between data instances?

If we split our data according to this variable, we would like the resulting groups to be as pure as possible.

By pure we mean homogeneous with respect to the target variable.

If every member of a group has the same value for the target, then the group is totally pure.

17KSE 521

Example

If this is our entire dataset:

Then, we can obtain two pure groups by splitting according to body shape:

18KSE 521

Concerns

Attributes rarely split a group perfectly.

Even if one subgroup happens to be pure, the other

may not.

Is a very small, pure group, a good thing?

How should continuous and categorical attributes be

handled?

19KSE 521

Entropy and Information Gain

Target variable has two (or more) categories:

1, 2 (,…m)

- Probability P1 for category 1

- Probability P2 for category 2

…

Entropy:

mm ppppppXH 2222121 logloglog)(

20KSE 521

Entropy

15.0log5.05.0log5.0)( 22 XH

mm ppppppXH 2222121 logloglog)(

81.025.0log25.075.0log75.0)( 22 XH

01log1)( 2 XH

21KSE 521

Entropy

22KSE 521

Information Gain

Calculation of information gain (IG):

IG (parent, children) = entropy(parent)−[p(c1)×entropy(c1)+p(c2)×entropy(c2) +…]

Note: Higher IG indicates a more informative split by

the variable.

Parent

Child 1

(c1)

Child 2

(c2)

Child…

23KSE 521

Information Gain

person id age>50 gender residence balance

mortgage

payment

123213 N F own 52000 delayed

17824 Y M own -3000 OK

232897 N F rent 70000 delayed

288822 Y M other 30000 delayed

…. …. …. …. …. ….

24KSE 521

Information Gain

- delay- OK

25KSE 521

Information Gain

- delay- OK

Entropy(parent)

= −[p(•)× log2 p(•) +p( )× log2 p( )]

= −[0.53 × (−0.9) +0.47 × (−1.1)]

= 0.99 (very impure!)

Left child: entropy(Balance < 50K)

= −[p(•) log2 p(•) + p( ) log2 p( )]

= −[0.92 × (−0.12) + 0.08 × (−3.7)]

= 0.39

Right child: entropy(Balance ≥50K)

= −[p(•) log2 p(•) + p( ) log2 p( )]

= −[0.24 × (−2.1) + 0.76 × (−0.39)]

= 0.79

26KSE 521

Information Gain

Entropy(parent)

= 0.99

Left child: entropy(Balance < 50K)

= 0.39

Right child: entropy(Balance ≥50K)

= 0.79

IG for the split based on “balance” variable :

IG = entropy(parent) − [p(Balance < 50K) × entropy(Balance < 50K)

+p(Balance ≥ 50K) × entropy(Balance ≥ 50K)]

= 0.99 − [0.43 × 0.39 + 0.57 × 0.79]

= 0.3727

Information Gain

entropy(parent)

=0.99

entropy(Residence=OWN)

=0.54

entropy(Residence=RENT)

=0.97

entropy(Residence=OTHER)

=0.98

IG = 0.13

- delay- OK

28KSE 521

So far…

We have measures of:Purity of the data (entropy)How informative a split by a variable is.

We can identify and rank informative variables.

Next – we use this method to build our first supervised

learning classifier – a decision tree.

29KSE 521

Decision Trees

If we select multiple attributes each giving some

information gain, it‘s not clear how to put them t

ogether decision trees

The tree creates a segmentation of the data

Each node in the tree contains a test of an attribute

Each path eventually terminates at a leaf

Each leaf corresponds to a segment,

and the attributes and values along t

he path give the characteristics

Each leaf contains a value for the

target variable

Decision trees are often used

as predictive models

30KSE 521

How to build a decision tree (1/3)

Manually build the tree

based on expert knowledge

very time-consuming

trees are sometimes corrupt

(redundancy, contradictions,

non-completenes, inefficient)

Build the tree automatically

by induction

recursively partition the instan

ces based on their attributes (d

ivide-and-conquer)

easy to understand

relatively efficient

DECISION

TREE

EXPERT INDUCTION

DWH

heuristic enumerative heuristic enumerative

31

GENERATED

ELEMENTARY

RULES

SAMPLE ELE

MENTARY RU

LES

KSE 521

Recursively apply attribute selection to find the

best attribute to partition the data set

The goal at each step is to select an attribute to p

artition the current group into subgroups that are

as pure as possible w.r.t. the target variable


32KSE 521


33KSE 521

Based on this dataset we will build a tree-based classifier.

Dataset

person id age>50 gender residence

Balance>=

50,000

mortgage

payment

delay

123213 N F own N delayed

17824 Y M own Y OK

232897 N F rent N delayed

288822 Y M other N delayed

…. …. …. …. …. ….

34KSE 521

Tree Structure

All customers

Balance≥50,000 Balance<50,000

Residence = Own

OKResidence = other

Delay

Age ≥ 50

OK

Age<50

DelayResidence = Rent

OK

35KSE 521

Tree Structure

All customers

(14 Delay,16 OK)

36KSE 521

Tree Structure

All customers

(14 Delay,16 OK)

0 0.1 0.2 0.3 0.4 0.5

cust id

age

gender

residence

balance

Information Gain

37KSE 521

Tree Structure

All customers

(14 Delay,16 OK)

Balance≥50,000

(4 delay, 12 OK)

Balance<50,000

(2 OK, 12 delay)

38KSE 521

Tree Structure

All customers

(14 Delay,16 OK)

Balance≥50,000

(4 delay, 12 OK)

Balance<50,000

(2 OK, 12 delay)

39KSE 521

Tree Structure

All customers

(14 Delay,16 OK)

Balance≥50,000

(4 delay, 12 OK)

Balance<50,000

(2 OK, 12 delay)

Age ≥ 50

OK

(1 delay,2 OK)

Age<50

Delay

(11 delay,0 OK)

40KSE 521

Tree Structure

All customers

(14 Delay,16 OK)

Balance≥50,000

(4 delay, 12 OK)

Balance<50,000

(2 OK, 12 delay)

Age ≥ 50

OK

(1 delay,2 OK)

Age<50

Delay

(11 delay,0 OK)

0 0.05 0.1 0.15 0.2

cust id

age

gender

residence

Information Gain

41KSE 521

Tree Structure

All customers

(14 Delay,16 OK)

Balance≥50,000

(4 delay, 12 OK)

Balance<50,000

(2 OK, 12 delay)

Residence = Own

OK

(0 delay, 5 OK)

Residence = Other

Delay

(3 delay, 2 OK)

Age ≥ 50

OK

(1 delay,2 OK)

Age<50

Delay

(11 delay,0 OK)

Residence = Rent

OK

(1 delay, 5 OK)

42KSE 521

Tree Structure

All customers

Balance≥50,000 Balance<50,000

Residence = Own

OKResidence = Other

Delay

Age ≥ 50

OK

Age<50

DelayResidence = Rent

OK

iD Age>50 Gender

Residenc

e

Balance

>=50K Delay

87594 Y F own <50K ???

43KSE 521

"Discretize" numeric attributes by split points

How to choose the split points that provide the highest

information gain?

Segmentation for regression problems

Information gain is not the right measure

We need a measure of purity for numeric values

Look at reduction of VARIANCE

To create the best segmentation given a numeric t

arget, we might choose the one that produces the

best weighted average variance reduction

44

Information gain for numeric attributes

KSE 521

Open Issues

All customers

(14 Delay,16 OK)

Balance≥50,000

(4 delay, 12 OK)

Balance<50,000

(2 OK, 12 delay)

Residence = Own

OK

(0 delay, 5 OK)

RESIDENCE = Other

Delay

(3 delay, 2 OK)

Age ≥ 50

OK

(1 delay,2 OK)

Age<50

Delay

(11 delay,0 OK)

Residence = Rent

OK

(1 delay, 5 OK)

45KSE 521

We often need a more informative prediction than

just a classification

E.g. allocate your budget to the instances with the highest

expected loss

More sophisticated decision-making process

Classification may oversimplify the problem

E.g. if all segments have a probability of <0.5 for write-off,

every leaf will be labeled “not write-off"

We would like each segment (leaf)

to be assigned an estimate of t

he probability of membership

in the different classes

Probability estimation tree

Probability Estimation (1/3)

46KSE 521

Tree induction can easily produce probability

estimation trees instead of simple classification trees

Instance counts at each leaf provide class probability

estimates

Frequency-based estimate of class membership:

if a leaf contains positive and negative instances, the

probability of any new instance being positive may be

estimated as n/(n+m).

Approach may be too optimistic for segments with a

very small number of instances ( overfitting)

Smoothed version of frequency-based estimate by Laplace

correction, which moderates the influence of leaves with

only a few instances:n+1

47

n+m+2with n as number of instances that

belong to class c and m as the number of instances not belonging to class c.


KSE 521

p(c) =

Effect of Laplace correction o

n several class ratios as the

number of instances increas

es (2/3, 3/4, 4/5)

Example:

A leaf of the classification tree that

has 2 pos. instances and no

negative instances would produce the same f-b estimate

( = 1) as a leaf node with 20 pos. and no negatives.

The Laplace correction smooths the estimate of the first lea

f down to = 0.75 to reflect this uncertainty, but it has much

less effect on the leaf with 20 instances ( ≈ 0.95)

48


KSE 521

Solve the churn problem by tree induction

Historical data set of 20,000 customers

Each customer either had stayed with the company or left

Customers are described by the following variables:

We want to use this data to predict which new

customers are going to churn.

Example - The Churn Problem (1/3)

49KSE 521

How good are each of these variables individually?

Measure the information gain of each variable

Compute information gain for each variable independently

The Churn Problem (2/3)

50KSE 521

The highest information gain feature

(HOUSE) is at the root of the tree.

Why is the order of features chosen for

the tree different from the ranking?

When to stop building the

tree?

How do we know that

this is a good model?

51

The Churn Problem (3/3)

KSE 521

Documents

Introduction to Predictive Modeling: From Correlation to ...kirc.kaist.ac.kr/kse521/Topic4.pdf · with respect to something that we would like to predict or estim ate ... To create