ZHANGXI LIN ISQS 7339 TEXAS TECH UNIVERSITY ISQS 7342-001, Business Analytics 1 Lecture Notes 2 Descriptive, Predictive, and Explanatory Analysis

ZHANGXI LINISQS 7339

TEXAS TECH UNIVERSITY

ISQS 7342-001, Business Analytics

1

Lecture Notes 2Descriptive, Predictive, and

Explanatory Analysis

Outline


2

Context Based AnalysisDecision Tree Algorithms


3

Context Based Analysis

From Descriptive to Explanatory Use of Data


4

Three types of data analysis with decision trees Descriptive analysis is to describe data or a relationship

among various data elements in the data set. Predictive use of data, in addition, is to assert that the

above relationship will hold over time and be the same with new data.

Explanatory use of data describes a relationship and attempt to show, by reference to the data, the effect and interpretation of the relationship.

Step up to the rigor of the data work and task organization as moving from descriptive to explanatory use

Showing Context


5

Decision trees can display contextual effects – hot spots and soft spots in the relationships that characterize the data.

One intuitively knows the importance of these contextual effects, but finds it difficult to understand the context because of the inherent difficulty of capturing and describing the complexity of the factors.

Terms Antecedents (as shown in the first level of the split). Referring to

factors or effects that are at the base of a chain of events or relationships

Intervening factors (as shown in the second level or lower levels of the split). Coming between the ordering established by the other factors and outcome. Intervening factors can interact with antecedents or other intervening factors to produce an interactive effect.

Interactive effects are important dimension of discussions about decision trees and are explained more fully.

Simpson’s Paradox


6

Simpson's paradox (or the Yule-Simpson effect) is a statistical paradox wherein the successes of groups seem reversed when the groups are combined. This result is often encountered in social and medical science statistics , and

occurs when frequency data are hastily given causal interpretation; the paradox disappears when causal relations are derived systematically, through formal analysis.

Edward H. Simpson described the phenomenon in 1951, along with Karl Pearson et al., and Udny Yule in 1903. The name Simpson's paradox was coined by Colin R. Blyth in 1972. Since Simpson did not discover this statistical paradox, some authors, instead, have used the impersonal names reversal paradox and amalgamation paradox in referring to what is now called Simpson's Paradox and the Yule-Simpson effect.

- Source: http://en.wikipedia.org/wiki/Simpson's_paradox

Reference: “On Simpson's Paradox and the Sure-Thing Principle,” Colin R. Blyth, Journal of the American Statistical Association, Vol. 67, No. 338 (Jun., 1972), pp. 364-366

Simpson’s Paradox - Example


7 Lisa and Bart, each edit Wikipedia

articles for two weeks. In the first week, Lisa improves 60% of the articles she edits out of 100 articles edited, and Bart improves 90% of the articles he edits out of 10 articles edited. In the second week, Lisa improves just 10% of the articles she edits but out of 10 articles edited, while Bart improves 30% yet out of 100 articles edited.

Both times, Bart improved a higher percentage of the quantity of articles compared to Lisa, while Lisa improved a higher percentage of the quality of articles.

When the two tests are combine using a weighted average, overall, Lisa has improved a much higher percentage than Bart because of the quality modifier had a significantly higher percentage.

- Source: wikipedia.org

Week 1 Week 2 Total

Lisa 60/100 1/10 61/110

Bart 9/10 30/100 39/110

The Effect of Context


8

Segmentation makes difference

Demonstration: http://zlin.ba.ttu.edu/pwGrad/7342/AID-Example.xlsx


9

Decision Tree Algorithms

Decision Tree Algorithms

AID (Automatic interaction detection), 1969 by Morgan and Sonquist

CHAID (CHi-squared AID), 1975 by Gordon V. KassXAID, 1982 by by Gordon V. KassCRT (or CART, Classification and Regression Trees),

1984 by Breiman et al.QUEST (Quick, Unbiased and Efficient Statistical Tree) ,

1997 by Wei-Yin Loh and Yu-Shan Shih

CLS , 1966 by Hunt et al. ID3 (Iterative Dichotomizer 3), 1983 by Ross Quinlan C4.5, C5.0, by Ross QuinlanSLIQ

10


AID


11

AID stands for Automatic Interaction Detector.It is a statistical technique for multivariate analysis

. can be used to determine the characteristics that differentiate

buyers from nonbuyers. involves a successive series of analytical steps that gradually

focus on the critical determinants of behavior, creating clusters of people with similar demographic characteristics and buying behavior.

This technique is explained in John A. Sonquist and James N. Morgan, The Detection Of Interaction Effects, University of Michigan, Monograph No. 35, 1969.

AID – Interaction with Multicollinearity


12

x

xxxxx

x

x

x

x

xxx

xx

xx

xxx

xx

xxx

x

x

x

x

x

Split by gender

Income

Saving

AID – Multicollinearity without Interaction


13

x

xxxx

xx

x

x

x

xxx

xx

x

x

xxxx x

xxx

x

x

x

x

x

Split by gender

Income

Saving

Simpson's Paradox

Simpson's paradox for continuous data: a positive trend appears for two separate groups (blue and red), a negative trend (black, dashed) appears when the data are combined.

- Source: Wikipedia.org


14

AID


15

Use of decision tree to search through the many factors, by which to influence a relationship to ensure that the presented final results are accurate. Partition the dataset in terms of selected attributes Do regression on each subset of the data

Features Capability of dealing with multicollinearity (w/ or w/o interaction) Addressing the problem of hidden relationship

Morgan and Sonquist’s notes: Intervening effect is due to an interaction, e.g., between customer

segment and the effect of the promotional program versus retention. It can obscure the relationship.

Decision tree provides 2/3 of the variability in some relationship, while regression accounts for 1/3 of the variability.

Decision trees perform well with strong categorical, non-linear effects, and are inefficient at packaging the predictive effects of generally linear relationships.

Imperfects of AID


16

Untrue relationship - Because the algorithm looks through so many potential groupings of values, it is more likely to find groupings that are actually anomalies in the data.

Biased selection of inputs or predictors - The successive partitioning of the data set into bins quickly exhausts the number of observations that are in lower levels of decision trees.

Potentially overfitting - AID does not know when to stop growing branches, and it forms splits at lower extremities of the decision tree where few data records are actually available.

Remedies


17

Using statistical tests to test the efficacy of a branch that is grown. CHAID and XAID by Kass (1975, 1982)

Using validating data to test any branch that is formed for reproducibility. CRT/CART by Breiman et al (1984)

CHAID Algorithm


18

CHAID stands for CHi-squared Automatic Interaction Detector, is a type of decision tree technique, published in 1980 by Gordon V. Kass.

In practice, it is often used in the context of direct marketing to select groups of consumers and predict how their responses to some variables affect other variables.

Advantage: Its output is highly visual and easy to interpret. Because it uses multiway splits by default, it needs rather large

sample sizes to work effectively as with small sample sizes the respondent groups can quickly become too small for reliable analysis.

CHAID detects interaction between variables in the data set. Using this technique we can establish relationships between a ‘dependent variable’ and other explanatory variables such as price, size, supplements etc.

CHAID is often used as an exploratory technique and is an alternative to multiple regression, especially when the data set is not well-suited to regression analysis.

Terms


19

Bonferroni correction The Bonferroni correction states that if an experimenter is

testing n dependent or independent hypotheses on a set of data, then the statistical significance level that should be used for each hypothesis separately is 1/n times what it would be if only one hypothesis were tested. Statistically significant simply means that a given result is unlikely to have occurred by chance.

It was developed by Italian mathematician Carlo Emilio Bonferroni.

Kass Adjustment A p-value adjustment that multiplies the p-value by a Bonferroni

factor that depends on the number of branches and chi-square target values, and sometimes on the number of distinct input values. The Kass adjustment is used in the Tree node.

CHAID


20

Statistical tests Uses a test of similarity to determine whether individual values

of an input should be combined. After similar values for an input have been combined according

to the previous rule, test of significance are used to select whether inputs are significant descriptors of target values and, if so, which are their strengths relative to other inputs.

CHAID addresses all the problems in the AID approach A statistical test is used to ensure that only relationships that

are significantly different from random effects are identified. Statistical adjustments address the biased selection of

variables as candidates for the branch partitions. Tree growth is terminated when the branch that is produced

fails the test of significance.

CRT (or CART) Algorithm

Closely follows the original AID goal but with improvement through the application of validation and cross-validation.

It can identify the overfitting problem and verify the reproducibility of the decision tree structure using hold-out or validation data

Breiman et al found that it was not necessary to have hold-out or validation data to implement this grow-and-compare method. A cross-validation method can be used by resampling the training data that is used to grow the decision tree.

Advantages Prevent to grow bigger tree that pass all the validation tests Possible to use prior probabilities

21


QUEST


22

QUEST (Quick, Unbiased and Efficient Statistical Tree) is a binary-split decision tree algorithm for classification and data mining developed by Wei-Yin Loh (University of Wisconsin-Madison) and Yu-Shan Shih (National Chung Cheng University, Taiwan).

The objective of QUEST is similar to CRT, by Breiman, Friedman, Olshen and Stone (1984). The major differences are: QUEST uses an unbiased variable selection technique by default QUEST uses imputation instead of surrogate splits to deal with missing values QUEST can easily handle categorical predictor variables with many

categories

If there are no missing values in the data, QUEST can optionally use the CART algorithm to produce a tree with univariate splits

CHAID, CRT, and QUEST


23

For classification-type problems (categorical dependent variable), all three algorithms can be used to build a tree for prediction. QUEST is generally faster than the other two algorithms, however, for very large datasets, the memory requirements are usually larger.

For regression-type problems (continuous dependent variable), the QUEST algorithm is not applicable, so only CHAID and CRT can be used.

CHAID will build non-binary trees that tend to be "wider". This has made the CHAID method particularly popular in market research applications. CHAID often yields many terminal nodes connected to a single branch, which can

be conveniently summarized in a simple two-way table with multiple categories for each variable or dimension of the table. This type of display matches well the requirements for research on market segmentation.

CRT will always yield binary trees, which can sometimes not be summarized as efficiently for interpretation and/or presentation.

Machine Learning


24

A general way of describing computer-mediated methods of learning or developing knowledge

Began as an academic disciplineOften associated with using computers to simulate or

reproduce intelligent behavior.

Machine learning and business analytics share common goal: In order to behavior with intelligence, it is necessary to acquire intelligence and to refine it over time.

The development of decision trees to form rules is called rule induction in machine learning literature Induction is the process of developing general laws on the basis

of an examination of particular cases.

ID3 Algorithm


25

ID3 (Iterative Dichotomizer 3), invented by Ross Quinlan in 1983, is an algorithm used to generate a decision tree.

The algorithm is based on Occam's razor: it prefers smaller decision trees (simpler theories) over larger ones. However, it does not always produce the smallest tree, and is therefore a heuristic. Occam's razor is formalized using the concept of information entropy.

The ID3 algorithm can be summarized as follows: Take all unused attributes and count their entropy concerning

test samples Choose attribute for which entropy is minimum Make node containing that attribute

Reference: http://www.cis.temple.edu/~ingargio/cis587/readings/id3-c45.html

Occam's Razor

“One should not increase, beyond what is necessary, the number of entities required to explain anything”

Occam's razor is a logical principle attributed to the mediaeval philosopher William of Occam. The principle states that one should not make more assumptions than the minimum needed.

This principle is often called the principle of parsimony. It underlies all scientific modeling and theory building. It admonishes us to choose from a set of otherwise equivalent

models of a given phenomenon the simplest one. In any given model, Occam's razor helps us to "shave off" those concepts, variables or constructs that are not really needed to explain the phenomenon. By doing that, developing the model will become much easier, and there is less chance of introducing inconsistencies, ambiguities and redundancies.

26


ID3 Algorithm

ID3 (Examples, Target_Attribute, Attributes) Create a root node for the tree If all examples are positive, Return the single-node tree Root, with

label = +. If all examples are negative, Return the single-node tree Root, with

label = -. If number of predicting attributes is empty, then Return the single

node tree Root, with label = most common value of the target attribute in the examples.

Otherwise Begin A = The Attribute that best classifies examples. Decision Tree attribute for Root = A. For each possible value, vi, of A,

Add a new tree branch below Root, corresponding to the test A = vi. Let Examples(vi), be the subset of examples that have the value vi for A If Examples(vi) is empty

Then below this new branch add a leaf node with label = most common target value in the examples

Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})

End Return Root

27


C4.5


28

Features Builds decision trees from a set of training data in the same way as

ID3, using the concept of information entropy. Examines the normalized information gain (difference in entropy)

that results from choosing an attribute for splitting the data. Can convert a decision tree into a rule set. An optimizer goes

through the rules set to reduce the redundancy of the rules. Can create fuzzy splits on interval inputs.

A few base cases: All the samples in the list belong to the same class. Once this

happens, the algorithm simply create a single leaf node for the decision tree.

None of the features give you any information gain, in this case C4.5 creates a decision node higher up the tree using the expected value of the class.

It also might happen that there is no any instances of a class. Then C4.5 creates a decision node higher up the tree using expected value.

C4.5


29

Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets.

Needs out-of-core sorting.

Tutorial: http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html

You can download the software from:http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz

YouTube: http://www.youtube.com/watch?v=8-vHunc4k8s

C4.5 vs. ID3

C4.5 made a number of improvements to ID3:Handling both continuous and discrete attributes - In

order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. [Quinlan, 96]

Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations.

Handling attributes with differing costs. Pruning trees after creation - C4.5 goes back through

the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes.

30


C5.0

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially.

C5.0 offers a number of improvements on C4.5: Speed - C5.0 is significantly faster than C4.5 (several orders of

magnitude) Memory Usage - C5.0 is more memory efficient than C4.5 Smaller Decision Trees - C5.0 gets similar results to C4.5 with

considerably smaller decision trees. Support For Boosting - improves the trees and gives them more

accuracy. Weighting - allows you to weight different attributes and

misclassification types. Winnowing - automatically winnows the data to help reduce noise. C5.0/See5 is a commercial and closed-source product, although free

source code is available for interpreting and using the decision trees and rule sets it outputs.

Is See5/C5.0 Better Than C4.5? http://www.rulequest.com/see5-comparison.html

31


SLIQ


32

A decision tree classifier that can handle both numerical and categorical attributes

It builds compact and accurate treesIt uses a pre-sorting technique in the tree growing

phase and an inexpensive pruning algorithmIt is suitable for classification of large disk-resident

datasets, independently of the number of classes, attributes and records

The Gini index is used to evaluate the “goodness” of the alternative splits for an attribute

The Evolution of DT Algorithms


33

AID

CHAID

XAIDID3

CRTC4.5

C5.0

Quinlan, 1983Entropy

Morgan & Sonquist, 1969

Kass ,1975Chi-Square

Kass ,1982 Breiman et al. 1984

Quinlan, 1993

http://www.gavilan.edu/research/reports/DMALGS.PDF

Statistical tests

Validation

Quinlan, ?Commercial version

CLS Hunt et al, 1966Entropy

StatisticalDecision Tree

Rule InductionMachine learning

QUEST

Loh & Shih , 1997Gini

Features of ABORETUM Procedure


34

Tree branching criteria Variance reduction for interval targets F-test for interval targets Gini or entropy reduction for categorical targets Chi-squared for nominal targets

Missing value handling Use missing values as a separate, but legitimate code in the split search Assign missing values to the leaf that they most closely resemble Distribute missing observations across all branches Use surrogate, non-missing inputs to impute the distribution of missing value in

the branch Methods

Cost-complexity pruning and reduced-error pruning Prior probabilities can be used in training or assessment Misclassification costs can be used to influence decisions and branch

construction Interactive training mode can be used t produce branches and prune branches

Others SAS Code generation PMML code generation

Questions

What are differences between predictive and explanatory analysis?

Why are input variables distinguished as antecedents and intervening factors?

What is the implication of Simpson’s paradox to the decision tree construction and explanation?

What does Interaction mean in the context of decision tree construction?

What is the primary purpose of AID algorithm? What are weaknesses of AID algorithm? How these problems

are addressed by other algorithm? Summarize the main principles in decision tree algorithm

design and implementation.


35

Documents

ZHANGXI LIN ISQS 7339 TEXAS TECH UNIVERSITY ISQS 7342-001, Business Analytics 1 Lecture Notes 2 Descriptive, Predictive, and Explanatory Analysis