Upload
rosanna-richards
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
ZHANGXI LINISQS 7339
TEXAS TECH UNIVERSITY
ISQS 7342-001, Business Analytics
1
Lecture Notes 2Descriptive, Predictive, and
Explanatory Analysis
Outline
ISQS 7342-001, Business Analytics
2
Context Based AnalysisDecision Tree Algorithms
ISQS 7342-001, Business Analytics
3
Context Based Analysis
From Descriptive to Explanatory Use of Data
ISQS 7342-001, Business Analytics
4
Three types of data analysis with decision trees Descriptive analysis is to describe data or a relationship
among various data elements in the data set. Predictive use of data, in addition, is to assert that the
above relationship will hold over time and be the same with new data.
Explanatory use of data describes a relationship and attempt to show, by reference to the data, the effect and interpretation of the relationship.
Step up to the rigor of the data work and task organization as moving from descriptive to explanatory use
Showing Context
ISQS 7342-001, Business Analytics
5
Decision trees can display contextual effects – hot spots and soft spots in the relationships that characterize the data.
One intuitively knows the importance of these contextual effects, but finds it difficult to understand the context because of the inherent difficulty of capturing and describing the complexity of the factors.
Terms Antecedents (as shown in the first level of the split). Referring to
factors or effects that are at the base of a chain of events or relationships
Intervening factors (as shown in the second level or lower levels of the split). Coming between the ordering established by the other factors and outcome. Intervening factors can interact with antecedents or other intervening factors to produce an interactive effect.
Interactive effects are important dimension of discussions about decision trees and are explained more fully.
Simpson’s Paradox
ISQS 7342-001, Business Analytics
6
Simpson's paradox (or the Yule-Simpson effect) is a statistical paradox wherein the successes of groups seem reversed when the groups are combined. This result is often encountered in social and medical science statistics , and
occurs when frequency data are hastily given causal interpretation; the paradox disappears when causal relations are derived systematically, through formal analysis.
Edward H. Simpson described the phenomenon in 1951, along with Karl Pearson et al., and Udny Yule in 1903. The name Simpson's paradox was coined by Colin R. Blyth in 1972. Since Simpson did not discover this statistical paradox, some authors, instead, have used the impersonal names reversal paradox and amalgamation paradox in referring to what is now called Simpson's Paradox and the Yule-Simpson effect.
- Source: http://en.wikipedia.org/wiki/Simpson's_paradox
Reference: “On Simpson's Paradox and the Sure-Thing Principle,” Colin R. Blyth, Journal of the American Statistical Association, Vol. 67, No. 338 (Jun., 1972), pp. 364-366
Simpson’s Paradox - Example
ISQS 7342-001, Business Analytics
7 Lisa and Bart, each edit Wikipedia
articles for two weeks. In the first week, Lisa improves 60% of the articles she edits out of 100 articles edited, and Bart improves 90% of the articles he edits out of 10 articles edited. In the second week, Lisa improves just 10% of the articles she edits but out of 10 articles edited, while Bart improves 30% yet out of 100 articles edited.
Both times, Bart improved a higher percentage of the quantity of articles compared to Lisa, while Lisa improved a higher percentage of the quality of articles.
When the two tests are combine using a weighted average, overall, Lisa has improved a much higher percentage than Bart because of the quality modifier had a significantly higher percentage.
- Source: wikipedia.org
Week 1 Week 2 Total
Lisa 60/100 1/10 61/110
Bart 9/10 30/100 39/110
The Effect of Context
ISQS 7342-001, Business Analytics
8
Segmentation makes difference
Demonstration: http://zlin.ba.ttu.edu/pwGrad/7342/AID-Example.xlsx
ISQS 7342-001, Business Analytics
9
Decision Tree Algorithms
Decision Tree Algorithms
AID (Automatic interaction detection), 1969 by Morgan and Sonquist
CHAID (CHi-squared AID), 1975 by Gordon V. KassXAID, 1982 by by Gordon V. KassCRT (or CART, Classification and Regression Trees),
1984 by Breiman et al.QUEST (Quick, Unbiased and Efficient Statistical Tree) ,
1997 by Wei-Yin Loh and Yu-Shan Shih
CLS , 1966 by Hunt et al. ID3 (Iterative Dichotomizer 3), 1983 by Ross Quinlan C4.5, C5.0, by Ross QuinlanSLIQ
10
ISQS 7342-001, Business Analytics
AID
ISQS 7342-001, Business Analytics
11
AID stands for Automatic Interaction Detector.It is a statistical technique for multivariate analysis
. can be used to determine the characteristics that differentiate
buyers from nonbuyers. involves a successive series of analytical steps that gradually
focus on the critical determinants of behavior, creating clusters of people with similar demographic characteristics and buying behavior.
This technique is explained in John A. Sonquist and James N. Morgan, The Detection Of Interaction Effects, University of Michigan, Monograph No. 35, 1969.
AID – Interaction with Multicollinearity
ISQS 7342-001, Business Analytics
12
x
xxxxx
x
x
x
x
xxx
xx
xx
xxx
xx
xxx
x
x
x
x
x
Split by gender
Income
Saving
AID – Multicollinearity without Interaction
ISQS 7342-001, Business Analytics
13
x
xxxx
xx
x
x
x
xxx
xx
x
x
xxxx x
xxx
x
x
x
x
x
Split by gender
Income
Saving
Simpson's Paradox
Simpson's paradox for continuous data: a positive trend appears for two separate groups (blue and red), a negative trend (black, dashed) appears when the data are combined.
- Source: Wikipedia.org
ISQS 7342-001, Business Analytics
14
AID
ISQS 7342-001, Business Analytics
15
Use of decision tree to search through the many factors, by which to influence a relationship to ensure that the presented final results are accurate. Partition the dataset in terms of selected attributes Do regression on each subset of the data
Features Capability of dealing with multicollinearity (w/ or w/o interaction) Addressing the problem of hidden relationship
Morgan and Sonquist’s notes: Intervening effect is due to an interaction, e.g., between customer
segment and the effect of the promotional program versus retention. It can obscure the relationship.
Decision tree provides 2/3 of the variability in some relationship, while regression accounts for 1/3 of the variability.
Decision trees perform well with strong categorical, non-linear effects, and are inefficient at packaging the predictive effects of generally linear relationships.
Imperfects of AID
ISQS 7342-001, Business Analytics
16
Untrue relationship - Because the algorithm looks through so many potential groupings of values, it is more likely to find groupings that are actually anomalies in the data.
Biased selection of inputs or predictors - The successive partitioning of the data set into bins quickly exhausts the number of observations that are in lower levels of decision trees.
Potentially overfitting - AID does not know when to stop growing branches, and it forms splits at lower extremities of the decision tree where few data records are actually available.
Remedies
ISQS 7342-001, Business Analytics
17
Using statistical tests to test the efficacy of a branch that is grown. CHAID and XAID by Kass (1975, 1982)
Using validating data to test any branch that is formed for reproducibility. CRT/CART by Breiman et al (1984)
CHAID Algorithm
ISQS 7342-001, Business Analytics
18
CHAID stands for CHi-squared Automatic Interaction Detector, is a type of decision tree technique, published in 1980 by Gordon V. Kass.
In practice, it is often used in the context of direct marketing to select groups of consumers and predict how their responses to some variables affect other variables.
Advantage: Its output is highly visual and easy to interpret. Because it uses multiway splits by default, it needs rather large
sample sizes to work effectively as with small sample sizes the respondent groups can quickly become too small for reliable analysis.
CHAID detects interaction between variables in the data set. Using this technique we can establish relationships between a ‘dependent variable’ and other explanatory variables such as price, size, supplements etc.
CHAID is often used as an exploratory technique and is an alternative to multiple regression, especially when the data set is not well-suited to regression analysis.
Terms
ISQS 7342-001, Business Analytics
19
Bonferroni correction The Bonferroni correction states that if an experimenter is
testing n dependent or independent hypotheses on a set of data, then the statistical significance level that should be used for each hypothesis separately is 1/n times what it would be if only one hypothesis were tested. Statistically significant simply means that a given result is unlikely to have occurred by chance.
It was developed by Italian mathematician Carlo Emilio Bonferroni.
Kass Adjustment A p-value adjustment that multiplies the p-value by a Bonferroni
factor that depends on the number of branches and chi-square target values, and sometimes on the number of distinct input values. The Kass adjustment is used in the Tree node.
CHAID
ISQS 7342-001, Business Analytics
20
Statistical tests Uses a test of similarity to determine whether individual values
of an input should be combined. After similar values for an input have been combined according
to the previous rule, test of significance are used to select whether inputs are significant descriptors of target values and, if so, which are their strengths relative to other inputs.
CHAID addresses all the problems in the AID approach A statistical test is used to ensure that only relationships that
are significantly different from random effects are identified. Statistical adjustments address the biased selection of
variables as candidates for the branch partitions. Tree growth is terminated when the branch that is produced
fails the test of significance.
CRT (or CART) Algorithm
Closely follows the original AID goal but with improvement through the application of validation and cross-validation.
It can identify the overfitting problem and verify the reproducibility of the decision tree structure using hold-out or validation data
Breiman et al found that it was not necessary to have hold-out or validation data to implement this grow-and-compare method. A cross-validation method can be used by resampling the training data that is used to grow the decision tree.
Advantages Prevent to grow bigger tree that pass all the validation tests Possible to use prior probabilities
21
ISQS 7342-001, Business Analytics
QUEST
ISQS 7342-001, Business Analytics
22
QUEST (Quick, Unbiased and Efficient Statistical Tree) is a binary-split decision tree algorithm for classification and data mining developed by Wei-Yin Loh (University of Wisconsin-Madison) and Yu-Shan Shih (National Chung Cheng University, Taiwan).
The objective of QUEST is similar to CRT, by Breiman, Friedman, Olshen and Stone (1984). The major differences are: QUEST uses an unbiased variable selection technique by default QUEST uses imputation instead of surrogate splits to deal with missing values QUEST can easily handle categorical predictor variables with many
categories
If there are no missing values in the data, QUEST can optionally use the CART algorithm to produce a tree with univariate splits
CHAID, CRT, and QUEST
ISQS 7342-001, Business Analytics
23
For classification-type problems (categorical dependent variable), all three algorithms can be used to build a tree for prediction. QUEST is generally faster than the other two algorithms, however, for very large datasets, the memory requirements are usually larger.
For regression-type problems (continuous dependent variable), the QUEST algorithm is not applicable, so only CHAID and CRT can be used.
CHAID will build non-binary trees that tend to be "wider". This has made the CHAID method particularly popular in market research applications. CHAID often yields many terminal nodes connected to a single branch, which can
be conveniently summarized in a simple two-way table with multiple categories for each variable or dimension of the table. This type of display matches well the requirements for research on market segmentation.
CRT will always yield binary trees, which can sometimes not be summarized as efficiently for interpretation and/or presentation.
Machine Learning
ISQS 7342-001, Business Analytics
24
A general way of describing computer-mediated methods of learning or developing knowledge
Began as an academic disciplineOften associated with using computers to simulate or
reproduce intelligent behavior.
Machine learning and business analytics share common goal: In order to behavior with intelligence, it is necessary to acquire intelligence and to refine it over time.
The development of decision trees to form rules is called rule induction in machine learning literature Induction is the process of developing general laws on the basis
of an examination of particular cases.
ID3 Algorithm
ISQS 7342-001, Business Analytics
25
ID3 (Iterative Dichotomizer 3), invented by Ross Quinlan in 1983, is an algorithm used to generate a decision tree.
The algorithm is based on Occam's razor: it prefers smaller decision trees (simpler theories) over larger ones. However, it does not always produce the smallest tree, and is therefore a heuristic. Occam's razor is formalized using the concept of information entropy.
The ID3 algorithm can be summarized as follows: Take all unused attributes and count their entropy concerning
test samples Choose attribute for which entropy is minimum Make node containing that attribute
Reference: http://www.cis.temple.edu/~ingargio/cis587/readings/id3-c45.html
Occam's Razor
“One should not increase, beyond what is necessary, the number of entities required to explain anything”
Occam's razor is a logical principle attributed to the mediaeval philosopher William of Occam. The principle states that one should not make more assumptions than the minimum needed.
This principle is often called the principle of parsimony. It underlies all scientific modeling and theory building. It admonishes us to choose from a set of otherwise equivalent
models of a given phenomenon the simplest one. In any given model, Occam's razor helps us to "shave off" those concepts, variables or constructs that are not really needed to explain the phenomenon. By doing that, developing the model will become much easier, and there is less chance of introducing inconsistencies, ambiguities and redundancies.
26
ISQS 7342-001, Business Analytics
ID3 Algorithm
ID3 (Examples, Target_Attribute, Attributes) Create a root node for the tree If all examples are positive, Return the single-node tree Root, with
label = +. If all examples are negative, Return the single-node tree Root, with
label = -. If number of predicting attributes is empty, then Return the single
node tree Root, with label = most common value of the target attribute in the examples.
Otherwise Begin A = The Attribute that best classifies examples. Decision Tree attribute for Root = A. For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi. Let Examples(vi), be the subset of examples that have the value vi for A If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in the examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
End Return Root
27
ISQS 7342-001, Business Analytics
C4.5
ISQS 7342-001, Business Analytics
28
Features Builds decision trees from a set of training data in the same way as
ID3, using the concept of information entropy. Examines the normalized information gain (difference in entropy)
that results from choosing an attribute for splitting the data. Can convert a decision tree into a rule set. An optimizer goes
through the rules set to reduce the redundancy of the rules. Can create fuzzy splits on interval inputs.
A few base cases: All the samples in the list belong to the same class. Once this
happens, the algorithm simply create a single leaf node for the decision tree.
None of the features give you any information gain, in this case C4.5 creates a decision node higher up the tree using the expected value of the class.
It also might happen that there is no any instances of a class. Then C4.5 creates a decision node higher up the tree using expected value.
C4.5
ISQS 7342-001, Business Analytics
29
Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets.
Needs out-of-core sorting.
Tutorial: http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html
You can download the software from:http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
YouTube: http://www.youtube.com/watch?v=8-vHunc4k8s
C4.5 vs. ID3
C4.5 made a number of improvements to ID3:Handling both continuous and discrete attributes - In
order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. [Quinlan, 96]
Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations.
Handling attributes with differing costs. Pruning trees after creation - C4.5 goes back through
the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes.
30
ISQS 7342-001, Business Analytics
C5.0
Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially.
C5.0 offers a number of improvements on C4.5: Speed - C5.0 is significantly faster than C4.5 (several orders of
magnitude) Memory Usage - C5.0 is more memory efficient than C4.5 Smaller Decision Trees - C5.0 gets similar results to C4.5 with
considerably smaller decision trees. Support For Boosting - improves the trees and gives them more
accuracy. Weighting - allows you to weight different attributes and
misclassification types. Winnowing - automatically winnows the data to help reduce noise. C5.0/See5 is a commercial and closed-source product, although free
source code is available for interpreting and using the decision trees and rule sets it outputs.
Is See5/C5.0 Better Than C4.5? http://www.rulequest.com/see5-comparison.html
31
ISQS 7342-001, Business Analytics
SLIQ
ISQS 7342-001, Business Analytics
32
A decision tree classifier that can handle both numerical and categorical attributes
It builds compact and accurate treesIt uses a pre-sorting technique in the tree growing
phase and an inexpensive pruning algorithmIt is suitable for classification of large disk-resident
datasets, independently of the number of classes, attributes and records
The Gini index is used to evaluate the “goodness” of the alternative splits for an attribute
The Evolution of DT Algorithms
ISQS 7342-001, Business Analytics
33
AID
CHAID
XAIDID3
CRTC4.5
C5.0
Quinlan, 1983Entropy
Morgan & Sonquist, 1969
Kass ,1975Chi-Square
Kass ,1982 Breiman et al. 1984
Quinlan, 1993
http://www.gavilan.edu/research/reports/DMALGS.PDF
Statistical tests
Validation
Quinlan, ?Commercial version
CLS Hunt et al, 1966Entropy
StatisticalDecision Tree
Rule InductionMachine learning
QUEST
Loh & Shih , 1997Gini
Features of ABORETUM Procedure
ISQS 7342-001, Business Analytics
34
Tree branching criteria Variance reduction for interval targets F-test for interval targets Gini or entropy reduction for categorical targets Chi-squared for nominal targets
Missing value handling Use missing values as a separate, but legitimate code in the split search Assign missing values to the leaf that they most closely resemble Distribute missing observations across all branches Use surrogate, non-missing inputs to impute the distribution of missing value in
the branch Methods
Cost-complexity pruning and reduced-error pruning Prior probabilities can be used in training or assessment Misclassification costs can be used to influence decisions and branch
construction Interactive training mode can be used t produce branches and prune branches
Others SAS Code generation PMML code generation
Questions
What are differences between predictive and explanatory analysis?
Why are input variables distinguished as antecedents and intervening factors?
What is the implication of Simpson’s paradox to the decision tree construction and explanation?
What does Interaction mean in the context of decision tree construction?
What is the primary purpose of AID algorithm? What are weaknesses of AID algorithm? How these problems
are addressed by other algorithm? Summarize the main principles in decision tree algorithm
design and implementation.
ISQS 7342-001, Business Analytics
35