View
215
Download
1
Category
Tags:
Preview:
Citation preview
1
From Association Rules To Causality
Presenters:
Amol Shukla, University of Waterloo
Claude-Guy Quimper, University of Waterloo
2
From Association Rules To Causality
Limitations of Association Rules and the
Support-Confidence Framework
Generalizing Association Rules to Correlations
Scalable Techniques for Mining Causal
Structures
Applications of Correlation and Causality
Summary
Presentation Outline
3
Review: Association Rules Mining
Itemset I={i1, …, ik} Find all the rules XY with min confidence and support
support, s, probability that a transaction contains XY
confidence, c, conditional probability that a transaction having X also contains Y, i.e., P(Y|X)
Let min_support = 50%, min_conf = 50%. Two example association rules are:
A C (50%, 66.7%)C A (50%, 100%)
Transaction-id
Items bough
t
10 A, B, C
20 A, C
30 A, D
40 B, E, F
4
Limitations of Association Rules using Support-Confidence Framework
Negative implications or dependencies are ignored
Consider the adjoining database. X and Y: positively related, X and Z: negatively related support and confidence of X=>Z dominates
Only the presence of items is taken into account
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%
5
Limitations of Association Rules using Support-Confidence Framework
Another market basket data example Buys Tea => Buys Coffee (support=20%,confidence=80%) Is this rule really valid? Pr(Buys Coffee)=90% Pr(Buys Coffee|Buys Tea)=80%
Negative correlation between buying tea and buying coffee is ignored
Items Boug
ht
Coffee
No Coffee
Sum(row
)
Tea 20 5 25
No Tea
70 5 75
Sum(col.)
90 10 100
6
From Association Rules To Causality
Limitations of Association Rules and the Support-
Confidence Framework
Generalizing Association Rules to
Correlations
Scalable Techniques for Mining Causal
Structures
Applications of Correlation and Causality
Summary
7
What is Correlation?
P(A): Probability that event A occurs P(A’): Probability that event A does not occur P(AB): Probability that events A and B occur
together. Events A and B are said to be independent if P(AB) = P(A) x P(B) Otherwise A and B are dependent Events A and B are said to be correlated if any of AB, A’B , AB’, A’B’ are dependent A correlation rule is a set of items that are
correlated
8
Computing Correlation Rules: Chi-squared Test for Independence
For an itemset I={i1,…,ik}, construct a k-dimensional contingency table R= {i1,i1’} x … x {ik,ik’}
We need to test whether each cell r= r1,…,rk in this table is dependent
Let O(r) denote the observed value of cell r in this table, and E(r) be its expected value.
The chi-squared statistic is the computed as:
Rr rE
rErO
)(
)()( 22
If 2 = 0, the cells are independent. If 2 > cut-off value,reject the independence assumption
9
Example: Computing the Chi-squared Statistic
Coffee
No Coffee
Sum(row)
Tea 20 5 25
No Tea
70 5 75
Sum(col.)
90 10 100
E(Coffee,Tea)= (90 x 25)/100 = 22.5
E(No Coffee,Tea) = (10 x 25)/100 = 2.5
E(Coffee,No Tea)= (90 x 75)/100 = 67.5
E(No Coffee,No Tea)=(10 x 75)/100=7.5
Since this value is greater than the cut-off value (2.71 at 90% significance level), we reject the independence assumption
2 = (20-22.5)2/22.5 + (5-2.5)2/2.5
+
(70-67.5)2/67.5 + (5-7.5)2/7.5
= 0.28 + 2.5 + 0.09 + 0.83 = 3.7
10
Determining the Cause of Correlation
I(r)>1 indicates positive dependence and I(r)<1 indicates negative dependence
The farther I(r) is from 1, the more a cell contributes to the 2 value, and the correlation. Coffe
eNo
Coffee
Tea 0.89 2
No Tea
1.03 0.66
Measures of Interest
Define measures of interest for each cell I(r) = O(r) / E(r)
Thus, [No Coffee,Tea] contributes the most to the correlation, indicating that buying tea might inhibit buying coffee
Cell Counts
Coffee
No Coffee
Tea 20 5
No Tea
70 5
= 70/67.5
11
Properties of Correlation
If a set of items is correlated, all its supersets are also correlated. Thus, correlation is upward-closed
We can focus on minimal correlated itemsets to reduce our search space
Support is downward-closed. A set has minimum support only if all its subsets have minimum support
We can combine correlation with support for an effective pruning strategy
12
Combining Correlation with Support
Support-confidence framework looks at only the top-left cell in the contingency table. To incorporate negative dependence, we must consider all the cells in the table
Combine correlation with support by defining “CT-support”
Let s be a user specified min-support threshold. Let p be a user-specified cut-off percentage value
An itemset I is CT-supported if at least p% of the cells in its contingency table have support not less than s
An itemset is significant if it is CT-supported and minimally correlated
14
Steps performed by the algorithm at level k
Mark the itemsetas ‘significant’
Is the Itemset CT-supported?
Is 2 greater than cut-off value?
No
Yes Add to the setNOTSIG
Construct ContingencyTable for next itemset
at the level
No
Yes
Generate itemset(s) of sizek+1 such that all of its subsets are in NOTSIG
Done processing all itemsets at level k
Start
15
Limitations of Correlation
Correlation might not be valid for ‘sparse’ itemsets. At least 80% of the cells in the contingency table must have expected value greater than 5.
Finding correlation rules is computationally more expensive than finding association rules.
Only indicates that the existence of a relationship. Does not specify the nature of the relationship, i.e., the cause and effect phenomenon is ignored.
Identifying causality is important for decision-making.
16
From Association Rules to Causality
Limitations of Association Rules and the
Support-Confidence Framework
Generalizing Association Rules to
Correlations
Scalable Techniques for Mining Causal
Structures
Applications of Correlation and Causality
Summary
17
Causality
Hot-Dogs
Hamburgers33% 33% 33%
Association Rule: Hot-Dogs BBQ Sauce [33%, 50%]Causality Rule: Hamburgers BBQ Sauce
18
Bayesian Networks
What is the best topology of a Bayesian network that describes the observed data?
Problem: Very expensive to compute
19
Simplifying Causal Relationships
Knowing the existence of a causal relationship is as good as knowing the relationship
20
Causality vs Correlation
Two correlated variables can have either:
A common ancestor
A causal relationship
21
First Rule of Causality
1) Suppose we have threepair wise dependentvariables:
2) And two variables become independent when conditionedon the third one
Independent
Independent
23
dependent
independent
dependent dependent
Second Rule of Causality
dependent
1) Suppose we havethree variables withthese relationships
2) And the two independent variables become dependentwhen conditioned on the third variable
25
Finding Causality
1) Construct a graph whereeach variable is a vertex
2) Perform a Chi-squared testto determine correlation
3) Add an edge labeled “C”for each correlated test
4) Add an edge labeled “U”for each uncorrelated test
5) For each triplet, check if acausality rule can be applied
C C C
C
C
C
C C
U
C
26
Weaknesses of the Algorithm
Causality rules do not cover all possible causality relationships
The X2 test with confidence set to 95% is expected to fail 5 times every 100 tests
Some variables might not be reported correlated or uncorrelated
27
From Association Rules to Causality
Limitations of Association Rules and the
Support-Confidence Framework
Generalizing Association Rules to Correlations
Scalable Techniques for Mining Causal
Structures
Applications of Correlation and
Causality
Summary
28
Experiments (Census)
Correlation rules Not a native English speaker Not
born in the U.S Served in the military Male Married more than 40 years old
Causality Rules Male Moved Last 5 years, Support-Job Native-Amer. $20-$40K House
Holder Asian, Laborer < $20K
29
Experiments (Text Data)
416 distinct frequent words 86320 pairs of words, 10% are
correlated Correlation Causality RulesNelson, Mandela upi, not reuterarea, province Iraqi, Iraqarea, secretary, war united, statesarea, secretary, they prime, minister
30
Beyond Correlation and Causality
Correlation and causality seem to be stronger mathematical model than confidence and support
It is possible to apply these concepts where confidence and support were previously applied
31
Association Rules with Constraints
Correlation can be seen as a monotone constraint
Algorithm obtained by modifying algorithms for mining constrained association rules
At least one item is meat
32
From Association Rules to Causality
Limitations of Association Rules and the
Support-Confidence Framework
Generalizing Association Rules to
Correlations
Scalable Techniques for Mining Causal
Structures
Applications of Correlation and Causality
Summary
33
Conclusion (Good news)
Correlation and causality are stronger mathematical models to retrieve interesting association rules
Allow to detect negative implications
Causality explains why there is a correlation
34
Conclusion (Bad news)
Difficult to precisely detect correlation (especially in sparse data cubes)
Not all causality relationships can be found
Are the results really better than with support and confidence?
35
Open Problems
How to discover hidden variables in causality
How to resolve bi-directional causality for disambiguatione.g: prime minister minister prime
How do we find causal patterns for more than 3 variables
36
References
Papers “Beyond Market Baskets: Generalizing Association Rules
to Correlations” - Brin, Motwani, Silverstein; SIGMOD 97 “Scalable Techniques for Mining Causal Structures” -
Silverstein, Brin, Motwani, Ullman; VLDB 98 “Efficient Mining of Constrained Correlated Sets” -
Grahne, Lakshmanan, Wang; ICDE 2000 “A Simple Constraint-Based Algorithm for Efficiently
Mining Observational Databases for Causal Relationships” - Cooper; Data Mining and Knowledge Discovery, vol 1, 1997
Textbook “Causality: models, reasoning, and inference” - Judea
Pearl; Cambridge University Press, 2000
Recommended