32
1 Classification Using Statistically Significant Rules Sanjay Chawla School of IT University of Sydney (joint work with Florian Verhein and Bavani Arunasalam)

Classification Using Statistically Significant Rules

  • Upload
    gizi

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Classification Using Statistically Significant Rules. Sanjay Chawla School of IT University of Sydney (joint work with Florian Verhein and Bavani Arunasalam). Overview. Data Mining Tasks Associative Classifiers Support and Confidence for Imbalanced Datasets - PowerPoint PPT Presentation

Citation preview

Page 1: Classification Using Statistically Significant Rules

1

Classification Using Statistically Significant Rules

Sanjay ChawlaSchool of IT

University of Sydney

(joint work with Florian Verhein and Bavani Arunasalam)

Page 2: Classification Using Statistically Significant Rules

2

Overview

• Data Mining Tasks

• Associative Classifiers

• Support and Confidence for Imbalanced Datasets

• The use of exact tests to mine rules

• Experiments

• Conclusion

Page 3: Classification Using Statistically Significant Rules

3

Data Mining

• Data Mining research has settled into an equilibrium involving four tasks

Pattern Mining(Association Rules)

Classification

Clustering Anomaly or OutlierDetection

Associative Classifier

DB

ML

Page 4: Classification Using Statistically Significant Rules

4

Outlier Detection

• Outlier Detection (Anomaly Detection) can be studied from two aspects– Unsupervised

• Nearest Neighbor or K-Nearest Neighbor Problem

– Supervised• Classification for imbalanced data set

– Fraud Detection– Medical Diagnosis

Page 5: Classification Using Statistically Significant Rules

6

Association Rules (Agrawal, Imielinksi and Swami, 93 SIGMOD)

Example:Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

– An implication expression of the form X Y, where X and Y are itemsets

– Example: {Milk, Diaper} {Beer}

• Rule Evaluation Metrics– Support (s)

• Fraction of transactions that contain both X and Y

– Confidence (c)• Measures how often items in Y

appear in transactions thatcontain X

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

From “Introduction to Data Mining”, Tan,Steinbach and Kumar

Page 6: Classification Using Statistically Significant Rules

7

Association Rule Mining (ARM) Task

• Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold– confidence ≥ minconf threshold

• Brute-force approach:– List all possible association rules– Compute the support and confidence for each rule

– Prune rules that fail the minsup and minconf thresholds

Computationally prohibitive!

Page 7: Classification Using Statistically Significant Rules

8

Mining Association Rules

• Two-step approach:

1. Frequent Itemset Generation– Generate all itemsets whose support minsup

2. Rule Generation– Generate high confidence rules from each

frequent itemset, where each rule is a binary partitioning of a frequent itemset

• Frequent itemset generation is still computationally expensive

Page 8: Classification Using Statistically Significant Rules

9

Frequent Itemset Generationnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets

From “Introduction to Data Mining”, Tan,Steinbach and Kumar

Page 9: Classification Using Statistically Significant Rules

10

Reducing Number of Candidates

• Apriori principle:– If an itemset is frequent, then all of its subsets

must also be frequent

• Apriori principle holds due to the following property of the support measure:

– Support of an itemset never exceeds the support of its subsets

– This is known as the anti-monotone property of support

)()()(:, YsXsYXYX

Page 10: Classification Using Statistically Significant Rules

11

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Apriori Principlenull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

From “Introduction to Data Mining”, Tan,Steinbach and Kumar

Page 11: Classification Using Statistically Significant Rules

12

Classification using ARMTID Items Gender

1 Bread, Milk F

2 Bread, Diaper, Beer, Eggs M

3 Milk Diaper, Beer, Coke M

4 Bread, Milk, Diaper, Beer M

5 Bread, Milk, Diaper, Coke F

In a Classification task we want to predict the class label (Gender) using the attributes

A good (albeit stereotypical) rule is {Beer,Diaper} Male whose support is 60% and confidence is 100%

Page 12: Classification Using Statistically Significant Rules

13

Classification Based on Association (CBA): Liu, Hsu and Ma (KDD 98)

• Mine association rules of the form Ac on the training data

• Prune or Select rules using a heuristic• Rank rules

– Higher confidence; higher support; smallest antecedent

• New data is passed through the ordered rule set– Apply first matching rule or variation thereof

Page 13: Classification Using Statistically Significant Rules

14

Several Variations

Acronym Authors Forum Comments

CMAR Li, Han, Pei ICDM 01 Use Chi^2

-- Antonie, Zaiane

DMKD 04 Pos and Neg rules

FARMER Cong et. al SIGMOD 04 row enumeration

Top-K Cong et. al SIGMOD 05 Limit no. of rules

CCCS Arunasalam et. Al.

SIGKDD 06 Support-free

Page 14: Classification Using Statistically Significant Rules

17

Downsides of Support (3)• Support is biased towards the majority class

– Eg: classes = {yes, no}, sup({yes})=90%– minSup > 10% wipes out any rule predicting

“no”– Suppose X no has confidence 1 and support

3%. Rule discarded if minSup > 3% even though it perfectly predicts 30% of the instances in the minority class!

• In summary, support has many downsides –especially for classification.

Page 15: Classification Using Statistically Significant Rules

18

Downside of Confidence(1)

A20 5 25

70 5 75

90 10 100

AC CConf(A C) = 20/25 = 0.8

Support(AC) = 20/100 = 0.2

Correlation between A and C:

189.090.025.0

20

)()(

),(

CA

CA

Thus, when the data set is imbalanced a high support and high confidence rule may not necessarily imply that the antecedent and the consequent are positively correlated.

Page 16: Classification Using Statistically Significant Rules

19

Downside of Confidence (2)

• Reasonable to expect that for “good rules” the antecedent and consequent are not independent!

• Suppose – P(Class=Yes) = 0.9– P(Class=Yes|X) = 0.9

Page 17: Classification Using Statistically Significant Rules

20

Complement Class Support(CCS)

)(

),()(

C

CACACCS

The following are equivalent for a rule A C

1. A and C are positively correlated

2. The support of the antecedent(A) is less than CCS(AC)

3. Conf(AC) is greater than the support of Consequent(C)

Page 18: Classification Using Statistically Significant Rules

21

Downsides of Confidence (3)Another useful observation

• Higher confidence (support) for a rule in the minority class implies higher correlation, and lower correlation in the minority class implies lower confidence, but neither of these apply for the majority class.

• Confidence (support) tends to bias the majority class.

Page 19: Classification Using Statistically Significant Rules

22

Statistical Significant Rules

• Support is a computationally efficient measure (anti-monotonic)– Tendency to “force” a statistical interpretation

on support

• Lets start with a statistically correct approach– And “force” it to be computationally

efficient

Page 20: Classification Using Statistically Significant Rules

23

Exact Tests

• Let the class variable be {0,1}.

• Suppose we have two rules X 1 and Y1

• We want to determine if the two rules are different, i.e., they have different effect on “causing” or ‘associating” with 1 or 0– e.g., medicine and placebo

Page 21: Classification Using Statistically Significant Rules

24

Exact Tests

21

2121

2121

11

][][0

][][1

mm

nnNnn

mdxmncxn

mbxmax

YX

We assume X and Y are binomial random variables with the same parameter p

We want to determine, the probability that a specific table instance occurs “purely by chance”

Table[a,b;c,d]

Page 22: Classification Using Statistically Significant Rules

25

Exact Tests

1

21

1

21

1

21

1

21

1

1

1

11

1211

1211

)1(

)1(

)(

)()()(

),()|(

m

nn

xm

n

x

n

ppm

nn

ppxm

npp

x

nmYXP

xmYPxXPmYXP

xmYxXPmYXxXP

mnnm

xmnxmxnx

We can calculate the exact probability of a specific table instance without resorting to asymptotic approximations.

This can be used to calculate the p-value of [a,b;c,d]

Page 23: Classification Using Statistically Significant Rules

26

Fisher Exact Test

• Given a table, [a,b;c,d], Fisher Exact Test will find the probability (p-value) of obtaining the given table or a more positively associated table under the assumption that X and Y come from the same distribution.

Page 24: Classification Using Statistically Significant Rules

27

Forcing “anti-monotonic”

• We test a rule X 1 against all its immediate generalizations {X-z 1; z in X}

• The

• The rule is significant if – Pvalue < significance level (typically 0.05)

• Use a bottom up approach and only test rules whose immediate generalizations are significant

• Webb[06] has used Fisher Exact tests for generic association rule mining

]},;,[{min dcbappXz

value

Page 25: Classification Using Statistically Significant Rules

28

Example

• Suppose we have already determined that the rules (A = a1) 1 and (A = a2) 1 are significant.

• Now we want to test if – X=(A =a1) ^ (A=a2) 1 is significant

• Then we carry out a FET on X and X –{A=a1} and X and X-{A=a2}.

• If the minimum of their p-value is less than the significance level we keep the X 1 rule, otherwise we discard it.

Page 26: Classification Using Statistically Significant Rules

30

Ranking Rules

• We have already observed that – high confidence rule for the majority class

may be “more” negatively correlated than the same rule predicting the other class

– A high positively correlated rule that predicts the minority class may have a lower confidence than the same rule predicting the other class

Page 27: Classification Using Statistically Significant Rules

31

Experiments: Random Dataset

• Attributes independent and uniformly distributed.• Makes no sense to find any rules – other than by chance• However minSup=1% and minConf=0.5 mines

4149/13092 -- over 31% of all possible rules. • Using our FET technique with standard significance level

we find only 11 (0.08%)

Page 28: Classification Using Statistically Significant Rules

32

Experiments: Balanced Dataset

• Similar performance (within 1%)

Page 29: Classification Using Statistically Significant Rules

33

Experiments: Balanced Dataset

• But mines only 0.06% the number of rules• By searching only 0.5% of the search space • And using only 0.4% of the time.

Page 30: Classification Using Statistically Significant Rules

34

Experiments: Imbalanced Dataset

• Higher performance than support-confidence techniques• Using 0.07% of the search space and time, 0.7% the

number of rules.

Page 31: Classification Using Statistically Significant Rules

35

Contributions• Strong evidence and arguments against the use

of support and confidence for imbalanced classification

• Simple technique for using Fisher’s Exact test for finding positively associated and statistically significant rules. – Uses on average 0.4% of the time, searches only

0.5% of the search space, finds only 0.06% of the rules as support-confidence techniques.

– Similar performance on balanced datasets, higher on imbalanced datasets.

• Parameter free (except for significance level)

Page 32: Classification Using Statistically Significant Rules

36

References

• Verhein and Chawla, Classification Using Statistically Significant Rules http://www.it.usyd.edu.au/~chawla

• Arunasalam and Chawla, CCCS: A Top-Down Associative Classifier for Imbalanced Class Distribution [ACM SIGKDD 2006; pp 517-522]