1 Mining Association Rules Mohamed G. Elfeky. 2 Introduction Data mining is the discovery of...

Mining Association Rules

Mohamed G. Elfeky

Introduction

Data mining is the discovery of knowledge and useful information from the large amounts of data stored in databases.

Association Rules: describing association relationships among the attributes in the set of relevant data.

RulesBody ==> Consequent [ Support , Confidence ]

Body: represents the examined data. Consequent: represents a discovered property for the examined data. Support: represents the percentage of the records satisfying the body or the consequent. Confidence: represents the percentage of the records satisfying both the body and the consequent to those satisfying only the body.

Association Rules Examples

Basket DataTea ^ Milk ==> Sugar [0.3 , 0.9]

Relational Datax.diagnosis = Heart ^ x.sex = Male ==> x.age > 50 [0.4 , 0.7]

Object-Oriented Datas.hobbies = { sport , art } ==> s.age() = Young [0.5 , 0.8]

Topics of DiscussionFormal Statement of the ProblemDifferent Algorithms AIS SETM Apriori AprioriTid AprioriHybrid

Performance Analysis

Formal Statement of the Problem

I = { i1 , i2 , … , im } is a set of itemsD is a set of transactions TEach transaction T is a set of items (subset of I)TID is a unique identifier that is associated with each transactionThe problem is to generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence

Problem Decomposition

The problem can be decomposed into two subproblems:

1. Find all sets of items (itemsets) that have support (number of transactions) greater than the minimum support (large itemsets).

2. Use the large itemsets to generate the desired rules.

For each large itemset l, find all non-empty subsets, and for each subset a generate a rule a ==> (l-a) if its confidence is greater than the minimum confidence.

General Algorithm1. In the first pass, the support of each

individual item is counted, and the large ones are determined

2. In each subsequent pass, the large itemsets determined in the previous pass is used to generate new itemsets called candidate itemsets.

3. The support of each candidate itemset is counted, and the large ones are determined.

4. This process continues until no new large itemsets are found.

AIS AlgorithmCandidate itemsets are generated and counted on-the-fly as the database is scanned.

1. For each transaction, it is determined which of the large itemsets of the previous pass are contained in this transaction.

2. New candidate itemsets are generated by extending these large itemsets with other items in this transaction.The disadvantage is that this results in unnecessarily generating and counting too many candidate itemsets that turn out to be small.

Example

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database

Itemset

Support

Itemset

Support

{1 3}* 2

{1 4} 1

{3 4} 1

{2 3}* 2

{2 5}* 3

{3 5}* 2

{1 2} 1

{1 5} 1

Itemset

Support

{1 3 4} 1

{2 3 5}*

{1 3 5} 1

SETM AlgorithmCandidate itemsets are generated on-the-fly as the database is scanned, but counted at the end of the pass.

1.New candidate itemsets are generated the same way as in AIS algorithm, but the TID of the generating transaction is saved with the candidate itemset in a sequential structure.

2.At the end of the pass, the support count of candidate itemsets is determined by aggregating this sequential structureIt has the same disadvantage of the AIS algorithm.Another disadvantage is that for each candidate itemset, there are as many entries as its support value.

Example

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database

Itemset

Support

Itemset

{1 3} 100

{1 4} 100

{3 4} 100

{2 3} 200

{2 5} 200

{3 5} 200

{1 2} 300

{1 3} 300

{1 5} 300

{2 3} 300

{2 5} 300

{3 5} 300

{2 5} 400

Itemset

{1 3 4} 100

{2 3 5} 200

{1 3 5} 300

{2 3 5} 300

Apriori Algorithm

Candidate itemsets are generated using only the large itemsets of the previous pass without considering the transactions in the database.

1.The large itemset of the previous pass is joined with itself to generate all itemsets whose size is higher by 1.

2.Each generated itemset, that has a subset which is not large, is deleted. The remaining itemsets are the candidate ones.

Example

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database

Itemset

Support

Itemset

Support

{1 2} 1

{1 3}* 2

{1 5} 1

{2 3}* 2

{2 5}* 3

{3 5}* 2

Itemset

Support

{2 3 5}*

C3 {1 2 3}

{1 3 5}

{2 3 5}

AprioriTid AlgorithmThe database is not used at all for counting the support of candidate itemsets after the first pass.

1.The candidate itemsets are generated the same way as in Apriori algorithm.

2.Another set C’ is generated of which each member has the TID of each transaction and the large itemsets present in this transaction. This set is used to count the support of each candidate itemset.The advantage is that the number of entries in C’ may be smaller than the number of transactions in the database, especially in the later passes.

Example

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database

Itemset

Support

Itemset

Support

{1 2} 1

{1 3}* 2

{1 5} 1

{2 3}* 2

{2 5}* 3

{3 5}* 2

Itemset

Support

{2 3 5}*

100 {1 3}

200 {2 3}, {2 5}, {3 5}

300 {1 2}, {1 3}, {1 5}, {2 3}, {2 5}, {3 5}

400 {2 5}

200 {2 3 5}

300 {2 3 5}

Performance Analysis

AprioriHybrid Algorithm

Performance Analysis shows that:1. Apriori does better than AprioriTid in the

earlier passes.2. AprioriTid does better than Apriori in the

later passes.Hence, a hybrid algorithm can be designed that uses Apriori in the initial passes and switches to AprioriTid when it expects that the set C’ will fit in memory.

1 Mining Association Rules Mohamed G. Elfeky. 2 Introduction Data mining is the discovery of...

Documents

Ante-Natal care Evidence-Based Perspective Alaa Eldin Hamed ELFEKY Professor of Obstetrics & Gynecology ASU-EBM GROUP Faculty of Medicine, Ain Shams University

Database Application Development€¦ · • Data mining – automated process to reveal patterns in data and system usage • OLAP databases designed to handle large amounts of data

MONUMENT MINING · PDF file4,345 8,937 1,206 194 (1,788 ... Monument Mining Limited ... nearest thousand dollar except per share amounts or where otherwise indicated

Product Brochure SoilTain-Dewatering engl.synteks.hr/wp/wp-content/uploads/2016/03/SoilTain-odvodnjavanje.pdf · industrial and mining operations generate large amounts of sludge

Data Mining In LinkedIneecs.csuohio.edu/~sschung/CIS660/SocialMedia...Issues In Social Media Sites: LinkedIn, Facebook, Twitter Large amounts of user data This data is both DISTRIBUTED

Application of Data mining in Medical Applicationsiii Abstract Data mining is a relatively new field of research whose major objective is to acquire knowledge from large amounts of

Management’s Discussion and Analysis And Consolidated ... - Filo …€¦ · 1 FILO MINING CORP. MANAGEMENT’S DISCUSSION AND ANALYSIS YEAR ENDED DECEMBER 31, 2018 (Amounts in

Metal ores and mining Ores are naturally occurring rocks found in the Earth’s crust. They contain metal/metal compounds in sufficient amounts to make it

FINANCIAL STATEMENTS - Semirara Mining 17 - A... · Philippine Financial Report Standards and reflected amounts are based on the best estimates and informed judgment of the management

Markhor: Robotic Mining Platform · environment. The rover is capable of mining large amounts of simulated ice chunks from below the surface, driving its payload to a collection station,

TD SECURITIES MINING CONFERENCE...MINING CONFERENCE JANUARY 22, 2020 2 Cautionary Statements ALL AMOUNTS IN U.S. DOLLARS UNLESS OTHERWISE STATED CAUTIONARY NOTE REGARDING FORWARDLOOKING

Data mining is sorting through data to identify patterns and establish relationships Sifting through very large amounts of data for useful information

Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael

Sputum Bacteriology in Patients with Acute Exacerbation of … Saad ElFeky, et al.pdf · 2017-07-18 · (Miravitles et al., 1999; Allegra et al., 2005). Isolated micro-organisms were

10 Data Mining. What is Data Mining? “Data Mining is the process of selecting, exploring and modeling large amounts of data to uncover previously unknown

Abdul-Aziz Rashid Al-Azmiof data available. Enterprises use mining technologies to search vast amounts of data for vital insight and knowledge. Mining tools such as data mining, text

Chapter 1: Introduction to Data Miningyee/784/handouts/dmb.pdfChapter 1: Introduction to Data Mining ... relationships in large amounts of data (SAS Institute Inc.) Data mining is

Data Mining Technologies for Blood Glucose and Diabetes ......Data mining is the process of selecting, exploring, and modeling large amounts of data to discover unknown patterns or

Data Mining By Archana Ketkar. What Is Data Mining? Data mining is the principle of sorting through large amounts of data and picking out relevant information

Turn smaLL amounTs of eLecTriciTy inTo massive amounTs of