21
PCS - COLIMA, Sept 20-21, 2004 ➘➚ Efficient Data-Structures and Parallel Algorithms for Association Rules Discovery Christophe Cérin Michel Koskas Gaël Le Mahec Jean-Sébastien Gay [email protected] [email protected] [email protected] [email protected] 1

Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

PCS - COLIMA, Sept 20-21, 2004 ➘➚

Efficient Data-Structures andParallel Algorithms for

Association Rules Discovery

Christophe CérinMichel KoskasGaël Le Mahec

Jean-Sébastien Gay

[email protected]@[email protected]@laria.u-picardie.fr

1

Page 2: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

Outline ➘➚

✘ Context

•Large Scale Systems•Challenging Applications•Association Rule Discovery

✘ Data Minning

•New Data Structure•New Algorithms

✘ Experimental results

•Our Library - Preliminary results

✘ Conclusion & further work

•Impl. choices - FT - SQL service

2

Page 3: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

CLASSIFICATION OF DISTRIBUTED SYSTEMS ➘➚❊

100.000volatile

no trust

stabletrust

P2P

InternetComputing

1−100

no identity

Two types of distributed systems

identity

"large scale"

"grid computing"

�� ������ ���� �� ��

� � � � � � � � � � � � � � � � � � � � � � � �

������������������������������������������

����������������������������������������������������������������������

����������������������������������������������������������������

����������

��Switching Hub

��Switching Hub

3

Page 4: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

GRID INITIATIVE IN FRANCE ➘➚

French National Project: Grid’5000

ACI Masse de données1M EUR

70 persons3 years term project

4 sub−projects

Good properties of our approach for mining:

✘DB scanned once

✘Few and short messages

4

Page 5: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

PROBLEM DESCRIPTION ➘➚

✘INPUT: list of tickets(A) champagne 13.45 EUR(B) appetizer 4.15 EUR(C) salmon 7.10 EUR· · · · · · · · · · · · · · · · · ·

✘Ex: if (AB) occurs 4000 times whereas(ABC) occurs 3000 times, produce: “ifAB occurs, then there is a probabilityof 75% that C occurs too”

5

Page 6: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

PROBLEM DESCRIPTION ➘➚❊

➮Alg. for generating rules:for all frequent sequences β do

for all subsequences α < β doconf = fr(α) / fr(β)if conf > min_conf then

output α ⇒ βoutput conf

➮Main problem: discovering frequentepizodes: rare in huge files(no enumeration, please!)

6

Page 7: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

PROBLEM DESCRIPTION ➘➚

✘KEY FACT of parallel / seq. algorithms

❄Count the occurrence of each item(cancel if support < XX → 1-itemsets)

❄From 1-item sets, generate 2-itemsets(cancel if support < XX → 2-itemsets)

❄iterate until producing the lastk-itemset

✘OUR APPROACH: efficient ADT for counting,candidate (k-itemset) generation

Radix Trees

7

Page 8: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

RADIX TREES ➘➚

✓For each item, we keep the line numberwhere it appears

0 1 3 4 7 11

0 00 0

h=20 => 262144 bytes

2.(2^h − 1) bits

1111110111101011101010

11111101111010111010100000000000

11111101111010111000101000000

✓Avantages: complexity of a searchlinked to the tree height - space -natural parallelism in tree management.

8

Page 9: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

RADIX TREE OPERATIONS (support = intersect)➘➚❊

0

0

0 1

1

1

1

⋃ 0

0

0

1

0

0

1

1

−→

0

0

0 1

1

0

0

1

1

Union of Radix Tree.

0

0

0 1

1

1

1

⋂ 0

0

0

1

0

0

1

1

−→

0

0

0

1

1

1

Intersection of Radix Tree.

1

9

Page 10: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

MULTITHREADING RADIX TREE OPERATIONS ➘➚

✘Problems: load-balancing ;maximum number of allowed threads ;

0 1

0 0 1

0 1 0 1 0 1

Thread T1 Main Thread

✘Current implementation: constant k

(number of maximum threads to be started)in Radix-Tree Class ;

10

Page 11: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

MULTITHREADING RADIX TREE OPERATIONS ➘➚

0 1

0 1 0 1

0 1 0 1 0 1

Main Thread Thread T2 Thread T1

Heuristic (# thread > 2): start a threadon a right child - left thread runs untilit encounters a leaf.

11

Page 12: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

MULTITHREADING RADIX TREE OPERATIONS ➘➚❊

0 1

0 1 0 1

0 1 0 1 0 1

Main Thread Thread T2 Thread T1

✘Goal 1: limit busy waiting.

✘Goal 2: limit pthread_create calls.Here: divide by 2 (comlete binary tree).

12

Page 13: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

CANDIDATE GENERATION ➘➚

✓Two informations stored in Radix Trees

A B C D

t1 t2 t4 t1 t4

13

Page 14: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

CANDIDATE GENERATION ➘➚

✓We focus on 2-itemset generation

A B C D

AA AB AC AD CA CB CC CD

14

Page 15: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

CANDIDATE GENERATION ➘➚❊

✓Important note: operations on RadixTrees only!

A B C D

AB AC AD BC BD CD

ABC ABD ACD BCD

ABCD

15

Page 16: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

MAIN PARALLEL ALGORITHM ➘➚

Algorithm executed on each Proc. 0 ≤ i ≤ p.

/* Initially, each processor has locally n/p lines of the transactiondatabase where n is the total number of lines and p is the processor number.*/

1- In parallel for each processor:Scanning of the local database for construction of 1-itemset tree.

2- In parallel for each processor:doBroadcast supports./* This part can be de-synchronized *//* to perform overlapping (see above) */Wait for all supports from others.Perform the sum reductions.Elimination of unsufficient itemsets support.Lk = rest of Ck

Construction of new candidates sets Ck+1.while (Ck+1 6= ∅)

3- frequent itemsets =⋃

Lk

16

Page 17: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

Bench (1 thread) - SUN bi-Opteron v20z ➘➚

ListsRadix Trees

Time (sec.)

# items

× 100040 50 60 70 80 90 100 110 120 130 140 150

10

20

30

40

50

60

70

80

90

100

110

17

Page 18: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

MAIN LESSONS ➘➚❊

➮Two threads > one thread(http://www.boost.org thread impl.)

➮Bitset, Hierarchy of bitsets?#include <iostream>#include <fstream>#include <string>#include <bitset>using namespace std;// MAINint main(int nb,char **arg) {ifstream fin;ofstream fout;int i;

bitset<4294967> aaaa(0);bitset<4294967> bbbb(1);bitset<4294967> cccc(0);

for(i=0;i<600;i++) {cccc = aaaa | bbbb;aaaa[i]=1;

}for(i=0;i<600;i++) {

cccc = aaaa | bbbb;aaaa[2*i]=1;

}for(i=0;i<600;i++) {

cccc = aaaa | bbbb;aaaa[3*i]=1;

}// cout << cccc << "\n";return 1;}

Properties

TIMEMEMORY SPACEEASY TO MANAGEDISK SPACE

18

Page 19: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

CONCLUSION ➘➚

➽New approaches for computing candidates

➽Parallel Alg. + Multithreading of Radixtree operations

➽MPI code available soon (+ MPI-V → FTMPI developped by F. Cappello in theGridExplorer Initiative)

➽How to represent Radix Trees (vector ofbits?) → efficient library for yetanother application in the project:

SQL service

19

Page 20: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules

CONCLUSION ➘➚

Efficient Data-Structures andParallel Algorithms for

Association Rules Discovery

www.laria.u-picardie.fr/˜cerin/zeta.html

[email protected]@laria.u-picardie.fr

20

Page 21: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules