26
1 Mining Frequent Itemsets fro m Data Streams with a Time-S ensitive Sliding Window Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu, Arbee L. P. Chen 2005/9/23 報報報 報報報

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

  • Upload
    remedy

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window. Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu, Arbee L. P. Chen 2005/9/23 報告人:董原賓. The Characteristics of data streams. Continuity: Data continuously arrive at a high rate - PowerPoint PPT Presentation

Citation preview

Page 1: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

1

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu, Arbee L. P. Chen

2005/9/23 報告人:董原賓

Page 2: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

2

The Characteristics of data streams

Continuity: Data continuously arrive at a high rate

Expiration: Data can be read only once

Infinity: The total amount of data is unbounded

Page 3: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

3

The requirements of data streams

Time-sensitivity: A model that adapts itself to the time passing of a continuous data stream

Approximation: Because the past data cannot be stored

Adjustability: Owing to the unlimited amount of data, a mechanism that adapts itself to available resources is needed

Page 4: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

4

Definition t : time point p : time period Basic block B : transactions arrive in [t-p+

1, t] the basic block numbered i denote as Bi

|w| : length of the window Θ : support threshold

t-p+1 t time

abc

ac

acd

p

B

Page 5: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

5

Definition

TS : time-sensitive sliding-window TSi : the TS that consists of the |W| consec

utive basic blocks from Bi-|W|+1 to Bi

∑i : the number of transactions in TSi

i-4 i-3 i-2 i-1 i i+1 time Bi-3 Bi-2 Bi-1 Bi Bi+1

|W| = 3aba

TSi

bdaccd

∑i = 5

Page 6: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

6

Time sensitive sliding window The buffer continuously consumes transacti

ons and pours them block-by-block into our system

Accuracy guarantees of no false dismissal (NFD) recall oriented or no false alarm (NFA) precision oriented are provided

Page 7: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

7

New itemset insertion Each frequent itemset is inserted into PFP in the fo

rm of (ID, Items, Acount, Pcount), recording a unique identifier, the items in it, the accumulated count, and the potential count, respectively.

Acount accumulates its exact support counts in the subsequent basic blocks, while Pcount estimates the maximum possible sum of its support counts in the past basic blocks

Page 8: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

8

New itemset insertion Check every frequent itemset discover

ed in Bi to see whether it has been kept by PFP. If it is, we increase its Acount.

Otherwise, we create a new entry in PFP and estimate its Pcount as the largest integer that is less than θ×∑i-1

Page 9: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

9

Old itemset update For each itemset that is in PFP (potenti

ally frequent-itemset pool) but not frequent in Bi, we compute its support count in Bi by scanning the buffer to update its Acount.

An itemset in PFP is deleted if its sum of Acount and Pcount is less than θ×∑i

Page 10: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

10

DT maintenance

Each itemset in PFP is inserted into DT (discounting table) in the form of (B_ID, ID, Bcount), recording the serial number of the current basic block, the identifier in PFP, and its support count in the current basic block, respectively

Page 11: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

11

Itemset discounting Since the transactions in Bi-|W| will be expire

d, the support counts of the itemsets kept by PFP are discounted accordingly

If the itemset’s Pcount is nonzero, we subtract the support count thresholds of the expired basic blocks from Pcount

If Pcount is already 0, we subtract Bcount of the corresponding entry in DT from Acount

Each entry in DT where B_ID = i−|W| is removed

Page 12: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

12

TA update TA (Threshold array) : dynamically compu

te the support count threshold θ×|Bi| for each basic bock Bi and store it into an entry in the threshold array

Only |W|+1 entries are maintained in TA

Page 13: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

13

Algorithm

Page 14: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

0 1 2 3 4 5 time

abbcdacabdbdacd

B_ID ID Bcount

TA (Threshold Array) 1 2 3 4

ID Itemset Acount Pcount

2.4

PFP (potentially frequent itemset pool)

DT( Discounting Table)

t =1 block B1 B2 B3 B4 B5

Mining B1 frequent : a(4) b(4) c(3) d(4) bd(3)

infrequent : ab(2) ac(2) ad(2) bc(1) cd(2) bcd(1) abd(1) acd(1)

New itemset insertion

1 a 4 0 2 b 4 0 3 c 3 0 4 d 4 0 5 bd 3 0

DT maintenance

1 1 4 1 2 4 1 3 3 1 4 4 1 5 3

|w| = 3

Block size = 1

Threshold = 0.4

TA update

Sliding window

Page 15: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

0 1 2 3 4 5 time

abbcdacabdbdacd

abcadabc

B_ID ID Bcount

TA (Threshold Array) 1 2 3 4

ID Itemset Acount Pcount

2 2.4

PFP (potentially frequent itemset pool)

DT( Discounting Table)

t =2 block B1 B2 B3 B4 B5

Mining B2 frequent : a(3) b(2) c(2) bc(2)

infrequent : d(1) ad(1)

New itemset insertion

1 a 4 0 2 b 4 0 3 c 3 0 4 d 4 0 5 bd 3 0

DT maintenance

1 1 4 1 2 4 1 3 3 1 4 4 1 5 3

|w| = 3

Block size = 1

Threshold = 0.4

1 a 7 0 2 b 6 0 3 c 5 0 4 d 4 0 5 bd 3 0 6 bc 2 2

1 a 7 0 2 b 6 0 3 c 5 0 4 d 5 0 5 bd 3 0 6 bc 2 2

Old itemset update

1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1

2.4

TA update

Sliding window

Pcount =

Page 16: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

0 1 2 3 4 5 time

abbcdacabdbdacd

abcadabc

bbdc

B_ID ID Bcount

TA (Threshold Array) 1 2 3 4

ID Itemset Acount Pcount

1.2 2 2.4

PFP (potentially frequent itemset pool)

DT( Discounting Table)

t =3 block B1 B2 B3 B4 B5

Mining B3 frequent : b(2)

infrequent : c(1) d(1) bd(1)

New itemset insertionDT maintenance

|w| = 3

Block size = 1

Threshold = 0.4

Old itemset update

1 a 7 0 2 b 6 0 3 c 5 0 4 d 5 0

1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1

1 a 7 0 2 b 8 0 3 c 5 0 4 d 5 0

1 a 7 0 2 b 8 0 3 c 6 0 4 d 6 0

1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1

TA update

2 2.4

Sliding window

Page 17: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

0 1 2 3 4 5 time

abbcdacabdbdacd

abcadabc

bbdc

ababcabab

B_ID ID Bcount

TA (Threshold Array) 1 2 3 4

ID Itemset Acount Pcount

1.6 1.2 2 2.4

PFP (potentially frequent itemset pool)

DT( Discounting Table)

t =4 block B1 B2 B3 B4 B5

Mining B4 frequent : a(4) b(4) ab(4)

infrequent : c(1) abc(1)

New itemset insertionDT maintenance

|w| = 3

Block size = 1

Threshold = 0.4

Old itemset update

Itemset discounting

1 a 7 0 2 b 8 0 3 c 6 0 4 d 6 0

1 1 4 1 2 4 1 3 3 1 4 4 1 5 3 2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1

1 a 3 0 2 b 4 0 3 c 3 0 4 d 2 0

1 a 7 0 2 b 8 0 3 c 3 0 4 d 2 0 5 ab 4 5

1 a 7 0 2 b 8 0 3 c 4 0 4 d 2 0 5 ab 4 5

2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1

2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4

TA update

1.2 2 2.4

Sliding window

Page 18: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

0 1 2 3 4 5 time

abbcdacabdbdacd

abcadabc

bbdc

ababcabab

bcbcbcbcabc

B_ID ID Bcount

TA (Threshold Array) 1 2 3 4

ID Itemset Acount Pcount

PFP (potentially frequent itemset pool)

DT( Discounting Table)

t =5 block B1 B2 B3 B4 B5

Mining B5 frequent : b(5) c(5) bc(5)

infrequent : a(1) ab(1) abc(1)

New itemset insertionDT maintenance

|w| = 3

Block size = 1

Threshold = 0.4

Old itemset update

2 1 3 2 2 2 2 3 2 2 4 1 3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4

1 a 7 0 2 b 8 0 5 ab 4 5TA updateItemset discounting

1.6 1.2 2 2.4

3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4

3 1 0 3 2 2 3 3 1 3 4 1 4 1 4 4 2 4 4 5 4 5 1 1 5 2 5 5 5 1 5 6 5 5 7 5

1 a 4 0 2 b 6 0 5 ab 4 2.6

1 a 5 0 2 b 11 0 5 ab 5 2.6 6 c 5 4 7 bc 5 4

1 a 4 0 2 b 11 0 5 ab 4 2.6 6 c 5 4 7 bc 5 4

2 1.6 1.2 2

Sliding window

Page 19: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

19

Self-adjusting discounting table

In this approach, DT often consumes most of the memory space. When the space limit is reached, an efficient way to reduce the DT size without losing too much accuracy is required

Page 20: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

20

Selective adjustment

Each entry DTk is in the new form of (B_ID, ID, Bcount, AVG, NUM, Loss)

DTk.AVG keeps the average of support counts for all the itemsets merged into DTk, DTk.NUM is the number of itemsets in DTk, while DTk.Loss records the merging loss of merging DTk with DTk-1

Page 21: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

21

Selective adjustment

The main idea is to select the entry with the smallest merging loss, called the victim, and merge it into the entry above it

Page 22: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

22

Merging loss and new Bcount For k>1 and DTk.B_ID=DTk-1.B_ID 1. Under NFD (no false dismissal) mode Bcount = min {DTk.Bount, DTk-1.Bount} DTk.loss = (DTk.NUM x DTk.AVG + DTk-1 x DTk-1.AVG) – min

{DTk.Bount, DTk-1.Bount} x (DTk.Num + DTk-1.NUM) 2. Under NFA (no false alarm) mode Bcount = max {DTk.Bount, DTk-1.Bount} DTk.loss = max {DTk.Bount, DTk-1.Bount} x (DTk.Num + DT

k-1.NUM) – (DTk.NUM x DTk.AVG + DTk-1 x DTk-1.AVG)

Page 23: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

23

Example

B_ID ID Bount

AVG NUM Loss

DT_limit = 4Under NFD mode

1 1 12 12 1 ∞ 1 1 12 12 1 ∞ 1 3 13 13 1 1 1 1 12 12 1 ∞ 1 3 13 13 1 1 1 4 2 2 1 11 1 5 10 10 1 8

1 1,3 12 12.5 2 ∞ 1 4 2 2 1 11 1 5 10 10 1 8

1 1,3 12 12.5 2 ∞ 1 4 2 2 1 21 1 5 10 10 1 8 1 6 10 10 1 0

1 1 12 1 3 13 1 6 10

Loss = (1x13 + 1x12) – min{13, 12} x (1+1) = 25 – 24 = 1

(DTk.NUM x DTk.AVG + DTk-1 x DTk-1.AVG) – min {DTk.Bount, DTk-1.Bount} x (DTk.Num + DTk-1.NUM)

Loss = (1x2 + 2x12.5) – min{2, 12} x (1+2) = 27 – 6 = 21Loss = (1x10 + 1x10) – min{10, 10} x (1+1) = 20 – 20 = 0

AVG = (12x1 + 13x1) / 1+1 = 12.5

AVG =

1 1,3 12 12.5 2 ∞ 1 4 2 2 1 21 1 5 10 10 1 8

Page 24: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

24

Experiment Intel Pentium-M 1.3GHz CPU 256 MB main memory Microsoft Windows XP Professional The datasets streaming into this system are

synthesized via the IBM data generator

Page 25: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

25

Experiment

Page 26: Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

26

Experiment