25
SLIQ and SPRINT for disk resident data

SLIQ and SPRINT for disk resident data

  • Upload
    zed

  • View
    56

  • Download
    0

Embed Size (px)

DESCRIPTION

SLIQ and SPRINT for disk resident data. SLIQ. SLIQ is a decision tree classifier that can handle both numerical and categorical attributes Builds compact and accurate trees Uses a pre-sorting technique in the tree growing phase Suitable for classification of large disk-resident datasets. - PowerPoint PPT Presentation

Citation preview

Page 1: SLIQ and SPRINT for disk resident data

SLIQ and SPRINTfor disk resident data

Page 2: SLIQ and SPRINT for disk resident data

SLIQ• SLIQ is a decision tree classifier that can handle

both numerical and categorical attributes• Builds compact and accurate trees • Uses a pre-sorting technique in the tree growing

phase• Suitable for classification of large disk-resident

datasets.

Page 3: SLIQ and SPRINT for disk resident data

Issues• There are two major, critical performance, issues in the tree-

growth phase:– How to find split points

– How to partition the data

• The well-known decision tree classifiers:– Grow trees depth-first

– Repeatedly sort the data at every node

• SLIQ:– Replace this repeated sorting with one-time sort

– Use new a data structure call class-list

– Class-list must remain memory resident at all time

Page 4: SLIQ and SPRINT for disk resident data

Some Data

rid age salary marital car

1 30 60 single sports

2 25 20 single mini

3 40 80 married van

4 45 100 single luxury

5 60 150 married luxury

6 35 120 single sports

7 50 70 married van

8 55 90 single sports

9 65 30 married mini

10 70 200 single luxury

Page 5: SLIQ and SPRINT for disk resident data

SLIQ - Attribute Lists

rid age

1 30

2 25

3 40

4 45

5 60

6 35

7 50

8 55

9 65

10 70

rid salary

1 60

2 20

3 80

4 100

5 150

6 120

7 70

8 90

9 30

10 200

rid marital

1 single

2 single

3 married

4 single

5 married

6 single

7 married

8 single

9 married

10 single

These are projections on (rid, attribute).

Page 6: SLIQ and SPRINT for disk resident data

SLIQ - Sort Numeric, Group Categorical

rid age

2 25

1 30

6 35

3 40

4 45

7 50

8 55

5 60

9 65

10 70

rid salary

2 20

9 30

1 60

7 70

3 80

8 90

4 100

6 120

5 150

10 200

rid marital

3 married

5 married

7 married

9 married

1 single

2 single

4 single

6 single

8 single

10 single

Page 7: SLIQ and SPRINT for disk resident data

SLIQ - Class List

rid car LEAF

1 sports N1

2 mini N1

3 van N1

4 luxury N1

5 luxury N1

6 sports N1

7 van N1

8 sports N1

9 mini N1

10 luxury N1

N1

Page 8: SLIQ and SPRINT for disk resident data

SLIQ - Histograms

rid car LEAF

1 sports N1

2 mini N1

3 van N1

4 luxury N1

5 luxury N1

6 sports N1

7 van N1

8 sports N1

9 mini N1

10 luxury N1

rid age

2 25

1 30

6 35

3 40

4 45

7 50

8 55

5 60

9 65

10 70

sports mini van luxury

L 0 0 0 0

R 3 2 2 3

sports mini van luxury

L

R

sports mini van luxury

L

R

...

N1

Evaluate each split, using GINI or Entropy.

age25 ?

age30 ?

Page 9: SLIQ and SPRINT for disk resident data

SLIQ - Histograms

rid car LEAF

1 sports N1

2 mini N1

3 van N1

4 luxury N1

5 luxury N1

6 sports N1

7 van N1

8 sports N1

9 mini N1

10 luxury N1

rid age

2 25

1 30

6 35

3 40

4 45

7 50

8 55

5 60

9 65

10 70

sports mini van luxury

L 0 0 0 0

R 3 2 2 3

sports mini van luxury

L 0 1 0 0

R 3 1 2 3

sports mini van luxury

L 1 1 0 0

R 2 1 2 3

...

N1

Evaluate each split, using GINI or Entropy.

age25

age30

Page 10: SLIQ and SPRINT for disk resident data

SLIQ - Histogramsrid car LEAF

1 sports N1

2 mini N1

3 van N1

4 luxury N1

5 luxury N1

6 sports N1

7 van N1

8 sports N1

9 mini N1

10 luxury N1

sports mini van luxury

L 0 0 0 0

R 3 2 2 3

sports mini van luxury

L 0 1 0 0

R 3 1 2 3

...

rid salary

2 20

9 30

1 60

7 70

3 80

8 90

4 100

6 120

5 150

10 200

sports mini van luxury

L 0 2 0 0

R 3 0 2 3

N1

Evaluate each split, using GINI or Entropy.

salary20

salary30

Page 11: SLIQ and SPRINT for disk resident data

SLIQ - Histograms

rid car LEAF

1 sports N1

2 mini N1

3 van N1

4 luxury N1

5 luxury N1

6 sports N1

7 van N1

8 sports N1

9 mini N1

10 luxury N1

sports mini van luxury

Yes 0 1 2 1

No 3 1 0 2

sports mini van luxury

Yes 3 1 0 2

No 0 1 2 1

rid marital

3 married

5 married

7 married

9 married

1 single

2 single

4 single

6 single

8 single

10 single

Married

Single

N1

Evaluate each split, using GINI or Entropy.

Page 12: SLIQ and SPRINT for disk resident data

SLIQ - Perform best split and Update Class List

rid car LEAF

1 sports N1

2 mini N1

3 van N1

4 luxury N1

5 luxury N1

6 sports N1

7 van N1

8 sports N1

9 mini N1

10 luxury N1

N1 salary60

N2 N3

rid salary

2 20

9 30

1 60

7 70

3 80

8 90

4 100

6 120

5 150

10 200

Page 13: SLIQ and SPRINT for disk resident data

SLIQ - Perform best split and Update Class List

rid car LEAF

1 sports N2

2 mini N2

3 van N3

4 luxury N3

5 luxury N3

6 sports N3

7 van N3

8 sports N3

9 mini N2

10 luxury N3

N1 salary60

N2 N3

rid salary

2 20

9 30

1 60

7 70

3 80

8 90

4 100

6 120

5 150

10 200

Page 14: SLIQ and SPRINT for disk resident data

SLIQ - Histogramsrid age

2 25

1 30

6 35

3 40

4 45

7 50

8 55

5 60

9 65

10 70

sports mini van luxury

L 0 0 0 0

R 1 1 1 0

sports mini van luxury

L 0 0 0 0

R 2 0 2 3

...

rid car LEAF

1 sports N2

2 mini N2

3 van N3

4 luxury N3

5 luxury N3

6 sports N3

7 van N3

8 sports N3

9 mini N2

10 luxury N3

N1 salary60

N2 N3

sports mini van luxury

L

R

sports mini van luxury

L

R

N1

N2

N1

N2Evaluate each split, using GINI or Entropy.

age25 ?

Page 15: SLIQ and SPRINT for disk resident data

SLIQ - Histogramsrid age

2 25

1 30

6 35

3 40

4 45

7 50

8 55

5 60

9 65

10 70

sports mini van luxury

L 0 0 0 0

R 1 1 1 0

sports mini van luxury

L 0 0 0 0

R 2 0 2 3

...

rid car LEAF

1 sports N2

2 mini N2

3 van N3

4 luxury N3

5 luxury N3

6 sports N3

7 van N3

8 sports N3

9 mini N2

10 luxury N3

N1 salary60

N2 N3

sports mini van luxury

L 0 1 0 0

R 1 0 1 0

sports mini van luxury

L 0 0 0 0

R 2 0 2 3

N1

N2

N1

N2Evaluate each split, using GINI or Entropy.

age25

Page 16: SLIQ and SPRINT for disk resident data

SLIQ - Pseudocode• Split evaluation:

EvaluateSplits()for each numeric attribute A do

for each value v in the attribute list dofind the corresponding entry in the class

list, and hence the corresponding class and the leaf node Ni

update the class histogram in leaf Ni

compute splitting score for test (A ≤ v) for Ni

for each categorical attribute A dofor each leaf of the tree do

find subset of A with best split

Page 17: SLIQ and SPRINT for disk resident data

SLIQ - Pseudocode• Updating the class list

UpdateLabels()for each split leaf Ni do

Let A be the split attribute for Ni.

for each (rid,v) in the attribute list for A dofind the corresponding entry in the class

list e (using the rid)if the leaf referenced by e is Ni then

find the new leaf Nj to which (rid,v) belongs

(by applying the splitting test) update the leaf pointer for e to Nj

Page 18: SLIQ and SPRINT for disk resident data

SLIQ - bottleneck• Class-list must remain memory resident at all time!

– Although not a big problem with today's memories, still there might be cases where this is a bottleneck.

• So, what can we do when the class-list doesn't fit in main memory?– SPRINT is a solution...

Page 19: SLIQ and SPRINT for disk resident data

SPRINTrid age car

2 25 mini

1 30 sports

6 35 sports

3 40 van

4 45 luxury

7 50 van

8 55 sports

5 60 luxury

9 65 mini

10 70 luxury

rid salary car

2 20 mini

9 30 mini

1 60 sports

7 70 van

3 80 van

8 90 sports

4 100 luxury

6 120 sports

5 150 luxury

10 200 luxury

rid marital car

3 married van

5 married luxury

7 married van

9 married mini

1 single sports

2 single mini

4 single luxury

6 single sports

8 single sports

10 single luxury

The main data structures used in SPRINT are:Attribute lists and Class histograms

Page 20: SLIQ and SPRINT for disk resident data

SPRINT - Histograms

sports mini van luxury

L 0 0 0 0

R 3 2 2 3

sports mini van luxury

L 0 1 0 0

R 3 1 2 3

sports mini van luxury

L 1 1 0 0

R 2 1 2 3

...

rid age car

2 25 mini

1 30 sports

6 35 sports

3 40 van

4 45 luxury

7 50 van

8 55 sports

5 60 luxury

9 65 mini

10 70 luxury

age25

age30Evaluate each split, using GINI or Entropy.

Page 21: SLIQ and SPRINT for disk resident data

SPRINT - Histograms

sports mini van luxury

L 0 0 0 0

R 3 2 2 3

sports mini van luxury

L 0 1 0 0

R 3 1 2 3

...

sports mini van luxury

L 0 2 0 0

R 3 0 2 3

rid salary car

2 20 mini

9 30 mini

1 60 sports

7 70 van

3 80 van

8 90 sports

4 100 luxury

6 120 sports

5 150 luxury

10 200 luxury

salary20

salary30Evaluate each split, using GINI or Entropy.

Page 22: SLIQ and SPRINT for disk resident data

SPRINT - Histograms

sports mini van luxury

Yes 0 1 2 1

No 3 1 0 2

sports mini van luxury

Yes 3 1 0 2

No 0 1 2 1

Married

Single

rid marital car

3 married van

5 married luxury

7 married van

9 married mini

1 single sports

2 single mini

4 single luxury

6 single sports

8 single sports

10 single luxury

Evaluate each split, using GINI or Entropy.

Page 23: SLIQ and SPRINT for disk resident data

SPRINT - Performing Best Split• Once the best split point has been found for a node, we execute

the split by creating child nodes.

• Requires splitting the node’s lists for every attribute into two.

• Partitioning the attribute list of the winning attribute (salary) is easy.– We scan the list, apply the split test, and move the records to two

new attribute lists - one for each new child.

Page 24: SLIQ and SPRINT for disk resident data

SPRINT - Performing Best Split• Unfortunately, for the remaining attribute lists of the node (age

and marital), we have no test that we can apply to the attribute values to decide how to divide the records.

• Solution: use the rids. – As we partition the list of the splitting attribute (i.e. salary), we

insert the rids of each record into a probe structure (hash table), noting to which child the record was moved.

• Once we have collected all the rids, we scan the lists of the remaining attributes and probe the hash table with the rid of each record. – The retrieved information tells us with which child to place the

record.

Page 25: SLIQ and SPRINT for disk resident data

SPRINT - Performing Best Split• If the hash-table is too large for the memory, splitting is done in

more than one step. – The attribute list for the splitting attribute is partitioned up to the

attribute record for which the hash table will fit in memory;

– Portions of attribute lists of non-splitting attributes are partitioned; and the process is repeated for the remainder of the attribute list of the splitting attribute.