26
Handling Numeric Attributes in Hoeffding Trees Bernhard Pfahringer, Geoff Holmes and Richard Kirkby

Handling Numeric Attributes in Hoeffding Trees

  • Upload
    butest

  • View
    735

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in

Hoeffding Trees

Bernhard Pfahringer, Geoff Holmes and Richard Kirkby

Page 2: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Overview

• Hoeffding trees are excellent for classification tasks on data streams.

• Handling numeric attributes well is crucial to performance in conventional decision trees (for example, C4.5 -> C4.8)

• Does handling numeric attributes matter for streamed data?

• We implement a range of methods and empirically evaluate their accuracy and costs.

Page 3: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Data Streams - reminder

• Idea is that data is being provided from a continuous source:– Examples processed one at a time (inspected once)– Memory is limited (!)– Model construction must scale (NlogN in num examples)

– Be ready to predict at any time

• As memory is limited this will have implications for any numeric handling method you might construct

• Only consider methods that work as the tree is built

Page 4: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Main assumptions/limitations

• Assume a stationary concept, i.e. no concept drift or change– may seem very limiting, but …

• Three-way trade-off:– memory– speed– accuracy

• Used only artificial data sources

Page 5: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Hoeffding Trees

• Introduced by Domingos and Hulten (VFDT)• “Extension” of decision trees to streams• HT Algorithm:

– Init tree T to root node– For each example from stream

• Find leaf L for this example• Update counts in L with attr values of example and compute split

function (eg Info Gain, IG) for each attribute• If IG(best attr) – IG(next best attr) > ε then split L on best attr

Page 6: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Active leaf data structure

• For each class value:– for each nominal attribute:

• for each possible value:– keep sum of counts/weights

– for each numeric attribute:• keep sufficient stats to approximate the distribution

• various possibilities: here assume normal distribution so estimate/record: n,mean,variance, + min/max

Page 7: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Numeric Handling Methods

• VFDT (VFML – Hulten & Domingos, 2003) – Summarize the numeric distribution with a histogram made up of a

maximum number of bins N (default 1000)– Bin boundaries determined by first N unique values seen in the

stream.– Issues: method sensitive to data order and choosing a good N for a

particular problem

• Exhaustive Binary Tree (BINTREE – Gama et al, 2003)– Closest implementation of a batch method

– Incrementally update a binary tree as data is observed

– Issues: high memory cost, high cost of split search, data order

Page 8: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Numeric Handling Methods

• Quantile Summaries (GK – Greenwald and Khanna, 2001)– Motivation comes from VLDB– Maintain sample of values (quantiles) plus range of

possible ranks that the samples can take (tuples)– Extremely space efficient

– Issues: use max number of tuples per summary

Page 9: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Handling Numeric Methods

• Gaussian Approximation (GAUSS)– Assume values conform to Normal Distribution– Maintain five numbers (eg mean, variance, weight, max,

min)– Note: not sensitive to data order– Incrementally updateable

– Using the max, min information per class – split the range into N equal parts

– For each part use the 5 numbers per class to compute the approx class distribution

• Use the above to compute the IG of that split

Page 10: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Gaussian approximation – 2 class problem

Page 11: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Gaussian approximation – 3 class problem

Page 12: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Gaussian approximation – 4 class problem

Page 13: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Empirical Evaluation

• Use each numeric handling method (8 in total) to build a Hoeffding Tree (HTMC)

• Vary parameters of some methods (VFML10,100,1000; BT; GK100,1000; GAUSS10,100)

• Train models for 10 hours – then test on one million (holdout) examples

• Define three application scenarios– Sensor network (100K memory limit)

– Handheld (32MB)

– Server (400MB)

Page 14: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Data generators

• Random tree (Domingos&Hulten):– (RTS) 10 num, 10 nom 5 values, 2 classes, leaves start at level 3,

max level 5, plus version with 10% noise added (RTSN)– (RTC) 50 num, 50 nom 5 values, 2 classes, leaves start at level 5,

max level 10, plus version with 10% noise added (RTCN)

• Random RBF (Kirkby):– (RRBFS) 10 num, 100 centers, 2 classes

– (RRBFC) 50 num, 1000 centers, 2 classes

• Waveform (Aha): – (Wave21): 21 noisy num, (Wave40): +19 irrelevant num; 3 classes

• (GenF1-GenF10) (Agrawal etal): – hypothetical loan applications, 10 different rule(s) over 6 num + 3

nom attrs, 5% noise, 2 classes

Page 15: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Tree Measurements

• Accuracy (% correct)• Number of training examples processed in 10

hours (in millions)• Number of active leaves (in hundreds)• Number of inactive leaves (in hundreds)• Total nodes (in hundreds)• Tree depth• Training speed (% of generation speed)• Prediction speed (% of generation speed)

Page 16: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Sensor Network (100K memory limit)Pred

Spd %

Train

Spd %

AvgTree

Depth

Total

Nodes

Inactive

(hdrds)

Active

Leaves

Train

(million)

% correct

Method

79642011.78.080.011685.33GAUSS100

81681212.18.8702086.16GAUSS10

886030.130.080174.65GK1000

847185.034.0301282.92GK100

897530.110.070174.45BT

888130.140.090176.06VF1000

857674.53.6501379.47VF100

82701110.68.1302187.7VF10

Page 17: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Handheld Environment (32MB memory limit)

Pred

Spd %

Train

Spd %

AvgTree

Depth

Total

Nodes

Inactive

(hdrds)

Active

Leaves

Train

(million)

% correct

Method

691450116763992.685390.91GAUSS100

691524116668393.787491.35GAUSS10

7516275814032.6693790.94GK1000

7317347775306.8996189.96GK100

7315225403733.6880890.48BT

7317276044124.2295190.97VF1000

7317247044815.9997390.97VF100

721622100967531.890991.53VF10

Page 18: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Server Environment (400MB memory limit)

7442459180.432029391.41VF10

Pred

Spd %

Train

Spd %

AvgTree

Depth

Total

Nodes

Inactive

(hdrds)

Active

Leaves

Train

(million)

% correct

Method

6666399838.756653890.75GAUSS100

7362889126.854051891.21GAUSS10

8032119712217.69191.03GK1000

754323461458415889.88GK100

8121914792.913.76090.50BT

793222061271910891.12VF1000

7542331614373.914291.19VF100

Page 19: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Overall results - comments

• VFML10 is superior on average in all environments, followed closely by GAUSS10

• GK methods are generally competitive• BINTREE is only competitive in a server setting

• Default setting of 1000 for VFML is a poor choice• Crude binning provides more space which leads to

faster growth and better trees (more room to grow)• Higher values for GAUSS leads to very deep trees

(in excess of the # of attributes) suggesting repeated splitting (too fine grained)

Page 20: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Remarks – sensor network environment

• Number of training examples low because learning stops when last active leaf is deactivated (mem mgmt freezes nodes – low # examples, low probability of splitting)

• Most accurate methods VFML10, GAUSS10

Page 21: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Remarks – Handheld Environment

• Generates smaller trees (than server) and can therefore process more examples

Page 22: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Remarks – Server Environment

Page 23: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

VFML10 vs GAUSS10 – Closer Analysis

• Recall VFML10 is superior on average• Sensor (avg 87.7 vs 86.2)

– GAUSS10 superior on 10 – VFML10 superior on 6 (2 no difference)

• Handheld (avg 91.5 vs 91.4)– GAUSS10 superior on 4– VFML10 superior on 8 (6 no difference)

• Server (avg 91.4 vs 91.2)– GAUSS10 superior on 6– VFML10 superior on 6 (6 no difference)

Page 24: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Data order

Page 25: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

Conclusion

• We have presented a method for handling numeric attributes in data streams that performs well in empirical studies

• The methods employing the most approximation were superior – they allow greater growth when memory is limited.

• On a dataset by dataset analysis there is not much to choose between VFML10 and GAUSS10

• Gains made in handling numeric variables come at a cost in terms of training and prediction speed – the cost is high in some environments

Page 26: Handling Numeric Attributes in Hoeffding Trees

Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group

All algorithms available

• https://sourceforge.net/projects/moa-datastream• All methods and an environment

for experimental evaluation of

data streams is available from the

above URL – system is called

Massive

Online

Analysis (MOA)