25
VHT: Vertical Hoeffding Tree Nicolas Kourtellis Telefonica I+D Gianmarco De Francisci Morales QCRI Albert Bifet Telecom ParisTech Arinto Murdopo LARC-SMU VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016 1

VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

Embed Size (px)

Citation preview

Page 1: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

1

VHT: Vertical Hoeffding TreeNicolas Kourtellis

Telefonica I+D

Gianmarco De Francisci MoralesQCRI

Albert BifetTelecom ParisTech

Arinto MurdopoLARC-SMU

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 2: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

2

Decision Trees (DT)Easy to visualize and understandFast to predict new instancesCan model non-linear relationships

Constructed using data batchesScans data multiple timesOptimal Tree? NP-complete…Greedy heuristics to build them

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 3: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

3

Big data anyone?

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

+

Page 4: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

4

DT + StreamingData come one example at a time with speedTree must be modified incrementallyVFDT with Hoeffding bound for guarantees

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 5: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

5

DT + Streaming + DistributedTree construction & maintenance distributed across machines

How?Task parallelismHorizontal parallelismVertical parallelism

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Task parallelism

Page 6: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

6

Horizontal Parallelism Independent instances

processed in isolation Instances distributed

randomly to machinesSame attribute counters exist

multiple timesMemory for model grows

linearly with the parallelismSplit criterion centrally

computed after partial counters aggregated

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 7: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

7

Vertical Parallelism Independent attributes

processed in isolation Instances must be transformed in

column-format Attributes distributed

consistently to same machine Attribute counters exist only

once Memory for model same as

sequential version Split criterion computed in

parallel

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 8: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

8

Algorithm

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 9: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

9

VHT OptimizationsOptimistic split execution

Use instances during split decision (in case no split)Instance buffering

Keep instances at model for replay (in case of split)Timeout before model decides to splitModel replication

Remove bottleneck of aggregation in single model

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 10: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

10

SAMOA ArchitectureMachine LearningAlgorithms

Distributed StreamProcessing Engines Flink

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Apex

Scalable Advanced Massive Online Analysis• Program once, run everywhere• Reuse existing infrastructure• Avoid deploy cycles• No system downtime• No complex backup/update process• No need to select update frequency

Page 11: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

11

Experimental Setup: Artificial TweetsZipf skew: 1.5Bag of words: 100, 1000, 10000 (attributes)Size of tweet: ~15 words Instances: 1,000,000Class: positive or negative

Gaussian random variable10 different seeded runsTest every 100k instancesMOA HT, Local VHT, Storm cluster VHT, Horizontal HTMore experiments on dense instances in paper!

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 12: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

12

Local VHT vs. MOA HT

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

• Accuracy: Local VHT ≥ MOA HT• Exec. time: extra overhead due interfacing with DSPE

without scaling out

Page 13: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

13

VHT vs. Horizontal HT

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

• Small drop in accuracy due to scaling and more attributes• Always better than Hor. HT (more gains in dense instances)

Page 14: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

14

VHT vs. Horizontal HT

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

• Up to 20x faster than MOA HT

• 5-10x faster than Hor. HT

• In dense instances, Hor. HT fails to run due to overhead

• Scaling out: not much impact

Page 15: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

15

VHT Evolution

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

• Closely following MOA, better than Hor. HT• Quickly captures best accuracy

Page 16: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

16

Experimental Setup: Dense InstancesRandom decision treeMixed categorical and numerical attributes

10-10, 100-100, 1k-1k, 10k-10k Instances: 1,000,0002 balanced classes10 different seeded runsTest every 100k instancesMOA HT, Local VHT, Storm cluster VHT,

Horizontal HT

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 17: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

17

Local VHT vs. MOA HT

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 18: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

18

VHT vs. Horizontal HT

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 19: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

19

VHT vs. Horizontal HT

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 20: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

20

VHT: Vertical Hoeffding Tree@ApacheSAMOA

http://samoa.incubator.apache.org/https://github.com/apache/incubator-samoa

Nicolas Kourtellis@kourtellis

[email protected]

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 21: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

21

Extra slides

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 22: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

22

What is SAMOA?Scalable Advanced Massive Online AnalysisA platform for mining big data streams

Framework for developing new distributed stream mining algorithms

Framework for deploying algorithms on new distributed stream processing engines

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 23: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

23

Taxonomy

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 24: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

24

Algorithms in SAMOAExisting:

Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive Model Rules (regression)

Pending: Distributed Naïve Bayes Stochastic Gradient Descent Adaptive + Boosting VHT Parallelized Gradient Boosted Decision Tree PARMA (frequent pattern mining) …

Check Samoa Roadmap for more

Looking for contributors!

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016

Page 25: VHT: Vertical Hoeffding Tree (IEEE BigData 2016)

25

VHT Evolution

VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016