Upload
nicolas-kourtellis
View
88
Download
1
Embed Size (px)
Citation preview
1
VHT: Vertical Hoeffding TreeNicolas Kourtellis
Telefonica I+D
Gianmarco De Francisci MoralesQCRI
Albert BifetTelecom ParisTech
Arinto MurdopoLARC-SMU
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
2
Decision Trees (DT)Easy to visualize and understandFast to predict new instancesCan model non-linear relationships
Constructed using data batchesScans data multiple timesOptimal Tree? NP-complete…Greedy heuristics to build them
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
3
Big data anyone?
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
+
4
DT + StreamingData come one example at a time with speedTree must be modified incrementallyVFDT with Hoeffding bound for guarantees
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
5
DT + Streaming + DistributedTree construction & maintenance distributed across machines
How?Task parallelismHorizontal parallelismVertical parallelism
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
Task parallelism
6
Horizontal Parallelism Independent instances
processed in isolation Instances distributed
randomly to machinesSame attribute counters exist
multiple timesMemory for model grows
linearly with the parallelismSplit criterion centrally
computed after partial counters aggregated
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
7
Vertical Parallelism Independent attributes
processed in isolation Instances must be transformed in
column-format Attributes distributed
consistently to same machine Attribute counters exist only
once Memory for model same as
sequential version Split criterion computed in
parallel
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
8
Algorithm
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
9
VHT OptimizationsOptimistic split execution
Use instances during split decision (in case no split)Instance buffering
Keep instances at model for replay (in case of split)Timeout before model decides to splitModel replication
Remove bottleneck of aggregation in single model
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
10
SAMOA ArchitectureMachine LearningAlgorithms
Distributed StreamProcessing Engines Flink
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
Apex
Scalable Advanced Massive Online Analysis• Program once, run everywhere• Reuse existing infrastructure• Avoid deploy cycles• No system downtime• No complex backup/update process• No need to select update frequency
11
Experimental Setup: Artificial TweetsZipf skew: 1.5Bag of words: 100, 1000, 10000 (attributes)Size of tweet: ~15 words Instances: 1,000,000Class: positive or negative
Gaussian random variable10 different seeded runsTest every 100k instancesMOA HT, Local VHT, Storm cluster VHT, Horizontal HTMore experiments on dense instances in paper!
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
12
Local VHT vs. MOA HT
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
• Accuracy: Local VHT ≥ MOA HT• Exec. time: extra overhead due interfacing with DSPE
without scaling out
13
VHT vs. Horizontal HT
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
• Small drop in accuracy due to scaling and more attributes• Always better than Hor. HT (more gains in dense instances)
14
VHT vs. Horizontal HT
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
• Up to 20x faster than MOA HT
• 5-10x faster than Hor. HT
• In dense instances, Hor. HT fails to run due to overhead
• Scaling out: not much impact
15
VHT Evolution
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
• Closely following MOA, better than Hor. HT• Quickly captures best accuracy
16
Experimental Setup: Dense InstancesRandom decision treeMixed categorical and numerical attributes
10-10, 100-100, 1k-1k, 10k-10k Instances: 1,000,0002 balanced classes10 different seeded runsTest every 100k instancesMOA HT, Local VHT, Storm cluster VHT,
Horizontal HT
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
17
Local VHT vs. MOA HT
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
18
VHT vs. Horizontal HT
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
19
VHT vs. Horizontal HT
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
20
VHT: Vertical Hoeffding Tree@ApacheSAMOA
http://samoa.incubator.apache.org/https://github.com/apache/incubator-samoa
Nicolas Kourtellis@kourtellis
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
21
Extra slides
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
22
What is SAMOA?Scalable Advanced Massive Online AnalysisA platform for mining big data streams
Framework for developing new distributed stream mining algorithms
Framework for deploying algorithms on new distributed stream processing engines
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
23
Taxonomy
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
24
Algorithms in SAMOAExisting:
Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive Model Rules (regression)
Pending: Distributed Naïve Bayes Stochastic Gradient Descent Adaptive + Boosting VHT Parallelized Gradient Boosted Decision Tree PARMA (frequent pattern mining) …
Check Samoa Roadmap for more
Looking for contributors!
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016
25
VHT Evolution
VHT: Vertical Hoeffding Tree. IEEE International Conference on Big Data, December 2016