Upload
adam-muise
View
106
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
1
KNITTING BOARMachine Learning, Mahout, and Parallel Iterative Algorithms
Josh Patterson
Principal Solutions Architect
Hello
✛ Josh Patterson> Master’s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)
> Twitter: @jpatanooga> Email: [email protected]
Outline
✛ Introduction to Machine Learning
✛ Mahout
✛ Knitting Boar and YARN
✛ Parting Thoughts
4
MACHINE LEARNINGIntroduction to
Basic Concepts
✛ What is Data Mining?> “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?> Raw data essentially useless
∗ Data is simply recorded facts∗ Information is the patterns underlying the data
✛ Machine Learning> Algorithms for acquiring structural descriptions from
data “examples”∗ Process of learning “concepts”
Shades of Gray
✛ Information Retrieval> information science, information architecture,
cognitive psychology, linguistics, and statistics.
✛ Natural Language Processing> grounded in machine learning, especially statistical
machine learning
✛ Statistics> Math and stuff
✛ Machine Learning> Considered a branch of artificial intelligence
Hadoop in Traditional Enterprises Today
✛ ETL
✛ Joining multiple disparate data sources
✛ Filtering data
✛ Aggregation
✛ Cube materialization
“Descriptive Statistics”
Hadoop All The Time?
✛ Don’t always assume you need “scale” and parallelization> Try it out on a single machine first
> See if it becomes a bottleneck!
✛ Will the data fit in memory on a beefy machine?
✛ We can always use the constructed model back in MapReduce to score a ton of new data
Twitter Pipeline
✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
> Looks to study data with descriptive statistics in the hopes of building models for predictive analytics
✛ Does majority of ML work via Pig custom integrations
> Pipeline is very “Pig-centric”
> Example: https://github.com/tdunning/pig-vector
> They use SGD and Ensemble methods mostly being conducive to large scale data mining
✛ Questions they try to answer
> Is this tweet spam?
> What star rating might this user give this movie?
Typical Pipeline for Cloudera Customer
✛ Data collection performed w Flume
✛ Data cleansing / ETL performed with Hive or Pig
✛ ML work performed with > SAS
> SPSS
> R
> Mahout
11 MAHOUTIntroduction to
Algorithm Groups in Apache Mahout
Copyright 2010 Cloudera Inc. All rights reserved12
✛ Classification> “Fraud detection”
✛ Recommendation> “Collaborative
Filtering”
✛ Clustering> “Segmentation”
✛ Frequent Itemset Mining
Classification
✛ Stochastic Gradient Descent > Single process> Logistic Regression Model Construction
✛ Naïve Bayes> MapReduce-based> Text Classification
✛ Random Forests> MapReduce-based
Copyright 2010 Cloudera Inc. All rights reserved13
What Are Recommenders?
✛ An algorithm that looks at a user’s past actions and suggests> Products> Services> People
✛ Advertisement> Cloudera has a great Data Science training course on
this topic> http://university.cloudera.com/training/data_science/in
troduction_to_data_science_-_building_recommender_systems.html
Clustering: Topic Modeling
✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation
Taking a Breath For a Minute
✛ Why Machine Learning?> Growing interest in predictive modeling
✛ Linear Models are Simple, Useful> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression
✛ Building Models Still is Time Consuming> The “Need for speed”
> “More data beats a cleverer algorithm”
17
KNITTING BOARIntroducing
Goals
✛ Parallelize Mahout’s Stochastic Gradient Descent> With as few extra dependencies as possible
✛ Wanted to explore parallel iterative algorithms using YARN> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later
19
Stochastic Gradient Descent
✛ Training> Simple gradient descent
procedure> Loss functions needs to be
convex
✛ Prediction > Logistic Regression:
∗ Sigmoid function using parameter vector (dot) example as exponential parameter
SGD
Model
Training Data
20
Current Limitations
✛ Sequential algorithms on a single node only goes so far
✛ The “Data Deluge”> Presents algorithmic challenges when combined with
large data sets> need to design algorithms that are able to perform in
a distributed fashion
✛ MapReduce only fits certain types of algorithms
21
Distributed Learning Strategies
✛ Langford, 2007> Vowpal Wabbit
✛ McDonald 2010> Distributed Training Strategies for the Structured
Perceptron
✛ Dekel 2010> Optimal Distributed Online Prediction Using Mini-
Batches
22
MapReduce vs. Parallel Iterative
Input
Output
Map Map Map
Reduce Reduce
ProcessorProcessor ProcessorProcessor ProcessorProcessor
Superstep 1Superstep 1
ProcessorProcessor ProcessorProcessor
Superstep 2Superstep 2
. . .
ProcessorProcessor
23
Why Stay on Hadoop?
“Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution?
If no, then operationally, we can consider the Hadoop stack …
there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.”
–– Lin, 2012
24
The Boar
✛ Parallel Iterative implementation of SGD on YARN
✛ Workers work on partitions of the data✛ Master keeps global copy of merged parameter
vector
25
Worker
✛ Each given a split of the total dataset> Similar to a map task
✛ Using a modified OLR> process N samples in a batch (subset of split)
✛ Batched gradient accumulation updates sent to master node> Gradient influences future models vectors towards
better predictions
26
Master
✛ Accumulates gradient updates> From batches of worker OLR runs
✛ Produces new global parameter vector> By averaging workers’ vectors
✛ Sends update to all workers> Workers replace local parameter vector with new
global parameter vector
27
Comparison: OLR vs POLR
OnlineLogisticRegression
Model
Training Data
Worker 1
Master
Partial Model
Global Model
Worker 2
Partial Model
Worker N
Partial Model
OnlineLogisticRegression Knitting Boar’s POLRSplit 1 Split 2 Split 3
…
28
20Newsgroups
Input Size vs Processing Time
4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 410
50
100
150
200
250
300
OLR
POLR
29
PARTING THOUGHTSKnitting Boar
30
Knitting Boar Lessons Learned
✛ Parallel SGD> The Boar is temperamental, experimental
∗ Linear speedup (roughly)
✛ Developing YARN Applications> More complex the just MapReduce> Requires lots of “plumbing”
✛ IterativeReduce> Great native-Hadoop way to implement algorithms> Easy to use and well integrated
31
Bits
✛ Knitting Boar> https://github.com/jpatanooga/KnittingBoar> 100% Java> ASF 2.0 Licensed> Quick Start
∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start
✛ IterativeReduce> https://github.com/emsixteeen/IterativeReduce> 100% Java> ASF 2.0 Licensed
32
✛ Machine Learning is hard> Don’t believe the hype> Do the work
✛ Model development takes time> Lots of iterations> Speed is key here
Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg
33
References
✛ Strata / Hadoop World 2012 Slides> http://
www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html
✛ Mahout’s SGD implementation> http://lingpipe.files.wordpress.com/2008/04/lazysgdre
gression.pdf
✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail!> http://arxiv.org/pdf/1209.2191v1.pdf
34
References
✛ Langford > http://hunch.net/~vw/
✛ McDonald, 2010> http://dl.acm.org/citation.cfm?id=1858068
35
Photo Credits
✛ http://eteamjournal.files.wordpress.com/2011/03/photos-of-mount-everest-pictures.jpg
✛ http://images.fineartamerica.com/images-medium-large/-say-hello-to-my-little-friend--luis-ludzska.jpg
✛ http://freewallpaper.in/wallpaper2/2202-2-2001_space_odyssey_-_5.jpg