35
KNITTING BOAR Machine Learning, Mahout, and Parallel Iterative Algorithms 1 Josh Patterson Principal Solutions Architect

KnittingBoar Toronto Hadoop User Group Nov 27 2012

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: KnittingBoar Toronto Hadoop User Group Nov 27 2012

1

KNITTING BOARMachine Learning, Mahout, and Parallel Iterative Algorithms

Josh Patterson

Principal Solutions Architect

Page 2: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Hello

✛ Josh Patterson> Master’s Thesis: self-organizing mesh networks

∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm

> Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)

> Twitter: @jpatanooga> Email: [email protected]

Page 3: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Outline

✛ Introduction to Machine Learning

✛ Mahout

✛ Knitting Boar and YARN

✛ Parting Thoughts

Page 4: KnittingBoar Toronto Hadoop User Group Nov 27 2012

4

MACHINE LEARNINGIntroduction to

Page 5: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Basic Concepts

✛ What is Data Mining?> “the process of extracting patterns from data”

✛ Why are we interested in Data Mining?> Raw data essentially useless

∗ Data is simply recorded facts∗ Information is the patterns underlying the data

✛ Machine Learning> Algorithms for acquiring structural descriptions from

data “examples”∗ Process of learning “concepts”

Page 6: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Shades of Gray

✛ Information Retrieval> information science, information architecture,

cognitive psychology, linguistics, and statistics.

✛ Natural Language Processing> grounded in machine learning, especially statistical

machine learning

✛ Statistics> Math and stuff

✛ Machine Learning> Considered a branch of artificial intelligence

Page 7: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Hadoop in Traditional Enterprises Today

✛ ETL

✛ Joining multiple disparate data sources

✛ Filtering data

✛ Aggregation

✛ Cube materialization

“Descriptive Statistics”

Page 8: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Hadoop All The Time?

✛ Don’t always assume you need “scale” and parallelization> Try it out on a single machine first

> See if it becomes a bottleneck!

✛ Will the data fit in memory on a beefy machine?

✛ We can always use the constructed model back in MapReduce to score a ton of new data

Page 9: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Twitter Pipeline

✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

> Looks to study data with descriptive statistics in the hopes of building models for predictive analytics

✛ Does majority of ML work via Pig custom integrations

> Pipeline is very “Pig-centric”

> Example: https://github.com/tdunning/pig-vector

> They use SGD and Ensemble methods mostly being conducive to large scale data mining

✛ Questions they try to answer

> Is this tweet spam?

> What star rating might this user give this movie?

Page 10: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Typical Pipeline for Cloudera Customer

✛ Data collection performed w Flume

✛ Data cleansing / ETL performed with Hive or Pig

✛ ML work performed with > SAS

> SPSS

> R

> Mahout

Page 11: KnittingBoar Toronto Hadoop User Group Nov 27 2012

11 MAHOUTIntroduction to

Page 12: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Algorithm Groups in Apache Mahout

Copyright 2010 Cloudera Inc. All rights reserved12

✛ Classification> “Fraud detection”

✛ Recommendation> “Collaborative

Filtering”

✛ Clustering> “Segmentation”

✛ Frequent Itemset Mining

Page 13: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Classification

✛ Stochastic Gradient Descent > Single process> Logistic Regression Model Construction

✛ Naïve Bayes> MapReduce-based> Text Classification

✛ Random Forests> MapReduce-based

Copyright 2010 Cloudera Inc. All rights reserved13

Page 14: KnittingBoar Toronto Hadoop User Group Nov 27 2012

What Are Recommenders?

✛ An algorithm that looks at a user’s past actions and suggests> Products> Services> People

✛ Advertisement> Cloudera has a great Data Science training course on

this topic> http://university.cloudera.com/training/data_science/in

troduction_to_data_science_-_building_recommender_systems.html

Page 15: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Clustering: Topic Modeling

✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation

Page 16: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Taking a Breath For a Minute

✛ Why Machine Learning?> Growing interest in predictive modeling

✛ Linear Models are Simple, Useful> Stochastic Gradient Descent is a very popular tool for

building linear models like Logistic Regression

✛ Building Models Still is Time Consuming> The “Need for speed”

> “More data beats a cleverer algorithm”

Page 17: KnittingBoar Toronto Hadoop User Group Nov 27 2012

17

KNITTING BOARIntroducing

Page 18: KnittingBoar Toronto Hadoop User Group Nov 27 2012

Goals

✛ Parallelize Mahout’s Stochastic Gradient Descent> With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms using YARN> Wanted a first class Hadoop-Yarn citizen

> Work through dev progressions towards a stable state

> Worry about “frameworks” later

Page 19: KnittingBoar Toronto Hadoop User Group Nov 27 2012

19

Stochastic Gradient Descent

✛ Training> Simple gradient descent

procedure> Loss functions needs to be

convex

✛ Prediction > Logistic Regression:

∗ Sigmoid function using parameter vector (dot) example as exponential parameter

SGD

Model

Training Data

Page 20: KnittingBoar Toronto Hadoop User Group Nov 27 2012

20

Current Limitations

✛ Sequential algorithms on a single node only goes so far

✛ The “Data Deluge”> Presents algorithmic challenges when combined with

large data sets> need to design algorithms that are able to perform in

a distributed fashion

✛ MapReduce only fits certain types of algorithms

Page 21: KnittingBoar Toronto Hadoop User Group Nov 27 2012

21

Distributed Learning Strategies

✛ Langford, 2007> Vowpal Wabbit

✛ McDonald 2010> Distributed Training Strategies for the Structured

Perceptron

✛ Dekel 2010> Optimal Distributed Online Prediction Using Mini-

Batches

Page 22: KnittingBoar Toronto Hadoop User Group Nov 27 2012

22

MapReduce vs. Parallel Iterative

Input

Output

Map Map Map

Reduce Reduce

ProcessorProcessor ProcessorProcessor ProcessorProcessor

Superstep 1Superstep 1

ProcessorProcessor ProcessorProcessor

Superstep 2Superstep 2

. . .

ProcessorProcessor

Page 23: KnittingBoar Toronto Hadoop User Group Nov 27 2012

23

Why Stay on Hadoop?

“Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution?

If no, then operationally, we can consider the Hadoop stack …

there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.”

–– Lin, 2012

Page 24: KnittingBoar Toronto Hadoop User Group Nov 27 2012

24

The Boar

✛ Parallel Iterative implementation of SGD on YARN

✛ Workers work on partitions of the data✛ Master keeps global copy of merged parameter

vector

Page 25: KnittingBoar Toronto Hadoop User Group Nov 27 2012

25

Worker

✛ Each given a split of the total dataset> Similar to a map task

✛ Using a modified OLR> process N samples in a batch (subset of split)

✛ Batched gradient accumulation updates sent to master node> Gradient influences future models vectors towards

better predictions

Page 26: KnittingBoar Toronto Hadoop User Group Nov 27 2012

26

Master

✛ Accumulates gradient updates> From batches of worker OLR runs

✛ Produces new global parameter vector> By averaging workers’ vectors

✛ Sends update to all workers> Workers replace local parameter vector with new

global parameter vector

Page 27: KnittingBoar Toronto Hadoop User Group Nov 27 2012

27

Comparison: OLR vs POLR

OnlineLogisticRegression

Model

Training Data

Worker 1

Master

Partial Model

Global Model

Worker 2

Partial Model

Worker N

Partial Model

OnlineLogisticRegression Knitting Boar’s POLRSplit 1 Split 2 Split 3

Page 28: KnittingBoar Toronto Hadoop User Group Nov 27 2012

28

20Newsgroups

Input Size vs Processing Time

4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 410

50

100

150

200

250

300

OLR

POLR

Page 29: KnittingBoar Toronto Hadoop User Group Nov 27 2012

29

PARTING THOUGHTSKnitting Boar

Page 30: KnittingBoar Toronto Hadoop User Group Nov 27 2012

30

Knitting Boar Lessons Learned

✛ Parallel SGD> The Boar is temperamental, experimental

∗ Linear speedup (roughly)

✛ Developing YARN Applications> More complex the just MapReduce> Requires lots of “plumbing”

✛ IterativeReduce> Great native-Hadoop way to implement algorithms> Easy to use and well integrated

Page 31: KnittingBoar Toronto Hadoop User Group Nov 27 2012

31

Bits

✛ Knitting Boar> https://github.com/jpatanooga/KnittingBoar> 100% Java> ASF 2.0 Licensed> Quick Start

∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

✛ IterativeReduce> https://github.com/emsixteeen/IterativeReduce> 100% Java> ASF 2.0 Licensed

Page 32: KnittingBoar Toronto Hadoop User Group Nov 27 2012

32

✛ Machine Learning is hard> Don’t believe the hype> Do the work

✛ Model development takes time> Lots of iterations> Speed is key here

Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg

Page 34: KnittingBoar Toronto Hadoop User Group Nov 27 2012

34

References

✛ Langford > http://hunch.net/~vw/

✛ McDonald, 2010> http://dl.acm.org/citation.cfm?id=1858068