Mahout and Distributed Machine Learning 101

Preview:

DESCRIPTION

Brief introduction to Mahout and distributed machine learning presented to Orlando Data Science

Citation preview

Introduction to machine learning

with mahoutJohn Ternent

@jaternent

Orlando Data Science – www.orlandods.com

May 13, 2014

Welcome!

Updates

Social Media

Facebook.com/orlandodata Twitter.com/orlandodata LinkedIn

OrlandoDS.com

Social Network Forum Articles and Content And More

Send articles to: scott@orlandods.com

Orlando Wiki

Completely Open Aggregate Learning Resources! Go NUTS

May 28th Event

Full Sail, UCF, and Florida Polytechnic

Submit Your Questions! @orlandodata

Member Survey

Need n=30!!! OrlandoDS.com/member-survey OR: find it in our past meetup

announcements

Learn Hadoop

First Class: June 3rd.

Location: Here

Future Plans

Establish Non-Profit Increase Global Following Become Strong Networking and

Education Resource for YOU

A (very) little bit about me… Consultant (Management & Technology) Open Source Evangelist Full-spectrum data nerd

A little about you!

Rate yourself (1 – 10) on Mahout Rate yourself (1 – 10) on Machine

Learning/Data Mining Rate yourself (1 – 10) on Big

Data/Hadoop

Please wait… optimizing presentation…

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

-- Tom M. Mitchell, 1997

Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to … an economic advantage.

-- Ian H. Witten & Eibe Frank, 2005

If you’re in academia, you call it “machine learning.” If you’re in business, you call it “data mining.” Mark Hall

I create or improve general

purpose algorithms for

machine learning

I use multiple machine learning

algorithms for practical data

discovery

Source : xkcd Source : xkcd

Machine Learning Uses

Clustering

Classification

Recommendation

Machine Learning Algorithms Regression K-means Clustering K-NN CART Neural Networks Support Vector Machines Association Rules

Principal Component Analysis Singular Value Decomposition Ensemble Methods Naïve Bayes …

Real-World Applications Recommender Systems Image recognition Signal Processing Propensity to buy/churn Fraud analysis Text analytics Spam filtering Forecasting methods Revenue management …

The Problem … and Opportunity

Big Data™If you have to choose, having more data does indeed trump a better algorithm. However, what is better than just having more data on its own is also having an algorithm that annotates the data with new linkages and statistics which alter the underlying data asset.”- Omar Tawakol

Weka Explorer can handle ~1M instances, 25 attributes (50 MB file)- Ian Witten

Potential Solutions

Expand RAM Use incremental algorithms Use distributable algorithms

Scale Up

Scale Out

Hadoop in 30 seconds

Input

Input

Input

Input

Input

Input

Input

Map (K,V)

Map (K,V)

Map (K,V)

Map (K,V)

Shuffle / Sort

Reduce

Reduce

Reduce

Output

Output

Output

Finally -- Mahout

A Java-based library of machine learning algorithms designed to support distributed processing

Initially on MapReduce, now leaning heavily towards Spark

Primarily focused on Recommenders, Clustering, and Classification spaces.

Running Mahout Locally – download mahout distro.

/bin/mahout is the wrapper script, default shows all the example programs available.

Lots of tools included to convert data into vector formats and pre-process text, worth a look

Amazon EC2 Configure stack from scratch on EC2 servers

Amazon EMR Quicker start, a lot of the build is already optimized

for MapReduce jobs, just add Mahout as a custom jar and pass the script as a parameter

Running Recommenders

Multiple Recommender AlgorithmsUser-basedItem-based

A Recommender Needs:DataModel (e.g. FileDataModel)Similarity driver (PearsonCorrelationSimilarity)Neighborhood (NearstNUserNeighborhood,

ThresholdUserNeighborhood)Recommender

Running Recommenders

Tip : If you have no preferences, there are Boolean equivalents of the recommender classes

Evaluate user vs. item similarities Example

Clustering Algorithms

To cluster you need:Location in n-dimensional spaceDistance metricThreshold

K-means Canopy Dirichlet Fuzzy K-means Spectral Clustering

Clustering

Clustering Text

Identify k topics in a document corpus Requires conversion of text into vector Lucene utilities are available to vectorize

text and apply stop-word or weighting criteria.

Seqdirectory – from a directory of text files

Lucene.vector – from a Lucene index

Classifiers

NaïveBayes RandomForests LogisticRegression (SGD) HiddenMarkov Example : 20 Newsgroups

Sidebar : Risks of Big Data Unsupervised Learning