Mahout and Distributed Machine Learning 101

Introduction to machine learning

with mahoutJohn Ternent

@jaternent

Orlando Data Science – www.orlandods.com

May 13, 2014

Welcome!

Updates

Social Media

Facebook.com/orlandodata Twitter.com/orlandodata LinkedIn

OrlandoDS.com

Social Network Forum Articles and Content And More

Send articles to: scott@orlandods.com

Orlando Wiki

Completely Open Aggregate Learning Resources! Go NUTS

May 28th Event

Full Sail, UCF, and Florida Polytechnic

Submit Your Questions! @orlandodata

Member Survey

Need n=30!!! OrlandoDS.com/member-survey OR: find it in our past meetup

announcements

Learn Hadoop

First Class: June 3rd.

Location: Here

Future Plans

Establish Non-Profit Increase Global Following Become Strong Networking and

Education Resource for YOU

A (very) little bit about me… Consultant (Management & Technology) Open Source Evangelist Full-spectrum data nerd

A little about you!

Rate yourself (1 – 10) on Mahout Rate yourself (1 – 10) on Machine

Learning/Data Mining Rate yourself (1 – 10) on Big

Data/Hadoop

Please wait… optimizing presentation…

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

-- Tom M. Mitchell, 1997

Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to … an economic advantage.

-- Ian H. Witten & Eibe Frank, 2005

If you’re in academia, you call it “machine learning.” If you’re in business, you call it “data mining.” Mark Hall

I create or improve general

purpose algorithms for

machine learning

I use multiple machine learning

algorithms for practical data

discovery

Source : xkcd Source : xkcd

Machine Learning Uses

Clustering

Classification

Recommendation

Machine Learning Algorithms Regression K-means Clustering K-NN CART Neural Networks Support Vector Machines Association Rules

Principal Component Analysis Singular Value Decomposition Ensemble Methods Naïve Bayes …

Real-World Applications Recommender Systems Image recognition Signal Processing Propensity to buy/churn Fraud analysis Text analytics Spam filtering Forecasting methods Revenue management …

The Problem … and Opportunity

Big Data™If you have to choose, having more data does indeed trump a better algorithm. However, what is better than just having more data on its own is also having an algorithm that annotates the data with new linkages and statistics which alter the underlying data asset.”- Omar Tawakol

Weka Explorer can handle ~1M instances, 25 attributes (50 MB file)- Ian Witten

Potential Solutions

Expand RAM Use incremental algorithms Use distributable algorithms

Scale Up

Scale Out

Hadoop in 30 seconds

Map (K,V)

Shuffle / Sort

Reduce

Output

Finally -- Mahout

A Java-based library of machine learning algorithms designed to support distributed processing

Initially on MapReduce, now leaning heavily towards Spark

Primarily focused on Recommenders, Clustering, and Classification spaces.

Running Mahout Locally – download mahout distro.

/bin/mahout is the wrapper script, default shows all the example programs available.

Lots of tools included to convert data into vector formats and pre-process text, worth a look

Amazon EC2 Configure stack from scratch on EC2 servers

Amazon EMR Quicker start, a lot of the build is already optimized

for MapReduce jobs, just add Mahout as a custom jar and pass the script as a parameter

Running Recommenders

Multiple Recommender AlgorithmsUser-basedItem-based

A Recommender Needs:DataModel (e.g. FileDataModel)Similarity driver (PearsonCorrelationSimilarity)Neighborhood (NearstNUserNeighborhood,

ThresholdUserNeighborhood)Recommender

Running Recommenders

Tip : If you have no preferences, there are Boolean equivalents of the recommender classes

Evaluate user vs. item similarities Example

Clustering Algorithms

To cluster you need:Location in n-dimensional spaceDistance metricThreshold

K-means Canopy Dirichlet Fuzzy K-means Spectral Clustering

Clustering

Clustering Text

Identify k topics in a document corpus Requires conversion of text into vector Lucene utilities are available to vectorize

text and apply stop-word or weighting criteria.

Seqdirectory – from a directory of text files

Lucene.vector – from a Lucene index

Classifiers

NaïveBayes RandomForests LogisticRegression (SGD) HiddenMarkov Example : 20 Newsgroups

Sidebar : Risks of Big Data Unsupervised Learning

Mahout and Distributed Machine Learning 101

Technology

Intro to Mahout

Tutorial Mahout - Recommendation

Mahout Interview Questions

DAS Bootcamp: Distributed Antenna Systems 101

Leveraging Solr and Mahout

Mahout and Recommendations

MAHOUT classifier tour

Mahout part2

Big Data Mining Application in Fasteners Manufacturing ... · Mahout in non-distributed as well as distributed environment. They explained various machine learning techniques using

Apache Mahout Algorithms

Mahout Quick Guide

Mahout part1

Apache Mahout - events.static.linuxfound.org · Suneel Marthi did a ‘Distributed Machine Learning with Apache Mahout’ talk at Big Data Ignite, Grand Rapids, Michigan - September

Apache mahout - introduction

Seattle Scalability Mahout

Introduction to Mahout

New directions for mahout

Tutorial: Big Data Algorithms and Applications Under Hadoopkpzhang/tutorial/... · • Introduction to Apache Mahout • Distributed clustering algorithm: K-means • Example: clustering

Apache Mahout

Mahout classification presentation