Machine learning with Mahout

Sandeep GiriHadoop

• A Scalable Machine Learning Library built on Hadoop, written in Java • Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore” • Started as a Lucene sub-project. Became Apache TLP in Apr’10. • Mahout – Keeper/Driver of Elephants.

http://www.knowbigdata.com

Sandeep GiriHadoop

MACHINE LEARNING“Programming Computers to optimize a

Performance using Example Data or Past Experience”

• Branch of Artificial Intelligence • Design and Development of Algorithms • Computers Evolve Behaviour based on Empirical Data

Sandeep GiriHadoop

MACHINE LEARNING - TYPES

Supervised Learning Using Labeled training data, to create a Classifier that can predict output for unseen inputs. !Unsupervised Learning Using Unlabeled training data to create a function that can predict output. !Semi-Supervised Learning Make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data.

Sandeep GiriHadoop

MACHINE LEARNING - APPLICATIONS• Recommend Friends, Dates, Products to end-user. • Classify content into pre-defined groups. • Find Similar content based on Object Properties. • Identify key topics in large Collections of Text. • Detect Anomalies within given data. • Ranking Search Results with User Feedback Learning. • Classifying DNA sequences. • Sentiment Analysis/ Opinion Mining • Computer Vision. • Natural Language Processing, • BioInformatics. • Speech and HandWriting Recognition.

Sandeep GiriHadoop

MACHINE LEARNING - TOOLSDATA SIZE CLASSFICATION TOOLS

Lines Sample Data

Analysis and Visualization Whiteboard,…

KBs - low MBs Prototype Data

Analysis and Visualization

Matlab, Octave, R, Processing,

MBs - low GBs Online Data

Analysis NumPy, SciPy, Weka, BLAS/LAPACK

Visualization Flare, AmCharts, Raphael, Protovis

GBs - TBs - PBs Big Data Analysis Mahout, Giraph

MLlib

Sandeep GiriHadoop

MAHOUT - RECOMMENDATION EXAMPLEREVIEWS

PERSON MOVIES RATING

Sandeep Matrix 9Sandeep Cassablanca 9Kumar Matrix 9Kumar Shawshank Redemption 8.5

…

RECOMMENDATION

PERSON MOVIES RATING

Kumar Cassablanca 9Sandeep Shawshank Redemption 8.5

…

Mahout - SIMILARITY_LOGLIKELIHOOD

Sandeep GiriHadoop

MAHOUT - USE CASESCollaborative filtering Mines user behaviour and makes product recommendations (e.g. Amazon recommendations) !Clustering Takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups, such that items belonging to the same group are similar to each other !Classification Learns from existing categorisations and then assigns unclassified items to the best category !Frequent itemset mining analyzes items in a group (e.g. items in a shopping cart or terms in a query session) and then identifies which items typically appear together

Sandeep GiriHadoop

MAHOUT LIVE EXAMPLE

1. Download the Dataset from movie lens cd sgiri/ wget -nc http://www.grouplens.org/system/files/ml-1m.zip

!2. Unzip unzip ml-1m.zip

!3. Prepare comma separated format of: user, movie, ratings cat ml-1m/ratings.dat|awk -F:: '{print $1","$2","$3}' > ratings.csv

http://www.grouplens.org/system/files/ml-1m.zip

Sandeep GiriHadoop

MAHOUT LIVE EXAMPLE

4. Copy Data to Hadoop hadoop fs -copyFromLocal ratings.csv sgiri/

!5. Run mahout export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD \ -i sgiri/ratings.csv -o sgiri/recos

Sandeep GiriHadoop

MAHOUT LIVE EXAMPLE

6. Print Results hadoop fs -cat recos/part-r-00000 =========== 6029 [2712:5.0, 1358:5.0, 1283:5.0, ...] 6030 [349:5.0, 2478:5.0,724:5.0, ...] 6031 [805:5.0, 2478:5.0,3108:5.0, ...] 6032 [3507: 5.0, 3097:5.0,3614:5.0, ...]

Sandeep GiriHadoop

MAHOUT - OTHER RECOMMENDER ALGOS

SIMILARITY_LOGLIKELIHOOD SIMILARITY_COOCCURRENCE SIMILARITY_TANIMOTO_COEFFICIENT SIMILARITY_CITY_BLOCK SIMILARITY_COSINE SIMILARITY_PEARSON_CORRELATION SIMILARITY_EUCLIDEAN_DISTANCE

Sandeep GiriHadoop

MAHOUT - USE CASESCollaborative filtering Mines user behaviour and makes product recommendations (e.g. Amazon recommendations) !Clustering Takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups, such that items belonging to the same group are similar to each other !Classification Learns from existing categorisations and then assigns unclassified items to the best category !Frequent itemset mining analyzes items in a group (e.g. items in a shopping cart or terms in a query session) and then identifies which items typically appear together

Education

Machine learning with Mahout