30
Classification on Mahout Naoki Nakatani San Jose State University CS185C Spring 2014

Mahout classification presentation

Embed Size (px)

DESCRIPTION

These slides were presented in class on April 7th, 2014.

Citation preview

Page 1: Mahout classification presentation

Classification on MahoutNaoki NakataniSan Jose State University

CS185C Spring 2014

Page 2: Mahout classification presentation

Agenda

● Classification Overview● Mahout Overview

○ Classification on Mahout● Case Study with Demo

○ Problem Description○ Working Environment○ Data Preparation○ ML Model Generation

Page 3: Mahout classification presentation

Classification?● Classifying examples into given set of categories● Supervised learning

○ Prepare data○ Build classifier (train & test)○ Apply classifier to new data

http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg

Page 4: Mahout classification presentation

Mahout?● Scalable machine learning

library = Can handle Big Data

● Runs on HDFS● Classification, Clustering,

Collaborative Filtering , etc

http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png

Page 5: Mahout classification presentation

Classification on Mahout?Classifying examples into given set of categories

Scalable machine learning library that can handle big data

Classifying big data into given set of categories

Page 6: Mahout classification presentation

Case Study & Demo

Given question with title and body, can we automatically generate tags for it?

Where can I find the LaTeX3 manual?Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web.

Does anyone have a link?

Documentation

latex3

expl3

Page 7: Mahout classification presentation

DatasetFile :● TrainSmall.tsv

Fields :● id, title, body, tags

Characteristics :● Each question contains

only one tag

\0

“----” , ”-----------” , “------------------------” , “--- --- --- ---”

\0

\0

“----” , ”-----------” , “------------------------” , “--- --- --- ---”“----” , ”-----------” , “------------------------” , “--- --- --- ---”

Page 8: Mahout classification presentation

Working Environment

● Mac OS 10.9.1● Eclipse 4.3.2● Hadoop 1.2.1● Mahout 0.9● Source code available here.

Page 9: Mahout classification presentation

Prerequisite (Where are you?)● You have input tsv file at result > output-topfivetags.● You are at “result” directory in Terminal.● Command “hadoop” and “mahout” is working.

Page 10: Mahout classification presentation

Prepare Data1. Convert TSV file to Hadoop sequence file format.

Specify tag as a category. (Run TSVToSeq.java)

output-tsvtoseq folder and chunk-0 file is created.

Page 11: Mahout classification presentation

Prepare Data1. Make directory in HDFS and upload chunk-0 (sequence

file) to the folder.

Page 12: Mahout classification presentation

hadoop fs -mkdir <directory>

Page 13: Mahout classification presentation

hadoop fs -put <source> <destination>

Page 14: Mahout classification presentation

Prepare Data2. Transform questions into vectors. (mahout seq2sparse)

Page 15: Mahout classification presentation

mahout seq2sparse -i <input directory> -o <output directory>

Page 16: Mahout classification presentation
Page 17: Mahout classification presentation

Prepare Data3. Split data into

a. Train set : to train modelb. Test set : to test model

Page 18: Mahout classification presentation

mahout split \-i <input directory> \

--trainingOutput <output dir to train> \--testOutput <output dir to test> \--randomSelectionPct <integer> \

--overwrite \--sequenceFiles \

-xm sequential

Page 19: Mahout classification presentation
Page 20: Mahout classification presentation

Build Classifier1. Choose algorithm to use for classificationAvailable algorithms:

○ Naive Bayes■ trainnb, testnb■ org.apache.mahout.

classifier.naivebayes

○ Hidden Markov Model■ baumwelch, hmmpredict■ org.apache.mahout.

classifier.sequencelearning.hmm

○ Logistic Regression■ trainlogistic, testlogistic■ org.apache.mahout.

classifier.sgd

○ Random Forest■ ?■ ?

Page 21: Mahout classification presentation

2. Train & test model using train set

Should yield high accuracy

Build Classifier (Naive Bayes)

Page 22: Mahout classification presentation

mahout trainnb \-i <dir to train vectors> \

-el \-li <dir to put label index> \

-o <dir to put model> \-ow \

-c

Page 23: Mahout classification presentation
Page 24: Mahout classification presentation

mahout testnb \-i <dir to train vectors> \

-m <dir to model> \-l <dir to label index> \

-ow \-o <output dir> \

-c

Page 25: Mahout classification presentation
Page 26: Mahout classification presentation

Build Classifier (Naive Bayes)3. Test model using test set

Check if the accuracy is satisfactory

Page 27: Mahout classification presentation
Page 28: Mahout classification presentation

Apply ClassifierWhat do you have at this point?● model● label index

You can start classifying new data! (Check this example)

Model

Label Index

Page 30: Mahout classification presentation

Happy Machine Learning!