Big Data Analytics using Mahout

Big Data AnalyticsUsing Mahout

Assoc. Prof. Dr. Thanachart NumnondaExecutive DirectorIMC InstituteApril 2015

Mahout

Mahout is a Java library which Implementing Machine Learning techniques for

clustering, classification and recommendation

What is Mahout?

Mahout in Apache Software

Why Mahout?

Apache License

Good Community

Good Documentation

Scalable

Extensible

Command Line Interface

Java Library

List of Algorithms

Mahout Architecture

Use Cases

Installing Mahout

Select a EC2 service and click on Lunch Instance

Choose My AMIs and select “Hadoop Lab Image”

Choose m3.medium Type virtual server

Leave configuration details as default

Add Storage: 20 GB

Name the instance

Select an existing security group > Select SecurityGroup Name: default

Click Launch and choose imchadoop as a key pair

Review an instance / click Connect for an instruction to connect to the instance

Connect to an instance from Mac/Linux

Connect to an instance from Windows using Putty

Connect to the instance

Install Maven

$ sudo apt-get install maven

$ mvn -v

Install Subversion

$ sudo apt-get install subversion

$ svn --version

Install Mahout

$ cd /usr/local/

$ sudo mkdir mahout

$ cd mahout

$ sudo svn co http://svn.apache.org/repos/asf/mahout/trunk

$ cd trunk

$ sudo mvn -DskipTests

Install Mahout (cont.)

Edit batch files

$ sudo vi $HOME/.bashrc

$ exec bash

Running Recommendation Algorithms

MovieLenshttp://grouplens.org/datasets/movielens/

Architecture for Recommender Engine

Item-Based Recommendation

Step 1: Gather some test data

Step 2: Pick a similarity measure

Step 3: Configure the Mahout command

Step 4: Making use of the output and doing morewith Mahout

Preparing Movielen data

$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

$ unzip ml-100k.zip

$ hadoop fs -mkdir /input

$ hadoop fs -put u.data /input/u.data

$ hadoop fs -mkdir /results

$ unset MAHOUT_LOCAL

Running Recommend Command

$ mahout recommenditembased -i /input/u.data -o/results/itemRecom.txt -s SIMILARITY_LOGLIKELIHOOD--tempDir /temp/recommend1

$ hadoop fs -ls /results/itemRecom.txt

View the result

$ hadoop fs -cat /results/itemRecom.txt/part-r-00000

Similarity Classname

SIMILARITY_COOCCURRENCE

SIMILARITY_LOGLIKELIHOOD

SIMILARITY_TANIMOTO_COEFFICIENT

SIMILARITY_CITY_BLOCK

SIMILARITY_COSINE

SIMILARITY_PEARSON_CORRELATION

SIMILARITY_EUCLIDEAN_DISTANCE

Running Recommendation in a single machine

$ export MAHOUT_LOCAL=true

$ mahout recommenditembased -i ml-100k/u.data -o/results/itemRecom.txt -s SIMILARITY_LOGLIKELIHOOD--numRecommendations 5

$ cat results/itemRecom.txt/part-r-00000

Running Example Program

Using CBayes classifer

Running Example Program

Preparing data

$ export WORK_DIR=/tmp/mahout-work-${USER}

$ mkdir -p ${WORK_DIR}

$ mkdir -p ${WORK_DIR}/20news-bydate

$ cd ${WORK_DIR}/20news-bydate

$ wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

$ tar -xzf 20news-bydate.tar.gz

$ mkdir ${WORK_DIR}/20news-all

$ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all

Note: Running on MapReduce

If you want to run onMapReduce mode, you need to run thefollowing commands before running the feature extractioncommands

$ unset MAHOUT_LOCAL

$ hadoop fs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all

Preparing the Sequence File

Mahout provides you a utility to convert the given input file in to asequence file format.

The input file directory where the original data resides.

The output file directory where the clustered data is to be stored.

Sequence Files

Sequence files are binary encoding of key/value pairs. There is aheader on the top of the file organized with some metadatainformation which includes:

– Version

– Key name

– Value name

– Compression

To view the sequential file

mahout seqdumper -i <input file> | more

Generate Vectors from Sequence Files

Mahout provides a command to create vector files fromsequence files.

mahout seq2sparse -i <input file path> -o <output file path>

Important Options:

-lnorm Whether output vectors should be logNormalize.

-nv Whether output vectors should be NamedVectors

-wt The kind of weight to use. Currently TF or TFIDF.Default: TFIDF

Extract Features

Convert the full 20 newsgroups dataset into a < Text, Text >SequenceFile.

Convert and preprocesses the dataset into a < Text,VectorWritable > SequenceFile containing term frequencies foreach document.

Prepare Testing Dataset

Split the preprocessed dataset into training and testing sets.

Training process

Train the classifier.

Testing the result

Test the classifier.

Dumping a vector file

We can dump vector files to normal text ones, as fillow

mahout vectordump -i <input file> -o <output file>

Options--useKey If the Key is a vector than dump that instead

--csv Output the Vector as CSV--dictionary The dictionary file.

Sample Output

Command line options

K-means clustering

Reuters Newswire

Preparing data

$ export WORK_DIR=/tmp/kmeans

$ mkdir $WORK_DIR

$ mkdir $WORK_DIR/reuters-out

$ cd $WORK_DIR

$ wget http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz

$ mkdir $WORK_DIR/reuters-sgm

$ tar -xzf reuters21578.tar.gz -C $WORK_DIR/reuters-sgm

Convert input to a sequential file

$ mahout org.apache.lucene.benchmark.utils.ExtractReuters$WORK_DIR/reuters-sgm $WORK_DIR/reuters-out

Convert input to a sequential file (cont)

$ mahout seqdirectory -i $WORK_DIR/reuters-out -o$WORK_DIR/reuters-out-seqdir -c UTF-8 -chunk 5

Create the sparse vector files

$ mahout seq2sparse -i $WORK_DIR/reuters-out-seqdir/ -o $WORK_DIR/reuters-out-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector

Running K-Means

$ mahout kmeans -i $WORK_DIR/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ -c $WORK_DIR/reuters-kmeans-clusters-o $WORK_DIR/reuters-kmeans -dmorg.apache.mahout.common.distance.CosineDistanceMeasure-x 10 -k 20 -ow

K-Means command line options

Viewing Result

$mkdir $WORK_DIR/reuters-kmeans/clusteredPoints

$ mahout clusterdump -i $WORK_DIR/reuters-kmeans/clusters-*-final -o $WORK_DIR/reuters-kmeans/clusterdump -d$WORK_DIR/reuters-out-seqdir-sparse-kmeans/dictionary.file-0-dt sequencefile -b 100 -n 20 --evaluate -dmorg.apache.mahout.common.distance.CosineDistanceMeasure-sp 0 --pointsDir $WORK_DIR/reuters-kmeans/clusteredPoints

Viewing Result

Dumping a cluster file

We can dump cluster files to normal text ones, as fillow

mahout clusterdump -i <input file> -o <output file>

Options -of The optional output format for the results.

Options: TEXT, CSV, JSON or GRAPH_ML

-dt The dictionary file type

--evaluate Run ClusterEvaluator

Canopy Clustering

Fuzyy k-mean Clustering

Command line options

Exercise: Traffic Accidents Dataset

http://fimi.ua.ac.be/data/accidents.dat.gz

Import-Export RDBMS data

Sqoop Hands-On Labs

1. Loading Data into MySQL DB

2. Installing Sqoop

3. Configuring Sqoop

4. Installing DB driver for Sqoop

5. Importing data from MySQL to Hive Table

6. Reviewing data from Hive Table

7. Reviewing HDFS Database Table files

Thanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop

1. MySQL RDS Server on AWS

A RDS Server is running on AWS with the followingconfiguration

> database: imc_db

> username: admin

> password: imcinstitute

>addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com

[This address may change]

1. country_tbl data

Testing data query from MySQL DB

Table name > country_tbl

2. Installing Sqoop

# wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz

# tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz

# sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/

# rm sqoop-1.4.5.bin__hadoop-1.0.0

Installing SqoopEdit $HOME ./bashrc

# sudo vi $HOME/.bashrc

3. Configuring Sqoop

ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-1.0.0/conf/

ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh

4. Installing DB driver for Sqoop

ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-1.0.0/lib/

ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$wget https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar

ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$exit

5. Importing data from MySQL to Hive Table

[hdadmin@localhost ~]$sqoop import --connectjdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl--hive-import --hive-table country -m 1

Warning: /usr/lib/hbase does not exist! HBase imports will fail.

Please set $HBASE_HOME to the root of your HBase installation.

Warning: $HADOOP_HOME is deprecated.

Enter password: <enter here>

6. Reviewing data from Hive Table

7. Reviewing HDFS Database Table files

Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse

Sqoop commands

Recommended Books

www.facebook.com/imcinstitute

Thank you

thanachart@imcinstitute.comwww.facebook.com/imcinstitutewww.slideshare.net/imcinstitutewww.thanachart.org

Big Data Analytics using Mahout

Technology

Big Data Analytics – Advanced Analytics

Hadoop, Big Data and Big Analytics 2014 - SAS...Hadoop, Big Data and Big Analytics 2014 3 waves of Big Analytics The Business Improvement Frameworks Big Analytics Use Cases - The data,

Big Data Analytics and Predictive Analytics - _ Predictive Analytics Today

EMC IT Big Data Analytics Journey - Dell EMC Saudi · PDF fileEMC IT Big Data Analytics Journey ... ANALYTICS ENTERPRISE Analytics -based ... Big Data Analytics BDLPredict failure

Big Data Analytics Architecture and Challenges, Issues of Big Data Analytics

Leveraging Big Data & Analytics for Smart Manufacturing Papers/Industries... · Leveraging Big Data & Analytics for Smart Manufacturing ... ENTERPRISE BIG DATA Analytics ... including

Big Analytics Without Big Hassles

COSC 6339 Big Data Analytics Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout · 2018-09-20 · 1 COSC 6339 Big Data Analytics Hadoop MapReduce Infrastructure: Pig, Hive, and

Apache Mahout

Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout

Big Data Analytics on Traditional HPC Infrastructure Using ...scale graph and machine learning frameworks (Giraph, GraphX, Gelly, Mahout, MLib, FlinkML), and much more [5, 30], have

Big Data Analysis Patterns with Hadoop, Mahout and Solr

Mahout Scala Bindings and Mahout Spark Bindings for Linear ...Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines WorkingNotesandManual Dmitriy Lyubimov∗

Big Data Analytics Module 4 – Data Mining and Predictive Analytics Including Mahout Saptak Sen, Microsoft Bill Ramos, Advaiya

DFW Big Data talk on Mahout Recommenders

Redefine Big Data: EMC Data Lake in Action · PDF fileGemFire - Real-Time Data Service HDFS Unified Storage HBas e Pig, Hive, Mahout Map Reduce Sqoop Flume ANSI SQL + Analytics Resource

Big gains from big data analytics

Big Data Patterns with Mahout

Big Data Analytics

DATA ANALYTICS Big Data & Analytics