84
Big Data Analytics Using Mahout Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute April 2015

Big Data Analytics using Mahout

Embed Size (px)

Citation preview

Page 1: Big Data Analytics using Mahout

Big Data AnalyticsUsing Mahout

Assoc. Prof. Dr. Thanachart NumnondaExecutive DirectorIMC InstituteApril 2015

Page 2: Big Data Analytics using Mahout

2

Mahout

Page 3: Big Data Analytics using Mahout

3

Mahout is a Java library which Implementing Machine Learning techniques for

clustering, classification and recommendation

What is Mahout?

Page 4: Big Data Analytics using Mahout

4

Mahout in Apache Software

Page 5: Big Data Analytics using Mahout

5

Why Mahout?

Apache License

Good Community

Good Documentation

Scalable

Extensible

Command Line Interface

Java Library

Page 6: Big Data Analytics using Mahout

6

List of Algorithms

Page 7: Big Data Analytics using Mahout

7

List of Algorithms

Page 8: Big Data Analytics using Mahout

8

List of Algorithms

Page 9: Big Data Analytics using Mahout

9

Mahout Architecture

Page 10: Big Data Analytics using Mahout

10

Use Cases

Page 11: Big Data Analytics using Mahout

11

Installing Mahout

Page 12: Big Data Analytics using Mahout

12

Page 13: Big Data Analytics using Mahout

13

Select a EC2 service and click on Lunch Instance

Page 14: Big Data Analytics using Mahout

14

Choose My AMIs and select “Hadoop Lab Image”

Page 15: Big Data Analytics using Mahout

15

Choose m3.medium Type virtual server

Page 16: Big Data Analytics using Mahout

16

Leave configuration details as default

Page 17: Big Data Analytics using Mahout

17

Add Storage: 20 GB

Page 18: Big Data Analytics using Mahout

18

Name the instance

Page 19: Big Data Analytics using Mahout

19

Select an existing security group > Select SecurityGroup Name: default

Page 20: Big Data Analytics using Mahout

20

Click Launch and choose imchadoop as a key pair

Page 21: Big Data Analytics using Mahout

21

Review an instance / click Connect for an instruction to connect to the instance

Page 22: Big Data Analytics using Mahout

22

Connect to an instance from Mac/Linux

Page 23: Big Data Analytics using Mahout

23

Connect to an instance from Windows using Putty

Page 24: Big Data Analytics using Mahout

24

Connect to the instance

Page 25: Big Data Analytics using Mahout

25

Install Maven

$ sudo apt-get install maven

$ mvn -v

Page 26: Big Data Analytics using Mahout

26

Install Subversion

$ sudo apt-get install subversion

$ svn --version

Page 27: Big Data Analytics using Mahout

27

Install Mahout

$ cd /usr/local/

$ sudo mkdir mahout

$ cd mahout

$ sudo svn co http://svn.apache.org/repos/asf/mahout/trunk

$ cd trunk

$ sudo mvn -DskipTests

Page 28: Big Data Analytics using Mahout

28

Install Mahout (cont.)

Page 29: Big Data Analytics using Mahout

29

Edit batch files

$ sudo vi $HOME/.bashrc

$ exec bash

Page 30: Big Data Analytics using Mahout

30

Running Recommendation Algorithms

Page 31: Big Data Analytics using Mahout

31

MovieLenshttp://grouplens.org/datasets/movielens/

Page 32: Big Data Analytics using Mahout

32

Architecture for Recommender Engine

Page 33: Big Data Analytics using Mahout

33

Item-Based Recommendation

Step 1: Gather some test data

Step 2: Pick a similarity measure

Step 3: Configure the Mahout command

Step 4: Making use of the output and doing morewith Mahout

Page 34: Big Data Analytics using Mahout

34

Preparing Movielen data

$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

$ unzip ml-100k.zip

$ hadoop fs -mkdir /input

$ hadoop fs -put u.data /input/u.data

$ hadoop fs -mkdir /results

$ unset MAHOUT_LOCAL

Page 35: Big Data Analytics using Mahout

35

Running Recommend Command

$ mahout recommenditembased -i /input/u.data -o/results/itemRecom.txt -s SIMILARITY_LOGLIKELIHOOD--tempDir /temp/recommend1

$ hadoop fs -ls /results/itemRecom.txt

Page 36: Big Data Analytics using Mahout

36

View the result

$ hadoop fs -cat /results/itemRecom.txt/part-r-00000

Page 37: Big Data Analytics using Mahout

37

Similarity Classname

SIMILARITY_COOCCURRENCE

SIMILARITY_LOGLIKELIHOOD

SIMILARITY_TANIMOTO_COEFFICIENT

SIMILARITY_CITY_BLOCK

SIMILARITY_COSINE

SIMILARITY_PEARSON_CORRELATION

SIMILARITY_EUCLIDEAN_DISTANCE

Page 38: Big Data Analytics using Mahout

38

Running Recommendation in a single machine

$ export MAHOUT_LOCAL=true

$ mahout recommenditembased -i ml-100k/u.data -o/results/itemRecom.txt -s SIMILARITY_LOGLIKELIHOOD--numRecommendations 5

$ cat results/itemRecom.txt/part-r-00000

Page 39: Big Data Analytics using Mahout

39

Running Example Program

Using CBayes classifer

Page 40: Big Data Analytics using Mahout

40

Running Example Program

Page 41: Big Data Analytics using Mahout

41

Preparing data

$ export WORK_DIR=/tmp/mahout-work-${USER}

$ mkdir -p ${WORK_DIR}

$ mkdir -p ${WORK_DIR}/20news-bydate

$ cd ${WORK_DIR}/20news-bydate

$ wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

$ tar -xzf 20news-bydate.tar.gz

$ mkdir ${WORK_DIR}/20news-all

$ cd

$ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all

Page 42: Big Data Analytics using Mahout

42

Note: Running on MapReduce

If you want to run onMapReduce mode, you need to run thefollowing commands before running the feature extractioncommands

$ unset MAHOUT_LOCAL

$ hadoop fs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all

Page 43: Big Data Analytics using Mahout

43

Preparing the Sequence File

Mahout provides you a utility to convert the given input file in to asequence file format.

The input file directory where the original data resides.

The output file directory where the clustered data is to be stored.

Page 44: Big Data Analytics using Mahout

44

Sequence Files

Sequence files are binary encoding of key/value pairs. There is aheader on the top of the file organized with some metadatainformation which includes:

– Version

– Key name

– Value name

– Compression

To view the sequential file

mahout seqdumper -i <input file> | more

Page 45: Big Data Analytics using Mahout

45

Generate Vectors from Sequence Files

Mahout provides a command to create vector files fromsequence files.

mahout seq2sparse -i <input file path> -o <output file path>

Important Options:

-lnorm Whether output vectors should be logNormalize.

-nv Whether output vectors should be NamedVectors

-wt The kind of weight to use. Currently TF or TFIDF.Default: TFIDF

Page 46: Big Data Analytics using Mahout

46

Extract Features

Convert the full 20 newsgroups dataset into a < Text, Text >SequenceFile.

Convert and preprocesses the dataset into a < Text,VectorWritable > SequenceFile containing term frequencies foreach document.

Page 47: Big Data Analytics using Mahout

47

Prepare Testing Dataset

Split the preprocessed dataset into training and testing sets.

Page 48: Big Data Analytics using Mahout

48

Training process

Train the classifier.

Page 49: Big Data Analytics using Mahout

49

Testing the result

Test the classifier.

Page 50: Big Data Analytics using Mahout

50

Dumping a vector file

We can dump vector files to normal text ones, as fillow

mahout vectordump -i <input file> -o <output file>

Options--useKey If the Key is a vector than dump that instead

--csv Output the Vector as CSV--dictionary The dictionary file.

Page 51: Big Data Analytics using Mahout

51

Sample Output

Page 52: Big Data Analytics using Mahout

52

Command line options

Page 53: Big Data Analytics using Mahout

53

Command line options

Page 54: Big Data Analytics using Mahout

54

Command line options

Page 55: Big Data Analytics using Mahout

55

K-means clustering

Page 56: Big Data Analytics using Mahout

56

Reuters Newswire

Page 57: Big Data Analytics using Mahout

57

Preparing data

$ export WORK_DIR=/tmp/kmeans

$ mkdir $WORK_DIR

$ mkdir $WORK_DIR/reuters-out

$ cd $WORK_DIR

$ wget http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz

$ mkdir $WORK_DIR/reuters-sgm

$ tar -xzf reuters21578.tar.gz -C $WORK_DIR/reuters-sgm

Page 58: Big Data Analytics using Mahout

58

Convert input to a sequential file

$ mahout org.apache.lucene.benchmark.utils.ExtractReuters$WORK_DIR/reuters-sgm $WORK_DIR/reuters-out

Page 59: Big Data Analytics using Mahout

59

Convert input to a sequential file (cont)

$ mahout seqdirectory -i $WORK_DIR/reuters-out -o$WORK_DIR/reuters-out-seqdir -c UTF-8 -chunk 5

Page 60: Big Data Analytics using Mahout

60

Create the sparse vector files

$ mahout seq2sparse -i $WORK_DIR/reuters-out-seqdir/ -o $WORK_DIR/reuters-out-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector

Page 61: Big Data Analytics using Mahout

61

Running K-Means

$ mahout kmeans -i $WORK_DIR/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ -c $WORK_DIR/reuters-kmeans-clusters-o $WORK_DIR/reuters-kmeans -dmorg.apache.mahout.common.distance.CosineDistanceMeasure-x 10 -k 20 -ow

Page 62: Big Data Analytics using Mahout

62

K-Means command line options

Page 63: Big Data Analytics using Mahout

63

Viewing Result

$mkdir $WORK_DIR/reuters-kmeans/clusteredPoints

$ mahout clusterdump -i $WORK_DIR/reuters-kmeans/clusters-*-final -o $WORK_DIR/reuters-kmeans/clusterdump -d$WORK_DIR/reuters-out-seqdir-sparse-kmeans/dictionary.file-0-dt sequencefile -b 100 -n 20 --evaluate -dmorg.apache.mahout.common.distance.CosineDistanceMeasure-sp 0 --pointsDir $WORK_DIR/reuters-kmeans/clusteredPoints

Page 64: Big Data Analytics using Mahout

64

Viewing Result

Page 65: Big Data Analytics using Mahout

65

Dumping a cluster file

We can dump cluster files to normal text ones, as fillow

mahout clusterdump -i <input file> -o <output file>

Options -of The optional output format for the results.

Options: TEXT, CSV, JSON or GRAPH_ML

-dt The dictionary file type

--evaluate Run ClusterEvaluator

Page 66: Big Data Analytics using Mahout

66

Canopy Clustering

Page 67: Big Data Analytics using Mahout

67

Fuzyy k-mean Clustering

Page 68: Big Data Analytics using Mahout

68

Command line options

Page 69: Big Data Analytics using Mahout

69

Exercise: Traffic Accidents Dataset

http://fimi.ua.ac.be/data/accidents.dat.gz

Page 70: Big Data Analytics using Mahout

70

Import-Export RDBMS data

Page 71: Big Data Analytics using Mahout

71

Sqoop Hands-On Labs

1. Loading Data into MySQL DB

2. Installing Sqoop

3. Configuring Sqoop

4. Installing DB driver for Sqoop

5. Importing data from MySQL to Hive Table

6. Reviewing data from Hive Table

7. Reviewing HDFS Database Table files

Page 72: Big Data Analytics using Mahout

Thanachart Numnonda, [email protected] Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop

1. MySQL RDS Server on AWS

A RDS Server is running on AWS with the followingconfiguration

> database: imc_db

> username: admin

> password: imcinstitute

>addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com

[This address may change]

Page 73: Big Data Analytics using Mahout

73

1. country_tbl data

Testing data query from MySQL DB

Table name > country_tbl

Page 74: Big Data Analytics using Mahout

74

2. Installing Sqoop

# wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz

# tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz

# sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/

# rm sqoop-1.4.5.bin__hadoop-1.0.0

Page 75: Big Data Analytics using Mahout

75

Installing SqoopEdit $HOME ./bashrc

# sudo vi $HOME/.bashrc

Page 76: Big Data Analytics using Mahout

76

3. Configuring Sqoop

ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-1.0.0/conf/

ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh

Page 77: Big Data Analytics using Mahout

77

4. Installing DB driver for Sqoop

ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-1.0.0/lib/

ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$wget https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar

ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$exit

Page 78: Big Data Analytics using Mahout

78

5. Importing data from MySQL to Hive Table

[hdadmin@localhost ~]$sqoop import --connectjdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl--hive-import --hive-table country -m 1

Warning: /usr/lib/hbase does not exist! HBase imports will fail.

Please set $HBASE_HOME to the root of your HBase installation.

Warning: $HADOOP_HOME is deprecated.

Enter password: <enter here>

Page 79: Big Data Analytics using Mahout

79

6. Reviewing data from Hive Table

Page 80: Big Data Analytics using Mahout

80

7. Reviewing HDFS Database Table files

Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse

Page 81: Big Data Analytics using Mahout

81

Sqoop commands

Page 82: Big Data Analytics using Mahout

82

Recommended Books

Page 83: Big Data Analytics using Mahout

83

www.facebook.com/imcinstitute

Page 84: Big Data Analytics using Mahout

84

Thank you

[email protected]/imcinstitutewww.slideshare.net/imcinstitutewww.thanachart.org