18
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

Embed Size (px)

Citation preview

Page 1: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

1

CS525: Big Data Analytics

Machine Learning on Hadoop

Fall 2013

Elke A. Rundensteiner

Page 2: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

2

Analytics ?

• Machine learning, data mining & statistics tools• Analyze/mine/summarize large datasets• Extract knowledge from past or streaming data• Predict trends in future data

Page 3: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

ML Today

• Internet search clustering

• Social network analysis

• Taxonomy transformations

• Market analytics

• Recommendation systems

• Log analysis & event filtering

• SPAM filtering

• Fraud detection

Page 4: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

4

Tools & Algorithms

• Collaborative Filtering

• Clustering Techniques

• Classification Algorithms

• Association Rules

• Frequent Pattern Mining

• Statistical libraries (Regression, SVM, …)

• Others…

Page 5: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

5

Common Use Cases

Page 6: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

6

Make It Industry Strength: Big Data

--Efficient in analyzing/mining data--Do not scale

--Efficient in managing big data--Does not analyze or mine data

How to integrate these two worlds ?

Page 7: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

8

Some Projects

• Apache Mahout• Open-source package on Hadoop for

data mining and machine learning

• Revolution R (R-Hadoop or Radoop )• Extensions to R package to run on

Hadoop

Page 8: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

9

Apache Mahout

Page 9: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

10

Apache Mahout

• Apache Software Foundation project

• Create scalable machine learning libraries

• Why ?

• Many Open Source ML libraries either:• Lack Community• Lack Documentation• Lack Scalability• Or are research-oriented only

Page 10: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

Support Machine Learning

Page 11: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

12

But Must Scale & Perform

• Be as fast as possible

• Scale to as much data as possible

Page 12: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

13

But Must Scale & Perform

• Be as fast as possible given intrinsic algorithm !

• What is expressible as map-reduce jobs ?

• Work in progress . . .

Page 13: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

14

C1: Collaborative Filtering

Page 14: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

15

C2: Clustering

• Group similar objects together

• K-Means, Fuzzy K-Means, Density-Based,…

• Different distance measures• Manhattan, Euclidean, …

Page 15: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

16

C3: Classification

Page 16: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

17

FPM: Frequent Pattern Mining

• Find the frequent itemsets• <milk, bread, cheese> are sold

frequently together

• Very common in market analysis, access pattern analysis, etc…

Page 17: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

18

Matrices and Statistics

• Math libraries• Vectors, matrices, etc.

• Noise reduction

• Similarity Functions

Page 18: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1

19

Apache Mahout

• http://mahout.apache.org/