47
Driving New Value from Big Data Investments An Introduction to Using R with Hadoop Jeffrey Breen Principal, Think Big Academy [email protected] http://www.thinkbigacademy.com / February 2013

Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Driving New Value from Big Data Investments

An Introduction to Using R with HadoopJeffrey BreenPrincipal, Think Big [email protected]://www.thinkbigacademy.com/

February 2013

Page 2: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

2

Building Modern Analytics Solutions to Monetize Big Data Investments

Strategy and Roadmap

IMAGINETraining

and Education

ILLUMINATEHands-On

Data Science and Data Engineering

IMPLEMENT

Leading Providerof Innovative Big Analytics Services

Page 3: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

3

We Accelerate Your Time to Value

THINK BIG Analytics Methodology

Experiment-Driven Short Projects with Nimble Test Solution Cycles

� Breaking Down Business and IT Barriers

� Discrete Projects with Beginning and End

� Early Releases to Validate ROI andEnsure Long Term Success

IMAGINE

ILLUMINATE

IMPLEMENT

Innovation and Value

Page 4: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

4

� Expert Training/Courses− e.g. Hadoop Developer, HBase, Pig and Hive for Modelers

� Joint Application Development� Side-by-Side Mentoring

Enable Your IT Staff with New Skills

Data Architect

Data Architect Big Data

Monitoring

DatabaseAdministrator

Big DataAdministrator

BusinessAnalyst

Data ScienceMath Modeler

Developers

Big DataEngineering

ILLUMINATE: Training and Education

� Build Capabilities to Manage Rapid Innovation Needed with Big Data

� Invest in and Scale Skills to Create Data-Driven Organization

THINK BIG Analytics

Page 5: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Agenda

5

� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A

Page 6: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Agenda

6

� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A

Page 7: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Revolution  Confidential

7http://thebalancedguy.blogspot.com/2010/09/with-3-boys-and-having-been-cub-scout.html

Page 8: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Revolution  Confidential

8http://www.wengerna.com/giant-knife-16999

Page 9: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Number  of  R  Packages  Available

How  many  R  Packages  are  there  now?

At  the  command  line  enter:>  dim(available.packages())

Slide courtesy of John Versotek, organizer of the Boston Predictive Analytics Meetup

Page 10: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Agenda

10

� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A

Page 11: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Revolution  Confidential

Page 12: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop
Page 13: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Google  File  System  is  the  Storage.

2003

13

Page 14: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

MapReduce  is  the  framework.

2004

14

Page 15: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Enter  HadoopAbout this time,

Doug Cutting, the creator of Lucene, was working on Nutch.

15

Page 16: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Nutch  Timeline

Year Topics

2003 Google’s GFS paper.

2004 Nutch Distributed File System (NDFS).

2004 Google’s MapReduce paper.

2004-2005

Nutch MapReduce Implementation.

16

Page 17: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Hadoop  TimelineYear Topics

2006NDFS and Nutch MapReduce extracted to separate Hadoop Apache project.

2008Hadoop is a top-level Apache project.Yahoo! announces 10K core cluster.

17

Page 18: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

�Optimize disk I/O performance.-Minimize disk head seeks!

�Redundant data storage and processing to eliminate many kinds of data loss.

�Horizontal scalability.�Run on commodity, server-class hardware.

Hadoop Design Goals

18

Page 19: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Revolution  Confidential

19

from Jeff Dean, based on Peter Norvig’s http://norvig.com/21-days.html

Page 20: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

What is Hadoop?

� An open source project designed to support large scale data processing� Inspired by Google’s MapReduce-based computational infrastructure� Comprised of several components- Hadoop Distributed File System (HDFS)- MapReduce processing framework, job scheduler, etc.- Ingest/outgest services (Sqoop, Flume, etc.)- Higher level languages and libraries (Hive, Pig, Cascading, Mahout)

� Written in Java, first opened up to alternatives through its Streaming API→ If your language of choice can handle stdin and stdout, you can use it to write MapReduce jobs

20

Page 21: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

21

Hadoop cluster components

Key� italics: process�✲ : MR jobs

Cluster

Slaves

IngestService

OutgestService

SQLStore

SQLStore

Logs

Client Servers

✲ Hive, Pig, ...✲ cron+bash, Azkaban, …

Sqoop, Scribe, …Monitoring, Management

...

Secondary Master Server

Secondary Name Node

Primary Master Server✲ Job Tracker

Name Node

Slave Server✲ Task Tracker

Data Node

DiskDiskDiskDiskDiskDiskDiskDisk

Slave Server✲ Task Tracker

Data Node

DiskDiskDiskDiskDiskDiskDiskDisk

Slave Server✲ Task Tracker

Data Node

DiskDiskDiskDiskDiskDiskDiskDiskfrom Think Big Academy’s Hadoop Developer Course

Page 22: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

22

Hadoop’s distributed file system

Services

Name Node

Data Nodes

64MB blocks

3x replication

Cluster

Slaves

IngestService

OutgestService

SQLStore

SQLStore

Logs

Client Servers

✲ Hive, Pig, ...✲ cron+bash, Azkaban, …

Sqoop, Scribe, …Monitoring, Management

...

Secondary Master Server

Secondary Name Node

Primary Master Server✲ Job Tracker

Name Node

Slave Server✲ Task Tracker

Data Node

DiskDiskDiskDiskDiskDiskDiskDisk

Slave Server✲ Task Tracker

Data Node

DiskDiskDiskDiskDiskDiskDiskDisk

Slave Server✲ Task Tracker

Data Node

DiskDiskDiskDiskDiskDiskDiskDiskfrom Think Big Academy’s Hadoop Developer Course

Page 23: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Agenda

23

� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A

Page 24: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

True confession: I was wrong about MapReduce

� When the Google paper was published in 2004, I was running a typical enterprise IT department

� Big hardware (Sun, EMC) + big applications (Siebel, Peoplesoft) + big databases (Oracle, SQL Server)= big licensing & support costs

� Loved the scalability, COTS components, and price, but missed the fact that keys (and values) could be compound & complex

� ... and examples like Wordcount didn’t help!

Source: Hadoop: The Definitive Guide, Second Edition, p. 20

24

Page 25: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Copyright  ©  2011-­‐2013,  Think  Big  AnalyNcs,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

We need to convert the Input

into the Output.

from Think Big Academy’s Hadoop Developer Course

Page 26: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Copyright  ©  2011-­‐2013,  Think  Big  AnalyNcs,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(N, "…")

(N, "…")

(N, "")

Mappers

There is a Reduce phase (N, "…")

from Think Big Academy’s Hadoop Developer Course

Page 27: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Copyright  ©  2011-­‐2013,  Think  Big  AnalyNcs,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(N, "…")

(N, "…")

(N, "")

Mappers

There is a Reduce phase (N, "…")

(hadoop, 1)(uses, 1)(mapreduce, 1)

(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)

(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)

from Think Big Academy’s Hadoop Developer Course

Page 28: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Revolution  Confidential

28http://blog.stackoverflow.com/wp-content/uploads/then-a-miracle-occurs-cartoon.png

Page 29: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Copyright  ©  2011-­‐2013,  Think  Big  AnalyNcs,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(N, "…")

(N, "…")

(N, "")

Mappers Sort,Shuffle

Reducers

There is a Reduce phase (N, "…")

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

from Think Big Academy’s Hadoop Developer Course

Page 30: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Copyright  ©  2011-­‐2013,  Think  Big  AnalyNcs,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(N, "…")

(N, "…")

(N, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

There is a Reduce phase (N, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

from Think Big Academy’s Hadoop Developer Course

Page 31: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Copyright  ©  2011-­‐2013,  Think  Big  AnalyNcs,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(N, "…")

(N, "…")

(N, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase (N, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

from Think Big Academy’s Hadoop Developer Course

Page 32: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Copyright  ©  2011-­‐2013,  Think  Big  AnalyNcs,  All  Rights  Reserved

There is a Map phase

Hadoop uses MapReduce

Input

(N, "…")

(N, "…")

(N, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase (N, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-zMap:

• Transform  one  input  to  0-­‐N  outputs.  

Reduce:

• Collect multiple inputs into one output.

from Think Big Academy’s Hadoop Developer Course

Page 33: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Agenda

33

� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A

Page 34: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Enter RHadoop

� RHadoop is an open source project sponsored by Revolution Analytics� Package Overview- rmr2 - all MapReduce-related functions- rhdfs - interaction with Hadoop’s HDFS file system- rhbase - access to the NoSQL HBase database

� rmr2 uses Hadoop’s Streaming API to allow R users to write MapReduce jobs in R- handles all of the I/O and job submission for you (no while(<stdin>)-like loops!)

34

Page 35: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

RHadoop Advantages

� Modular- Packages group similar functions- Only load (and learn!) what you need- Minimizes prerequisites and dependencies

� Open Source- Cost: Low (no) barrier to start using- Transparency: Development, issue tracker, Wiki, etc. hosted on github

• https://github.com/RevolutionAnalytics/RHadoop/� Supported- Sponsored by Revolution Analytics- Training & professional services available- Support available with Revolution R Enterprise subscriptions

35

Page 36: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

wordcount: codelibrary(rmr2)

map = function(k,lines) {

words.list = strsplit(lines, '\\s') words = unlist(words.list)

return( keyval(words, 1) )}

reduce = function(word, counts) { keyval(word, sum(counts))}

wordcount = function (input, output = NULL) { mapreduce(input = input, output = output, input.format = "text", map = map, reduce = reduce)}

36

from Revolution Analytics’ Getting Started with RHadoop course

Page 37: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

wordcount: submit job and fetch results

Submit job> hdfs.root = 'wordcount'> hdfs.data = file.path(hdfs.root, 'data')> hdfs.out = file.path(hdfs.root, 'out')> out = wordcount(hdfs.data, hdfs.out)

Fetch results from HDFS> results = from.dfs( out )> results.df = as.data.frame(results, stringsAsFactors=F )> colnames(results.df) = c('word', 'count')> head(results.df) word count1 greatness 22 damned 33 tis 54 jade 15 magician 1

37

from Revolution Analytics’ Getting Started with RHadoop course

Page 38: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Code notes

� Scalable- Hadoop and MapReduce abstract away system details- Code runs on 1 node or 1,000 nodes without modification

� Portable- You write normal R code, interacting with normal R objects- RHadoop’s rmr2 library abstracts away Hadoop details- All the functionality you expect is there—including Enterprise R’s

� Flexible- Only the mapper deals with the data directly- All components communicate via key-value pairs- Key-value “schema” chosen for each analysis rather than as a prerequisite to

loading data into the system

38

Page 39: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

rmr2 Function Overview

� Convenience- keyval() - creates a key-value pair from any two R objects. Used to generate

output from input formatters, mappers, reducers, etc.� Input/output- from.dfs(), to.dfs() - read/write data from/to the HDFS- make.input.format() - provides common file parsing (text, CSV) or will wrap a user-

supplied function� Job execution- mapreduce() - submit job and return an HDFS path to the results if successful

39

Page 40: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

rhdfs function overview

� File & directory manipulation- hdfs.ls(), hdfslist.files()- hdfs.delete(), hdfs.del(), hdfs.rm() - hdfs.dircreate(), hdfs.mkdir()- hdfs.chmod(), hdfs.chown(), hdfs.file.info()- hdfs.exists()

� Copying, moving & renaming files to/from/within HDFS- hdfs.copy(), hdfs.move(), hdfs.rename()- hdfs.put(), hdfs.get()

� Reading files directly from HDFS- hdfs.file(), hdfs.read(), hdfs.write(), hdfs.flush()- hdfs.seek(), hdfs.tell(con), hdfs.close()- hdfs.line.reader(), hdfs.read.text.file()

� Misc.- hdfs.init(), hdfs.defaults()

40

Page 41: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

rhbase function overview

� Initialization- hb.init()

� Create and manage tables- hb.list.tables(), hb.describe.table()- hb.new.table(), hb.delete.table()

� Read and write data- hb.insert(), hb.insert.data.frame()- hb.get(), hb.get.data.frame(), hb.scan()- hb.delete()

� Administrative, etc.- hb.defaults(), hb.set.table.mode()- hb.regions.table(), hb.compact.table()

41

Page 42: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Big Data Warehousing with Hive

� Hive supplies a SQL-like query language- very familiar for those with relational database experience

� But Hive compiles, optimizes, and executes these queries as MapReduce jobs on the Hadoop cluster

� Can be used in conjunction with other Hadoop jobs, such as those written with rmr2

42

Page 43: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Hive architecture & access

43

Hadoop

Master✲ Job Tracker Name Node DFS

Hive

Driver(compiles, optimizes, executes)

CLI HWI Thrift Server

Metastore

JDBC ODBC

RODBC, RJDBC, etc.Terminal browser

Page 44: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Accessing Hive via ODBC/JDBClibrary(RJDBC)

# set the classpath to include the JDBC driver location, plus commons-logging

[...]class.path = c(hive.class.path, commons.class.path)drv = JDBC("org.apache.hadoop.hive.jdbc.HiveDriver", classPath=class.path, "`")

# make a connection to the running Hive Server:conn = dbConnect(drv, "jdbc:hive://localhost:10000/default")

# setting the database name in the URL doesn't help,# so issue 'use databasename' command:res = dbSendQuery(conn, 'use mydatabase')

# submit the query and fetch the results as a data.frame:df = dbGetQuery(conn, 'SELECT name, sub FROM employees LATERAL VIEW explode(subordinates) subView AS sub')

44

Page 45: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Other ways to use R and Hadoop

� HDFS- Revolution Enterprise R can read and write files directly on the distributed file

system- Files can include ScaleR’s XDF-formatted data sets

� MapReduce- Many other R packages have been written to use R and Hadoop together,

including RHIPE, segue, Oracle’s R Connector for Hadoop, etc.

� Hive- Hadoop Streaming is also available for Hive to leverage functionality external to

Hadoop and Java- RHive leverages RServe to connect the two

• http://cran.r-project.org/web/packages/RHive/

45

Page 46: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Big Data ≠ Hadoop

� NoSQL databases offer low-latency, random-access to key-values- HBase- Cassandra- CouchDB- MongoDB- Accumulo

� Next week, Think Big’s Douglas Moore will be presenting at the Boston Storm Meetup:- “Predictive Analytics with Storm, Hadoop, R and AWS”- http://www.meetup.com/Boston-Storm-Users/events/103506142/

46

Page 47: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop

Want to learn more?

47

� Upcoming public Getting Started with RHadoop 1-day classes- Hands-on examples and exercises covering rhdfs, rhbase, and rmr2- Algorithms and data include wordcount, analysis of airline flight data, and

collaborative filtering using structured and unstructured data from text, CSV files and Twitter

• February 25, 2013 - Palo Alto, CA• March 13, 2013 - Boston, MA

• 25% off with “useR” discount code @ http://bit.ly/rhadoop0313

� Revolution Analytics Quick Start Program for Hadoop- Private Getting Started with RHadoop training- Onsite consulting assistance for initial use case- Revolution R for Hadoop licenses and support- More info @ http://bit.ly/rhadoopqs