70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples Big Data Analytics with R and Hadoop D. Praveen Kumar Research Scholar (Full-Time) Department of Computer Science & Engineering YSREC of Yogi Vemana University, Proddatur Kadapa Dt., A. P, India November 30, 2016 YSR Engineering College of YVU, Proddatur, Kadapa Big Data Analytics with R and Hadoop November 30, 2016 Slide: 1 / 70

RHadoop

Embed Size (px)

Citation preview

Page 1: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Big Data Analytics with R and Hadoop

D. Praveen KumarResearch Scholar (Full-Time)

Department of Computer Science & EngineeringYSREC of Yogi Vemana University, Proddatur

Kadapa Dt., A. P, India

November 30, 2016

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 1 / 70

Page 2: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

1 Introduction

2 RHadoop

3 RHadoop Installation

4 rhdfs Methods

5 rmr2

6 Examples

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 2 / 70

Page 3: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Big Data - Introduction

Big Data has to deal with large and complex data sets that can bestructured, semi-structured, or unstructured and will typically notfit into memory to be processed. They have to be processed inplace, which means that computation has to be done where thedata resides for processing.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 3 / 70

Page 4: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Big Data - 3V’s

Velocity refers to the low latency, real-time speed at which theanalytics need to be applied. (Example: to perform analyticson a continuous stream of data originating from a socialnetworking site)

Volume refers to the size of the data set. It may be in KB,MB, GB, TB, or PB based on the type of the application thatgenerates or receives the data.

Variety refers to the various types of the data that can exist,for example, text, audio, video, and photos.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 4 / 70

Page 5: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Big Data - 3V’s (Cont..)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 5 / 70

Page 6: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Popular Organizations that hold Big Data

Some of the popular organizations that hold Big Data are asfollows: (upto 2014)

Facebook: It has 40 PB of data and captures 100 TB/day

Yahoo!: It has 60 PB of data

Twitter: It captures 8 TB/day

EBay: It has 40 PB of data and captures 50 TB/day

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 6 / 70

Page 7: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Hadoop - Introduction

Apache Hadoop is an open source Java framework forprocessing and querying vast amounts of data on largeclusters of commodity hardware.

Hadoop is a top level Apache project, initiated and led byYahoo! and Doug Cutting.

Its impact can be boiled down to four salient characteristics:scalable, cost-effective, flexible, fault-tolerant solutions.

Apache Hadoop has two main features:

HDFS (Hadoop Distributed File System) - StoringMap Reduce - Processing

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 7 / 70

Page 8: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Requirements

Necessary

Java >= 7

ssh

Linux OS (Ubuntu >=14.04)

Hadoop framework

Optional

Eclipse

Internet connection

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 8 / 70

Page 9: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Java 7 & Installation

Hadoop requires a working Java installation. However, usingjava 1.7 or more is recommended.

Following command is used to install java in linux platformsudo apt-get install openjdk-7-jdk (or)

sudo apt-get install default-jdk

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 9 / 70

Page 10: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Java PATH Setup

We need to set JAVA path

Open the .bashrc file located in home directorygedit ~/.bashrc

Add below line at the end:export JAVA HOME=/usr/lib/jvm/java−7−openjdk−amd64

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 10 / 70

Page 11: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Installation & Configuration of SSH

Hadoop requires SSH(Secure Shell) access to manage itsnodes, i.e. remote machines plus your local machine if youwant to use Hadoop on it.

Install SSH using following commandsudo apt-get install ssh

First, we have to generate DSA an SSH key for user.ssh-keygen -t dsa -P ’’ -f ~ /.ssh/id dsa

cat ~ /.ssh/id dsa.pub >> ~ /.ssh/authorized keys

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 11 / 70

Page 12: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Download & Extract Hadoop

Download Hadoop from the Apache Download Mirrors

http://mirror.fibergrid.in/apache/hadoop/common/

Extract the contents of the Hadoop package to a location of yourchoice. I picked /usr/local/hadoop.$ sudo chmod 777 /usr/local

$ cd /usr/local

$ tar xzf hadoop-2.7.2.tar.gz

$ sudo mv hadoop-2.7.2 hadoop

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 12 / 70

Page 13: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Add Hadoop configuration in .bashrc

Add Hadoop configuration in .bashrc in home directory.export HADOOP INSTALL=/usr/local/hadoop

export PATH=$PATH:$HADOOP INSTALL/bin

export PATH=$PATH:$HADOOP INSTALL/sbin

export HADOOP MAPRED HOME=$HADOOP INSTALL

export HADOOP HDFS HOME=$HADOOP INSTALL

export HADOOP COMMON HOME=$HADOOP INSTALL

export YARN HOME=$HADOOP INSTALL

export HADOOP COMMON LIB NATIVE DIR=$HADOOP INSTALL/lib/native

export HADOOP OPTS="-Djava.library.path=$HADOOP INSTALL/lib"

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 13 / 70

Page 14: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Create temp file, DataNode & NameNode

Execute below commands to create NameNodemkdir -p /usr/local/hadoopdata/hdfs/namenode

Execute below commands to create DataNodemkdir -p /usr/local/hadoopdata/hdfs/datanode

Execute below code to create the tmp directory in hadoopsudo mkdir -p /app/hadoop/tmp

sudo chown hadoop1:hadoop1 /app/hadoop/tmp

sudo chmod 750 /app/hadoop/tmp

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 14 / 70

Page 15: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Files to Configure

The following are the files we need to configure

core-site.xml

hadoop-env.sh

mapred-site.xml

hdfs-site.xml

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 15 / 70

Page 16: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Add properties in /usr/local/hadoop/etc/core-site.xml

Add the following snippets between the< configuration > ... < /configuration > tags in the core-site.xmlfile.

Add below property to specify the location of tmp< property >< name > hadoop.tmp.dir < /name >< value > /app/hadoop/tmp < /value >< /property >

Add below property to specify the location of default filesystem and its port number.< property >< name > fs.default.name < /name >< value > hdfs : //localhost : 54310 < /value >

< /property >

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 16 / 70

Page 17: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Add properties in /usr/local/hadoop/etc/hadoop-env.sh

Un-Comment the JAVA HOME and Give Correct Path ForJava.export JAVA HOME=/usr/lib/jvm/java-7-openjdk-amd64

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 17 / 70

Page 18: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Add property in/usr/local/hadoop/etc/hadoop/mapred-site.xml

In file we add The host name and port that the MapReduce jobtracker runs at. Add following in mapred-site.xml :< property >< name > mapred .job.tracker < /name >< value > localhost : 54311 < /value >< /property >

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 18 / 70

Page 19: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Add properties in ... etc/hadoop/hdfs-site.xml

In file hdfs-site.xml add following:

Add replication factor< property >< name > dfs.replication < /name >< value > 1 < /value >

< /property >

Specify the NameNode< property >< name > dfs.namenode.name.dir < /name >< value > file : /usr/local/hadoopdata/hdfs/namenode < /value >

< /property >

Specify the DataNode< property >< name > dfs.datanode.name.dir < /name >< value > file : /usr/local/hadoopdata/hdfs/datanode < /value >

< /property >

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 19 / 70

Page 20: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Formatting the HDFS file system via the NameNode

The first step to starting up your Hadoop installation is

Formatting the Hadoop file system

We need to do this the first time you set up a Hadoop.

Do not format a running Hadoop file system as you will loseall the data currently in HDFS

To format the file system, run the commandhadoop namenode -format

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 20 / 70

Page 21: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Starting single-node cluster

Run the command:start-all.sh

This will startup a NameNode,SecondaryNameNode,DataNode, ResourceManager and a NodeManager on yourmachine.

A nifty tool for checking whether the expected Hadoopprocesses are running is jpshadoop1@hadoop1:/usr/local/hadoop$ jps

2598 NameNode3112 ResourceManager3523 Jps2917 SecondaryNameNode2727 DataNode3242 NodeManager

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 21 / 70

Page 22: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Stopping your single-node cluster

Run the commandstop-all.sh

To stop all the daemons running on your machine output will belike this.stopping NodeManagerlocalhost: stopping ResourceManagerstopping NameNodelocalhost: stopping DataNode

localhost: stopping SecondaryNameNode

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 22 / 70

Page 23: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

R - Introduction

R is an open source software package to perform statisticalanalysis on data.

R is a programming language developed from S(Statistical)

R provides a wide variety of statistical, machine learning,graphical techniques, and is highly extensible.

R can now connect with other data stores, such as MySQL,SQLite, MongoDB, and Hadoop etc.,

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 23 / 70

Page 24: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

R - Features

Following are Some of the R Features

Effective statistical programming language

Relational database support

Data analytics

Data visualization

Extension through the vast library of R packages

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 24 / 70

Page 25: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

R - Operations

R allows performing Data analytics by various operations such as:

Regression

Classification

Clustering

Recommendation

Text mining

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 25 / 70

Page 26: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

R - Installation (Windows)

For Windows, follow the given steps:

1 Navigate to www.r-project.org.

2 Click on the CRAN section, select CRAN mirror, and selectyour Windows OS (stick to Linux; Hadoop is almost alwaysused in a Linux environment).

3 Download the latest R version from the mirror.

4 Execute the downloaded .exe to install R.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 26 / 70

Page 27: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

R - Installation (Ubuntu)

For Linux-Ubuntu, follow the given steps:

1 Navigate to www.r-project.org.

2 Click on the CRAN section, select CRAN mirror, and selectyour OS.

3 In the /etc/apt/sources.list file, add the CRAN< mirror > entry.

4 Download and update the package lists from the repositoriesusing the sudo apt-get update command.

5 Install R system using the sudo apt-get install r-base

command.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 27 / 70

Page 28: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

RHEL/CentOS

For Linux-RHEL/CentOS, follow the given steps:

1 Navigate to www.r-project.org.

2 Click on CRAN, select CRAN mirror, and select Red Hat OS.

3 Download the R-*core-*.rpm file.

4 Install the .rpm package using the rpm -ivh R-*core-*.rpmcommand.

5 Install R system using sudo yum install R.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 28 / 70

Page 29: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Hadoop MapReduce in R

Hadoop MapReduce in R, we can perform in Three Ways:

1 R and Hadoop Integrated Programming Environment(RHIPE)

2 HadoopStreaming

3 RHadoop

Among these three RHadoop is efficient and easiest.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 29 / 70

Page 30: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

RHadoop - Introduction

RHadoop was developed by Revolution Analytics

RHadoop is available with three main R packages:

1 rhdfs - provides HDFS data operations2 rmr - provides MapReduce execution operations3 rhbase - input data source at the HBase

Here it’s not necessary to install all of the three RHadooppackages to run the Hadoop MapReduce operations with Rand Hadoop.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 30 / 70

Page 31: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

RHadoop - Architecture

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 31 / 70

Page 32: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

rhdfs

rhdfs is an R interface for providing the HDFS usability fromthe R console.

rhdfs package calls the HDFS API in backend to operate datasources stored on HDFS.

With rhdfs methods, R programmer can easily perform readand write operations on distributed data files.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 32 / 70

Page 33: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

rmr

rmr is an R interface for providing Hadoop MapReduce facilityinside the R environment.

R programmer needs to just divide their application logic intothe map and reduce phases and submit it with the rmrmethods.

After that, rmr calls the Hadoop streaming MapReduce APIwith several job parameters as input directory, outputdirectory, mapper, reducer, and so on, to perform the RMapReduce job over Hadoop cluster.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 33 / 70

Page 34: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

rhbase

rhbase is an R interface for operating the Hadoop HBase datasource stored at the distributed network via a Thrift server.

The rhbase package is designed with several methods forinitialization and read/write and table manipulationoperations.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 34 / 70

Page 35: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

R and Hadoop installation

We already installed R and Hadoop

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 35 / 70

Page 36: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Installing the R packages

To connect R and Hadoop we need to install some of the packages:

httr

functional

devtools

plyr

reshape2

rJava

RJSONIO

itertools

digest

Rcppinstall.packages( c(’httr’,’functional’,’devtools’, ’plyr’,’reshape2’))

install.packages( c(’rJava’,’RJSONIO’, ’itertools’, ’digest’,’Rcpp’))

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 36 / 70

Page 37: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Setting environment variables

We need to set following environment variables through R console.## Setting HADOOP CMD

Sys.setenv(HADOOP CMD="/usr/local/hadoop/bin/hadoop")

## Setting up HADOOP STREAMING

Sys.setenv(HADOOP STREAMING="/usr/local/hadoop/share

/hadoop/tools/lib/hadoop-streaming-2.7.3.jar")

or, we can also set the R console via the command line as follows:export HADOOP CMD="/usr/local/hadoop/"

export HADOOP STREAMING="/usr/local/hadoop/share

/hadoop/tools/lib/hadoop-streaming-2.7.3.jar"

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 37 / 70

Page 38: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Usage of Hadoop Streaming jar

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 38 / 70

Page 39: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Downloading RHadoop Packages

Download RHadoop packages from GitHub repository ofRevolution Analytics:https://github.com/RevolutionAnalytics/RHadoop

rmr: [rmr-2 3.3.1.tar.gz ]

rhdfs: [rhdfs-1.0.8.tar.gz ]

rhbase: [rhbase-1.2.1.tar.gz ]

We can install these packages using R-command line or RStudio

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 39 / 70

Page 40: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Installing rmr package

Install throught R Commander using the following CommandR CMD INSTALL rmr-2 3.3.1.tar.gz

Install using Rstudio follow the steps

Click on Tools → Install PackagesChange Install from option from Repository(CERN) toPackage Archive File (.tar.gz) optionChoose the rmr-2 3.3.1.tar.gz file from your local systemClick on Install button (It also install supporting packages ofrmr)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 40 / 70

Page 41: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Installing rhdfs package

Install throught R Commander using the following CommandR CMD INSTALL rhdfs-1.0.8.tar.gz

Install using Rstudio follow the steps

Click on Tools → Install PackagesChange Install from option from Repository(CERN) toPackage Archive File (.tar.gz) optionChoose the rhdfs-1.0.8.tar.gz file from your local systemClick on Install button (It also install supporting packages ofrhdfs)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 41 / 70

Page 42: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Installing rhbase package

Install throught R Commander using the following CommandR CMD INSTALL rhbase-1.2.1.tar.gz

Install using Rstudio follow the steps

Click on Tools → Install PackagesChange Install from option from Repository(CERN) toPackage Archive File (.tar.gz) optionChoose the rhbase-1.2.1.tar.gz file from your local systemClick on Install button (It also install supporting packages ofrhdfs)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 42 / 70

Page 43: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Loading the RHadoop libraries

However we load a normal library in R, Similarly we can loadRHadoop libraries using require() or library() methods.

library(’rhdfs’) # Loading HDFS

library(’rmr2’) # Loading MapReduce

library(’rhbase’) # Loading HBase

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 43 / 70

Page 44: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

Initializing the RHadoop

Initialize the rhdfs package with parameters specifying the locationof the hadoop configuration files.

Syntax:hdfs.init(hadoop=PATH)

here PATH specifys the location of the hadoop configuration file.If we can’t pass any parameter, by default conguration files takenfrom the HADOOP CMD environment variable.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 44 / 70

Page 45: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.ls

It is useful to list files and directories of the HDFS. It returns thedata frames that columns corresponding to permissions, owner,groups, size (in bytes), modification time and file or directoryname.syntax: hdfs.ls(path, recurse=FALSE)

If recurse is TRUE, It recursively shows the sub directories.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 45 / 70

Page 46: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.defaults

This method is used to set and get the default configurations ofthe HDFS

Syntax:hdfs.defaults(arg)

arg indicates name of the parameter or NULL.

This function list following values

local : rJava object corresponding to local system.blocksize: default block size of the files stored in HDFSfs: an rJava object corresponds to the HDFSfu: Helper object for rhdfsclasspath: The java classpathreplication: default replication factor in HDFSconf : name-value mappings for Hadoop configurationparameters

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 46 / 70

Page 47: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.defaults : Examples

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 47 / 70

Page 48: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.cat

This method is useful to read the lines form a file on HDFS.

Syntax:hdfs.cat(path,n,buffersize)

path : Location of the source filen : Number of line read form filebuffersize : Size of the buffer (Optional)

Example:hdfs.cat(’/RHadoop/1/example.txt’)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 48 / 70

Page 49: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.put

This method is useful to transfer the data from the local systemto HDFS.

Syntax:hdfs.put(src,dest,dstFS=hdfs.defaults(”fs”))

src : Location of the source directory or filedest : Location of the destination directory or filedstFS : The destination file system (Optional)

Example:hdfs.put(’/home/dp/Desktop/example.txt’,’/RHadoop/1/’)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 49 / 70

Page 50: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.get

This method is useful to transfer the data from the HDFS to localsystem.

Syntax:hdfs.get(src,dest,srcFS=hdfs.defaults(”fs”))

src : Location of the source directory or filedest : Location of the destination directory or filesrcFS : The source file system (Optional)

Example:hdfs.get(’/RHadoop/1/’,’/home/dp/Desktop/1/’)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 50 / 70

Page 51: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.copy | hdfs.cp

This method is useful to copy the data from one location of theHDFS to another location in HDFS

Syntax:hdfs.copy(src,dest,overwrite=FALSE)

src : Location of the source directory or filedest : Location of the destination directory or fileoverwrite : If file exist, whether or not it should be overwritten

Example:hdfs.copy(’/RHadoop/1/’,’/RHadoop/2/’)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 51 / 70

Page 52: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.move

This method is useful to move the data from one location of theHDFS to another location in HDFS and remove the sourcedirectory or file.

Syntax:hdfs.move(src,dest)

src : Location of the source directory or filedest : Location of the destination directory or file

Example:hdfs.move(’/RHadoop/1/’,’/RHadoop/2/’)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 52 / 70

Page 53: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.rename

This method is useful to rename the file or directory in HDFSthrough R

Syntax:hdfs.rename(src,dest)

src : Location of the source directory or filedest : Location of the destination directory or file

Example:hdfs.rename(’/RHadoop/1/example.txt’,’/RHadoop/1/sample.txt’)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 53 / 70

Page 54: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.rm | hdfs.rmr | hdfs.delete

These functions are used to delete files or directories of HDFSusing R.

Syntax:hdfs.delete(path)hdfs.rm(path)hdfs.rmr(path)

Example:hdfs.delete("/RHadoop/1/")

hdfs.rm("/RHadoop/1/")

hdfs.rmr("/RHadoop/1/")

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 54 / 70

Page 55: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.chmod

This method is useful to changing the permissions of HDFS files orDirectories

Syntaxhdfs.chmod(Path, permissions= ’777’)permission is a character that represents permission of a file ordirectory,.

Examplehdfs.chmod("/RHadoop", permissions= ’777’)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 55 / 70

Page 56: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.dircreate | hdfs.mkdir

Both these functions will be used for creating a directory over theHDFS filesystem.

Syntax:hdfs.mkdir(Dirname)

Example:hdfs.mkdir("/RHadoop/3/")

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 56 / 70

Page 57: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.file

This is used to initialize the file to be used for read/write operationon local system or HDFS.

Syntax:hdfs.file(path, mode, buffersize ..)

’r’ for read mode, ’w’ for write mode. Append mode is notallowed.

Example:f =

hdfs.file("/RHadoop/2/README.txt","r",buffersize=104857600)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 57 / 70

Page 58: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.write

This is used to write in to the file stored at HDFS via streaming.

Syntax:hdfs.write(object,con,hsync=FALSE)

Object is any R object, con is HDFS connection

Example:obj = c1,2,3,4,5,6,7

hdfs.write(object,con,hsync=FALSE)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 58 / 70

Page 59: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.read

This is used to read from binary files on the HDFS directory. Thiswill use the stream for the deserialization of the data.

Syntax:hdfs.read(con,n,start)

n indicates number of bytes, start indicates starting block.

Example:f =

hdfs.file("/RHadoop/2/README.txt","r",buffersize=104857600)

m = hdfs.read(f)

c = rawToChar(m)

print(c)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 59 / 70

Page 60: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.close

This is used to close the stream when a file operation is complete.It will close the stream and will not allow further file operations.

Syntax:hdfs.close(con)

con indicates connection of HDFS

Example:hdfs.close(f)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 60 / 70

Page 61: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

hdfs.file.info

This is used to get meta information about the file stored atHDFS.

Syntax:hdfs.file.info(PATH)

Example:

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 61 / 70

Page 62: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

to.dfs

Write R objects to the file system.

Syntax:to.dfs(kv,output,format=”native”)

kv means any valid key value pair or vector, matrix ect.,output is any valid path, and format is string naming format

Example: small.ints ← to.dfs(1:10)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 62 / 70

Page 63: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

from.dfs

This is used to read the R objects from the HDFS filesystem thatare in the binary encrypted format.

Syntax:from.dfs(input,format)

input is any valid path, and format is string naming format

Example:from.dfs(’/tmp/RtmpRMIXzb/file2bda3fa07850’)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 63 / 70

Page 64: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

mapreduce

This is used for defining and executing the MapReduce job.

Syntax:mapreduce(input, output, map, reduce, input.format,output.format)

input: Path to the input folder on HDFSoutput: Path to the output folder on HDFSmap:An optional R function returning null or a value ofkeyval()reduce: An optional R function of two arguments, a key and adata structure representing all the values associated with keyinput.format: Type of input dataoutput.format: Type of output data

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 64 / 70

Page 65: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

keyval

The keyval function is used to creates return values from map orreduce functions, themselves parameters to mapreduce.

Syntax:keyval(key,val)

Where key is the desired key or keys, and val is the desiredvalue or values.

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 65 / 70

Page 66: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

WordCount Mapreduce source code

#Set Environment Variables

Sys.setenv(HADOOP CMD="/usr/local/hadoop/bin/hadoop")

Sys.setenv(HADOOP STREAMING="/usr/local/hadoop/share

/hadoop/tools/lib/hadoop-streaming-2.7.1.jar")

Sys.setenv(HADOOP HOME="/usr/local/hadoop/")

# load librarys

library(rmr2)

library(rhdfs)

# initiate rhdfs package

hdfs.init()

Cont..

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 66 / 70

Page 67: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

WordCount Mapreduce source code - cont..

map ← function(k,lines) {words.list ← strsplit(lines, ’ ’)

words ← unlist(words.list)

return( keyval(words, 1) )

}

reduce ← function(word, counts) {keyval(word, sum(counts))

}

wordcount ← function (input, output) {mapreduce(input=input, output=output, input.format="text", map=map,

reduce=reduce)

}

Cont..

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 67 / 70

Page 68: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

WordCount Mapreduce source code - cont..

## read text files from folder /in1/wc/

hdfs.root ← ’/in1’

hdfs.data ← file.path(hdfs.root, ’wc’)

## save result in folder /in1/out

hdfs.out ← file.path(hdfs.root, ’out’)

## Submit job

out ← wordcount(hdfs.data, hdfs.out)

results ← from.dfs(out)

results.df ← as.data.frame(results, stringsAsFactors=F)

colnames(results.df) ← c(’word’, ’count’)

head(results.df)

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 68 / 70

Page 69: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

WordCount Output

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 69 / 70

Page 70: RHadoop

Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples

thank You

YSR Engineering College of YVU, Proddatur, Kadapa

Big Data Analytics with R and Hadoop

November 30, 2016 Slide: 70 / 70