Upload
praveen-kumar-donta
View
210
Download
2
Embed Size (px)
Citation preview
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Big Data Analytics with R and Hadoop
D. Praveen KumarResearch Scholar (Full-Time)
Department of Computer Science & EngineeringYSREC of Yogi Vemana University, Proddatur
Kadapa Dt., A. P, India
November 30, 2016
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 1 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
1 Introduction
2 RHadoop
3 RHadoop Installation
4 rhdfs Methods
5 rmr2
6 Examples
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 2 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Big Data - Introduction
Big Data has to deal with large and complex data sets that can bestructured, semi-structured, or unstructured and will typically notfit into memory to be processed. They have to be processed inplace, which means that computation has to be done where thedata resides for processing.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 3 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Big Data - 3V’s
Velocity refers to the low latency, real-time speed at which theanalytics need to be applied. (Example: to perform analyticson a continuous stream of data originating from a socialnetworking site)
Volume refers to the size of the data set. It may be in KB,MB, GB, TB, or PB based on the type of the application thatgenerates or receives the data.
Variety refers to the various types of the data that can exist,for example, text, audio, video, and photos.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 4 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Big Data - 3V’s (Cont..)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 5 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Popular Organizations that hold Big Data
Some of the popular organizations that hold Big Data are asfollows: (upto 2014)
Facebook: It has 40 PB of data and captures 100 TB/day
Yahoo!: It has 60 PB of data
Twitter: It captures 8 TB/day
EBay: It has 40 PB of data and captures 50 TB/day
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 6 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Hadoop - Introduction
Apache Hadoop is an open source Java framework forprocessing and querying vast amounts of data on largeclusters of commodity hardware.
Hadoop is a top level Apache project, initiated and led byYahoo! and Doug Cutting.
Its impact can be boiled down to four salient characteristics:scalable, cost-effective, flexible, fault-tolerant solutions.
Apache Hadoop has two main features:
HDFS (Hadoop Distributed File System) - StoringMap Reduce - Processing
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 7 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Requirements
Necessary
Java >= 7
ssh
Linux OS (Ubuntu >=14.04)
Hadoop framework
Optional
Eclipse
Internet connection
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 8 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Java 7 & Installation
Hadoop requires a working Java installation. However, usingjava 1.7 or more is recommended.
Following command is used to install java in linux platformsudo apt-get install openjdk-7-jdk (or)
sudo apt-get install default-jdk
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 9 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Java PATH Setup
We need to set JAVA path
Open the .bashrc file located in home directorygedit ~/.bashrc
Add below line at the end:export JAVA HOME=/usr/lib/jvm/java−7−openjdk−amd64
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 10 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Installation & Configuration of SSH
Hadoop requires SSH(Secure Shell) access to manage itsnodes, i.e. remote machines plus your local machine if youwant to use Hadoop on it.
Install SSH using following commandsudo apt-get install ssh
First, we have to generate DSA an SSH key for user.ssh-keygen -t dsa -P ’’ -f ~ /.ssh/id dsa
cat ~ /.ssh/id dsa.pub >> ~ /.ssh/authorized keys
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 11 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Download & Extract Hadoop
Download Hadoop from the Apache Download Mirrors
http://mirror.fibergrid.in/apache/hadoop/common/
Extract the contents of the Hadoop package to a location of yourchoice. I picked /usr/local/hadoop.$ sudo chmod 777 /usr/local
$ cd /usr/local
$ tar xzf hadoop-2.7.2.tar.gz
$ sudo mv hadoop-2.7.2 hadoop
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 12 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Add Hadoop configuration in .bashrc
Add Hadoop configuration in .bashrc in home directory.export HADOOP INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP INSTALL/bin
export PATH=$PATH:$HADOOP INSTALL/sbin
export HADOOP MAPRED HOME=$HADOOP INSTALL
export HADOOP HDFS HOME=$HADOOP INSTALL
export HADOOP COMMON HOME=$HADOOP INSTALL
export YARN HOME=$HADOOP INSTALL
export HADOOP COMMON LIB NATIVE DIR=$HADOOP INSTALL/lib/native
export HADOOP OPTS="-Djava.library.path=$HADOOP INSTALL/lib"
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 13 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Create temp file, DataNode & NameNode
Execute below commands to create NameNodemkdir -p /usr/local/hadoopdata/hdfs/namenode
Execute below commands to create DataNodemkdir -p /usr/local/hadoopdata/hdfs/datanode
Execute below code to create the tmp directory in hadoopsudo mkdir -p /app/hadoop/tmp
sudo chown hadoop1:hadoop1 /app/hadoop/tmp
sudo chmod 750 /app/hadoop/tmp
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 14 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Files to Configure
The following are the files we need to configure
core-site.xml
hadoop-env.sh
mapred-site.xml
hdfs-site.xml
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 15 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Add properties in /usr/local/hadoop/etc/core-site.xml
Add the following snippets between the< configuration > ... < /configuration > tags in the core-site.xmlfile.
Add below property to specify the location of tmp< property >< name > hadoop.tmp.dir < /name >< value > /app/hadoop/tmp < /value >< /property >
Add below property to specify the location of default filesystem and its port number.< property >< name > fs.default.name < /name >< value > hdfs : //localhost : 54310 < /value >
< /property >
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 16 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Add properties in /usr/local/hadoop/etc/hadoop-env.sh
Un-Comment the JAVA HOME and Give Correct Path ForJava.export JAVA HOME=/usr/lib/jvm/java-7-openjdk-amd64
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 17 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Add property in/usr/local/hadoop/etc/hadoop/mapred-site.xml
In file we add The host name and port that the MapReduce jobtracker runs at. Add following in mapred-site.xml :< property >< name > mapred .job.tracker < /name >< value > localhost : 54311 < /value >< /property >
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 18 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Add properties in ... etc/hadoop/hdfs-site.xml
In file hdfs-site.xml add following:
Add replication factor< property >< name > dfs.replication < /name >< value > 1 < /value >
< /property >
Specify the NameNode< property >< name > dfs.namenode.name.dir < /name >< value > file : /usr/local/hadoopdata/hdfs/namenode < /value >
< /property >
Specify the DataNode< property >< name > dfs.datanode.name.dir < /name >< value > file : /usr/local/hadoopdata/hdfs/datanode < /value >
< /property >
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 19 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Formatting the HDFS file system via the NameNode
The first step to starting up your Hadoop installation is
Formatting the Hadoop file system
We need to do this the first time you set up a Hadoop.
Do not format a running Hadoop file system as you will loseall the data currently in HDFS
To format the file system, run the commandhadoop namenode -format
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 20 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Starting single-node cluster
Run the command:start-all.sh
This will startup a NameNode,SecondaryNameNode,DataNode, ResourceManager and a NodeManager on yourmachine.
A nifty tool for checking whether the expected Hadoopprocesses are running is jpshadoop1@hadoop1:/usr/local/hadoop$ jps
2598 NameNode3112 ResourceManager3523 Jps2917 SecondaryNameNode2727 DataNode3242 NodeManager
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 21 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Stopping your single-node cluster
Run the commandstop-all.sh
To stop all the daemons running on your machine output will belike this.stopping NodeManagerlocalhost: stopping ResourceManagerstopping NameNodelocalhost: stopping DataNode
localhost: stopping SecondaryNameNode
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 22 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
R - Introduction
R is an open source software package to perform statisticalanalysis on data.
R is a programming language developed from S(Statistical)
R provides a wide variety of statistical, machine learning,graphical techniques, and is highly extensible.
R can now connect with other data stores, such as MySQL,SQLite, MongoDB, and Hadoop etc.,
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 23 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
R - Features
Following are Some of the R Features
Effective statistical programming language
Relational database support
Data analytics
Data visualization
Extension through the vast library of R packages
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 24 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
R - Operations
R allows performing Data analytics by various operations such as:
Regression
Classification
Clustering
Recommendation
Text mining
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 25 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
R - Installation (Windows)
For Windows, follow the given steps:
1 Navigate to www.r-project.org.
2 Click on the CRAN section, select CRAN mirror, and selectyour Windows OS (stick to Linux; Hadoop is almost alwaysused in a Linux environment).
3 Download the latest R version from the mirror.
4 Execute the downloaded .exe to install R.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 26 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
R - Installation (Ubuntu)
For Linux-Ubuntu, follow the given steps:
1 Navigate to www.r-project.org.
2 Click on the CRAN section, select CRAN mirror, and selectyour OS.
3 In the /etc/apt/sources.list file, add the CRAN< mirror > entry.
4 Download and update the package lists from the repositoriesusing the sudo apt-get update command.
5 Install R system using the sudo apt-get install r-base
command.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 27 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
RHEL/CentOS
For Linux-RHEL/CentOS, follow the given steps:
1 Navigate to www.r-project.org.
2 Click on CRAN, select CRAN mirror, and select Red Hat OS.
3 Download the R-*core-*.rpm file.
4 Install the .rpm package using the rpm -ivh R-*core-*.rpmcommand.
5 Install R system using sudo yum install R.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 28 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Hadoop MapReduce in R
Hadoop MapReduce in R, we can perform in Three Ways:
1 R and Hadoop Integrated Programming Environment(RHIPE)
2 HadoopStreaming
3 RHadoop
Among these three RHadoop is efficient and easiest.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 29 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
RHadoop - Introduction
RHadoop was developed by Revolution Analytics
RHadoop is available with three main R packages:
1 rhdfs - provides HDFS data operations2 rmr - provides MapReduce execution operations3 rhbase - input data source at the HBase
Here it’s not necessary to install all of the three RHadooppackages to run the Hadoop MapReduce operations with Rand Hadoop.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 30 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
RHadoop - Architecture
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 31 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
rhdfs
rhdfs is an R interface for providing the HDFS usability fromthe R console.
rhdfs package calls the HDFS API in backend to operate datasources stored on HDFS.
With rhdfs methods, R programmer can easily perform readand write operations on distributed data files.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 32 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
rmr
rmr is an R interface for providing Hadoop MapReduce facilityinside the R environment.
R programmer needs to just divide their application logic intothe map and reduce phases and submit it with the rmrmethods.
After that, rmr calls the Hadoop streaming MapReduce APIwith several job parameters as input directory, outputdirectory, mapper, reducer, and so on, to perform the RMapReduce job over Hadoop cluster.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 33 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
rhbase
rhbase is an R interface for operating the Hadoop HBase datasource stored at the distributed network via a Thrift server.
The rhbase package is designed with several methods forinitialization and read/write and table manipulationoperations.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 34 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
R and Hadoop installation
We already installed R and Hadoop
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 35 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Installing the R packages
To connect R and Hadoop we need to install some of the packages:
httr
functional
devtools
plyr
reshape2
rJava
RJSONIO
itertools
digest
Rcppinstall.packages( c(’httr’,’functional’,’devtools’, ’plyr’,’reshape2’))
install.packages( c(’rJava’,’RJSONIO’, ’itertools’, ’digest’,’Rcpp’))
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 36 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Setting environment variables
We need to set following environment variables through R console.## Setting HADOOP CMD
Sys.setenv(HADOOP CMD="/usr/local/hadoop/bin/hadoop")
## Setting up HADOOP STREAMING
Sys.setenv(HADOOP STREAMING="/usr/local/hadoop/share
/hadoop/tools/lib/hadoop-streaming-2.7.3.jar")
or, we can also set the R console via the command line as follows:export HADOOP CMD="/usr/local/hadoop/"
export HADOOP STREAMING="/usr/local/hadoop/share
/hadoop/tools/lib/hadoop-streaming-2.7.3.jar"
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 37 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Usage of Hadoop Streaming jar
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 38 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Downloading RHadoop Packages
Download RHadoop packages from GitHub repository ofRevolution Analytics:https://github.com/RevolutionAnalytics/RHadoop
rmr: [rmr-2 3.3.1.tar.gz ]
rhdfs: [rhdfs-1.0.8.tar.gz ]
rhbase: [rhbase-1.2.1.tar.gz ]
We can install these packages using R-command line or RStudio
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 39 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Installing rmr package
Install throught R Commander using the following CommandR CMD INSTALL rmr-2 3.3.1.tar.gz
Install using Rstudio follow the steps
Click on Tools → Install PackagesChange Install from option from Repository(CERN) toPackage Archive File (.tar.gz) optionChoose the rmr-2 3.3.1.tar.gz file from your local systemClick on Install button (It also install supporting packages ofrmr)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 40 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Installing rhdfs package
Install throught R Commander using the following CommandR CMD INSTALL rhdfs-1.0.8.tar.gz
Install using Rstudio follow the steps
Click on Tools → Install PackagesChange Install from option from Repository(CERN) toPackage Archive File (.tar.gz) optionChoose the rhdfs-1.0.8.tar.gz file from your local systemClick on Install button (It also install supporting packages ofrhdfs)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 41 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Installing rhbase package
Install throught R Commander using the following CommandR CMD INSTALL rhbase-1.2.1.tar.gz
Install using Rstudio follow the steps
Click on Tools → Install PackagesChange Install from option from Repository(CERN) toPackage Archive File (.tar.gz) optionChoose the rhbase-1.2.1.tar.gz file from your local systemClick on Install button (It also install supporting packages ofrhdfs)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 42 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Loading the RHadoop libraries
However we load a normal library in R, Similarly we can loadRHadoop libraries using require() or library() methods.
library(’rhdfs’) # Loading HDFS
library(’rmr2’) # Loading MapReduce
library(’rhbase’) # Loading HBase
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 43 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
Initializing the RHadoop
Initialize the rhdfs package with parameters specifying the locationof the hadoop configuration files.
Syntax:hdfs.init(hadoop=PATH)
here PATH specifys the location of the hadoop configuration file.If we can’t pass any parameter, by default conguration files takenfrom the HADOOP CMD environment variable.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 44 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.ls
It is useful to list files and directories of the HDFS. It returns thedata frames that columns corresponding to permissions, owner,groups, size (in bytes), modification time and file or directoryname.syntax: hdfs.ls(path, recurse=FALSE)
If recurse is TRUE, It recursively shows the sub directories.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 45 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.defaults
This method is used to set and get the default configurations ofthe HDFS
Syntax:hdfs.defaults(arg)
arg indicates name of the parameter or NULL.
This function list following values
local : rJava object corresponding to local system.blocksize: default block size of the files stored in HDFSfs: an rJava object corresponds to the HDFSfu: Helper object for rhdfsclasspath: The java classpathreplication: default replication factor in HDFSconf : name-value mappings for Hadoop configurationparameters
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 46 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.defaults : Examples
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 47 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.cat
This method is useful to read the lines form a file on HDFS.
Syntax:hdfs.cat(path,n,buffersize)
path : Location of the source filen : Number of line read form filebuffersize : Size of the buffer (Optional)
Example:hdfs.cat(’/RHadoop/1/example.txt’)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 48 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.put
This method is useful to transfer the data from the local systemto HDFS.
Syntax:hdfs.put(src,dest,dstFS=hdfs.defaults(”fs”))
src : Location of the source directory or filedest : Location of the destination directory or filedstFS : The destination file system (Optional)
Example:hdfs.put(’/home/dp/Desktop/example.txt’,’/RHadoop/1/’)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 49 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.get
This method is useful to transfer the data from the HDFS to localsystem.
Syntax:hdfs.get(src,dest,srcFS=hdfs.defaults(”fs”))
src : Location of the source directory or filedest : Location of the destination directory or filesrcFS : The source file system (Optional)
Example:hdfs.get(’/RHadoop/1/’,’/home/dp/Desktop/1/’)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 50 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.copy | hdfs.cp
This method is useful to copy the data from one location of theHDFS to another location in HDFS
Syntax:hdfs.copy(src,dest,overwrite=FALSE)
src : Location of the source directory or filedest : Location of the destination directory or fileoverwrite : If file exist, whether or not it should be overwritten
Example:hdfs.copy(’/RHadoop/1/’,’/RHadoop/2/’)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 51 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.move
This method is useful to move the data from one location of theHDFS to another location in HDFS and remove the sourcedirectory or file.
Syntax:hdfs.move(src,dest)
src : Location of the source directory or filedest : Location of the destination directory or file
Example:hdfs.move(’/RHadoop/1/’,’/RHadoop/2/’)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 52 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.rename
This method is useful to rename the file or directory in HDFSthrough R
Syntax:hdfs.rename(src,dest)
src : Location of the source directory or filedest : Location of the destination directory or file
Example:hdfs.rename(’/RHadoop/1/example.txt’,’/RHadoop/1/sample.txt’)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 53 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.rm | hdfs.rmr | hdfs.delete
These functions are used to delete files or directories of HDFSusing R.
Syntax:hdfs.delete(path)hdfs.rm(path)hdfs.rmr(path)
Example:hdfs.delete("/RHadoop/1/")
hdfs.rm("/RHadoop/1/")
hdfs.rmr("/RHadoop/1/")
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 54 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.chmod
This method is useful to changing the permissions of HDFS files orDirectories
Syntaxhdfs.chmod(Path, permissions= ’777’)permission is a character that represents permission of a file ordirectory,.
Examplehdfs.chmod("/RHadoop", permissions= ’777’)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 55 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.dircreate | hdfs.mkdir
Both these functions will be used for creating a directory over theHDFS filesystem.
Syntax:hdfs.mkdir(Dirname)
Example:hdfs.mkdir("/RHadoop/3/")
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 56 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.file
This is used to initialize the file to be used for read/write operationon local system or HDFS.
Syntax:hdfs.file(path, mode, buffersize ..)
’r’ for read mode, ’w’ for write mode. Append mode is notallowed.
Example:f =
hdfs.file("/RHadoop/2/README.txt","r",buffersize=104857600)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 57 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.write
This is used to write in to the file stored at HDFS via streaming.
Syntax:hdfs.write(object,con,hsync=FALSE)
Object is any R object, con is HDFS connection
Example:obj = c1,2,3,4,5,6,7
hdfs.write(object,con,hsync=FALSE)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 58 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.read
This is used to read from binary files on the HDFS directory. Thiswill use the stream for the deserialization of the data.
Syntax:hdfs.read(con,n,start)
n indicates number of bytes, start indicates starting block.
Example:f =
hdfs.file("/RHadoop/2/README.txt","r",buffersize=104857600)
m = hdfs.read(f)
c = rawToChar(m)
print(c)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 59 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.close
This is used to close the stream when a file operation is complete.It will close the stream and will not allow further file operations.
Syntax:hdfs.close(con)
con indicates connection of HDFS
Example:hdfs.close(f)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 60 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
hdfs.file.info
This is used to get meta information about the file stored atHDFS.
Syntax:hdfs.file.info(PATH)
Example:
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 61 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
to.dfs
Write R objects to the file system.
Syntax:to.dfs(kv,output,format=”native”)
kv means any valid key value pair or vector, matrix ect.,output is any valid path, and format is string naming format
Example: small.ints ← to.dfs(1:10)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 62 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
from.dfs
This is used to read the R objects from the HDFS filesystem thatare in the binary encrypted format.
Syntax:from.dfs(input,format)
input is any valid path, and format is string naming format
Example:from.dfs(’/tmp/RtmpRMIXzb/file2bda3fa07850’)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 63 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
mapreduce
This is used for defining and executing the MapReduce job.
Syntax:mapreduce(input, output, map, reduce, input.format,output.format)
input: Path to the input folder on HDFSoutput: Path to the output folder on HDFSmap:An optional R function returning null or a value ofkeyval()reduce: An optional R function of two arguments, a key and adata structure representing all the values associated with keyinput.format: Type of input dataoutput.format: Type of output data
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 64 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
keyval
The keyval function is used to creates return values from map orreduce functions, themselves parameters to mapreduce.
Syntax:keyval(key,val)
Where key is the desired key or keys, and val is the desiredvalue or values.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 65 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
WordCount Mapreduce source code
#Set Environment Variables
Sys.setenv(HADOOP CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP STREAMING="/usr/local/hadoop/share
/hadoop/tools/lib/hadoop-streaming-2.7.1.jar")
Sys.setenv(HADOOP HOME="/usr/local/hadoop/")
# load librarys
library(rmr2)
library(rhdfs)
# initiate rhdfs package
hdfs.init()
Cont..
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 66 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
WordCount Mapreduce source code - cont..
map ← function(k,lines) {words.list ← strsplit(lines, ’ ’)
words ← unlist(words.list)
return( keyval(words, 1) )
}
reduce ← function(word, counts) {keyval(word, sum(counts))
}
wordcount ← function (input, output) {mapreduce(input=input, output=output, input.format="text", map=map,
reduce=reduce)
}
Cont..
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 67 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
WordCount Mapreduce source code - cont..
## read text files from folder /in1/wc/
hdfs.root ← ’/in1’
hdfs.data ← file.path(hdfs.root, ’wc’)
## save result in folder /in1/out
hdfs.out ← file.path(hdfs.root, ’out’)
## Submit job
out ← wordcount(hdfs.data, hdfs.out)
results ← from.dfs(out)
results.df ← as.data.frame(results, stringsAsFactors=F)
colnames(results.df) ← c(’word’, ’count’)
head(results.df)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 68 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
WordCount Output
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 69 / 70
Outline Introduction RHadoop RHadoop Installation rhdfs rmr2 Examples
thank You
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 70 / 70