Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 1

Big Data using HadoopOn Amazon Elastic MapReduce

Hands On Workshop

Dr.Thanachart [email protected]

Danairat T.

Certified Java Programmer, TOGAF – [email protected], +66-81-559-1446

mailto:[email protected]

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Lecture: Big Data Development Process




Big Data Development Process Guideline

Architecture Planning

• Targeted Users

• Target Opportunities

• Data Scientist

• Data Source/Type

• Data Capturing Approach

• Data Processing and Visualize Planning

• Technology Architecture

• Big Data EcoSystem

• (Hadoop Ecosystem)

• Sizing

• Integration

• Security

• Administration and Operation Planning

Big Data

Development

• Develop Use Cases• Set up Big Data

Pseudo-distribution Mode

• Set up HDFS• Develop Data

Capturing System• Develop Data

Analytic • Map Reduce• Hive• R• Etc.

• Integrate result to Enterprise Analytic System

• Set up Big Data Cluster Mode

Operation and Support

• Monitor HDFS utilization and capacity planning

• Monitor Job Tracker availability

• Monitor Data Capturing System

• Upgrade or Patch Big Data Hadoop ecosystem

• System admin. Training

• Helpdesk Training• End-User Training

(Analytic Results)

System

Evaluation

• Adoption Rates for each analytics results

• No. of Missing Analytic Results

• No. of Missing Data• Lost hours per month• Avg. of each Analytic

Result Response Time• No. of Technology

System Failure per month




Hands-On: Running Hadoopon Local Mode




Hadoop Installation

Hadoop provides three installation choices:

● Local mode: This is an unzip and run mode to get you started right away where allparts of Hadoop run within the same JVM

● Pseudo distributed mode: This mode will be run on different parts of Hadoop as different Java processors, but within a single machine

● Distributed mode: This is the real setup that spans multiple machines




Installing Hadoop and Ecosystem

1. Installing Virutal Box or VMWare Player

2. Running Image File

3. Start Hadoop

4. Hadoop Web Console

5. Stop Hadoop

Notes:-

Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will encounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6




MapReduce (Job Scheduling/Execution System)

HDFS(Hadoop Distributed File System)

Pig Sqoop

HBase

Hive

Hadoop's Ecosystem in the VM




Starting Hadoop

[hdadmin@localhost hadoop]$ /usr/local/hadoop/bin/start-all.sh

Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

[hdadmin@localhost hadoop]$ /usr/lib/jvm/jdk1.6.0_39/bin/jps

11567 Jps

10766 NameNode

11099 JobTracker

11221 TaskTracker

10899 DataNode

11018 SecondaryNameNode

[hdadmin@localhost hadoop]$

Checking Java Process and you are now running Hadoop as pseudo distributed mode




Hadoop is up!




Stopping Hadoop

[hdadmin@localhost hadoop]$ /usr/local/hadoop/bin/stop-all.sh

stopping jobtracker

localhost: stopping tasktracker

stopping namenode

localhost: stopping datanode

localhost: stopping secondarynamenode




Hands-On: Importing Data to HDFSusing Hadoop Command Line




Importing Data to Hadoop

Creating new file in /tmp

$ vi /tmp/input_test.txt

GNOME Terminal is a terminal emulation application that you can use to perform the following tasks:

Access a UNIX shell in the GNOME environment

A shell is a program that interprets and executes the commands that you type at a command line prompt. When you start GNOME Terminal, the application starts the default shell that is specified in your system account. You can switch to a different shell at any time.

Typing for the text file, Please type your own data

$hadoop dfs -mkdir /input

$hadoop dfs -mkdir /output

$hadoop dfs -copyFromLocal /tmp/input_test.txt /input




Hands-On: Reviewing, Retrieving, Deleting Data from HDFS




Review file in Hadoop HDFS

[hdadmin@localhost bin]$ hadoop dfs -ls /input

Found 1 items

-rw-r--r-- 1 hdadmin supergroup 1016 2013-03-13 20:11 /input/input_test.txt

[hdadmin@localhost bin]$ hadoop dfs -cat /input/input_test.txt

List HDFS File

Read HDFS File

Retrieve HDFS File to Local File System

Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html

[hdadmin@localhost bin]$ hadoop dfs -copyToLocal /input/input_test.txt /tmp/file.txt




Review file in Hadoop HDFS using WebUI

http://localhost:50070/













Scroll Down
















Hadoop Port Numbers

Daemon Default Port

Configuration Parameter in conf/*-site.xml

HDFS Namenode 50070 dfs.http.address

Datanodes 50075 dfs.datanode.http.address

Secondarynamenode 50090 dfs.secondary.http.address

MR JobTracker 50030 mapred.job.tracker.http.address

Tasktrackers 50060 mapred.task.tracker.http.address




Review Content from System shell

[hdadmin@localhost current]$ cd /app/hadoop/tmp/dfs/data/current

[hdadmin@localhost current]$ ls -l

total 24

-rw-r--r--. 1 hdadmin hadoop 1016 Mar 13 20:11 blk_1997667773574667398

-rw-r--r--. 1 hdadmin hadoop 15 Mar 13 20:11 blk_1997667773574667398_1005.meta

-rw-r--r--. 1 hdadmin hadoop 4 Mar 13 20:04 blk_-6735227193197163844

-rw-r--r--. 1 hdadmin hadoop 11 Mar 13 20:04 blk_-6735227193197163844_1004.meta

-rw-r--r--. 1 hdadmin hadoop 482 Mar 13 20:18 dncp_block_verification.log.curr

-rw-r--r--. 1 hdadmin hadoop 154 Mar 13 20:03 VERSION

[hdadmin@localhost current]$ more blk_1997667773574667398

GNOME Terminal is a terminal emulation application that you can use to perform the following tasks:

Access a UNIX shell in the GNOME environment

A shell is a program that interprets and executes the commands that you type at a command lin

e prompt. When you start GNOME Terminal, the application starts the default shell that is specified in your system account. You can switch to a different shell at any time.

[hdadmin@localhost current]$




Removing data from HDFS using Shell Command

hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt

Deleted hdfs://localhost:54310/input/input_test.txt

hdadmin@localhost detach]$




Hands-On: Running Hadoopon Amazon Elastic MapReduce




Architecture Overview of Amazon EMR




Creating an AWS account




Signing up for the necessary services

● Simple Storage Service (S3)● Elastic Compute Cloud (EC2)● Elastic MapReduce (EMR)

Caution! This costs real money!




Creating Amazon S3 bucket




Create access key using Security Credentials in the AWS Management Console







Creating a new Job Flow in EMR

























View Result from the S3 bucket




Lecture: Understanding Map Reduce Processing

Client

Name Node Job Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Map Reduce




MapReduce Framework

map: (K1, V1) -> list(K2, V2))

reduce: (K2, list(V2)) -> list(K3, V3)




MapReduce Processing – The Data flow

1. InputFormat, InputSplits, RecordReader

2. Mapper - your focus is here

3. Partition, Shuffle & Sort

4. Reducer - your focus is here

5. OutputFormat, RecordWriter




How does the MapReduce work?

Output in a list of (Key, List of Values)

in the intermediate file

Sorting

Partitioning

Output in a list of (Key, Value)


InputSplit

RecordReader

RecordWriter




How does the MapReduce work?

Sorting

Partitioning

Combining

Car, 2

Car, 2

Bear, {1,1}

Car, {2,1}

River, {1,1}

Deer, {1,1}

Output in a list of (Key, List of Values)


Output in a list of (Key, Value)


InputSplit

RecordReader

RecordWriter




InputFormat

InputFormat: Description: Key: Value:

TextInputFormat Default format; reads lines of text files

The byte offset of the line The line contents

KeyValueInputFormat Parses lines into key, val pairs

Everything up to the first tab character

The remainder of the line

SequenceFileInputFormat

A Hadoop-specific high-performance binary format

user-defined user-defined




InputSplitAn InputSplit describes a unit of work that comprises a single map task.

InputSplit presents a byte-oriented view of the input.

You can control this value by setting the mapred.min.split.size parameter in core-site.xml, or by overriding the parameter in the JobConf object used to submit a particular MapReduce job.

RecordReader

RecordReader reads <key, value> pairs from an InputSplit.

Typically the RecordReader converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented to the Mapper



http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/RecordReader.html


Mapper

Mapper: The Mapper performs the user-defined logic to the input a key, value and emits (key, value) pair(s) which are forwarded to the Reducers.

Partition, Shuffle & Sort

After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers.

Partitioner controls the partitioning of map-outputs to assign to reduce task . he total number of partitions is the same as the number of reduce tasks for the job

The set of intermediate keys on a single node is automatically sorted by internal Hadoop before they are presented to the Reducer

This process of moving map outputs to the reducers is known as shuffling.




ReducerThis is an instance of user-provided code that performs read each key, iterator of values in the partition assigned. The OutputCollector object in Reducer phase has a method named collect() which will collect a (key, value) output.

OutputFormat, Record Writer

OutputFormat governs the writing format in OutputCollector and RecordWriter writes output into HDFS.

OutputFormat: Description

TextOutputFormat Default; writes lines in "key \t value" form

SequenceFileOutputFormatWrites binary files suitable for reading into subsequent MapReduce jobs

NullOutputFormat generates no output files




Hands-On: Writing you own Map Reduce Program




Wordcount (HelloWord in Hadoop)1. package org.myorg;

2.

3. import java.io.IOException; 4. import java.util.*;

5.

6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*;

11.

12. public class WordCount {

13.

14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text();

17.

18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }




Wordcount (HelloWord in Hadoop)

27.

28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. }

37.




Wordcount (HelloWord in Hadoop)

38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount");

41.

42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class);

44.

45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class);

48.

49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class);

51.

52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));

54.

55. JobClient.runJob(conf); 57. } 58. }

59.




Hands-On: Packaging Map Reduce and Deploying to Hadoop Runtime

Environment




Packaging Map Reduce Program

Usage

Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar:

$ mkdir /home/hduser/wordcount_classes $ cd /home/hduser$ javac -classpath /usr/local/hadoop/hadoop-core-0.20.205.0.jar -d wordcount_classes WordCount.java $ jar -cvf ./wordcount.jar -C wordcount_classes/ .

$ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir

Output:

…….

$ hadoop dfs -cat /output/wordcount_output_dir/part-00000




Reviewing MapReduce Output Result

Scroll Downthe web page
































Hands-On: Running WordCount.jar on Amazon EMR




Upload .jar file and input file to Amazon S3

1. Select <yourbucket> in Amazon S3 service

2. Create folder : applications

3. Upload wordcount.jar to the applications folder

4. Create another folder: input

5. Upload input_test.txt to the input folder




Create a new Job Flow in EMR




Input JAR Location and Arguments
















View the Result




LectureUnderstanding Hive




IntroductionA Petabyte Scale Data Warehouse Using Hadoop

Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL




What Hive is NOT

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.).




System Architecture and Components

• Metastore: To store the meta data.• Query compiler and execution engine: To convert SQL queries to a

sequence of map/reduce jobs that are then executed on Hadoop.• SerDe and ObjectInspectors: Programmable interfaces and

implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.

• UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions).

• Clients: Command line client similar to Mysql command line.

hive.apache.org




Architecture Overview

HDFS

Hive CLIQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

Parser

Planner

Mgm

t.

Web

UI

HDFS

DDL

Hive

Hive.apache.org




Sample HiveQL

The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query

SELECT * FROM t where t.c = 'xyz'

SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)

SELECT t1.c1, count(1) from t1 group by t1.c1

Hive.apache.org




Running Hive

Hive Shell

● Interactive

hive● Script

hive -f myscript● Inline

hive -e 'SELECT * FROM mytable'

Hive.apache.org




Hands-On: Creating Table and Retrieving Data using Hive




Hive Hands-On Labs

1. Creating Hive Table

2. Reviewing Hive Table in HDFS

3. Alter and Drop Hive Table

4. Loading Data to Hive Table

5. Querying Data from Hive Table

6. Reviewing Hive Table Content from HDFS Command and WebUI

7. Insert Overwriting the Hive Table




Starting Hive Re-Start Hive CLI again

$ hive

Logging initialized using configuration in file:/usr/local/hive-0.9.0-bin/conf/hive-log4j.properties

Hive history file=/tmp/hdadmin/hive_job_log_hdadmin_201303171635_1944738265.txt

hive>

hive> quit;

Quit from Hive




1. Creating Hive Table

hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

OK

Time taken: 4.069 seconds

hive (default)> show tables;

OK

test_tbl


hive (default)> describe test_tbl;

OK

id int

country string


hive (default)>

See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html




2. Reviewing Hive Table in HDFS

[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse

Found 1 items

drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl

[hdadmin@localhost hdadmin]$

Review Hive Table fromHDFS WebUI





hive (default)> alter table test_tbl add columns (remarks STRING);

hive (default)> describe test_tbl;

OK

id int

country string

remarks string


hive (default)> drop table test_tbl;

OK


See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html





CREATE EXTERNAL TABLE weblog_entries (

ip STRING, dash1 STRING, dash2 STRING,

date STRING,status1 STRING, getstr STRING,

link STRING,http STRING,

Status STRING,

size INT

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY

'\n'

LOCATION '/data/';

weblog.hsql

hive –f weblog_create_external_table.hql

See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html




4. Loading Data to Hive Table

$ hive

hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Creating Hive table

hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLE test_tbl;

Copying data from file:/tmp/test_tbl_data.csv

Copying file: file:/tmp/test_tbl_data.csv

Loading data to table default.test_tbl

OK


hive (default)>

Loading data to Hive table





hive (default)> select * from test_tbl;

OK

1 USA

62 Indonesia

63 Philippines

65 Singapore

66 Thailand


hive (default)>





hive (default)> select country from test_tbl;Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_201303171733_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303171733_0001

Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201303171733_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2013-03-17 18:13:19,097 Stage-1 map = 0%, reduce = 0%

2013-03-17 18:13:25,151 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec







MapReduce Total cumulative CPU time: 250 msec

Ended Job = job_201303171733_0001

MapReduce Jobs Launched:

Job 0: Map: 1 Cumulative CPU: 0.25 sec HDFS Read: 282 HDFS Write: 45 SUCCESS

Total MapReduce CPU Time Spent: 250 msec

OK

USA

Indonesia

Philippines

Singapore

Thailand


hive (default)>




6. Reviewing Hive Table Content from HDFS Command and WebUI

[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl

Found 1 items

-rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08 /user/hive/warehouse/test_tbl/test_tbl_data.csv


[hdadmin@localhost hdadmin]$ hadoop fs -cat /user/hive/warehouse/test_tbl/test_tbl_data.csv

1,USA

62,Indonesia

63,Philippines

65,Singapore

66,Thailand





7. Insert Overwriting the Hive Table

hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl;

Copying data from file:/tmp/test_tbl_data_updated.csv

Copying file: file:/tmp/test_tbl_data_updated.csv

Loading data to table default.test_tbl

Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl

OK


hive (default)>




Review Hive Table Created in HDFS and WebUI

[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl

Found 1 items

-rw-r--r-- 1 hdadmin supergroup 3510 2013-03-17 18:25 /user/hive/warehouse/test_tbl/test_tbl_data_updated.csv


[hdadmin@localhost hdadmin]$ hadoop fs -cat /user/hive/warehouse/test_tbl/test_tbl_data_updated.csv

93,Afghanistan

355,Albania

213,Algeria

1684,AmericanSamoa

376,Andorra

244,Angola

1264,Anguilla

672,Antarctica

1268,AntiguaandBarbuda

54,Argentina

374,Armenia

297,Aruba

61,Australia

43,Austria

994,Azerbaijan

1242,Bahamas

973,Bahrain

…




Hands-On: Install the Amazon EMR Command Line Interface




Installing Amazon EMR CLI

1. Install Ruby

2. Download the Amazon EMR CLI

3. Install the Amazon EMR CLI

4. Create your credentials file (credentials.json)

5. Create an Amazon EC2 key pair

6. Configure your SSH credentials

7. Verify installation of the Amazon EMR CL

Instruction:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-install.html




Example: Credentials file

{

"access_id": "AKI..........................A",

"private_key": "SaJHI4wjyK.............UWDaYOw2el",

"keypair": "imckey",

"key-pair-file": "~/elastic-mapreduce-cli/imckey.pem",

"log_uri": "s3n://imcbucket/",

"region": "us-west-2"

}




Running Amazon EMR CLI

THANACHARTs-MacBook-Air:~ THANACHART$ cd elastic-mapreduce-cli/

THANACHARTs-MacBook-Air:elastic-mapreduce-cli THANACHART$

THANACHARTs-MacBook-Air:elastic-mapreduce-ruby THANACHART$ ./elastic-mapreduce --list

j-2JW8QBWXIYNV8 TERMINATED ec2-54-213-112-102.us-west-2.compute.amazonaws.comHBase CLI

COMPLETED Start HBase

j-1JNA9G1O7ET2G TERMINATED ec2-54-213-112-74.us-west-2.compute.amazonaws.com Hive Interactive2

COMPLETED Setup Hive

j-1H7NX8OGFNFRW TERMINATED ec2-54-213-10-135.us-west-2.compute.amazonaws.com Hive Interactive




Hands-On: Running Hive Interactiveon Amazon EMR




Running Hive on Amazon EMR

● Amazon EMR enables you to run Hive scripts in two modes:

● Interactive● Batch

Hive.apache.org




Upload an input file to Amazon S3


2. Create afolder:data

3. Upload hdi-data.csv to the data folder




Running Hive Interactive










Select EC2 Key Pair







Find Job Flow ID




Running CLI to check the Job Flow

$ ./elastic-mapreduce --list -j j-37WK3Z1T2FZ7D

j-37WK3Z1T2FZ7D STARTING ec2-54-213-119-89.us-west-2.compute.amazonaws.com Hive Interactive Demo

PENDING Setup Hive

$ ./elastic-mapreduce --list -j j-37WK3Z1T2FZ7D

j-37WK3Z1T2FZ7D RUNNING ec2-54-213-119-89.us-west-2.compute.amazonaws.com Hive Interactive Demo

RUNNING Setup Hive

$ ./elastic-mapreduce --ssh j-37WK3Z1T2FZ7D

hadoop@ip-172-31-24-126:~$hive

Logging initialized using configuration in file:/home/hadoop/.versions/hive-0.8.1/conf/hive-log4j.properties

Hive history file=/mnt/var/lib/hive_081/tmp/history/hive_job_log_hadoop_201308011448_800175951.txt

hive>



mailto:hadoop@ip-172-31-24-126


Create a table using HiveQL

hive> CREATE TABLE HDI(

> id INT, country STRING, hdi FLOAT, lifeex INT, mysch INT, eysch

> INT, gni INT)

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY ","

> STORED AS TEXTFILE

> LOCATION "s3://imcbucket/data";

OK


hive> SHOW TABLES;

OK

hdi





Running a SELECT statement

hive> SELECT country, gni FROM hdi WHERE gni > 2000;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_201308011444_0001, Tracking URL = http://ip-172-31-24-126:9100/jobdetails.jsp?jobid=job_201308011444_0001

Kill Command = /home/hadoop/bin/hadoop job -Dmapred.job.tracker=172.31.24.126:9001 -kill job_201308011444_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2013-08-01 14:55:53,846 Stage-1 map = 0%, reduce = 0%





Running a SELECT statement (cont.)

MapReduce Total cumulative CPU time: 15 seconds 520 msec

Ended Job = job_201308011444_0001

Counters:

MapReduce Jobs Launched:

Job 0: Map: 1 Accumulative CPU: 15.52 sec HDFS Read: 372 HDFS Write: 2435 SUCCESS

Total MapReduce CPU Time Spent: 15 seconds 520 msec

OK

Norway 47557

Australia 34431

Netherlands 36402

United States 43017

New Zealand 23737

...




LectureUnderstanding Pig




IntroductionA high-level platform for creating MapReduce programs Using Hadoop

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.




Pig Components

● Two Compnents● Language (Pig Latin)● Compiler

● Two Execution Environments● Local

pig -x local● Distributed

pig -x mapreduce

Hive.apache.org




Running Pig

● Script

pig myscript● Command line (Grunt)

pig● Embedded

Writing a java program

Hive.apache.org




Pig Latin

Hive.apache.org




Pig Execution Stages

Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi




Why Pig?

● Makes writing Hadoop jobs easier● 5% of the code, 5% of the time● You don't need to be a programmer to write Pig scripts

● Provide major functionality required for DatawareHouse and Analytics● Load, Filter, Join, Group By, Order, Transform

● User can write custom UDFs (User Defined Function)

Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi




Pig v.s. Hive

Hive.apache.org




Hands-On: Running a Pig script




Starting Pig Command Line

[hdadmin@localhost ~]$ pig -x local

2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53

2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hdadmin/pig_1375327740024.log

2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hdadmin/.pigbootup not found

2013-08-01 10:29:00,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

grunt>




countryFilter.pig

A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int);B = FILTER A BY gni > 2000;C = ORDER B BY gni;dump C;

#Preparing Data

[hdadmin@localhost ~]$ cp hadoop_data/hdi-data.csv /usr/local/pig-0.11.1/bin/

#Edit Your Script

[hdadmin@localhost ~]$ cd /usr/local/pig-0.11.1/bin/

[hdadmin@localhost ~]$ vi countryFilter.pig

Writing a Pig Script




[hdadmin@localhost ~]$ cd /usr/local/pig-0.11.1/bin/

[hdadmin@localhost ~]$ pig -x local

grunt > run countryFilter.pig

....

(150,Cameroon,0.482,51,5,10,2031)

(126,Kyrgyzstan,0.615,67,9,12,2036)

(156,Nigeria,0.459,51,5,8,2069)

(154,Yemen,0.462,65,2,8,2213)

(138,Lao People's Democratic Republic,0.524,67,4,9,2242)

(153,Papua New Guinea,0.466,62,4,5,2271)

(165,Djibouti,0.43,57,3,5,2335)

(129,Nicaragua,0.589,74,5,10,2430)

(145,Pakistan,0.504,65,4,6,2550)

Running a Pig Script




Writing a Join operation script

CountryJoin..pig

A = load 'hdi-data.csv' using PigStorage(',') AS (id:int,country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int,gni:int);B = FILTER A BY gni> 2000;C = ORDER B BY gni;D = load 'export-data.csv' using PigStorage(',') AS(country:chararray, expct:float);E = JOIN C BY country, D by country;dump E;




Hands-On: Running a Pig scripton Amazon EMR




Upload .pig file to Amazon S3


2. Upload countryFilter-EMR.pigto the data folder




Creating a Pig program













Viewing a result







LectureUnderstanding HBase




IntroductionAn open source, non-relational, distributed database

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (, providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.




HBase Features

● Column oriented data store, known as Hadoop Database● Support random realtime CRUD operations (unlike

HDFS)● No SQL Database● Opensource, written in Java● Run on a cluster of commodity hardware

Hive.apache.org




HBase Architecture

Hive.apache.org




When to use Hbase?

● When you need high volume data to be stored ● Un-structured data● Sparse data● Column-oriented data● Versioned data (same data template, captured at various

time, time-elapse data)● When you need high scalability

Hive.apache.org




Hands-On: Running HBase




Starting HBase shell

[hdadmin@localhost ~]$ start-hbase.sh

starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master-localhost.localdomain.out

[hdadmin@localhost ~]$ jps

3064 TaskTracker

2836 SecondaryNameNode

2588 NameNode

3513 Jps

3327 HMaster

2938 JobTracker

2707 DataNode

[hdadmin@localhost ~]$ hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013

hbase(main):001:0>




Create a table and insert data in HBase

hbase(main):009:0> create 'test', 'cf'

0 row(s) in 1.0830 seconds

hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'


hbase(main):011:0> scan 'test'

ROW COLUMN+CELL

row1 column=cf:a, timestamp=1375363287644, value=val1


hbase(main):002:0> get 'test', 'row1'

COLUMN CELL

cf:a timestamp=1375363287644, value=val1





Hands-On: Running HBase commandson Amazon EMR




Create a HBase shell
















Find Job Flow ID




Starting Hbase Shell

$ ./elastic-mapreduce --list -j j-3MKWRS0K8IH7K

j-3MKWRS0K8IH7K WAITING ec2-54-213-117-162.us-west-2.compute.amazonaws.comHBase Interactive

COMPLETED Start HBase

$ ./elastic-mapreduce --ssh j-3MKWRS0K8IH7K

hadoop@ip-172-31-33-161:~$ hbase shell




Recommendation to Further Study

Hadoop Beginner's Guide

Hadoop: The Definitive Guide, 3rd Edition





Hadoop in Practice

Hadoop MapReduce Cookbook





Amazon Elastic MapReduce Developer Guide




Thank you