150
Danairat T., 2013, [email protected] Big Data Hadoop – Hands On Workshop 1 Big Data using Hadoop On Amazon Elastic MapReduce Hands On Workshop Dr.Thanachart Numnonda [email protected] Danairat T. Certified Java Programmer, TOGAF – Silver [email protected], +66-81-559-1446

Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Embed Size (px)

Citation preview

Page 1: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 1

Big Data using HadoopOn Amazon Elastic MapReduce

Hands On Workshop

Dr.Thanachart [email protected]

Danairat T.

Certified Java Programmer, TOGAF – [email protected], +66-81-559-1446

Page 2: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Lecture: Big Data Development Process

Page 3: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Big Data Development Process Guideline

Architecture Planning

• Targeted Users

• Target Opportunities

• Data Scientist

• Data Source/Type

• Data Capturing Approach

• Data Processing and Visualize Planning

• Technology Architecture

• Big Data EcoSystem

• (Hadoop Ecosystem)

• Sizing

• Integration

• Security

• Administration and Operation Planning

Big Data

Development

• Develop Use Cases• Set up Big Data

Pseudo-distribution Mode

• Set up HDFS• Develop Data

Capturing System• Develop Data

Analytic • Map Reduce• Hive• R• Etc.

• Integrate result to Enterprise Analytic System

• Set up Big Data Cluster Mode

Operation and Support

• Monitor HDFS utilization and capacity planning

• Monitor Job Tracker availability

• Monitor Data Capturing System

• Upgrade or Patch Big Data Hadoop ecosystem

• System admin. Training

• Helpdesk Training• End-User Training

(Analytic Results)

System

Evaluation

• Adoption Rates for each analytics results

• No. of Missing Analytic Results

• No. of Missing Data• Lost hours per month• Avg. of each Analytic

Result Response Time• No. of Technology

System Failure per month

Page 4: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running Hadoopon Local Mode

Page 5: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hadoop Installation

Hadoop provides three installation choices:

● Local mode: This is an unzip and run mode to get you started right away where allparts of Hadoop run within the same JVM

● Pseudo distributed mode: This mode will be run on different parts of Hadoop as different Java processors, but within a single machine

● Distributed mode: This is the real setup that spans multiple machines

Page 6: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Installing Hadoop and Ecosystem

1. Installing Virutal Box or VMWare Player

2. Running Image File

3. Start Hadoop

4. Hadoop Web Console

5. Stop Hadoop

Notes:-

Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will encounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6

Page 7: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

MapReduce (Job Scheduling/Execution System)

HDFS(Hadoop Distributed File System)

Pig Sqoop

HBase

Hive

Hadoop's Ecosystem in the VM

Page 8: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Starting Hadoop

[hdadmin@localhost hadoop]$ /usr/local/hadoop/bin/start-all.sh

Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

[hdadmin@localhost hadoop]$ /usr/lib/jvm/jdk1.6.0_39/bin/jps

11567 Jps

10766 NameNode

11099 JobTracker

11221 TaskTracker

10899 DataNode

11018 SecondaryNameNode

[hdadmin@localhost hadoop]$

Checking Java Process and you are now running Hadoop as pseudo distributed mode

Page 9: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hadoop is up!

Page 10: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Stopping Hadoop

[hdadmin@localhost hadoop]$ /usr/local/hadoop/bin/stop-all.sh

stopping jobtracker

localhost: stopping tasktracker

stopping namenode

localhost: stopping datanode

localhost: stopping secondarynamenode

Page 11: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Importing Data to HDFSusing Hadoop Command Line

Page 12: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Importing Data to Hadoop

Creating new file in /tmp

$ vi /tmp/input_test.txt

GNOME Terminal is a terminal emulation application that you can use to perform the following tasks:

Access a UNIX shell in the GNOME environment

A shell is a program that interprets and executes the commands that you type at a command line prompt. When you start GNOME Terminal, the application starts the default shell that is specified in your system account. You can switch to a different shell at any time.

Typing for the text file, Please type your own data

$hadoop dfs -mkdir /input

$hadoop dfs -mkdir /output

$hadoop dfs -copyFromLocal /tmp/input_test.txt /input

Page 13: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Reviewing, Retrieving, Deleting Data from HDFS

Page 14: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review file in Hadoop HDFS

[hdadmin@localhost bin]$ hadoop dfs -ls /input

Found 1 items

-rw-r--r-- 1 hdadmin supergroup 1016 2013-03-13 20:11 /input/input_test.txt

[hdadmin@localhost bin]$ hadoop dfs -cat /input/input_test.txt

List HDFS File

Read HDFS File

Retrieve HDFS File to Local File System

Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html

[hdadmin@localhost bin]$ hadoop dfs -copyToLocal /input/input_test.txt /tmp/file.txt

Page 15: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review file in Hadoop HDFS using WebUI

http://localhost:50070/

Page 16: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review file in Hadoop HDFS using WebUI

Page 17: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review file in Hadoop HDFS using WebUI

Page 18: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review file in Hadoop HDFS using WebUI

Scroll Down

Page 19: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review file in Hadoop HDFS using WebUI

Page 20: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review file in Hadoop HDFS using WebUI

Page 21: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review file in Hadoop HDFS using WebUI

Page 22: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hadoop Port Numbers

Daemon Default Port

Configuration Parameter in conf/*-site.xml

HDFS Namenode 50070 dfs.http.address

Datanodes 50075 dfs.datanode.http.address

Secondarynamenode 50090 dfs.secondary.http.address

MR JobTracker 50030 mapred.job.tracker.http.address

Tasktrackers 50060 mapred.task.tracker.http.address

Page 23: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review Content from System shell

[hdadmin@localhost current]$ cd /app/hadoop/tmp/dfs/data/current

[hdadmin@localhost current]$ ls -l

total 24

-rw-r--r--. 1 hdadmin hadoop 1016 Mar 13 20:11 blk_1997667773574667398

-rw-r--r--. 1 hdadmin hadoop 15 Mar 13 20:11 blk_1997667773574667398_1005.meta

-rw-r--r--. 1 hdadmin hadoop 4 Mar 13 20:04 blk_-6735227193197163844

-rw-r--r--. 1 hdadmin hadoop 11 Mar 13 20:04 blk_-6735227193197163844_1004.meta

-rw-r--r--. 1 hdadmin hadoop 482 Mar 13 20:18 dncp_block_verification.log.curr

-rw-r--r--. 1 hdadmin hadoop 154 Mar 13 20:03 VERSION

[hdadmin@localhost current]$ more blk_1997667773574667398

GNOME Terminal is a terminal emulation application that you can use to perform the following tasks:

Access a UNIX shell in the GNOME environment

A shell is a program that interprets and executes the commands that you type at a command lin

e prompt. When you start GNOME Terminal, the application starts the default shell that is specified in your system account. You can switch to a different shell at any time.

[hdadmin@localhost current]$

Page 24: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Removing data from HDFS using Shell Command

hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt

Deleted hdfs://localhost:54310/input/input_test.txt

hdadmin@localhost detach]$

Page 25: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running Hadoopon Amazon Elastic MapReduce

Page 26: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Architecture Overview of Amazon EMR

Page 27: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Creating an AWS account

Page 28: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Signing up for the necessary services

● Simple Storage Service (S3)● Elastic Compute Cloud (EC2)● Elastic MapReduce (EMR)

Caution! This costs real money!

Page 29: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Creating Amazon S3 bucket

Page 30: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Create access key using Security Credentials in the AWS Management Console

Page 31: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 32: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Creating a new Job Flow in EMR

Page 33: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 34: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 35: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 36: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 37: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 38: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 39: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 40: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

View Result from the S3 bucket

Page 41: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Lecture: Understanding Map Reduce Processing

Client

Name Node Job Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Map Reduce

Page 42: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

MapReduce Framework

map: (K1, V1) -> list(K2, V2))

reduce: (K2, list(V2)) -> list(K3, V3)

Page 43: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

MapReduce Processing – The Data flow

1. InputFormat, InputSplits, RecordReader

2. Mapper - your focus is here

3. Partition, Shuffle & Sort

4. Reducer - your focus is here

5. OutputFormat, RecordWriter

Page 44: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

How does the MapReduce work?

Output in a list of (Key, List of Values)

in the intermediate file

Sorting

Partitioning

Output in a list of (Key, Value)

in the intermediate file

InputSplit

RecordReader

RecordWriter

Page 45: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

How does the MapReduce work?

Sorting

Partitioning

Combining

Car, 2

Car, 2

Bear, {1,1}

Car, {2,1}

River, {1,1}

Deer, {1,1}

Output in a list of (Key, List of Values)

in the intermediate file

Output in a list of (Key, Value)

in the intermediate file

InputSplit

RecordReader

RecordWriter

Page 46: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

InputFormat

InputFormat: Description: Key: Value:

TextInputFormat Default format; reads lines of text files

The byte offset of the line The line contents

KeyValueInputFormat Parses lines into key, val pairs

Everything up to the first tab character

The remainder of the line

SequenceFileInputFormat

A Hadoop-specific high-performance binary format

user-defined user-defined

Page 47: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

InputSplitAn InputSplit describes a unit of work that comprises a single map task.

InputSplit presents a byte-oriented view of the input.

You can control this value by setting the mapred.min.split.size parameter in core-site.xml, or by overriding the parameter in the JobConf object used to submit a particular MapReduce job.

RecordReader

RecordReader reads <key, value> pairs from an InputSplit.

Typically the RecordReader converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented to the Mapper

Page 48: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Mapper

Mapper: The Mapper performs the user-defined logic to the input a key, value and emits (key, value) pair(s) which are forwarded to the Reducers.

Partition, Shuffle & Sort

After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers.

Partitioner controls the partitioning of map-outputs to assign to reduce task . he total number of partitions is the same as the number of reduce tasks for the job

The set of intermediate keys on a single node is automatically sorted by internal Hadoop before they are presented to the Reducer

This process of moving map outputs to the reducers is known as shuffling.

Page 49: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

ReducerThis is an instance of user-provided code that performs read each key, iterator of values in the partition assigned. The OutputCollector object in Reducer phase has a method named collect() which will collect a (key, value) output.

OutputFormat, Record Writer

OutputFormat governs the writing format in OutputCollector and RecordWriter writes output into HDFS.

OutputFormat: Description

TextOutputFormat Default; writes lines in "key \t value" form

SequenceFileOutputFormatWrites binary files suitable for reading into subsequent MapReduce jobs

NullOutputFormat generates no output files

Page 50: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Writing you own Map Reduce Program

Page 51: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Wordcount (HelloWord in Hadoop)1. package org.myorg;

2.

3. import java.io.IOException; 4. import java.util.*;

5.

6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*;

11.

12. public class WordCount {

13.

14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text();

17.

18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }

Page 52: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Wordcount (HelloWord in Hadoop)

27.

28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. }

37.

Page 53: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Wordcount (HelloWord in Hadoop)

38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount");

41.

42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class);

44.

45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class);

48.

49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class);

51.

52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));

54.

55. JobClient.runJob(conf); 57. } 58. }

59.

Page 54: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Packaging Map Reduce and Deploying to Hadoop Runtime

Environment

Page 55: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Packaging Map Reduce Program

Usage

Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar:

$ mkdir /home/hduser/wordcount_classes $ cd /home/hduser$ javac -classpath /usr/local/hadoop/hadoop-core-0.20.205.0.jar -d wordcount_classes WordCount.java $ jar -cvf ./wordcount.jar -C wordcount_classes/ .

$ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir

Output:

…….

$ hadoop dfs -cat /output/wordcount_output_dir/part-00000

Page 56: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Scroll Downthe web page

Page 57: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 58: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 59: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 60: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 61: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 62: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 63: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 64: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running WordCount.jar on Amazon EMR

Page 65: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Upload .jar file and input file to Amazon S3

1. Select <yourbucket> in Amazon S3 service

2. Create folder : applications

3. Upload wordcount.jar to the applications folder

4. Create another folder: input

5. Upload input_test.txt to the input folder

Page 66: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Create a new Job Flow in EMR

Page 67: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Input JAR Location and Arguments

Page 68: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 69: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 70: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 71: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 72: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

View the Result

Page 73: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

LectureUnderstanding Hive

Page 74: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

IntroductionA Petabyte Scale Data Warehouse Using Hadoop

Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL

Page 75: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

What Hive is NOT

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.).

Page 76: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

System Architecture and Components

• Metastore: To store the meta data.• Query compiler and execution engine: To convert SQL queries to a

sequence of map/reduce jobs that are then executed on Hadoop.• SerDe and ObjectInspectors: Programmable interfaces and

implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.

• UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions).

• Clients: Command line client similar to Mysql command line.

hive.apache.org

Page 77: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Architecture Overview

HDFS

Hive CLIQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

Parser

Planner

Mgm

t.

Web

UI

HDFS

DDL

Hive

Hive.apache.org

Page 78: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Sample HiveQL

The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query

SELECT * FROM t where t.c = 'xyz'

SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)

SELECT t1.c1, count(1) from t1 group by t1.c1

Hive.apache.org

Page 79: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Running Hive

Hive Shell

● Interactive

hive● Script

hive -f myscript● Inline

hive -e 'SELECT * FROM mytable'

Hive.apache.org

Page 80: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Creating Table and Retrieving Data using Hive

Page 81: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hive Hands-On Labs

1. Creating Hive Table

2. Reviewing Hive Table in HDFS

3. Alter and Drop Hive Table

4. Loading Data to Hive Table

5. Querying Data from Hive Table

6. Reviewing Hive Table Content from HDFS Command and WebUI

7. Insert Overwriting the Hive Table

Page 82: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Starting Hive Re-Start Hive CLI again

$ hive

Logging initialized using configuration in file:/usr/local/hive-0.9.0-bin/conf/hive-log4j.properties

Hive history file=/tmp/hdadmin/hive_job_log_hdadmin_201303171635_1944738265.txt

hive>

hive> quit;

Quit from Hive

Page 83: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

1. Creating Hive Table

hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

OK

Time taken: 4.069 seconds

hive (default)> show tables;

OK

test_tbl

Time taken: 0.138 seconds

hive (default)> describe test_tbl;

OK

id int

country string

Time taken: 0.147 seconds

hive (default)>

See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html

Page 84: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

2. Reviewing Hive Table in HDFS

[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse

Found 1 items

drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl

[hdadmin@localhost hdadmin]$

Review Hive Table fromHDFS WebUI

Page 85: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

3. Alter and Drop Hive Table

hive (default)> alter table test_tbl add columns (remarks STRING);

hive (default)> describe test_tbl;

OK

id int

country string

remarks string

Time taken: 0.077 seconds

hive (default)> drop table test_tbl;

OK

Time taken: 0.9 seconds

See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html

Page 86: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

3. Alter and Drop Hive Table

CREATE EXTERNAL TABLE weblog_entries (

ip STRING, dash1 STRING, dash2 STRING,

date STRING,status1 STRING, getstr STRING,

link STRING,http STRING,

Status STRING,

size INT

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY

'\n'

LOCATION '/data/';

weblog.hsql

hive –f weblog_create_external_table.hql

See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html

Page 87: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

4. Loading Data to Hive Table

$ hive

hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Creating Hive table

hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLE test_tbl;

Copying data from file:/tmp/test_tbl_data.csv

Copying file: file:/tmp/test_tbl_data.csv

Loading data to table default.test_tbl

OK

Time taken: 0.241 seconds

hive (default)>

Loading data to Hive table

Page 88: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

5. Querying Data from Hive Table

hive (default)> select * from test_tbl;

OK

1 USA

62 Indonesia

63 Philippines

65 Singapore

66 Thailand

Time taken: 0.287 seconds

hive (default)>

Page 89: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

5. Querying Data from Hive Table

hive (default)> select country from test_tbl;Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_201303171733_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303171733_0001

Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201303171733_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2013-03-17 18:13:19,097 Stage-1 map = 0%, reduce = 0%

2013-03-17 18:13:25,151 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec

2013-03-17 18:13:26,161 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec

2013-03-17 18:13:27,175 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec

2013-03-17 18:13:28,186 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec

2013-03-17 18:13:29,208 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec

2013-03-17 18:13:30,217 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.25 sec

2013-03-17 18:13:31,224 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.25 sec

MapReduce Total cumulative CPU time: 250 msec

Ended Job = job_201303171733_0001

MapReduce Jobs Launched:

Job 0: Map: 1 Cumulative CPU: 0.25 sec HDFS Read: 282 HDFS Write: 45 SUCCESS

Total MapReduce CPU Time Spent: 250 msec

OK

USA

Indonesia

Philippines

Singapore

Thailand

Time taken: 19.829 seconds

hive (default)>

Page 90: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

6. Reviewing Hive Table Content from HDFS Command and WebUI

[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl

Found 1 items

-rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08 /user/hive/warehouse/test_tbl/test_tbl_data.csv

[hdadmin@localhost hdadmin]$

[hdadmin@localhost hdadmin]$ hadoop fs -cat /user/hive/warehouse/test_tbl/test_tbl_data.csv

1,USA

62,Indonesia

63,Philippines

65,Singapore

66,Thailand

[hdadmin@localhost hdadmin]$

Page 91: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

7. Insert Overwriting the Hive Table

hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl;

Copying data from file:/tmp/test_tbl_data_updated.csv

Copying file: file:/tmp/test_tbl_data_updated.csv

Loading data to table default.test_tbl

Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl

OK

Time taken: 0.204 seconds

hive (default)>

Page 92: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review Hive Table Created in HDFS and WebUI

[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl

Found 1 items

-rw-r--r-- 1 hdadmin supergroup 3510 2013-03-17 18:25 /user/hive/warehouse/test_tbl/test_tbl_data_updated.csv

[hdadmin@localhost hdadmin]$

[hdadmin@localhost hdadmin]$ hadoop fs -cat /user/hive/warehouse/test_tbl/test_tbl_data_updated.csv

93,Afghanistan

355,Albania

213,Algeria

1684,AmericanSamoa

376,Andorra

244,Angola

1264,Anguilla

672,Antarctica

1268,AntiguaandBarbuda

54,Argentina

374,Armenia

297,Aruba

61,Australia

43,Austria

994,Azerbaijan

1242,Bahamas

973,Bahrain

Page 93: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Install the Amazon EMR Command Line Interface

Page 94: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Installing Amazon EMR CLI

1. Install Ruby

2. Download the Amazon EMR CLI

3. Install the Amazon EMR CLI

4. Create your credentials file (credentials.json)

5. Create an Amazon EC2 key pair

6. Configure your SSH credentials

7. Verify installation of the Amazon EMR CL

Instruction:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-install.html

Page 95: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Example: Credentials file

{

"access_id": "AKI..........................A",

"private_key": "SaJHI4wjyK.............UWDaYOw2el",

"keypair": "imckey",

"key-pair-file": "~/elastic-mapreduce-cli/imckey.pem",

"log_uri": "s3n://imcbucket/",

"region": "us-west-2"

}

Page 96: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Running Amazon EMR CLI

THANACHARTs-MacBook-Air:~ THANACHART$ cd elastic-mapreduce-cli/

THANACHARTs-MacBook-Air:elastic-mapreduce-cli THANACHART$

THANACHARTs-MacBook-Air:elastic-mapreduce-ruby THANACHART$ ./elastic-mapreduce --list

j-2JW8QBWXIYNV8 TERMINATED ec2-54-213-112-102.us-west-2.compute.amazonaws.comHBase CLI

COMPLETED Start HBase

j-1JNA9G1O7ET2G TERMINATED ec2-54-213-112-74.us-west-2.compute.amazonaws.com Hive Interactive2

COMPLETED Setup Hive

j-1H7NX8OGFNFRW TERMINATED ec2-54-213-10-135.us-west-2.compute.amazonaws.com Hive Interactive

Page 97: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running Hive Interactiveon Amazon EMR

Page 98: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Running Hive on Amazon EMR

● Amazon EMR enables you to run Hive scripts in two modes:

● Interactive● Batch

Hive.apache.org

Page 99: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Upload an input file to Amazon S3

1. Select <yourbucket> in Amazon S3 service

2. Create afolder:data

3. Upload hdi-data.csv to the data folder

Page 100: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Running Hive Interactive

Page 101: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 102: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 103: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Select EC2 Key Pair

Page 104: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 105: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Find Job Flow ID

Page 106: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Running CLI to check the Job Flow

$ ./elastic-mapreduce --list -j j-37WK3Z1T2FZ7D

j-37WK3Z1T2FZ7D STARTING ec2-54-213-119-89.us-west-2.compute.amazonaws.com Hive Interactive Demo

PENDING Setup Hive

$ ./elastic-mapreduce --list -j j-37WK3Z1T2FZ7D

j-37WK3Z1T2FZ7D RUNNING ec2-54-213-119-89.us-west-2.compute.amazonaws.com Hive Interactive Demo

RUNNING Setup Hive

$ ./elastic-mapreduce --ssh j-37WK3Z1T2FZ7D

hadoop@ip-172-31-24-126:~$hive

Logging initialized using configuration in file:/home/hadoop/.versions/hive-0.8.1/conf/hive-log4j.properties

Hive history file=/mnt/var/lib/hive_081/tmp/history/hive_job_log_hadoop_201308011448_800175951.txt

hive>

Page 107: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Create a table using HiveQL

hive> CREATE TABLE HDI(

> id INT, country STRING, hdi FLOAT, lifeex INT, mysch INT, eysch

> INT, gni INT)

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY ","

> STORED AS TEXTFILE

> LOCATION "s3://imcbucket/data";

OK

Time taken: 4.292 seconds

hive> SHOW TABLES;

OK

hdi

Time taken: 0.305 seconds

Page 108: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Running a SELECT statement

hive> SELECT country, gni FROM hdi WHERE gni > 2000;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_201308011444_0001, Tracking URL = http://ip-172-31-24-126:9100/jobdetails.jsp?jobid=job_201308011444_0001

Kill Command = /home/hadoop/bin/hadoop job -Dmapred.job.tracker=172.31.24.126:9001 -kill job_201308011444_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2013-08-01 14:55:53,846 Stage-1 map = 0%, reduce = 0%

2013-08-01 14:58:37,725 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.52 sec

Page 109: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Running a SELECT statement (cont.)

MapReduce Total cumulative CPU time: 15 seconds 520 msec

Ended Job = job_201308011444_0001

Counters:

MapReduce Jobs Launched:

Job 0: Map: 1 Accumulative CPU: 15.52 sec HDFS Read: 372 HDFS Write: 2435 SUCCESS

Total MapReduce CPU Time Spent: 15 seconds 520 msec

OK

Norway 47557

Australia 34431

Netherlands 36402

United States 43017

New Zealand 23737

...

Page 110: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

LectureUnderstanding Pig

Page 111: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

IntroductionA high-level platform for creating MapReduce programs Using Hadoop

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Page 112: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Pig Components

● Two Compnents● Language (Pig Latin)● Compiler

● Two Execution Environments● Local

pig -x local● Distributed

pig -x mapreduce

Hive.apache.org

Page 113: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Running Pig

● Script

pig myscript● Command line (Grunt)

pig● Embedded

Writing a java program

Hive.apache.org

Page 114: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Pig Latin

Hive.apache.org

Page 115: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Pig Execution Stages

Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Page 116: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Why Pig?

● Makes writing Hadoop jobs easier● 5% of the code, 5% of the time● You don't need to be a programmer to write Pig scripts

● Provide major functionality required for DatawareHouse and Analytics● Load, Filter, Join, Group By, Order, Transform

● User can write custom UDFs (User Defined Function)

Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Page 117: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Pig v.s. Hive

Hive.apache.org

Page 118: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running a Pig script

Page 119: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Starting Pig Command Line

[hdadmin@localhost ~]$ pig -x local

2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53

2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hdadmin/pig_1375327740024.log

2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hdadmin/.pigbootup not found

2013-08-01 10:29:00,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

grunt>

Page 120: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

countryFilter.pig

A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int);B = FILTER A BY gni > 2000;C = ORDER B BY gni;dump C;

#Preparing Data

[hdadmin@localhost ~]$ cp hadoop_data/hdi-data.csv /usr/local/pig-0.11.1/bin/

#Edit Your Script

[hdadmin@localhost ~]$ cd /usr/local/pig-0.11.1/bin/

[hdadmin@localhost ~]$ vi countryFilter.pig

Writing a Pig Script

Page 121: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

[hdadmin@localhost ~]$ cd /usr/local/pig-0.11.1/bin/

[hdadmin@localhost ~]$ pig -x local

grunt > run countryFilter.pig

....

(150,Cameroon,0.482,51,5,10,2031)

(126,Kyrgyzstan,0.615,67,9,12,2036)

(156,Nigeria,0.459,51,5,8,2069)

(154,Yemen,0.462,65,2,8,2213)

(138,Lao People's Democratic Republic,0.524,67,4,9,2242)

(153,Papua New Guinea,0.466,62,4,5,2271)

(165,Djibouti,0.43,57,3,5,2335)

(129,Nicaragua,0.589,74,5,10,2430)

(145,Pakistan,0.504,65,4,6,2550)

Running a Pig Script

Page 122: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Writing a Join operation script

CountryJoin..pig

A = load 'hdi-data.csv' using PigStorage(',') AS (id:int,country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int,gni:int);B = FILTER A BY gni> 2000;C = ORDER B BY gni;D = load 'export-data.csv' using PigStorage(',') AS(country:chararray, expct:float);E = JOIN C BY country, D by country;dump E;

Page 123: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running a Pig scripton Amazon EMR

Page 124: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Upload .pig file to Amazon S3

1. Select <yourbucket> in Amazon S3 service

2. Upload countryFilter-EMR.pigto the data folder

Page 125: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Creating a Pig program

Page 126: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 127: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 128: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 129: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Viewing a result

Page 130: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 131: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

LectureUnderstanding HBase

Page 132: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

IntroductionAn open source, non-relational, distributed database

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (, providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

Page 133: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

HBase Features

● Column oriented data store, known as Hadoop Database● Support random realtime CRUD operations (unlike

HDFS)● No SQL Database● Opensource, written in Java● Run on a cluster of commodity hardware

Hive.apache.org

Page 134: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

HBase Architecture

Hive.apache.org

Page 135: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

When to use Hbase?

● When you need high volume data to be stored ● Un-structured data● Sparse data● Column-oriented data● Versioned data (same data template, captured at various

time, time-elapse data)● When you need high scalability

Hive.apache.org

Page 136: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running HBase

Page 137: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Starting HBase shell

[hdadmin@localhost ~]$ start-hbase.sh

starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master-localhost.localdomain.out

[hdadmin@localhost ~]$ jps

3064 TaskTracker

2836 SecondaryNameNode

2588 NameNode

3513 Jps

3327 HMaster

2938 JobTracker

2707 DataNode

[hdadmin@localhost ~]$ hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013

hbase(main):001:0>

Page 138: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Create a table and insert data in HBase

hbase(main):009:0> create 'test', 'cf'

0 row(s) in 1.0830 seconds

hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'

0 row(s) in 0.0750 seconds

hbase(main):011:0> scan 'test'

ROW COLUMN+CELL

row1 column=cf:a, timestamp=1375363287644, value=val1

1 row(s) in 0.0640 seconds

hbase(main):002:0> get 'test', 'row1'

COLUMN CELL

cf:a timestamp=1375363287644, value=val1

1 row(s) in 0.0370 seconds

Page 139: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running HBase commandson Amazon EMR

Page 140: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Create a HBase shell

Page 141: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 142: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 143: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 144: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 145: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Find Job Flow ID

Page 146: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Starting Hbase Shell

$ ./elastic-mapreduce --list -j j-3MKWRS0K8IH7K

j-3MKWRS0K8IH7K WAITING ec2-54-213-117-162.us-west-2.compute.amazonaws.comHBase Interactive

COMPLETED Start HBase

$ ./elastic-mapreduce --ssh j-3MKWRS0K8IH7K

hadoop@ip-172-31-33-161:~$ hbase shell

Page 147: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Recommendation to Further Study

Hadoop Beginner's Guide

Hadoop: The Definitive Guide, 3rd Edition

Page 148: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Recommendation to Further Study

Hadoop in Practice

Hadoop MapReduce Cookbook

Page 149: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Recommendation to Further Study

Amazon Elastic MapReduce Developer Guide

Page 150: Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Thank you