30
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Pig Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com

SDEC2011 Essentials of Pig

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Essentials of PigMastering Hadoop Map-reduce for Data Analysis

Shashank Tiwariblog: shanky.org | twitter: @[email protected]

Page 2: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Session Agenda

• What is Pig and why should you use it?

• Installing & Setting up Pig

• Pig’s Components

• Using Pig with Hadoop MapReduce

• Summary & Conclusion

Page 3: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

What is Pig?

• Higher-level abstraction for Hadoop MapReduce

• An infrastructure for data analysis using a scripting language

• named, Pig Latin

Page 4: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Why should you use Pig?

• Hadoop MapReduce:

• Requires you to be a programmer

• Forces you to design all your algorithms in terms of the map and reduce primitives

Page 5: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Installing & Setting Up Pig -- Required Software

• Required Software:

• Java 1.6.x

• Hadoop 0.20.x

• Ant 1.7+ (for builds)

• JUnit 4.5 (for tests)

• Cygwin (on Windows)

Page 6: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Download

• Source: http://pig.apache.org/

• Version:

• 0.8.1 -- current stable

Page 7: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Install & Configure

• Extract: tar zxvf pig-0.8.1.tar.gz

• Move & Create Symbolic Link:

• ln -s pig-0.8.1 pig

• Edit: bin/pig

• export PIG_CLASSPATH=$HADOOP_HOME/conf

Page 8: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Verify Installation

• Verify:

(remember to start Hadoop first.)

• bin/pig -help (command options)

• bin/pig (run the grunt shell)

Page 9: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Running Pig

• Run Mode

• Local Mode -- single machine

• MapReduce Mode -- needs a Hadoop cluster (with HDFS)

• Run via:

• grunt shell

• pig scripts

• embedded programs

Page 10: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Pig IDE

• PigPen, an eclipse based IDE

• graphical data flow definition

• can show example data flow

Page 11: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Pig Components

• Pig Latin

• Pig Engine

• execution engine on top of Hadoop

• includes default optimal configurations

Page 12: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

A client for your cluster

• Pig does not run on a Hadoop cluster

• It connects to one

Page 13: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Pig Latin

• Data flow language (Not declarative like SQL)

• Increases productivity (less lines do more)

• Includes standard operations like join, filter, group, sort

• User code and existing binaries can be included

• Supports nested data types

• Does not require metadata

Page 14: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Pig Latin Example

• Will leverage the tutorial that comes with the distribution

• Check the tutorial folder in the distribution

Page 15: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Start Grunt Shell

• cd $PIG_HOME

• bin/pig -x local

Page 16: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Aggregate Data

• grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, timestamp, query);

• alternate delimiters can be used and de-serializers like PigJsonLoader can be leveraged

• grunt> grouped = GROUP log BY user;

• grunt> counted = FOREACH grouped GENERATE group, COUNT(log);

• grunt> STORE counted INTO 'output';

Page 17: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Group Data

• grunt> grouped = GROUP log BY user;

• In Pig group operation generates (key, collection) pair , where the collection itself is a collection of tuples.

• The key of the tuples is the same key as that of the (key, collection) pair

Page 18: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Filter Data

• grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query);

• grunt> grouped = GROUP log BY user;

• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;

• grunt> filtered = FILTER counted BY cnt > 75;

• grunt> STORE filtered INTO 'output1';

Page 19: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Order Data

• grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query);

• grunt> grouped = GROUP log BY user;

• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;

• grunt> filtered = FILTER counted BY cnt > 50;

• grunt> sorted = ORDER filtered BY cnt;

• grunt> STORE sorted INTO 'output2';

Page 20: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Join Data Example

• Words appearing in Adventures of Huckleberry Finn by Mark Twain

• http://www.gutenberg.org/ebooks/76

• Words appearing in The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle

• http://www.gutenberg.org/ebooks/1661

Page 21: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Loading & Counting Huckleberry Finn Data

• grunt> A = load 'pg76.txt';

• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

• grunt> C = filter B by word matches '\\w+';

• grunt> D = group C by word;

• grunt> E = foreach D generate COUNT(C), group;

• store E into 'huckleberry_finn_freq';

Page 22: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Loading & Counting Sherlock Holmes Data

• grunt> A = load 'pg1661.txt';

• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

• grunt> C = filter B by word matches '\\w+';

• grunt> D = group C by word;

• grunt> E = foreach D generate COUNT(C), group;

• grunt> store E into 'sherlock_holmes_freq';

Page 23: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Join Data

• grunt> hf= LOAD 'huckleberry_finn_freq' AS (freq, word);

• grunt> sh= LOAD 'sherlock_holmes_freq' AS (freq, word);

• grunt> inboth = JOIN hf BY word, sh BY word;

• grunt> STORE inboth INTO 'output3';

Page 24: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Set Difference (A - B, in A but not in B)

• hf= LOAD 'huckleberry_finn_freq' AS (freq, word);

• sh = LOAD 'sherlock_holmes_freq' AS (freq, word);

• grouped = COGROUP hf BY word, sh BY word;

• not_in_hf = FILTER grouped BY COUNT(hf) == 0;

• out = FOREACH not_in_hf GENERATE FLATTEN(sh);

• STORE out INTO 'output4';

Page 25: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Cogroup Data

• Extends the idea of grouping to multiple collections

• Instead of (key, collection) pair, it now emits a key and a set of tuples from each of the multiple collections

• With two sources of input it would be (key, collection1, collection2), where tuples from the first source will be in collection1 and tuples from the second source will be in collection2.

Page 26: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Data types Supported

• int, long, double, chararray, bytearray

• map, tuple (ordered), bag (unordered)

Page 27: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Data type Declaration

• hf= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);

• explicit data type declaration

• hf= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);

• weighted = FOREACH hf GENERATE freq * 100;

• type inference, freq cast to int

Page 28: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Data type Declaration

• hf= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);

• explicit data type declaration

• hf= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);

• weighted = FOREACH hf GENERATE freq * 100;

• type inference, freq cast to int

Page 29: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Custom Extensions

• User defined functions can be called from Pig scripts

• Nested operations can be carried out

• FOREACH grouped { sorted = ORDER hf BY counted;

• GENERATE group, CustomFunction(sorted); }

• Flow can be split: SPLIT A INTO Negative IF $0 < 0, Positive IF $0 > 0;

Page 30: SDEC2011 Essentials of Pig

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Questions?

• blog: shanky.org | twitter: @tshanky

[email protected]