79
Pig Workshop Sudar Muthu http://sudarmuthu.com http://twitter.com/sudarmuthu https://github.com/sudar

Pig workshop

Embed Size (px)

DESCRIPTION

Slides that I used for my Pig Workshop

Citation preview

Page 1: Pig workshop

Pig WorkshopSudar Muthu

http://sudarmuthu.comhttp://twitter.com/sudarmuthu

https://github.com/sudar

Page 2: Pig workshop

Research Engineer by profession I mine useful information from data You might recognize me from other HasGeek

events Blog at http://sudarmuthu.com Builds robots as hobby ;)

Who am I?

Page 3: Pig workshop

HasGeekSpecial Thanks

Page 4: Pig workshop

What I will not cover?

Page 5: Pig workshop

What is BigData, or why it is needed? What is MapReduce? What is Hadoop? Internal architecture of Pig

http://sudarmuthu.com/blog/getting-started-with-hadoop-and-pig

What I will not cover?

Page 6: Pig workshop

What we will see today?

Page 7: Pig workshop

What is Pig How to use it

Loading and storing data Pig Latin SQL vs Pig Writing UDF’s

Debugging Pig Scripts Optimizing Pig Scripts When to use Pig

What we will see today?

Page 8: Pig workshop

So, all of you have Pig installed right? ;)

Page 9: Pig workshop

“Platform for analyzing large sets of data”

What is Pig?

Page 10: Pig workshop

Pig Shell (Grunt) Pig Language (Latin) Libraries (Piggy Bank) User Defined Functions (UDF)

Components of Pig

Page 11: Pig workshop

It is a data flow language Provides standard data processing

operations Insulates Hadoop complexity Abstracts Map Reduce Increases programmer productivity

… but there are cases where Pig is not suitable.

Why Pig?

Page 12: Pig workshop

Pig Modes

Page 13: Pig workshop

For this workshop, we will be using Pig only in local

mode

Page 14: Pig workshop

Getting to know your Pig shell

Page 15: Pig workshop

Similar to Python’s shellpig –x local

Page 16: Pig workshop

Inline in shell From a file Streaming through other executable Embed script in other languages

Different ways of executing Pig Scripts

Page 17: Pig workshop

Pigs eat anythingLoading and Storing data

Page 18: Pig workshop

file = LOAD 'data/dropbox-policy.txt' AS (line);

data = LOAD 'data/tweets.csv' USING PigStorage(',');

data = LOAD 'data/tweets.csv' USING PigStorage(',') AS ('list', 'of', 'fields');

Loading Data into Pig

Page 19: Pig workshop

PigStorage – for most cases TextLoader – to load text files JSONLoader – to load JSON files Custom loaders – You can write your own

custom loaders as well

Loading Data into Pig

Page 20: Pig workshop

DUMP input;

Very useful for debugging, but don’t use it on huge datasets

Viewing Data

Page 21: Pig workshop

STORE data INTO 'output_location';

STORE data INTO 'output_location' USING PigStorage();

STORE data INTO 'output_location' USING PigStorage(',');

STORE data INTO 'output_location' USING BinStorage();

Storing Data from Pig

Page 22: Pig workshop

Similar to `LOAD`, lot of options are available

Can store locally or in HDFS You can write your own custom Storage as

well

Storing Data

Page 23: Pig workshop

data = LOAD 'data/data-bag.txt' USING PigStorage(',');

STORE data INTO 'data/output/load-store' USING PigStorage('|');

https://github.com/sudar/pig-samples/load-store.pig

Load and Store example

Page 24: Pig workshop

Pig Latin

Page 25: Pig workshop

Scalar Types Complex Types

Data Types

Page 26: Pig workshop

int, long – (32, 64 bit) integer float, double – (32, 64 bit) floating point boolean (true/false) chararray (String in UTF-8) bytearray (blob) (DataByteArray in Java)

If you don’t specify anything bytearray is used by default

Scalar Types

Page 27: Pig workshop

tuple – ordered set of fields (data) bag – collection of tuples map – set of key value pairs

Complex Types

Page 28: Pig workshop

Row with one or more fields Fields can be of any data type Ordering is important Enclosed inside parentheses ()

Eg: (Sudar, Muthu, Haris, Dinesh)(Sudar, 176, 80.2F)

Tuple

Page 29: Pig workshop

Set of tuples SQL equivalent is Table Each tuple can have different set of fields Can have duplicates Inner bag uses curly braces {} Outer bag doesn’t use anything

Bag

Page 30: Pig workshop

Outer bag

(1,2,3)(1,2,4)(2,3,4)(3,4,5)(4,5,6)

https://github.com/sudar/pig-samples/data-bag.pig

Bag - Example

Page 31: Pig workshop

Inner bag

(1,{(1,2,3),(1,2,4)})(2,{(2,3,4)})(3,{(3,4,5)})(4,{(4,5,6)})

https://github.com/sudar/pig-samples/data-bag.pig

Bag - Example

Page 32: Pig workshop

Set of key value pairs Similar to HashMap in Java Key must be unique Key must be of chararray data type Values can be any type Key/value is separated by # Map is enclosed by []

Map

Page 33: Pig workshop

[name#sudar, height#176, weight#80.5F]

[name#(sudar, muthu), height#176, weight#80.5F]

[name#(sudar, muthu), languages#(Java, Pig, Python)]

Map - Example

Page 34: Pig workshop

Similar to SQL Denotes that value of data element is

unknown Any data type can be null

Null

Page 35: Pig workshop

We can specify a schema (collection of datatypes) to `LOAD` statements

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);

data = LOAD 'data/nested-schema.txt' AS (f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);

Schemas in Load statement

Page 36: Pig workshop

Fields can be looked up by

Position Name Map Lookup

Expressions

Page 37: Pig workshop

data = LOAD 'data/nested-schema.txt' AS (f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);

by_pos = FOREACH data GENERATE $0;DUMP by_pos;

by_field = FOREACH data GENERATE f2;DUMP by_field;

by_map = FOREACH data GENERATE f3#'name';DUMP by_map;

https://github.com/sudar/pig-samples/lookup.pig

Expressions - Example

Page 38: Pig workshop

Operators

Page 39: Pig workshop

All usual arithmetic operators are supported

Addition (+) Subtraction (-) Multiplication (*) Division (/) Modulo (%)

Arithmetic Operators

Page 40: Pig workshop

All usual boolean operators are supported

AND OR NOT

Boolean Operators

Page 41: Pig workshop

All usual comparison operators are supported

== != < > <= >=

Comparison Operators

Page 42: Pig workshop

FOREACH FLATTERN GROUP FILTER COUNT ORDER BY DISTINCT LIMIT JOIN

Relational Operators

Page 43: Pig workshop

Generates data transformations based on columns of data

x = FOREACH data GENERATE *;

x = FOREACH data GENERATE $0, $1;

x = FOREACH data GENERATE $0 AS first, $1 AS second;

FOREACH

Page 44: Pig workshop

Un-nests tuples and bags. Most of the time results in cross product

(a, (b, c)) => (a,b,c)

({(a,b),(d,e)}) => (a,b) and (d,e)

(a, {(b,c), (d,e)}) => (a, b, c) and (a, d, e)

FLATTEN

Page 45: Pig workshop

Groups data in one or more relations Groups tuples that have the same group key Similar to SQL group by operator

outerbag = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP outerbag;

innerbag = GROUP outerbag BY f1;DUMP innerbag;

https://github.com/sudar/pig-samples/group-by.pig

GROUP

Page 46: Pig workshop

Selects tuples from a relation based on some condition

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP data;

filtered = FILTER data BY f1 == 1;DUMP filtered;

https://github.com/sudar/pig-samples/filter-by.pig

FILTER

Page 47: Pig workshop

Counts the number of tuples in a relationship

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);grouped = GROUP data BY f2;

counted = FOREACH grouped GENERATE group, COUNT (data);DUMP counted;

https://github.com/sudar/pig-samples/count.pig

COUNT

Page 48: Pig workshop

Sort a relation based on one or more fields. Similar to SQL order by

data = LOAD 'data/nested-sample.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP data;

ordera = ORDER data BY f1 ASC;DUMP ordera;

orderd = ORDER data BY f1 DESC;DUMP orderd;

https://github.com/sudar/pig-samples/order-by.pig

ORDER By

Page 49: Pig workshop

Removes duplicates from a relation

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP data;

unique = DISTINCT data;DUMP unique;

https://github.com/sudar/pig-samples/distinct.pig

DISTINCT

Page 50: Pig workshop

Limits the number of tuples in the output.

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP data;

limited = LIMIT data 3;DUMP limited;

https://github.com/sudar/pig-samples/limit.pig

LIMIT

Page 51: Pig workshop

Joins relation based on a field. Both outer and inner joins are supported

a = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);DUMP a;

b = LOAD 'data/simple-tuples.txt' USING PigStorage(',') AS (t1:int, t2:int);DUMP b;

joined = JOIN a by f1, b by t1;DUMP joined;

https://github.com/sudar/pig-samples/join.pig

JOIN

Page 52: Pig workshop

From Table – Load file(s) Select – FOREACH GENERATE Where – FILTER BY Group By – GROUP BY + FOREACH

GENERATE Having – FILTER BY Order By – ORDER BY Distinct - DISTINCT

SQL vs Pig

Page 53: Pig workshop

Count the number of words in a text file

Let’s see a complete example

https://github.com/sudar/pig-samples/count-words.pig

Page 54: Pig workshop

Extending Pig - UDF

Page 55: Pig workshop

Do operations on more than one field Do more than grouping and filtering Programmer is comfortable Want to reuse existing logic

Traditionally UDF can be written only in Java. Now other languages like Python are also supported

Why UDF?

Page 56: Pig workshop

Eval Functions Filter functions Load functions Store functions

Different types of UDF’s

Page 57: Pig workshop

Can be used in FOREACH statement Most common type of UDF Can return simple types or Tuples

b = FOREACH a generate udf.Function($0);

b = FOREACH a generate udf.Function($0, $1);

Eval Functions

Page 58: Pig workshop

Extend EvalFunc<T> interface The generic <T> should contain the return type Input comes as a Tuple Should check for empty and nulls in input Extend exec() function and it should return the value Extend getArgToFuncMapping() to let UDF know

about Argument mapping Extend outputSchema() to let UDF know about

output schema

Eval Functions

Page 59: Pig workshop

Create a jar file which contains your UDF classes

Register the jar at the top of Pig script Register other jars if needed Define the UDF function Use your UDF function

Using Java UDF in Pig Scripts

Page 60: Pig workshop

Let’s see an example which returns a string

https://github.com/sudar/pig-samples/strip-quote.pig

Page 61: Pig workshop

Let’s see an example which returns a Tuple

https://github.com/sudar/pig-samples/get-twitter-names.pig

Page 62: Pig workshop

Can be used in the Filter statements Returns a boolean value

Eg: vim_tweets = FILTER data By FromVim(StripQuote($6));

Filter Functions

Page 63: Pig workshop

Extends FilterFun, which is a EvalFunc<Boolean>

Should return a boolean Input it is same as EvalFunc<T> Should check for empty and nulls in input Extend getArgToFuncMapping() to let UDF

know about Argument mapping

Filter Functions

Page 64: Pig workshop

Let’s see an example which returns a Boolean

https://github.com/sudar/pig-samples/from-vim.pig

Page 65: Pig workshop

If the error affects only particular row then return null.

If the error affects other rows, but can recover, then throw an IOException

If the error affects other rows, and can’t recover, then also throw an IOException. Pig and Hadoop will quit, if there are many IOExceptions.

Error Handling in UDF

Page 66: Pig workshop

Can we try to write some more UDF’s?

Page 67: Pig workshop

Writing UDF in other languages

Page 68: Pig workshop

Streaming

Page 69: Pig workshop

Entire data set is passed through an external task

The external task can be in any language Even shell script also works Uses the `STREAM` function

Streaming

Page 70: Pig workshop

data = LOAD 'data/tweets.csv' USING PigStorage(',');

filtered = STREAM data THROUGH `cut -f6,8`;

DUMP filtered;

https://github.com/sudar/pig-samples/stream-shell-script.pig

Stream through shell script

Page 71: Pig workshop

data = LOAD 'data/tweets.csv' USING PigStorage(',');

filtered = STREAM data THROUGH `strip.py`;

DUMP filtered;

https://github.com/sudar/pig-samples/stream-python.pig

Stream through Python

Page 72: Pig workshop

DUMP is your friend, but use with LIMIT DESCRIBE – will print the schema names ILLUSTRATE – Will show the structure of the

schema In UDF’s, we can use warn() function. It

supports upto 15 different debug levels Use Penny - https://cwiki.apache.org/PIG/

pennytoollibrary.html

Debugging Pig Scripts

Page 73: Pig workshop

Project early and often Filter early and often Drop nulls before a join Prefer DISTINCT over GROUP BY Use the right data structure

Optimizing Pig Scripts

Page 74: Pig workshop

-p key=value - substitutes a single key, value

-m file.ini – substitutes using an ini file default – provide default values

http://sudarmuthu.com/blog/passing-command-line-arguments-to-pig-scripts

Using Param substitution

Page 75: Pig workshop

Anything data relatedProblems that can be solved using Pig

Page 76: Pig workshop

Lot of custom logic needs to be implemented Need to do lot of cross lookup Data is mostly binary (processing image

files) Real-time processing of data is needed

When not to use Pig?

Page 77: Pig workshop

PiggyBank - https://cwiki.apache.org/PIG/piggybank.html

DataFu – Linked-In Pig Library - https://github.com/linkedin/datafu

Elephant Bird – Twitter Pig Library - https://github.com/kevinweil/elephant-bird

External Libraries

Page 78: Pig workshop

Pig homepage - http://pig.apache.org/ My blog about Pig - http://sudarmuthu.com/blog/category/hadoop-pig Sample code –

https://github.com/sudar/pig-samples Slides – http://slideshare.net/sudar

Useful Links

Page 79: Pig workshop

Thank you