70
Apache Pig Making data transformation easy Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015

Apache Pig: Making data transformation easy

Embed Size (px)

Citation preview

Apache PigMaking data transformation easy

Víctor Sánchez AnguixUniversitat Politècnica de València

MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image

Course 2014/2015

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce Problem Solving

Complex problem

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce Problem Solving

Complex problem

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce Problem Solving

➢ Need to solve complex problem

➢ More complex atomic operations than M/R

➢ Java is not a data oriented language → Low productivity

➢ Any solutions?

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Apache Pig to the rescue!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Join in Apache Hadooppublic class DeliveryFileMapper extends MapReduceBase implements

Mapper<LongWritable, Text, Text, Text>{

private String cellNumber,deliveryCode,fileTag="DR~";

public void map(LongWritable key, Text value,

OutputCollector<Text, Text> output, Reporter reporter) throws

IOException

{

String line = value.toString();

String splitarray[] = line.split(",");

cellNumber = splitarray[0].trim();

deliveryCode = splitarray[1].trim();

output.collect(new Text(cellNumber), new Text

(fileTag+deliveryCode));

}

}

** Extracted from http://kickstarthadoop.blogspot.com.

es/2011/09/joins-with-plain-map-reduce.html

public class SmsReducer extends MapReduceBase implements

Reducer<Text, Text, Text, Text> {

private String customerName,deliveryReport;

private static Map<String,String> DeliveryCodesMap= new

HashMap<String,String>();

public void configure(JobConf job){

loadDeliveryStatusCodes();

}

public void reduce(Text key, Iterator<Text> values,

OutputCollector<Text, Text> output, Reporter reporter)

throws IOException{

while (values.hasNext()){

String currValue = values.next().toString();

String valueSplitted[] = currValue.split("~");

if(valueSplitted[0].equals("CD"))

customerName=valueSplitted[1].trim();

else if(valueSplitted[0].equals("DR"))

deliveryReport = DeliveryCodesMap.get

(valueSplitted[1].trim());

}

if(customerName!=null && deliveryReport!=null)

output.collect(new Text(customerName), new Text

(deliveryReport));

else if(customerName==null)

output.collect(new Text("customerName"), new Text

(deliveryReport));

else if(deliveryReport==null)

output.collect(new Text(customerName), new Text

("deliveryReport"));

}

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Join in Apache Pig

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Join in Apache Pig

A = JOIN A BY keyA, B BY keyB;

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Apache Pig overview

➢ Framework layer over HDFS and Hadoop

➢ Developed by Yahoo at 2006

➢ Users: Yahoo, Linkedin, Twitter, IBM, etc.

➢ Last major release: 0.14.0 (November 2014)http://pig.apache.org/

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Apache Hadoop vs. Apache Pig

➢ M/R as atomic operations

➢ Java is not data oriented

➢ M/R inner flexibility➢ Efficiency

➢ ETL operations: Join, Filter, Group, etc.

➢ Pig Latin: Data scripting language

➢ UDF with Java (and others)

➢ Transform to M/R overhead

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Pig Programming Model: Data

➢ Pig operations operate on relations

➢ A relation is a bag

➢ A bag is a collection of tuples

➢ A tuple is an ordered set of fields

➢ A field is any type of data

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Sounds complicated… but it’s not!

➢ Basic data types:○ Boolean: True, False○ Int and Long: 1, 2, 3, 4, 5○ Float and Double: 2.3, 1.4, 4.5○ Chararray: ‘Hello’, ‘I am a string’○ DateTime: 2014-09-11T12:20:14.1234+00:00○ … more but you won’t probably use them very often

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Sounds complicated… but it’s not!

➢ Tuple: A catch-all data type

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Sounds complicated… but it’s not!

➢ Bag:

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Sounds complicated… but it’s not!

➢ Bag:

➢ And relations? Just the most outer (distributed) bags

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Loading data?

➢ Loading data? No, first let’s meet our friend Grunt

➢ Interactive pig shell → Nice for debugging/experimenting

➢ pig -x local or pig -x mapred

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Loading data?

➢ Data source: Local or HDFS (usually!)➢ LOAD instruction:

○ Data is automatically loaded in a distributed relation

Students = LOAD ‘student_path’ USING PigStorage( ‘\t’, ‘-noschema’ ) AS (student_id: Long, name: Chararray, surname: Chararray, gender: Chararray,

age: Int);

Relation Name

Path to HD/HDFS

Connector Field separator

Tuple schema

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Loading data?

➢ Data source: Local or HDFS (usually!)➢ LOAD instruction:

○ Data is automatically loaded in a distributed relation

Grades = LOAD ‘grade_path’ USING PigStorage( ‘,’, ‘-schema’ );

Relation Name

Path to HD/HDFS

Connector Field separator

Load schema from .pig_schema

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Checking relations’ content

➢ DUMP instruction:○ Prints the content of a relation at standard output

DUMP Students;

(1,John,Doe,M,18)(2,Mary,Doe,F,20)(3,Lara,Croft,F,25)(4,Sherlock,Holmes,M,36)(5,John,Watson,M,38)(6,Sarah,Kerrigan,F,21)(7,Bruce,Wayne,M,32)(8,Tony,Stark,M,33)(9,Princess,Peach,F,21)(10,Peter,Parker,M,23)

grunt>

Relation Name

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Checking relations’ content

➢ DESCRIBE instruction:○ Prints the schema of the relation at standard output

DESCRIBE Students;

Students: {student_id: long,name: chararray,surname: chararray,gender: chararray,age: int}

grunt>Relation Name

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Checking relations’ content

➢ ILLUSTRATE instruction:○ Prints the schema of the relation and a tuple example

at standard outputILLUSTRATE Students;

-------------------------------------------------------------------------------------------------------------------| Students | student_id:long | name:chararray | surname:chararray | gender:chararray | age:int |-------------------------------------------------------------------------------------------------------------------| | 9 | Princess | Peach | F | 21 |-------------------------------------------------------------------------------------------------------------------

grunt>

Relation Name

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ FOREACH instruction:○ Generate new relations by projecting data of a relation

StudentsProj= FOREACH Students GENERATE student_id, name, age;

Relation Name

Base relation

Projected data

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ FOREACH instruction:○ Generate new relations by projecting data of a relation

StudentsProj= FOREACH Students GENERATE student_id, CONCAT(name,surname) AS full_name, age;

Relation Name

Base relation

Projected data

We can generate new data too!!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ FOREACH instruction:○ Let us execute the instruction and… it seems that

nothing happens!

○ We had some tracing output with LOAD, DUMP, and ILLUSTRATE…

○ Any ideas on this issue?

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ Pig employs lazy evaluation

➢ Computation only when:○ LOAD, ILLUSTRATE, DUMP, STORE

➢ Pig keeps a DAG on MR jobs needed to compute relations (optimized!)

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Extend Student relation to add a field that determines if the students is under 25 years

(1,John,Doe,M,18,true)

(2,Mary,Doe,F,20,true)

(3,Lara,Croft,F,25,false)

(4,Sherlock,Holmes,M,36,false)

(5,John,Watson,M,38,false)

(6,Sarah,Kerrigan,F,21,true)

...

Exercise: Who is under 25?

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ FILTER instruction:○ Generate a new relation by filtering data on a relation

StudentsFilt= FILTER Students BY age > 24 AND age < 34;

DUMP StudentsFilt;

(3,Lara,Croft,F,25)(7,Bruce,Wayne,M,32)(8,Tony,Stark,M,33)

Relation Name

Base relation

Condition to fulfill

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ SPLIT instruction:○ Splits a relation into multiple relations based on

conditions

SPLIT Students INTO StudentsMale IF gender == ‘M’, StudentsFemale OTHERWISE;

DUMP StudentsMale;

(1,John,Doe,M,18)(4,Sherlock,Holmes,M,36)(5,John,Watson,M,38)(7,Bruce,Wayne,M,32)(8,Tony,Stark,M,33)(10,Peter,Parker,M,23)

Base relation

New relation

Condition to fulfill by new relation. Otherwise means the rest

New relation

Condition to fulfill by new relation

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ SPLIT instruction:○ Splits a relation into multiple relations based on

conditions

SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder30 IF age<30, OtherStudents OTHERWISE;

DUMP OtherStudents;

(4,Sherlock,Holmes,M,36)(5,John,Watson,M,38)(8,Tony,Stark,M,33)

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ GROUP instruction:○ Creates tuples with the key and a of bag tuples with

the same key values

StudentsGr = GROUP Students BY gender;

DUMP StudentsGr;

(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(3,Lara,Croft,F,25),(2,Mary,Doe,F,20)})(M,{(10,Peter,Parker,M,23),(8,Tony,Stark,M,33),(7,Bruce,Wayne,M,32),(5,John,Watson,M,38),(4,Sherlock,Holmes,M,36),(1,John,Doe,M,18)})

DESCRIBE StudentsGr;

StudentsGr: {group: chararray,Students: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)}}

Base relation

New relation

Use these fields’ values to make groups

New schema!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ GROUP instruction:○ We can use multiple relations. Creates one bag per

relation

StudentsGr = GROUP StudentsUnder25 BY gender, OtherStudents BY gender;

DUMP StudentsGr;(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(8,Tony,Stark,M,33),(5,John,Watson,M,38),(4,Sherlock,Holmes,M,36)})

DESCRIBE StudentsGr;StudentsCoGr: {group: chararray,StudentsUnder25: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)},OtherStudents: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)}}

Base relation

New relation

Use these fields’ values to make groups

New schema!

Base relation

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ Nested FOREACH:○ Operate on data in bags inside a relation and then

project

StudentsNested = FOREACH StudentsGr{Information = FOREACH Students GENERATE name, surname;GENERATE group AS gender, Information AS

student_information;}

DUMP StudentsNested;(F,{(Princess,Peach),(Sarah,Kerrigan),(Lara,Croft),(Mary,Doe)})(M,{(Peter,Parker),(Tony,Stark),(Bruce,Wayne),(John,Watson),(Sherlock,Holmes),(John,Doe)})

Base relation

New relation

Bag inside base relation

Finally project

New bag

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ (inner) JOIN instruction:○ Our classic database operator for relations!

StudentsGrades= JOIN Students BY student_id, Grades BY student_id;

DUMP StudentsGrades;(1,John,Doe,M,18,1,Physics,2.3) (1,John,Doe,M,18,1,Biology,4.5)(1,John,Doe,M,18,1,Engineering,7.7) (1,John,Doe,M,18,1,Math,5.6)(2,Mary,Doe,F,20,2,Engineering,6.7) (2,Mary,Doe,F,20,2,Physics,6.7)…DESCRIBE StudentsGrades;StudentsGrades: {Students::student_id: long,Students::name: chararray,Students::surname: chararray,Students::gender: chararray,Students::age: int,Grades::student_id: long,Grades::course: chararray,Grades::mark: double}

Base relation 1

New relation

Use these fields’ values to group

New schema!

Base relation

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ (left) JOIN instruction:○ Our classic database operator for relations!

Operating on relations

StudentsGrades= JOIN Students BY student_id LEFT, Grades BY student_id;

DUMP StudentsGrades;(6,Sarah,Kerrigan,F,21,,,) (7,Bruce,Wayne,M,32,7,Engineering,8.5)(7,Bruce,Wayne,M,32,7,Physics,8.9) (7,Bruce,Wayne,M,32,7,Math,8.5)(8,Tony,Stark,M,33,8,Math,6.7)…DESCRIBE StudentsGrades;StudentsGrades: {Students::student_id: long,Students::name: chararray,Students::surname: chararray,Students::gender: chararray,Students::age: int,Grades::student_id: long,Grades::course: chararray,Grades::mark: double}

Left relation

New relation

Do not forget this one!

New schema!

Right relation

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ CROSS instruction:○ Cartesian product of two or more relations

Operating on relations

StudentsCr= CROSS Students, Grades;

DUMP StudentsCr;(10,Peter,Parker,M,23,10,Physics,3.3) (10,Peter,Parker,M,23,9,Physics,5.0)(10,Peter,Parker,M,23,7,Physics,8.9) (10,Peter,Parker,M,23,5,Physics,4.5)(10,Peter,Parker,M,23,4,Physics,6.6) (10,Peter,Parker,M,23,3,Physics,5.7)(10,Peter,Parker,M,23,2,Physics,6.7) (10,Peter,Parker,M,23,1,Physics,2.3)…DESCRIBE StudentsCr;StudentsCr: {Students::student_id: long,Students::name: chararray,Students::surname: chararray,Students::gender: chararray,Students::age: int,Grades::student_id: long,Grades::course: chararray,Grades::mark: double}

Relation 1

New relation

Relation 2

New schema!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ UNION instruction:○ Joins in the same relation multiple relations

Operating on relations

StudentsUnion= UNION Students, Grades;

DUMP StudentsUnion;(1,John,Doe,M,18) (1,Math,5.6)(2,Mary,Doe,F,20) (2,Math,8.9)(3,Lara,Croft,F,25) (3,Math,7.1)…DESCRIBE StudentsUnion;Schema for StudentsUnion unknown.

Relation 1

New relation

Relation 2

Union does not preserve schemas!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ DISTINCT instruction:○ Only preserves unique tuples

Operating on relations

Courses= FOREACH Grades GENERATE course AS course;UniqueCourses= DISTINCT Courses;

DUMP UniqueCourses;(Math)(Biology)(Physics)(Engineering)

New relation

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ ORDER BY instruction:○ Sorts relations by a specific criteria

Operating on relations

SortedGrades= ORDER Grades BY mark DESC;

DUMP SortedGrades;(2,Biology,10.0)(10,Engineering,10.0)(10,Math,10.0)(5,Biology,10.0)(5,Engineering,9.0)(7,Physics,8.9)…

Base relation

New relation

field(s) used to sort

Sort criteria: DESC (descendant) or ASC (ascendant)

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ LIMIT instruction:○ Truncates relation’s size

Operating on relations

BestGrades= LIMIT SortedGrades 3;

DUMP BestGrades;(10,Math,10.0)(10,Engineering,10.0)(2,Biology,10.0)

Base relation

New relation

Maximum number of tuples

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ RANK instruction:○ Appends position of each tuple in the relation

Operating on relations

RankedGrades= RANK SortedGrades;

DUMP RankedGrades;(1,2,Biology,10.0)(2,10,Engineering,10.0)(3,10,Math,10.0)(4,5,Biology,10.0)(5,5,Engineering,9.0)… DESCRIBE RankedGrades;RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,mark: double}

Base relation

New relation

Rank number!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ RANK instruction:○ We can also sort and rank!

Operating on relations

RankedGrades= RANK SortedGrades BY student_id ASC, mark DESC;

DUMP RankedGrades;(1,1,Engineering,7.7)(2,1,Math,5.6)(3,1,Biology,4.5)(4,1,Physics,2.3)(5,2,Biology,10.0)… DESCRIBE RankedGrades;RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,mark: double}

Base relation

New relation

fields to sort

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ SAMPLE instruction:○ Sample the relation!

Operating on relations

SampledGrades= SAMPLE Grades 0.05;

DUMP SampledGrades;(4,Engineering,8.0)

Base relation

New relation

proportion to sample

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Get the 3 top grades for each student

(1,{(Engineering,7.7),(Math,5.6),(Biology,4.5)})

(2,{(Biology,10.0),(Math,8.9),(Engineering,6.7)})

(3,{(Math,7.1),(Physics,5.7),(Engineering,4.3)})

(4,{(Engineering,8.0),(Biology,6.7),(Physics,6.6)})

(5,{(Biology,10.0),(Engineering,9.0),(Math,6.7)})

(6,{(,)})

...

Exercise: Top grades

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ CUBE instruction:○ Is this really useful? Yes! Many aggregates with just

one operation

Operating on relations

CubedGrades= CUBE Grades BY CUBE(student_id,course);

CubedGrades= FOREACH CubedGrades GENERATE group, AVG(cube.mark);

DUMP CubedGrades;

((,Math),7.188888888888889)((,Biology),7.8)((,Physics),5.375)((,Engineering),6.877777777777778)((,),6.729032258064516)((2,Math),8.9)((2,Biology),10.0)((2,),8.075)…

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ CUBE/ROLLUP instruction:○ Like standard CUBE but nulls values are introduced

from right to left

Operating on relations

RolledGrades= CUBE Grades BY ROLLUP(course,student_id);

RolledGrades= FOREACH RolledGrades GENERATE group, AVG(cube.mark);

DUMP RolledGrades;

((Math,),7.188888888888889)((Math,2),8.9)((Math,3),7.1)((Math,4),2.3)((Math,5),6.7)((Math,7),8.5)((Math,8),6.7)((Math,9),8.9)…

order matters!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ ASSERT instruction:○ Assert that the whole relation fulfills a condition○ Useful for debugging

Operating on relations

ASSERT Grades BY mark > 0.0, ‘marks should be greater than 0’;

Base relation

Error message

condition

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ STORE instruction:○ Stores the relation into the local FS or HDFS (usually!)○ Useful for debugging

Finally, storing data!

STORE BestGrades INTO ‘best_grades_path’ USING

PigStorage( ‘\t’, ‘-noschema’ );

Relationpath to store data

Connector Field separator

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Problems solved?!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ ASSERT➢ GROUP➢ CROSS➢ CUBE➢ DISTINCT➢ FILTER➢ FOREACH➢ GROUP

Only these operations?

➢ JOIN➢ LIMIT➢ LOAD➢ ORDER, RANK➢ SAMPLE➢ SPLIT➢ UNION➢ DUMP, ILLUSTRATE,

DESCRIBE

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Transform data in data projections

➢ Built-in functions:○ math functions, string functions, datetime functions,

casting functions, etc.

➢ User defined functions:○ Our own functions written in Java, Python, Ruby,

Javascript, etc.

Functions & user defined functions

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag functions:○ AVG/MAX/MIN/SUM: compute the

average/max/min/sum of a bag of numeric values

Functions & user defined functions

GradesGr = GROUP Grades BY course;

GradesAvg= FOREACH GradesGr GENERATE group AS course, AVG(Grades.mark) AS avg_mark;

DUMP GradesAvg;

(Math,7.188888888888889)(Biology,7.8)(Physics,5.375000000000001)(Engineering,6.877777777777777)

Employ only this field in bag/tuple

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag functions:○ COUNT: number of elements (not null) in a bag

Functions & user defined functions

GradesCount= FOREACH GradesGr GENERATE group AS course, COUNT(Grades) AS number_students;

DUMP GradesCount;

(Math,9)(Biology,5)(Physics,8)(Engineering,9)

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag/Tuple functions:○ FLATTEN: behavior depends on input

Functions & user defined functions

DUMP GradesCount;(Math,{(8,Math,6.7),(1,Math,5.6),(10,Math,10.0),(9,Math,8.9),(2,Math,8.9),(3,Math,7.1),(4,Math,2.3),(5,Math,6.7),(7,Math,8.5)})(Biology,{(5,Biology,10.0),(4,Biology,6.7),(2,Biology,10.0),(1,Biology,4.5),(9,Biology,7.8)})...GradesFlat= FOREACH GradesGr GENERATE group AS course, FLATTEN(Grades.mark) AS mark;

DUMP GradesFlat;

(Math,6.7)(Math,5.6)(Math,10.0)…

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag/Tuple functions:○ FLATTEN: behavior depends on input

Functions & user defined functions

GradesTuple = FOREACH Grades GENERATE student_id, TOTUPLE(course, mark) AS tuple_mark;DUMP GradesTuple(1,(Math,5.6))(2,(Math,8.9))(3,(Math,7.1))(4,(Math,2.3))...GradesUntupled= FOREACH GradesTuple GENERATE student_id AS student_id, FLATTEN(tuple_mark);DUMP GradesUntupled;(1,Math,5.6)(2,Math,8.9)(3,Math,7.1)(4,Math,2.3)…

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag/Tuple functions:○ SUBTRACT: Tuples on first bag not in the second

Functions & user defined functions

SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20, OtherStudents OTHERWISE;StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY gender;DUMP StudentsCoGr(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})StudentsSub = FOREACH StudentsCoGr GENERATE group, SUBTRACT( StudentsUnder25, StudentsUnder20 );DUMP StudentsSub;(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})(M,{(10,Peter,Parker,M,23)})

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag/Tuple functions:○ DIFF: Non overlapping tuples on two bags

Functions & user defined functions

DUMP StudentsCoGr(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})StudentsDiff = FOREACH StudentsCoGr GENERATE group, DIFF(StudentsUnder25, StudentsUnder20);DUMP StudentsDiff;(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})(M,{(10,Peter,Parker,M,23)})

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Math functions:○ Common math functions for numeric values:

■ ABS ■ EXP■ FLOOR■ LOG■ RANDOM■ ROUND■ SQRT■ ...

Functions & user defined functions

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ String functions:○ Transform chararrays:

■ ENDSWITH ■ LOWER■ UPPER■ SUBSTRING■ TRIM■ REPLACE■ ...

Functions & user defined functions

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Datetime functions:○ Get information on dates and timestamps:

■ AddDuration ■ CurrentTime■ ToDate■ ToString■ ToUnixTime■ DaysBetween■ ...

Functions & user defined functions

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

public class SHUFFLE extends EvalFunc<DataBag> {

@Override

public DataBag exec( Tuple input ) throws IOException {

if ( input == null )

throw new IOException("Invalid input: null");

if( input.size() != 1 )

throw new IOException("Expected one argument");

if( input.get( 0 ) == null )

return null;

TupleFactory tf = TupleFactory.getInstance();

DataBag bag = (DataBag) input.get( 0 );

List<Tuple> l = new ArrayList<Tuple>();

for( Tuple t : bag )

l.add( t );

Collections.shuffle( l );

DataBag resBag = B BagFactory.getInstance().newDefaultBag( l );

return resBag;

}

User defined functions

@Override

public Schema outputSchema( Schema input ) {

try {

return new Schema( input.getField( 0 ) );

} catch( Exception e ){

return null;

}

}

}

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Library of useful UDFs released 2010➢ Created by LinkedIn engineering team:

○ Stats: variance, quantiles, median, etc.○ Bags: concat, append, preped, etc.○ Sampling○ Page rank○ Session estimation

➢ Last major release: 1.2.0 (Dec, 2013)http://datafu.incubator.apache.org/

More functions: Datafu Pig

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

How to use UDF libraries

REGISTER lib/datafu-1.2.0.jar

DEFINE BagConcat datafu.pig.bags.BagConcat();

DUMP StudentsCoGr

(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})

StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat(StudentsUnder25,StudentsUnder20);

DUMP StudentBagConcat (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)})(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18),(1,John,Doe,M,18)})

Indicate UDF to be included and name

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Scripting

REGISTER lib/datafu-1.2.0.jar

DEFINE BagConcat datafu.pig.bags.BagConcat();

Students= LOAD ‘$student_file’ USING PigStorage( ‘\t’, ‘-noschema’ ) AS ( student_id: Long, name: Chararray, surname: Chararray, gender: Chararray, age: Int)

SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20, OtherStudents OTHERWISE;StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY gender;

StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat(StudentsUnder25,StudentsUnder20);

STORE StudentBagConcat INTO ‘$output’ USING PigStorage( ‘\t’, ‘-schema’ );

Asda

Libraries and Udfs

Load

data

Transform

data

Store d

ata

parameter

parameter

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Calling a script

pig -x mapred -f myscript.pig -param student_file=students.csv -param output=myoutput_path

parameter definitionexecution mode script file

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Not limited to plain text

➢ Multiple supported format: Json, Avro, Accumulo, etc.

➢ Connectors to data sources: MongoDb, Cassandra, HBase, etc.

More on load/store

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Detect pairs of products bought together (e.g., chairs and tables)

➢ Goal: recommend related products➢ Association score:

Exercise: Product association

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Purchases: purchases.tsv

➢ Products: products.tsv

Product association

product_id user_id price date1 23 14.5 2014-03-034 15 11.2 2014-08-0988 3 48.3 2011-01-01...

product_id status1 ok5 ko99 ok...

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Time to work!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Clear and simple syntax

➢ Interactive client➢ Transparent M/R

jobs➢ Integration with

Java and others

Final notes: Pros & cons

➢ Not as flexible as Hadoop

➢ Oriented towards ETL, not AI

➢ No loops

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ http://pig.apache.org/

➢ Programming pig. Alan Gates. Ed. O’Reilly

➢ StackOverflow

Extra information