Upload
victor-sanchez-anguix
View
564
Download
1
Embed Size (px)
Citation preview
Apache PigMaking data transformation easy
Víctor Sánchez AnguixUniversitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image
Course 2014/2015
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
Complex problem
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
Complex problem
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
➢ Need to solve complex problem
➢ More complex atomic operations than M/R
➢ Java is not a data oriented language → Low productivity
➢ Any solutions?
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Pig to the rescue!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Hadooppublic class DeliveryFileMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text>{
private String cellNumber,deliveryCode,fileTag="DR~";
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter reporter) throws
IOException
{
String line = value.toString();
String splitarray[] = line.split(",");
cellNumber = splitarray[0].trim();
deliveryCode = splitarray[1].trim();
output.collect(new Text(cellNumber), new Text
(fileTag+deliveryCode));
}
}
** Extracted from http://kickstarthadoop.blogspot.com.
es/2011/09/joins-with-plain-map-reduce.html
public class SmsReducer extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
private String customerName,deliveryReport;
private static Map<String,String> DeliveryCodesMap= new
HashMap<String,String>();
public void configure(JobConf job){
loadDeliveryStatusCodes();
}
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException{
while (values.hasNext()){
String currValue = values.next().toString();
String valueSplitted[] = currValue.split("~");
if(valueSplitted[0].equals("CD"))
customerName=valueSplitted[1].trim();
else if(valueSplitted[0].equals("DR"))
deliveryReport = DeliveryCodesMap.get
(valueSplitted[1].trim());
}
if(customerName!=null && deliveryReport!=null)
output.collect(new Text(customerName), new Text
(deliveryReport));
else if(customerName==null)
output.collect(new Text("customerName"), new Text
(deliveryReport));
else if(deliveryReport==null)
output.collect(new Text(customerName), new Text
("deliveryReport"));
}
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Pig
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Pig
A = JOIN A BY keyA, B BY keyB;
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Pig overview
➢ Framework layer over HDFS and Hadoop
➢ Developed by Yahoo at 2006
➢ Users: Yahoo, Linkedin, Twitter, IBM, etc.
➢ Last major release: 0.14.0 (November 2014)http://pig.apache.org/
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Hadoop vs. Apache Pig
➢ M/R as atomic operations
➢ Java is not data oriented
➢ M/R inner flexibility➢ Efficiency
➢ ETL operations: Join, Filter, Group, etc.
➢ Pig Latin: Data scripting language
➢ UDF with Java (and others)
➢ Transform to M/R overhead
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Pig Programming Model: Data
➢ Pig operations operate on relations
➢ A relation is a bag
➢ A bag is a collection of tuples
➢ A tuple is an ordered set of fields
➢ A field is any type of data
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Basic data types:○ Boolean: True, False○ Int and Long: 1, 2, 3, 4, 5○ Float and Double: 2.3, 1.4, 4.5○ Chararray: ‘Hello’, ‘I am a string’○ DateTime: 2014-09-11T12:20:14.1234+00:00○ … more but you won’t probably use them very often
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Tuple: A catch-all data type
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Bag:
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Bag:
➢ And relations? Just the most outer (distributed) bags
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Loading data? No, first let’s meet our friend Grunt
➢ Interactive pig shell → Nice for debugging/experimenting
➢ pig -x local or pig -x mapred
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Data source: Local or HDFS (usually!)➢ LOAD instruction:
○ Data is automatically loaded in a distributed relation
Students = LOAD ‘student_path’ USING PigStorage( ‘\t’, ‘-noschema’ ) AS (student_id: Long, name: Chararray, surname: Chararray, gender: Chararray,
age: Int);
Relation Name
Path to HD/HDFS
Connector Field separator
Tuple schema
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Data source: Local or HDFS (usually!)➢ LOAD instruction:
○ Data is automatically loaded in a distributed relation
Grades = LOAD ‘grade_path’ USING PigStorage( ‘,’, ‘-schema’ );
Relation Name
Path to HD/HDFS
Connector Field separator
Load schema from .pig_schema
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ DUMP instruction:○ Prints the content of a relation at standard output
DUMP Students;
(1,John,Doe,M,18)(2,Mary,Doe,F,20)(3,Lara,Croft,F,25)(4,Sherlock,Holmes,M,36)(5,John,Watson,M,38)(6,Sarah,Kerrigan,F,21)(7,Bruce,Wayne,M,32)(8,Tony,Stark,M,33)(9,Princess,Peach,F,21)(10,Peter,Parker,M,23)
grunt>
Relation Name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ DESCRIBE instruction:○ Prints the schema of the relation at standard output
DESCRIBE Students;
Students: {student_id: long,name: chararray,surname: chararray,gender: chararray,age: int}
grunt>Relation Name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ ILLUSTRATE instruction:○ Prints the schema of the relation and a tuple example
at standard outputILLUSTRATE Students;
-------------------------------------------------------------------------------------------------------------------| Students | student_id:long | name:chararray | surname:chararray | gender:chararray | age:int |-------------------------------------------------------------------------------------------------------------------| | 9 | Princess | Peach | F | 21 |-------------------------------------------------------------------------------------------------------------------
grunt>
Relation Name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:○ Generate new relations by projecting data of a relation
StudentsProj= FOREACH Students GENERATE student_id, name, age;
Relation Name
Base relation
Projected data
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:○ Generate new relations by projecting data of a relation
StudentsProj= FOREACH Students GENERATE student_id, CONCAT(name,surname) AS full_name, age;
Relation Name
Base relation
Projected data
We can generate new data too!!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:○ Let us execute the instruction and… it seems that
nothing happens!
○ We had some tracing output with LOAD, DUMP, and ILLUSTRATE…
○ Any ideas on this issue?
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ Pig employs lazy evaluation
➢ Computation only when:○ LOAD, ILLUSTRATE, DUMP, STORE
➢ Pig keeps a DAG on MR jobs needed to compute relations (optimized!)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Extend Student relation to add a field that determines if the students is under 25 years
(1,John,Doe,M,18,true)
(2,Mary,Doe,F,20,true)
(3,Lara,Croft,F,25,false)
(4,Sherlock,Holmes,M,36,false)
(5,John,Watson,M,38,false)
(6,Sarah,Kerrigan,F,21,true)
...
Exercise: Who is under 25?
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FILTER instruction:○ Generate a new relation by filtering data on a relation
StudentsFilt= FILTER Students BY age > 24 AND age < 34;
DUMP StudentsFilt;
(3,Lara,Croft,F,25)(7,Bruce,Wayne,M,32)(8,Tony,Stark,M,33)
Relation Name
Base relation
Condition to fulfill
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ SPLIT instruction:○ Splits a relation into multiple relations based on
conditions
SPLIT Students INTO StudentsMale IF gender == ‘M’, StudentsFemale OTHERWISE;
DUMP StudentsMale;
(1,John,Doe,M,18)(4,Sherlock,Holmes,M,36)(5,John,Watson,M,38)(7,Bruce,Wayne,M,32)(8,Tony,Stark,M,33)(10,Peter,Parker,M,23)
Base relation
New relation
Condition to fulfill by new relation. Otherwise means the rest
New relation
Condition to fulfill by new relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ SPLIT instruction:○ Splits a relation into multiple relations based on
conditions
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder30 IF age<30, OtherStudents OTHERWISE;
DUMP OtherStudents;
(4,Sherlock,Holmes,M,36)(5,John,Watson,M,38)(8,Tony,Stark,M,33)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ GROUP instruction:○ Creates tuples with the key and a of bag tuples with
the same key values
StudentsGr = GROUP Students BY gender;
DUMP StudentsGr;
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(3,Lara,Croft,F,25),(2,Mary,Doe,F,20)})(M,{(10,Peter,Parker,M,23),(8,Tony,Stark,M,33),(7,Bruce,Wayne,M,32),(5,John,Watson,M,38),(4,Sherlock,Holmes,M,36),(1,John,Doe,M,18)})
DESCRIBE StudentsGr;
StudentsGr: {group: chararray,Students: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)}}
Base relation
New relation
Use these fields’ values to make groups
New schema!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ GROUP instruction:○ We can use multiple relations. Creates one bag per
relation
StudentsGr = GROUP StudentsUnder25 BY gender, OtherStudents BY gender;
DUMP StudentsGr;(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(8,Tony,Stark,M,33),(5,John,Watson,M,38),(4,Sherlock,Holmes,M,36)})
DESCRIBE StudentsGr;StudentsCoGr: {group: chararray,StudentsUnder25: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)},OtherStudents: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)}}
Base relation
New relation
Use these fields’ values to make groups
New schema!
Base relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ Nested FOREACH:○ Operate on data in bags inside a relation and then
project
StudentsNested = FOREACH StudentsGr{Information = FOREACH Students GENERATE name, surname;GENERATE group AS gender, Information AS
student_information;}
DUMP StudentsNested;(F,{(Princess,Peach),(Sarah,Kerrigan),(Lara,Croft),(Mary,Doe)})(M,{(Peter,Parker),(Tony,Stark),(Bruce,Wayne),(John,Watson),(Sherlock,Holmes),(John,Doe)})
Base relation
New relation
Bag inside base relation
Finally project
New bag
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ (inner) JOIN instruction:○ Our classic database operator for relations!
StudentsGrades= JOIN Students BY student_id, Grades BY student_id;
DUMP StudentsGrades;(1,John,Doe,M,18,1,Physics,2.3) (1,John,Doe,M,18,1,Biology,4.5)(1,John,Doe,M,18,1,Engineering,7.7) (1,John,Doe,M,18,1,Math,5.6)(2,Mary,Doe,F,20,2,Engineering,6.7) (2,Mary,Doe,F,20,2,Physics,6.7)…DESCRIBE StudentsGrades;StudentsGrades: {Students::student_id: long,Students::name: chararray,Students::surname: chararray,Students::gender: chararray,Students::age: int,Grades::student_id: long,Grades::course: chararray,Grades::mark: double}
Base relation 1
New relation
Use these fields’ values to group
New schema!
Base relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ (left) JOIN instruction:○ Our classic database operator for relations!
Operating on relations
StudentsGrades= JOIN Students BY student_id LEFT, Grades BY student_id;
DUMP StudentsGrades;(6,Sarah,Kerrigan,F,21,,,) (7,Bruce,Wayne,M,32,7,Engineering,8.5)(7,Bruce,Wayne,M,32,7,Physics,8.9) (7,Bruce,Wayne,M,32,7,Math,8.5)(8,Tony,Stark,M,33,8,Math,6.7)…DESCRIBE StudentsGrades;StudentsGrades: {Students::student_id: long,Students::name: chararray,Students::surname: chararray,Students::gender: chararray,Students::age: int,Grades::student_id: long,Grades::course: chararray,Grades::mark: double}
Left relation
New relation
Do not forget this one!
New schema!
Right relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CROSS instruction:○ Cartesian product of two or more relations
Operating on relations
StudentsCr= CROSS Students, Grades;
DUMP StudentsCr;(10,Peter,Parker,M,23,10,Physics,3.3) (10,Peter,Parker,M,23,9,Physics,5.0)(10,Peter,Parker,M,23,7,Physics,8.9) (10,Peter,Parker,M,23,5,Physics,4.5)(10,Peter,Parker,M,23,4,Physics,6.6) (10,Peter,Parker,M,23,3,Physics,5.7)(10,Peter,Parker,M,23,2,Physics,6.7) (10,Peter,Parker,M,23,1,Physics,2.3)…DESCRIBE StudentsCr;StudentsCr: {Students::student_id: long,Students::name: chararray,Students::surname: chararray,Students::gender: chararray,Students::age: int,Grades::student_id: long,Grades::course: chararray,Grades::mark: double}
Relation 1
New relation
Relation 2
New schema!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ UNION instruction:○ Joins in the same relation multiple relations
Operating on relations
StudentsUnion= UNION Students, Grades;
DUMP StudentsUnion;(1,John,Doe,M,18) (1,Math,5.6)(2,Mary,Doe,F,20) (2,Math,8.9)(3,Lara,Croft,F,25) (3,Math,7.1)…DESCRIBE StudentsUnion;Schema for StudentsUnion unknown.
Relation 1
New relation
Relation 2
Union does not preserve schemas!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ DISTINCT instruction:○ Only preserves unique tuples
Operating on relations
Courses= FOREACH Grades GENERATE course AS course;UniqueCourses= DISTINCT Courses;
DUMP UniqueCourses;(Math)(Biology)(Physics)(Engineering)
New relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ORDER BY instruction:○ Sorts relations by a specific criteria
Operating on relations
SortedGrades= ORDER Grades BY mark DESC;
DUMP SortedGrades;(2,Biology,10.0)(10,Engineering,10.0)(10,Math,10.0)(5,Biology,10.0)(5,Engineering,9.0)(7,Physics,8.9)…
Base relation
New relation
field(s) used to sort
Sort criteria: DESC (descendant) or ASC (ascendant)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ LIMIT instruction:○ Truncates relation’s size
Operating on relations
BestGrades= LIMIT SortedGrades 3;
DUMP BestGrades;(10,Math,10.0)(10,Engineering,10.0)(2,Biology,10.0)
Base relation
New relation
Maximum number of tuples
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ RANK instruction:○ Appends position of each tuple in the relation
Operating on relations
RankedGrades= RANK SortedGrades;
DUMP RankedGrades;(1,2,Biology,10.0)(2,10,Engineering,10.0)(3,10,Math,10.0)(4,5,Biology,10.0)(5,5,Engineering,9.0)… DESCRIBE RankedGrades;RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,mark: double}
Base relation
New relation
Rank number!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ RANK instruction:○ We can also sort and rank!
Operating on relations
RankedGrades= RANK SortedGrades BY student_id ASC, mark DESC;
DUMP RankedGrades;(1,1,Engineering,7.7)(2,1,Math,5.6)(3,1,Biology,4.5)(4,1,Physics,2.3)(5,2,Biology,10.0)… DESCRIBE RankedGrades;RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,mark: double}
Base relation
New relation
fields to sort
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ SAMPLE instruction:○ Sample the relation!
Operating on relations
SampledGrades= SAMPLE Grades 0.05;
DUMP SampledGrades;(4,Engineering,8.0)
Base relation
New relation
proportion to sample
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Get the 3 top grades for each student
(1,{(Engineering,7.7),(Math,5.6),(Biology,4.5)})
(2,{(Biology,10.0),(Math,8.9),(Engineering,6.7)})
(3,{(Math,7.1),(Physics,5.7),(Engineering,4.3)})
(4,{(Engineering,8.0),(Biology,6.7),(Physics,6.6)})
(5,{(Biology,10.0),(Engineering,9.0),(Math,6.7)})
(6,{(,)})
...
Exercise: Top grades
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CUBE instruction:○ Is this really useful? Yes! Many aggregates with just
one operation
Operating on relations
CubedGrades= CUBE Grades BY CUBE(student_id,course);
CubedGrades= FOREACH CubedGrades GENERATE group, AVG(cube.mark);
DUMP CubedGrades;
((,Math),7.188888888888889)((,Biology),7.8)((,Physics),5.375)((,Engineering),6.877777777777778)((,),6.729032258064516)((2,Math),8.9)((2,Biology),10.0)((2,),8.075)…
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CUBE/ROLLUP instruction:○ Like standard CUBE but nulls values are introduced
from right to left
Operating on relations
RolledGrades= CUBE Grades BY ROLLUP(course,student_id);
RolledGrades= FOREACH RolledGrades GENERATE group, AVG(cube.mark);
DUMP RolledGrades;
((Math,),7.188888888888889)((Math,2),8.9)((Math,3),7.1)((Math,4),2.3)((Math,5),6.7)((Math,7),8.5)((Math,8),6.7)((Math,9),8.9)…
order matters!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ASSERT instruction:○ Assert that the whole relation fulfills a condition○ Useful for debugging
Operating on relations
ASSERT Grades BY mark > 0.0, ‘marks should be greater than 0’;
Base relation
Error message
condition
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ STORE instruction:○ Stores the relation into the local FS or HDFS (usually!)○ Useful for debugging
Finally, storing data!
STORE BestGrades INTO ‘best_grades_path’ USING
PigStorage( ‘\t’, ‘-noschema’ );
Relationpath to store data
Connector Field separator
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Problems solved?!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ASSERT➢ GROUP➢ CROSS➢ CUBE➢ DISTINCT➢ FILTER➢ FOREACH➢ GROUP
Only these operations?
➢ JOIN➢ LIMIT➢ LOAD➢ ORDER, RANK➢ SAMPLE➢ SPLIT➢ UNION➢ DUMP, ILLUSTRATE,
DESCRIBE
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Transform data in data projections
➢ Built-in functions:○ math functions, string functions, datetime functions,
casting functions, etc.
➢ User defined functions:○ Our own functions written in Java, Python, Ruby,
Javascript, etc.
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag functions:○ AVG/MAX/MIN/SUM: compute the
average/max/min/sum of a bag of numeric values
Functions & user defined functions
GradesGr = GROUP Grades BY course;
GradesAvg= FOREACH GradesGr GENERATE group AS course, AVG(Grades.mark) AS avg_mark;
DUMP GradesAvg;
(Math,7.188888888888889)(Biology,7.8)(Physics,5.375000000000001)(Engineering,6.877777777777777)
Employ only this field in bag/tuple
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag functions:○ COUNT: number of elements (not null) in a bag
Functions & user defined functions
GradesCount= FOREACH GradesGr GENERATE group AS course, COUNT(Grades) AS number_students;
DUMP GradesCount;
(Math,9)(Biology,5)(Physics,8)(Engineering,9)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:○ FLATTEN: behavior depends on input
Functions & user defined functions
DUMP GradesCount;(Math,{(8,Math,6.7),(1,Math,5.6),(10,Math,10.0),(9,Math,8.9),(2,Math,8.9),(3,Math,7.1),(4,Math,2.3),(5,Math,6.7),(7,Math,8.5)})(Biology,{(5,Biology,10.0),(4,Biology,6.7),(2,Biology,10.0),(1,Biology,4.5),(9,Biology,7.8)})...GradesFlat= FOREACH GradesGr GENERATE group AS course, FLATTEN(Grades.mark) AS mark;
DUMP GradesFlat;
(Math,6.7)(Math,5.6)(Math,10.0)…
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:○ FLATTEN: behavior depends on input
Functions & user defined functions
GradesTuple = FOREACH Grades GENERATE student_id, TOTUPLE(course, mark) AS tuple_mark;DUMP GradesTuple(1,(Math,5.6))(2,(Math,8.9))(3,(Math,7.1))(4,(Math,2.3))...GradesUntupled= FOREACH GradesTuple GENERATE student_id AS student_id, FLATTEN(tuple_mark);DUMP GradesUntupled;(1,Math,5.6)(2,Math,8.9)(3,Math,7.1)(4,Math,2.3)…
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:○ SUBTRACT: Tuples on first bag not in the second
Functions & user defined functions
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20, OtherStudents OTHERWISE;StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY gender;DUMP StudentsCoGr(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})StudentsSub = FOREACH StudentsCoGr GENERATE group, SUBTRACT( StudentsUnder25, StudentsUnder20 );DUMP StudentsSub;(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})(M,{(10,Peter,Parker,M,23)})
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:○ DIFF: Non overlapping tuples on two bags
Functions & user defined functions
DUMP StudentsCoGr(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})StudentsDiff = FOREACH StudentsCoGr GENERATE group, DIFF(StudentsUnder25, StudentsUnder20);DUMP StudentsDiff;(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})(M,{(10,Peter,Parker,M,23)})
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Math functions:○ Common math functions for numeric values:
■ ABS ■ EXP■ FLOOR■ LOG■ RANDOM■ ROUND■ SQRT■ ...
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ String functions:○ Transform chararrays:
■ ENDSWITH ■ LOWER■ UPPER■ SUBSTRING■ TRIM■ REPLACE■ ...
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Datetime functions:○ Get information on dates and timestamps:
■ AddDuration ■ CurrentTime■ ToDate■ ToString■ ToUnixTime■ DaysBetween■ ...
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
public class SHUFFLE extends EvalFunc<DataBag> {
@Override
public DataBag exec( Tuple input ) throws IOException {
if ( input == null )
throw new IOException("Invalid input: null");
if( input.size() != 1 )
throw new IOException("Expected one argument");
if( input.get( 0 ) == null )
return null;
TupleFactory tf = TupleFactory.getInstance();
DataBag bag = (DataBag) input.get( 0 );
List<Tuple> l = new ArrayList<Tuple>();
for( Tuple t : bag )
l.add( t );
Collections.shuffle( l );
DataBag resBag = B BagFactory.getInstance().newDefaultBag( l );
return resBag;
}
User defined functions
@Override
public Schema outputSchema( Schema input ) {
try {
return new Schema( input.getField( 0 ) );
} catch( Exception e ){
return null;
}
}
}
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Library of useful UDFs released 2010➢ Created by LinkedIn engineering team:
○ Stats: variance, quantiles, median, etc.○ Bags: concat, append, preped, etc.○ Sampling○ Page rank○ Session estimation
➢ Last major release: 1.2.0 (Dec, 2013)http://datafu.incubator.apache.org/
More functions: Datafu Pig
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
How to use UDF libraries
REGISTER lib/datafu-1.2.0.jar
DEFINE BagConcat datafu.pig.bags.BagConcat();
DUMP StudentsCoGr
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})
StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat(StudentsUnder25,StudentsUnder20);
DUMP StudentBagConcat (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)})(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18),(1,John,Doe,M,18)})
Indicate UDF to be included and name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Scripting
REGISTER lib/datafu-1.2.0.jar
DEFINE BagConcat datafu.pig.bags.BagConcat();
Students= LOAD ‘$student_file’ USING PigStorage( ‘\t’, ‘-noschema’ ) AS ( student_id: Long, name: Chararray, surname: Chararray, gender: Chararray, age: Int)
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20, OtherStudents OTHERWISE;StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY gender;
StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat(StudentsUnder25,StudentsUnder20);
STORE StudentBagConcat INTO ‘$output’ USING PigStorage( ‘\t’, ‘-schema’ );
Asda
Libraries and Udfs
Load
data
Transform
data
Store d
ata
parameter
parameter
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Calling a script
pig -x mapred -f myscript.pig -param student_file=students.csv -param output=myoutput_path
parameter definitionexecution mode script file
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Not limited to plain text
➢ Multiple supported format: Json, Avro, Accumulo, etc.
➢ Connectors to data sources: MongoDb, Cassandra, HBase, etc.
More on load/store
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Detect pairs of products bought together (e.g., chairs and tables)
➢ Goal: recommend related products➢ Association score:
Exercise: Product association
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Purchases: purchases.tsv
➢ Products: products.tsv
Product association
product_id user_id price date1 23 14.5 2014-03-034 15 11.2 2014-08-0988 3 48.3 2011-01-01...
product_id status1 ok5 ko99 ok...
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Time to work!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Clear and simple syntax
➢ Interactive client➢ Transparent M/R
jobs➢ Integration with
Java and others
Final notes: Pros & cons
➢ Not as flexible as Hadoop
➢ Oriented towards ETL, not AI
➢ No loops
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ http://pig.apache.org/
➢ Programming pig. Alan Gates. Ed. O’Reilly
➢ StackOverflow
Extra information