© Hortonworks Inc. 2011
Daniel Dai (@daijy)Thejas Nair (@thejasn)
Page 1
Making Pig FlyOptimizing Data Processing on Hadoop
© Hortonworks Inc. 2011
What is Apache Pig?
Page 2Architecting the Future of Big Data
Pig Latin, a high level data processing language.
An engine that executes Pig Latin locally or on a Hadoop cluster.
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
© Hortonworks Inc. 2011
Pig-latin example
Page 3Architecting the Future of Big Data
• Query : Get the list of web pages visited by users whose age is between 20 and 29 years.
USERS = load ‘users’ as (uid, age);
USERS_20s = filter USERS by age >= 20 and age <= 29;
PVs = load ‘pages’ as (url, uid, timestamp);
PVs_u20s = join USERS_20s by uid, PVs by uid;
© Hortonworks Inc. 2011
Why pig ?
Page 4Architecting the Future of Big Data
•Faster development– Fewer lines of code– Don’t re-invent the wheel
•Flexible– Metadata is optional– Extensible– Procedural programming
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
© Hortonworks Inc. 2011
Pig optimizations
Page 5Architecting the Future of Big Data
• Ideally user should not have to bother
• Reality– Pig is still young and immature– Pig does not have the whole picture
–Cluster configuration–Data histogram
– Pig philosophy: Pig is docile
© Hortonworks Inc. 2011
Pig optimizations
Page 6Architecting the Future of Big Data
• What pig does for you– Do safe transformations of query to optimize– Optimized operations (join, sort)
• What you do– Organize input in optimal way– Optimize pig-latin query– Tell pig what join/group algorithm to use
© Hortonworks Inc. 2011
Rule based optimizer
Page 7Architecting the Future of Big Data
• Column pruner• Push up filter• Push down flatten• Push up limit• Partition pruning• Global optimizer
© Hortonworks Inc. 2011
Column Pruner
Page 8Architecting the Future of Big Data
• Pig will do column pruning automatically
• Cases Pig will not do column pruning automatically
– No schema specified in load statement
A = load ‘input’ as (a0, a1, a2);B = foreach A generate a0+a1;C = order B by $0;Store C into ‘output’;
Pig will prune a2 automatically
A = load ‘input’;B = order A by $0;C = foreach B generate $0+$1;Store C into ‘output’;
A = load ‘input’;A1 = foreach A generate $0, $1;B = order A1 by $0;C = foreach B generate $0+$1;Store C into ‘output’;
DIY
© Hortonworks Inc. 2011
Column Pruner
Page 9Architecting the Future of Big Data
• Another case Pig does not do column pruning
– Pig does not keep track of unused column after grouping
A = load ‘input’ as (a0, a1, a2);B = group A by a0;C = foreach B generate SUM(A.a1);Store C into ‘output’;
DIY
A = load ‘input’ as (a0, a1, a2);A1 = foreach A generate $0, $1;B = group A1 by a0;C = foreach B generate SUM(A.a1);Store C into ‘output’;
© Hortonworks Inc. 2011
Push up filter
Page 10Architecting the Future of Big Data
• Pig split the filter condition before push
A
Join
a0>0 && b0>10
B
Filter
A
Join
a0>0
B
Filter b0>10
Original query Split filter condition
A
Join
a0>0
B
Filter b0>10
Push up filter
© Hortonworks Inc. 2011
Other push up/down
Page 11Architecting the Future of Big Data
• Push down flatten
• Push up limit
Load
Flatten
Order
Load
Flatten
Order
A = load ‘input’ as (a0:bag, a1);B = foreach A generate flattten(a0), a1;C = order B by a1;Store C into ‘output’;
Load
Limit
Foreach
Load
Foreach
Limit
Load (limited)
Foreach
Load
Limit
Order
Load
Order (limited)
© Hortonworks Inc. 2011
Partition pruning
Page 12Architecting the Future of Big Data
• Prune unnecessary partitions entirely– HCatLoader
2010
2011
2012
HCatLoader Filter (year>=2011)
2010
2011
2012
HCatLoader (year>=2011)
© Hortonworks Inc. 2011
Intermediate file compression
Page 13Architecting the Future of Big Data
Pig Script
map 1
reduce 1
map 2
reduce 2
Pig temp file
map 3
reduce 3
Pig temp file
•Intermediate file between map and reduce
– Snappy
•Temp file between mapreduce jobs
– No compression by default
© Hortonworks Inc. 2011
Enable temp file compression
Page 14Architecting the Future of Big Data
•Pig temp file are not compressed by default
– Issues with snappy (HADOOP-7990)– LZO: not Apache license
•Enable LZO compression–Install LZO for Hadoop–In conf/pig.properties
–With lzo, up to > 90% disk saving and 4x query speed up
pig.tmpfilecompression = truepig.tmpfilecompression.codec = lzo
© Hortonworks Inc. 2011
Multiquery
Page 15Architecting the Future of Big Data
• Combine two or more map/reduce job into one
– Happens automatically– Cases we want to control multiquery: combine too many
Load
Group by $0 Group by $1
Foreach Foreach
Store Store
Group by $2
Foreach
Store
© Hortonworks Inc. 2011
Control multiquery
Page 16Architecting the Future of Big Data
• Disable multiquery– Command line option: -M
• Using “exec” to mark the boundaryA = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, COUNT(A);Store C0 into ‘output0’;B1 = group A by $1;C1 = foreach B1 generate group, COUNT(A);Store C1 into ‘output1’;execB2 = group A by $2;C2 = foreach B2 generate group, COUNT(A);Store C2 into ‘output2’;
© Hortonworks Inc. 2011
Implement the right UDF
Page 17Architecting the Future of Big Data
• Algebraic UDF– Initial– Intermediate– Final
A = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, SUM(A);Store C0 into ‘output0’;
MapInitial
CombinerIntermediate
ReduceFinal
© Hortonworks Inc. 2011
Implement the right UDF
Page 18Architecting the Future of Big Data
• Accumulator UDF– Reduce side UDF– Normally takes a bag
• Benefit– Big bag are passed in batches
– Avoid using too much memory
– Batch size
A = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, my_accum(A);Store C0 into ‘output0’;
my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed }}pig.accumulative.batchsize=20000
© Hortonworks Inc. 2011
Memory optimization
Page 19Architecting the Future of Big Data
• Control bag size on reduce side
– If bag size exceed threshold, spill to disk
– Control the bag size to fit the bag in memory if possible
reduce(Text key, Iterator<Writable> values, ……)
Mapreduce:
Iterator
Bag of Input 1 Bag of Input 2 Bag of Input 3
pig.cachedbag.memusage=0.2
© Hortonworks Inc. 2011
Optimization starts before pig
Page 20Architecting the Future of Big Data
• Input format • Serialization format• Compression
© Hortonworks Inc. 2011
Input format -Test Query
Page 21Architecting the Future of Big Data
> searches = load ’aol_search_logs.txt' using PigStorage() as (ID, Query, …);
> search_thejas = filter searches by Query matches '.*thejas.*'; > dump search_thejas; (1568578 , thejasminesupperclub, ….)
© Hortonworks Inc. 2011
Input formats
Page 22Architecting the Future of Big Data
PigStor
age
LzoP
igStor
age
PigStor
age W
Typ
e
AvroStor
age (
has t
ypes
)0
20406080
100120140
RunTime (sec)
RunTime (sec)
© Hortonworks Inc. 2011
Columnar format
Page 23Architecting the Future of Big Data
•RCFile•Columnar format for a group of rows•More efficient if you query subset of columns
© Hortonworks Inc. 2011
Tests with RCFile
Page 24Architecting the Future of Big Data
• Tests with load + project + filter out all records.
• Using hcatalog, w compression,types•Test 1
•Project 1 out of 5 columns•Test 2
•Project all 5 columns
© Hortonworks Inc. 2011
RCFile test results
Page 25Architecting the Future of Big Data
Project 1 (sec) Project all (sec)0
20
40
60
80
100
120
140
Plain TextRCFile
© Hortonworks Inc. 2011
Cost based optimizations
Page 26Architecting the Future of Big Data
• Optimizations decisions based on your query/data
• Often iterative process
Run query Measure
Tune
© Hortonworks Inc. 2011
• Hash Based Agg
Use pig.exec.mapPartAgg=true to enable
Map task
Cost based optimization - Aggregation
Page 27Architecting the Future of Big Data
Map(logic) M.
Output
HBA HBAOutput
Reduce task
© Hortonworks Inc. 2011
Cost based optimization – Hash Agg.
Page 28Architecting the Future of Big Data
• Auto off feature • switches off HBA if output reduction is
not good enough• Configuring Hash Agg
• Configure auto off feature - pig.exec.mapPartAgg.minReduction
• Configure memory used - pig.cachedbag.memusage
© Hortonworks Inc. 2011
Cost based optimization - Join
Page 29Architecting the Future of Big Data
• Use appropriate join algorithm•Skew on join key - Skew join•Fits in memory – FR join
© Hortonworks Inc. 2011
Cost based optimization – MR tuning
Page 30Architecting the Future of Big Data
•Tune MR parameters to reduce IO•Control spills using map sort params •Reduce shuffle/sort-merge params
© Hortonworks Inc. 2011
Parallelism of reduce tasks
Page 31Architecting the Future of Big Data
4 6 8 24 48 2560:14:240:15:500:17:170:18:430:20:100:21:360:23:020:24:290:25:55
Runtime
Runtime
• Number of reduce slots = 6• Factors affecting runtime
• Cores simultaneously used/skew• Cost of having additional reduce tasks
© Hortonworks Inc. 2011
Cost based optimization – keep data sorted
Page 32Architecting the Future of Big Data
•Frequent joins operations on same keys
• Keep data sorted on keys• Use merge join
• Optimized group on sorted keys• Works with few load functions – needs
additional i/f implementation
© Hortonworks Inc. 2011
Optimizations for sorted data
Page 33Architecting the Future of Big Data
sort+sort+join+join join + join0
10
20
30
40
50
60
70
80
90
Join 2Join 1Sort2Sort1
© Hortonworks Inc. 2011
Future Directions
Page 34Architecting the Future of Big Data
• Optimize using stats• Using historical stats w hcatalog• Sampling
© Hortonworks Inc. 2011
Questions
Page 35Architecting the Future of Big Data
?
© Hortonworks Inc. 2011 Page 36