Upload
beverley-hodge
View
212
Download
0
Embed Size (px)
Citation preview
Alan GatesBecoming a Pig Developer
- 2 -
Who Am I?
• Pig committer• Hadoop PMC Member• Yahoo! architect for Pig
- 3 -
Current Status
• Release 0.3 June 2009– Multi-store queries
• Pig added to Amazon Elastic MapReduce August 2009• Release 0.4 September 2009
– Added skew and merge join– Added outer join (for default hash join only)
• Release 0.5 November 2009– Hadoop 0.20
- 4 -
Components
User machine
Hadoop Cluster
Pig resides on user machine
Job executes on cluster
No need to install anything extra on your Hadoop cluster.
- 5 -
How It Works
Parser
ScriptA = loadB = filterC = groupD = foreach
Logical PlanSemanticChecks
Logical PlanLogicalOptimizer
Logical Plan
Logical toPhysicalTranslatorPhysical Plan
PhysicalTo MRTranslator
MapReduceLauncher
Jar tohadoop
Map-Reduce Plan
Logical Plan ≈ relational algebra
Plan standard optimizations
Physical Plan = physical operators to be executed
Map-Reduce Plan = physical operators broken into Map, Combine, and Reduce stages
- 6 -
Fragment Replicate Join
Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “replicated”;
Pages Users
Map 1
Map 2
Users
Users
Pagesblock 1
Pagesblock 2
- 7 -
Hash Join
Pages Users
Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user;
Map 1
Pagesblock n
Map 2
Usersblock m
Reducer 1
Reducer 2
(1, user)
(2, name)
(1, fred)(2, fred)(2, fred)
(1, jane)(2, jane)(2, jane)
- 8 -
Skew Join
Pages Users
Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “skewed”;
Map 1
Pagesblock n
Map 2
Usersblock m
Reducer 1
Reducer 2
(1, user)
(2, name)
(1, fred, p1)(1, fred, p2)(2, fred)
(1, fred, p3)(1, fred, p4)(2, fred)
SP
SP
- 9 -
Merge Join
Pages Usersaaron . . . . . . . .zach
aaron . . . . . . . .zach
Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “merge”;
Map 1
Map 2
Users
Users
Pages
Pages
aaron…amr
aaron…
amy…barb
amy…
- 10 -
Multi-store script
A = load ‘users’ as (name, age, gender, city, state);B = filter A by name is not null;C1 = group B by age, gender;D1 = foreach C1 generate group, COUNT(B);store D into ‘bydemo’;C2= group B by state;D2 = foreach C2 generate group, COUNT(B);store D2 into ‘bystate’;
load users filter nulls
group by state
group by age, gender
apply UDFs
apply UDFs
store into ‘bystate’
store into ‘bydemo’
- 11 -
Multi-Store Map-Reduce Plan
map filter
local rearrange
split
local rearrange
reduce
multiplexpackage package
foreach foreach
- 12 -
Basic User Defined Functions
A = load ‘users’;B = group A all;C = foreach B generate COUNT(A);
long exec(bag b) { return b.size();}
Reduce
- 13 -
Algebraic User Defined Functions
A = load ‘users’;B = group A all;C = foreach B generate COUNT(A);
long exec(tuple t){ return 1;}
long exec(bag b) { long sum = 0; for (long s : b) { sum += s; } return sum;}
long exec(bag b) { long sum = 0; for (long s : b) { sum += s; } return sum;}
Reduce CombineMapInitial Intermediate Final
- 14 -
Accumulative User Defined Functions
A = load ‘users’ as (name, url, timestamp);B = group A by name;C = foreach B { D = order A by timestamp; generate SessionAnalysis(A);}
public interface Accumulator <T> { public void accumulate(List<Tuple> b);
public T getValue()}
Reduce
- 15 -
Performance Tips
• Project early and often• Use Parallel• Filter out nulls before join• For integer arithmetic, use types
- 16 -
Performance
0.1 0.2 0.3 0.4,0.5
trunk
- 17 -
Upcoming Features
• Redesign of load and store function interfaces• Adding outer join to all join types• UDFs in python and ruby• Changing spilling strategy to avoid running out of memory• Adding Accumulator interface
- 18 -
Learn More
• Read the online documentation: http://hadoop.apache.org/pig/
• On line tutorials– From Yahoo, http://developer.yahoo.com/hadoop/tutorial/– From Cloudera, http://www.cloudera.com/hadoop-training
• A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstore
• Join the mailing lists:– [email protected] for user questions– [email protected] for developer issues
• Contribute back your work, over 40 people have contributed so far
- 19 -
Questions