Upload
clifton-wade
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
LOGO
www.nordridesign.com
Overview of Pig Query Compiler
Implementation of �Restore
Experiments�
Outline
LOGO
www.nordridesign.com
Overview of the Pig Query Compilera parser syntactically checks the
input query and transforms it into a logical plan, which is a directed acyclic graph (DAG) of logical operators(1)
logical optimizer applies optimization rules to this logical plan(2)
MapReduce compiler transforms the logical plan into a physical plan and then compiles it into a series of MapReduce jobs, which forms a workflow(3)
LOGO
www.nordridesign.com
Overview of the Pig Query Compiler - ContinuedMapReduce optimizer applies
rules to reduce the number of MapReduce jobs in the work- flow(4)
Hadoop job manager submits the jobs in a workflow to Hadoop for execution taking into account the dependencies between them.(5)
LOGO
www.nordridesign.com
Overview of the Pig Query Compiler - ContinuedJobControlCompiler
component of the Hadoop job manager of Pig
Input is Workflow of Mapreduce Jobs
After the completion of executing all the MapReduce jobs in the workflow, these intermediate outputs are deleted.
LOGO
www.nordridesign.com
Implementation of RestoreThe input of ReStore is a
workflow of MapReduce jobs. Every physical plan of these jobs
passes though two stages: (1) matching with plans in the repository, and (2) generating candidate sub-jobs.
.Implement the repository as a table that con-tains in every record: (1) a physical plan of a MapReduce job, (2) the filename of the output of this job in HDFS, and (3) statistics about this job
LOGO
www.nordridesign.com
ExperimentsReusing the Output of Whole
Jobs(7.1)Reusing the Output of Sub•
Jobs(7.2)Comparing the Heuristics for
GeneratingCandidate Sub-Jobs(7.3)
Reusing Sub• Jobs vs. Whole Jobs((7.4)
Effect of Data Reduction((7.5)
LOGO
www.nordridesign.com
Reusing the Output of Whole Jobs(7.1)
Job execution time for queries is much reduced by resusing jobs compared to no data reuse.(L3, L11 – PigMix)
Example:L2-L8 and L11 (Join, Group, Co-
Group,Filter Distinct and Union)L3, L11 - PigMix
LOGO
www.nordridesign.com
Reusing the Output of sub Jobs(7.2)
Job execution time for queries is further reduced by resusing Output of jobs compared to no data reuse and generating sub jobs
Example:L2-L8 and L11 (Join, Group, Co-
Group,Filter Distinct and Union)L3, L11 - PigMix
LOGO
www.nordridesign.com
Comparing Heuristics for Generating Candidate subjobs(7.3)
Job execution time for queries is further reduced by resusing Output of jobs compared to no data reuse and generating sub jobs
Example:L2-L8 and L11 (Join, Group, Co-
Group,Filter Distinct and Union)L3, L11 - PigMix
LOGO
www.nordridesign.com
Comparing the Heuristics for generating candidate Sub-Jobs (7.3)shows total size of Input Data loaded by different queries
Q I/P(GB)
HC
(GB)HA
(GB)NH
(GB)O/P
L2 150.6 3.1 3.1 6.7 1.1 MB
L3 150.7 3.2 8.2 22.1 62.9 MB
L4 150.6 2 2.8 10.8 34.2 MB
L5 150.7 1.8 4.6 7.4 2 B
L6 150.6 3.7 10.1 24.3 92.7 MB
L7 150.6 2.2 5.4 5.4 1.5 MB
L8 150.6 3.3 3.3 11.4 27 B
L11 173.6 2.6 2.7 2.8 1.6 GB
LOGO
www.nordridesign.com
Reusing subjobs Vs Whole Jobs(7.4)
Field name Cardinality % Selected Data
field6 200 0.5%
field7 100 1%
field8 20 5%
field9 10 10%
field10 5 20%
field11 2 50%
field12 1.6 60%
LOGO
www.nordridesign.com
Reusing subjobs Vs Whole Jobs(7.4)
Overhead and Speed up of different jobs – Dark line is speedup
LOGO
www.nordridesign.com
Effect of Data Reduction(7.5)
Overhead and Speed up of different jobs with filter operators
LOGO
www.nordridesign.com
Effect of Data Reduction(7.5)ContinuedQuery Template QPA = load ’$synth_data’ as
(field1, ..., field12); B = foreach A generate field1, ...;
C = group B by (field1, ...);D = foreach C generate
COUNT($1);store D into ’$out’;
LOGO
www.nordridesign.com
Effect of Data Reduction(7.5)ContinuedQuery Template QFA = load ’$synth_data’ as (field1, ..., field12); B = filter A by $fieldi = $val ;C = group B by field1;D = foreach C generate COUNT($1);store D into ’$out’;
’;
LOGO
www.nordridesign.com
Related WorkPaper addresses challenges by
Mapreduce like massive data sizes and procedural nature of query language
Otherwork – Materialized views and Mrshare