RESTORE IMPLEMENTATION as an extension to pig Vijay S

RESTORE IMPLEMENTATION as an extension to pig

Vijay S

http://www.nordridesign.com/

LOGO

www.nordridesign.com

Overview of Pig Query Compiler

Implementation of �Restore

Experiments�

Outline


LOGO


Overview of the Pig Query Compilera parser syntactically checks the

input query and transforms it into a logical plan, which is a directed acyclic graph (DAG) of logical operators(1)

logical optimizer applies optimization rules to this logical plan(2)

MapReduce compiler transforms the logical plan into a physical plan and then compiles it into a series of MapReduce jobs, which forms a workflow(3)


LOGO


Overview of the Pig Query Compiler - ContinuedMapReduce optimizer applies

rules to reduce the number of MapReduce jobs in the workflow(4)

Hadoop job manager submits the jobs in a workflow to Hadoop for execution taking into account the dependencies between them.(5)


LOGO


Overview of the Pig Query Compiler - ContinuedJobControlCompiler

component of the Hadoop job manager of Pig

Input is Workflow of Mapreduce Jobs

After the completion of executing all the MapReduce jobs in the workflow, these intermediate outputs are deleted.


LOGO


Implementation of RestoreThe input of ReStore is a

workflow of MapReduce jobs. Every physical plan of these jobs

passes though two stages: (1) matching with plans in the repository, and (2) generating candidate sub-jobs.

.Implement the repository as a table that con-tains in every record: (1) a physical plan of a MapReduce job, (2) the filename of the output of this job in HDFS, and (3) statistics about this job


LOGO


ExperimentsReusing the Output of Whole

Jobs(7.1)Reusing the Output of Sub•

Jobs(7.2)Comparing the Heuristics for

GeneratingCandidate Sub-Jobs(7.3)

Reusing Sub• Jobs vs. Whole Jobs((7.4)

Effect of Data Reduction((7.5)


LOGO


Reusing the Output of Whole Jobs(7.1)

Job execution time for queries is much reduced by resusing jobs compared to no data reuse.(L3, L11 – PigMix)

Example:L2-L8 and L11 (Join, Group, Co-

Group,Filter Distinct and Union)L3, L11 - PigMix


LOGO


Reusing the Output of sub Jobs(7.2)

Job execution time for queries is further reduced by resusing Output of jobs compared to no data reuse and generating sub jobs




LOGO


Comparing Heuristics for Generating Candidate subjobs(7.3)

Job execution time for queries is further reduced by resusing Output of jobs compared to no data reuse and generating sub jobs




LOGO


Comparing the Heuristics for generating candidate Sub-Jobs (7.3)shows total size of Input Data loaded by different queries

Q I/P(GB)

HC

(GB)HA

(GB)NH

(GB)O/P

L2 150.6 3.1 3.1 6.7 1.1 MB

L3 150.7 3.2 8.2 22.1 62.9 MB

L4 150.6 2 2.8 10.8 34.2 MB

L5 150.7 1.8 4.6 7.4 2 B

L6 150.6 3.7 10.1 24.3 92.7 MB

L7 150.6 2.2 5.4 5.4 1.5 MB

L8 150.6 3.3 3.3 11.4 27 B

L11 173.6 2.6 2.7 2.8 1.6 GB


LOGO


Reusing subjobs Vs Whole Jobs(7.4)

Field name Cardinality % Selected Data

field6 200 0.5%

field7 100 1%

field8 20 5%

field9 10 10%

field10 5 20%

field11 2 50%

field12 1.6 60%


LOGO


Reusing subjobs Vs Whole Jobs(7.4)

Overhead and Speed up of different jobs – Dark line is speedup


LOGO


Effect of Data Reduction(7.5)

Overhead and Speed up of different jobs with filter operators


LOGO


Effect of Data Reduction(7.5)ContinuedQuery Template QPA = load ’$synth_data’ as

(field1, ..., field12); B = foreach A generate field1, ...;

C = group B by (field1, ...);D = foreach C generate

COUNT($1);store D into ’$out’;


LOGO


Effect of Data Reduction(7.5)ContinuedQuery Template QFA = load ’$synth_data’ as (field1, ..., field12); B = filter A by $fieldi = $val ;C = group B by field1;D = foreach C generate COUNT($1);store D into ’$out’;

’;


LOGO


Related WorkPaper addresses challenges by

Mapreduce like massive data sizes and procedural nature of query language

Otherwork – Materialized views and Mrshare


Documents

RESTORE IMPLEMENTATION as an extension to pig Vijay S