Download pdf - 06 pig etl features

Apache Pig ETL Features

Pig is ETL tool?

• Pig is a data flow language.

• It sits on top of Hadoop and makes it possible to create complex jobs to process large volumes of data quickly and efficiently.

• Best of all, it supports many relational features, making it easy to join, group, and aggregate data.

• If you think this sounds a lot like an ETL tool, you’d be right.

• Pig has many things in common with ETL tools, if those ETL tools ran on many server simultaneously.

Case 1 – Time Sensitive Data Loads

• Data comes in from outside of the database in – text,

– XML,

– CSV, or

– some other arbitrary file format

• The data then has to be processed into a different formats and loaded into a database for later querying.

• Sometimes there are a lot of steps involved, sometimes the data has to be translated into an intermediate format.

• An advantage of being able to scale out across many servers is that doubling throughput is often as easy as doubling the number of servers working on a problem.

Case 2 – Processing Many Data Sources• Combining advertising information from multiple sources and mixing it

together with

– web server traffic,

– IP geo-location, and

– click through metrics

• It’s possible to gain a deeper understanding of customer behavior and judge just how effective certain ads are in certain parts of the country.

• Pig isn’t just designed to scale out over many servers.

• Pig can be used to complex data flows and extend them with custom code.

• A job can be written to collect

– web server logs,

– use external programs to fetch geo-location data for the users’ IP addresses, and

– join the new set of geo-located web traffic to click maps stored as JSON,

– web analytic data in CSV format, and

– spreadsheets from the advertising department

• To build a rich view of user behavior overlaid with advertising effectiveness.

Case 3 – Analytic Insight Through Sampling

• Even in case 2, we’ve seen how Pig can provide some analytical insight into the massive quantities of data that are generated every day in the datacenter.

• It’s easy to fall into the trap of thinking that Pig is an ETL glue that moves data from a log file, processes it, and drops it off for another database to consume.

• Pig is more than just an ETL tool.

• One of Pig’s strengths is its ability to perform sampling of large data sets.

• As Pig manipulates data, it’s easy to reduce the set of data that we’re operating on using sampling.

• By sampling with a random distribution of data, we can reduce the amount of data that needs to be analyze and still deliver meaningful results.

Summing Up

• Pig isn’t a replacement for SQL Server Integration Services.

• Their use cases overlap for many tasks, but they also solve very different problems.

• Using Pig for all ETL processes will be overkill when the data can reasonably be handled within a single SQL Server instance.

• On the flip side, there are problems that are too large to quickly solve within a single SSIS process or package.

• In either situation you should pick the best tool for the job.

SQL Pig Example

From

table

Load

file(s)

SQL: from X;

Pig: A = load ‘mydata’ using PigStorage(‘\t’)

as (col1, col2, col3);

Select Foreach …

generate

SQL: select col1 + col2, col3 …

Pig: B = foreach A generate col1 + col2, col3;

Where Filter SQL: select col1 + col2, col3

from X

where col2>2;

Pig: C = filter B by col2 > ‘2’;

Mapping SQL to Pig

SQL Pig Example

Group

by

Group +

foreach …

generate

SQL: select col1, col2, sum(col3)

from X group by col1, col2;

Pig: D = group A by (col1, col2);

E = foreach D generate flatten(group), SUM(A.col3);

Having Filter SQL: select col1, sum(col2) from X group by col1

having sum(col2) > 5;

Pig: F = filter E by $1 > ‘5’;

Order

By

Order …

By

SQL: select col1, sum(col2)

from X group by col1 order by col1;

Pig: H = ORDER E by $0;

Mapping SQL to Pig

SQL Pig Example

Distinct Distinct SQL: select distinct col1 from X;

Pig: I = foreach A generate col1;

J = distinct I;

Distinct

Agg

Distinct

in

foreach

SQL: select col1, count (distinct col2)

from X group by col1;

Pig: K = foreach D {

L = distinct A.col2;

generate flatten(group), SUM(L); }

Mapping SQL to Pig

SQL Pig ExampleJoin Cogroup +

flatten

(also

shortcut:

JOIN)

SQL: select A.col1, B.col3

from A join B using (col1);

Pig:

A = load ‘data1’ using PigStorage(‘\t’) as (col1, col2);

B = load ‘data2’ using PigStorage(‘\t’) as (col1, col3);

C = cogroup A by col1 inner, B by col1 inner;

D = foreach C generate flatten(A), flatten(B);

E = foreach D generate A.col1, B.col3;

Mapping SQL to Pig

End of session

Day – 3: Apache Pig ETL Features