Apache Pig ETL Features
Pig is ETL tool?
• Pig is a data flow language.
• It sits on top of Hadoop and makes it possible to create complex jobs to process large volumes of data quickly and efficiently.
• Best of all, it supports many relational features, making it easy to join, group, and aggregate data.
• If you think this sounds a lot like an ETL tool, you’d be right.
• Pig has many things in common with ETL tools, if those ETL tools ran on many server simultaneously.
Case 1 – Time Sensitive Data Loads
• Data comes in from outside of the database in – text,
– XML,
– CSV, or
– some other arbitrary file format
• The data then has to be processed into a different formats and loaded into a database for later querying.
• Sometimes there are a lot of steps involved, sometimes the data has to be translated into an intermediate format.
• An advantage of being able to scale out across many servers is that doubling throughput is often as easy as doubling the number of servers working on a problem.
Case 2 – Processing Many Data Sources• Combining advertising information from multiple sources and mixing it
together with
– web server traffic,
– IP geo-location, and
– click through metrics
• It’s possible to gain a deeper understanding of customer behavior and judge just how effective certain ads are in certain parts of the country.
• Pig isn’t just designed to scale out over many servers.
• Pig can be used to complex data flows and extend them with custom code.
• A job can be written to collect
– web server logs,
– use external programs to fetch geo-location data for the users’ IP addresses, and
– join the new set of geo-located web traffic to click maps stored as JSON,
– web analytic data in CSV format, and
– spreadsheets from the advertising department
• To build a rich view of user behavior overlaid with advertising effectiveness.
Case 3 – Analytic Insight Through Sampling
• Even in case 2, we’ve seen how Pig can provide some analytical insight into the massive quantities of data that are generated every day in the datacenter.
• It’s easy to fall into the trap of thinking that Pig is an ETL glue that moves data from a log file, processes it, and drops it off for another database to consume.
• Pig is more than just an ETL tool.
• One of Pig’s strengths is its ability to perform sampling of large data sets.
• As Pig manipulates data, it’s easy to reduce the set of data that we’re operating on using sampling.
• By sampling with a random distribution of data, we can reduce the amount of data that needs to be analyze and still deliver meaningful results.
Summing Up
• Pig isn’t a replacement for SQL Server Integration Services.
• Their use cases overlap for many tasks, but they also solve very different problems.
• Using Pig for all ETL processes will be overkill when the data can reasonably be handled within a single SQL Server instance.
• On the flip side, there are problems that are too large to quickly solve within a single SSIS process or package.
• In either situation you should pick the best tool for the job.
SQL Pig Example
From
table
Load
file(s)
SQL: from X;
Pig: A = load ‘mydata’ using PigStorage(‘\t’)
as (col1, col2, col3);
Select Foreach …
generate
SQL: select col1 + col2, col3 …
Pig: B = foreach A generate col1 + col2, col3;
Where Filter SQL: select col1 + col2, col3
from X
where col2>2;
Pig: C = filter B by col2 > ‘2’;
Mapping SQL to Pig
SQL Pig Example
Group
by
Group +
foreach …
generate
SQL: select col1, col2, sum(col3)
from X group by col1, col2;
Pig: D = group A by (col1, col2);
E = foreach D generate flatten(group), SUM(A.col3);
Having Filter SQL: select col1, sum(col2) from X group by col1
having sum(col2) > 5;
Pig: F = filter E by $1 > ‘5’;
Order
By
Order …
By
SQL: select col1, sum(col2)
from X group by col1 order by col1;
Pig: H = ORDER E by $0;
Mapping SQL to Pig
SQL Pig Example
Distinct Distinct SQL: select distinct col1 from X;
Pig: I = foreach A generate col1;
J = distinct I;
Distinct
Agg
Distinct
in
foreach
SQL: select col1, count (distinct col2)
from X group by col1;
Pig: K = foreach D {
L = distinct A.col2;
generate flatten(group), SUM(L); }
Mapping SQL to Pig
SQL Pig ExampleJoin Cogroup +
flatten
(also
shortcut:
JOIN)
SQL: select A.col1, B.col3
from A join B using (col1);
Pig:
A = load ‘data1’ using PigStorage(‘\t’) as (col1, col2);
B = load ‘data2’ using PigStorage(‘\t’) as (col1, col3);
C = cogroup A by col1 inner, B by col1 inner;
D = foreach C generate flatten(A), flatten(B);
E = foreach D generate A.col1, B.col3;
Mapping SQL to Pig
End of session
Day – 3: Apache Pig ETL Features