Introduction to Apache Pig

1. Introduction To PIG
The evolution of data processing frameworks

2. What is PIG?
Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs
Pig generates and compiles a Map/Reduce program(s) on the fly.
3. Why PIG?
Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
4. File Formats
PigStorage
Custom Load / Store Functions
5. Installing PIG
Download / Unpack tarball (pig.apache.org)
Install RPM / DEB package (cloudera.com)
6. Running PIG
Grunt Shell: Enter Pig commands manually using Pigs interactive shell, Grunt.
Script File: Place Pig commands in a script file and run the script.
Embedded Program: Embed Pig commands in a host language and run the program.
7. Run Modes
Local Mode: To run Pig in local mode, you need access to a single machine.
Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.
8. Sample PIG script
A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
store B into id.out;
9. Sample Script With Schema
A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name);
10. Eval Functions
AVG
CONCAT
Example
COUNT
COUNT_STAR
DIFF
IsEmpty
MAX
MIN
SIZE
SUM
TOKENIZE
11. Math Functions
# Math Functions
ABS
ACOS
ASIN
ATAN
CBRT
CEIL
COSH
COS
EXP
FLOOR
LOG
LOG10
RANDOM
ROUND
SIN
SINH
SQRT
TAN
TANH
12. Pig Types
13. Sample CW PIG script
RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');
input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions;
GroupedInput = GROUP input BY (Category, TagId, URL);
result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;
STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
14. Sample PIG script (Filtering)
RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');
input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions;
defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);
GroupedInput = GROUP defFilter BY (Category, TagId, URL);
result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;
STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
15. What is PIG UDF?
UDF- User Defined Function
Types of UDFs:
Eval Functions (extends EvalFunc)
Aggregate Functions (extends EvalFunc implements Algebraic)
Filter Functions (extends FilterFunc)
UDFContext
Allows UDFs to get access to the JobConfobject
Allows UDFs to pass configuration information between instantiations of the UDF on the front and backends.
16. Sample UDF
public class TopLevelDomain extends EvalFunc {
@Override
public String exec(Tupletuple) throws IOException {
Object o = tuple.get(0);
if (o == null) {
return null;
}
return Validator.getTLD(o.toString());
}
}
17. UDF In Action
REGISTER '$WORK_DIR/pig-support.jar';
DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain();
AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain
18. Resources
Apache PIG http://pig.apache.org/
Apache Hadoophttp://hadoop.apache.org/
Cloudera CDH https://wiki.cloudera.com/display/DOC/CDH3+Installation
19. PIG DEMO

Technology

Introduction to Apache Pig