Upload
jason-shao
View
9.213
Download
4
Embed Size (px)
DESCRIPTION
An introductory presentation that was delivered by a co-worker at the NYC Hadoop Meetup
Citation preview
2. What is PIG?
Pig is a platform for analyzing large data sets that consists of a
high-level language for expressing data analysis programs
Pig generates and compiles a Map/Reduce program(s) on the
fly.
3. Why PIG?
Ease of programming - It is trivial to achieve parallel execution
of simple, "embarrassingly parallel" data analysis tasks. Complex
tasks comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to
write, understand, and maintain.
4. File Formats
PigStorage
Custom Load / Store Functions
5. Installing PIG
Download / Unpack tarball (pig.apache.org)
Install RPM / DEB package (cloudera.com)
6. Running PIG
Grunt Shell: Enter Pig commands manually using Pigs interactive
shell, Grunt.
Script File: Place Pig commands in a script file and run the
script.
Embedded Program: Embed Pig commands in a host language and run the
program.
7. Run Modes
Local Mode: To run Pig in local mode, you need access to a single
machine.
Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you
need access to a Hadoop cluster and HDFS installation.
8. Sample PIG script
A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
store B into id.out;
9. Sample Script With Schema
A = LOAD 'student_data' AS (name: chararray, age: int, gpa:
float);
B = FOREACH A GENERATE myudfs.UPPER(name);
10. Eval Functions
AVG
CONCAT
Example
COUNT
COUNT_STAR
DIFF
IsEmpty
MAX
MIN
SIZE
SUM
TOKENIZE
11. Math Functions
# Math Functions
ABS
ACOS
ASIN
ATAN
CBRT
CEIL
COSH
COS
EXP
FLOOR
LOG
LOG10
RANDOM
ROUND
SIN
SINH
SQRT
TAN
TANH
12. Pig Types
13. Sample CW PIG script
RawInput = LOAD '$INPUT' USING
com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');
input = foreachRawInput GENERATE ContextCategoryId as Category,
TagId, URL, Impressions;
GroupedInput = GROUP input BY (Category, TagId, URL);
result = FOREACH GroupedInput GENERATE group,
SUM(input.Impressions) as Impressions;
STORE result INTO '$OUTPUT' USING
com.contextweb.pig.CWHeaderStore();
14. Sample PIG script (Filtering)
RawInput = LOAD '$INPUT' USING
com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');
input = foreachRawInput GENERATE ContextCategoryId as Category,
DefLevelId , TagId, URL,Impressions;
defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId ==
12);
GroupedInput = GROUP defFilter BY (Category, TagId, URL);
result = FOREACH GroupedInput GENERATE group,
SUM(input.Impressions) as Impressions;
STORE result INTO '$OUTPUT' USING
com.contextweb.pig.CWHeaderStore();
15. What is PIG UDF?
UDF- User Defined Function
Types of UDFs:
Eval Functions (extends EvalFunc)
Aggregate Functions (extends EvalFunc implements Algebraic)
Filter Functions (extends FilterFunc)
UDFContext
Allows UDFs to get access to the JobConfobject
Allows UDFs to pass configuration information between
instantiations of the UDF on the front and backends.
16. Sample UDF
public class TopLevelDomain extends EvalFunc {
@Override
public String exec(Tupletuple) throws IOException {
Object o = tuple.get(0);
if (o == null) {
return null;
}
return Validator.getTLD(o.toString());
}
}
17. UDF In Action
REGISTER '$WORK_DIR/pig-support.jar';
DEFINE
getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain();
AA = foreach input GENERATE TagId,
getTopLevelDomain(PublisherDomain) as RootDomain
18. Resources
Apache PIG http://pig.apache.org/
Apache Hadoophttp://hadoop.apache.org/
Cloudera CDH
https://wiki.cloudera.com/display/DOC/CDH3+Installation
19. PIG DEMO