Previously …
• (Traditional) Databases are not Swiss-Army knives• Large data problems require radically different
solutions• Exploit the power of parallel I/O and computation• MapReduce as a framework for building reliable
distributed data processing applications• Storing large data requires redesign from the
ground up, i.e. filesystem (HDFS)
Previously …
• HDFS : A reliable open source distributed file system
• HBase : A sorted multi-dimensional map for record oriented data– Not Relational– No query language other than map semantics (Get
and Put)
MapReduce is great but …
Got to write all this for a WordCount!!!
MapReduce
• Development cycles too long– Writing code– Packaging code
• JOINs on large data too hard to implement in MapReduce
• Today’s class: Keeping it Simple– Can we abstract users from MapReduce?
Pig
• Started in Fall 2007 at Yahoo!• Simplify MapReduce by
capturing common data processing patterns– Results in improved productivity – Lowers barrier to entry for large data processing
• Today: Runs 40% of Yahoo!’s large data jobs• Who else: Twitter, LinkedIn, AOL, …• Similar efforts elsewhere: Sawzall (Google), Hive
(Facebook)
Pig = Query Language + Interpreter
• Language: Pig Latin– A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN
• Interpreter: Grunt– An execution environment to convert Pig Latin to
MapReduce• Two modes– Local : JVM– Distributed: via Hadoop
Pig Latin
Example from Pittsburg Hadoop Users Group
Equivalent MapReduce code
Pig Latin from an Example
• Find users who visit “good” pages
(Example courtesy: Yahoo! Research)
Conc
eptu
al D
atafl
ow
Pig Latin script
Pig Latin: The Language
• Structure– Collection of STATEMENTS– Statement has an OPERATOR and ends in ‘;’
Summary of Pig Latin OperatorsCategory Operator
Loading and Storing LOADSTOREDUMP
Filtering FILTERDISTINCTFOREACH … GENERATESTREAM
Grouping and Joining JOINCOGROUPCROSS
Sorting ORDERLIMIT
Combining and Splitting UNIONSPLIT
LOAD/STORE and Schemas
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> records = LOAD ‘input/sample.txt’;
grunt> STORE records INTO ‘output/sample.out`;
FILTER
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> bad_records = FILTER records BY quality < 0;
grunt> bad_years = FOREACH bad_records GENERATE year;
STREAM
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> projected = FOREACH records GENERATE $0, $2;
grunt> projected = STREAM records THROUGH `cut -f0,2`
JOIN
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> sales = LOAD ‘input/sales.txt’>> AS (year:int, profit:float);
grunt> combined = JOIN records BY year, sales BY year;
grunt> profit_year = FOREACH combined GENERATE profit, year;
GROUP
grunt> combined = GROUP records BY quality;
grunt> combined = GROUP sales BY quality < AVG(quality);
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
ORDERgrunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> combined = ORDER records BY year, quality DESC;
Parallelismgrunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> combined = GROUP records BY quality PARALLEL 50;
Can use PARALLEL keyword in any statement
User Defined Functions
• Unlike SQL, can invoke custom defined functions in query– Proprietary solutions like PL/SQL allow that
grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);
grunt> REGISTER mypackage.jar;grunt> DEFINE MyFunc mypackage.MyFuncImpl.myFunc();grunt> combined = GROUP records BY MyFunc(quality);
PIG LATIN ReviewCategory Operator
Loading and Storing LOADSTOREDUMP
Filtering FILTERDISTINCTFOREACH … GENERATESTREAM
Grouping and Joining JOINCOGROUPCROSS
Sorting ORDERLIMIT
Combining and Splitting UNIONSPLIT
Revisiting WordCount
grunt> sentences = LOAD ‘input/*.txt’>> USING TextLoader() AS (sentence: chararray);
grunt> words = FOREACH sentences GENERATE flatten(TOKENIZE(sentence)) AS word;
grunt> word_kinds = GROUP words BY word;
grunt> word_count = FOREACH word_kinds>> GENERATE group, COUNT(words)
grunt> STORE word_count INTO ‘output/wordcount’;
No more this …
Related Project: Hive
• Started in Facebook, now open source• Like PIG but supports SQL• Trend : Move towards in-database MapReduce• Allows existing DB applications to scale up• Makes MapReduce capabilities easily
accessible• Business opportunity: www.vertica.com
Summary (this and last class)
• MapReduce as a radically different solution to large data problems
• Exploit the power of parallel I/O and computation
• Need to think from the “ground up”– Filesystem: HDFS– Table store: HBase
• Basic MapReduce too complicated DB end users
Summary (this and last class)
• Efforts to simplify MapReduce based data processing
• PIG from Yahoo!• Pig Latin a-not-so-SQL like language– A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN
• Facebook Hive supports direct SQL interface• Emerging trend: Fusion of MapReduce and DB
technologies
Happy Thanksgiving!