Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Preview:

Citation preview

Storage and Analysis of Tera-scale Data : 2 of 2

415 Database Class11/24/09

delip@jhu.edu

Previously …

• (Traditional) Databases are not Swiss-Army knives• Large data problems require radically different

solutions• Exploit the power of parallel I/O and computation• MapReduce as a framework for building reliable

distributed data processing applications• Storing large data requires redesign from the

ground up, i.e. filesystem (HDFS)

Previously …

• HDFS : A reliable open source distributed file system

• HBase : A sorted multi-dimensional map for record oriented data– Not Relational– No query language other than map semantics (Get

and Put)

MapReduce is great but …

Got to write all this for a WordCount!!!

MapReduce

• Development cycles too long– Writing code– Packaging code

• JOINs on large data too hard to implement in MapReduce

• Today’s class: Keeping it Simple– Can we abstract users from MapReduce?

Pig

• Started in Fall 2007 at Yahoo!• Simplify MapReduce by

capturing common data processing patterns– Results in improved productivity – Lowers barrier to entry for large data processing

• Today: Runs 40% of Yahoo!’s large data jobs• Who else: Twitter, LinkedIn, AOL, …• Similar efforts elsewhere: Sawzall (Google), Hive

(Facebook)

Pig = Query Language + Interpreter

• Language: Pig Latin– A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN

• Interpreter: Grunt– An execution environment to convert Pig Latin to

MapReduce• Two modes– Local : JVM– Distributed: via Hadoop

Pig Latin

Example from Pittsburg Hadoop Users Group

Equivalent MapReduce code

Pig Latin from an Example

• Find users who visit “good” pages

(Example courtesy: Yahoo! Research)

Conc

eptu

al D

atafl

ow

Pig Latin script

Pig Latin: The Language

• Structure– Collection of STATEMENTS– Statement has an OPERATOR and ends in ‘;’

Summary of Pig Latin OperatorsCategory Operator

Loading and Storing LOADSTOREDUMP

Filtering FILTERDISTINCTFOREACH … GENERATESTREAM

Grouping and Joining JOINCOGROUPCROSS

Sorting ORDERLIMIT

Combining and Splitting UNIONSPLIT

LOAD/STORE and Schemas

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> records = LOAD ‘input/sample.txt’;

grunt> STORE records INTO ‘output/sample.out`;

FILTER

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> bad_records = FILTER records BY quality < 0;

grunt> bad_years = FOREACH bad_records GENERATE year;

STREAM

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> projected = FOREACH records GENERATE $0, $2;

grunt> projected = STREAM records THROUGH `cut -f0,2`

JOIN

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> sales = LOAD ‘input/sales.txt’>> AS (year:int, profit:float);

grunt> combined = JOIN records BY year, sales BY year;

grunt> profit_year = FOREACH combined GENERATE profit, year;

GROUP

grunt> combined = GROUP records BY quality;

grunt> combined = GROUP sales BY quality < AVG(quality);

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

ORDERgrunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> combined = ORDER records BY year, quality DESC;

Parallelismgrunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> combined = GROUP records BY quality PARALLEL 50;

Can use PARALLEL keyword in any statement

User Defined Functions

• Unlike SQL, can invoke custom defined functions in query– Proprietary solutions like PL/SQL allow that

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> REGISTER mypackage.jar;grunt> DEFINE MyFunc mypackage.MyFuncImpl.myFunc();grunt> combined = GROUP records BY MyFunc(quality);

PIG LATIN ReviewCategory Operator

Loading and Storing LOADSTOREDUMP

Filtering FILTERDISTINCTFOREACH … GENERATESTREAM

Grouping and Joining JOINCOGROUPCROSS

Sorting ORDERLIMIT

Combining and Splitting UNIONSPLIT

Revisiting WordCount

grunt> sentences = LOAD ‘input/*.txt’>> USING TextLoader() AS (sentence: chararray);

grunt> words = FOREACH sentences GENERATE flatten(TOKENIZE(sentence)) AS word;

grunt> word_kinds = GROUP words BY word;

grunt> word_count = FOREACH word_kinds>> GENERATE group, COUNT(words)

grunt> STORE word_count INTO ‘output/wordcount’;

No more this …

Related Project: Hive

• Started in Facebook, now open source• Like PIG but supports SQL• Trend : Move towards in-database MapReduce• Allows existing DB applications to scale up• Makes MapReduce capabilities easily

accessible• Business opportunity: www.vertica.com

Summary (this and last class)

• MapReduce as a radically different solution to large data problems

• Exploit the power of parallel I/O and computation

• Need to think from the “ground up”– Filesystem: HDFS– Table store: HBase

• Basic MapReduce too complicated DB end users

Summary (this and last class)

• Efforts to simplify MapReduce based data processing

• PIG from Yahoo!• Pig Latin a-not-so-SQL like language– A data flow language

• LOAD, STORE, FILTER, ORDER, GROUP, JOIN

• Facebook Hive supports direct SQL interface• Emerging trend: Fusion of MapReduce and DB

technologies

Happy Thanksgiving!

Recommended