29
Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 [email protected]

Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 [email protected]

Embed Size (px)

Citation preview

Page 1: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Storage and Analysis of Tera-scale Data : 2 of 2

415 Database Class11/24/09

[email protected]

Page 2: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Previously …

• (Traditional) Databases are not Swiss-Army knives• Large data problems require radically different

solutions• Exploit the power of parallel I/O and computation• MapReduce as a framework for building reliable

distributed data processing applications• Storing large data requires redesign from the

ground up, i.e. filesystem (HDFS)

Page 3: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Previously …

• HDFS : A reliable open source distributed file system

• HBase : A sorted multi-dimensional map for record oriented data– Not Relational– No query language other than map semantics (Get

and Put)

Page 4: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

MapReduce is great but …

Got to write all this for a WordCount!!!

Page 5: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

MapReduce

• Development cycles too long– Writing code– Packaging code

• JOINs on large data too hard to implement in MapReduce

• Today’s class: Keeping it Simple– Can we abstract users from MapReduce?

Page 6: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Pig

• Started in Fall 2007 at Yahoo!• Simplify MapReduce by

capturing common data processing patterns– Results in improved productivity – Lowers barrier to entry for large data processing

• Today: Runs 40% of Yahoo!’s large data jobs• Who else: Twitter, LinkedIn, AOL, …• Similar efforts elsewhere: Sawzall (Google), Hive

(Facebook)

Page 7: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Pig = Query Language + Interpreter

• Language: Pig Latin– A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN

• Interpreter: Grunt– An execution environment to convert Pig Latin to

MapReduce• Two modes– Local : JVM– Distributed: via Hadoop

Page 8: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Pig Latin

Example from Pittsburg Hadoop Users Group

Page 9: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Equivalent MapReduce code

Page 10: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Pig Latin from an Example

• Find users who visit “good” pages

(Example courtesy: Yahoo! Research)

Page 11: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Conc

eptu

al D

atafl

ow

Page 12: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Pig Latin script

Page 13: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Pig Latin: The Language

• Structure– Collection of STATEMENTS– Statement has an OPERATOR and ends in ‘;’

Page 14: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Summary of Pig Latin OperatorsCategory Operator

Loading and Storing LOADSTOREDUMP

Filtering FILTERDISTINCTFOREACH … GENERATESTREAM

Grouping and Joining JOINCOGROUPCROSS

Sorting ORDERLIMIT

Combining and Splitting UNIONSPLIT

Page 15: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

LOAD/STORE and Schemas

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> records = LOAD ‘input/sample.txt’;

grunt> STORE records INTO ‘output/sample.out`;

Page 16: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

FILTER

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> bad_records = FILTER records BY quality < 0;

grunt> bad_years = FOREACH bad_records GENERATE year;

Page 17: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

STREAM

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> projected = FOREACH records GENERATE $0, $2;

grunt> projected = STREAM records THROUGH `cut -f0,2`

Page 18: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

JOIN

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> sales = LOAD ‘input/sales.txt’>> AS (year:int, profit:float);

grunt> combined = JOIN records BY year, sales BY year;

grunt> profit_year = FOREACH combined GENERATE profit, year;

Page 19: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

GROUP

grunt> combined = GROUP records BY quality;

grunt> combined = GROUP sales BY quality < AVG(quality);

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

Page 20: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

ORDERgrunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> combined = ORDER records BY year, quality DESC;

Page 21: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Parallelismgrunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> combined = GROUP records BY quality PARALLEL 50;

Can use PARALLEL keyword in any statement

Page 22: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

User Defined Functions

• Unlike SQL, can invoke custom defined functions in query– Proprietary solutions like PL/SQL allow that

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> REGISTER mypackage.jar;grunt> DEFINE MyFunc mypackage.MyFuncImpl.myFunc();grunt> combined = GROUP records BY MyFunc(quality);

Page 23: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

PIG LATIN ReviewCategory Operator

Loading and Storing LOADSTOREDUMP

Filtering FILTERDISTINCTFOREACH … GENERATESTREAM

Grouping and Joining JOINCOGROUPCROSS

Sorting ORDERLIMIT

Combining and Splitting UNIONSPLIT

Page 24: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Revisiting WordCount

grunt> sentences = LOAD ‘input/*.txt’>> USING TextLoader() AS (sentence: chararray);

grunt> words = FOREACH sentences GENERATE flatten(TOKENIZE(sentence)) AS word;

grunt> word_kinds = GROUP words BY word;

grunt> word_count = FOREACH word_kinds>> GENERATE group, COUNT(words)

grunt> STORE word_count INTO ‘output/wordcount’;

Page 25: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

No more this …

Page 26: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Related Project: Hive

• Started in Facebook, now open source• Like PIG but supports SQL• Trend : Move towards in-database MapReduce• Allows existing DB applications to scale up• Makes MapReduce capabilities easily

accessible• Business opportunity: www.vertica.com

Page 27: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Summary (this and last class)

• MapReduce as a radically different solution to large data problems

• Exploit the power of parallel I/O and computation

• Need to think from the “ground up”– Filesystem: HDFS– Table store: HBase

• Basic MapReduce too complicated DB end users

Page 28: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Summary (this and last class)

• Efforts to simplify MapReduce based data processing

• PIG from Yahoo!• Pig Latin a-not-so-SQL like language– A data flow language

• LOAD, STORE, FILTER, ORDER, GROUP, JOIN

• Facebook Hive supports direct SQL interface• Emerging trend: Fusion of MapReduce and DB

technologies

Page 29: Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Happy Thanksgiving!