Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Storage and Analysis of Tera-scale Data : 2 of 2

415 Database Class11/24/09

delip@jhu.edu

Previously …

• (Traditional) Databases are not Swiss-Army knives• Large data problems require radically different

solutions• Exploit the power of parallel I/O and computation• MapReduce as a framework for building reliable

distributed data processing applications• Storing large data requires redesign from the

ground up, i.e. filesystem (HDFS)

Previously …

• HDFS : A reliable open source distributed file system

• HBase : A sorted multi-dimensional map for record oriented data– Not Relational– No query language other than map semantics (Get

and Put)

MapReduce is great but …

Got to write all this for a WordCount!!!

MapReduce

• Development cycles too long– Writing code– Packaging code

• JOINs on large data too hard to implement in MapReduce

• Today’s class: Keeping it Simple– Can we abstract users from MapReduce?

• Started in Fall 2007 at Yahoo!• Simplify MapReduce by

capturing common data processing patterns– Results in improved productivity – Lowers barrier to entry for large data processing

• Today: Runs 40% of Yahoo!’s large data jobs• Who else: Twitter, LinkedIn, AOL, …• Similar efforts elsewhere: Sawzall (Google), Hive

(Facebook)

Pig = Query Language + Interpreter

• Language: Pig Latin– A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN

• Interpreter: Grunt– An execution environment to convert Pig Latin to

MapReduce• Two modes– Local : JVM– Distributed: via Hadoop

Pig Latin

Example from Pittsburg Hadoop Users Group

Equivalent MapReduce code

Pig Latin from an Example

• Find users who visit “good” pages

(Example courtesy: Yahoo! Research)

Pig Latin script

Pig Latin: The Language

• Structure– Collection of STATEMENTS– Statement has an OPERATOR and ends in ‘;’

Summary of Pig Latin OperatorsCategory Operator

Loading and Storing LOADSTOREDUMP

Filtering FILTERDISTINCTFOREACH … GENERATESTREAM

Grouping and Joining JOINCOGROUPCROSS

Sorting ORDERLIMIT

Combining and Splitting UNIONSPLIT

LOAD/STORE and Schemas

grunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> records = LOAD ‘input/sample.txt’;

grunt> STORE records INTO ‘output/sample.out`;

FILTER

grunt> bad_records = FILTER records BY quality < 0;

grunt> bad_years = FOREACH bad_records GENERATE year;

STREAM

grunt> projected = FOREACH records GENERATE $0, $2;

grunt> projected = STREAM records THROUGH `cut -f0,2`

grunt> sales = LOAD ‘input/sales.txt’>> AS (year:int, profit:float);

grunt> combined = JOIN records BY year, sales BY year;

grunt> profit_year = FOREACH combined GENERATE profit, year;

grunt> combined = GROUP records BY quality;

grunt> combined = GROUP sales BY quality < AVG(quality);

ORDERgrunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> combined = ORDER records BY year, quality DESC;

Parallelismgrunt> records = LOAD ‘input/sample.txt’>> AS (year:int, temprature:int, quality:int);

grunt> combined = GROUP records BY quality PARALLEL 50;

Can use PARALLEL keyword in any statement

User Defined Functions

• Unlike SQL, can invoke custom defined functions in query– Proprietary solutions like PL/SQL allow that

grunt> REGISTER mypackage.jar;grunt> DEFINE MyFunc mypackage.MyFuncImpl.myFunc();grunt> combined = GROUP records BY MyFunc(quality);

PIG LATIN ReviewCategory Operator

Loading and Storing LOADSTOREDUMP

Filtering FILTERDISTINCTFOREACH … GENERATESTREAM

Grouping and Joining JOINCOGROUPCROSS

Sorting ORDERLIMIT

Combining and Splitting UNIONSPLIT

Revisiting WordCount

grunt> sentences = LOAD ‘input/*.txt’>> USING TextLoader() AS (sentence: chararray);

grunt> words = FOREACH sentences GENERATE flatten(TOKENIZE(sentence)) AS word;

grunt> word_kinds = GROUP words BY word;

grunt> word_count = FOREACH word_kinds>> GENERATE group, COUNT(words)

grunt> STORE word_count INTO ‘output/wordcount’;

No more this …

Related Project: Hive

• Started in Facebook, now open source• Like PIG but supports SQL• Trend : Move towards in-database MapReduce• Allows existing DB applications to scale up• Makes MapReduce capabilities easily

accessible• Business opportunity: www.vertica.com

Summary (this and last class)

• MapReduce as a radically different solution to large data problems

• Exploit the power of parallel I/O and computation

• Need to think from the “ground up”– Filesystem: HDFS– Table store: HBase

• Basic MapReduce too complicated DB end users

Summary (this and last class)

• Efforts to simplify MapReduce based data processing

• PIG from Yahoo!• Pig Latin a-not-so-SQL like language– A data flow language

• LOAD, STORE, FILTER, ORDER, GROUP, JOIN

• Facebook Hive supports direct SQL interface• Emerging trend: Fusion of MapReduce and DB

technologies

Happy Thanksgiving!

Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Documents

TERA Spring 2013

Quality in Part-time Technology Education John Baker, Sr. (jb@jhu.edu)jb@jhu.edu Director, Undergraduate Technology Programs School of Professional Studies

Electro Tera Pia

tera magazine

Tera Data 08

TERA TrainerGuide 2013 - Transportation Research Boardonlinepubs.trb.org/onlinepubs/tcrp/tcrp_w60TrainerGuide.pdf · TERA - TRAINER’S GUIDE 3 TERA is a simulation-based training

TERA Winter 2014

Tera Sms System

600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu

Tera Semiconductor

Physio Tera Py

Delip Rao delip@jhu.edu

Tera Data notes

NanowireSensor (Nano-Tera)

2020-21 Handbook FINAL...Italian Walter Stephens (walter.stephens@jhu.edu) Spanish William Egginton (egginton@jhu.edu) LANGUAGE PROGRAM DIRECTORS: French Kristin Cook-Gailloud (kacg@jhu.edu)

Nano-Tera Brochure

A Recent Candidate Noah A. Smith {cs,clsp}.jhu.edu {lti,mld}.scs.cmu.edu Noah A. Smith {cs,clsp}.jhu.edu {lti,mld}.scs.cmu.edu

TERA · 2016-04-14 · TERA TR-500 4 TERA Online Service and Support The TERA website provides additional information about obtaining service or support for the TERA line of two-way

Symbolic Processing. How to Teach “Programming” Kenneth.Church@jhu.edu Kenneth.Church@jhu.edu Lecture 1: Education for kids – Lego Mindstorms (NQC: Not

Tera stream ETL