53
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Thinking at Scale with Hadoop 1 photo by: i_pinz, flickr Ken Krugler Bixo Labs & Scale Unlimited Friday, September 24, 2010

Thinking at scale with hadoop

Embed Size (px)

DESCRIPTION

Presentation by Ken Krugler at the SDForum SAM SIG (Software Architecture & Modeling) meeting on Sept 22nd, 2010. This talk provides a brief introduction to Map-Reduce & Hadoop, then discusses challenges of implementing complex data processing using low-level Map-Reduce support, and a number of solutions.

Citation preview

Page 1: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Thinking at Scale with Hadoop

1

phot

o by

: i_p

inz,

flick

r

Ken Krugler

Bixo Labs &Scale Unlimited

Friday, September 24, 2010

Page 2: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Introduction - Ken Krugler

Founder, Bixo Labs - http://bixolabs.com

Large scale web mining workflows, running in Amazon’s cloud

Instructor, Scale Unlimited - http://scaleunlimited.com

Teaching Hadoop Boot Camp

Founder, Krugle (2005 - 2009)

Code search startup

Parse/index 2.6 billion lines of open source code

2

Friday, September 24, 2010

Page 3: Thinking at scale with hadoop

Goals for Talk

Intro to Map-Reduce

Overview of Hadoop

Challenges with Map-Reduce

Solutions to Complexity

3

Friday, September 24, 2010

Page 4: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

4

Introduction to Map-Reduce

phot

o by

: exf

ordy

, flick

r

Ken Krugler

Bixo Labs &Scale Unlimited

Friday, September 24, 2010

Page 5: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Map -> The ‘map’ function in the MapReduce algorithm, user defined

Reduce -> The ‘reduce’ function in the MapReduce algorithm, user defined

Group -> The ‘magic’ that happens between Map and Reduce

Key Value Pair -> two units of data, exchanged between Map & Reduce

5

Definitions

Friday, September 24, 2010

Page 6: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

How MapReduce Works

Map translates input to keys and values to new keys and values

System Groups each unique key with all its values

Reduce translates the values of each unique key to new keys and values

6

[K1,V1] Map [K2,V2]

[K2,{V2,V2,....}] [K3,V3]Reduce

[K2,V2] [K2,{V2,V2,....}]Group

Friday, September 24, 2010

Page 7: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

All Together

7

Reducer

Reducer

Reducer

Mapper

Mapper

Mapper

Mapper

Mapper

Shuffle

Shuffle

Shuffle

Friday, September 24, 2010

Page 8: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Canonical Example - Word Count

Word Count

Read a document, parse out the words, count the frequency of each word

Specifically, in MapReduce

With a document consisting of lines of text

Translate each line of text into key = word and value = 1

Sum the values (ones) for each unique word

8

Friday, September 24, 2010

Page 9: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

9

[1, When in the Course of human events, it becomes]

[When,1] [in,1] [the,1] [Course,1] [k,1]

[2, necessary for one people to dissolve the political bands]

[n, {k,k,k,k,...}]

[When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}]

[k,{v,v,...}][connected,{1,1,1,1,1}]

[When,4] [peope,6] [dissolve,3]

[k,sum(v,v,v,..)][connected,5]

Map

Reduce

GroupUser

Defined

[K1,V1]

[K2,{V2,V2,....}]

[K2,V2]

[K3,V3]

[3, which have connected them with another, and to assume]

Friday, September 24, 2010

Page 10: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Divide & Conquer

Because

The Map function only cares about the current key and value, and

The Reduce function only cares about the current key and its values

Mappers & Reducers can run Map/Reduce functions independently

Then

A Mapper can invoke Map on an arbitrary number of input keys and values

A Reducer can invoke Reduce on an arbitrary number of the unique keys

Any number of Mappers & Reducers can be run on any number of nodes

10

Friday, September 24, 2010

Page 11: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

11

Overview of Hadoop

phot

o by

: exf

ordy

, flick

r

Ken Krugler

Bixo Labs &Scale Unlimited

Friday, September 24, 2010

Page 12: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Logical Architecture

Logically, Hadoop is simply a computing cluster that provides:

a Storage layer, and

an Execution layer

12

Cluster

Storage

Execution

Friday, September 24, 2010

Page 13: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Logical Architecture

Where users can submit jobs that:

Run on the Execution layer, and

Read/write data from the Storage layer

13

Cluster

Storage

Execution

Users

Client

Client

Client

job job job

Friday, September 24, 2010

Page 14: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Storage - HDFS

14

Performance

Optimized for many concurrent serialized reads of large files

But only a single writer (write-once, no append)

Fidelity

Replication via block replication

GeographyGeography

RackRackRack

Node Node Node ...Node

block

block

block

block

block

block

block

block

Friday, September 24, 2010

Page 15: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Execution

Performance

Divide and Conquer

MapReduce

Data Affinity

Move code to data

Fidelity

Applications must choose to fail on bad data, or ignore bad data

15

GeographyGeography

RackRackRack

Node Node Node ...Node

map

reduce

map map map map

reducereduce

Friday, September 24, 2010

Page 16: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Architectural Components

16

Solid boxes are unique applicationsDashed boxes are child JVM instances on same node as parentDotted boxes are blocks of managed files on same node as parent

read/write

nsoperations

read/write

read/writens

operations

nsoperations

NameNode

Secondary

DataNodeDataNode

DataNodeDataNode

JobTrackerClientjobs

TaskTracker

mapper

child jvm

reducer

child jvm

mapper

child jvm

reducer

child jvmreducer

child jvm

mapper

child jvm

data block

tasks

Friday, September 24, 2010

Page 17: Thinking at scale with hadoop

Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

17

Challenges of Map-Reduce

phot

o by

: Gra

ham

Rac

her,

flickrKen Krugler

Bixo Labs &Scale Unlimited

Friday, September 24, 2010

Page 18: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Complexity of keys/values

Simple != Simplicity

People think about processing records/objects

The correct key definition changes during a workflow

A “value” is typically a set of fields

18

Friday, September 24, 2010

Page 19: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Complexity of chaining jobs

Most real-world workflows require multiple Map-Reduce jobs

Connecting data between jobs is tedious

Keeping key/value definitions in sync is error-prone

As an example...

19

Friday, September 24, 2010

Page 20: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

A trivial filter/count/sort workflow

20

private void start(String inputPath, String outputPath, String filterUser) throws Exception { // counts the pages the users clicked and sort by number of clicks

// filter and count

Path outputDir = new Path(outputPath); Path parent = outputDir.getParent(); Path tmpOutput = new Path(parent, "" + System.currentTimeMillis());

// Create the job configuration Configuration conf = new Configuration(); conf.set(FilterMapper.USER_KEY, filterUser);

Job filterCountJob = new Job(conf, "filter+count job");

FileInputFormat.setInputPaths(filterCountJob, new Path(inputPath)); FileOutputFormat.setOutputPath(filterCountJob, tmpOutput);

filterCountJob.setMapperClass(FilterMapper.class); filterCountJob.setMapOutputKeyClass(Text.class); filterCountJob.setMapOutputValueClass(LongWritable.class);

filterCountJob.setReducerClass(CounterReducer.class); filterCountJob.setOutputKeyClass(Text.class); filterCountJob.setOutputValueClass(LongWritable.class); filterCountJob.setInputFormatClass(TextInputFormat.class); filterCountJob.setOutputFormatClass(SequenceFileOutputFormat.class);

// run job and block until job is done filterCountJob.waitForCompletion(false);

// sort Job sortJob = new Job(conf, "sort job");

FileInputFormat.setInputPaths(sortJob, tmpOutput); FileOutputFormat.setOutputPath(sortJob, outputDir);

sortJob.setMapperClass(SortMapper.class); sortJob.setMapOutputKeyClass(UserAndCount.class); sortJob.setMapOutputValueClass(NullWritable.class);

// sort are only stable with one reducer but you lose distribution. sortJob.setNumReduceTasks(1);

sortJob.setInputFormatClass(SequenceFileInputFormat.class); sortJob.setOutputFormatClass(TextOutputFormat.class); // run job and block until job is done sortJob.waitForCompletion(false);

// Get rid of temp directory used to bridge between the two jobs FileSystem.get(conf).delete(tmpOutput, true);

Friday, September 24, 2010

Page 21: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Not seeing the forest...

Too much low-level detail obscures intent

Not seeing intent leads to bugs

And makes optimization harder

21

Friday, September 24, 2010

Page 22: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Hard to use MR optimizations

MR => MR{0,1} => M{1,}R{0,1} => M{1,}R{0,1}M{0,}

Combiners can reduce map output size

Map-side joins where one of the sides is “small”

22

Friday, September 24, 2010

Page 23: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Solutions to Complexity

Cascading - Tuples, Pipes, and Operations

FlumeJava - Java Classes and Deferred Execution

Pig/Hive - “It’s just a big database”

Datameer - “It’s just a big spreadsheet”

23

Friday, September 24, 2010

Page 24: Thinking at scale with hadoop

Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

24

Cascading

phot

o by

: Gra

ham

Rac

her,

flickrKen Krugler

Bixo Labs &Scale Unlimited

Friday, September 24, 2010

Page 25: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Cascading

An API for defining data processing workflows

First public release Jan 2008

Currently 1.1.1 and supports Hadoop 0.19.x through 0.20.2

http://www.cascading.org

25

Friday, September 24, 2010

Page 26: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Cascading

Basic unit of data is “Tuple” - roughly equal to a record

Implements all standard data processing operations

function, filter, group by, co-group, aggregator, etc

Complex applications are built by chaining operations with pipes

And are planned into the minimum number of MapReduce jobs

26

Friday, September 24, 2010

Page 27: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Operations on Tuples

Map and Reduce communicate through Keys and Values

Our problems are really about elements of Keys and Values

e.g. “username”, not “line”

The effect is more reusability of our ‘map’ and ‘reduce’ operations

27

Friday, September 24, 2010

Page 28: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Mapping onto Hadoop

Functions & Filters are map operations

Aggregates & Buffers are reduce operations

Built-in support for joining, multi-field sorting

Concept of “Taps” with “Schemes” as input and output formats

28

Friday, September 24, 2010

Page 29: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Loose Coupling

29

Groupfunc aggr Sink

func

Source

Above, all the possible relationships between operations and data

If Fields are the interface between operations, we can build flexible apps

Friday, September 24, 2010

Page 30: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Layered Over MapReduce

30

Map

CoGroupfunc aggr SinkSource GroupBy

func

aggr

Source func

functemp

Reduce Map Reduce

Friday, September 24, 2010

Page 31: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Cascading Code

31

Tap input = new Hfs( new TextLine( new Fields( "line" ) ), inputPath ); Tap output = new Hfs( new TextLine( 1 ), outputPath, SinkMode.REPLACE );

Pipe pipe = new Pipe( "example" );

RegexSplitGenerator function = new RegexSplitGenerator( new Fields( "word" ), "\\s+" ); pipe = new Each( pipe, new Fields( "line" ), function );

pipe = new GroupBy( pipe, new Fields( "word" ) );

pipe = new Every( pipe, new Fields( "word" ), new Count( new Fields( "count" ) ) ); pipe = new Every( pipe, new Fields( "word" ), new Average( new Fields( "avg" ) ) );

pipe = new GroupBy( pipe, new Fields( "count", -1 ) );

FlowConnector.setApplicationJarClass( new Properties(), Main.class ); Flow flow = new FlowConnector( properties ).connect( "example flow", input, output, pipe ); flow.complete();

Friday, September 24, 2010

Page 32: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

32

Every('tsCount')[Count[decl:'count']]

Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/sec']']

GroupBy('tsCount')[by:['ts']]

Each('arrival rate')[DateParser[decl:'ts'][args:1]]

Hfs['SequenceFile[['ip', 'time', 'method', 'event', 'status', 'size']]']['output/logs/']']

Every('tmCount')[Count[decl:'count']]

Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/min']']

GroupBy('tmCount')[by:['tm']]

Each('arrival rate')[ExpressionFunction[decl:'tm'][args:1]]

[head]

[tail]

['ts', 'count']

['ts']

['ip', 'time', 'method', 'event', 'status', 'size']

['ip', 'time', 'method', 'event', 'status', 'size']

tsCount['ts']

['ts']

['tm', 'count']

['tm']

['ts']

['ts']

tmCount['tm']

['tm']

['tm']

['tm']

['ts']

['ts']

Friday, September 24, 2010

Page 33: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

33

Every('')[Agregator[decl:'']]

Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']

GroupBy('')[by:['']]

Each('')[Identity[decl:'', '']]

CoGroup('')[by:a:['']b:['']]

Each('')[RegexSplitter[decl:'', ''][args:1]]

Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']

CoGroup('')[by:a:['']b:['']]

Each('')[Identity[decl:'', '', '', '']]

CoGroup('')[by:a:['']b:['']]

Each('')[Identity[decl:'', '', '']]

CoGroup('')[by:a:['']b:['']]

Each('')[RegexSplitter[decl:'', '', '', ''][args:1]]

Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']

Each('')[RegexSplitter[decl:'', '', ''][args:1]]

Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']

Every('')[Agregator[decl:'', '']]

Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']

GroupBy('')[by:['']]

Each('')[Identity[decl:'', '', '']]

CoGroup('')[by:a:['']b:['']]

Each('')[Identity[decl:'', '', '']]

CoGroup('')[by:a:['']b:['']]

Each('')[Identity[decl:'', '', '', '']]

CoGroup('')[by:a:['']b:['']]

[head]

[tail]

TempHfs['SequenceFile[['', '', '', '']]'][]

TempHfs['SequenceFile[['', '', '']]'][]

TempHfs['SequenceFile[['', '', '']]'][]

TempHfs['SequenceFile[['', '']]'][]

TempHfs['SequenceFile[['', '', '', '']]'][]

TempHfs['SequenceFile[['', '', '']]'][]

TempHfs['SequenceFile[['', '', '', '', '', '', '']]'][]

TempHfs['SequenceFile[['', '']]'][]

TempHfs['SequenceFile[['', '']]'][]

TempHfs['SequenceFile[['', '']]'][]

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

a[''],b['']

['', '', '', '']

['', '']

['', '']

a[''],b['']

['', '', '', '', '']

a[''],b['']

['', '', '', '', '', '', '', '', '']

a['']

['', '']

['', '', '']

['', '', '']

a[''],b['']

['', '', '', '', '', '']

a[''],b['']

['', '', '', '', '', '', '']

a[''],b['']

['', '', '', '', '']

a['']

['', '', '']

['', '', '', '']

['', '', '', '']

['', '', '', '']

['', '', '', ''] ['', '', '', '']

['', '', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '']

['', '']

['', '']

['', '']

['', '', '', '']

['', '', '', '']

['', '', '', '']

['', '', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

a[''],b['']

['', '', '', '', '', '', '']

['', '', '', '', '', '', '']

['', '', '', '', '', '', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', ''] ['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

Friday, September 24, 2010

Page 34: Thinking at scale with hadoop

Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

34

FlumeJava

phot

o by

: Gra

ham

Rac

her,

flickrKen Krugler

Bixo Labs &Scale Unlimited

Friday, September 24, 2010

Page 35: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

FlumeJava

Recent paper (June 2010) by Googlites - Chambers et. al.

Internal project to make it easier to write Map-Reduce jobs

Equivalent open source project just started

Plume - http://github.com/tdunning/Plume

35

Friday, September 24, 2010

Page 36: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Java Classes, Deferred Exec

Java-centric view of map-reduce processing

Handful of classes to represent immutable, parallel collections

PCollection, PTable

Small number of primitive operations on classes

parallelDo, groupByKey, combineValues, flatten

Deferred execution allows construction/optimization of workflow

36

Friday, September 24, 2010

Page 37: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

FlumeJava Code

37

PCollection<String> lines = readTextFileCollection("/gfs/data/shakes/hamlet.txt");

Collection<String> words = lines.parallelDo(new DoFn<String,String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } } }, collectionOf(strings()));

Friday, September 24, 2010

Page 38: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Results of Use at Google

38

TextText

Lines of code

Text

Number of Map-Reduce Jobs

Friday, September 24, 2010

Page 39: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

39

Pig & Hive

phot

o by

: Tam

bako

the

Jagu

ar, fl

ickr

Ken Krugler

Bixo Labs &Scale Unlimited

Friday, September 24, 2010

Page 40: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Pig

Pig is a data flow ‘programming environment’ for processing large files

PigLatin is the text language it provides for defining data flows

Developed by researchers in Yahoo! and widely used internally

A sub-project of Hadoop

APL licensed

http://hadoop.apache.org/pig/

First incubator release v0.1.0 on Sept 2008

40

Friday, September 24, 2010

Page 41: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Pig

Provides two kinds of optimizations:

Physical - caching and stream optimizations at the MapReduce level

Algebraic - mathematical reductions at the syntax level

Allows for User Defined Functions (UDF)

must register jar of functions via register path/some.jar

can be called directly from PigLatin

B = foreach A generate myfunc.MyEvalFunc($0);

41

Friday, September 24, 2010

Page 42: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

PigLatin - Example

42

W = LOAD 'filename' AS (url, outlink);G = GROUP W by url;R = FOREACH G { FW = FILTER W BY outlink eq 'www.apache.org'; PW = FW.outlink; DW = DISTINCT PW; GENERATE group, COUNT(DW);}

Friday, September 24, 2010

Page 43: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Hive

Hive is a data warehouse built on top of flat files

Developed by Facebook

Available as a Hadoop contribution in v0.19.0

http://wiki.apache.org/hadoop/Hive

43

Friday, September 24, 2010

Page 44: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Hive

Data Organization into Tables with logical and hash partitioning

A Metastore to store metadata about Tables/Partitions/Buckets

Partitions are ‘how’ the data is stored (collections of related rows)

Buckets further divide Partitions based on a hash of a column

metadata is actually stored in a RDBMS

A SQL like query language over object data stored in Tables

Custom types, structs of primitive types

44

Friday, September 24, 2010

Page 45: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Hive - Queries

Queries always write results into a table

45

FROM user INSERT OVERWRITE TABLE user_active SELECT user.* WHERE user.active = true;

FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum.txt' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age;

Friday, September 24, 2010

Page 46: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

46

Datameer Analytics Solution (DAS)

phot

o by

: Tam

bako

the

Jagu

ar, fl

ickr

Ken Krugler

Bixo Labs &Scale Unlimited

Friday, September 24, 2010

Page 47: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

DAS - Big Spreadsheets

Commercial solution from Datameer

Similar to IBM BigSheets

GUI + workflow editor on top of Hadoop

Define data sources, operations (via spreadsheet), and data sinks

http://www.datameer.com

47

Friday, September 24, 2010

Page 48: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

48

Friday, September 24, 2010

Page 49: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

49

Friday, September 24, 2010

Page 50: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

50

Summary

phot

o by

: Tam

bako

the

Jagu

ar, fl

ickr

Ken Krugler

Bixo Labs &Scale Unlimited

Friday, September 24, 2010

Page 51: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Thinking at Scale with Hadoop

Understand the underlying Map-Reduce architecture

Model the problem as a workflow with operations on records

Consider using Cascading or (eventually) Plume

Use Pig/Hive to solve query-centric problems

Look at DAS or BigSheets for easier analytics solutions

51

Friday, September 24, 2010

Page 52: Thinking at scale with hadoop

Resources

Hadoop mailing lists - many, e.g. [email protected]

Hadoop: The Definitive Guide by Tom White (O’Reilly)

Hadoop Training

Scale Unlimited: http://scaleunlimited.com/courses

Cloudera: http://www.cloudera.com/hadoop-training/

Cascading: http://www.cascading.org

DAS: http://www.datameer.com

52

Friday, September 24, 2010

Page 53: Thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Questions?

53

Friday, September 24, 2010