78
Map/Reduce Programming Model Ahmed Abdelsadek

Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Embed Size (px)

Citation preview

Page 1: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Map/Reduce Programming ModelAhmed Abdelsadek

Page 2: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Outlines• Introduction

•What is Map/Reduce?

•Framework Architecture

•Map/Reduce Algorithm Design

•Tools and Libraries built on top of Map/Reduce

Page 3: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Introduction

•Big Data

•Scaling ‘out’ not ‘up’

•Scaling ‘everything’ linearly with data size

•Data-intensive applications

Page 4: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Map/Reduce

•Origins •Google Map/Reduce•Hadoop Map/Reduce

•The Map and Reduce functions are both defined with respect to data structured in (key, value) pairs.

Page 5: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Mapper• The Map function takes a key/value pair, processes it, and

generates zero or more output key/value pairs. • The input and output types of the mapper can be different

from each other.

Page 6: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Reducer• The Reduce function takes a key and a series of all values

associated with it, processes it, and generates zero or more output key/value pairs.

• The input and output types of the reducer can be different from each other.

Page 7: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Mappers/Reducers

•map: (k1; v1) -> [(k2; v2)]

•reduce: (k2; [v2]) -> [(k3; v3)]

Page 8: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

WordCount Example

•Problem: count the number of occurrences of every word in a text collection.

Map(docid a, doc d)for all term t in doc d do

Emit(term t, count 1)

Reduce(term t; counts [c1, c2, …])sum = 0for all count c in counts [c1, c2,

…] dosum = sum + c

Emit(term t, count sum)

Page 9: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Map/Reduce Framework Architecture and Execution Overview

Page 10: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Architecture - Overview

•Map/Reduce runs on top of DFS

Page 11: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Data Flow

Page 12: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Timeline

Page 13: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Work Flow

Page 14: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Work Flow

Page 15: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Work Flow

Page 16: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Work Flow

Page 17: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Work Flow

Page 18: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Work Flow

Page 19: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Work Flow

Page 20: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Work Flow

Page 21: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Work Flow

Page 22: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Job Work Flow

Page 23: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Fault Tolerance• Task Fails

▫ Re-execution

• TaskTracker Fails▫ Removes the node from

pool of TaskTrackers▫ Re-schedule its tasks

• JobTracker Fails▫ Singe point of failure. Job

fails

Page 24: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Map/Reduce Framework Features• Locality

▫Move code to the data• Task Granularity

▫Mappers and reducers should be much larger than the number of machines, however, not too much! Dynamic load balancing!

•Backup Tasks▫Avoid slow workers▫Near completion

Page 25: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Map/Reduce Framework Features•Skipping bad records

▫Many failures on the same record•Local execution

▫Debug in isolation•Status information

▫Progress of computations•User Counters, report progress

▫Periodically propagated to the master node

Page 26: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Hadoop Streaming and Pipes• APIs to MapReduce that allows you to write your

map and reduce functions in languages other than Java

• Hadoop Streaming▫ Uses Unix standard streams as the interface between

Hadoop and your program▫ You can use any language that can read standard input

and write to standard output

• Hadoop Pipes (for C++)▫ Pipes uses sockets as the channel to communicates with

the process running the C++ map or reduce function▫ JNI is not used

Page 27: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Keep in Mind• Programmer has little control over many aspects

of execution▫ Where a mapper or reducer runs (i.e., on which

node in the cluster).▫ When a mapper or reducer begins or finishes▫ Which input key-value pairs are processed by a

specific mapper.▫ Which intermediate key-value pairs are processed

by a specific reducer.

Page 28: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Map/Reduce Algorithm Design

Page 29: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Partitioners • Dividing up the intermediate key space.• Simplest: Hash value of the key mod the number of

reducers▫Assigns same number of keys to reducers▫Only considers the key and ignores the value▫May yield large differences in the number of

values sent to each reducer

•More complex partitioning algorithm to handle the imbalance in the amount of data associated with each key

Page 30: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Combiners • In WordCount example: the amount of intermediate data is

larger than the input collection itself• Combiners are an optimization for local aggregation before the

shuffle and sort phase▫ Compute a local count for a word over all the documents

processed by the mapper• Think of combiners as “mini-reducers”

▫ However, combiners and reducers are not always interchangeable

• Combiner input and output pair are same as mapper output pairs▫ Same as reducer input pair

• Combiner may be invoked zero, one, or multiple times• Combiner can emit any number of key-value pairs

Page 31: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Complete View of Map/Reduce

Page 32: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Local Aggregation • Network and disk latency are high!

• Features help local aggregation▫Single (Java) Mapper object for multiple

(key,value) pairs in an input split (preserve state across multiple calls of the map() method)

▫Share in-object data structures and counters▫ Initialization, and finalization code across all

map() calls in a single task▫ JVM reuse across multiple tasks on the same

machine

Page 33: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Basic WordCount Example

Page 34: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Per-Document Aggregation• Associative array inside the map() call to sum up term counts

within a single document

• Emits a key-value pair for each unique term, instead of emitting a key-value pair for each term in the document▫ substantial savings in the number of intermediate key-value pairs

emitted

Page 35: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Per-Mapper Aggregation• Associative array inside the Mapper object to sum up term

counts across multiple documents

Page 36: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

In-Mapper Combining• Pros

▫ More control over when local aggregation occurs and how it exactly takes place (recall: no guarantees on combiners)

▫ More efficient than using actual combiners No additional overhead with object creation, serializing,

reading, and writing the key-value pairs

• Cons▫ Breaks the functional programming (not a big deal!)▫ Scalability bottleneck

Needs sufficient memory to store intermediate results Solution: Block and flush, every N key-value pairs have been

processed or every M bytes have been used.

Page 37: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Correctness with Local Aggregation• Combiners are viewed as optional optimizations

▫ Correctness of algorithm should not depend on its computations

• Combiners and reducers are not interchangeable▫ Unless reduce computation is both commutative and

associative

• Make sure of the semantics of your aggregation algorithm▫ Notice for example

Page 38: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Pair and Stripes• In some problems: common approach is to

construct complex keys and values to achieve more efficiency

• Example: Problem of building word co-occurrence matrix from large document collection▫ Formally, the co-occurrence matrix of a corpus is a

square N x N matrix where n is the number of unique words in the corpus

▫ Cell Mij contains the number of times word Wi co-occured with word Wj

Page 39: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Pairs Approach• Mapper: emits co-occurring words pair as the key and the

integer one• Reducer: sums up all the values associated with the same

co-occurring word pair

Page 40: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Pairs Approach

•Pairs algorithm generates a massive number of key-value pairs

•Combiners have few opportunities to perform local aggregation

•The sparsity of the key space also limits the effectiveness of in-memory combining

Page 41: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Stripes Approach• Store co-occurrence information in an associative array• Mapper: emits words as keys and associative arrays as

values• Reducer: element-wise sum of all associative arrays of the

same key

Page 42: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Stripes Approach

•Much more compact representation

•Much fewer intermediate key-value pairs

•More opportunities to perform local aggregation

•May cause potential scalability bottlenecks of the algorithm.

Page 43: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Which approach is faster?

• APW (Associated Press Worldstream ): corpus of 2.27 million documents totaling 5.7 GB

Page 44: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Computing Relative Frequencies•In the previous example, (Wi,Wj) co-

occurrence may be high just because one of the words is very common!

•Solution: Compute relative frequencies

Page 45: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Relative Frequencies with Stripes • Straightforward!• In Reducer:

▫ Sum all words counts co-occur with the key word▫ Divide the counts by that sum to get the relative frequency!

• Lessons:▫ Use of complex data structures to coordinate distributed

computations▫ Appropriate structuring of keys and values, bring together all

the pieces of data required to perform a computation

• Drawback?▫ As with before, this algorithm also assumes that each

associative array fits into memory (Scalability bottleneck!)

Page 46: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Relative Frequencies with Pairs• Reducer receives (Wi,Wj) as the key and the counts as value

▫ From this alone it is not possible to compute f(Wj | Wi)• Hint: Reducers like Mappers, can preserve state across multiple

keys• Solution: at reducer side, buffer in memory all the words that co-

occur with Wi▫ In essence building the associative array in the stripes approach

• Problem?▫ Word pairs can be in any arbitrary order!

• Solution: we must define the sort order of the pair ▫ Keys are first sorted by the left word, and then by the right

word• So That: when left word changes ->

▫ Sum, calculate and emit the results, flush the memory

Page 47: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Relative Frequencies with Pairs• Problem?

▫ Same left-word pairs may be sent to different reducers!• Solution?

▫ We must ensure that all pairs with the same left word are sent to the same reducer

• How?▫ Custom Paritioners!!

Pays attention to the left word and partition based on its hash only

• Will it work?▫ Yeah!

• Drawback? ▫ Still scalability bottleneck!

Page 48: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Relative Frequencies with Pairs• Another approach? With no bottlenecks?• Can we compute or ‘have’ the sum before processing the

pairs counts?• The notion of ‘before’ and ‘after’ can be seen in the

ordering of the key-value pairs• This insight lies in properly sequencing the data presented

to the reducer▫ Programmer should define the sort order of keys so that data

needed earlier is presented earlier to the reducer• So now, we need two things

▫ Compute the sum for a give word Wi▫ Send that sum to the reducer before any words pair where Wi

is its left side

Page 49: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Relative Frequencies with Pairs• How?• To get the sum

▫ Modify the Mapper to additionally emits a ‘special’ key of (Wi, *), with a value of one

• To ensure the order▫ defining the sort order of the keys so that pairs with the

special symbol of the form (Wi, *) are ordered before any other key-value pairs where the left word is Wi

• In addition: ▫ Partitioner to pay attention to only the left word

Page 50: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Relative Frequencies with Pairs

•Example

•Memory bottlenecks?▫No!

Page 51: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Order Inversion Design Pattern• To summarize

▫ Emitting a special key-value pair for getting the sum▫ Controlling the sort order of the intermediate key▫ Defining a custom partitioner▫ Preserving state across multiple keys in the reducer

• Quite common in pattern in many problems

• The key insight▫ Convert the sequencing of computations into a sorting

problem

Page 52: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Secondary Sort• In addition to sorting by key, we also need to sort by value• Implemented in Google, but not in Hadoop• Two main techniques

▫ Buffer all the readings in memory and then sort May lead to too much memory consumption

▫ Value-to-key conversion Move part of the value into the intermediate key to form a

composite key We must define the intermediate key sort order We must define the partitioner so that all pairs associated

with the same key are sent to the same reducer Reducer will need to preserve state across multiple pairs May lead to too much intermediate pairs

Page 53: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Relational Joins• For databases, data-warehousing, and data analytics• Semi-structured data• Example of a join

• Where S and T are datasets (relations), k is the key we want to join on, si and ti are the unique IDs of S and T respectively, Si and Ti are the rest of the tuple attributes

Page 54: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Reduce-side Join• One-to-one join

▫ Emit tuple’s join attribute as key, rest of attributes as value

• One-to-many join▫ Buffer all tuple’s in memory▫ Use Value-to-key pattern

Page 55: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Reduce-side Join• Many-to-many join

▫ The previous algorithm works as well

▫ Smaller set should come first▫ Reducer will buffer it in memory

• Lessons▫ Basic idea is to repartition the two datasets by the join key▫ Not efficient since it shuffles both datasets across the

network

Page 56: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Map-side Joins• Assume datasets are

▫ Both sorted by the join key▫ Divided into same number of files▫ Partitioned in the same manner by the join key▫ In each file, tuples are sorted by the join key

• We can perform a join by scanning through both datasets simultaneously▫ This is known as a merge join

• Parallelize by partitioning and sorting both datasets in the same time▫ Map over one of the datasets (the larger one)▫ Inside the mapper read the corresponding part of the other dataset

Non-local read▫ Perform the merge join

Page 57: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Map-side Joins• More efficient than a reduce-side join

▫Doesn’t shuffle all the datasets• Drawback:

▫Strong assumption on the input files format• Advice

▫If used in a workflow with multiple Map/Reduce jobs, ensure the previous reducer writes its output in a convenient format.

Page 58: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Memory-backed Join

•If one of the datasets can fit in memory•Load it in memory•Map over the other dataset•Use random access to tuples based on the

join key•Great performance improvement

Page 59: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Summary• In-mapper combining

▫ Aggregates partial results▫ Emit less intermediate pair

• Pair and Stripes▫ Keep track of joint events

One by one Stripe fashion

• Order inversion▫ Convert the sequencing of computations into a sorting

problem• Value-to-key conversion

▫ Scalable solution for secondary sorting▫ Moving part of the value into the key

Page 60: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Before we go!• Remember: Limitations of Map/Reduce Model

▫ Map/Reduce mainly designed for batch processes, not for online query

▫ Prevents modifying or adding input data while the job is running, as well as modifying the number of machines.

▫ Map/Reduce job has a single entry and a single exit We can not keep it alive waiting for an event to trigger

it▫ Map/Reduce works on flat files

Lack of scheme support

Page 61: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

What’s Next?

Page 62: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Map/Reduce vs RDBM• A living debate in databases and data analytics communities• On 2008, D. DeWitt and M. Stonebraker write

▫ “MapReduce: A major step backwards”▫ A giant step backward in the programming paradigm▫ An implementation uses brute force instead of indexing▫ Not novel at all -- well known techniques developed nearly 25 years

ago▫ Missing most of the features that are routinely included in current

DBMS▫ Incompatible with all of the tools DBMS users have come to depend on

• MapReduce is missing features▫ Indexing, Bulk loader, Updates, Transactions, Integrity constraints,

Referential integrity, Views• MapReduce is incompatible with the DBMS tools

▫ Report writers, Business intelligence tools, Data mining tools, Replication tools, Database design tools

Page 63: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Map/Reduce vs RDBM• On 2010, same authors and others write“MapReduce and Parallel DBMSs:Friends or Foes?“• Where they argue that

▫ Map/Reduce is a complement to DBMS not a competitive▫ They are used in different application domain

• Parallel DBMSs excel at efficient querying of large data sets

• MR style systems excel at ETL(extract-transform-load) tasks

Page 64: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

NoSQL• Mechanism for storage and retrieval of data that use

looser consistency models than traditional relational databases▫ To achieve higher scalability and availability

• Usually in form of Key-Value store• Built on top of Distributed File Systems• Examples

▫ Google Big Table▫ Apache HBase▫ Apache Cassandra▫ Amazon Dynamo

Page 65: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Tools on top of Hadoop• Apache Pig

▫ Apache Pig is a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce

▫ Apache Pig features a “Pig Latin”, a relational data-flow language enables SQL-like queries to be performed on distributed datasets within Hadoop applications.

▫ Pig originated as a Yahoo Research▫ In 2007, Pig became an open source project of the

Apache Software Foundation.

Page 66: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Apache Pig

•Pig Latin Example

Page 67: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Apache Pig

•Pig execution flow

Page 68: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Tools on top of Hadoop• Apache Hive

▫ Hive is a data warehouse system for the open source Apache Hadoop project.

▫ Hive features a SQL-like HiveQL language that facilitates data analysis and summarization for large datasets stored in Hadoop-compatible file systems.

▫ Hive originated as a Facebook▫ Later became an open source project under the

Apache Software Foundation.

Page 69: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Apache Hive

•HiveQL Example

Page 70: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Pig vs Hive• They are/were independent projects and there was no

centrally coordinated goal. • They were in different spaces early on and have grown to

overlap with time as both projects expand

• Some differences are▫ Pig Latin is procedural, where HiveQL is declarative.▫ Pig Latin allows developers to insert their own code almost

anywhere in the data pipeline.

• Both compiles to Map and Reduce jobs.

Page 71: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Libraries on top of Hadoop• Mahoot

▫ Machine learning library to build scalable machine learning algorithms.

Page 72: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Libraries on top of Hadoop• HIPI (Hadoop Image Processing Interface)

▫ Framework that provides an API for performing image processing tasks in a distributed computing environment

Page 73: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Summary

•Map/Reduce

•Framework Architecture

•Map/Reduce Algorithm Design

•Tools and Libraries built on top of Map/Reduce

Page 74: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Demo

•Starting Hadoop cluster•Copying data to HDFS•Compiling our Java Map/Reduce code and

create the Jar file.•Submit Hadoop job•Show progress and dash boards•Retrieve the output from HDFS•Shut down Hadoop cluster

Page 77: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Appendix

•Studying materials▫“Data-Intensive Text Processing with

MapReduce” Jimmy Lin and Chris Dyer

▫“Hadoop: The Definitive Guide” Tom White

▫“MapReduce Design Patterns” Donald Miner and Adam Shook

Page 78: Map/Reduce Programming Model Ahmed Abdelsadek. Outlines Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and

Questions?