Andrii Vozniuk - Cloud infrastructure. Google File System and MapReduce

Cloud Infrastructure: GFS & MapReduce

Andrii VozniukUsed slides of: Jeff Dean, Ed Austin

Data Management in the Cloud

EPFLFebruary 27, 2012

http://people.epfl.ch/andrii.vozniuk



http://isa.epfl.ch/imoniteur_ISAP/!itffichecours.htm?ww_i_matiere=963734524&ww_x_anneeAcad=2011-2012&ww_i_section=2139068&ww_i_niveau=&ww_c_langue=en

http://epfl.ch/

Outline

• Motivation• Problem Statement• Storage: Google File System (GFS)• Processing: MapReduce• Benchmarks • Conclusions

Motivation

• Huge amounts of data to store and process• Example @2004:– 20+ billion web pages x 20KB/page = 400+ TB– Reading from one disc 30-35 MB/s • Four months just to read the web• 1000 hard drives just to store the web• Even more complicated if we want to process data

• Exp. growth. The solution should be scalable.

Motivation

• Buy super fast, ultra reliable hardware?– Ultra expensive– Controlled by third party– Internals can be hidden and proprietary– Hard to predict scalability– Fails less often, but still fails!– No suitable solution on the market

Motivation• Use commodity hardware? Benefits:

– Commodity machines offer much better perf/$– Full control on and understanding of internals– Can be highly optimized for their workloads– Really smart people can do really smart things

• Not that easy:– Fault tolerance: something breaks all the time– Applications development– Debugging, Optimization, Locality– Communication and coordination– Status reporting, monitoring

• Handle all these issues for every problem you want to solve

Problem Statement

• Develop a scalable distributed file system for large data-intensive applications running on inexpensive commodity hardware

• Develop a tool for processing large data sets in parallel on inexpensive commodity hardware

• Develop the both in coordination for optimal performance

Google Cluster Environment

Servers Racks Clusters

DatacentersCloud

Google Cluster Environment

• @2009:– 200+ clusters– 1000+ machines in many of them– 4+ PB File systems– 40GB/s read/write load– Frequent HW failures– 100s to 1000s active jobs (1 to 1000 tasks)

• Cluster is 1000s of machines– Stuff breaks: for 1000 – 1 per day, for 10000 – 10 per day– How to store data reliably with high throughput? – How to make it easy to develop distributed applications?

Google Technology Stack

Google Technology Stack

Focus ofthis talk

Google File System*

* The Google File System. S. Ghemawat, H. Gobioff, S. Leung. SOSP, 2003

G F S

• Inexpensive commodity hardware• High Throughput > Low Latency• Large files (multi GB)• Multiple clients• Workload– Large streaming reads– Small random writes– Concurrent append to the same file

http://research.google.com/archive/gfs.html

http://research.google.com/archive/gfs.html

ArchitectureG F S

• User-level process running on commodity Linux machines• Consists of Master Server and Chunk Servers• Files broken into chunks (typically 64 MB), 3x redundancy (clusters, DCs)• Data transfers happen directly between clients and Chunk Servers

Master Node• Centralization for simplicity• Namespace and metadata management• Managing chunks

– Where they are (file<-chunks, replicas)– Where to put new– When to re-replicate (failure, load-balancing)– When and what to delete (garbage collection)

• Fault tolerance– Shadow masters– Monitoring infrastructure outside of GFS– Periodic snapshots– Mirrored operations log

G F S

Master Node

• Metadata is in memory – it’s fast!– A 64 MB chunk needs less than 64B metadata => for 640 TB less

than 640MB•Asks Chunk Servers when– Master starts– Chunk Server joins the cluster

• Operation log– Is used for serialization of concurrent operations– Replicated– Respond to client only when log is flushed locally and remotely

G F S

Chunk Servers

• 64MB chunks as Linux files–Reduce size of the master‘s datastructure–Reduce client-master interaction– Internal fragmentation => allocate space lazily–Possible hotspots => re-replicate

• Fault tolerance–Heart-beat to the master– Something wrong => master inits replication

G F S

Mutation OrderG F SCurrent lease holder?

identity of primarylocation of replicas(cached by client)

3a. data

3b. data

3c. data

Write request

Primary assigns # to mutationsApplies itForwards write request

Operation completed

Operation completed

Operation completedor Error report

Control & Data Flow

• Decouple control flow and data flow• Control flow– Master -> Primary -> Secondaries

• Data flow– Carefully picked chain of Chunk Servers

• Forward to the closest first• Distance estimated based on IP

– Fully utilize outbound bandwidth– Pipelining to exploit full-duplex links

G F S

Other Important Things

• Snapshot operation – make a copy very fast• Smart chunks creation policy– Below-average disk utilization, limited # of recent

• Smart re-replication policy– Under replicated first– Chunks that are blocking client– Live files first (rather than deleted)

• Rebalance and GC periodically

How to process data stored in GFS?

G F S

MapReduce*

• A simple programming model applicable to many large-scale computing problems

• Divide and conquer strategy• Hide messy details in MapReduce runtime library:– Automatic parallelization– Load balancing– Network and disk transfer optimizations – Fault tolerance– Part of the stack: improvements to core library benefit all

users of library

*MapReduce: Simplified Data Processing on Large Clusters. J. Dean, S. Ghemawat. OSDI, 2004

M M M

R R

http://research.google.com/archive/mapreduce.html



Typical problem

• Read a lot of data. Break it into the parts• Map: extract something important from each part• Shuffle and Sort• Reduce: aggregate, summarize, filter, transform Map results

• Write the results• Chain, Cascade

• Implement Map and Reduce to fit the problem

M M M

R R

Nice ExampleM M M

R R

Other suitable examples

• Distributed grep• Distributed sort (Hadoop MapReduce won TeraSort)• Term-vector per host• Document clustering• Machine learning• Web access log stats• Web link-graph reversal• Inverted index construction• Statistical machine translation

M M M

R R

Model

• Programmer specifies two primary methods– Map(k,v) -> <k’,v’>*– Reduce(k’, <v’>*) -> <k’,v’>*

• All v’ with same k’ are reduced together, in order• Usually also specify how to partition k’– Partition(k’, total partitions) -> partition for k’

• Often a simple hash of the key• Allows reduce operations for different k’ to be parallelized

M M M

R R

CodeM M M

R R

Mapmap(String key, String value): // key: document name // value: document contents for each word w in value:

EmitIntermediate(w, "1");

Reducereduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values:

result += ParseInt(v); Emit(AsString(result))

ArchitectureM M M

R R

• One master, many workers• Infrastructure manages scheduling and distribution• User implements Map and Reduce

ArchitectureM M M

R R

ArchitectureM M M

R R

ArchitectureM M M

R R

Combiner = local Reduce

Important things• Mappers scheduled close to data• Chunk replication improves locality• Reducers often run on same machine as mappers• Fault tolerance

– Map crash – re-launch all task of machine– Reduce crash – repeat crashed task only– Master crash – repeat whole job– Skip bad records

• Fighting ‘stragglers’ by launching backup tasks• Proven scalability• September 2009 Google ran 3,467,000 MR Jobs averaging 488

machines per Job• Extensively used in Yahoo and Facebook with Hadoop

M M M

R R

Benchmarks: GFSG F S

Benchmarks: MapReduceM M M

R R

Conclusions

• GFS & MapReduce– Google achieved their goals– A fundamental part of their stack

• Open source implementations– GFS Hadoop Distributed FS (HDFS)– MapReduce Hadoop MapReduce

“We believe we get tremendous competitive advantageby essentially building our own infrastructure” -- Eric Schmidt

http://hadoop.apache.org/

Thank your for your [email protected]