Map Reduce

Why do we need Grid

Storage-->Single disk can not host all the dataComputation-->Single cpu can not provide all the computing needsParallel jobs--> Serial execution is no more viable option

What we expect from a framework

Distributed storageJob specification platformJob spliting/mergingJob execution and monitoring

Basic attributes expected

Resource managementDiskCPUMemoryBand width of network

Fault tolerantNetwork failureMachine failureJob/code bug…

Scalability

Hadoop Core

Separate distributed file system based on google file system type architecture-->HDFSSeparate job splitting and merging mechanism

mapreduce framework on top of distributed file system

Provides custom job specification mechanism-->input format,mapper,partitioner,combiner,reducer, outputformat

HDFS attributesDistributed, Reliable,ScalableOptimized for streaming reads, very large data setsAssumes write once read several timesNo local caching possible due to large files and streaming readsHigh data replication Fit logically with mapreduceSynchronized access to metadata--> namenodeMetadata (Edit log, FSI image) stored in namenode local os file system.

HDFS

Copied from HDFS design document

Mapreduce framework attributes

Fair isolation--> easy synchronization and fail over...

Mapreduce

Copied from yahoo tutorial

Copied from yahoo tutorial

Fault tolerant goal

Hadoop assumes that at least one machine is down every timeHDFSBlock level replicationReplicated and persistent metadataRack awareness and consideration of whole rac failure

Fault tolerant goal contd..

MapreduceNo dependency assumed between tasksTasks from a failed node can be transferred to other nodes without any state informationMapper--> whole tasks are to be executed in other nodes

Reducer-->only un executed tasks are to be transmitted since all executed result are written to output

Resource management goal

CPU/ MemoryMechanisms are provided so that direct streaming are possible to the file descriptor--> no user level operations for very large objects

Optimized sorting possible so that we can mostly decide the order from the bytes without instantiating object around them

....

Resource management goal contd..

BandwidthHDFS architecture ensures that the read request is served from the nearest node (replication)

Mapreduce framework ensures that the operations are executed nearest to the data -->moving operations is cheaper to moving data

Optimized operations in every stage--> combiner, data replication (parallel buffering and transfer from one node), ...

Scalability goal

Flat scalability--> addition and removal of a node is fairly straight forward

Sub projects

Zoo keeper for small shared information (useful for synchronization, lock, leader selection and so many sharing problems in distributed systems).Hbase for semi structured data (provides implementation of google big table design)Hive for ad hoc query analysis (currently supports insertion in multiple tables, group by, multiple table selection and order by is under construction)Avro for data serialization applicable to map reduce

How about other frameworks ??

Questions ???

Technology

Map Reduce