17
Why do we need Grid Storage-->Single disk can not host all the data Computation-->Single cpu can not provide all the computing needs Parallel jobs--> Serial execution is no more viable option

Map Reduce

  • Upload
    openak

  • View
    620

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Map Reduce

Why do we need Grid

Storage-->Single disk can not host all the dataComputation-->Single cpu can not provide all the computing needsParallel jobs--> Serial execution is no more viable option

Page 2: Map Reduce

What we expect from a framework

Distributed storageJob specification platformJob spliting/mergingJob execution and monitoring

Page 3: Map Reduce

Basic attributes expected

Resource managementDiskCPUMemoryBand width of network

Fault tolerantNetwork failureMachine failureJob/code bug…

Scalability

Page 4: Map Reduce

Hadoop Core

Separate distributed file system based on google file system type architecture-->HDFSSeparate job splitting and merging mechanism

mapreduce framework on top of distributed file system

Provides custom job specification mechanism-->input format,mapper,partitioner,combiner,reducer, outputformat

Page 5: Map Reduce

HDFS attributesDistributed, Reliable,ScalableOptimized for streaming reads, very large data setsAssumes write once read several timesNo local caching possible due to large files and streaming readsHigh data replication Fit logically with mapreduceSynchronized access to metadata--> namenodeMetadata (Edit log, FSI image) stored in namenode local os file system.

Page 6: Map Reduce

HDFS

Copied from HDFS design document

Page 7: Map Reduce

Mapreduce framework attributes

Fair isolation--> easy synchronization and fail over...

Page 8: Map Reduce

Mapreduce

Copied from yahoo tutorial

Page 9: Map Reduce

Copied from yahoo tutorial

Page 10: Map Reduce

Fault tolerant goal

Hadoop assumes that at least one machine is down every timeHDFSBlock level replicationReplicated and persistent metadataRack awareness and consideration of whole rac failure

Page 11: Map Reduce

Fault tolerant goal contd..

MapreduceNo dependency assumed between tasksTasks from a failed node can be transferred to other nodes without any state informationMapper--> whole tasks are to be executed in other nodes

Reducer-->only un executed tasks are to be transmitted since all executed result are written to output

Page 12: Map Reduce

Resource management goal

CPU/ MemoryMechanisms are provided so that direct streaming are possible to the file descriptor--> no user level operations for very large objects

Optimized sorting possible so that we can mostly decide the order from the bytes without instantiating object around them

....

Page 13: Map Reduce

Resource management goal contd..

BandwidthHDFS architecture ensures that the read request is served from the nearest node (replication)

Mapreduce framework ensures that the operations are executed nearest to the data -->moving operations is cheaper to moving data

Optimized operations in every stage--> combiner, data replication (parallel buffering and transfer from one node), ...

Page 14: Map Reduce

Scalability goal

Flat scalability--> addition and removal of a node is fairly straight forward

Page 15: Map Reduce

Sub projects

Zoo keeper for small shared information (useful for synchronization, lock, leader selection and so many sharing problems in distributed systems).Hbase for semi structured data (provides implementation of google big table design)Hive for ad hoc query analysis (currently supports insertion in multiple tables, group by, multiple table selection and order by is under construction)Avro for data serialization applicable to map reduce

Page 16: Map Reduce

How about other frameworks ??

Page 17: Map Reduce

Questions ???