Upload
openak
View
620
Download
2
Embed Size (px)
Citation preview
Why do we need Grid
Storage-->Single disk can not host all the dataComputation-->Single cpu can not provide all the computing needsParallel jobs--> Serial execution is no more viable option
What we expect from a framework
Distributed storageJob specification platformJob spliting/mergingJob execution and monitoring
Basic attributes expected
Resource managementDiskCPUMemoryBand width of network
Fault tolerantNetwork failureMachine failureJob/code bug…
Scalability
Hadoop Core
Separate distributed file system based on google file system type architecture-->HDFSSeparate job splitting and merging mechanism
mapreduce framework on top of distributed file system
Provides custom job specification mechanism-->input format,mapper,partitioner,combiner,reducer, outputformat
HDFS attributesDistributed, Reliable,ScalableOptimized for streaming reads, very large data setsAssumes write once read several timesNo local caching possible due to large files and streaming readsHigh data replication Fit logically with mapreduceSynchronized access to metadata--> namenodeMetadata (Edit log, FSI image) stored in namenode local os file system.
HDFS
Copied from HDFS design document
Mapreduce framework attributes
Fair isolation--> easy synchronization and fail over...
Mapreduce
Copied from yahoo tutorial
Copied from yahoo tutorial
Fault tolerant goal
Hadoop assumes that at least one machine is down every timeHDFSBlock level replicationReplicated and persistent metadataRack awareness and consideration of whole rac failure
Fault tolerant goal contd..
MapreduceNo dependency assumed between tasksTasks from a failed node can be transferred to other nodes without any state informationMapper--> whole tasks are to be executed in other nodes
Reducer-->only un executed tasks are to be transmitted since all executed result are written to output
Resource management goal
CPU/ MemoryMechanisms are provided so that direct streaming are possible to the file descriptor--> no user level operations for very large objects
Optimized sorting possible so that we can mostly decide the order from the bytes without instantiating object around them
....
Resource management goal contd..
BandwidthHDFS architecture ensures that the read request is served from the nearest node (replication)
Mapreduce framework ensures that the operations are executed nearest to the data -->moving operations is cheaper to moving data
Optimized operations in every stage--> combiner, data replication (parallel buffering and transfer from one node), ...
Scalability goal
Flat scalability--> addition and removal of a node is fairly straight forward
Sub projects
Zoo keeper for small shared information (useful for synchronization, lock, leader selection and so many sharing problems in distributed systems).Hbase for semi structured data (provides implementation of google big table design)Hive for ad hoc query analysis (currently supports insertion in multiple tables, group by, multiple table selection and order by is under construction)Avro for data serialization applicable to map reduce
How about other frameworks ??
Questions ???