80
MapReduce

Map Reduce In 5 Minutes

Embed Size (px)

DESCRIPTION

Map Reduce In 5 Minutes

Citation preview

Page 1: Map Reduce In 5 Minutes

MapReduce

Page 2: Map Reduce In 5 Minutes

MapReducereduced

Page 3: Map Reduce In 5 Minutes

PetaMengurangi

Page 4: Map Reduce In 5 Minutes

PetaMengurangi

*google translate

*

Page 5: Map Reduce In 5 Minutes

theproblem

Page 6: Map Reduce In 5 Minutes

lotsofdata

Page 7: Map Reduce In 5 Minutes

e.g.theentireinterwebs

Page 8: Map Reduce In 5 Minutes

singlecomputernotgoingtowork

Page 9: Map Reduce In 5 Minutes

lotsofcomputers

Page 10: Map Reduce In 5 Minutes

wehavethat

Page 11: Map Reduce In 5 Minutes

clusterprogramming

Page 12: Map Reduce In 5 Minutes

clusterprogramming=suck

Page 13: Map Reduce In 5 Minutes

MapReduce

Page 14: Map Reduce In 5 Minutes

makesthepaingoaway

Page 15: Map Reduce In 5 Minutes

2mainstages

Page 16: Map Reduce In 5 Minutes
Page 17: Map Reduce In 5 Minutes
Page 18: Map Reduce In 5 Minutes
Page 19: Map Reduce In 5 Minutes

map

Page 20: Map Reduce In 5 Minutes

mapprocessdataonhosts

Page 21: Map Reduce In 5 Minutes

reduce

Page 22: Map Reduce In 5 Minutes

reducesummarisetheresults

Page 23: Map Reduce In 5 Minutes

examplecountwordsonlines

Page 24: Map Reduce In 5 Minutes

>>>reduce(operator.add,map(countWords,lines))

Page 25: Map Reduce In 5 Minutes

>>>reduce(operator.add,map(countWords,lines))

Page 26: Map Reduce In 5 Minutes

>>>reduce(operator.add,map(countWords,lines))

Page 27: Map Reduce In 5 Minutes

exceptinthiscase

Page 28: Map Reduce In 5 Minutes

lotsofmachines

Page 29: Map Reduce In 5 Minutes

typicalcluster

Page 30: Map Reduce In 5 Minutes
Page 31: Map Reduce In 5 Minutes

O(103)machineseach2‐8GbRAMlocalIDEdisks

Page 32: Map Reduce In 5 Minutes

GFSdistributesthedata

Page 33: Map Reduce In 5 Minutes

processdataonhostssummariseresults

Page 34: Map Reduce In 5 Minutes

splitdataintochunksprocessdataonhostssummariseresults

Page 35: Map Reduce In 5 Minutes

splitdataintochunksallocatemachines

processdataonhostssummariseresults

Page 36: Map Reduce In 5 Minutes

splitdataintochunksallocatemachinesstartprocesses

processdataonhostssummariseresults

Page 37: Map Reduce In 5 Minutes

splitdataintochunksallocatemachinesstartprocesses

senddatatomappersprocessdataonhostssummariseresults

Page 38: Map Reduce In 5 Minutes

splitdataintochunksallocatemachinesstartprocesses

senddatatomappersprocessdataonhosts

monitorhostssummariseresults

Page 39: Map Reduce In 5 Minutes

splitdataintochunksallocatemachinesstartprocesses

senddatatomappersprocessdataonhosts

monitorhostssendresultstoreducers

summariseresults

Page 40: Map Reduce In 5 Minutes

splitdataintochunksallocatemachinesstartprocesses

senddatatomappersprocessdataonhosts

monitorhostsredofailedandstragglerssendresultstoreducers

summariseresults

Page 41: Map Reduce In 5 Minutes

splitdataintochunksallocatemachinesstartprocesses

senddatatomappersprocessdataonhosts

monitorhostsredofailedandstragglerssendresultstoreducers

summariseresultsoutputfinalresults

Page 42: Map Reduce In 5 Minutes

MapReducedoestheyukkystuff

Page 43: Map Reduce In 5 Minutes

splitdataintochunksallocatemachinesstartprocesses

senddatatomappersprocessdataonhosts

monitorhostsredofailedandstragglerssendresultstoreducers

summariseresultsoutputfinalresults

MapReduce

programmer

Page 44: Map Reduce In 5 Minutes

handlesfailures

Page 45: Map Reduce In 5 Minutes

handlesstragglers

Page 46: Map Reduce In 5 Minutes

avanitysearch

Page 47: Map Reduce In 5 Minutes

%ofrefstoAnthony

Page 48: Map Reduce In 5 Minutes

%ofrefstoAnthonyBaxter

Page 49: Map Reduce In 5 Minutes

count(‘AnthonyBaxter’)count(‘Anthony’)

Page 50: Map Reduce In 5 Minutes

C++library

Page 51: Map Reduce In 5 Minutes

...withPythonbindings,yay!

Page 52: Map Reduce In 5 Minutes

classAnthonyMapper(mrpython.Mapper):defMap(self,map_input):meCount=otherCount=0docId=map_input.key()#ignored‐docidsrc=map_input.value()#documentsourcetext=ExtractText(src).split()seenAnthony=Falseforwordintext:ifnotseenAnthony:ifword.lower()=='anthony':seenAnthony=Trueelse:ifword.lower()=='baxter':meCount+=1else:otherCount+=1seenAnthony=Falseyield'me',meCountyield'other',otherCount

Page 53: Map Reduce In 5 Minutes

classAnthonyReducer(mrpython.Reducer):defReducer(self,reduce_input):'''Passedakey(either'me'or'other')andalistofcounts.Addsthecountsandreturnsthem.'''count=0forvalinreduce_input.values():sum+=int(val)yieldcount

Page 54: Map Reduce In 5 Minutes

theresult:

Page 55: Map Reduce In 5 Minutes

theresult:

about1in4000

Page 56: Map Reduce In 5 Minutes

otherusesforMapReduce

Page 57: Map Reduce In 5 Minutes

weblinkgraphs

Page 58: Map Reduce In 5 Minutes

accesslogs

Page 59: Map Reduce In 5 Minutes

textanalysis

Page 60: Map Reduce In 5 Minutes

googlenewsclustering

Page 61: Map Reduce In 5 Minutes

localsearch

Page 62: Map Reduce In 5 Minutes

roadtraffic

Page 63: Map Reduce In 5 Minutes

takespeedsamples

Page 64: Map Reduce In 5 Minutes
Page 65: Map Reduce In 5 Minutes

groupbyroadsegment

Page 66: Map Reduce In 5 Minutes
Page 67: Map Reduce In 5 Minutes

taketheaverage

Page 68: Map Reduce In 5 Minutes

onceperminute

Page 69: Map Reduce In 5 Minutes

outputtoamaplayer

Page 70: Map Reduce In 5 Minutes
Page 71: Map Reduce In 5 Minutes

limitation:availabilityofdata

Page 72: Map Reduce In 5 Minutes

MapReduceisprettycool

Page 73: Map Reduce In 5 Minutes

formoreinformation

Page 74: Map Reduce In 5 Minutes

“mapreducepaper”“gfspaper”

“googlepapers”

Page 75: Map Reduce In 5 Minutes

ifyou’dliketoplay

Page 76: Map Reduce In 5 Minutes

hadoop.apache.org

Page 77: Map Reduce In 5 Minutes

opensource

Page 78: Map Reduce In 5 Minutes

javaimplementation

Page 79: Map Reduce In 5 Minutes

HDFS

Page 80: Map Reduce In 5 Minutes