Upload
harisfazillah-jamel
View
1.611
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Map Reduce In 5 Minutes
Citation preview
MapReduce
MapReducereduced
PetaMengurangi
PetaMengurangi
*google translate
*
theproblem
lotsofdata
e.g.theentireinterwebs
singlecomputernotgoingtowork
lotsofcomputers
wehavethat
clusterprogramming
clusterprogramming=suck
MapReduce
makesthepaingoaway
2mainstages
map
mapprocessdataonhosts
reduce
reducesummarisetheresults
examplecountwordsonlines
>>>reduce(operator.add,map(countWords,lines))
>>>reduce(operator.add,map(countWords,lines))
>>>reduce(operator.add,map(countWords,lines))
exceptinthiscase
lotsofmachines
typicalcluster
O(103)machineseach2‐8GbRAMlocalIDEdisks
GFSdistributesthedata
processdataonhostssummariseresults
splitdataintochunksprocessdataonhostssummariseresults
splitdataintochunksallocatemachines
processdataonhostssummariseresults
splitdataintochunksallocatemachinesstartprocesses
processdataonhostssummariseresults
splitdataintochunksallocatemachinesstartprocesses
senddatatomappersprocessdataonhostssummariseresults
splitdataintochunksallocatemachinesstartprocesses
senddatatomappersprocessdataonhosts
monitorhostssummariseresults
splitdataintochunksallocatemachinesstartprocesses
senddatatomappersprocessdataonhosts
monitorhostssendresultstoreducers
summariseresults
splitdataintochunksallocatemachinesstartprocesses
senddatatomappersprocessdataonhosts
monitorhostsredofailedandstragglerssendresultstoreducers
summariseresults
splitdataintochunksallocatemachinesstartprocesses
senddatatomappersprocessdataonhosts
monitorhostsredofailedandstragglerssendresultstoreducers
summariseresultsoutputfinalresults
MapReducedoestheyukkystuff
splitdataintochunksallocatemachinesstartprocesses
senddatatomappersprocessdataonhosts
monitorhostsredofailedandstragglerssendresultstoreducers
summariseresultsoutputfinalresults
MapReduce
programmer
handlesfailures
handlesstragglers
avanitysearch
%ofrefstoAnthony
%ofrefstoAnthonyBaxter
count(‘AnthonyBaxter’)count(‘Anthony’)
C++library
...withPythonbindings,yay!
classAnthonyMapper(mrpython.Mapper):defMap(self,map_input):meCount=otherCount=0docId=map_input.key()#ignored‐docidsrc=map_input.value()#documentsourcetext=ExtractText(src).split()seenAnthony=Falseforwordintext:ifnotseenAnthony:ifword.lower()=='anthony':seenAnthony=Trueelse:ifword.lower()=='baxter':meCount+=1else:otherCount+=1seenAnthony=Falseyield'me',meCountyield'other',otherCount
classAnthonyReducer(mrpython.Reducer):defReducer(self,reduce_input):'''Passedakey(either'me'or'other')andalistofcounts.Addsthecountsandreturnsthem.'''count=0forvalinreduce_input.values():sum+=int(val)yieldcount
theresult:
theresult:
about1in4000
otherusesforMapReduce
weblinkgraphs
accesslogs
textanalysis
googlenewsclustering
localsearch
roadtraffic
takespeedsamples
groupbyroadsegment
taketheaverage
onceperminute
outputtoamaplayer
limitation:availabilityofdata
MapReduceisprettycool
formoreinformation
“mapreducepaper”“gfspaper”
“googlepapers”
ifyou’dliketoplay
hadoop.apache.org
opensource
javaimplementation
HDFS