55
Big Data Systems

Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

BigDataSystems

Page 2: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

BigDataParallelism

• Hugedataset

• crawleddocuments,webrequestlogs,etc.

• Naturalparallelism:

• canworkondifferentpartsofdataindependently

• imageprocessing,grep,indexing,manymore

Page 3: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Challenges

• ParallelizeapplicaFon

• Wheretoplaceinputandoutputdata?

• WheretoplacecomputaFon?

• Howtocommunicatedata?Howtomanagethreads?HowtoavoidnetworkboJlenecks?

• BalancecomputaFons

• HandlefailuresofnodesduringcomputaFon

• SchedulingseveralapplicaFonswhowanttoshareinfrastructure

Page 4: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

GoalofMapReduce

• TosolvethesedistribuFon/fault-toleranceissuesonceinareusablelibrary

• Toshieldtheprogrammerfromhavingtore-solvethemforeachprogram

• Toobtainadequatethroughputandscalability

• Toprovidetheprogrammerwithaconceptualframeworkfordesigningtheirparallelprogram

Page 5: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

MapReduce

• Overview:

• ParFFonlargedatasetintoMsplits

• RunmaponeachparFFon,whichproducesRlocalparFFons;usingaparFFonfuncFonR

• Hiddenintermediateshufflephase

• RunreduceoneachintermediateparFFon,whichproducesRoutputfiles

Page 6: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Details

• Inputvalues:setofkey-valuepairs

• Jobwillreadchunksofkey-valuepairs

• “key-value”pairsagoodenoughabstracFon

• Map(key,value):

• SystemwillexecutethisfuncFononeachkey-valuepair

• Generateasetofintermediatekey-valuepairs

• Reduce(key,values):

• Intermediatekey-valuepairsaresorted

• ReducefuncFonisexecutedontheseintermediatekey-values

Page 7: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Countwordsinweb-pages

Map(key,value){//keyisurl//valueisthecontentoftheurlForeachwordWinthecontentGenerate(W,1);}

Reduce(key,values){//keyisword(W)//valuesarebasicallyall1sSum=Sumall1sinvalues

//generateword-countpairsGenerate(key,sum);}

Page 8: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Reverseweb-linkgraph

Gotogoogleadvancedsearch:"findpagesthatlinktothepage:"cnn.com

Map(key,value){//key=url//value=contentForeachurl,linkingtotargetGenerate(outputtarget,url);}

Reduce(key,values){//key=targeturl//values=allurlsthatpointtothetargeturlGenerate(key,listofvalues);}

Page 9: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

• QuesFon:howdoweimplement“join”inMapReduce?

• ImagineyouhavealogtableLandsomeothertableRthatcontainssayuserinformaFon

• PerformJoin(L.uid==R.uid)

• SaysizeofL>>sizeofR

• Bonus:considerrealworldzipfdistribuFons

Page 10: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Comparisons

• Worthcomparingittootherprogrammingmodels:

• distributedsharedmemorysystems

• bulksynchronousparallelprograms

• key-valuestorageaccessedbygeneralprograms

• MoreconstrainedprogrammingmodelforMapReduce

• OthermodelsarelatencysensiFve,havepoorthroughputefficiency

• MapReduceprovidesforeasyfaultrecovery

Page 11: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

ImplementaFon

• Dependsontheunderlyinghardware:sharedmemory,messagepassing,NUMAsharedmemory,etc.

• InsideGoogle:

• commodityworkstaFons

• commoditynetworkinghardware(1Gbps-10Gbpsnow-atnodelevelandmuchsmallerbisecFonbandwidth)

• cluster=100sor1000sofmachines

• storageisthroughGFS

Page 12: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

MapReduceInput

• Wheredoesinputcomefrom?

• Inputisstriped+replicatedoverGFSin64MBchunks

• ButinfactMapalwaysreadsfromalocaldisk

• TheyruntheMapsontheGFSserverthatholdsthedata

• Tradeoff:

• Good:Mapreadsatdiskspeed(localaccess)

• Bad:onlytwoorthreechoicesofwhereagivenMapcanrun

• potenFalproblemforloadbalance,stragglers

Page 13: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

IntermediateData

• WheredoesMapReducestoreintermediatedata?

• OnthelocaldiskoftheMapserver(notinGFS)

• Tradeoff:

• Good:localdiskwriteisfasterthanwriFngovernetworktoGFSserver

• Bad:onlyonecopy,potenFalproblemforfault-toleranceandload-balance

Page 14: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

OutputStorage

• WheredoesMapReducestoreoutput?

• InGFS,replicated,separatefileperReducetask

• SooutputrequiresnetworkcommunicaFon--slow

• ItcanthenbeusedasinputforsubsequentMapReduce

Page 15: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

QuesFon

• WhatarethescalabilityboJlenecksforMapReduce?

Page 16: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Scaling

• Mapcallsprobablyscale

• butinputmightnotbeinfinitelyparFFonable,andsmallinput/intermediatefilesincurhighoverheads

• Reducecallsprobablyscale

• butcan’thavemoreworkersthankeys,andsomekeyscouldhavemorevaluesthanothers

• Networkmaylimitscaling

• Stragglerscouldbeaproblem

Page 17: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

FaultTolerance

• Themainidea:MapandReducearedeterminisFc,funcFonal,andindependent

• soMapReducecandealwithfailuresbyre-execuFng

• WhatifaworkerfailswhilerunningMap?

• CanwerestartjustthatMaponanothermachine?

• Yes:GFSkeepscopyofeachinputspliton3machines

• Masterknows,tellsReduceworkerswheretofindintermediatefiles

Page 18: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

FaultTolerance

• IfaMapfinishes,thenthatworkerfails,doweneedtore-runthatMap?

• Intermediateoutputnowinaccessibleonworker'slocaldisk.

• Thusneedtore-runMapelsewhereunlessallReduceworkershavealreadyfetchedthatMap'soutput.

• WhatifMaphadstartedtoproduceoutput,thencrashed?

• NeedtoensurethatReducedoesnotconsumetheoutputtwice

• WhatifaworkerfailswhilerunningReduce?

Page 19: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

RoleoftheMaster

• Keepsstateregardingthestateofeachworkermachine(pingseachmachine)

• Reschedulesworkcorrespondingtofailedmachines

• OrchestratesthepassingoflocaFonstoreducefuncFons

Page 20: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

LoadBalance

• WhatifsomeMapmachinesarefasterthanothers?

• Orsomeinputsplitstakelongertoprocess?

• SoluFon:manymoreinputsplitsthanmachines

• MasterhandsoutmoreMaptasksasmachinesfinish

• Thusfastermachinesdobiggershareofwork

• Butthere'saconstraint:

• WanttorunMaptaskonmachinethatstoresinputdata

• GFSkeeps3replicasofeachinputdatasplit

• onlythreeefficientchoicesofwheretoruneachMaptask

Page 21: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Stragglers

• Oqenonemachineisslowatfinishingverylasttask

• badhardware,overloadedwithsomeotherwork

• Loadbalanceonlybalancesnewlyassignedtasks

• SoluFon:alwaysschedulemulFplecopiesofverylasttasks!

Page 22: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

HowmanyMRtasks?

• PaperusesM=10xnumberofworkers,R=2x.

• More=>

• finergrainedloadbalance.

• lessredundantworkforstragglerreducFon.

• spreadtasksoffailedworkerovermoremachines

• overlapMapandshuffle,shuffleandReduce.

• Less=>bigintermediatefilesw/lessoverhead.

• MandRalsomaybeconstrainedbyhowdataisstripedinGFS(e.g.,64MBchunks)

Page 23: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Discussion

• whataretheconstraintsimposedonmapandreducefuncFons?

• howwouldyouliketoexpandthecapabilityofmapreduce?

Page 24: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

MapReduceCriFcism

• “Giantstepbackwards”inprogrammingmodel

• Sub-opFmalimplementaFon

• “Notnovelatall”

• MissingmostoftheDBfeatures

• IncompaFblewithalloftheDBtools

Page 25: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

ComparisontoDatabases

• Hugesourceofcontroversy;claims:

• paralleldatabaseshavemuchmoreadvanceddataprocessingsupportthatleadstomuchmoreefficiency

• supportanindex;selecFonisaccelerated

• providesqueryopFmizaFon

• paralleldatabasessupportamuchrichersemanFcmodel

• supportaschema;sharingacrossapps

• supportSQL,efficientjoins,etc.

Page 26: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

WheredoesMRwin?

• Scaling

• Loadingdataintosystem

• Faulttolerance(parFalrestarts)

• Approachability

Page 27: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

SparkMoFvaFon

• MRProblems

• cannotsupportcomplexapplicaFonsefficiently

• cannotsupportinteracFveapplicaFonsefficiently

• Rootcause

• Inefficientdatasharing

In MapReduce, the only way to share data across jobs is stable storage -> slow!

Page 28: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

MoFvaFon

Page 29: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Goal:In-MemoryDataSharing

Page 30: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Challenge

• HowtodesignadistributedmemoryabstracFonthatisbothfaulttolerantandefficient?

Page 31: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

OtheropFons

• ExisFngstorageabstracFonshaveinterfacesbasedonfine-grainedupdatestomutablestate

• E.g.,RAMCloud,databases,distributedmem,Piccolo

• RequiresreplicaFngdataorlogsacrossnodesforfaulttolerance

• Costlyfordata-intensiveapps

• 10-100xslowerthanmemorywrite

Page 32: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

RDDAbstracFon

• Restrictedformofdistributedsharedmemory

• immutable,parFFonedcollecFonofrecords

• canonlybebuiltthroughcoarse-graineddeterminisFctransformaFons(map,filter,join…)

• Efficient fault-tolerance using lineage

• Log coarse-grained operations instead of fine-grained data updates

• An RDD has enough information about how it’s derived from other dataset

• Recompute lost partitions on failure

Page 33: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Fault-tolerance

Page 34: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

DesignSpace

Page 35: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

OperaFons

• TransformaFons(e.g.map,filter,groupBy,join)

• LazyoperaFonstobuildRDDsfromotherRDDs

• AcFons(e.g.count,collect,save)

• Returnaresultorwriteittostorage

Page 36: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

lines=spark.textFile(“hdfs://...”)

errors=lines.filter(lambdas:s.startswith(“ERROR”))messages=errors.map(lambdas:s.split(‘\t’)[2])

messages.persist()

messages.filter(lambdas:“foo”ins).count()messages.filter(lambdas:“bar”ins).count()...

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in <1 sec(vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Example: Mining Console Logs

Load error messages from a log into memory, then interactively search

Page 37: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

RDD Fault ToleranceRDDs track the transformations used to build them (their lineage) to recompute lost data

E.g:

messages=textFile(...).filter(lambdas:s.contains(“ERROR”)).map(lambdas:s.split(‘\t’)[2])

HadoopRDD path = hdfs://…

FilteredRDD func = contains(...)

MappedRDD func = split(…)

Page 38: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Lineage

• Sparkusesthelineagetoschedulejobs

• TransformaFononthesameparFFonformastage

• Joins,forexample,areastageboundary

• Needtoreshuffledata

• Ajobrunsasinglestage

• pipelinetransformaFonwithinastage

• SchedulejobwheretheRDDparFFonis

Page 39: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Lineage&FaultTolerance

• Greatopportunityforefficientfaulttolerance

• Let'ssayonemachinefails

• Wanttorecomputeonlyitsstate

• Thelineagetellsuswhattorecompute

• FollowthelineagetoidenFfyallparFFonsneeded

• Recomputethem

• Forlastexample,idenFfyparFFonsoflinesmissing

• Alldependenciesare“narrow”;eachparFFonisdependentononeparentparFFon

• NeedtoreadthemissingparFFonoflines;recomputethetransformaFons

Page 40: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

FaultRecovery

Page 41: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Example:PageRank

Page 42: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Optimizing Placement

• links&ranksrepeatedlyjoined

• Canco-parFFonthem(e.g.,hashbothonURL)

• Canalsouseappknowledge,e.g.,hashonDNSname

Page 43: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

PageRankPerformance

Page 44: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

TensorFlow:SystemforML

• OpenSource,lotsofdevelopers,externalcontributors

• Usedin:RankBrain(rankresults),Photos(imagerecogniFon),SmartReply(automaFcemailresponses)

Page 45: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

ThreetypesofML

• Largescaletraining:hugedatasets,generatemodels

• Google’spreviousDistBelieffor100sofmachines

• Lowlatencyinference:runningmodelsindatacenters,phones,etc.

• Customengines

• TesFngnewideas

• Singlenodeflexiblesystems(Torch,Theano)

Page 46: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

TensorFlow

• Commonwaytowriteprograms

• Dataflow+Tensors

• Mutablestate

• SimplemathemaFcaloperaFons

• AutomaFcdifferenFaFon

Page 47: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Background: NN Training

• Takeinputimage

• ComputelossfuncFon(forwardpass)

• Computeerrorgradients(backwardpass)

• Updateweights

• Repeat

Page 48: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

ComputaFonisaDFG

Page 49: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

ExampleCode

Page 50: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

ExampleCode

Page 51: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Parameter Server Architecture

Statelessworkers,statefulparameterservers(DHT)CommutaFveupdatestoparameterserver

Page 52: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

TensorFlow

• Flexiblearchitectureformappingoperatorsandparameterserverstodifferentdevices

• SupportsmulFpleconcurrentexecuFonsonoverlappingsubgraphsoftheoverallgraph

• IndividualverFcesmayhavemutablestatethatcanbesharedbetweendifferentexecuFonsofthegraph

Page 53: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

TensorFlowhandlestheglue

Page 54: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

Synchrony?

• AsynchronousexecuFonissomeFmeshelpful,addressesstragglers

• Asynchronycausesconsistencyproblems

• TensorFlow:pursuessynchronoustraining

• Butaddskbackupmachinestoreducethestragglerproblem

• UsesdomainspecificknowledgetoenablethisopFmizaFon

Page 55: Big Data Systems - courses.cs.washington.eduMapReduce Input • Where does input come from? • Input is striped+replicated over GFS in 64 MB chunks • But in fact Map always reads

OpenResearchProblems

• AutomaFcplacement:dataflow-greatmechanism,butnotclearhowtouseitappropriately

• mutablestate-splitround-robinacrossparameterservernodes,statelesstasksreplicatedonGPUsasmuchasitfits,restonCPUs

• HowtotakedataflowrepresentaFontogeneratemoreefficientcode?