7
10/3/17 1 CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy 1 Duke CS, Fall 2017 CompSci 516: Database Systems Announcements Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic devices Everything until and including Lecture 12 included 2 Duke CS, Fall 2017 CompSci 516: Database Systems Reading Material Recommended (optional) readings: Chapter 2 (Sections 1,2,3) of Mining of Massive Datasets, by Rajaraman and Ullman: http://i.stanford.edu/~ullman/mmds.html Original Google MR paper by Jeff Dean and Sanjay Ghemawat, OSDI’ 04: http://research.google.com/archive/mapreduce.html “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing” (see course website) – by Matei Zaharia et al. - 2012 3 Acknowledgement: Some of the following slides have been borrowed from Prof. Shivnath Babu, Prof. Dan Suciu, Prajakta Kalmegh, and Junghoon Kang Duke CS, Fall 2017 CompSci 516: Database Systems Map Reduce Duke CS, Fall 2017 CompSci 516: Database Systems 4 Big Data it cannot be stored in one machine store the data sets on multiple machines Google File System it cannot be processed in one machine parallelize computation on multiple machines MapReduce Ack: Slide by Junghoon Kang The Map-Reduce Framework Google published MapReduce paper in OSDI 2004, a year after the Google File System paper A high level programming paradigm allows many important data-oriented processes to be written simply processes large data by: applying a function to each logical record in the input (map) categorize and combine the intermediate results into summary values (reduce) Duke CS, Fall 2017 CompSci 516: Database Systems 6

Reading Material Map Reduce The Map-Reduce Framework · Map-Reduce Steps • Input is typically (key, value) pairs –but could be objects of any type • Map and Reduce are performed

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reading Material Map Reduce The Map-Reduce Framework · Map-Reduce Steps • Input is typically (key, value) pairs –but could be objects of any type • Map and Reduce are performed

10/3/17

1

CompSci 516DatabaseSystems

Lecture12Map-Reduce

andSpark

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Announcements

• Practicemidtermpostedonsakai– Firstprepareandthenattempt!

• MidtermnextWednesday10/11inclass– Closedbook/notes,noelectronicdevices– EverythinguntilandincludingLecture12included

2DukeCS,Fall2017 CompSci516:DatabaseSystems

ReadingMaterial• Recommended(optional)readings:

– Chapter2(Sections1,2,3)ofMiningofMassiveDatasets,byRajaraman andUllman:http://i.stanford.edu/~ullman/mmds.html

– OriginalGoogleMRpaperbyJeffDeanandSanjayGhemawat,OSDI’04:http://research.google.com/archive/mapreduce.html

– “ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing”(seecoursewebsite)– byMatei Zahariaetal.- 2012

3

Acknowledgement:SomeofthefollowingslideshavebeenborrowedfromProf.Shivnath Babu,Prof.DanSuciu,Prajakta Kalmegh,andJunghoon Kang

DukeCS,Fall2017 CompSci516:DatabaseSystems

MapReduce

DukeCS,Fall2017 CompSci516:DatabaseSystems 4

BigData

itcannotbe storedinonemachine

storethedatasetsonmultiplemachines

GoogleFileSystem

itcannotbe processed inonemachine

parallelizecomputationonmultiplemachines

MapReduce

Ack:SlidebyJunghoon Kang

TheMap-ReduceFramework

• GooglepublishedMapReducepaperinOSDI2004,ayearaftertheGoogleFileSystempaper

• Ahighlevelprogrammingparadigm– allowsmanyimportantdata-orientedprocessestobewrittensimply

• processeslargedataby:– applyingafunctiontoeachlogicalrecordintheinput(map)

– categorizeandcombinetheintermediateresultsintosummaryvalues(reduce)

DukeCS,Fall2017 CompSci516:DatabaseSystems 6

Page 2: Reading Material Map Reduce The Map-Reduce Framework · Map-Reduce Steps • Input is typically (key, value) pairs –but could be objects of any type • Map and Reduce are performed

10/3/17

2

WheredoesGoogleuseMapReduce?

MapReduce

Input

Output

● crawleddocuments● webrequestlogs

● invertedindices● graphstructureofwebdocuments● summariesofthenumberofpages

crawledperhost● thesetofmostfrequentqueriesinaday

Ack:SlidebyJunghoon Kang

StorageModel

• Dataisstoredinlargefiles(TB,PB)– e.g.market-basketdata(morewhenwedodatamining)

– orwebdata• Filesaredividedintochunks– typicallymanyMB(64MB)– sometimeseachchunkisreplicatedforfaulttolerance(laterindistributedDBMS)

DukeCS,Fall2017 CompSci516:DatabaseSystems 8

Map-ReduceSteps

• Inputistypically(key,value)pairs– butcouldbeobjectsofanytype

• MapandReduceareperformedbyanumberofprocesses– physicallylocatedinsomeprocessors

DukeCS,Fall2017 CompSci516:DatabaseSystems 9

samekey

Map ReduceShuffleInputkey-valuepairs

outputlistssortbykey

Map-ReduceSteps

1. ReadData2. Map– extractsomeinfoofinterest

in(key,value)form3. Shuffleandsort

– sendsamekeystothesamereduceprocess

DukeCS,Fall2017 CompSci516:DatabaseSystems 10

samekey

Map ReduceShuffleInputkey-valuepairs

outputlistssortbykey

4. Reduce– operateonthevaluesofthesamekey– e.g.transform,aggregate,summarize,

filter5. Outputtheresults(key,final-result)

SimpleExample:Map-Reduce

• Wordcounting• Invertedindexes

Ack:SlidebyProf.Shivnath Babu

DukeCS,Fall2017 CompSci516:DatabaseSystems 11

MapFunction

• Eachmapprocessworksonachunkofdata• Input:(input-key,value)• Output:(intermediate-key,value)-- maynotbethesameasinputkeyvalue• Example:listalldocidscontainingaword

– outputofmap(word,docid)– emitseachsuchpair– wordiskey,docid isvalue– duplicateeliminationcanbedoneatthereducephase

DukeCS,Fall2017 CompSci516:DatabaseSystems 12

samekey

Map ReduceShuffleInputkey-valuepairs

outputlistssortbykey

Page 3: Reading Material Map Reduce The Map-Reduce Framework · Map-Reduce Steps • Input is typically (key, value) pairs –but could be objects of any type • Map and Reduce are performed

10/3/17

3

ReduceFunction

• Input:(intermediate-key,list-of-values-for-this-key)– listcanincludeduplicates– eachmapprocesscanleaveitsoutputinthelocaldisk,reduceprocesscanretrieveits

portion• Output:(output-key,final-value)• Example:listalldocidscontainingaword

– outputwillbealistof(word,[doc-id1,doc-id5,….])– ifthecountisneeded,reducecounts#docs,outputwillbealistof(word,count)

DukeCS,Fall2017 CompSci516:DatabaseSystems 13

samekey

Map ReduceShuffleInputkey-valuepairs

outputlistssortbykey

ExampleProblem:MapReduceExplainhowthequerywillbeexecutedinMapReduce

• SELECTa,max(b)astopb• FROMR• WHEREa>0• GROUPBYa

SpecifythecomputationperformedinthemapandthereducefunctionsDukeCS,Fall2017 CompSci516:DatabaseSystems 14

Map

• Eachmaptask– ScansablockofR– Callsthemapfunctionforeachtuple– Themapfunctionappliestheselectionpredicatetothetuple

– Foreachtuple satisfyingtheselection,itoutputsarecordwithkey=aandvalue=b

SELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa

•Wheneachmaptaskscansmultiplerelations,itneedstooutputsomethinglikekey=aandvalue=(‘R’,b)whichhastherelationname‘R’

DukeCS,Fall2017 CompSci516:DatabaseSystems 15

Shuffle

• TheMapReduce enginereshufflestheoutputofthemapphaseandgroupsitontheintermediatekey,i.e.theattributea

SELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa

•Notethattheprogrammerhastowriteonlythemapandreducefunctions,theshufflephaseisdonebytheMapReduceengine(althoughtheprogrammercanrewritethepartitionfunction),butyoushouldstillmentionthisinyouranswers

DukeCS,Fall2017 CompSci516:DatabaseSystems 16

ReduceSELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa

• Eachreducetask• computes the aggregate value max(b) = topb for each group

(i.e. a) assigned to it (by calling the reduce function) • outputs the final results: (a, topb)

•Multipleaggregatescanbeoutputbythereducephaselikekey=aandvalue=(sum(b),min(b)) etc.

• Sometimesasecond(thirdetc)levelofMap-Reducephasemightbeneeded

A local combiner can be used to compute local max before data gets reshuffled (in the map tasks)

DukeCS,Fall2017 CompSci516:DatabaseSystems 17

MoreTerminology

• AMap-Reduce“Job”– e.g.countthewordsinalldocs– complexqueriescanhavemultipleMRjobs

• MaporReduce“Tasks”– Agroupofmaporreduce“functions”– scheduledonasingle“worker”

• Worker– aprocessthatexecutesonetaskatatime– oneperprocessor,so4-8permachine

• Amastercontroller– dividesthedataintochunks– assignsdifferentprocessorstoexecutethemapfunctiononeach

chunk– other/sameprocessorsexecutethereducefunctionsontheoutputsof

themapfunctions

DukeCS,Fall2017 CompSci516:DatabaseSystems 18

however,thereisnouniformterminologyacrosssystems

Ack:SlidebyProf.DanSuciu

Page 4: Reading Material Map Reduce The Map-Reduce Framework · Map-Reduce Steps • Input is typically (key, value) pairs –but could be objects of any type • Map and Reduce are performed

10/3/17

4

WhyisMap-ReducePopular?

• DistributedcomputationbeforeMapReduce– howtodividetheworkloadamongmultiplemachines?– howtodistributedataandprogramtoothermachines?– howtoscheduletasks?– whathappensifataskfailswhilerunning?– …and…and...

• DistributedcomputationafterMapReduce– howtowriteMapfunction?– howtowriteReducefunction?

• Developers’tasksmadeeasy

DukeCS,Fall2017 CompSci516:DatabaseSystems 19

Ack:SlidebyJunghoon Kang

HandlingFaultToleranceinMR

• Althoughtheprobabilityofamachinefailureislow,theprobabilityofamachinefailingamongthousandsofmachinesiscommon

• WorkerFailure– Themastersendsheartbeattoeachworkernode– Ifaworkernodefails,themasterreschedulesthetaskshandledbytheworker

• MasterFailure– ThewholeMapReducejobgetsrestartedthroughadifferentmaster

DukeCS,Fall2017 CompSci516:DatabaseSystems 20Ack:SlidebyJunghoon Kang

OtheraspectsofMapReduce• Locality

– TheinputdataismanagedbyGFS– ChoosetheclusterofMapReducemachinessuchthatthosemachinescontaintheinputdataontheirlocaldisk

– Wecanconservenetworkbandwidth• Taskgranularity

– Itispreferabletohavethenumberoftaskstobemultiplesofworkernodes

– Smallerthepartitionsize,fasterfailoverandbettergranularityinloadbalance,butitincursmoreoverhead

– Needabalance• BackupTasks

– Inordertocopewitha“straggler”,themasterschedulesbackupexecutionsoftheremainingin-progresstasks

DukeCS,Fall2017 CompSci516:DatabaseSystems 21Ack:SlidebyJunghoon Kang

ApacheHadoop

• ApacheHadoophasanopen-sourceversionofGFSandMapReduce– GFS->HDFS(HadoopFileSystem)– GoogleMapReduce->HadoopMapReduce

• YoucandownloadthesoftwareandimplementyourownMapReduceapplications

DukeCS,Fall2017 CompSci516:DatabaseSystems 22Ack:SlidebyJunghoon Kang

MapReduceProsandCons

• MapReduceisgoodforoff-linebatchjobsonlargedatasets

• MapReduceisnotgoodforiterativejobsduetohighI/Ooverheadaseachiterationneedstoread/writedatafrom/toGFS

• MapReduceisbadforjobsonsmalldatasetsandjobsthatrequirelow-latencyresponse

DukeCS,Fall2017 CompSci516:DatabaseSystems 23Ack:SlidebyJunghoon Kang

Spark

DukeCS,Fall2017 CompSci516:DatabaseSystems 24

Page 5: Reading Material Map Reduce The Map-Reduce Framework · Map-Reduce Steps • Input is typically (key, value) pairs –but could be objects of any type • Map and Reduce are performed

10/3/17

5

WhatisSpark?

• NotamodifiedversionofHadoop

• Separate,fast,MapReduce-likeengine– In-memorydatastorageforveryfastiterativequeries–Generalexecutiongraphsandpowerfuloptimizations–Upto40xfasterthanHadoop–Upto100xfaster(2-10xondisk)

• CompatiblewithHadoop’sstorageAPIs– Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc

Borrowed slide

Distributedin-memorylargescaledataprocessingengine!

Ack:SlidebyPrajakta Kalmegh

Applications(BigDataAnalysis)

• In-memoryanalytics&anomalydetection(Conviva)

• Interactivequeriesondatastreams(Quantifind)• Exploratoryloganalysis(Foursquare)• Trafficestimationw/GPSdata(MobileMillennium)

• Twitterspamclassification(Monarch)• ...

Borrowed slide

Ack:SlidebyPrajakta Kalmegh

WhyaNewProgrammingModel?

• MapReducegreatlysimplifiedbigdataanalysis• Butassoonasitgotpopular,userswantedmore:–Morecomplex,multi-stageiterative applications(graphalgorithms,machinelearning)–Moreinteractive ad-hocqueries–Morereal-time onlineprocessing• Allthreeoftheseappsrequirefastdatasharingacrossparalleljobs

Borrowed slide

NOTE: What were the workarounds in MR world? Ysmart [1], Stubby[2], PTF[3], Haloop [4], Twister [5]

Ack:SlidebyPrajakta Kalmegh

DataSharinginMapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication, serialization, and disk IOBorrowed slide

Ack:SlidebyPrajakta Kalmegh

iter. 1 iter. 2 . . .

Input

DataSharinginSpark

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and diskBorrowed slide

Ack:SlidebyPrajakta Kalmegh

RDD:SparkProgrammingModel

• Keyidea:ResilientDistributedDatasets(RDDs)–Distributedcollectionsofobjectsthatcanbecachedinmemoryorstoredondiskacrossclusternodes–Manipulatedthroughvariousparalleloperators–Automaticallyrebuiltonfailure (How?UseLineage)

Borrowed slide

Ack:SlidebyPrajakta Kalmegh

Page 6: Reading Material Map Reduce The Map-Reduce Framework · Map-Reduce Steps • Input is typically (key, value) pairs –but could be objects of any type • Map and Reduce are performed

10/3/17

6

AdditionalSlidesonSpark(OptionalReading)

DukeCS,Fall2017 CompSci516:DatabaseSystems 31

Ack:ThefollowingslidesarebyPrajakta Kalmegh

More on RDDs• Transformations: Created through deterministic operations on either

‣ data in stable storage or

‣ other RDDs

• Lineage: RDD has enough information about how it was derived from other datasets

• Immutable: RDD is a read-only, partitioned collection of records

‣ Checkpointing of RDDs with long lineage chains can be done in the background.

‣Mitigating stragglers: We can use backup tasks to recompute transformations on RDDs

• Persistence level: Users can choose a re-use storage strategy (caching in memory, storing the RDD only on disk or replicating it across machines; also chose a persistence priority for data spills)

• Partitioning: Users can ask that an RDD’s elements be partitioned across machines based on a key in each record

RDD Transformations and Actions

*http://www.tothenew.com/blog/spark-1o3-spark-internals/

*https://spark.apache.org/docs/1.0.1/cluster-overview.html

Note:LazyEvaluation:Averyimportantconcept

DAGofRDDs

*https://trongkhoanguyenblog.wordpress.com/2014/11/27/understand-rdd-operations-transformations-and-actions/

FaultTolerance

• RDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata

• E.g:messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc = _.contains(...)

MappedRDDfunc = _.split(…)

Borrowed slide

Tradeoff:LowComputationcost(cachemoreRDDs)

VSHighmemorycost(notmuchworkforGC)

Representing RDDs• Graph-based representation. Five components :

Page 7: Reading Material Map Reduce The Map-Reduce Framework · Map-Reduce Steps • Input is typically (key, value) pairs –but could be objects of any type • Map and Reduce are performed

10/3/17

7

Representing RDDs (Dependencies)

one-to-one many-to-one many-to-many

shuffle

Representing RDDs (An example)

Advantages of the RDD model

Checkpoint!

• DataSharinginSparkandSomeApplications• RDDDefinition,Model,Representation,Advantages

OtherEngineFeatures:Implementation

• Notcoveredindetails• SomeSummary:• SparklocalvsSparkStandalonevsSparkcluster(ResourcesharinghandledbyYarn/Mesos)

• JobScheduling: DAGSchedulervsTaskScheduler (FairvsFIFOattaskgranularity)

• MemoryManagement:serializedin-memory(fastest)VSdeserializedin-memoryVSon-diskpersistent

• SupportforCheckpointing:TradeoffbetweenusinglineageforrecomputingpartitionsVScheckpointingpartitionsonstablestorage

• InterpreterIntegration: Shipexternalinstancesofvariablesreferencedinaclosurealongwiththeclosureclasstoworkernodesinordertogivethemaccesstothesevariables