View
4
Download
0
Category
Preview:
Citation preview
CS378–BigDataProgramming
Lecture23Closures,Caching,Par<<ons
Review
• Assignment11– Createusersessions– Ordereventsby<mestamp,eventtype,subtype– OrdersessionsbyuserID– Par<<onsessionsbyreferringdomain– SampleSHOWERsessions(1in10)
BigDataProgramming 2CS378-Fall2018
DistributedSparkApplica<onLearningSpark,Figure7-1
BigDataProgramming 3CS378-Fall2018
Distribu<ngaSparkApplica<on
• SparkDriverrunsyourmain()method– ConvertsSparkprogramintotasks– Createsanexecu<onplanbasedonDAG
• DAGisderivedfromtransforma<ons
– Performsop<miza<on(like:pipeliningmap()’s)
• Tasksarebundleduptobesenttocluster– Clusterhasmul<pletaskexecutors
BigDataProgramming 4CS378-Fall2018
Distribu<ngaSparkApplica<on
• Schedulingindividualtasks– Executorsregisterwithdriver– Tasksscheduledbasedondataloca<on– Cacheddataistracked(forfuturetaskscheduling)
• Driverexposesdataontaskstatus
BigDataProgramming 5CS378-Fall2018
Distribu<ngaSparkApplica<on
• WithHadooptheJARwassenttoworkers– Sparkalsoneedstogetthecodetoworkers
• Hadoophastwotasks:map,reduce– Instan<a<ontakesplaceontheworkers
• Sparksendsobjectinstancestoworkers– IndividualtasksdefinedinyourSparkcode– Objectsareserialized(viaJavaserializa<on)
BigDataProgramming 6CS378-Fall2018
Closures
• Func<onsasfirstclassobjects– Canbepassedtoafunc<onasanargument– Canbereturnedfromafunc<on– Canbeassignedtovariables
• Closurescontainfreevariablesthatareboundinthelexicalenvironment/scope
BigDataProgramming 7CS378-Fall2018
Closures
• InScala,func<onsasatypearebuilt-in
• InJava,closuresarerealizedasinstances– Defineanobjectthatimplementsaninterface– Interfacerequiresimplementa<onofanabstractmethod
– InSparkAPI,thatmethodiscall()
BigDataProgramming 8CS378-Fall2018
Closures
• OurJavafunc<onsare:– Instan<ated– Sentofftotheworkertasks(viaserializa<on)– Eachtaskgetsitsowncopy(nocommunica<on)
• Non-localreferenceswillcausecontainingobjecttobeserializedaswell.– Variablevaluetypesmustbeserializable
BigDataProgramming 9CS378-Fall2018
Closures–IssuesinJava
• Afunc<onreferencesamethodinanenclosingscope– Methoditselfcannotbeserialized– Theen<recontainingclassmustbeserialized
• Issues– Thisclassisnotserializable– Theassociateddatamightbelarge
BigDataProgramming 10CS378-Fall2018
Persistence
• RecallthatRDDsarerecomputedasneeded– Anac<onini<atesevalua<on– Addi<onalac<onresultsinanotherevalua<on
• AnRDDcanbepersistedforefficiency• MakinganRDDpersistent:– cache() – persist(StorageLevel level)
BigDataProgramming 11CS378-Fall2018
PersistenceOp<onsFrom:hgp://training.databricks.com/workshop/itas_workshop.pdf
BigDataProgramming 12CS378-Fall2018
Par<<oning
• Prudentpar<<oningcangreatlyreducetheamountofcommunica<on(shuffle)
• IfanRDDisscannedonlyonce,noneed• IfanRDDisreusedmul<ple<mesinkey-orientedopera<ons– Par<<oningcanimproveperformancesignificantly
BigDataProgramming 13CS378-Fall2018
Par<<oning
• Par<<oningonpairRDDs(key,value)
• ConsideranRDDcontainingusersessions– Allusersoversome<meperiod(dayorweek)– Wewanttomergeinthelasthourofevents
• We’llbejoiningsessionsandeventsbyuserID
BigDataProgramming 14CS378-Fall2018
Par<<oningFigure4-4,fromLearningSpark
BigDataProgramming 15CS378-Fall2018
Par<<oningFigure4-5,fromLearningSpark
BigDataProgramming 16CS378-Fall2018
Par<<oning
• ConsideranRDDcontainingusersessions– Allusersoversome<meperiod(dayorweek)– Wewanttomergeevents,mul<ple<mes
• Tosetupforthis:– CreatethesessionRDD– Par<<on(callpartitionBy(),atransforma<on)– Persist
BigDataProgramming 17CS378-Fall2018
Recommended