60
Programming Patterns and Tools for Cloud Computing SeSe ACC

Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Programming Patterns and Tools for Cloud

Computing SeSe ACC

Page 2: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

fdsf

AmazonWebServicesHPHelionBackspaceSNICCloud

GoogleAppEngineOpenShiftCloudFoundry……

GoogleDocsOffice365DropboxStochSS

2SamJohnson,http://creativecommons.org/licenses/by-sa/3.0/

Basicservicemodels:IaaS,PaaS,SaaS

Page 3: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Virtual Appliances

3

Installationofcompletesoftwarestackpackagedasvirtualmachineimages.

• Distributedasfiles(imageformats)ordirectlyinacloudenvironment

• TypicallyusedtopackageupasingleapplicationasaVMI,simpledistribution

• CommonwaytomakeSaaSoutoflegacyapplication• Caneasymovebetweendifferenthostingenvironments(Public/PrivateCloud,VirtualBox,VMWareetc)

• Userstypicallydeployandusethemasvirtualprivateresources

https://en.wikipedia.org/wiki/Virtual_appliance

Page 4: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Platform as a Service (PaaS)

4

Thecapabilityprovidedtotheconsumeristodeployontothecloudinfrastructureconsumer-createdoracquiredapplica\onscreatedusingprogramminglanguages,libraries,services,andtoolssupportedbytheprovider.Theconsumerdoesnotmanageorcontroltheunderlyingcloudinfrastructureincludingnetwork,servers,opera\ngsystems,orstorage,buthascontroloverthedeployedapplica\onsandpossiblyconfigura\onse^ngsfortheapplica\on-hos\ngenvironment.

AmajoraimofPaaSistoaidindevelopmentof“cloudy”applicationsby:

• ReducingtheburdenofexplicitlymanagingIaaS• Cloudinteroperability(agoalforsomeofthem)• Enforcingprogrammingmodels/frameworksthatleadstodesignsthatmakegooduseofthecloudmodel.

Page 5: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

5

Platform as a Service (PaaS)DirectlyconsumingIaaSgivesgreatflexibilityandcontrolbutmaybetoocomplicatedfortheaverageuser.PlatformasaServiceaimsatsimplifyingapplicationdevelopmentbyabstractingawayIaaS.Also,insomecases,byenforcingcertainenvironmentsandprogrammingmodels,theplatformcanprovideaddedvalue,particularlyautoscaling(seeJohanTordsson’slectures).

Page 6: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Example: Google App Engine

6

• Letsyoueasilybuildscalablewebapplications• DevelopinalocalSDKsandbox,thendeploytoGoogleCloudwithnocodemodification

• Theplatformwillautoscaleyouwebservicetomeetincreasedloadandtraffic• Supportsdifferentlanguages,e.g.Python,Java,GoandyourfavoriteframeworkslikeDjango,Jinja2etc.

• ButnotallpossiblePythonpackages(sandbox)• Worksreallywellformany(traditional)webapplications• Sandboxcanbelimitingfornon-typicalapplications,likethosefromCSE

Page 7: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Example: CloudFoundry

http://12factor.net/

https://www.cloudfoundry.org/platform/

Page 8: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Example: OpenShift

8

https://www.openshift.org/

FromRedHat.SimilargoalsasCloudFoundy,basedonDockerandKubernetes

Page 9: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Software as a Service

9

Thecapabilityprovidedtotheconsumeristousetheprovider’sapplica\onsrunningonacloudinfrastructure.Theapplica\onsareaccessiblefromvariousclientdevicesthrougheitherathinclientinterface,suchasawebbrowser(e.g.,web-basedemail),oraprograminterface.Theconsumerdoesnotmanageorcontroltheunderlyingcloudinfrastructureincludingnetwork,servers,opera\ngsystems,storage,orevenindividualapplica\oncapabili\es,withthepossibleexcep\onoflimiteduser-specificapplica\onconfigura\onse^ngs.

Page 10: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

10

www.molns.org,github.com/MOLNs/molnsCloudvirtualappliance/platformforlargescalesimulationswithPyURDME

Case study: MOLNs and StochSS

Page 11: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

The tasks: Spatial Stochastic Simulations in Systems Biology

11

PyURDME(www.pyurdme.org)isamodelingandsimulationlibraryforstochasticreaction-diffusionsimulationsofbiochemicalnetworks.

• Stochasticprocesssimulatedonacomputationalmesh.

• Applicationsincludestudyinggeneregulation,yeastpolarization.

• Outputisatimeseriesofspatialdata,X.

Page 12: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

What problems did we want to solve?

12

1. PyURDMEsimulations,inparticularparametersweeps,requirelargecomputationalresources.

Page 13: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

13

Xisjustonepossibleoutcome,weneedtogeneratelargeensemblesofrealizationstoconductastatisticalanalysis.

1k-1Mtrajectories—> 10*1000~10Gbdata

Fallsinthecategory“High-TroughputComputing(HTC)”and“Many-TaskComputing(MTC)”

Page 14: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Potential solution, use HPC -clusters

• We did that for a while, and used them to make large sweeps and write papers ourselves. Then we started to think about the problem of scientific software. There were problems: • The modeling process really benefits from interactivity,

but HPC resources are not for that. • It was daunting to try to write software to work with any

type of university cluster. • The software stack needed is complex and fragile and

would likely be very time-consuming to install in the HPC environment (leading to long wait times for users)

• Computational experiments become hard to reproduce since they rely on specific resources and organizations.

14

Page 15: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Components in the implementation

15

Amazon EC2

HP Helion

OpenStack

PyURDME: !Spatial stochastic modeling

and simulation engine

molnsclient: automatic setup and provisioning of a distributed !! ! ! computational environment - the “virtual lab”

molnsutil: Library for scalable parallel execution of simulation engines. !! ! Unified data management in the cloud environments.

IPython project: Web based notebook front-end to Python. !! ! ! ! Sharable, computable documents. !

! ! ! ! ! ! Parallel computing in via Ipython.parallel.

MOLNs

Simulation engine

Data analysis frameworks

Simulation engine

Page 16: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Summary of Experiences so far

• Virtual appliance for building VM images with the software stack for one OS only reduces time to release, critical for a small science team.

• Some parts were easy, essentially just deploying existing software in a virtual environment (e.g. IPython)

• Special care was needed for data transport + Special care was needed for storage abstractions

• As the stack grows, the large monolithic virtual appliance/image is starting to give us problems. We are now looking into alternative technologies as a remedy, e.g. containers and more extensive use of orchestration tools.

16

Page 17: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

What we have not cover in this course that is important for developing SaaS

• Scalable web sites/services • frameworks • sessions • authentication • databases • security

17

Inpractice,thisisanintegralpartofalmostallcloudcomputingsoftwareprojects,butheremanysourcesofinformation,toolsandframeworks(Webapp2,Djago,Flask..),PaaS(GAE…),existthatwillmakeyourlifefairlyeasy.

Page 18: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

“I need my application to scale”• What do we mean with a scalable CSE application?

18

Page 19: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Tightly coupled applications?• Common in CSE, typical examples are Partial Differential

Equation (PDE) solvers

19

Communicationoverblock-boundaries

Page 20: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

HPC programming models • Message Passing Interface (Distributed Memory) • OpenMP, Pthreads etc. (Shared Memory) • CUDA, OpenCL (GPGPU) • FGPA

20

• Low-latencyinterprocesscommunicationiscriticaltoapplicationperformance• Specializedhardware/instructionsets.• “Everycyclecounts”,usersofHPCbecomeconcernedaboutVMoverhead• Failedtaskscausesmajorproblems/performancelosses• Computeinfrastructureorganizedaccordingly• ’High-PerformanceComputing,HPCclusterresources.Waitinaqueuetoobtaindedicateddirectaccesstohigh-bandwidthhardwareresources.

Page 21: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

“Cloud computing doesn’t scale”

21

Whenyouhearthosetypesofstatements,theytypicallyrefertoconceptionsabouttraditionalcommunicationintensiveHPCapplications,typicallyimplementedwithMPI.

• Referstolow-performingnetworkbetweeninstancesinPublicClouds• DoesnottypicallyaccountfortailoredHPCclusterofferingsandplacementgroups(availableine.g.EC2)

Representsaverynarrowviewonwhatapplicationsandprogrammingmodelsareadmissible.

http://star.mit.edu/cluster/StarClusterdeploysMPI-clustersoverAWS:

Page 22: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

22

Vertical vs. Horizontal Scaling

Verticalscaling(“scale-up”): Increasethecapacityofyourserverstomeetapplicationdemand.

• AddmorevCPUs• AddmoreRAM/storage • FundamentallylimitedbyHWcapacityoftheunderlyinghost• Costincreasessharplyforhigh-endmachines

Horizontalscaling(“scale-out”): Meetincreasedapplicationdemandbyaddingmoreservers/storage/etc.

• Alsohaslimitations,butalotofworkgoingontoimprovescale-outproperties.• Costofaddinganothermachineofthesame(commodity)typescaleslinearlyTypicallywhatwearemostinterestedininCloudComputing.

Page 23: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Strong and Weak Scaling

23

Strongscaling:Howdoestheexecutiontimedecreaseasafunctionofincreasedresources,forafixedproblemsize?

HowquicklycanIsolveaproblemofthissizebyaddingmoreHW?

Page 24: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

24

Weakscaling:Howdoestheexecutiontimebehaveasafunctionofincreasedresources,foranincreasingproblemsize?

Strong and Weak Scaling

HowmuchcanIgrowmyproblemsizeandstillmaintainthesameexecutiontime?

Page 25: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Scalability and Elasticity

25

Capabili\escanbeelas\callyprovisionedandreleased,insomecasesautoma\cally,toscalerapidlyoutwardandinwardcommensuratewithdemand.Totheconsumer,thecapabili\esavailableforprovisioningomenappeartobeunlimitedandcanbeappropriatedinanyquan\tyatany\me.

Designprinciplestoallowforhorizontalscalingandelasticity?

(Discuss/Brainstorminsmallgroupsfor5min)

Page 26: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Strive for statelessness of task/component

26

Controller

Worker

“Growingiseasierthanshrinking”Data,taskinput/output

Butofcourse,allcomputations/applicationsneeddataandstate.Itisnotsomuchaboutifyoustoreitasitisaboutwhereyoustoreit.

Ideallyyouwantyourtasktobeabletoexecuteanywhere,atanytime.

• Robustness• Canincrease,decreasenumberofworkers

Page 27: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

27

Controller

Worker

Isthecontrollermachinedesignedforscalablestorage?

Highpressureoncriticalcomponent(robustness)

Networkapotentialperformancebottleneck

Page 28: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Use the object store when you can

28

Controller

Worker

S3,Swift

key-valuestorage,designedforhorizontalscaling

Inthiskindofmaster-slavesetupthecontrollerwillstillmaintainalotofinformation,routetasksetc.

Whyevenstoretaskinformationoncontroller?Performance(latency)isonegoodreason

Page 29: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Use the object store when you can

29

Controller

Worker

S3,Swift

Decouplethelife-timeofcomputeserversanddata!

• Neededforelasticity• Serversarecattle

http://www.slideshare.net/randybias/pets-vs-cattle-the-elastic-cloud-story

• Datacanbeprecious

Page 30: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

30

Controller

Worker

Anotheroptionforrobustness,storereplicasofdataovermultiplehostsinadistributedfilesystem.

• HDFS(Hadoop,BigData)• Doesnotdecouplecompute/storagelifetimesaseasily• DesignedforextremeI/Othroughput

Page 31: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

AsynchronousTask

3131

Controller

S3,Swift

Aimtoexecutetasksinnon-blockingmode,andhaveasfewglobalsynchronizationpointsaspossible.

Page 32: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

32http://abhishek-tiwari.com/post/amqp-rabbitmq-and-celery-a-visual-guide-for-dummies#fn:1

key,valuestoredatabaseRedis

Example: Celery/RabbitMQ (Lab3)

Celery-DistributedTaskQueue

Page 33: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Our favorite example: MOLNs

33

PyURDME(www.pyurdme.org)isamodelingandsimulationlibraryforstochasticreaction-diffusionsimulationsofbiochemicalnetworks.

Wehaveseenhowweworkedwithautomation,orchestrationetctomakethingsintocloudservice,let’slookabitatthecomputationalproblem.

Page 34: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

What problems did we want to solve?

34

1. PyURDMEsimulations,inparticularparametersweeps,requirelargecomputationalresources.

• MonteCarlo• Tasksarenumerous• Butindependent

mostlyMTCHPC,HTCorMTC?

Butalotofthetimewespendbuildinganddebuggingmodels…

ordevelopingsolvers

Forthis,wedon’tneedbigresources,butuserinteractivityisinvaluable.

Page 35: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Interactive Computation

35

Personalopinion:OneofthemoreexcitingthingsaboutcloudcomputingtechnologiesfromtheCSEperspectiveistheopportunitytoenableinteractiveparallelcomputing.

Input Output

Output

Input

Interactwithpartialoutputascomputationproceeds

Non-interactive,obtaincompleteoutputwhenjobisdone.

Page 36: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

MOLNs revisited

36

Amazon EC2

HP Helion

OpenStack

PyURDME: !Spatial stochastic modeling

and simulation engine

molnsclient: automatic setup and provisioning of a distributed !! ! ! computational environment - the “virtual lab”

molnsutil: Library for scalable parallel execution of simulation engines. !! ! Unified data management in the cloud environments.

IPython project: Web based notebook front-end to Python. !! ! ! ! Sharable, computable documents. !

! ! ! ! ! ! Parallel computing in via Ipython.parallel.

MOLNs

Simulation engine

Data analysis frameworks

Simulation engine

Page 37: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Architecture

37

Workers

Controller

Workers

WebBrowser

Molns Client

SharedStorage

PersistentStorage

(S3 / Swift)

InfrastructureDeployment

IPythonNotebookServer

IPythonController

IPythonEngine(s)

IPythonEngine(s)

Page 38: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

38

IPython ParallelTwotypesofviewstoyourcluster:

• direct_view• Supportsexplicitmappingofdata/taskstoengines(workers)• Runtasksonspecificmachines• MPI

• load_balanced_view• Asynchronoustaskqueue(usesZeroMQ)

Enginesalwaysblockonexecution

AsetofschedulersprovideasynchronousinterfacetoClients

Page 39: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

IPython Parallel

39

• Let’syouworkwithmanymodelssimultaneously• InteractivelyfromIPython(Notebook)• Idealforrapidprototyping/simpleparallelflows

IPparallelinClouds:Doesnotattempttosolvetheproblemofdatamanagement

Oneofthegoalsof‘molnsutil’

Inmolnsutilwehaveabstractionsforstoragesystems: SharedStorage() PersistentStorage() LocalStorage()

Page 40: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Programming Patterns and Tools for Cloud

ComputingPart 2: “Big Data”

Page 41: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

What is “Big Data”

41

Noteasytodefinejustbasedonsize:• Movingtarget,today’s“big”mightbetomorrows“small”• Rightnow,maybemultipleTBstoPBs(maybe)• Dependsonthetypeofdata(structured,unstructured)• Dependsonthe“velocity”ofthedata(streamingdata)

Relativetostate-of-theartcommoditysystems: Theproblemis“BigData”ifitcan’tbestored,handled,processedusingcommoditysystemsortraditionalRDMSsolutions.Typically,singleworkstations/disksarenotsufficient.MoreaboutthisinLectures2and3.

ThethreeV’s:Volume,Velocity,Variety

Page 42: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

The three Vs, Volume, Velocity and Variety

42http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data

Page 43: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

43

VolumeBigDataproblemsinvolvelargedatasets,andcomputationstypicallybecomeI/Obound:

• HardDrivescapacityto(cheaply)storelargeamountsofdatahasincreasedrapidlybutread/writespeedshavenot.

—>Distributedatasetsovermultipledisks/machinesandprocessitinparallel.

Posesfundamentalrequirementsonthesystemstosupportsuchanalysis(moreaboutthisinLectures2,3)

Page 44: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

1000 Genomes project

44

Today,DNAsequencesfrom~2600individuals—>>460TBorrawdata(fullgenomicsequences),andgrowing.

Aim:Createanatlasofthehumangeneticvariation

FindbiomedicalrelevantDNAvariations

Understandsimpleandcomplextraits,andthein-betweens.

Capturegeneticvariationsthatisfoundinatleast1%ofthehumanpopulation

Page 45: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

45

Fig. 3. Cloud computing usage in big data.

Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar, Abdullah Gani, Samee Ullah Khan

The rise of “big data” on cloud computing: Review and open research issues

Information Systems, Volume 47, 2015, 98–115

http://dx.doi.org/10.1016/j.is.2014.07.006

Page 46: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

FrameWorks for Big Data

46

Problemswithverylargedatasetsrequirespecialframeworks:• Hadoop• Hive• Spark• …• …

Focusof“LargeDatasetsforScientificApplications” CanbefollowedasaPhDcourse.March28-endofsemester.

Page 47: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

47Theriseof“bigdata”oncloudcomputing:Reviewandopenresearchissues

Page 48: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

What is Hadoop?

48

OriginalImplementationofMapReducebyGoogle.(Read“googlemapreduce.pdf”intheportal)

HadoopisanOpenSourceimplementationinspiredbyGoogle’soriginalframework.

Firstdevelopers:DougCuttingandMikeCafarellaatthattimeatYahoo!,namedafterCutting’sson’stoyelephant.

Page 49: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Hadoop

49

Today,theHadoopecosystemisquitelarge,andcontainsmanyanalyticstools.Thecorecomponentsare:

• HDFS(DistributedFilesystem)• Hadoopcommons(libraries,utilitiesetc.)• YARN(Resourcemanagementplatform,notin1.2.1)• MapReduce

Page 50: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

What is MapReduce ?

50

• Aprogrammingmodelforwriting(simple)(data)parallelprograms• Aframeworktoexecutesuchprograms.

Advantages:IfyoumanagetoexpressyourproblemasMapReduce,theframeworkhandles:

• Datadistribution(HDFS)• Fault-tolerance(onmanylevels)• Bringscomputationsclosetodata*• Runsoncommodityhardware• Scalesto1000’sofnodes(failureisinevitable) • Simpleprogrammingmodel(whenyouknowit…)

Thedata(andthecluster)shouldbeverylargefortheadvantagestobecomeapparent.

ItwasdesignedforthecaseofcommodityHWwithouthighperformancenetworkconnects.YoushouldnotthinkofHDFSasa“HPCparallelfilesystem”.Youdon’tcommunicatedatatocompute,youcommunicatecomputetodata.

Page 51: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Basic design

51

http://en.wikipedia.org/wiki/Apache_Hadoop

HandlesJobs,Tasks,logicofcomputingclosetodata

DistributedFilesystem,HDFS

Page 52: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Resources

52

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

ThereareacoupleofexcellentbasictutorialsforgettingstartedtoplayingwithHadooponsinglenodeclustersorsmalldistributedclusters(yes,thisisfun,IrecommenditifyouhavethetimeandasomeLinux(admin)experience).

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

Also,itisrathereasytosetupa“toy-hadoop”onyourownlaptop:

https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

Page 53: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Setting up Hadoop in Production

53

Thisisacomplexprocess.Inpractice,therearemanyvendorsthatprovidehostedservices,orprojectstomakeitsimplere.g.

• Cloudera• Hortonworks• ElasticMapReduce(EMR)inAWS• OpenStackSahara • ApacheBigtop

—>MoreimportanttogetfamiliarwithMapReduce.

Page 54: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

54

https://developers.google.com/appengine/docs/python/dataprocessing/

So how do you use it?

Page 55: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

55Frameworkprovidesoptimizedsort,shuffle.

Youprovide.

Page 56: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Key stages of MapReduce

56

1.Splitinput2.Map3.Shuffle4.Reduce

Page 57: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Hadoop Streaming

57

WhatifIdon’twanttouseJava?

TheHadoopstreamingAPIlet’syouwritethemapperandreducerasastandaloneprogram,readingfromstdinandwritingtostdout.YoumapperandreducercannowbewrittenusingPython,Ruby,Bashshell,R,younameit.

catinput.txt|./map|sort|./reduce>output.txt

Logically,itcorrespondstothefollowingUnixpipe:

Page 58: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Performance guidelines

•ThereisasignificantlatencyinstartingMap-tasksbytheframework•Mapsshoulddoconsiderablework(ruleofthumb,workatleast1min)•Defaultblocksizeis64-128MB,inputfilesshouldbelarge(GBsized)•Worksbestfor“one-shot”computations,notiterativeflows,duetothelargelatency.

•UseCombiner(localreduce)toreducethenetworkoverheadforshuffle/reduce.

58

Page 59: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Criticism against MapReduce• Restricted programming model. • Performance

59

http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html

Anoldbutstillinterestingdiscussion:

http://blog.tonybain.com/tony_bain/2010/09/was-stonebraker-right.html

• Not novel

Page 60: Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not decouple compute/storage lifetimes as easily • Designed for extreme I/O throughput AsynchronousTask

Apache Spark

60