Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not...

Programming Patterns and Tools for Cloud

Computing SeSe ACC

AmazonWebServicesHPHelionBackspaceSNICCloud

GoogleAppEngineOpenShiftCloudFoundry……

GoogleDocsOffice365DropboxStochSS

2SamJohnson,http://creativecommons.org/licenses/by-sa/3.0/

Basicservicemodels:IaaS,PaaS,SaaS

Virtual Appliances

Installationofcompletesoftwarestackpackagedasvirtualmachineimages.

• Distributedasfiles(imageformats)ordirectlyinacloudenvironment

• TypicallyusedtopackageupasingleapplicationasaVMI,simpledistribution

• CommonwaytomakeSaaSoutoflegacyapplication• Caneasymovebetweendifferenthostingenvironments(Public/PrivateCloud,VirtualBox,VMWareetc)

• Userstypicallydeployandusethemasvirtualprivateresources

https://en.wikipedia.org/wiki/Virtual_appliance

Platform as a Service (PaaS)

Thecapabilityprovidedtotheconsumeristodeployontothecloudinfrastructureconsumer-createdoracquiredapplica\onscreatedusingprogramminglanguages,libraries,services,andtoolssupportedbytheprovider.Theconsumerdoesnotmanageorcontroltheunderlyingcloudinfrastructureincludingnetwork,servers,opera\ngsystems,orstorage,buthascontroloverthedeployedapplica\onsandpossiblyconfigura\onse^ngsfortheapplica\on-hos\ngenvironment.

AmajoraimofPaaSistoaidindevelopmentof“cloudy”applicationsby:

• ReducingtheburdenofexplicitlymanagingIaaS• Cloudinteroperability(agoalforsomeofthem)• Enforcingprogrammingmodels/frameworksthatleadstodesignsthatmakegooduseofthecloudmodel.

Platform as a Service (PaaS)DirectlyconsumingIaaSgivesgreatflexibilityandcontrolbutmaybetoocomplicatedfortheaverageuser.PlatformasaServiceaimsatsimplifyingapplicationdevelopmentbyabstractingawayIaaS.Also,insomecases,byenforcingcertainenvironmentsandprogrammingmodels,theplatformcanprovideaddedvalue,particularlyautoscaling(seeJohanTordsson’slectures).

Example: Google App Engine

• Letsyoueasilybuildscalablewebapplications• DevelopinalocalSDKsandbox,thendeploytoGoogleCloudwithnocodemodification

• Theplatformwillautoscaleyouwebservicetomeetincreasedloadandtraffic• Supportsdifferentlanguages,e.g.Python,Java,GoandyourfavoriteframeworkslikeDjango,Jinja2etc.

• ButnotallpossiblePythonpackages(sandbox)• Worksreallywellformany(traditional)webapplications• Sandboxcanbelimitingfornon-typicalapplications,likethosefromCSE

Example: CloudFoundry

http://12factor.net/

https://www.cloudfoundry.org/platform/

Example: OpenShift

https://www.openshift.org/

FromRedHat.SimilargoalsasCloudFoundy,basedonDockerandKubernetes

Software as a Service

Thecapabilityprovidedtotheconsumeristousetheprovider’sapplica\onsrunningonacloudinfrastructure.Theapplica\onsareaccessiblefromvariousclientdevicesthrougheitherathinclientinterface,suchasawebbrowser(e.g.,web-basedemail),oraprograminterface.Theconsumerdoesnotmanageorcontroltheunderlyingcloudinfrastructureincludingnetwork,servers,opera\ngsystems,storage,orevenindividualapplica\oncapabili\es,withthepossibleexcep\onoflimiteduser-specificapplica\onconfigura\onse^ngs.

www.molns.org,github.com/MOLNs/molnsCloudvirtualappliance/platformforlargescalesimulationswithPyURDME

Case study: MOLNs and StochSS

The tasks: Spatial Stochastic Simulations in Systems Biology

PyURDME(www.pyurdme.org)isamodelingandsimulationlibraryforstochasticreaction-diffusionsimulationsofbiochemicalnetworks.

• Stochasticprocesssimulatedonacomputationalmesh.

• Applicationsincludestudyinggeneregulation,yeastpolarization.

• Outputisatimeseriesofspatialdata,X.

What problems did we want to solve?

1. PyURDMEsimulations,inparticularparametersweeps,requirelargecomputationalresources.

Xisjustonepossibleoutcome,weneedtogeneratelargeensemblesofrealizationstoconductastatisticalanalysis.

1k-1Mtrajectories—> 10*1000~10Gbdata

Fallsinthecategory“High-TroughputComputing(HTC)”and“Many-TaskComputing(MTC)”

Potential solution, use HPC -clusters

• We did that for a while, and used them to make large sweeps and write papers ourselves. Then we started to think about the problem of scientific software. There were problems: • The modeling process really benefits from interactivity,

but HPC resources are not for that. • It was daunting to try to write software to work with any

type of university cluster. • The software stack needed is complex and fragile and

would likely be very time-consuming to install in the HPC environment (leading to long wait times for users)

• Computational experiments become hard to reproduce since they rely on specific resources and organizations.

Components in the implementation

Amazon EC2

HP Helion

OpenStack

PyURDME: !Spatial stochastic modeling

and simulation engine

molnsclient: automatic setup and provisioning of a distributed !! ! ! computational environment - the “virtual lab”

molnsutil: Library for scalable parallel execution of simulation engines. !! ! Unified data management in the cloud environments.

IPython project: Web based notebook front-end to Python. !! ! ! ! Sharable, computable documents. !

! ! ! ! ! ! Parallel computing in via Ipython.parallel.

Simulation engine

Data analysis frameworks

Simulation engine

Summary of Experiences so far

• Virtual appliance for building VM images with the software stack for one OS only reduces time to release, critical for a small science team.

• Some parts were easy, essentially just deploying existing software in a virtual environment (e.g. IPython)

• Special care was needed for data transport + Special care was needed for storage abstractions

• As the stack grows, the large monolithic virtual appliance/image is starting to give us problems. We are now looking into alternative technologies as a remedy, e.g. containers and more extensive use of orchestration tools.

What we have not cover in this course that is important for developing SaaS

• Scalable web sites/services • frameworks • sessions • authentication • databases • security

Inpractice,thisisanintegralpartofalmostallcloudcomputingsoftwareprojects,butheremanysourcesofinformation,toolsandframeworks(Webapp2,Djago,Flask..),PaaS(GAE…),existthatwillmakeyourlifefairlyeasy.

“I need my application to scale”• What do we mean with a scalable CSE application?

Tightly coupled applications?• Common in CSE, typical examples are Partial Differential

Equation (PDE) solvers

Communicationoverblock-boundaries

HPC programming models • Message Passing Interface (Distributed Memory) • OpenMP, Pthreads etc. (Shared Memory) • CUDA, OpenCL (GPGPU) • FGPA

• Low-latencyinterprocesscommunicationiscriticaltoapplicationperformance• Specializedhardware/instructionsets.• “Everycyclecounts”,usersofHPCbecomeconcernedaboutVMoverhead• Failedtaskscausesmajorproblems/performancelosses• Computeinfrastructureorganizedaccordingly• ’High-PerformanceComputing,HPCclusterresources.Waitinaqueuetoobtaindedicateddirectaccesstohigh-bandwidthhardwareresources.

“Cloud computing doesn’t scale”

Whenyouhearthosetypesofstatements,theytypicallyrefertoconceptionsabouttraditionalcommunicationintensiveHPCapplications,typicallyimplementedwithMPI.

• Referstolow-performingnetworkbetweeninstancesinPublicClouds• DoesnottypicallyaccountfortailoredHPCclusterofferingsandplacementgroups(availableine.g.EC2)

Representsaverynarrowviewonwhatapplicationsandprogrammingmodelsareadmissible.

http://star.mit.edu/cluster/StarClusterdeploysMPI-clustersoverAWS:

Vertical vs. Horizontal Scaling

Verticalscaling(“scale-up”): Increasethecapacityofyourserverstomeetapplicationdemand.

• AddmorevCPUs• AddmoreRAM/storage • FundamentallylimitedbyHWcapacityoftheunderlyinghost• Costincreasessharplyforhigh-endmachines

Horizontalscaling(“scale-out”): Meetincreasedapplicationdemandbyaddingmoreservers/storage/etc.

• Alsohaslimitations,butalotofworkgoingontoimprovescale-outproperties.• Costofaddinganothermachineofthesame(commodity)typescaleslinearlyTypicallywhatwearemostinterestedininCloudComputing.

Strong and Weak Scaling

Strongscaling:Howdoestheexecutiontimedecreaseasafunctionofincreasedresources,forafixedproblemsize?

HowquicklycanIsolveaproblemofthissizebyaddingmoreHW?

Weakscaling:Howdoestheexecutiontimebehaveasafunctionofincreasedresources,foranincreasingproblemsize?

Strong and Weak Scaling

HowmuchcanIgrowmyproblemsizeandstillmaintainthesameexecutiontime?

Scalability and Elasticity

Capabili\escanbeelas\callyprovisionedandreleased,insomecasesautoma\cally,toscalerapidlyoutwardandinwardcommensuratewithdemand.Totheconsumer,thecapabili\esavailableforprovisioningomenappeartobeunlimitedandcanbeappropriatedinanyquan\tyatany\me.

Designprinciplestoallowforhorizontalscalingandelasticity?

(Discuss/Brainstorminsmallgroupsfor5min)

Strive for statelessness of task/component

Controller

Worker

“Growingiseasierthanshrinking”Data,taskinput/output

Butofcourse,allcomputations/applicationsneeddataandstate.Itisnotsomuchaboutifyoustoreitasitisaboutwhereyoustoreit.

Ideallyyouwantyourtasktobeabletoexecuteanywhere,atanytime.

• Robustness• Canincrease,decreasenumberofworkers

Controller

Worker

Isthecontrollermachinedesignedforscalablestorage?

Highpressureoncriticalcomponent(robustness)

Networkapotentialperformancebottleneck

Use the object store when you can

Controller

Worker

S3,Swift

key-valuestorage,designedforhorizontalscaling

Inthiskindofmaster-slavesetupthecontrollerwillstillmaintainalotofinformation,routetasksetc.

Whyevenstoretaskinformationoncontroller?Performance(latency)isonegoodreason

Use the object store when you can

Controller

Worker

S3,Swift

Decouplethelife-timeofcomputeserversanddata!

• Neededforelasticity• Serversarecattle

http://www.slideshare.net/randybias/pets-vs-cattle-the-elastic-cloud-story

• Datacanbeprecious

Controller

Worker

Anotheroptionforrobustness,storereplicasofdataovermultiplehostsinadistributedfilesystem.

• HDFS(Hadoop,BigData)• Doesnotdecouplecompute/storagelifetimesaseasily• DesignedforextremeI/Othroughput

AsynchronousTask

Controller

S3,Swift

Aimtoexecutetasksinnon-blockingmode,andhaveasfewglobalsynchronizationpointsaspossible.

32http://abhishek-tiwari.com/post/amqp-rabbitmq-and-celery-a-visual-guide-for-dummies#fn:1

key,valuestoredatabaseRedis

Example: Celery/RabbitMQ (Lab3)

Celery-DistributedTaskQueue

Our favorite example: MOLNs

PyURDME(www.pyurdme.org)isamodelingandsimulationlibraryforstochasticreaction-diffusionsimulationsofbiochemicalnetworks.

Wehaveseenhowweworkedwithautomation,orchestrationetctomakethingsintocloudservice,let’slookabitatthecomputationalproblem.

What problems did we want to solve?

1. PyURDMEsimulations,inparticularparametersweeps,requirelargecomputationalresources.

• MonteCarlo• Tasksarenumerous• Butindependent

mostlyMTCHPC,HTCorMTC?

Butalotofthetimewespendbuildinganddebuggingmodels…

ordevelopingsolvers

Forthis,wedon’tneedbigresources,butuserinteractivityisinvaluable.

Interactive Computation

Personalopinion:OneofthemoreexcitingthingsaboutcloudcomputingtechnologiesfromtheCSEperspectiveistheopportunitytoenableinteractiveparallelcomputing.

Input Output

Output

Interactwithpartialoutputascomputationproceeds

Non-interactive,obtaincompleteoutputwhenjobisdone.

MOLNs revisited

Amazon EC2

HP Helion

OpenStack

PyURDME: !Spatial stochastic modeling

and simulation engine

molnsclient: automatic setup and provisioning of a distributed !! ! ! computational environment - the “virtual lab”

molnsutil: Library for scalable parallel execution of simulation engines. !! ! Unified data management in the cloud environments.

IPython project: Web based notebook front-end to Python. !! ! ! ! Sharable, computable documents. !

! ! ! ! ! ! Parallel computing in via Ipython.parallel.

Simulation engine

Data analysis frameworks

Simulation engine

Architecture

Workers

Controller

Workers

WebBrowser

Molns Client

SharedStorage

PersistentStorage

(S3 / Swift)

InfrastructureDeployment

IPythonNotebookServer

IPythonController

IPythonEngine(s)

IPython ParallelTwotypesofviewstoyourcluster:

• direct_view• Supportsexplicitmappingofdata/taskstoengines(workers)• Runtasksonspecificmachines• MPI

• load_balanced_view• Asynchronoustaskqueue(usesZeroMQ)

Enginesalwaysblockonexecution

AsetofschedulersprovideasynchronousinterfacetoClients

IPython Parallel

• Let’syouworkwithmanymodelssimultaneously• InteractivelyfromIPython(Notebook)• Idealforrapidprototyping/simpleparallelflows

IPparallelinClouds:Doesnotattempttosolvetheproblemofdatamanagement

Oneofthegoalsof‘molnsutil’

Inmolnsutilwehaveabstractionsforstoragesystems: SharedStorage() PersistentStorage() LocalStorage()

Programming Patterns and Tools for Cloud

ComputingPart 2: “Big Data”

What is “Big Data”

Noteasytodefinejustbasedonsize:• Movingtarget,today’s“big”mightbetomorrows“small”• Rightnow,maybemultipleTBstoPBs(maybe)• Dependsonthetypeofdata(structured,unstructured)• Dependsonthe“velocity”ofthedata(streamingdata)

Relativetostate-of-theartcommoditysystems: Theproblemis“BigData”ifitcan’tbestored,handled,processedusingcommoditysystemsortraditionalRDMSsolutions.Typically,singleworkstations/disksarenotsufficient.MoreaboutthisinLectures2and3.

ThethreeV’s:Volume,Velocity,Variety

The three Vs, Volume, Velocity and Variety

42http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data

VolumeBigDataproblemsinvolvelargedatasets,andcomputationstypicallybecomeI/Obound:

• HardDrivescapacityto(cheaply)storelargeamountsofdatahasincreasedrapidlybutread/writespeedshavenot.

—>Distributedatasetsovermultipledisks/machinesandprocessitinparallel.

Posesfundamentalrequirementsonthesystemstosupportsuchanalysis(moreaboutthisinLectures2,3)

1000 Genomes project

Today,DNAsequencesfrom~2600individuals—>>460TBorrawdata(fullgenomicsequences),andgrowing.

Aim:Createanatlasofthehumangeneticvariation

FindbiomedicalrelevantDNAvariations

Understandsimpleandcomplextraits,andthein-betweens.

Capturegeneticvariationsthatisfoundinatleast1%ofthehumanpopulation

Fig. 3. Cloud computing usage in big data.

Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar, Abdullah Gani, Samee Ullah Khan

The rise of “big data” on cloud computing: Review and open research issues

Information Systems, Volume 47, 2015, 98–115

http://dx.doi.org/10.1016/j.is.2014.07.006

FrameWorks for Big Data

Problemswithverylargedatasetsrequirespecialframeworks:• Hadoop• Hive• Spark• …• …

Focusof“LargeDatasetsforScientificApplications” CanbefollowedasaPhDcourse.March28-endofsemester.

47Theriseof“bigdata”oncloudcomputing:Reviewandopenresearchissues

What is Hadoop?

OriginalImplementationofMapReducebyGoogle.(Read“googlemapreduce.pdf”intheportal)

HadoopisanOpenSourceimplementationinspiredbyGoogle’soriginalframework.

Firstdevelopers:DougCuttingandMikeCafarellaatthattimeatYahoo!,namedafterCutting’sson’stoyelephant.

Hadoop

Today,theHadoopecosystemisquitelarge,andcontainsmanyanalyticstools.Thecorecomponentsare:

• HDFS(DistributedFilesystem)• Hadoopcommons(libraries,utilitiesetc.)• YARN(Resourcemanagementplatform,notin1.2.1)• MapReduce

What is MapReduce ?

• Aprogrammingmodelforwriting(simple)(data)parallelprograms• Aframeworktoexecutesuchprograms.

Advantages:IfyoumanagetoexpressyourproblemasMapReduce,theframeworkhandles:

• Datadistribution(HDFS)• Fault-tolerance(onmanylevels)• Bringscomputationsclosetodata*• Runsoncommodityhardware• Scalesto1000’sofnodes(failureisinevitable) • Simpleprogrammingmodel(whenyouknowit…)

Thedata(andthecluster)shouldbeverylargefortheadvantagestobecomeapparent.

ItwasdesignedforthecaseofcommodityHWwithouthighperformancenetworkconnects.YoushouldnotthinkofHDFSasa“HPCparallelfilesystem”.Youdon’tcommunicatedatatocompute,youcommunicatecomputetodata.

Basic design

http://en.wikipedia.org/wiki/Apache_Hadoop

HandlesJobs,Tasks,logicofcomputingclosetodata

DistributedFilesystem,HDFS

Resources

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

ThereareacoupleofexcellentbasictutorialsforgettingstartedtoplayingwithHadooponsinglenodeclustersorsmalldistributedclusters(yes,thisisfun,IrecommenditifyouhavethetimeandasomeLinux(admin)experience).

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

Also,itisrathereasytosetupa“toy-hadoop”onyourownlaptop:

https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

Setting up Hadoop in Production

Thisisacomplexprocess.Inpractice,therearemanyvendorsthatprovidehostedservices,orprojectstomakeitsimplere.g.

• Cloudera• Hortonworks• ElasticMapReduce(EMR)inAWS• OpenStackSahara • ApacheBigtop

—>MoreimportanttogetfamiliarwithMapReduce.

https://developers.google.com/appengine/docs/python/dataprocessing/

So how do you use it?

55Frameworkprovidesoptimizedsort,shuffle.

Youprovide.

Key stages of MapReduce

1.Splitinput2.Map3.Shuffle4.Reduce

Hadoop Streaming

WhatifIdon’twanttouseJava?

TheHadoopstreamingAPIlet’syouwritethemapperandreducerasastandaloneprogram,readingfromstdinandwritingtostdout.YoumapperandreducercannowbewrittenusingPython,Ruby,Bashshell,R,younameit.

catinput.txt|./map|sort|./reduce>output.txt

Logically,itcorrespondstothefollowingUnixpipe:

Performance guidelines

•ThereisasignificantlatencyinstartingMap-tasksbytheframework•Mapsshoulddoconsiderablework(ruleofthumb,workatleast1min)•Defaultblocksizeis64-128MB,inputfilesshouldbelarge(GBsized)•Worksbestfor“one-shot”computations,notiterativeflows,duetothelargelatency.

•UseCombiner(localreduce)toreducethenetworkoverheadforshuffle/reduce.

Criticism against MapReduce• Restricted programming model. • Performance

http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html

Anoldbutstillinterestingdiscussion:

http://blog.tonybain.com/tony_bain/2010/09/was-stonebraker-right.html

• Not novel

Apache Spark

Programming Patterns and Tools for Cloud Computing...• HDFS (Hadoop, Big Data) • Does not...

Documents

20030103 Lifetimes

Fluorescence Lifetimes

February 2013 LifeTIMES magazine

HDFS Architecture

LifeTimes: Winter 2011

HDFS Commands

Hdfs Dhruba

Certificates - HDFS

Understanding hdfs

Fast Decouple

HDFS & MapReduce

Somerset Lifetimes

LifeTimes: Fall 2011

Federated HDFS

LifeTimes: Winter 2012

LifeTimes: Spring 2009

Somerset Lifetimes, July 2014

McHarrie LifeTimes Fall 2015

Hadoop and HDFS Overview - segintechnologies.comsegintechnologies.com/.../Hadoop-and-HDFS-overview.pdf · HDFS with High Availability •We can deploy HDFS with high availability

b lifetimes