Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Airflow allows developers, admins and operations teams to author, schedule and orchestrate workflows and jobs within an organization. While it’s main focus started with orchestrating data pipelines, it’s ability to work seamlessly outside of the Hadoop stack makes it a compelling solution to manage even traditional workloads. The paper discusses the architecture of Airflow as a big data platform and how it can help address these challenges to create a stable data pipelines for enterprises.
Orchestrating Big Data with Apache Airflow July 2016
6185 W DETROIT ST | CHANDLER, AZ 85226 | (623) 282-2385 | CLAIRVOYANTSOFT.COM | [email protected]
OVERV IEW
DataAnalyticsisplayingakeyroleinthedecision-makingprocessat various stages of business in many industries. Data is beinggeneratedataveryfastpacethroughvarioussourcesacrossthebusiness.Applicationsthatautomatethebusinessprocessesareliterally fountainsofdata today. Implementing solutions forusecases like “real time data ingestion from various sources”,“processingthedataatdifferentlevelsofthedataingestion”andpreparingthefinaldataforanalysisisaseriouschallengegiventhedynamic nature of the data that is being generated. Properorchestrating, scheduling, managing and monitoring the datapipelines isacritical taskforanydataplatformtobestableandreliable.Thedynamicnatureofthedatasources,datainflowrates,dataschema,processingneeds,etc.,theworkflowmanagement(pipeline generation / maintenance/monitoring) creates thesechallengesforanydataplatform.
Thiswhitepaperprovidesaviewonsomeoftheopensourcetoolsavailabletoday.ThepaperalsodiscussestheuniquearchitectureofAirflowasabigdataplatformandhowitcanhelpaddressthesechallenges to create a stable data platform for enterprises. Inaddition,ingestionwon'tbehaltedatthefirstsignoftrouble.
INTRODUCT ION TO A IRFLOW
Airflow is a platform to programmatically author, schedule andmonitordatapipelinesthatmeetstheneedofalmostallthestagesof the lifecycleofWorkflowManagement.ThesystemhasbeenbuiltbyAirbnbonthebelowfourprinciples:
• Dynamic: Airflow pipelines are configuration as code(Python),allowingfordynamicpipelinegeneration.Thisallows for writing code that instantiates pipelinesdynamically.
• Extensible:Easilydefineyourownoperators,executorsand extend the library so that it fits the level ofabstractionthatsuitsyourenvironment.
• Elegant: Airflow pipelines are lean and explicit.Parameterizing your scripts is built into the core ofAirflowusingthepowerfulJinjatemplatingengine.
• Scalable:Airflowhasamodulararchitectureandusesamessagequeue toorchestrateanarbitrarynumberofworkers.Airflowisreadytoscaletoinfinity.
BasicconceptsofAirflow
• DAGs:DirectedAcyclicGraph–isacollectionofallthetasksyouwanttorun,organizedinawaythatreflectstheirrelationshipsanddependencies.
o DAGs are defined as python scripts and areplaced in the DAGs folder (could be anylocation, but needs to be configured in theairflowconfigfile).
o Once a new DAG is placed into the DAGSfolder, the DAGS are picked up by Airflowautomaticallywithinaminute’stime.
• Operators: An operator describes a single task in aworkflow.WhileDAGsdescribehowtorunaworkflow,Operatorsdeterminewhatgetsdone.
o Task: Once an operator is instantiated usingsomeparameters,itisreferredtoasa“task”
o Task Instance: A task executed at a time iscalledTaskInstance.
• SchedulingtheDAGs/Tasks:TheDAGsandTaskscanbescheduled to be run at certain frequency using thebelowparameters.
o Scheduleinterval:DetermineswhentheDAGshould be triggered. This can be a cronexpressionoradatetimeobjectofpython.
• Executors: Once the DAGs, Tasks and the schedulingdefinitionsare inplace, someoneneedtoexecute thejobs/tasks.HereiswhereExecutorscomeintopicture.
o There are three typesof executorsprovidedbyAirflowoutofthebox.
o Sequential:ASequentialexecutorisfor test drive that can execute thetasks one by one (sequentially).Taskscannotbeparallelized.
• Local:AlocalexecutorislikeSequentialexecutor.Butitcanparallelizetaskinstanceslocally.
• Celery: Celery executor is a open source DistributedTasksExecutionEnginethatbasedonmessagequeuesmaking it more scalable and fault tolerant. MessagequeueslikeRabbitMQorRediscanbeusedalongwithCelery.
• Thisistypicallyusedforproductionpurposes.
Airflowhasanedgeoverothertoolsinthespace
Beloware some key featureswhereAirflowhas anupper handoverothertoolslikeLuigiandOozie:
• Pipelinesareconfiguredviacodemakingthepipelinesdynamic
• A graphical representation of the DAG instances andTaskInstancesalongwiththemetrics.
• Scalability:DistributionofWorkersandQueuesforTaskexecution
• HotDeploymentofDAGS/Tasks
• Support for Celery, SLAs, great UI for monitoringmatrices
• Has support for Calendar schedule and Crontabscheduling
• Backfill: Ability to rerun a DAG instance in case of afailure.
• Variables for making the changes to the DAGS/Tasksquickandeasy
ARCH ITECTURE OF A IRFLOW
Airflowtypicallyconstitutesofthebelowcomponents.
• Configuration file: All the configuration points like“whichporttorunthewebserveron”,“whichexecutorto use”, “config related to RabbitMQ/Redis”,workers,DAGSlocation,repositoryetc.areconfigured.
• Metadatadatabase(MySQLorpostgres):ThedatabasewhereallthemetadatarelatedtotheDAGS,DAGruns,tasks,variablesarestored.
• DAGs(DirectedAcyclicGraphs):ThesearetheWorkflowdefinitions (logical units) that contains the taskdefinitionsalongwiththedependenciesinfo.Thesearetheactualjobsthattheuserwouldbeliketoexecute.
• Scheduler: A component that is responsible fortriggeringtheDAGinstancesandjobinstancesforeachDAG.TheschedulerisalsoresponsibleforinvokingtheExecutor(beitLocalorCeleryorSequential)
• Broker(RedisorRabbitMQ):IncaseofaCeleryexecutor,thebrokerisrequiredtoholdthemessagesandactasacommunicatorbetweentheexecutorandtheworkers.
• Worker nodes: The actual workers that execute thetasksandreturntheresultofthetask.
• Webserver:AwebserverthatrenderstheUIforAirflowthroughwhichonecanviewtheDAGs,itsstatus,rerun,createvariables,connectionsetc.
HOW IT WORKS
• Initially theprimary (instance1)andstandby(instance2)schedulerswouldbeupandrunning.Theinstance1would be declared as primary Airflow server in theMySQLtable.
• TheDAGsfolderforprimaryinstance(instance1)wouldcontain the actualDAGs and theDAGs folder and thestandby instance (instance 2) would containCounterpartPoller(PrimaryServerPoller).
o PrimaryserverwouldbeschedulingtheactualDAGsasrequired.
o Standby server would be running thePrimaryServerPoller which wouldcontinuously poll the Primary Airflowscheduler.
• Let'sassume,thePrimaryserverhasgonedown.Inthatcase, the PrimaryServerPoller would detect the sameand
o DeclareitselfastheprimaryAirflowserverintheMySQLtable.
o MovetheactualDAGsoutofDAGsfolderandmoves thePrimaryServerPollerDAG into the
DAGs folder on the older primary server(instance1)
o Move the actualDAGs intoDAGs folder andmove the PrimaryServerPoller DAG out ofDAGsfolderontheolderstandby(instance2).
• Sohere,thePrimaryandStandbyservershaveswappedtheirpositions.
• Even if the airflow scheduler on the current standbyserver (instance 1) comes back, since therewould beonly theCounterServerPollerDAGrunningon it, therewouldbenoharm.And this server (instance1)wouldremain to be standby till the current Primary server(instance2)goesdown.
• Incasethecurrentprimaryservergoesdown,thesameprocess would repeat and the airflow running oninstance1wouldbecomethePrimaryserver.
DEPLOYMENT V IEWS
Basedontheneeds,onemayhavetogowithasimplesetuporacomplexsetupofAirflow.TherearedifferentwaysAirflowcanbedeployed(especially fromanExecutorpointofview).Belowarethedeploymentoptionsalongwiththedescriptionforeach.
Standalonemodeofdeployment
Description: As mentioned in the above section, the typicalinstallationofAirflowwillstartasfollows.
• Configurationfile(airflow.cfg):whichcontainsthedetailsofwhere to pick the DAGs from, what Executor to run, howfrequentlytheschedulershouldpolltheDAGsfolderfornewdefinitions,whichporttostartthewebserveronetc.
• MetadataRepository:Typically,MySQLorpostgresdatabaseisusedforthispurpose.AllthemetadatarelatedtotheDAGs,theirmetrics,Tasksand their statuses,SLAs,Variables,etc.arestoredhere.
• WebServer:ThisrendersthebeautifulUIthatshowsalltheDAGs,theircurrentstatesalongwiththemetrics(whicharepulledfromtheRepo).
• Scheduler: This reads the DAGs, put the details about theDAGsintoRepo.ItinitiatestheExecutor.
• Executor:Thisisresponsibleforreadingthescheduleintervalinfoandcreates the instances for theDAGsandTasks intoRepo.
• Worker:Theworker reads the tasks instancesandperformthetasksandwritesthestatusbacktotheRepo.
Distributedmodeofdeployment
Description: The description for most of the componentsmentionedintheStandalonesectionremainthesameexceptfortheExecutorandtheworkers.
• RabbitMQ: RabbitMQ is the distributed messaging servicethatisleveragedbyCeleryExecutortoputthetaskinstancesinto.Thisiswheretheworkerswouldtypicallyreadthetasksforexecution.Basically,thereisabrokerURLthatisexposedbyRabbitMQfortheCeleryExecutorandWorkerstotalkto.
• Executor: Here the executor would be Celery executor(configuredinairflow.cfg).TheCeleryexecutorisconfiguredtopointtotheRabbitMQBroker.
• Workers:Theworkersareinstalledondifferentnodes(basedon the requirement) and they are configured to read thetasksinfofromtheRabbitMQbrokers.TheworkersarealsoconfiguredwithaWorker_result_backendwhichtypicallycanbeconfiguredtobetheMySQLrepoitself.
Theimportantpointtobenotedhereis:
TheWorkernodesisnothingbuttheairflowinstallation.TheDAGdefinitionsshouldbe insynconall thenodes (boththeprimaryairflowinstallationandtheWorkernodes)
DistributedmodeofdeploymentwithHighAvailabilitysetup
Description: As part of the setup for high availability of Airflowinstallation,weareassumingifMySQLrepositoryisconfiguredtobehighlyavailableandRabbitMQwouldbehighlyavailable.Thefocus is on how tomake the airflow components like theWebServerandtheSchedulerhighlyavailable.
The description formost of the componentswould remain thesameasabove.Belowarewhatchanges:
• Newairflowinstance(standby):Therewouldbeanotherinstanceofairflowsetupasastandby.
o The ones shown in Green is the primaryairflowinstance.
o Theoneinredisthestandbyone.• A new DAG must be put in place. Something called
“CounterPartPoller”.ThepurposeofthisDAGwouldbetwo-fold
o To continuously poll the counterpartschedulertoseeifitisupandrunning.
o If the counterpart instance is not reachable(whichmeanstheinstanceisdown),
• DeclarethecurrentairflowinstanceasthePrimary• MovetheDAGsofthe(previous)primaryinstanceoutof
DAGsfolderandmovetheCounterpartPollerDAGintotheDAGsfolder.
• MovetheactualDAGs intotheDAGsfolderandmovethe Counterpart Poller out of DAGs folder on thestandbyserver(theonewhichdeclaresitselfasprimarynow).
NotethatthedeclarationasprimaryserverbytheinstancescanbedoneasaflaginsomeMySQLtable.
TYP ICAL STAGES OF WORKFLOWMANAGEMENT
ThetypicalstagesofthelifecycleforWorkflowManagementofBigDataareasfollows:
• CreateJobstointeractwithsystemsthatoperateonDatao Use of tools/products like: Hive / Presto /
HDFS/Postgres/S3etc.• (Dynamic)Workflowcreation
o Based on the number of sources, size of data,business logic, variety of data, changes in theschema,andthelistgoeson.
• ManageDependenciesbetweenOperationso Upstream,Downstream,CrossPipeline
dependencies,PreviousJobstate,etc.• ScheduletheJobs/Operations
o Calendar schedule,EventDriven,CronExpressionetc.
• KeeptrackoftheOperationsandthemetricsoftheworkflowo Monitorthecurrent/historicstateofthejobs,the
resultsofthejobsetc.• Ensuring Fault tolerance of the pipelines and capability to
backfillanymissingdata,etc.
Thislistgrowsasthecomplexityincreases.
TOOLS THAT SOLVE WORKFLOWMANAGEMENT
There are amanyWorkflowManagement Tools in themarket.Somehave support forBigDataoperationsoutof thebox, andsomethatneedextensivecustomization/extensionstosupportBigDatapipelines.
• Oozie: Oozie is a workflow scheduler system to manageApacheHadoopjobs
• BigDataScript: BigDataScript is intended as a scriptinglanguageforbigdatapipeline
• Makeflow:Makeflowisaworkflowengineforexecutinglargecomplexworkflowsonclusters,andgrids
• Luigi:LuigiisaPythonmodulethathelpsyoubuildcomplexpipelinesofbatchjobs.(ThisisastrongcontenderforAirflow)
• Airflow: Airflow is a platform to programmatically author,scheduleandmonitorworkflows
• Azkaban:AzkabanisabatchworkflowjobschedulercreatedatLinkedIntorunHadoopjobs
• Pinball:PinballisascalableworkflowmanagerdevelopedatPinterest
Mostofthementionedtoolsmeetsthebasicneedoftheworkflowmanagement. When it comes to dealing with the complexworkflows,onlyfewoftheaboveshine.Luigi,Airflow,OozieandPinballarethetoolspreferred(andarebeingusedinProduction)bymostteamsacrosstheindustry.
None of the existing resources (on the web) talk aboutarchitectureandabout the setupofAirflow inproductionwithCeleryExecutorandmoreimportantlyonhowAirflowneedstobeconfigured to be highly available. Hence here is an attempt tosharethatmissinginformation.
INSTALLAT ION STEPS FOR A IRFLOW AND HOW IT
WORKS .
Installpip
Installationsteps
1
2
sudoyuminstallepel-release
sudoyuminstallpython-pippython-wheel
InstallErlang
Installationsteps
1
2
sudoyuminstallwxGTK
sudoyuminstallerlang
RabbitMQ
Installationsteps
1
wgethttps://www.rabbitmq.com/releases/rabbitmq-server/v3.6.2/rabbitmq-server-3.6.2-1.noarch.rpm
sudoyuminstallrabbitmq-server-3.6.2-1.noarch.rpm
Celery
Installationsteps
1pipinstallcelery
Airflow:Pre-requisites
Installationsteps
1sudoyuminstallgcc-gfortranlibgfortrannumpyredhat-rpm-configpython-develgcc-c++
Airflow
Installationsteps
1
2
3
4
5
#createahomedirectoryforairflow
mkdir~/airflow
#exportthelocationtoAIRFLOW_HOMEvariable
exportAIRFLOW_HOME=~/airflow
pipinstallairflow
InitializetheAirflowDatabase
InstallationSteps1airflowinitdb
By default, Airflow installs with SQLite DB. Above step wouldcreate a airflow.cfg file within “$AIRFLOW_HOME”/ directory.Once this is done, you may want to change the Repositorydatabase to some well-known (Highly Available) relationsdatabaselike“MySQL”,Postgresetc.Thenreinitializethedatabase(usingairflowinitdbcommand).Thatwouldcreatealltherequiredtablesforairflowintherelationaldatabase.
StarttheAirflowcomponents
Installationsteps
1
2
3
4
5
6
#StarttheScheduler
airflowscheduler
#StarttheWebserver
airflowwebserver
#StarttheWorker
airflowworker
HerearefewimportantConfigurationpointsinairflow.cfgfile
• Dags_folder=/root/airflow/DAGS
o Thefolderwhereyourairflowpipelineslive• executor=LocalExecutor
o Theexecutorclassthatairflowshoulduse.• sql_alchemy_conn=mysql://root:root@localhost/airflow
o TheSqlAlchemyconnectionstringtothemetadatadatabase.
• base_url=http://localhost:8080o The hostname and port at which the Airflow
webserverruns• broker_url=sqla+mysql://root:root@localhost:3306/airflow
o TheCelerybrokerURL.CelerysupportsRabbitMQ,Redisandexperimentallyasqlalchemydatabase
• celery_result_backend=db+mysql://root:root@localhost:3306/airflow
o AkeyCelerysettingthatdeterminesthelocationofwheretheworkerswritetheresults.
Thisshouldgiveyouarunningairflowinstanceandsetyouonthepathtorunitinproduction.
Ifyouhaveanyquestions,feedback–wewouldlovetohearfromyou,dodrop in a line [email protected] Airflow extensively in production and would love to hearmoreaboutyourneedsandthoughts.
ABOUT CLA IRVOYANT
Clairvoyantisaglobaltechnologyconsultingandservicescompany.Wehelporganizationsbuildinnovativeproductsandsolutionsusingbigdata,analytics,andthecloud.Weprovidebest-in-classsolutionsandservicesthatleveragebigdataandcontinuallyexceedclientexpectations.Ourdeepverticalknowledgecombinedwithexpertiseonmultiple,enterprise-gradebigdataplatformshelpssupportpurpose-builtsolutionstomeetourclient’sbusinessneeds.