High-Availability Algorithms for Distributed Stream Processing

High-AvailabilityAlgorithmsforDistributedStreamProcessing

Jeong-HyonHwang,MagdalenaBalazinska,AlexanderRasin,

UgurCetintemel,MichaelStonebraker,andStanZdonikBrownUniversityMIT{jhhwang,alexr,ugur,sbz}@cs.brown.edu{mbalazin,stonebraker}@lcs.mit.edu

AbstractStream-processingsystemsaredesignedtosupportanemergingclassofapplicationsthatrequiresophisticatedandtimelyprocessingofhigh-volumedatastreams,oftenorigi-natingindistributedenvironments.Unliketraditionaldata-processingapplicationsthatrequirepreciserecoveryforcor-rectness,manystream-processingapplicationscantolerateandbenetfromweakerrecoveryguarantees.Inthispaper,westudyvariousrecoveryguaranteesandpertinentrecoverytechniquesthatcanmeetthecorrectnessandperformancere-quirementsofstream-processingapplications.Wediscussthedesignandalgorithmicchallengesasso-ciatedwiththeproposedrecoverytechniquesanddescribehoweachcanprovidedifferentguaranteeswithpropercom-binationsofredundantprocessing,checkpointing,andremotelogging.Usinganalysisandsimulations,wequantifythecostofourrecoveryguaranteesandexaminetheperformanceandapplicabilityoftherecoverytechniques.Wealsoanalyzehowtheknowledgeofquerynetworkpropertiescanhelpdecreasethecostofhighavailability.

1IntroductionStream-processingengines(SPEs)[1,3,5,6,16,18]aredesignedtosupportanewclassofdataprocessingapplica-tions,calledstream-basedapplications,wheredataispushedtothesystemintheformofstreamsoftuplesandqueriesarecontinuouslyexecutedoverthesestreams.Theseapplicationsincludesensor-basedmonitoring(cartrafc,airquality,battleeld),nancialapplications(stock-pricemonitoring,tickerfailuredetection),andassettracking.Becausedatasourcesarecommonlylocatedatremotesites,stream-basedapplica-tionscangaininbothscalabilityandefciencyiftheserverscollectivelyprocessandaggregatedatastreamswhileroutingthemfromtheiroriginstothetargetapplications.Asare-sult,recentattentionhasbeenfocusedonextendingstreamprocessingtodistributedenvironments,resultinginso-calleddistributedstream-processingsystems(DSPSs)[6,7,22].InaDSPS,thefailureofasingleservercansignicantlydisruptorevenhaltoverallstreamprocessing.Indeed,suchThismaterialisbaseduponworksupportedbytheNationalScienceFoundationunderGrantsNo.IIS-0205445,IIS-0325838,IIS-0325525,IIS-0325703,andIIS-0086057.Anyopinions,ndings,andconclusionsorrec-ommendationsexpressedinthismaterialarethoseoftheauthor(s)anddonotnecessarilyreecttheviewsoftheNationalScienceFoundation.

afailurecausesthelossofapotentiallylargeamountoftran-sientinformationand,perhapsmoreimportantly,preventsdownstreamserversfrommakingprogress.ADSPSthere-foremustincorporateahigh-availabilitymechanismthatal-lowsprocessingtocontinueinspiteofserverfailures.Thisaspectofstreamprocessing,however,hasreceivedlittleat-tentionuntilnow[23].Inthispaper,wefocusonapproacheswhereonceaserverfails,abackupservertakesovertheoper-ationofthefailedone.Tightlysynchronizingaprimaryandasecondarysothattheyalwayshavethesamestateincurshighrun-timeoverhead.Hence,weexploreapproachesthatrelaxthisrequirement,allowingthebackuptorebuildthemissingstateinstead.Becausedifferentstreamprocessingapplicationshavedif-ferenthigh-availabilityrequirements,wedenethreetypesofrecoveryguaranteesthataddressthesedifferentneeds.Preciserecoveryhidestheeffectsofafailureperfectly,ex-ceptforsometransientincreaseinprocessinglatency,andiswell-suitedforapplicationsthatrequirethepost-failureout-putbeidenticaltotheoutputwithoutfailure.Manynan-cialservicesapplicationshavesuchstrictcorrectnessrequire-ments.Rollbackrecoveryavoidsinformationlosswithoutguaran-teeingpreciserecovery.Theoutputproducedafterafailureisequivalenttothatofanexecutionwithoutfailure,butnotnecessarilytotheoutputoftheexecutionthatfailed.Theout-putmayalsocontainduplicatetuples.Toavoidinformationloss,thesystemmustpreserveallthenecessaryinputdataforthebackupservertorebuild(fromitscurrentstate)theprimarysstateatthemomentoffailure.Rollbackrecoveryisthusappropriateforapplicationsthatcannottolerateinfor-mationlossbutmaytolerateimpreciseoutputcausedbythebackupserverreprocessingtheinputsomewhatdifferentlythantheprimarydid.Exampleapplicationsincludethosethatalertwhenspecicconditionsoccur(e.g.,realarms,theftpreventionthroughassettracking).WeshowinSection6thatthisrecoveryguaranteecanbeprovidedmoreefcientlythanpreciserecoverybothintermsofruntimeoverheadandrecoveryspeed.Gaprecovery,ourweakestrecoveryguarantee,addressestheneedsofapplicationsthatoperatesolelyonthemostre-centinformation(e.g.,sensor-basedenvironmentmonitor-ing),wheredroppingsomeolddataistolerableforreducedrecoverytimeandruntimeoverhead.WedenetheserecoverysemanticsmorepreciselyinSec-

tion3.Tothebestofourknowledge,commercialDBMSstypicallyofferpreciseorgaprecoverycapabilities[8,19,20,21]andnoexistingsolutionaddressesrollbackrecoveryorasimilarweakrecoverymodel.Wealsoinvestigatefourrecoveryapproachesthatcanpro-videoneormoreoftheaboverecoveryguarantees.Sinceeachapproachemploysadifferentcombinationofredundantcomputation,checkpointing,andremotelogging,theyofferdifferenttradeoffsbetweenruntimeoverheadandrecoveryperformance.Werstintroduceamnesia,alightweightschemethatpro-videsgaprecoverywithoutanyruntimeoverhead(Section4).Wethenpresentpassivestandbyandactivestandby,twoprocess-pairs[4,10]approachestailoredtostreamprocess-ing.Inpassivestandby,eachprimaryserver(a.k.a.node)pe-riodicallyreectsitsstateupdatestoitssecondarynode.Inactivestandby,thesecondarynodesprocessalltuplesinpar-allelwiththeirprimaries.Wealsoproposeupstreambackup,anapproachthatsignicantlyreducesruntimeoverheadcom-paredtothestandbyapproacheswhiletradingoffasmallfractionofrecoveryspeed.Inthisapproach,upstreamnodesactasbackupsfortheirdownstreamneighborsbypreservingtuplesintheiroutputqueueswhiletheirdownstreamneigh-borsprocessthem.Ifaserverfails,itsupstreamnodesreplaytheloggedtuplesonarecoverynode.InSection5,wede-scribethedetailsoftheseapproacheswithanemphasisontheuniquedesignchallengesthatariseinstreamprocessing.Upstreambackupandthestandbyapproachesprovideroll-backrecoveryintheirsimplestformsandcanbeextendedtoprovidepreciserecoveryatahigherruntimecost,aswediscussinSection6.Interestingly,foragivenhigh-availabilityapproach,theoverheadtoachievepreciserecoverycannoticeablychangewiththepropertiesoftheoperatorsconstitutingthequerynet-work.WethusdevelopinSection3ataxonomyofstream-processingoperators,classifyingthemaccordingtotheirim-pactonrecoverysemantics.Section6showshowsuchknowledgehelpsreducehigh-availabilitycostsandaffectsthechoiceofmostappropriatehigh-availabilitytechnique.Finally,bycomparingtheruntimeoverheadandrecoveryperformanceforeachcombinationofrecoveryapproachandguarantee(Section7),wecharacterizethetradeoffsamongtheapproachesanddescribethescenarioswheneachismostappropriate.Wendthatupstreambackuprequiresonlyasmallfractionoftheruntimecostofothers,whilekeepingre-coverytimerelativelyshortforquerieswithmoderatestatesize.Thesizeofquerystateandthefrequencyofhigh-availabilitytaskssignicantlyinuencetherecoveryperfor-manceofupstreambackupandtheruntimeperformanceofpassivestandby.Wealsondthatthereisafundamentaltradeoffbetweenrecoverytimeandruntimeoverheadandthateachapproachcoversacomplementaryportionofthesolutionspace.2TheSystemModelAdatastreamisasequenceoftuplesthatarecontinuouslygeneratedinrealtimeandneedtobeprocessedonarrival.

Figure1.AnexampleDSPS

Thismodelofprocessingdatabefore(orinsteadof)storingitcontrastswiththetraditionalprocess-after-storemodelemployedbyallconventionalDBMSs.Instream-processingsystems[1,3,6],eachoperatorisaprocessingunit(map,l-ter,join,aggregate,etc.)thatreceivesinputtuplesthroughitsinputqueues(oneforeachinputstream)andproducesoutputtuplesbasedonitsexecutionsemantics.Aloop-free,directedgraphofoperatorsiscalledaquerynetwork.ADSPSpartitionsitsquerynetworkacrossmultiplenodes.Eachnoderunsastream-processingengine(SPE).Figure1illustratesaquerynetworkdistributedacrossthreenodes,Nu,N,andNd.Inthegure,streamsarerepresentedbysolidlinearrowswhileoperatorsarerepresentedasboxeslabeledwithsymbolsdenotingtheirfunctions.Sincemes-sagesowonstreamsI1andI2fromNutoN,NuissaidtobeupstreamofN,andNissaidtobedownstreamofNu.Weassumethatthecommunicationnetworkensuresorder-preserving,reliablemessagetransport(e.g.,TCP).Sincewefocusonsingle-nodefail-stopfailures(i.e.,han-dlingnetworkfailures,partitions,ormultiplesimultaneousfailuresincludingthoseduringrecoveryisbeyondthescopeofthispaper),weassociateeachnodeNwitharecoverynodeNthatisinchargeofdetectingaswellashandlingthefail-ureofN.Ninthiscaseiscalledaprimarynode.ForNweusethetermsrecoverynode,secondarynode,andbackupnodeinterchangeably.EachrecoverynoderunsitsownSPE,andhasthesamequery-networkfragmentasitsprimary,butitsstateisnotnecessarilythesameasthatoftheprimary.Todetectfailures,eachrecoverynodeperiodicallysendskeep-aliverequeststoitsprimaryandassumesthatthelatterfailedifafewconsecutiveresponsesdonotreturnwithinatimeoutperiod(forexample,ourprototypeusesthreemes-sageswith100mstransmissioninterval,foranaveragefail-uredetectiondelayof250ms).Whenarecoverynodedetectsthefailureofitsprimary,ifitwasnotalreadyreceivingthein-putstreams,itaskstheupstreamnodestostartsendingitthedata(inFigure1,I1andI2switchtoI1andI2respectively).Therecoverynodealsostartsforwardingitsoutputstreamstodownstreamnodes(inFigure1,OswitchestoO).Becausethesecondarymayneedtoreprocesssomeearlierinputtuplestobringitsstateup-to-datewiththepre-failurestateoftheprimary,eachredirectedinputstreammustbeabletoreplayearliertuples.Forthispurpose,eachoutputstreamhasanoutputqueueasatemporarystoragefortuplessent.

Finally,onceafailednodecomesbacktolife,itassumestheroleofthesecondary.AswediscussinSection5,eachap-proachrequiresadifferentamountoftimeforrecoveryand,thus,forthesystemtotolerateanewfailure.3High-AvailabilitySemanticsInthissection,wedenethreerecoverytypes,basedontheireffectsasperceivedbythenodesdownstreamfromthefailure.Sincesomeoperatorpropertiesfacilitatestrongerre-coveryguarantees,wealsodeviseanoperatorclassicationbasedontheireffectsonrecoverysemantics.3.1RecoveryTypesWeassumethataquery-networkfragment,Q,isgiventoaprimary/secondarypair.Qhasasetofninputstreams(I1,I2,...,In)andproducesoneoutputstreamO.Thede-nitionsbelowcaneasilybeextendedtoquery-networkfrag-mentswithmultipleoutputstreams.Becausetheprocessingmaybenon-deterministic,aswediscussinSection3.2,executingQoverthesameinputstreamsmayeachtimeproduceadifferentsequenceoftu-plesontheoutputstream.Wedeneanexecutiontobethesequenceofevents(suchasthearrival,processingorproduc-tionofatuple)thatoccurwhileanoderunsQ.Givenanexecutione,wedenotewithOetheoutputstreamproducedbye.WeexpresstheoveralloutputstreamafterfailureandrecoveryasOf+O,whereisfthepre-failureexecutionoftheprimaryandOistheoutputstreamproducedbythesecondaryafterittookover.PreciseRecovery:Thestrongestfailurerecoveryguaran-tee,calledpreciserecovery,completelymasksafailureandensuresthattheoutputproducedbyanexecutionwithfail-ure(andrecovery)isidenticaltotheoutputproducedbyanexecutionewithoutfailure:i.e.,Of+O=Oe.RollbackRecovery:Aweakerrecoveryguarantee,calledrollbackrecovery,ensuresthatfailuresdonotcauseinforma-tionloss.Morespecically,itguaranteesthattheeffectsofallinputtuplesarealwaysforwardedtodownstreamnodesinspiteoffailures.Achievingthisguaranteerequires:1.Inputpreservation-Theupstreamnodesmuststoreintheiroutputqueuesalltuplesthatthesecondaryneedstorebuild,fromitscurrentstate,theprimarysstate.Were-fertosuchtuplesasduplicateinputtuplesbecausetheyhavealreadyenteredtheprimarynode.2.Outputpreservation-Ifasecondaryisrunningaheadofitsprimary,thesecondarymuststoretuplesinitsoutputqueuesuntilallthedownstreamnodesreceivethecor-respondingtuplesfromtheprimarynode.Thetuplesatthesecondaryarethenconsideredduplicate.Becausethesecondarymayfollowadifferentexecutionthanitsprimary,duplicateoutputtuplesarenotnecessarilyidenticaltothoseproducedbytheprimary.Weconsideranoutputtupletatthesecondarytobeduplicateifthepri-maryhasalreadyprocessedallinputtuplesthataffectedthevalueoftandforwardedtheresultingoutputtuplesdown-stream.Weformallydenerollbackrecoveryandduplicateoutputtuplesin[11].

RecoverytypeBeforefailureAfterfailure-Preciset1t2t3t4t5t6...-Gapt1t2t3t5t6...-Rollback-Repeatingt1t2t3t2t3t4...-Convergentt1t2t3t2t3t4...-Divergentt1t2t3t2t3t4...Figure2.Outputsproducedbyeachtypeofrecovery

WeusethecongurationinFigure1toillustratethesecon-cepts.WecannotdiscardtuplesintheoutputqueuesofI1andI2ifNrequiresthemtorebuildNsstate.Similarly,ifNisrunningaheadofN,itmustpreservealltuplesinOsoutputqueueuntiltheybecomeduplicate(i.e.,NdreceivesfromNtuplesresultingfromprocessingthesameinputtuples).Rollbackrecoveryallowsthesecondarytoforwarddu-plicateoutputtuplesdownstream.ThecharacteristicsofQdeterminethecharacteristicsofsuchduplicateoutputtuplesaswellasthepropertiesofOf+O.Wedistinguishthreetypesofrollbackrecovery.Inthersttype,repeatingrecov-ery,duplicateoutputtuplesareidenticaltothoseproducedpreviouslybytheprimary.Withthesecondtype,convergentrecovery,duplicateoutputtuplesaredifferentfromthosepro-ducedbytheprimary.Thedetailsonsuchsituationsaredis-cussedinSection3.2underconvergent-capableoperators.Inbothrecoverytypes,however,theconcatenationofOfandOafterremovingduplicatetuplesisidenticaltoanoutputwith-outfailure,Oe.Finally,thethirdtypeofrecovery,divergentrecovery,hasthesamepropertiesasconvergentrecoveryre-gardingduplicateoutputtuples.Eliminatingtheseduplicates,however,doesnotproduceanoutputthatisachievablewith-outfailure,becauseofthenon-determinisminprocessing.GapRecovery:Anyrecoverytechniquethatdoesnotensurebothinputandoutputpreservationmayresultininformationloss.Thisrecoverytypeiscalledgaprecovery.Example:Figure2showsexamplesofoutputsproducedbyeachrecoverytype.Withpreciserecovery,theoutputcorre-spondstoanoutputwithoutfailure:tuplest1throught6areproducedinsequence.Withgaprecovery,thefailurecausesthelossoftuplet4.Repeatingrecoveryproducestuplest2andt3twice.Convergentrecoverygeneratesdifferenttuplest2andt3afterfailure(butcorrespondingtot2andt3)butthenproducestuplest4andfollowingaswouldanexecutionwithoutfailure.Finally,divergentrecoverykeepsproducingequivalentratherthanidenticaltuplesafterthefailure.PropagationofRecoveryEffects:Thesemanticsabovede-netheeffectsoffailureandrecoveryontheoutputstreamofthefailedquery-networkfragment.Theseeffectsthenprop-agatethroughtherestofthequerynetworkuntiltheyreachclientapplications.Becausepreciserecoverymasksfailures,nosideeffectspropagate.Gaprecoverymaylosetuples.Af-terafailure,clientapplicationsmaythusmissaburstoftu-ples.Becausethequerynetworkmayaggregatemanytuplesintoasingleoutputtuple,missingtuplesmayalsoresultinincorrectoutputvalues:e.g.,asumoperatormayproducealowersum.Rollbackrecoverydoesnotlosetuplesbutmaygenerateduplicatetuples.Thenaloutputstreammaythus

ArbitraryUnion,operatorswithtimeout

pper-input-tupleprocessingtimednetworktransmissiondelaybetweenanynodes

DeterministicConvergentcapableRepeatable

BSort,Resample,Aggregate(notimeout)

Filter,Map,Join(notimeout)

inputtuplearrivalrateCsizeofcheckpointmessagecsizeofqueue-trimmingmessageMcheckpointorqueue-trimmingintervalDfailuredetectiondelayrtimetoredirectinputstreamsnbopsnumberofoperatorsinthequerynetworknbpathsnumberofpathsfrominputtooutputstreams

bwoverheadbandwidthconsumedforhighavailabilityprocoverheadprocessingrequiredforhighavailabilityFigure3.TaxonomyofAuroraoperators

containaburstofeitherredundantorincorrecttuples:e.g.,asumoperatordownstreammayproduceahighersumvalue.Itisalsopossible,however,thatduplicate-insensitiveopera-tors(e.g.,max)downstreamcanalwaysguaranteecorrectre-sults.Ingeneral,therecoverytypeforanodemustbechosenbasedontheapplicationscorrectnesscriteriaandthecharac-teristicsoftheoperatorsonthenodeanddownstream.3.2OperatorClassicationWedistinguishfourtypesofoperatorsbasedontheireffectsonrecoverysemantics:arbitrary(includingnon-deterministic),deterministic,convergent-capable,andre-peatable.Figure3depictsthecontainmentrelationshipamongtheseoperatortypesandtheclassicationofAuroraoperators[1,2].Thetypeofaquerynetworkisdeterminedbythetypeofitsmostgeneraloperator.Anoperatorisdeterministicifitproducesthesameout-putstreameverytimeitstartsfromthesameinitialstateandreceivesthesamesequenceoftuplesoneachinputstream.Therearethreepossiblecausesofnon-determinisminop-erators:dependenceontime(eitherexecutiontimeorinputtuplearrivaltimes),dependenceonthearrivalorderoftuplesondifferentinputstreams(e.g.,union,whichinterleavestu-plesfrommultiplestreams),anduseofnon-determinisminprocessingsuchasrandomization.Adeterministicoperatoriscalledconvergent-capableifityieldsaconvergentrecoverywhenitrestartsfromanemptyinternalstateandre-processesthesameinputstreams,start-ingfromanarbitraryearlierpointintime.Tobeconvergent-capable,anoperatormustthusrebuilditsinternalstatefromscratchandupdateitonsubsequentinputsinamannerthateventuallyconvergestotheexecutionthatwouldhaveexistedwithoutfailure.Windowalignmentistheonlypossiblecausethatpreventsadeterministicoperatorfrombeingconvergent-capable.Thisisbecausewindowboundariesdenethese-quencesoftuplesoverwhichoperatorsperformcomputa-tions.Therefore,adeterministicoperatorisconvergent-capableifandonlyifitswindowalignmentsalwaysconvergetothesamealignmentwhenrestartedfromanarbitraryone.Aconvergent-capableoperatorisrepeatableifityieldsarepeatingrecoverywhenitrestartsfromanemptyinternalstateandre-processesthesameinputstreams,startingfromanarbitraryearlierpointintime(theoperatormustproduceidenticalduplicatetuples).Anecessaryconditionforanoper-atortoberepeatableisfortheoperatortouseatmostonetu-plefromeachinputstreamtoproduceanoutputtuple.Ifase-quenceofmultipletuplescontributestoanoutputtuple,then

numberoflostorredundanttuplesKdelaybeforeprocessingrstduplicateinputtupleQaveragenumberofinputtuplestore-processrectimetimespentrecreatingthefailedstate(afterfailuredetection)bandwidthconsumedfortupletransmissionprocessingrequiredforordinarytupleprocessingTable1.Summaryofnotationrestartingtheoperatorfromthemiddleofthatsequencemayyieldatleastonedifferentoutputtuple.Aggregatesarethusnotrepeatableingeneral,whereaslter(whichsimplydropstuplesthatdonotmatchagivenpredicate)andmap(whichtransformstuplesbyapplyingfunctionstotheirattributes)arerepeatableastheyhaveoneinputstreamandprocesseachtu-pleindependentlyofothers.Join(withouttimeout)isalsore-peatablebecauseitswindowsdenedoninputstreamshavealignmentsrelativetothelatestinputtuplebeingprocessed.Inthefollowingsections,wepresentapproachesforgaprecovery,rollbackrecovery,andpreciserecovery,respec-tively.Foreachapproach,wediscusstheimpactofthequery-networktypeonrecoveryandanalyzethetradeoffsbetweenrecoverytimeandruntimeoverhead.Table1summarizesthenotationthatweuse.4GapRecoveryThesimplestapproachtohighavailabilityisforthesec-ondarynodetorestartthefailedquerynetworkfromanemptystateandcontinueprocessinginputtuplesastheyarrive.Thisapproach,calledamnesia,producesagaprecoveryforalltypesofquerynetworks.Inamnesia,thefailuredetectiondelay,theratesoftuplesonstreams,andthesizeofthestateofthequerynetworkdeterminethenumber,,oflosttuples.Thisapproachimposesnooverheadatruntime(c.f.Table3).Wedenerecoverytimeastheintervalbetweenthetimewhenthesecondarydiscoversthatitsprimaryfailedandthetimeitreachestheprimaryspre-failurestate(oranequiva-lentstateforanon-deterministicquerynetwork).Recoverytimethusmeasuresthetimespentrecreatingthefailedstate.Sinceamnesiadoesnotrecreatetheloststateanddropstuplesuntilthesecondaryisreadytoacceptthem,therecov-erytimeiszero.Ittakestimertoredirecttheinputstothesecondary,butwhenprocessingrestarts,thersttuplespro-cessedarethosethatwouldhavebeenprocessedatthesametimeifthefailuredidnothappen.I.e.,thereisnoextradelayduetothefailureorrecovery.5RollbackRecoveryProtocolsWepresentthreeapproachestoachieverollbackrecovery,eachoneusingadifferentcombinationofredundantcompu-tation,checkpointing,andloggingatothernodes.Werst

Query-networktypeApproachRepeatableConvergent-capableDeterministicArbitraryPassivestandbyRepeatingRepeatingRepeatingDivergentUpstreambackupRepeatingConvergentDivergentDivergentActivestandbyRepeatingRepeatingRepeatingDivergent

Table2.Typeofrollbackrecoveryachievedbyeachhigh-availabilityapproachforeachquery-networktyperectimebwoverheadprocoverheadAmnesia000

112PassivestandbyK+Qp,whereK=r+d;Q=Mf1(M,C)f2(M,C)

1111UpstreambackupK+Qp,whereK=r+d;Q=|state|+M+2df3(M,c)f4(M,nbops,nbpaths)Activestandby(negligible)100%+f3(M,c)100%+2f4(M,nbops,nbpaths)Table3.Recoverytimeandruntimeoverheadforeachapproach

presentpassivestandby,anadaptationoftheprocess-pairsmodelwithpassivebackup.Passivestandbyreliesoncheck-pointingtoachievehighavailability.Then,weintroduceupstreambackup,whereupstreamnodesintheprocessingowserveasbackupfortheirdownstreamneighborsbylog-gingtheiroutputtuples.Finally,wedescribeactivestandby,anotheradaptationoftheprocess-pairsmodelwhereeachstandbyperformsprocessinginparallelwithitsprimary.Wediscussactivestandbylast,becauseitreliesonconceptsin-troducedinupstreambackup.Foreachapproach,weexaminetherecoveryguaranteesitprovides,theaveragerecoverytime,andtheruntimeover-head.Wedividetheruntimeoverheadintoprocessingandcommunication(orbandwidth)overhead.Table2summa-rizestherecoverytypesachievedbyeachapproachwhileTa-ble3summarizestheirperformancemetrics.5.1PassiveStandbyInpassivestandby,eachprimaryperiodicallysendsthedeltaofitsstatetothesecondary,whichtakesoverfromthelatestcheckpointwhentheprimaryfails.Sincereal-timeresponseiscrucialformanystream-processingapplications,themainchallengeinpassivestandbyistoenabletheprimarytocontinueprocessingevenduringacheckpoint.Thestateofaquerynetworkconsistsofthestatesofin-putqueuesofoperators,operatorsthemselves,andthenodeoutputqueues(oneforeachoutputstream).Eachcheck-pointmessage(a.k.a.stateupdatemessage)thuscapturesthechangestothestatesofthosequeuesandoperatorsonthepri-marynodesincethelastcheckpointmessagewascomposed.Foreachqueue,thecheckpointmessagecontainsthenewlyenqueuedtuplesaswellasthelastdequeueposition.Foranoperator,however,thecontentofthemessagedependsontheoperatortype.Forexample,themessageisemptyforstate-lessoperatorswhileitstores,foranaggregateoperator,eithersomesummaryvalues(e.g.,count,sum,etc.)ortheactualtu-plesthatnewlyenteredtheoperatorsstate.Toavoidthesuspensionofprocessing,thecompositionofacheckpointmessageisconductedalongavirtualsweeplinethatmovesfromleft(upstream)toright(downstream).Ateverystep,anoperatorclosesttotherightofthesweeplineischosenandonceitsstatedifferenceissavedinthecheck-pointmessage,thesweeplinemovestotherightoftheoper-ator.Theprimaryisfreetoexecuteoperatorsawayfromthe

sweeplinebothupstreamanddownstreambecausethesecon-currenttasksdonotviolatetheconsistencyofthecheckpointmessage.Indeed,executingoperatorstotheleftofthesweeplineisequivalenttoexecutingthemaftercheckpointing.Ex-ecutingoperatorstotherightofthesweeplinecorrespondstoexecutingthembeforethemessagecomposition.Passivestandbyguaranteesrollbackrecoveryasfollows:(1)inputpreservation-eachupstreamprimarynodepre-servesoutputtuplesinitsoutputqueuesuntiltheyaresafelystoredatthedownstreamsecondaries.InFigure1,wheneverstandbynodeNreceivesacheckpointfromN,itinformsupstreamnodeNuaboutthenewtuplesthatitreceivedonitsinputstreams,I1andI2.Nudiscardsonlythoseacknowl-edgedtuplesfromitsoutputqueues.(2)outputpreservation-thesecondaryisalwaysbehindtheprimarybecauseitsstatecorrespondstothelastcheckpointedsate.Ifaprimaryfails,thesecondarytakesoverandsendsalltuplesfromitsoutputqueuestothedownstreamnodes.Thesecondaryalsoasksupstreamnodestostartsendingittheiroutputstreams,includingtuplesstoredintheiroutputqueues.Whenthefailednoderejoinsthesystem,itassumestheroleofthesecondary.Becausethenewsecondaryhasanemptystate,theprimarysendsitscompletestateintherstcheck-pointmessage.RecoveryType:Becausethesecondarynoderestartsfromapaststateofitsprimary,passivestandbyprovidesrepeat-ingrecoveryfordeterministicquerynetworksanddivergentrecoveryforothers.RecoveryTime:Passivestandbyhasashortrecoverytimebecausethebackupholdsacompleteandrecentsnapshotoftheprimarysstate.RecoverytimeisequaltoK+Qp,whereKisthedelaybeforetherecoverynodereceivesitsrstinputtuple,Qisthenumberofduplicateinputtuplesitreprocesses,andpistheaverageprocessingtimeperinputtuple.Kisthesumofr(thetimetoredirectinputstreams)andd(thetimeforthersttupletopropagatefromtheupstreamnodes).Qisonaveragehalfacheckpointintervalworthofinputtu-ples.Theaveragenumber,,ofduplicatetuplesisclosetoMout,whereMisthecheckpointintervalandoutistherateoftuplesonoutputstreams.Overhead:Passivestandbyimposeshighruntimeover-head.Thebandwidthoverheadisinverselyproportionalto

NNuApp

NcProducetuplesandstoreinoutputqueuesMapoutputtuplestoinputtuples

tuplesLevel-0ACKTuplesreceived

Processtuples,producenewtuplesandstoreinoutputqueues

tuples

Consumetuples

I1

Na901

257

Filter

S

500Union

50...123......O'soutputqueue

ACK(0,O,125)ACK(1,O,50)NbOACK(0,O,123)

EffectoftupsavedatAppcause((O,123),I1)=(I1,200)Trimoutputqueues

Level-1ACKles

Mapoutputtuplesontoinputtuples

Level-0ACKTuplesreceived

I2

ACK(1,O,55)cause((O,123),I2)=(I2,100)

(a)Nareceivesacksfromdownstreamandnewtuples

Figure4.Inter-nodecommunicationinupstreambackup

fromupstream.ThelterprocessesI1[900]andproducesS[500]

thecheckpointintervalandproportionaltothesizeofcheck-pointmessages.Theprocessingoverheadconsistsofgen-eratingandprocessingcheckpointmessages(proportionaltothebandwidthoverhead).Thecheckpointinterval(M)de-terminesthetradeoffbetweenruntimeoverheadandrecoverytime.Table3summarizestheseresults.

ACK(0,I1,901)ACK(1,I1,200)I1ACK(0,I2,257)ACK(1,I2,100)I2

Na

FilterS

...Union188...12350O'soutputqueuecause((O,188),I1)=(I1,900)cause((O,188),I2)=(I2,257)cause((O,187),I1)=(I1,900)cause((O,187),I2)=(I2,...)

188O188

187

187

NbNc

5.2UpstreamBackupInupstreambackup,upstreamnodesactasbackupsfortheirdownstreamneighborsbyloggingtuplesintheirout-putqueuesuntilalldownstreamneighborscompletelypro-cessthesetuples.Forinstance,inFigure1,nodeNuservesasbackupfornodeN:ifNfails,Nrestorestheloststatebyre-processingthetuplesloggedatNu.Whenafailednoderejoinsthesystem,itassumestheroleoftherecoverynodestartingfromanemptystate.Thesystemisthenabletotol-erateanewfailurewithoutfurtherdelay.Themaindifcultyofthisapproachistodeterminethemaximumsetofloggedtuplesthatcansafelybediscardedgivenoperatornon-determinismandthemany-to-manyrela-tionshipbetweeninputandoutputtuples.Figure4showsatypicalcommunicationsequencebe-tweenthreenodesNu,N,andApp.Eachnodeproducesandsendstuplesdownstreamwhilestoringtheminitsout-putqueues.Eachnodealsoperiodicallyacknowledgesre-ceptionofinputtuplesbysendinglevel-0ackstoitsdirectupstreamneighbors.Whenanode(e.g.,N)receiveslevel-0acksfromdownstreamneighbors(e.g.,App),itnotiesitsownupstreamneighbors(e.g.,Nu)abouttheearliestloggedtuples(oneperNusoutput)thatcontributedtoproducingtheacknowledgedtuplesandarethustheoldesttuplesnecessarytore-buildthecurrentstate(ofN).Discardingonlyearliertuplesallowsthesystemtosurvivesinglefailures.Thenoti-cationsarethuscalledlevel-1acks(denotedACK(1,S,u),whereSidentiesastreamanduidentiesatupleonthatstream).LeafnodesintheDSPSuselevel-0ackstotrimtheiroutputqueues.Sinceupstreamnodeslogalltuplesnecessaryforthesec-ondarytore-buildtheprimarysstatefromanemptystate(in-putpreservation)andthesecondaryrestartsfromanemptystate(outputpreservation),upstreambackupprovidesroll-backrecovery.

5.2.1QueueTrimmingProtocolToavoidspurioustransmissions,nodesproducebothlevel-0andlevel-1ackseveryMseconds.Alowerackfrequency

(b)Natrimsitsoutputqueueat(O,50)whilepushingnewtuplesO[187]andO[188]downstream.Naalsomapsthelowestlevel-0ackreceived,(O,123),ontolevel-1acksFigure5.Oneiterationofupstreambackup

reducesbandwidthutilization,butincreasesthesizeofoutputqueuesandtherecoverytime.Tocomposelevel-1acks,eachnodends,foreachout-putstreamO,thelatestoutputtupleO[v]acknowledgedatlevel-0byalldownstreamneighbors.ForeachinputstreamI,thenodemapsO[v]backontotheearliestinputtupleI[u]thatcausedO[v].Thisbackwardmappingisconductedbyafunctioncause((O,v),I)(I,u),where(I,u)denotestheidentieroftupleI[u]andmarksthebeginningofthese-quenceoftuplesonInecessarytoregenerateO[v].Wedis-cussthecausefunctionnext.Thenodeperformsthesemap-pingsforeachoutputstreamandidentiestheearliesttupleoneachinputstreamthatcannowbetrimmed.Thenodepro-duceslevel-1acksforthesetuples.Eachupstreamneighbortrimsitsoutputqueuesuptothepositionthatcorrespondstotheoldesttupleacknowledgedatlevel-1byalldownstreamneighbors.Wepresentthisalgorithminmoredetailin[11].Figure5illustratesoneiterationoftheupstream-backupalgorithmsononenode.Intheexample,nodeNareceiveslevel-0andlevel-1acksfromtwodownstreamneighborsNbandNc.First,sincebothneighborshavenowsentlevel-1acksfortuplesuptoO[50],NaremovesfromitsoutputqueuealltuplesprecedingO[50].Second,sincebothNbandNchavesentlevel-0acksfortuplesuptoO[123],NamapsO[123]backontotherstinputtuplesthatcausedit.Nasendslevel-1acksforthesetuples,identiedwith(I1,200)and(I2,100).Intheexample,NaalsoreceivestuplesI1[901]andI2[257]fromitsupstreamneighborsandacknowledgestheirreceptionwithlevel-0acks.

5.2.2MappingOutputTuplesontoInputTuplesWenowdiscusshownodescomputethecausefunction,cause((O,v),I)(I,u).Thisfunctionmapsanarbitrary

outputtupleO[v]onstreamOontotheearliestinputtupleI[u]oninputstreamIthathascontributedtotheproductionofO[v](i.e.affectedthevalueofO[v]).Tofacilitatethismapping,weproposetokeeptrackoftheoldestinputtuplesthataffectanycomputation,byappendinginput-tupleindica-torstotuplesastheytravelthroughoperatorsonanode.ForatupleO[v],theseindicators,denotedwithindicators(O,v),containtheidentiersoftheoldesttuplesoninputstreamsnecessarytogenerateO[v].Wealsocalltheseindicatorslowwatermarks.Onanystream,indicatorvaluesaremonotoni-callynon-decreasing.Whenatupleentersanode,itsindicatorsareinitializedtoitsidentier:e.g.,indicators(I,u)={(I,u)}.Eachop-eratorusestheindicatorsofitsinputtuplestocomputetheindicatorsforitsoutputtuples.Whenitisrstsetup,eachoperatoroinitializesawatermarkvariableforeachnode-wideinputstreamIthatcontributestoeachinputstreamSofo:[I,S]=0.Asitprocessestuples,theoperatorupdateseach[I,S]toholdtheindicatoroftheoldesttuplecurrentlyinthestateor,forstatelessoperators,theindicatorsofthelasttuplesprocessed.Whenitproducesatuplet,theoper-atoriteratesthroughallvaluesandappends(I,min)toindicators(t),whereministheminimumofall[I,].Someoperators,suchasunion,havemanyinputstreamsbutonlyafewofthemactuallycontributetoanysingleout-puttuple.Theseoperatorscanreducethenumberofindica-torsonoutputtuplesbyappendingonlyindicatorsforinputstreamsthatactuallyaffectedtheoutputtuplevalue.Thus,cause((O,v),I)referstotheindicatorofO[v]thatcorre-spondstostreamI,ortotheindicatorofthelastprecedingtupleaffectedbyI,ifO[v]wasnotaffectedbyI.Notethatindicatorsarenotsenttodownstreamnodes.Moredetailsabouttheuseofindicatorscanbefoundin[11].Figure5showsanexampleofmanaginginput-tupleindi-cators.InFigure5(a),thelterprocessesI1[900]andpro-ducesS[500].Hence,indicators(S,500)={(I1,900)}.InFigure5(b),theunionoperatorprocessestuplesS[500]andI2[257]toproduceO[187]andO[188]respectively.Hence,indicators(O,187)={(I1,900)}andindicators(O,188)={(I2,257)}.Therefore,cause((O,188),I1)=(I1,900),cause((O,188),I2)=(I2,257),andcause((O,187),I1)=(I1,900).cause((O,187),I2)dependsontheindicatorsofthetuplesprecedingO.RecoveryType:Upstreambackuprestartsfromanemptystateproducingarepeatingrecoveryforrepeatablequerynet-works,aconvergentrecoveryforconvergent-capablequerynetworksandadivergentrecoveryforallothers.Theseguar-anteesareweakerthanthoseofthestandbyapproaches.RecoveryTime:Thetime,K,toreceivethersttupleisthesameasforpassivestandbybuttherecoverynodemayre-processsignicantlymoretuples.Itmustre-process(1)alltuplesthatcontributedtotheloststate,(2)acompletequeue-trimmingintervalworthoftuplesonaverage(duetothepe-riodictransmissionofbothlevel-0andlevel-1acks),and(3)someextratuplesthataccountforthepropagationdelaysoflevel-0acks.Thenumber,,ofredundanttuplesistheprod-uctofthenumberoftuplestoreprocess(Q)andthequery-

networkselectivityminusthenumberoftuplesthatremainaspartofthequery-networkstate.Overhead:Upstreambackuphasthelowestbandwidthover-headbecausequeue-trimmingmessages,whichcontainonlythetupleidentiersforstreamscrossingnodeboundaries,aresignicantlysmallerthancheckpointmessagesusedbytheotherapproaches.Theprocessingoverheadisalsosmall:op-eratorskeepstrackoftheoldesttuple(anditsindicators)oneachoftheirinputstreamsthatcontributestotheircurrentstates.Furthermore,wecanreducethespatialandcompu-tationaloverheadofmanagingindicatorsbyprocessingthemandappendingthemtotuplesoccasionally.Ingeneral,theto-taloverhead,assummarizedinTable3,isproportionaltothenumberofoperatorsandthenumberofpaths,whereapathisadataowconnectinganinputstreamtoanoutputstream.5.3ActiveStandbyActivestandbyisanothervariationontheprocess-pairsmodel.Incontrasttopassivestandby,withactivestandby,eachsecondarynodereceivestuplesfromupstreamandpro-cessestheminparallelwiththeprimary.Thesecondary,how-ever,doesnotsendanyoutputtuplesdownstream.Itlogsthesetuplesinitsoutputqueuesinstead.Thechallengeofactivestandbyliesinboundingtheout-putqueuesoneachsecondary,whileensuringoutputpreser-vation.Becausetheprimaryandsecondarymayhavenon-deterministicoperators,theymayhavedifferenttuplesintheiroutputqueues.Toidentifyduplicateoutputtuples,weaddasecondsetofinput-tupleindicatorstoeachtuple.Foratuple,O[v],thissecondsetcontainsforeachinputstreamI,theidentier(I,u)ofthemostrecenttuplethatcontributedtotheproductionofO[v].Wecalltheseidentiershighwater-marks.Atupleatthesecondaryisduplicateifithasalower-valuedhighwatermarkthanatupleattheprimary.Indeed,thistupleresultsfromprocessingthesameorevenolderin-puttuples.Eachsecondarythustrimsallloggedoutputtuplesthathaveahighwatermarklowerthanthehighwatermarksofthetuplesalreadyreceivedbydownstreamnodes.Forhighwatermarkstobecorrect,weneedtodistinguishinput-tupleindicatorsthattravelondifferentpathsthroughanode.Wediscussthesedetailsfurtherin[11].Watermarksareneversentbetweenupstreamanddown-streamnodesbuttheyaresentbetweenprimaryandsec-ondarynodes,asillustratedinthefollowingexample.WeuseFigure5toillustrateactivestandbybutweassumein-dicatorsarehighwatermarks.WhenACK(0,O,125)andACK(0,O,123)arrive,nodeNadeterminesthatO[123]isnowacknowledgedatlevel-0bybothdownstreamneigh-bors.SincetupleO[123]mapsontoinputtuplesiden-tiedwith(I1,200)and(I2,100),thesetofidentiers{(I1,200),(I2,100)}isaddedtothequeue-trimmingmes-sageastheentryvalueforO.Whenthesecondaryre-ceivesthequeue-trimmingmessage,itdiscardstuplesu(fromtheoutputqueuecorrespondingtoO)forwhichcause((O,u),I1)returnsatupleolderthanI1[200]andcause((O,u),I2)returnsatupleolderthanI2[100].Iftheprimaryfails,thesecondarytakesoverbysendingtheloggedtuplestoalldownstreamneighbors,andthencon-

imatelyf(k)id)size(tuple),wheref(k)isafunctionoftheav-Repeatablef(k)size(tupleid)Convergentf(k)size(tupleid)PassivestandbyQ.networkbwoverheadprocoverheadrectimeDeterministicnonenegligiblenoneArbitrarynonenegligiblenoneActivestandbyQ.networkbwoverheadprocoverheadrectimeDeterministicnonenegligiblerArbitrarydeterminantsdeterminantsr+f5(log.freq.)UpstreambackupQ.networkbwoverheadprocoverheadrectimesize(tuple)negligiblenonesize(tuple)doublenegligibleArbitrarydeterminantsdeterminantsnegligible

Table4.Addedoverheadforpreciserecovery

tinuingitsprocessing.Whenthefailednoderejoinsthesys-temasthenewsecondary,itstartswithanemptystateandbecomesup-to-datewithrespecttothenewprimaryonlyaf-terprocessingsufcientlymanyinputtuples.Activestandbyguaranteesrollbackrecoverysinceeachsecondaryalwaysre-ceiveswhatitsprimaryreceives(inputpreservation)andeachsecondarydiscardsloggedoutputtuplesonlywhentheybe-comeduplicate(outputpreservation).RecoveryType:Becausethesecondaryprocessestuplesinparallelwiththeprimary,activestandbyprovidesrepeatingrecoveryforalldeterministicquerynetworksanddivergentrecoveryforothers.RecoveryTime:Becausethestandbycontinuesprocessingduringfailure,itonlyneedstotransmitallduplicatetuplesinitsoutputqueuetoreachastateequivalenttothatoftheprimary.Recoverytimeisthereforenegligible.Thenumber,

2,ofredundanttuplesisonaverageMout+2doutforeachoutputstream.Mdeterminesthetrimmingintervalforthesecondarysoutputqueues.Overhead:Becauseallprocessingisreplicatedbythestandbynode,bothprocoverheadandbwoverheadareap-proximately100%.Theoverheadsareactuallysomewhathigherduetotheprocessingofinput-tupleindicatorsandtransmittingqueue-trimmingmessages.Table3summarizestheseresults.6Precise-RecoveryExtensionsAllrecoveryapproachescanachievepreciserecoveryforconvergent-capablequerynetworks,byeliminatingduplicatetuplesduringconvergence.Itisalsopossible,thoughmuchmorecostly,toprovidepreciserecoveryforarbitrarynet-works.Table4summarizestheextraruntimeoverheadandrecoverytimerequiredforpreciserecovery.PassiveStandby:Passivestandbyprovidesrepeatingrecov-eryfordeterministicquerynetworks.Tomakerecoverypre-cise,beforesendinganyoutputtuples,thefailovernodemustaskdownstreamneighborsfortheidentiersofthelasttu-plestheyreceivedandthendiscardalltuplesprecedingtheonesidentied.Theserequestscanbemadewhilethere-coverynoderegeneratesthefailedstate,achievingprecisere-coverywithoutadditionaloverhead.Foranon-deterministicquerynetwork,becausethesecondarymayproducediffer-entduplicateoutputtupleswhenittakesover,theprimary

canonlyforwardcheckpointedtuplesdownstream.Thiscon-straintcausesburstyoutputwhilealsoincreasingtheend-to-endlatency.ActiveStandby:Foradeterministicquerynetwork,activestandbyalsomakesrecoveryprecisebyaskingdownstreamnodesfortheidentiersofthelatesttuplestheyreceived.Thedelayimposedbythisrequestcannotbemaskedandthusex-tendstherecoverytimebyr.Forotherquerynetworks,wemustensurethatboththeprimaryandsecondaryfollowthesameexecution.Todoso,wheneveranon-deterministicop-eratorexecutes,theprimarymustcollectallinformationnec-essarytoreplaytheexecutionoftheoperator.Theprimaryaccumulatessuchinformation,calleddeterminants[9]1,inalogmessage.Determinantsaffectbothbandwidthandpro-cessingoverhead.Theloggingfrequencyaffects(1)there-coverytime,becausenon-deterministicoperatorsonthesec-ondarycannotexecuteuntiltheyobtainappropriatedetermi-nants,and(2)theend-to-enddelay,becausetheprimarycan-notsendtuplesdownstreamuntilthesecondaryreceivesalldeterminantsinvolvedingeneratingthesetuples.UpstreamBackup:Inrepeatablequerynetworks,operatorsproduceoutputtuplesbycombiningatmostonetuplefromeachinputstream.Input-tupleindicatorsthereforeuniquelyidentifytuplesandcanserveforduplicateelimination,offer-ingpreciserecoverywithnegligibleextraprocessingover-head.Foraconvergentquerynetwork,thesecondarymustbeabletoremoveduplicateoutputtuplesduringrecovery.ItachievesthisbyusingtheadditionalhighwatermarksasdiscussedinSection5.3.Thisapproachthusdoublestheprocessingoverhead.Forrepeatablequerynetworks,nodesforwardlowwatermarksdownstreamwhileforconvergent-capablequerynetworks,nodesforwardhighwatermarksin-stead.Inbothcases,theextrabandwidthoverheadisapprox-size(tupleeragenumberofinputstreams(atanode)thatcontributetoanoutputstream.Asinactivestandby,upstreambackupcanprovidepreciserecoveryformorecomplexquerynetworksbyloggingdeterminantsfromprimarytosecondary.Unlikeactivestandby,thesedeterminantsareprocessedonlywhenthesecondarytakesover.Thedetailsoftheprotocolarepre-sentedin[11].7EvaluationWeevaluateandcomparetheperformanceofeachap-proachthroughsimulations.UsingCSIM[17],webuiltadetailedsimulatorofaDSPS.Table5summarizesthemainsimulationparameters.Theparametervalueswereobtainedfromourprototypeimplementation,whichcurrentlysupportsallourrecoverytypesforsimplerepeatablequerynetworks.Eachpointshownintheguresistheaverageof25simula-tionruns,atleastonesimulatedminuteeach.Becauseam-nesiahasnooverheadandazerorecoverytime,butprovides1Therepresentationofadeterminantdependsontheoperatortype.Forexample,thedeterminantforarandomltercouldberepresentedasabitvectorwhereeachbitshowswhetherthecorrespondingtuplepassedorwasdropped.Foraunionoperator,thedeterminantmustincludetheexactinter-leavingoftuples.

PassiveStandbyActiveStandby(rollback)ActiveStandby(precise)UpstreamBackup(rollback)UpstreamBackup(precise)

RecoveryTime(ms),withk=1,andsize(tupleid)=850=0.16)Aggregate(Proc.CostofFilter)WindowAdvance100Selectivityexpectedvalueof#ofinputtuplesconsumed0.1ParameterMeaningDefaultinputtuplearrivalrate(tuples/s)1000Ddelaytodetectthefailureofanode(ms)250Mqueue-trimming/checkpointinterval(ms)50rtimetoredirectinputstreams(ms)40Tuplesizeofatupleandsizeofatupleid(bytes)50,8Networkbandwidth(Mbps)anddelay(ms)16,5Proc.Costavg.processingtimeperinputtuple(s)Filter101#ofoutputtuplesemitted

Table5.Simulationparametersandtheirdefaultvalues1008060402000255075100BandwidthOverheadforHighAvailability(%)

Figure6.Recoverytimeandruntimeoverheadforrollbackandpreciserecoveryasthecommunicationintervalvariesfrom25msto200ms(indicatedbythearrows)

onlygaprecovery,wefocusourevaluationontheotherthreeapproaches.Werstexaminetheoverheadandrecoveryperformanceofeachapproachforrollbackrecoveryandaconvergent-capablequerynetwork(Section7.1).Wethenevaluatetheaddedoverheadofachievingpreciserecovery(Section7.2)andexaminetheeffectofquery-networktypesandotherquery-networkpropertiesontheperformanceofeachap-proach(Section7.3).Wenallyexaminehowperformancechangesasafunctionofquerynetworksize(Section7.4).Fortheoverheadmeasurements,weonlypresentband-widthoverheadbecauseprocessingoverheadposessimilartradeoffswhilebeingmoredifculttoreproduceandevaluateaccuratelyinsimulations.WereferthereadertoSections4through6foradetaileddiscussionofprocessingoverheads.

7.1RuntimeOverheadvsRecoveryTime

Toexaminetheruntimeoverheadandrecoverytimetrade-offsforrollbackrecoveryandaconvergent-capablequerynetwork,wesimulateanaggregatewith100mswindow,10msadvance(thisaggregateconsumes10%ofanodesprocessingcapacity)anddefaultvaluesforotherparameters.Theonlytunableparameterforeachapproachisthecom-municationinterval,whichisthequeue-trimmingintervalforupstreambackupandactivestandbyandthecheckpointingintervalforpassivestandby.Figure6showstherelationbe-tweenrecoverytimeandbandwidthoverheadasthecommu-nicationintervalvariesfrom25,to50,100,150,and200ms.

Lookingattheoverhead,upstreambackupistheclearwinnerwithanoverheadclosetozero.Evenwitha25mscommunicationinterval,thenodetransmitsonlyone8-bytetupleidentierforevery25tuplesitreceives,yieldinganoverheadof0.64%.Upstreambackup,however,hastheslow-estrecoveryasitmustrecreatethecompletestateofthefailedquerynetwork.Upstreambackupsrecoverytimeisalsomostsensitivetothedurationofthecommunicationinterval.Fre-quenttrimmingreducesrecoverytimeforanegligibleaddedoverheaduntilthesizeofthequerynetworkandthetimetoredirecttheinputstreams(ris40msinourprototype)even-tuallylimitstherecoveryspeed.Recoverytimeisstillrela-tivelyshortcomparedwiththe250msfailuredetectiondelay.Activestandbyhasanoverheadofatleast100%becausethesecondaryreceivesallinputtuplesinparallelwiththeprimary.Queue-trimmingmessagesusedtodiscardoutputtuplesfromthesecondarymaketheoverheadslightlyexceed100%.Activestandbyhasanegligiblerecoverytime,though.Thesecondaryonlyneedstoresendhalfaqueue-trimmingintervalworthofduplicatetuplesstoredinitsoutputqueues.Passivestandbysrecoverytimeisbetweenthatoftheotherapproachesbecausethesecondaryalreadyhasasnap-shotofthelastcheckpointbutmustaskupstreamnodestoredirecttheiroutputstreamsandmustre-processonaveragehalfacheckpointworthoftuples.Passivestandbysoverheadvariessignicantlywiththecommunicationintervalaseachcheckpointmessagecontainsanupdateofthequery-networkstate.Whenoperatorshaveaselectivityoflessthan1.0,increasingtheintervalbetweencheckpointsalsoincreasesthenumberoftuplesprocessedanddroppedwithoutbeingcheckpointed.Thekneeat100mscorrespondstothe100mswindowsize.Thecurvewouldbesmootherforalargerquerynetwork.7.2CostofPreciseRecoveryFigure6alsopresentstherecoverytimeandruntimeoverheadofpreciserecovery.Forpassivestandbyandac-tivestandby,preciserecoveryofconvergent-capablequerynetworksaddsnoruntimeoverheadcomparedwithroll-backrecovery.Preciserecoveryincreasestheruntimeover-headofupstreambackupbyalittleover16%(equalto:ksize(tupleid)size(tuple)size(tuple)becausewatermarksarenowsentdownstream.Theoverheadthusremainsmuchlowerthanthatoftheprocess-pairbasedapproaches.Forupstreambackupandpassivestandby,theprecisere-coverytimeisalmostthesameastherollbackrecoverytime.Upstreambackupmustnowprocessadditionaloffsetindica-torsbutthisaddsnegligibledelay.Forallapproaches,re-coverynodesmustnowaskdownstreamneighborsforthelatesttuplestheyreceived.Forupstreambackupandpassivestandbythiscommunicationproceedsinparallelwithtuplere-processing(orinputstreamredirection).Activestandbycannotmaskthisdelayandrecoveryextendsbytheconstantvaluer(40msinourprototype).Overall,allapproachescanofferpreciserecoveryforconvergent-capablequerynetworksatanegligibleincrementalcost.

QueryNet-workType

ResultUpstreamBackup

ActiveStandby

PassiveStandby

Windowsize(tuples)100200300400500PSoverhead(%)111.55111.55111.54111.54111.54

RepeatableBwoverhead(%)0.64100.96101.27Rec.time(ms)47.621.8045.88Convergent-Bwoverhead0.64100.96111.55capableRec.time69.860.0748.88Non-Bwoverhead1.28101.91101.90deterministicRec.time50.921.8247.24

Table6.Effectsofquery-networktype

7.3EffectsofQuery-NetworkTypeWenowexaminetheeffectsofquery-networktypesonthebasicperformanceofrollbackrecovery.Table6summarizestherecoverytimeandbandwidthoverheadofeachapproachwhenthequerynetworkconsistsofarepeatablelterwithselectivity1.0,ourdefaultconvergent-capableaggregate,andanon-deterministicunionoperatorthatmergestwostreams(500tuples/seach)intoone.Interestingly,theresultsshowthatneithertheoverheadsnortherecoverytimesoftheap-proachesareaffectedbythequerynetworktype.Upstreambackupandactivestandbyusequeue-trimmingmessages.Theiroverheadsthusdependontherelativeratesofthesemessagesandtuplesoninputstreamsratherthananyotherpropertyofthequerynetwork.InTable6,theunionhasaslightlyhigheroverheadwiththeseapproachesbecauseithastwoinputstreamsathalftherateeach.Theoverheadofpassivestandbyisproportionaltothesizeofthecheckpointmessages,whichdoesnotdependonthetypeofthequerynetworkbutonthemagnitudeofchangesinquery-networkstatebetweentwocheckpoints.Becausetheaggregatehasthegreatestdifferencesinstatebetweencheckpoints,itsover-headishighestwithpassivestandby.Activestandbyrecoversbyretransmittingoutputtuples.InTable6,theoutputrateistentimeslowerfortheaggre-gatebecauseofthe10msadvance,resultinginafasterre-coveryforthatoperator.Theothertwoapproachesrecoverbyre-processingtuples.Passivestandbyre-processeshalfacheckpointworthoftuplesonaverage.Itsrecoveryperfor-manceisthusindependentofthetypeofthequerynetworkbutratherdependsonprocessingcomplexity(duringrecov-erytuplesarere-processedatthemaximumrate).Upstreambackupsrecoveryalsodependsonprocessingcomplexity.Thereis,however,asecondparameter.Thenumberoftuplesthatupstreambackupmustre-processdependsonthesizeofthequery-networkstate.Forthesereasons,theaggregatehasthelongestrecoverytimewiththeseapproaches,especiallywithupstreambackup.Forpassivestandbytheincreaseisnegligiblecomparedwiththestreamredirectiondelay.Hence,forrollbackrecovery,thequerynetworktypedoesnotaffectrecoverytimeorruntimeoverhead.Rather,thesizeofthequery-networkstateandtherateandmagnitudeofthestatechangesaffectrecoverytimeofupstreambackupandoverheadandsomewhatrecoverytimeofpassivestandby.

7.3.1SizeofQuery-NetworkStateWeexaminetheeffectsofincreasingthesizeofthequery-networkstatebysimulatingthefailureandrecoveryofan

PSrec.time(ms)48.951.754.660.063.9UBrec.time(ms)69.998.9138.7188.5248.3

Table7.Effectsofquery-networkstatesizeAdvance(tuples)1005025105PSoverhead(%)102.6103.6105.6111.6121.5PSrec.time(ms)47.547.547.648.851.6UBrec.time(ms)62.661.461.369.983.8

Table8.Effectsofrateofquery-networkstatechange

aggregateoperatorwithincreasingwindowsize(100to500tuples),butaconstant10-tupleadvance.Table7showstheresultingpassivestandby(PS)overheadandbothpassivestandbyandupstreambackup(UB)recoverytimes.Increasingthesizeofthequery-networkstatedoesnotnecessarilyincreasetherateatwhichthatstatechanges.Inthisexperiment,theoverheadofpassivestandbyremainsconstantat112%.Therecoverytimeofpassivestandbyduetoreprocessingtuples(thepartinexcessof40ms)increasesbyaboutafactorofthreewhenthesizeofthestatequintu-ples.Thisincreaseisduetotheheavierper-tupleprocessingcost,duetocomputingaggregatevaluesoverlargernumbersoftuples.Theincreaseinrecoverytimeismorepronouncedforupstreambackup.Thetimespentreprocessingtuplesin-creasesroughlylinearlywiththesizeofthestate.Upstreambackupmustindeedreprocessanumberoftuplesdirectlyproportionaltothesizeofthequery-networkstate.

7.3.2RateofQuery-NetworkStateChangeWeexaminetheimpactofincreasingtherateatwhichthestateofaquerynetworkchangesusinganaggregateoperatorwithdecreasingwindowadvancefrom100msto5msandthusincreasingselectivityfrom0.01to0.2.Table8showstheimpactofthisincreaseinquery-networkstate-updaterateontheoverheadofpassivestandbyandtherecoverytimesofbothpassivestandbyandupstreambackup.Asexpected,theoverheadofpassivestandbyincreaseswiththemagnitudeofchangesinquery-networkstate.Theadvancedeterminesthenumberoftuplesthattheoperatorproducesduringacheckpointinterval.Thisnumberincreasesfrom1to20astheadvancedecreasesfrom100to5ms.Theincreasedper-input-tupleprocessingcostduetoasmalleradvance,slightlyprolongsrecoveryforpassivestandby(visibleforanadvanceof10tuplesofless).Wemightexpectthesameeffecttocauseaslightincreaseintherecoverytimeofupstreambackup.Wemeasureadecreaseinstead.Upstreambackupperiodicallyupdatestheidenti-ersoftheoldesttuplesoneachinputstreamthatcontributetothecurrentquery-networkstate.Whenthestatechangesmorerapidly,theoldertuplesarediscardedfasterandrecov-eryrestartsfromalaterpoint.Thisinturnresultsinafasterrecovery.Forasmallenoughadvance,however,theaddedprocessingcostdominatesrecoverytime.Astheadvancereaches10ms,therecoverytimestartsincreasing.Insummary,forrollbackrecovery,thesizeofthequery-

RecoveryTime(ms)180160140120

PassiveStandbyActiveStandby(rollback)ActiveStandby(precise)UpstreamBackup(rollback)UpstreamBackup(precise)

Additionally,bothinourprototypeandsimulator,wemaketherstnodesinthesystemadoptthepassive-standbymodelsinceotherapproachesimposeextrarequirementsonstreamsources.Activestandbyrequiresthateachsourcesendsthe

100806040200050100150200BandwidthOverheadforHighAvailability(%)

Figure7.Effectsofthenumberofoperators.Thearrowsindicatethedirectionsofthetrends

networkstateincreasesupstreambackupsrecoverytimewhiletherateandmagnitudeatwhichthatstatechangesim-pactstheruntimeoverheadofpassivestandby.7.4EffectofNetworkSize

Increasingthesizeandcomplexityofthequerynetworktranslatesintoincreasingthesizeofthequery-networkstate,therateatwhichthisstatechanges,andtheprocessingcom-plexity.Asanexample,Figure7showstheperformanceofeachapproachforachainof1to5aggregateoperators(withtheparametervaluesfromTable5).Othercongurationsyieldsimilarresults.Asexpected,increasingthenumberofoperatorsincreasestheoverheadofpassivestandbybecausethenumberoftuplesthatareproducedinsideorattheoutputofthequerynetworkincreases.Largerquerynetworksalsoslightlyincreasere-coverytimeforpassivestandbybecausetheprocessingcom-plexityofeachtupleincreases.Therecoverytimeofup-streambackupincreasesrapidlyasthestateofthequerynet-workincreaseswitheachextraaggregate.Itreaches170msfor5operators,whichisstillrelativelyshortcomparedwiththe250msfailuredetectiondelay.Interestingly,evenwithalargerquerynetwork,upstreambackupstillprovidespreciserecoveryatafractionofthecostoftheotherapproaches.7.5Discussion

Theresultsshowthateachapproachposesacleartrade-offbetweenrecoverytimeandprocessingoverhead.Activestandby,withitshighoverheadandnegligiblerecoverytime,appearsparticularlywellsuitedforsystemswherequickre-coveryjustieshighruntimecosts(e.g.,nancialservices,militaryapplications).Passivestandbydoesnotseemwellsuitedtostream-processingsystemsasitsperformanceisworsethanthatofactivestandbyforbothrecoverytimeandruntimeoverhead.Passivestandby,however,istheonlyapproachthateasilyprovidespreciserecoveryforarbitraryquerynetworks.Itisthusbestsuitedforapplications,suchaspatientmonitor-ingandothermedicalapplications,thatimposeasomewhatlowerloadonthesystembutnecessitatepreciserecovery.

streamtotwodifferentlocationsandupstreambackupre-quiresthateachsourceloggsthetuplesitproduces.Upstreambackupprovidespreciserecoveryformostquerynetworkswiththelowestruntimeoverheadbutatthecostofalongerrecovery.Therecoverytimeofthisapproach,however,canbesignicantlyreducedbydistributingthere-coveryloadovermultiplenodes.Ingeneral,upstreambackupisappropriatewhenshortrecoverydelaysaretolerableandisthusparticularlysuitableforsensor-basedenvironmentandinfrastructuremonitoringapplications.Incontrasttoprocess-pairapproaches,recoverynodescanbechosenamonglivenodesallowingallserverstoprocessdatastreamsatruntime.8RelatedWorkReliabilitythroughredundantprocessing,checkpoint-ing,andlogginghasbeenwidelystudiedinthecontextoftraditionalapplications[9].Recently,therehasbeenmuchworkondata-streamprocessing(e.g.,Aurora[1,5],STREAM[18],TelegraphCQ[6]),includingproposalsfordistributedengines[7,22].Inthispaper,weinvestigatehowtoachievehighavailabilityinthesenewsystems.Theprocess-pairsmodelisadoptedbymanyexistingDBMSs[8,19,21,20].Oracle10g/DataGuard[19]isonesuchfacilitybuiltontopofOracleStreams[20].Thelat-terenablescross-databaseeventpropagationandtrigger-rule-basedprocessingofeventstreams.DataGuardsupportsthreerecoverymodes:maximumprotection(MPR),availability(MAV),andperformance(MPE).MPRsynchronouslyap-pliesthesameupdatetomultiplemachinesaspartofthesametransaction,providingpreciserecovery.MPEasyn-chronouslytransmitsredologstothestandby,providinggaprecoveryonly.MAVswitchesbetweenMPRandMPEbasedontheaccessibilityofthestandby.Ourapproachesprovidepreciserecoveryataloweroverheadbecausecheckpointsareasynchronousandtheyalsoofferrollbackrecovery.Commercialworkowsystems[13]alsorelyonredun-dantcomponentstoachievehighavailability.Avariationoftheprocess-pairsapproachisusedintheExoticawork-owsystem[14].Insteadofbackingupprocessstates,Ex-oticalogschangestotheworkowcomponents,whichstoreinter-processmessages.Thisapproachissimilartoupstreambackupinthatthesystemstatecanberecoveredbyrepro-cessingthecomponentbackups.Unlikeupstreambackup,however,thisapproachdoesnottakeadvantageofthedata-ownatureofprocessing,andthereforehastoexplicitlybackupthecomponentsatremoteservers.TheDRscheme[15],whichefcientlyresumesfailedwarehouseloads,isalsosimilartoupstreambackup.Insteadofoffset-indicators,DRusesoutputtuplesandpropertiesofoperatorstocompute,duringrecovery,thetrimmingboundsoninputstreams.IncontrasttoDR,ourschemesupportsin-niteinputsbytrimmingoutputqueuesatruntime.Wealsosupportfailurerecoveryatthegranularityofnodesinsteadof

thewholesystem.Wedonotrequirethatinputstreamshaveanypropertysuchasorderonsomeattribute.Inparallelprocessingsystems,routernodesdistributein-comingmessagesacrossasetofparallelservers[12,22].Ifaserverfails,therouterre-directsincomingmessagestoothernodes.Theseapproachesaddresshowtoselectfailovernodesandre-routemessagestothem,whereaswefocusonreplicatingandrecoveringstate.InMQSeries[12],messagesthatarebeingprocessedbyaserverwhenthefailurehappensaretrappeduntiltheserverrecovers.Flux[23]introducesatechniquesimilartoouractive-standbymethod.Ittriestoaccomplishloss-freeandduplication-freefailure/recoveryse-manticsbyexploitingsequencenumbersassignedtotuples.Itcurrentlyonlyconsidersorder-preservingorset-preservingoperatorsthoughandthuscannotsupportconvergent-capableanddivergentqueriesdiscussedinthispaper.9ConclusionInthispaper,wearguethatthedistributedanddata-ownatureofstreamprocessingapplicationsraisesnovelchal-lengesandopportunitiesforhighavailability.Wedenethreerecoverytypesthatprovideincreasinglystrongerguarantees.Wealsodenefourclassesofoperatorsandquerynetworksbasedontheirimpactonthecostofprovidingvariousrecov-eryguarantees.Withinthisframework,weintroducethreere-coveryapproachesthatprovidetheproposedguaranteeswithdifferentcombinationsofredundantprocessing,checkpoint-ing,andlogging.Usinganalysisandsimulations,wequantitativelychar-acterizetheruntimeoverheadandrecoverytimetradeoffsamongtheapproaches.Wendthateachapproachcov-ersacomplementaryportionofthesolutionspace.Process-pairbasedapproaches,especiallyactivestandby,providethefastestrecoverybutatahighcost.Activestandbyisthusbestsuitedforenvironmentswherefastfailurerecovery(i.e.,minimaldisruptions)justieshigherruntimecosts.Passivestandbyisbestsuitedtoprovidepreciserecoveryforarbitraryquerynetworks.Incontrast,upstreambackuphasasigni-cantlylowerruntimeoverheadbutalongerrecoverytimethatdependsmostlyonthesizeofthequery-networkstate.Thisapproachisthusbestsuitedforanenvironmentwherefailuresareinfrequentandshortrecoverydelaysaretolerable.Wecurrentlyhaveabasicprototypeimplementationthatcanprovidetheproposedrecoverytypesforrepeatablequerynetworks.Wewillextendourprototypetosupportarbi-traryquerynetworksandperformexperimentsonrealde-ployments.Wealsoplantoinvestigatehowtosimultane-ouslyusedifferentrecoveryapproachesatnodesinaDSPS,and,thusleveragethebenetsofallschemes.Wealsoplantostudynetworkpartitions,multiplefailures,andtheinterac-tionbetweenhighavailabilityandloadbalancing.References[1]D.J.Abadi,D.Carney,U.Cetintemel,M.Cherniack,C.Con-vey,S.Lee,M.Stonebraker,N.Tatbul,andS.Zdonik.Aurora:Anewmodelandarchitecturefordatastreammanagement.TheVLDBJournal,Sep2003.

[2]A.Arasu,S.Babu,andJ.Widom.TheCQLcontinuousquerylanguage:Semanticfoundationsandqueryexecution.Techni-calReport2003-67,StanfordUniversity,2003.[3]B.Babcock,S.Babu,M.Datar,R.Motwani,andJ.Widom.Modelsandissuesindatastreamsystems.InProc.of2002ACMPODS,June2002.[4]J.Barlett,J.Gray,andB.Horst.FaulttoleranceinTandemcomputersytems.TechnicalReport86.2,TandemComputers,Mar.1986.[5]D.Carney,U.Cetintemel,M.Cherniack,C.Convey,S.Lee,G.Seidman,M.Stonebraker,N.Tatbul,andS.Zdonik.Moni-toringstreams:Anewclassofdatamanagementapplications.InProc.ofthe28thVLDB,Aug.2002.[6]S.Chandrasekaran,A.Deshpande,M.Franklin,andJ.Heller-stein.TelegraphCQ:Continuousdataowprocessingforanuncertainworld.InProc.ofthe1stCIDR,Jan.2003.[7]M.Cherniack,H.Balakrishnan,M.Balazinska,D.Carney,U.Cetintemel,Y.Xing,andS.Zdonik.Scalabledistributedstreamprocessing.InProc.ofthe1stCIDR,2003.[8]E.CialiniandJ.Macdonald.CreatinghotsnapshotsandstandbydatabaseswithIBMDB2UniversalDatabase(TM)V7.2andEMCTimeFinder(TM).DB2InformationManage-mentWhitePapers,Sept.2001.[9]E.N.M.Elnozahy,L.Alvisi,Y.-M.Wang,andD.B.Johnson.Asurveyofrollback-recoveryprotocolsinmessage-passingsystems.ACMComput.Surv.,34(3):375408,2002.[10]J.Gray.Whydocomputersstopandwhatcanbedoneaboutit?TechnicalReport85.7,TandemComputers,1985.[11]J.-H.Hwang,M.Balazinska,A.Rasin,U.Cetintemel,M.Stonebraker,andS.Zdonik.High-availabilityalgorithmsfordistributedstreamprocessing.TechnicalReportCS-04-05,DepartmentofComputerScience,BrownUniversity,2004.

,[12]IBMCorporation.GettingthemostoutofMQSeries.Whitepaper.http://www.bmc.com/resourcecenter/partners/mqseries/gettingthemostoutof%mqseries.html2003.[13]IBMCorporation.IBMWebSphereV5.0:Performance,scal-ability,andhighavailability:WebSphereHandbookSeries.IBMRedbook,July2003.[14]M.Kamath,G.Alonso,R.Guenthor,andC.Mohan.Providinghighavailabilityinverylargeworkowmanagementsystems.InProc.of5thInt.Conf.onExtendingDatabaseTechnology,1996.[15]W.Labio,J.L.Wiener,H.Garcia-Molina,andV.Gorelik.Ef-cientresumptionofinterruptedwarehouseloads.InProc.ofthe2000ACMSIGMOD,May2000.[16]S.MaddenandM.J.Franklin.Fjordingthestream:Anarchi-tectureforqueriesoverstreamingsensordata.InProc.ofthe18thICDE,2002.[17]MesquiteSoftware,Inc.CSIM18userguide.http://www.mesquite.com[18]R.Motwani,J.Widom,A.Arasu,B.Babcock,S.Babu,M.Datar,G.Manku,C.Olston,J.Rosenstein,andR.Varma.Queryprocessing,approximation,andresourcemanagementinadatastreammanagementsystem.InProc.ofthe1stCIDR,Jan.2003.[19]OracleInc.Oracle10ghighavailabilitysolutions.http://otn.oracle.com/deploy/availability[20]OracleInc.Oracle9istreams-onlinedocumentation.http://www.oracle.com[21]A.Ray.Oracledataguard:Ensuringdisasterrecoveryfortheenterprise.AnOraclewhitepaper,Mar.2002.[22]M.Shah,J.Hellerstein,S.Chandrasekaran,andM.Franklin.Anadaptivepartitioningoperatorforcontinuousquerysys-tems.TechnicalReportCS-02-1205,UC.Berkeley,2002.[23]M.A.Shah,J.M.Hellerstein,andE.Brewer.Highly-available,fault-tolerant,paralleldataows.InProc.ofthe2004ACMSIGMOD,June2004.

Documents

High-Availability Algorithms for Distributed Stream Processing