43
Astrolabe: A Robust and Scalable Technology For Distributed System Monitoring, Management, and Data Mining Robbert van Renesse, Kenneth P. Birman, and Werner Vogels Department of Computer Science Cornell University, Ithaca, NY 14853 rvr, ken, vogels @cs.cornell.edu Scalable management and self-organizational capabilities are emerging as central requirements for a generation of large-scale, highly dynamic, distributed applications. We have developed an entirely new distributed information management system called Astrolabe. Astrolabe collects large-scale system state, permitting rapid updates and providing on-the-fly attribute aggregation. This latter capability permits an applica- tion to locate a resource without knowing the machines on which it resides, and also offers a scalable way to track system state as it evolves over time. The combination of features makes it possible to solve a wide variety of management and self-configuration problems. The paper describes the design of the system with a focus upon its scalabil- ity. After describing the Astrolabe service, we present examples of the use of Astrolabe for locating resources, publish-subscribe, and distributed synchronization in large sys- tems. Astrolabe is implemented using a peer-to-peer protocol, and uses a restricted form of mobile code based on the SQL query language for aggregation. This protocol gives rise to a novel consistency model. Astrolabe addresses several security consid- erations using a built-in PKI. The scalability of the system is evaluated using both simulation and experiments; these confirm that Astrolabe could scale to thousands and perhaps millions of nodes, with information propagation delays in the tens of seconds. Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network Architecture and Design – network communications; C.2.4 [Computer-Com- munication Networks]: Distributed Systems – distributed applications; D.1.3 [Pro- gramming Techniques]: Concurrent Programming – distributed programming; D.4.4 [Operating Systems]: Communications Management – network communication; D.4.5 [Operating Systems]: Reliability – fault tolerance; D.4.6 [Operating Systems]: Se- curity and Protection – authentication; D.4.7 [Operating Systems]: Organization and Design – distributed systems; H.3.3 [Information Systems]: Information Search and Retrieval – information filtering; H.3.4 [Information Systems]: Information Storage and Retrieval – distributed systems; H.3.5 [Information Systems]: Online Information Services – data sharing. This research was funded in part by DARPA/AFRL-IFGA grant F30602-99-1-0532, in part by a grant under NASA’s REE program administered by JPL, and in part by NSF-CISE grant 9703470. The authors are also grateful for support from the AFRL/Cornell Information Assurance Institute. 1

Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Astrolabe:A RobustandScalableTechnologyFor DistributedSystemMonitoring,

Management,andDataMining�

RobbertvanRenesse,KennethP. Birman,andWernerVogels

Departmentof ComputerScienceCornellUniversity, Ithaca,NY 14853�

rvr, ken,vogels� @cs.cornell.edu

Scalablemanagementandself-organizationalcapabilitiesare emergingascentralrequirementsfor a generationof large-scale, highlydynamic,distributedapplications.We havedevelopedan entirely new distributedinformationmanagementsystemcalledAstrolabe. Astrolabecollectslarge-scalesystemstate, permitting rapid updatesandproviding on-the-flyattribute aggregation. This latter capability permitsan applica-tion to locatea resourcewithoutknowingthe machineson which it resides,andalsooffers a scalablewayto track systemstateasit evolvesover time. Thecombinationoffeaturesmakesit possibleto solvea widevarietyof managementandself-configurationproblems.Thepaperdescribesthedesignof thesystemwith a focusuponits scalabil-ity. AfterdescribingtheAstrolabeservice, wepresentexamplesof theuseof Astrolabefor locatingresources,publish-subscribe, anddistributedsynchronizationin largesys-tems. Astrolabe is implementedusinga peer-to-peerprotocol, and usesa restrictedform of mobilecodebasedon theSQLquerylanguage for aggregation. Thisprotocolgivesrise to a novel consistencymodel. Astrolabeaddressesseveral securityconsid-erations using a built-in PKI. The scalability of the systemis evaluatedusing bothsimulationandexperiments;theseconfirmthatAstrolabecouldscaleto thousandsandperhapsmillionsof nodes,with informationpropagationdelaysin thetensof seconds.

CategoriesandSubjectDescriptors:C.2.1[Computer-Communication Networks]:Network ArchitectureandDesign– networkcommunications; C.2.4[Computer-Com-munication Networks]: DistributedSystems– distributedapplications; D.1.3 [Pro-gramming Techniques]: ConcurrentProgramming– distributedprogramming; D.4.4[Operating Systems]: CommunicationsManagement–networkcommunication; D.4.5[Operating Systems]: Reliability – fault tolerance; D.4.6 [Operating Systems]: Se-curity andProtection– authentication; D.4.7[Operating Systems]: OrganizationandDesign– distributedsystems; H.3.3 [Inf ormation Systems]: InformationSearchandRetrieval – informationfiltering; H.3.4 [Inf ormation Systems]: InformationStorageandRetrieval – distributedsystems; H.3.5[Inf ormation Systems]: OnlineInformationServices– datasharing.�

This researchwasfundedin partby DARPA/AFRL-IFGA grantF30602-99-1-0532,in partby a grantunderNASA’sREEprogramadministeredby JPL,andin partby NSF-CISEgrant9703470.Theauthorsarealsogratefulfor supportfrom theAFRL/CornellInformationAssuranceInstitute.

1

Page 2: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

GeneralTerms:Algorithms,monitoring,management,performance,reliability, secu-rity.

AdditionalKey WordsandPhrases:membership,failuredetection,epidemicprotocols,gossip,aggregation,scalability, publish-subscribe.

1 Intr oduction

With thewide availability of low-costcomputersandpervasive network connectivity,many organizationsarefacingthe needto managelarge collectionsof distributedre-sources.Thesemight includepersonalworkstations,dedicatednodesin a distributedapplicationsuchasa web farm, or objectsstoredat thesecomputers.The comput-ersmay be co-locatedin a room, spreadacrossa building or campus,or even scat-teredaroundtheworld. Configurationsof thesesystemschangerapidly—failuresandchangesin connectivity arethenorm,andsignificantadaptationmayberequiredif theapplicationis to maintaindesiredlevelsof service.

To a growing degree,applicationsare expectedto be self-configuringand self-managing,andasthe rangeof permissibleconfigurationsgrows, this is becominganenormouslycomplex undertaking.Indeed,the managementsubsystemfor a contem-porarydistributedsystem(i.e., a Web Servicessystemreportingdatacollectedfroma setof corporatedatabases,file systems,andotherresources)is oftenmorecomplexthantheapplicationitself. Yet thetechnologyoptionsfor building managementmech-anismshave lagged.Currentsolutions,suchasclustermanagementsystems,directoryservices,andeventnotificationservices,eitherdonotscaleadequatelyor aredesignedfor relatively staticsettings(seeSection8).

At thetimeof thiswriting, themostwidely-used,scalabledistributedmanagementsystemis DNS [18]. DNS is adirectoryservicethatorganizesmachinesinto domains,andassociatesattributes(called resource records) with eachdomain. Although de-signedprimarily to mapdomainnamesto IP addressesandmail servers,DNShasbeenextendedin a varietyof waysto make it moredynamicandsupporta wider varietyofapplications.TheseextensionsincludeRound-RobinDNS(RFC1794)to supportloadbalancing,the Serverrecord(RFC 2782) to supportservicelocation, and the Nam-ing Authority Pointer (RFC 2915) for Uniform ResourceNames. To date,however,acceptanceof thesenew mechanismshasbeenlimited.

In this paper, we describea new information managementservicecalled Astro-labe.Astrolabemonitorsthedynamicallychangingstateof a collectionof distributedresources,reportingsummariesof this informationto its users.Like DNS, Astrolabeorganizestheresourcesinto ahierarchyof domains,which wecall zonesto avoid con-fusion,andassociatesattributeswith eachzone.Unlike DNS, zonesarenot boundtospecificservers,theattributesmaybehighly dynamic,andupdatespropagatequickly;typically, in tensof seconds.

Astrolabecontinuouslycomputessummariesof the datain the systemusingon-the-flyaggregation.Theaggregationmechanismis controlledby SQLqueries,andcanbeunderstoodasa typeof dataminingcapability. For example,Astrolabeaggregationcanbe usedto monitor the statusof a setof serversscatteredwithin the network, tolocatea desiredresourceon thebasisof its attributevalues,or to computea summarydescriptionof loadson critical network components.As this informationchanges,As-

2

Page 3: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

trolabewill automaticallyandrapidly recomputetheassociatedaggregatesandreportthechangesto applicationsthathaveregisteredtheir interest.

Aggregationis intendedasasummarizingmechanism.1 For example,anaggregatecouldcountthenumberof nodessatisfyingsomeproperty, but not to concatenatetheirnamesinto a list, sincethat list couldbeof unboundedsize.Theapproachis intendedto boundthe rateof informationflow at eachparticipatingnode,so that even underworst-caseconditions,it will be independentof systemsize. To this end,eachaggre-gateis restrictedto somescope,within which it is computedonbehalfof andvisible toall nodes.Only aggregateswith highglobalvalueshouldhaveglobalscope.Thenum-berof aggregatingqueriesactive within any givenscopeis assumedto bereasonablysmall,andindependentof systemsize.To ensurethatapplicationsdo not accidentallyviolatethesepolicies,nodesseekingto introduceanew aggregatingfunctionmusthaveadministrativerightswithin thescopewhereit will becomputed.

Initial experiencewith the Astrolabeaggregationmechanismsdemonstratesthatthesystemis extremelypowerful despiteits limits. Managersof anapplicationmightusethe technologyto monitor andcontrol a distributedapplicationusingaggregatesthat summarizethe overall statewithin the network asa whole, andalsowithin thedomains(scopes)of which it is composed.A new machinejoining a systemcoulduseAstrolabeto discover information aboutresourcesavailable in the vicinity: byexploiting the scopingmechanismsof the aggregationfacility, the resourcesreportedwithin a domainwill be thoseof mostlikely valueto applicationsjoining within thatregion of thenetwork. After a failure,Astrolabecanbeusedbothfor notificationandto coordinatereconfiguration.More broadly, any form of looselycoupledapplicationcoulduseAstrolabeasaplatformfor coordinatingdistributedtasks.Indeed,Astrolabeusesits own capabilitiesfor self-management.

It maysoundasif designinganaggregateto besufficiently conciseandyet to havehigh valueto applicationsis somethingof anart. Yet theproblemturnsout to berela-tively straightforwardandnotunlike thedesignof ahierarchicaldatabase.A relativelysmallnumberof aggregatingmechanismssuffice to dealwith a wide varietyof poten-tial needs.Indeed,experiencesupportsthe hypothesisthat the forms of informationneededfor large-scalemanagement,configurationandcontrolaregenerallyamenableto a compactrepresentation.

Astrolabemaintainsexcellent responsivenesseven as the systembecomesverylarge, andeven if it exhibits significantdynamicism. The loadsassociatedwith thetechnologyaresmall andbounded,both at the level of messageratesseenby partic-ipating machinesand loadsimposedon communicationlinks. Astrolabealso hasasmall“footprint” in thesenseof computationalandstorageoverheads.

TheAstrolabesystemlooksto a usermuchlike a database,althoughit is a virtualdatabasethatdoesnotresideonacentralizedserver, anddoesnotsupportatomictrans-actions.This databasepresentationextendsto severalaspects.Most importantly, eachzonecanbe viewed asa relationaltablecontainingthe attributesof its child zones,which in turn canbequeriedusingSQL.Also, usingdatabaseintegrationmechanismslike ODBC [24] andJDBC[22] standarddatabaseprogrammingtoolscanaccessandmanipulatethedataavailablethroughAstrolabe.

1Aggregationis a complex topic. We have only just begun to explore thepower of Astrolabe’s existingmechanisms,andhavealsoconsideredseveralpossibleextensions.Thispaperlimits itself to themechanismsimplementedin the currentversionof Astrolabeandfocuseson whatwe believe will becommonwaysofusingthem.

3

Page 4: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Thedesignof Astrolabereflectsfour principles:

1. Scalabilitythroughhierarchy: A scalablesystemis onethatmaintainsconstant,or slowly degrading,overheadsandperformanceasits sizeincreases.Astrolabeachievesscalability throughits zonehierarchy. Given boundson the sizeandamountof information in a zone,the computationalandcommunicationcostsof Astrolabearealsobounded.Informationin zonesis summarizedbeforebe-ing exchangedbetweenzones,keepingwide-areastorageandcommunicationrequirementsat a manageablelevel.

2. Flexibility throughmobilecode: Differentapplicationsmonitor differentdata,anda singleapplicationmayneeddifferentdataat differenttimes. A restrictedform of mobile code,in the form of SQL aggregationqueries,allows userstocustomizeAstrolabeby installingnew aggregationfunctionson thefly.

3. Robustnessthrougha randomizedpeer-to-peerprotocol: Systemsbasedoncen-tralizedserversarevulnerableto failures,attacks,andmismanagement.Instead,Astrolabeusesa peer-to-peerapproachby running a processon eachhost.2

Theseprocessescommunicatethroughan epidemicprotocol. Suchprotocolsarehighly tolerantof many failurescenarios,easyto deploy, andefficient. Theycommunicateusingrandomizedpoint-to-pointmessageexchange,an approachthatmakesAstrolaberobustevenin thefaceof localizedoverloads,which maybriefly shutdown regionsof theInternet.

4. Securitythroughcertificates: Astrolabeusesdigital signaturesto identify andrejectpotentiallycorrupteddataandto controlaccessto potentiallycostlyoper-ations.Zoneinformation,updatemessages,configuration,andclientcredentials,all areencapsulatedin signedcertificates.Thezonetreeitself formsthePublicKey Infrastructure.

Thispaperdiscusseseachof theseprinciples.In Section2, wepresentanoverviewof theAstrolabeservice.Section3 illustratestheuseof Astrolabein anumberof appli-cations.Astrolabe’simplementationis describedin Section4. WedescribeAstrolabe’ssecuritymechanismsin Section5. In Section6, we explain how Astrolabeleveragesmobile code,while Section7 describesperformanceandscalability. Here,we showthatAstrolabecouldscaleto thousandsandperhapsmillions of nodes,with informa-tion propagationdelaysin thetensof seconds.Section8 describevariousrelatedwork,from which Astrolabeborrowsheavily. Section9 concludes.

2 AstrolabeOverview

Astrolabegathers,disseminatesand aggregatesinformation aboutzones. A zoneisrecursively definedto beeitherahostor asetof non-overlappingzones.Zonesaresaidto benon-overlappingif they do not haveany hostsin common.Thus,thestructureofAstrolabe’s zonescanbeviewedasa tree. The leavesof this treerepresentthehosts(seeFigure1).

2Processeson hoststhat do not run a Astrolabeprocesscanstill accessthe AstrolabeserviceusinganRPCprotocolto any remoteAstrolabeprocess.

4

Page 5: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Figure1: An exampleof a three-level Astrolabezonetree.Thetop-level root zonehasthreechild zones.Eachzone,includingtheleaf zones(thehosts),hasanattributelist(calleda MIB). Eachhostrunsa Astrolabeagent.

Eachzone(excepttheroot)hasa local zoneidentifier, astringnameuniquewithinthe parentzone. A zoneis globally identifiedby its zonename, which is its pathofzoneidentifiersfrom theroot,separatedby slashes.(e.g., “/USA/Cornell/pc3”).

EachhostrunsanAstrolabeagent.Thezonehierarchyis implicitly specifiedwhenthe systemadministratorinitializes theseagentswith their names.For example,the“/USA/Cornell/pc3”agentcreatesthe “/”, “/USA”, and“/USA/Cornell” zonesif theydid not exist already. Thusthezonehierarchyis formedin adecentralizedmanner, butoneultimately determinedby systemadministrators.As we will see,representativesfrom within the setof agentsareelectedto take responsibilityfor runningthe gossipprotocolsthatmaintaintheseinternalzones;if they fail or becomeunsuitable,thepro-tocol will automaticallyelectothersto take their places.Associatedwith eachzoneisan attribute list which containsthe informationassociatedwith the zone. Borrowingterminologyfrom SNMP[26], this attribute list is bestunderstoodasa form of Man-agementInformationBaseor MIB, althoughtheinformationis certainlynot limited totraditionalmanagementinformation.

Unlike SNMP, the Astrolabeattributesarenot directly writable,but generatedbyso-calledaggregation functions. Eachzonehasa set of aggregation functionsthatcalculatethe attributesfor the zone’s MIB. An aggregationfunction for a zoneis anSQL program,which takesa list of theMIBs of thezone’schild zonesandproducesasummaryof their attributes.

Leaf zonesform anexception.Eachleaf zonehasa setof virtual child zones. Thevirtual child zonesare local to the correspondingagent. The attributesof thesevir-tual zonesarewritable, ratherthanbeinggeneratedby aggregationfunctions. Eachleaf zonehasat leastonevirtual child zonecalled“system”,but theagentallows newvirtual child zonesto becreated.Forexample,anapplicationonahostmaycreateavir-tual child zonecalled“SNMP” andpopulateit with attributesfrom theSNMP’s MIB.Theapplicationwould thenberesponsiblefor updatingAstrolabe’sMIB whenever theSNMPattributeschange.

5

Page 6: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Astrolabeis designedundertheassumptionthatMIBs will berelatively smallob-jects– afew hundredor eventhousandbytes,notmillions. An applicationdealingwithlargerobjectswouldnot includetheobjectitself into theMIB, but would insteadexportinformationabouttheobject,suchasa URL for downloadinga copy, a time stamp,aversionnumber, or a contentsummary. In our exampleswe will treatindividual com-putersastheownersof leaf zones,andwill assumethat themachinehasa reasonablysmallamountof stateinformationto report.

Astrolabecanalsosupportsystemsin which individual objectsarethe leaf zones,andhencecould be usedto track the statesof very large numbersof files, databaserecords,or other objects. Using Astrolabe’s aggregation functions,one could thenquerythe statesof theseobjects. However, keepin mind that aggregationfunctionssummarize– their outputsmustbe boundedin size. Thus,onemight designan ag-gregationfunctionto countall picturescontainingimagesof a very tall, thin, beardedman,or even to list the threepictureswith the strongestsuchmatch. Onecouldnot,however, useaggregationto make a list of all suchpictures.In fact, it is easyto enu-meratethenodesthatcontributeto acountingaggregate,but thiswouldbedoneby theapplication,not theaggregationfunction.

We cannow describethe mechanismwherebyaggregationfunctionsareusedtoconstructtheMIB of azone.

Aggregationfunctionsareprogrammable.The codeof thesefunctionsis embed-dedin so-calledaggregationfunctioncertificates(AFCs),which aresignedandtime-stampedcertificatesthat are installedasattributesinsideMIBs. The namesof suchattributesarerequiredto startwith thereservedcharacter’&’.

For eachzoneit is in, theAstrolabeagentat eachhostscanstheMIBs of its childzoneslooking for suchattributes. If it findsmorethanoneby thesamename,but ofdifferentvalues,it selectsthemostrecentonefor evaluation.Eachaggregationfunctionis thenevaluatedto producetheMIB of thezone.TheagentslearnsabouttheMIBs ofotherzonesthroughthegossipprotocoldescribedin Section4.1.

Thus,if onethinks of Astrolabeasa form of decentralizedhierarchicaldatabase,therewill bea table(a relation)for eachzone.Each“column” in a leaf zoneis avalueextractedfrom the correspondingnodeor object. Eachcolumnin an inner zoneis avaluecomputedby anaggregatingfunctionto summarizeits children.Thesecolumnsmightbeverydifferentfrom thoseof thechildren.For example,thechild zonesmightreportloads,numbersof filescontainingpicturesof tall, thin, beardedmen,etc.An in-nerzonecouldhaveonecolumngiving themeanloadon its children,anothercountingthe total numberof matchingpicturesreportedby its children,anda third listing thethreechild nodeswith thestrongestmatches.In thelattercasewewouldprobablyalsohave a columngiving theactualquality of thosematches,so that further aggregationcanbeperformedat higherlevelsof thezonehierarchy. However, this isn’t required:using Astrolabe’s scopingmechanism,we could searchfor thosematchingpicturesonly within asinglezone,or within somecornerof theoverall tree,or within any otherwell-definedscope.In effect,any two leafnodessharingacommonancestralzonewillagreeon the layout of the MIB for that ancestralzone,but two sibling zonesat thesamelevel of thehierarchycouldhave differentschemasanddifferentkindsof data-differentcolumns.

In additionto code,AFCsmaycontainotherinformation.Two importantotherusesof AFCsareinformationrequestsandrun-timeconfiguration.An InformationRequestAFC specifieswhat informationtheapplicationwantsto retrieve at eachparticipating

6

Page 7: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Method Description

find contacts(time,scope) searchfor Astrolabeagentsin thegivenscopeset contacts(addresses) specifyaddressesof initial agentsto connecttoget attributes(zone,event queue) reportupdatesto attributesof zoneget children(zone,event queue) reportupdatesto zonemembershipsetattribute(zone,attribute,value) updatethegivenattribute

Table1: ApplicationProgrammerInterface.

host,andhow to aggregatethis informationin thezonehierarchy. Both arespecifiedusingSQL queries.A Configuration AFC specifiesrun-timeparametersthatapplica-tions may usefor dynamicon-line configuration.We will presentexamplesof theseuseslaterin this paper.

Applicationsinvoke Astrolabeinterfacesthroughcalls to a library (seeTable1).Initially, the library connectsto anAstrolabeagentusingTCP. Thesetof agentsfromwhich the library canchooseis specifiedusingset contacts. Optionally, eligibleagentscan be found automaticallyusingfind contacts. The time parameterspecifieshow longto search,while thescope parameterspecifieshow to search(e.g.,usinga broadcastrequeston the local network). (In the simplestcase,an Astrolabeagentis run on every host, so that applicationprocessescan always connectto theagenton thelocalhostandneednot worry abouttheconnectionbreaking.)

From thenon, the library allows applicationsto peruseall the informationin theAstrolabetree,settingup connectionsto otheragentsasnecessary. The creationandterminationof connectionsis transparentto applicationprocesses,sotheprogrammercanthink of Astrolabeasonesingleservice.Updatesto theattributesof zones,aswellasupdatesto themembershipof zones,arepostedon localeventqueues. Applicationscanalsoupdatetheattributeof virtual zonesusingset attribute.

Besidesanative interface,thelibrary hasanSQLinterfacethatallowsapplicationsto view eachnodein the zonetreeasa relationaldatabasetable,with a row for eachchild zoneanda columnfor eachattribute. The programmercanthensimply invokeSQL operatorsto retrievedatafrom thetables.Usingselection,join, andunionopera-tions,theprogrammercancreatenew viewsof theAstrolabedatathatareindependentof the physicalhierarchyof the Astrolabetree. An ODBC driver is availablefor thisSQL interface,so that many existing databasetools canuseAstrolabedirectly, andmany databasescanimport datafrom Astrolabe. SQL doesnot supportinstantnoti-ficationsof attribute changes,so that applicationsthat needsuchnotificationswouldneedto obtainthemusingthenative interface.

Astrolabeagentsalsoact asweb servers,henceinformationcanbe browsedandchangedusingany standardwebbrowserinsteadof goingthroughthelibrary.

3 Examples

Theforegoingoverview describesthefull featuresetof thesystem,but maynotconveythesimplicity andeleganceof theprogrammingmodelit enables.Theexamplesthatfollow are intendedto illustrate the power and flexibility that Astrolabebrings to adistributedenvironment.

7

Page 8: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

3.1 Example1: Peer-to-peer Cachingof Lar geObjects

Many distributedapplicationsoperateon oneor a few largeobjects.It is often infea-sible to keeptheseobjectson onecentralserver, andcopy themacrossthe Internetwhenever a processneedsa copy. The loadingtime would bemuchtoo long, andtheloadonthenetwork toohigh. A solutionis for processesto find anearbyexistingcopyin the Internet,and to “side-load” the objectusing a peer-to-peercopying protocol.In this sectionwe look at theuseof Astrolabeto locatenearbycopiesandto managefreshness.

Supposewearetrying to locateacopy of thefile ‘game.db’.Assumeeachhosthasa database‘files’ thatcontainsoneentryperfile. (We have, in fact,written anODBCdriver thatmakesahost’sfile systemappearlikeadatabase.)To find out if aparticularhosthasa copy of thefile, we mayexecutethefollowing SQLqueryon this host:

SELECTCOUNT(*) AS file_count

FROM filesWHERE name = ’game.db’

If file count ��� , thehosthasat leastonecopy of thegivenfile. Theremaybemany hoststhathave a copy of thefile, sowe alsoneedto aggregatethis informationalongwith thelocationof thefiles. For thepurposeof this example,assumethateachhostinstallsanattribute ’result’ containingits hostnamein its leaf MIB. (In practice,this attributewould beextractedusinganotherqueryon, say, theregistry of thehost.)Then,thefollowing aggregationquerycountsthenumberof copiesin eachzone,andpicks onehostfrom eachzonethat hasthe file. This hostnameis exportedinto the’result’ attributeof eachzone3:

SELECTFIRST(1, result) AS result,SUM(file_count) AS file_count

WHERE file_count > 0

We now simply combinebothqueriesin an InformationRequestAFC (IR-AFC),andinstall it. As theIR-AFC propagatesthroughtheAstrolabehierarchy, thenecessaryinformationis collectedandaggregated.

This discussionmay make it soundas if a typical applicationmight install newaggregationquerieswith fairly broad,evenglobal,scope.As notedearlier, this is notthecase:sucha patternof usewould violate thescalabilityof thesystemby creatingzones(thosenearthe root) with very largenumbersof attributes. An aggregatesuchastheonejust shown shouldeitherbeof global importanceandvalue,or limited to asmallerzonewithin whichmany programsneedtheresult,or usedwith ashortlifetime(to find thefile but then“terminate”). But notice,too, thataggregationis intrinsicallyparallel. Theaggregatedvalueseenin a zoneis a regionalvaluecomputedfrom thatzone’schildren.

In our example,if searchingfor ‘game.db’is a commonneed,eachnodedoingsocanusetheaggregateto find a nearbycopy, within thedistancemetricusedto define

3FIRST(n, attrs) is anAstrolabeSQL extensionthatreturnsthenamedattributesfrom thefirst �rows. It is not necessaryto specifythe“FROM” clauseof theSELECTstatement,asthereis only oneinputtable.“FROM” is necessaryin nestedstatements,however.

8

Page 9: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

theAstrolabehierarchy. In effect, many applicationscanusethesameaggregationtosearchfor differentfiles (albeit onesmatchingthe samequery). This is a somewhatcounter-intuitive propertyof aggregatesandmakesAstrolabefar morepowerful thanwould bethecaseif only theroot aggregationvaluewasof interest.Indeed,for manypurposes,the root aggregationis almosta helperfunction, while the valuesat innerzonesaremoreuseful.

In particular, to find the nearestcopy of the file, a processwould first inspectitsmost local zonefor a MIB that lists a ’result’. If not, it would simply travel up thehierarchyuntil it finds a zonewith a ’result’ attribute, or the root zone. If the rootzone’s ’result’ attributeis empty, thereis no copy of thefile.

In the exampleabove, we did not careaboutthe freshnessof the copy retrieved.But now supposeeachfile maintainsa versionattribute aswell. Now eachzonecanlist thehostthathasthemostrecentversionof thefile asfollows:

SELECTresult, version

WHERE version ==(SELECT MAX(version))

Any processmay determinewhat the mostrecentversionnumberis by checkingthe ’version’attributeof theroot zone.A processwishing to obtainthe latestversioncansimply go up the tree,startingin its leaf zone,until it finds the right version. Aprocessthatwantsto downloadnew versionsasthey becomeavailablesimply hastomonitortherootzone’sversionnumber, andthenfind a nearbyhostthathasa copy.

This exampleillustratestheuseof Astrolabefor datamining andresourcediscov-ery. Beforewe moveon to thenext example,we will explorehow the“raw” informa-tion mayactuallybecollected.

ODBC is a standardmechanismfor retrieving datafrom a variety of sources,in-cludingrelationaldatabasesandspreadsheetsfrom a largevarietyof vendors.Thelan-guagefor specifyingwhichdatais to beaccessedis SQL.Thedataitself is representedusingODBC datastructures.

We have developedanotheragentthatcanconnectto ODBC datasources,retrieveinformation,andpostthis informationin theAstrolabezonetree.Eachhostrunsbothan Astrolabeandan ODBC agent. The ODBC agentinspectscertificatesin the As-trolabetree whosenamesstart with the prefix &odbc:. Thesecertificatesspecifywhichdatasourceto connectto, andwhatinformationto retrieve. Theagentpoststhisinformation into the virtual systemzoneof the local Astrolabeagent. The&odbc:certificatesmay alsospecifyhow this information is to be aggregatedby Astrolabe.Applicationscanupdatethesecertificateson the fly to changethe datasources,theactualinformationthatis retrieved,or how it is aggregated.

With this new agent,the hostof an Astrolabezonehierarchyis no longera trueleaf node. Instead,thehostcanbeunderstoodasa zonewhosechild nodesrepresentthe objects(programs,files, etc) that resideon the host. Astrolabenow becomesarepresentationcapableof holdinginformationabouteveryobjectin apotentiallyworld-widenetwork: millions of hostsandperhapsbillions of objects.

Of course,no userwould actually“see” this vastcollectionof informationat anyoneplace.Astrolabeis a completelydecentralizedtechnologyandany givenuserseesonly its parentzones,andthoseof its children. Thepower of dynamicaggregationisthatevenin a massive network, theparentzonescandynamicallyreporton thestatus

9

Page 10: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

�� ��� �� ��� �� �� �� �

�� � �� �

������ �� ����

A

CD

E

FB

Figure2: In this exampleof SelectCast,eachzoneelectstwo routersfor forwardingmessages,eventhoughonly oneis usedfor forwardingmessages.Thesender, A, sendsthemessageto a routerof eachchild zoneof theroot zone.

of thesevastnumbersof objects.Suppose,for example,thata virus entersthesystem,andis known to changea particularobject.Merelyby definingtheappropriateAFC, asystemadministratorwould beableto identify infectedcomputers,evenif their virusdetectionsoftwarehasbeendisabled.

Otheragentsfor retrieving dataarepossible.We havewritten a WindowsRegistryagent,which retrievesinformationout of theWindowsRegistry (includinga largecol-lectionof localperformanceinformation).A similaragentfor Unix systeminformationis alsoavailable.

3.2 Example2: Peer-to-peer Data Diffusion

Many distributedgamesandotherapplicationsrequirea form of multicastthatscaleswell, is fairly reliable,anddoesnot put a TCP-unfriendlyloadon theInternet. In thefaceof slow participants,themulticastprotocol’s flow controlmechanismshouldnotforcetheentiresystemto grind to ahalt. ThissectiondescribesSelectCast,amulticastfacility wehavebuilt usingAstrolabe.SelectCastusesAstrolabefor control,but setsupits own treeof TCPconnectionsfor actuallytransportingmessages.A full-lengthpaperis in preparationon theSelectCastsubsystem;here,we limit ourselvesto a high-levelsummary.

Eachmulticastgrouphasaname,say“game”. Theparticipantsnotify their interestin receiving messagesfor this groupby installingtheir TCP/IPaddressin theattribute“game”of their leafzone’sMIB. Thisattributeis aggregatedusingthequery“SELECTFIRST(2,game)AS game”.That is, eachzoneselectstwo of its participants’TCP/IPaddresses(seeFigure2). We call theseparticipantsthe routers for their zone. Tworoutersallowsfor fastrecoveryin caseonefails. If bothfail, recoverywill alsohappen,as Astrolabewill automaticallyselectnew routers,but this mechanismis relativelyslow.

10

Page 11: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Participantsexchangemessagesof theform (zone,data).A participantthatwantsto initiate a multicastlists thechild zonesof theroot zone,and,for eachchild thathasanon-empty“game”attribute,sendsthemessage(child-zone,data)to a routerfor thatchild zone(moreonthisselectionlater).Eachtimeaparticipant(routeror not)receivesamessage(zone,data),theparticipantfindsthechild zonesof thegivenzonethathavenon-empty“game”attributesandrecursively continuesthedisseminationprocess.

This approacheffectively constructsa treeof TCPconnectionsspanningthesetofparticipants.EachTCPconnectionis cachedsolongastheend-pointsremainactiveasparticipants,for re-use(perhapsevenby a differentquery). The treeis updatedauto-matically, whenAstrolabereportszonemembershipchanges,by terminatingunneededTCPconnectionsandcreatingnew onesasappropriate.

To make surethat the disseminationlatency doesnot suffer from slow routersorconnectionsin the tree,somemeasuresmustbe taken. First, eachparticipantcouldpost (in Astrolabe)the rate of messagesthat it is able to process. The aggregationquerycanthenbeupdatedasfollowsto selectonly thehighestperformingparticipantsfor routers.

SELECTFIRST(2, game) AS game

ORDER BY rate

Senderscanalsomonitortheir outgoingTCPpipes.If onefills up, they maywantto try anotherparticipantfor the correspondingzone. It is even possibleto useallroutersfor a zoneconcurrently, thusconstructinga “f at tree” for dissemination,butthencareshouldbetakento dropduplicatesandreconstructtheorderof messages.Wearecurrently investigatingthis option in our implementation.Thesemechanismsto-gethereffectively routemessagesaroundslow partsof theInternet,muchlikeResilientOverlayNetworks[3] accomplishesfor point-to-pointtraffic.

SelectCastdoesnotprovideanend-to-endacknowledgementmechanism,andthustherearescenariosin which messagesmay not arrive at all members(particularly ifthe setof receiversis dynamic). Additional reliability canbe implementedon top ofSelectCast.For example,BimodalMulticast [5] combinesa messagelogging facilitywith anepidemicprotocolthatidentifiesandrecoverslostmessages,therebyachievingend-to-endreliability (with high probability). RunningBimodal Multicast over Se-lectCastwould alsoremove Bimodal Multicast’s dependenceon IP multicast,whichis poorly supportedin the Internet,andprovide Bimodal Multicast with a notion oflocality, which may be usedto improve the performanceof messageretransmissionstrategies.

Notice that eachdistinct SelectCastinstance— eachdatadistribution pattern—requiresaseparateaggregationfunction.As notedpreviously, Astrolabecanonly sup-port boundednumbersof aggregationfunctionsat any level of its hierarchy. Thus,while a singleAstrolabeinstancecouldprobablysupportasmany asseveralhundredSelectCastinstances,thenumberwouldnotbeunbounded.Ourthird exampleexploresoptionsfor obtaininga moregeneralform of Publish/Subscribeby layeringadditionalfiltering logic overa SelectCastinstance.Sodoinghasthepotentialof eliminatingtherestrictionon numbersof simultaneouslyactivequeries,while still benefitingfrom therobustnessandscalabilityof thebasicSelectCastdatadistribution tree.

Althougha detailedevaluationof SelectCastis outsideof thescopeof this paper,we have comparedtheperformanceof thesystemwith thatof otherapplication-level

11

Page 12: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

routerarchitecturesandwith IP multicast. We find that with steadymulticastrates,SelectCastimposesmessageloadsandlatenciescomparableto otherapplication-levelsolutions,but that IP multicastachieveslower latenciesandlower messageloads(abenefitof beingimplementedin thehardwarerouters).Our solutionhasnot yet beenfully optimizedbut thereis no reasonthat peakmessageforwardingratesshouldbelower thanfor otherapplication-level solutions,sincethe critical pathfor SelectCast(when the tree is not changing)simply involves relaying messagesreceived on anincoming TCP connectioninto somenumberof outgoingconnections.Thereis anobvioustradeoff betweenfanout(hence,work doneby therouter)anddepthof thefor-wardingtree(hence,latency), but this is notunderourcontrolsincethetopologyof thetreeis determinedby thehumanadministrator’sassignmentof zonenames.

3.3 Example3: Publish-Subscribe

In Publish/Subscribesystems[19], receiverssubscribeto certaintopicsof interest,andpublisherspostmessagesto topics.As just noted,while theSelectCastprotocolintro-ducedin theprevioussectionsupportsa form of Publish/Subscribe(aswell asthegen-eralizationthatwe termedselectivemulticast), themechanismcanonly handlelimitednumbersof queries.Here,weexploreameansof filtering within aSelectCastinstanceto obtaina form of subsetdelivery in which thatlimitation is largelyeliminated.

In the extendedprotocol,eachmessageis taggedwith an SQL condition,chosenby the publisherof the message.Saythat a publisherwantsto sendan updateto allhostshaving aversionof someobjectlessthan3.1.First shewould installaSelectCastquerythatcalculatestheminimumversionnumberof thatobjectin eachzone,andcallit, say“MIN(version”. We call this thecoveringquery. Next shewould attachthecondition“MIN(version) < 3.1” to themessage.Themessageis thenforwardedusingSelectCast.

Recallthat in SelectCast,theparticipantsat eachlayersimply forwardeachmes-sageto the routers,which arecalculatefor eachzoneusingthe SelectCastquery. Inthe extendedprotocol, the participantsin the SelectCastprotocolfirst apply the con-dition asa filter (usinga WHERE clauseaddedto theSelectCastquerythat calculatesthe setof routers),to decideto which routersto forward the message.4 Topic-basedPublish/Subscribecanthenbe expressedby having the publisherspecify that a mes-sageshouldbe deliveredto all subscribersto a particulartopic. Our challengeis toefficiently evaluatethis querywithout losingscalability. For example,while a new at-tributecouldpotentiallybedefinedfor eachSQLconditionin useby thesystem,doingsoscalespoorly if therearemany conditions.

A solutionthatscalesreasonablywell usesa Bloom filter to compresswhatwouldotherwisebe a single bit per query into a bit vector[8].5 This solution associatesafixed-sizebit mapwith eachcovering query. Assumefor the momentthat our goalis simply to implementPublish/Subscribeto a potentiallylargenumberof topics. Wedefinea hashingfunctionon topics,mappingeachtopic to a bit value. Theconditiontaggedto the messageis “BITSET(HASH(topic))”, and the associatedattributecanbeaggregatedusingbitwiseOR. In the caseof hashcollisions,this solutionmay

4Theresultof thequeryis cachedfor a limited amountof time(currently, 10seconds),sothatunderhighthroughputtheoverheadcanbeamortizedovermany messages,assumingthey oftenusethesameconditionor smallsetof conditions.

5Bloom filters arealsousedin thedirectoryserviceof theNinja system.[14]

12

Page 13: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

lead to messagesbeingroutedto moredestinationsthanstrictly necessary, which issafe,but inefficient. Thusthesizeof thebitmapandthenumberof coveringSelectCastqueriesshouldbe adjusted,perhapsdynamically, so that the rateof collisionswill beacceptablylow.

Notice that theresultingprotocolneedstime to reactwhena conditionis usedforthefirst time,or whena new subscriberjoins thesystem,sincetheaggregationmech-anismwill needtime to updatetheBloom filter. During theperiodbeforethefilter hasbeenupdatedin Astrolabe(a few tensof seconds),thenew destinationprocessmightnot receive messagesintendedfor it. However, after this warmupperiod, reliabilitywill bethesameasfor SelectCast,andperformancelimited only by thespeedatwhichparticipantsforward messages.As in the caseof SelectCast,gossip-basedrecoveryfrom messagelogscanbeusedto make thesolutionreliable.6

With a moreelaboratefiltering mechanism,this behavior couldbe extended.Forexample,the Sienasystemprovidescontent-basedPublish/Subscribe[10]. In Siena,subscribersspecify informationaboutthe contentof messagethey are interestedin,while publishersspecify information aboutthe contentof messagethey send. Sub-scribers’specificationsareaggregatedin theinternalroutersof Siena,andthenmatchedagainstthepublishers’specifications.By addingsuchaggregationfunctionalityto As-trolabe’sSQLengine,wecouldextendtheabovesolutionto supportexpressions(ratherthanjustsingletopicata timematching)or evenfull-fledgedcontentaddressingin themannerof Siena.

3.4 Example4: Synchronization

Astrolabemay be usedto run basicdistributedprogrammingfacilities in a scalablemanner. For example,barrier synchronizationmay be doneby having a counterateachparticipatinghost,initially 0. Eachtime a hostreachesthebarrier, it incrementsthe counter, andwaits until the aggregateminimum of the countersequalsthe localcounter.

RecallthatAstrolabeis currentlyconfiguredto imposevery low backgroundloadsat theexpenseof somewhatslower propagationof new data.More precisely, althoughgossiprateis aparameter, mostof ourwork usesagossiprateof onceperfiveseconds.7

The delaybeforebarriernotificationoccursscalesasthe gossipratetimesthe log ofthesizeof thesystem(the logarithmicbasebeingthesizeof anaveragezone). If weassumezonesof size64, a systemwith 500,000nodeswould updateglobalaggrega-tions in abouttwentyseconds– quiteacceptablefor settingssuchasgrid computing,wherelatenciesarehigh in any case.

Similarly, votingcanbedoneby having two attributes,yesandno, in additionto thenmembersattribute,all aggregatedby takingthesum.Thus,participantshaveaccesstothetotalnumberof members,thenumberthathavevotedin favor, andthenumberthathavevotedagainst.This informationcanbeusedin avarietyof distributedalgorithms,suchascommitprotocols.

6At the time of this writing, an implementationof reliablePublish/Subscribeover SelectCastwasstillunderdevelopment,andasystematicperformanceevaluationhadnotyet beenundertaken.

7It is temptingto speculateaboutthe behavior of Astrolabewith very high gossiprates,but doing soleadsto misleadingconclusions.Gossipconvergencetimeshave a probabilisticdistribution andtherewillalwaysbesomenodesthatseeanevent many roundsearlierthanothers.Thus,while Astrolabehasratherpredictablebehavior in large-scalesettings,the valueof the systemlies in its scalability, not its speedorreal-timecharacteristics.

13

Page 14: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Again,thevalueof suchasolutionis thatit scalesextremelywell (andcanco-existwith other scalablemonitoring and control mechanisms).For small configurations,Astrolabewould be a ratherslow way to solve the problem. In contrast,traditionalconsensusprotocolsarevery fastin smallconfigurations,but have costslinear in sys-tem size (they often have a 2-phasecommit or somesort of circular token-passingprotocolat thecore).With asfew asa few hundredparticipants,suchasolutionwouldbreakdown.

A synchronizationproblemthatcomesup with Astrolabeapplicationsis thatAFCpropagationnot only takessometime,but thattheamountof time is only probabilisti-cally bounded.How long shoulda processwait beforeit canbesureevery agenthasreceivedtheAFC?Thisquestioncanbeansweredby summingupthenumberof agentsthathavereceivedtheAFC,asasimpleaggregationquerywithin theAFC itself. Whenthis sumequalsthe nmembers attribute of the root zone,all agentshave received theAFC.

4 Implementation

Eachhost runsan Astrolabeagent. Suchan agentrunsAstrolabe’s gossipprotocolwith otheragents,andalsosupportsclientsthatwant to accesstheAstrolabeservice.Eachagenthasaccessto (that is, keepsa local copy of) only a subsetof all theMIBsin the Astrolabezonetree. This subsetincludesall the zoneson thepathto the root,aswell asthesibling zonesof eachof those.In particular, eachagenthasa local copyof the root MIB,8 andtheMIBs of eachchild of the root. As statedbefore,thereareno centralizedserversassociatedwith internalzones;their MIBs arereplicatedon allagentswithin thosezones.

However, this replicationis not lock-step:differentagentsin a zonearenot guar-anteedto have identicalcopiesof MIBs even if queriedat the sametime, andnot allagentsareguaranteedto perceive eachandevery updateto a MIB. Instead,theAstro-labeprotocolsguaranteethat MIBs do not lag behindusingan old versionof a MIBforever. Moreprecisely, Astrolabeimplementsaprobabilisticconsistency modelunderwhich, if updatesto theleaf MIBs ceasefor long enough,anoperationalagentis arbi-trarily likely to reflectall the updatesthathasbeenseenby otheroperationalagents.We call thiseventualconsistencyanddiscussthepropertyin Section4.3. Firstwe turnourattentionto theimplementationof theAstrolabeprotocols.

4.1 Gossip

Astrolabepropagatesinformationusingan epidemicpeer-to-peerprotocolknown asgossip[11]. As we will seelater, this protocol is scalable,fast,andsecure.The ba-sic idea is simple: periodically, eachagentselectssomeother agentat randomandexchangesstateinformationwith it. If the two agentsarein the samezone,the stateexchangedrelatesto MIBs in that zone;if they arein differentzones,they exchangestateassociatedwith the MIBs of their leastcommonancestorzone. In this manner,thestatesof Astrolabeagentswill tendto convergeasdataages.

8Becausetheroot MIB is calculatedlocally by eachagent,it never needsto becommunicatedbetweenagents.

14

Page 15: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

pc1

pc2

pc3

pc4

Cornell

MIT

/USA/Cornell

/USA

self

self

self USA

Europe

Asia

/

monitor

system

/USA/Cornell/pc3

self

inventory

Figure 3: A simplified representationof the datastructuremaintainedby the agentcorrespondingto /USA/Cornell/pc3.

Conceptually, eachzoneperiodicallychoosesanothersibling zoneat random,andthetwo exchangetheMIBs of all their sibling zones.After this exchange,eachadoptsthemostrecentMIBs accordingto the issuedtimestamps.Thedetailsof theprotocolaresomewhatmorecomplicated,particularlysinceonly leafzonesactuallycorrespondto individual machines,while internalzonesarecollectionsof machinesthat collec-tively areresponsiblefor maintainingtheir stateandgossipingthis stateto peerzones.

As elaboratedin Section4.2,Astrolabegossipsaboutmembershipinformationjustas it gossipsaboutMIBs andotherdata. If a processfails, its MIB will eventuallyexpire andbe deleted. If a processjoins the system,its MIB will spreadthroughitsparentzoneby gossip,andasthis occurs,aggregateswill begin to reflectthe contentof thatMIB.

Theremainderof this sectiondescribesAstrolabe’sprotocolin moredetail.EachAstrolabeagentmaintainsthe datastructuredepictedin Figure3. For each

level in the hierarchy, the agentmaintainsa recordwith the list of child zones(andtheirattributes),andwhichchild zonerepresentsits own zone(self). Thefirst (bottom)recordcontainsthelocalvirtual child zones,whoseattributescanbeupdatedby writing

15

Page 16: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

themdirectly (throughanRPCinterface).In theremainingrecords,theMIBs pointedto by selfarecalculatedby theagentlocally usingtheaggregationfunctions.TheotherMIBs arelearnedthroughthegossipprotocol.

TheMIB of any zoneis requiredto containat leastthefollowing attributes:

id: thelocal zoneidentifier;

rep: thezonenameof therepresentativeagentfor thezone—theagentthatgen-eratedtheMIB of thezone;

issued: a timestampfor theversionof theMIB, usedfor thereplacementstrategyin theepidemicprotocol,aswell asfor failuredetection;

contacts: asmallsetof addressesfor representativeagentsof thiszone,usedforthepeer-to-peerprotocolthattheagentsrun.

servers: a smallsetof TCP/IPaddressesfor (representativeagentsof) thiszone,usedby applicationsto interactwith theAstrolabeservice.9

nmembers: the total numberof hostsin the zone. The attribute is constructedby taking the sumof the nmembers attributesof the child zones.It is usedforpacingmulticastlocationmechanisms,aswell asin thecalculationof averages.

id andrep areautomaticallyassigned,that is, their valuesarenot programmable.The id attribute is setto the local identifierwithin theparentzone,while rep is settothe full zonenameof the local agent. The AFCs canprovide valuesfor eachof theotherattributes. If the AFCs do not computea valuefor issued, the local wall clocktime is used.

Thecontactsattributeis dynamicallycomputedbasedon anaggregationfunction,muchlike the routersin SelectCast(Section3.2). In effect, eachzoneelectsthe setof agentsthat gossipon behalfof that zone. The electioncanbe arbitrary, or basedon characteristicslike load or longevity of the agents.10 Note that an agentmay beelectedto representmorethanonezone,andthusrun morethanonegossipprotocol,asdescribedbelow. Themaximumnumberof zonesanagentcanrepresentis boundedby thenumberof levelsin theAstrolabetree.

Eachagentperiodically runs the gossipalgorithm. First, the agentupdatestheissuedattribute in the MIB of its virtual systemzone,andre-evaluatesthe AFCs thatdependon this attribute. Next, the agenthasto decideat which levels (in the zonetree)it will gossip.For this decision,the agenttraversesthe list of recordsin Figure3. An agentgossipson behalfof thosezonesfor which it is a contact,ascalculatedby the aggregationfunction for that zone. The rateof gossipat eachlevel canbe setindividually (usingthe&config certificatedescribedin Section6.2).

Whenit is time to gossipwithin somezone,theagentpicksoneof thechild zones,other than its own, from the list at random. Next the agentlooks up the contactsattributefor this child zone,andpicksa randomcontactagentfrom thesetof hostsin

9Typically theserefer to the sameagentsascontacts, but applicationscanchooseotheragentsby in-stallingtheappropriateAFC.

10Thelattermaybeadvisedif thereis ahigh rateof agentsjoining andleaving thesystem.Many peer-to-peersystemssuffer degradedperformancewhennetwork partitioningor ahigh rateof churnoccurs.WearecurrentlyfocusedonusingAstrolabein comparatively stablesettings,but seethis topicasaninterestingonedeservingfurtherstudy.

16

Page 17: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

this attribute. (Gossipsalwaysarebetweendifferentchild zones,thusif thereis onlyonechild zoneat a level, no gossipwill occur.) The gossipingagentthensendsthechosenagentthe id, rep, and issuedattributesof all the child zonesat that level, anddoesthesamethingfor thehigherlevelsin thetreeupuntil theroot level. Therecipientcomparesthe informationwith theMIBs that it hasin its memory, andcandeterminewhich of the gossiper’s entriesareout-of-date,andwhich of its own entriesare. Itsendsthe updatesfor thefirst category backto the gossiper, andrequestsupdatesforthesecondcategory.

Thereis one importantdetail when decidingif one MIB is newer than another.Originally, we simply comparedthe issuedtimestampswith oneanother, but foundthataswe scaledup thesystemwe couldnot rely on clocksbeingsynchronized.Thislack of synchronizationis the reasonfor the rep attribute: we now only comparetheissuedtimestampsof the sameagent,identifiedby rep. For eachzone,we maintainthemostrecentMIB for eachrepresentative, thatis, theagentthatgeneratedtheMIB.until it timesout (seeSection4.2). We exposeto the Astrolabeclientsonly the MIBof oneof theserepresentatives.Sincewe cannotcomparetheir issuedtimestamps,weselecttheonefor which wereceivedanupdatemostrecently.

We are currently in the processof evaluatingvariouscompressionschemesforreducingtheamountof informationthathasto be exchangedthis way. Nevertheless,gossipwithin azonespreadsquickly, with disseminationtimegrowing !#"%$'&)(+*-, , where* is the numberof child zonesof the zone(seeSection7.1). Gossipis going oncontinuously, its ratebeing independentof the rateof updates.(The sole impactofa high updaterate is that our compressionalgorithmswill not performaswell, andhencenetwork loadsmaybe somewhat increased.)Thesepropertiesareimportanttothescalabilityof Astrolabe,aswe will discussin Section7.

4.2 Membership

Up until now we have tacitly assumedthat the set of machinesis fixed. In a largedistributedapplication,chancesarethatmachineswill bejoining andleaving at ahighrate. The overall frequency of crashesriseslinearly with the numberof machines.Keepingtrackof membershipin a largedistributedsystemis no easymatter[32]. InAstrolabe,membershipis simpler, becauseeachagentonly hasto know a relativelysmallsubsetof theagents(logarithmicin thetotal sizeof themembership.)Theseareits own gossipcontacts,andthosefor its parentandchild zones.

Therearetwo aspectsto membership:removing membersthat have failed or aredisconnected,and integrating membersthat have just startedup or were previouslypartitionedaway.

The mechanismusedfor failure detectionin Astrolabeis fairly simple. As de-scribedabove,eachMIB hasa repattributethatcontainsthenameof therepresentativeagentthatgeneratedtheMIB, andanissuedattributethatcontainsthetimeatwhichtheagentlastupdatedtheMIB. Agentskeeptrack,for eachzoneandfor eachrepresenta-tive agentof thezone,thelastMIBs from thoserepresentativeagents.Whenanagenthasnot seenanupdatefor a zonefrom a particularrepresentative agentfor thatzonefor sometime .0/214365 , it removesits correspondingMIB. WhenthelastMIB of azoneisremoved,thezoneitself is removedfrom theagent’s list of zones.This algorithmwillalwaysdetectandremove failed participantsandemptyzoneswithin .0/2143'5 seconds.

17

Page 18: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

. /2143'5 shouldgrow logarithmicallywith membershipsize(see[32]), which in turn canbedeterminedfrom thenmembersattribute.

The otherpart of membershipis integration. Either becauseof true network par-titions, or becauseof setting .0/7183'5 too aggressively, it is possiblethat the Astrolabetreesplits up into two or moreindependentpieces.New machines,andmachinesre-coveringfrom crashes,alsoform independent,degenerate,Astrolabetrees(whereeachparenthasexactly onechild). We needa way to gluethepiecestogether.

AstrolabereliesonIP multicastto setuptheinitial contactbetweentrees.Eachtreemulticastsa gossipmessageat a fixedrate9 which is typically on theorderof tensec-onds.Thecollectivemembersof thetreeareresponsiblefor thismulticasting,andtheydosoby eachtossingacoinevery 9 secondsthatis weightedby thenmembersattributeof therootzone.Thuseachmembermulticastsatanaveragerateof 9;:<*;=?>@=?A4>CBED .

Thecurrentimplementationof Astrolabealsooccasionallybroadcastsgossipsonthe local LAN in orderto integratemachinesthatdo not supportIP multicast. In ad-dition, Astrolabeagentscanbeconfiguredwith a setof so-calledrelatives, which areaddressesof agentsthat shouldoccasionallybe contactedusing point-to-pointmes-sages.This strategy allows the integrationof Astrolabetreesthat cannotreacheachotherby any form of multicast.Thesemechanismsaredescribedin moredetailin [31].

Astrolabeassumesthat the administratorsresponsiblefor configuringthe systemwill assignzonenamesin a mannerconsistentwith physicaltopologyof thenetwork.In particular, for goodperformance,it is desirablethat the siblingsof a leaf nodebereasonablyclose(in thenetwork). Sincezonestypically contain32to 64members,thevastmajorityof messagesareexchangedbetweensiblingleafnodes.Thus,if thisrathertrivial placementpropertyholds, Astrolabewill not overloadlong-distancenetworklinks.

4.3 Eventual Consistency

Astrolabetakessnapshotsof the distributedstate,andprovidesaggregatedinforma-tion to its users. The aggregatedinformation is replicatedamongall the agentsthatwereinvolvedin takingthesnapshot.Thesetof agentsis dynamic.This raisesmanyquestionsaboutconsistency. For example,whenretrieving anaggregatevalue,doesitincorporateall thelatestchangesto thedistributedstate?Whentwo usersretrieve thesameattributeat thesametime,do they obtainthesameresult?Do snapshotsreflectasingleinstancein time?Whenlooking at snapshot1, andthenlaterat snapshot2, is itguaranteedthatsnapshot2 wastakenaftersnapshot1?

The answerto all thesequestionsis no. For the sake of scalability, robustness,andrapid disseminationof updates,a weaknotion of consistency hasbeenadoptedfor Astrolabe.Givenanaggregateattribute F thatdependson someotherattribute G ,Astrolabeguaranteeswith probability1 thatwhenanupdateH is madeto G , either Hitself, or anupdateto G madeafterH , is eventuallyreflectedin F . Wecall thiseventualconsistency, becauseif updatescease,replicatedaggregateresultswill eventuallybethesame.

Aggregateattributesareupdatedfrequently, but their progressmay not be mono-tonic. This is becausethe issuedtime in a MIB is thetime whentheaggregationwascalculated,but ignoresthetimesatwhichtheleafattributesthatareusedin thecalcula-tion wereupdated.Thusit is possiblethata new aggregation,calculatedby a different

18

Page 19: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

agent,is basedon someattributesthatareolderthana previously reportedaggregatedvalue.

For example,perhapsagenta computesthemeanloadwithin somezoneas6.0byaveragingtheloadsfor MIBs in thechild zonesknown ata. Now agentb computesthemeanloadas5.0.Thedesignof Astrolabeis suchthatthesecomputationscouldoccurconcurrentlyandmight bebasedon temporarilyincomparableattributesets:a mighthavemorerecentdatafor child zonex andyetb mighthavemorerecentdatafor childzoney.

Thisphenomenoncouldcausecomputedattributesto jumparoundin time,aneffectthat would be confusing. To avoid the problem,Astrolabecantrack the (min, max)interval for the issuedattributeassociatedwith theinputsto any AFC. Here,min is theissuedtime of theearliestupdatedinput attribute,andmaxtheissuedtime of themostrecentlyupdatedinputattribute.An updatewith suchaninterval is notacceptedunlessthe minimumissuedtime of the new MIB is at leastaslargeasthemaximumissuedtime of thecurrentone.This way we canguaranteemonotonicity, in thatall attributesthat wereusedin the aggregationarestrictly newer thanthoseof the old aggregatedvalue.

Ratherthanseeinganupdateto anaggregateresultaftereachreceivedgossip,anupdatewill only begeneratedaftera completelynew setof input attributeshashadachanceto propagate.As we will seelater, this propagationmaytake many roundsofgossip(5 — 35 roundsfor practicalMariner hierarchies).The userthusseesfewerupdates,but thevaluesrepresenta sensibleprogressionin time. The trade-off canbemadeby theapplications.

4.4 Communication

Wehavetacitly assumedthatAstrolabeagentshaveasimplewayto addresseachotherandexchangegossipmessages.Unfortunately, in this ageof firewalls, Network Ad-dressTranslation(NAT), andDHCP, many hostshavenowayof addressingeachother,andeven if they do, firewalls oftenstandin the way of establishingcontact.Oneso-lution would have beento e-mailgossipmessagesbetweenhosts,but we rejectedthissolution,amongothers,for efficiency considerations.We alsorealizedthat IPv6 maystill bea long time in coming,andthat IT managersarevery reluctantto createholesin firewalls.

We currentlyoffer two solutionsto this problem.Both solutionsinvolve HTTP asthecommunicationprotocolunderlyinggossip,andrely ontheability of mostfirewallsandNAT boxesto setup HTTP connectionsfrom within a firewall to anHTTP serveroutsidethe firewall, possiblythroughan HTTP proxy server. Onesolution deploysAstrolabeagentson thecore Internet(reachableby HTTP from anywhere),while theother is basedon ApplicationLevel Gateways(ALGs) suchasusedby AOL InstantMessenger(www.aol.com/aim)andGroove(www.groove.net).Thesolutionscanbothbeusedsimultaneously.

In thesolutionbasedonALGs, ahostwishingto receiveamessagesendsanHTTPPOSTrequestto anALG (seeFigure4). TheALG doesnot responduntil a messageis available. A hostwishing to senda messagesendsan HTTP POSTrequestwith amessagein the body to the ALG. The ALG forwardsthe messageto the appropriatereceiver, if available,asa response,to thereceiver’sPOSTrequest.TheALG haslim-ited capacityto buffer messagesthatarrive betweenreceiver’s POSTrequests.When

19

Page 20: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

ALG

(1)

(3) (4)

(2)

Core Internet FirewallFirewall

I IJ J K KL L

Figure4: Application Level Gateway. (1) Receiver sendsa RECEIVE requestusinganHTTPPOSTrequest;(2) SendersendsthemessageusingaSENDrequestusinganHTTP POSTrequest;(3) ALG forwardsthe messageto the receiver usingan HTTP200response;(4) ALG sendsanemptyHTTP200responsebackto thesender.

usingpersistentHTTP connections,the efficiency is reasonableif the ALG is closeto its connectedreceivers. (It turnsout that no specialencodingis necessaryfor themessages.)

The ALGs caneitherbe deployedstand-aloneon the coreInternet,or asservletswithin existing enterpriseweb servers. For efficiency andscalability reasons,hostspreferablyreceive througha nearbyALG, which requiresthata sufficient numberofserversbe deployedacrossthe Internet. Note alsothat machinesthat arealreadydi-rectlyconnectedto thecoreInternetdonot haveto receivemessagesthroughanALG,but canreceive themdirectly.

Thissolutionevenworksfor mobilemachines,but for efficiency wehaveincludedaredirectionmechanisminspiredby theoneusedin cellularphonenetworks. A mobilemachine,whenconnectingin a new part of the Internet,hasto setup a redirectionaddressfor a nearbyALG with its “home” ALG. Whena sendertries to connecttothe (home)ALG of the mobilehost,the senderis informedof the ALG closerto themobilehost.

Anothersolutionis to deploy, insteadof ALGs, ordinaryAstrolabeagentsin thecoreInternet.AstrolabeagentscangossipbothoverUDPandHTTP. Oneminorprob-lem is thatgossipcannotbeinitiatedfrom outsidea firewall to the inside,but updateswill still spreadrapidly becauseoncea gossipis initiatedfrom insidea firewall to theoutside,theresponsecausesupdatesto spreadin theotherdirection.A largerproblemis that theAstrolabehierarchyhasto becarefullyconfiguredso thateachzonethat islocatedbehinda firewall, but hasattributesthatshouldbevisible outsidethefirewall,hasat leastonesibling zonethatis outsidethefirewall.

To increasescalabilityandefficiency, we designeda new way of addressingend-points. In particular, we would like to have machinesthat cancommunicatedirectlythroughUDP or someotherefficient mechanismto do so, insteadof going throughHTTP.

We definea realmto bea setof peersthatcancommunicatewith eachotherusinga singlecommunicationprotocol. For example,the peerswithin a corporatenetwork

20

Page 21: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Source

Destination

ALG Server 1

ALG Server 2

Site 1 Core Internet Site 2

Figure5: Themany waysgossipcantravel from asourcehostin Site1 to adestinationhost in Site 2. Eachsite hasfour hosts,two of which are representative agentsfortheir associatedsites,behinda firewall. The representativesof Site 2 connectto twodifferentALG serversto receivemessagesfrom outsidetheir firewall.

typically form arealm.In fact,therearetwo realms:aUDPandanHTTPrealm.Peersthat usethe ALG alsoform a realm. The peerswith static IP addresseson the coreInternetthat arenot behindfirewalls alsoform a UDP andan HTTP realm. Thus,apeermaybein multiple realms.In eachrealm,it hasa local address.Two peersmayhavemorethanonerealmin common.

We assignto eachrealma globally uniqueidentifier calledRealmID. The ALGrealmis called“internet/HTTP”.A corporatenetwork realmmaybe identifiedby theIP addressof its firewall, plus a “UDP” or “HTTP” qualifier (viz “a.b.c.d/UDP”resp.“a.b.c.d/HTTP”).

We definefor eachpeerits AddressSetto beasetof triplesof theform (RealmID,Address,Preference),where RealmID is thegloballyuniqueidentifierof therealmthepeeris in, Addressis theaddresswithin therealm(andis only locally unique), Preferenceindicatesa preferencesortingorderon thecorrespondingcommuni-

cationprotocols.Currently, UDP is preferredoverHTTP.

A peermay have multiple triples in the samerealmwith differentaddresses(“multi-homing”), typically for fault-tolerancereasons,aswell as,thesameaddressin distinctrealms.For fault-tolerancepurposes,a receiverregisterswith multipleALGs, andthushasmultiple addressesin theChatrealm(seeFigure5).

When peer F wantsto senda messageto peer G , F first determinesthe com-mon realms. Next, it will typically usea weightedpreference(basedon both F andG ’s preferences)to decidewhich addressto sendthe messageto. Astrolabeagentsrandomizethis choicein order to dealwith permanentnetwork failuresin particularrealms.

More detailon wide-areanetworking in Astrolabeis describedin [31].

4.5 Fragmentation

Messagesin Astrolabegrow in sizeapproximatelyasafunctionof thebranchingfactorusedin thehierarchy. The larger thebranchingfactor, themorezonesthatneedto be

21

Page 22: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

gossipedabout,and the larger the gossipmessages.For this reason,the branchingfactor shouldbe limited. In practice,we have found that Astrolaberequiresabout200 to 300 bytesper (compressed)MIB in a gossipmessage.The concernis thatUDP messagesaretypically limited to approximately8 Kbytes,which canthereforejust containabout25 - 40 MIBs in a message.This limit is reachedvery easily, andthereforeAstrolabehasto beableto fragmentits gossipmessagessentacrossUDP.

TheAstrolabeagentsusea simpleprotocolfor fragmentation.Ratherthaninclud-ing all updatedMIBs in agossipmessage,anagentwill just includeasmany aswill fit.In orderto compensatefor lost time,theagentspeedsuptherateof gossipaccordingly.For example,if on averageonly half of theupdatedMIBs fit in a gossipmessage,theagentwill gossiptwice asfast. It turnsout that a goodstrategy for choosingwhichMIBs to includein a messageis randomselection,asdescribedin Section7.4.

5 Security

EachAstrolabezoneis a separateunit of management,eachwith its own setof policyrules.Suchpoliciesgovernchild zonecreation,gossiprate,failuredetectiontime-outs,introducingnew AFCs,etc. Thesepoliciesareunderthesecurecontrolof anadmin-istrator. That is, althoughtheadministrationof Astrolabeasa whole is decentralized,eachzoneis centrallyadministeredin a fully securefashion.Eachzonemayhave itsown administrator, evenif onezoneis nestedwithin another.

We believe thatan importantprincipleof achieving scalein a hierarchicalsystemis thatchildrenshouldhave a way to overridepoliciesenforcedby their parents.Thisprinciple is perhapsunintuitive, sinceit meansthat managersof zoneswith only afew machineshavemorecontrol(overthosemachines)thanmanagersof largerencap-sulatingzones. This createsan interestingtension: managersof large zonescontrolmoremachines,but havelesscontrolovereachmachinethanmanagersof smallzones.Astrolabeis designedto conformto this principle,which guaranteesthatits own man-agementis decentralizedandscalable.

Securityin Astrolabeis currentlyonly concernedwith integrity andwrite accesscontrol,notconfidentiality(secrecy).11 Wewishto preventadversariesfrom corruptinginformation,or introducingnon-existentzones.

Individualzonesin Astrolabecaneachdecidewhetherthey wantto usepublickeycryptography, sharedkey cryptography, andno cryptography, in decreasingorderofsecurityandoverhead.For simplicity of exposition,in what follows we will presentjust thepublickey mechanisms,althoughexperiencewith therealsystemsuggeststhatthesharedkey cryptographyoptionoftenrepresentsthebesttrade-off betweensecurityandoverhead.

5.1 Certificates

Eachzonein Astrolabehasa correspondingCertificationAuthority (CA) that issuescertificates.(In practice,a singleserver processis often responsiblefor several suchCAs.) EachAstrolabeagenthasto know andtrust the public keys of the CAs of its

11Theproblemof confidentialityis significantlyharder, asit would involve replicatingthedecryptionkeyona largenumberof agents,aninherentlyinsecuresolution.

22

Page 23: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

ancestorzones.Certificatescanalsobe issuedby otherprincipals,andAstrolabecanautonomouslydecideto trustor not trustsuchprincipals.

An Astrolabecertificateis a signedattribute list. It hasat leastthe following twoattributes:

id: theissuerof thecertificate;

issued: thetime at which thecertificatewasissued.Astrolabeusesthis attributeto distinguishbetweenold andnew versionsof acertificate.

Optionally, acertificatecanhaveanattributeexpires, whichspecifiesthetimeatwhichthe certificatewill be no longervalid. (For this mechanismto work well, all agent’sclocksshouldbeapproximatelysynchronizedwith realtime.)

Eachzonein Astrolabehastwo public/privatekey pairsassociatedwith it: theCAkeys (the privatekey of which is kept only by the correspondingCA), andthe zonekeys. Theseareusedto createfour kindsof certificates:

1. a zonecertificatebindsthe id of a zoneto its public zonekey. It is signedus-ing the privateCA key of its parentzone. Note that the root zonecannothave(andwill turn out not to need)a zonecertificate.Zonecertificatescontaintwoattributesin additionto the id and issued: name, which is the zonename,andpubkey, which is thepublic zonekey. As only theparentCA hastheprivateCAkey to signsuchcertificates,adversariescannotintroducearbitrarychild zones.

2. a MIB certificate is a MIB, signedby the privatezonekey of the correspond-ing zone. Thesearegossipedalongwith their correspondingzonecertificatesbetweenhoststo propagateupdates.Thesignaturepreventsthe introductionof“f alsegossip”aboutthezone.

3. anaggregationfunctioncertificate(AFC) containsthecodeandotherinforma-tion aboutan aggregationfunction. An agentwill only install thoseAFCs thatareissueddirectly by oneof its ancestorzones(asspecifiedby the id attribute),or by oneof theirclients(seenext bullet).

4. a client certificateis usedto authenticateclientsto Astrolabeagents.A clientis definedto be a userof the Astrolabeservice. The agentsdo not maintainaclient database,but if the certificateis signedcorrectlyby a CA key of oneoftheancestorzonesof theagent(specifiedby the id attribute), the connectionisaccepted.Client certificatescanalsospecifycertainrestrictionson interactionsbetweenclient andagent,suchaswhich attributesthe client may inspect. Assuch,client certificatesarenot unlike capabilities.A client certificatemaycon-tainapublickey. Thecorrespondingprivatekey is usedto signAFCscreatedbytheclient.

In orderto functioncorrectly, eachagentneedsto beconfiguredwith its zonename(or “path”), andthepublicCA keysof eachzoneit is in. (It is importantthatagentsdonothaveaccessto privateCA keys.) As wesaw whendiscussingthenetwork protocols(Section4.1), someagentsin eachzone(exceptthe root zone)needthe privatezonekeys andcorrespondingzonecertificatesof thosezonesfor which they areallowedtopostupdates.In particular, eachhostneedstheprivatekey of its leaf zone.

23

Page 24: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Therootzoneis anexception:sinceit doesnothavesiblingzones,it nevergossipsupdatesandthereforeonly needsCA keys, which it useswhensigningcertificatesonbehalfof top-level child zones.Thereis the commonissueof the trade-off betweenfault-toleranceandsecurity: the morehoststhat have the privatekey for a zone,themorefault-tolerantthesystem,but alsothemorelikely thekey will getcompromised.(Potentiallythisproblemcanbefixedusingathresholdscheme;wehavenotyet inves-tigatedthis option.)

Notethatzonecertificatesare not chained. Althougheachis signedby theprivateCA key of theparentzone,theagentsareconfiguredwith thepublic CA key of eachancestorzone,sothatno recursivecheckingis necessary. Chainingwould imply tran-sitive trust,which doesnotscaleandviolatesAstrolabe’sgoverningprincipleof zonesbeingmorepowerful thantheir ancestorzones.

TheCA of azone,andonly thatCA, cancreatenew childzonesbygeneratinganewzonecertificateanda correspondingprivatekey, and thuspreventing impersonationattacks.The privatekey is usedto sign MIB certificates(updatesof the MIB of thenew zone),which will only beacceptedif thezonecertificate’ssignaturechecksusingthepublic CA key, andtheMIB certificate’s signaturechecksusingthepublic key inthezonecertificate.Similarly, a zoneCA hascontroloverwhatAFC codeis installedwithin thatzone.This is describedin moredetail in thenext section,which treatsthequestionof whichclientsareallowedwhich typesof access.

5.2 Client Access

As mentionedabove,Astrolabeagentsdo not maintaininformationaboutclients.TheCA of a zonemay chooseto keepsuchinformation, but it is not accessibleto theAstrolabeagentsthemselves,aswedonotbelievethissolutionwouldscale(eventuallytherewouldbetoomany clientsto track).Instead,Astrolabeprovidesamechanismthathasfeaturesof bothcapabilitiesandaccesscontrollists.

A clientthatwantsto useAstrolabehasto obtainaclientcertificatefrom aCA. Theclientwill only beableto accessAstrolabewithin thezoneof theCA. Whenit contactsone of the agentsin the CA, the agentwill usethe id attribute in the certificatetodeterminethezone,andif it is in factin thiszone,will havethepublickey of thezone.It usesthe public key to checkthat the certificateis signedcorrectlybeforeallowingany accessto theagent.Thus,in somesense,theclientcertificateis acapabilityfor thezoneof theCA thatsignedit.

Additional fine-grainedcontroloverclientsis exercisedin (at least)two ways:

1. the client certificatecanspecifycertainconstraintson the holder’s access.Forexample,it mayspecifywhich attributestheclient is allowedto reador update.

2. eachzonecanspecify(andupdateat run-time)certainconstraintson accessbyholdersof client certificatessignedby any of its ancestorzones.

That is, theclient is constrainedby theunionof thesecurityrestrictionsspecifiedin its certificate,andthe restrictionsspecifiedby all zoneson the pathfrom the leafzoneof theagentit is connectedto, to thezonethatsignedits certificate.

In general,client certificatesissuedby larger zoneshave differentsecurityrulesthanclient certificatesissuedby its child zones.The formerwill have morepower inthe largerzones,while the latterwill have morepower in the smallerzones.For thisreason,usersmayrequirea setof client certificates.

24

Page 25: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Securitywill notscaleif policiescannotbechangedonthefly. Thezonerestrictionscanbechangedat run-timeby installingan“&config” AFC for thezone(seeSection6.2). But client certificatessharea disadvantagewith capabilitiesandcertificatesingeneral:they cannotbeeasilyrevokeduntil they expire. We areconsideringtwo pos-siblesolutions.Oneis to haveclient certificateswith shortexpirationtimes.Theotheris to useCertificationRevocationLists (CRLs).Both havescalingproblems.

Themechanismsdiscussedso far take careof authenticationandauthorizationis-sues. They prevent impersonationandspoofingattacks,but they do not prevent au-thorizedagentsand clients from lying about their attributes. Such lies could havesignificantimpacton calculatedaggregates.Take, for example,thesimpleprogram:

SELECT MIN(load) AS load;

This functionexportsthe minimum of the load attributesof the childrenof somezoneto thezone’sattributeby thesamename.Theintentionis thattherootzone’s loadattributewill containtheglobalminimumload.Both clientsandagents,whenholdingvalid certificates,cando substantialdamageto suchaggregateddata. Clientscanlieaboutloadby installinganunrealisticallylow or high loadin a leaf zone’sMIB, whileanagentthatholdsa privatekey of a zonecangossipa valuedifferentfrom thecom-putedminimumof its child zones’loads.To make anapplicationrobustagainstsuchattacks,werecommendremovingoutliersasmuchaspossible.Unfortunately, standardSQLdoesnotprovidesupportfor removing outliers.We areplanningto extendAstro-labe’sSQLenginewith suchsupport.Moreproblematically, whenappliedrecursively,theresultof aggregationafterremoving outliersmayhaveunclearsemantics.

6 AggregationFunctions

Theaggregationfunctionof a zonereadsthelist of MIBs belongingto its child zones,andproducesa MIB for its zoneas output.12 The codefor an AFC is provided inattributesof its child zoneMIBs whosenamestartswith the character’&’. AFCsthemselvesareattribute lists. An AFC hasat leastthefollowing attributesin additionto thestandardcertificateattributes:

lang: specifiesthelanguagein which theprogramis coded.

code: containstheSQLcodeitself.

deps: containsthe input attributeson which theoutputof thefunctiondepends.Astrolabereducesoverheadby only re-evaluatingthoseAFCsfor which the in-put haschanged.

category: specifiestheattributein which theAFC is to beinstalled.As we willseelater, this explicit specificationpreventsrogueusersfrom misusingcorrectlysignedAFCs.

12Astrolabeis not securedagainstfaultyaggregationfunctions.For example,supposethatanaggregationfunction is expectedto report the highestload in somezone,but sometimesincorrectlyreportszero. Theaggregationvalueseenby userswould seemto bouncearound,sometimesreflectingthecorrectvalue,andsometimesreflectingthis erroneousinput. It may be possibleto usea form of voting to overcomesuchfailures,but atpresent,this is simplyaknown securitydeficiency of theinitial system.

25

Page 26: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

Function Description

MIN(attribute) Find theminimumattributeMAX(attribute) Find themaximumattributeSUM(attribute) SumtheattributesAVG(attribute[, weight]) CalculatetheweightedaverageOR(attribute) BitwiseOR of a bit mapAND(attribute) BitwiseAND of a bit mapFIRST(n,attribute) Returnasetwith thefirst * attributesRANDOM(n, attribute[, weight]) Returnasetwith * randomlyselectedattributes

Table2: ExtendedSQLaggregationfunctions.Theoptionalweightcorrectsfor imbal-ancein thehierarchy, andis usuallysetto *;=?>@=?A4>CBED .

As describedlater, anAFC mayalsohave thefollowing attributes: copy: a Booleanthatspecifiesif theAFC canbe“adopted.” Adoptioncontrolspropagationof AFCsinto siblingzones. level: anAFC is either“weak” or “strong.” StrongAFCscannotbereplacedbyancestorzones,but weakAFCscanif they havemorerecentissuedattributes. client: in caseof an AFC issuedby a client, this attribute containsthe entireclient certificateof theclient. Theclient certificatemaybecheckedwith theCAkey of the issuer, while the AFC may be checked using the public key in theclient certificate.

In this section,we will discusshow AFCs areprogrammed,andthe propagationrulesof AFCs.

6.1 Programming

WehaveextendedSQLwith asetof new aggregationfunctionsthatwefoundhelpfulintheapplicationsthatwe havepursued.A completetableof thecurrentsetof functionsappearsin Table2.

Thefollowing aggregationqueryis installedby Astrolabeperdefault (but maybechangedat run-time):

SELECTSUM(nmembers) AS nmembers,MAX(depth) + 1 AS depth,FIRST(3, contacts) AS contacts,FIRST(3, servers) AS servers

Here,nmembers is thetotal numberof hostsin thezone,depthis thenestinglevelof thezones,contactsarethefirst three“representative” gossipaddressesin thezone,andservers thefirst threeTCP/IPaddresses,usedby clientsto interactwith this zone.nmembers is oftenusedto weighthevalueswhencalculatingtheaverageof avalue,sothattheactualaverageis calculated(ratherthantheaverageof averages).

Astrolabe’sSQL enginealsohasa simpleexceptionhandlingmechanismthatpro-ducesdescriptiveerrormessagesin a standardoutputattribute.

26

Page 27: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

6.2 Propagation

An applicationintroducesa new AFC by writing an attribute of a virtual child zoneat someAstrolabeagent. This Astrolabeagentwill now automaticallystartevaluat-ing this AFC. TheAstrolabearchitectureincludestwo mechanismswherebyan AFCcanpropagatethroughthe system.First, the AFC canincludeanotherAFC (usually,a copy of itself) aspartof its output.Whenthis propagationmechanismis employed,theoutputAFC will becopiedinto theappropriateparentMIB andhencewill beeval-uatedwhentheMIB is computedfor theparent.This approachcausestheaggregationprocessto recursively repeatitself until therootMIB is reached.

Becauseof the gossipprotocol, theseAFCs will automaticallypropagateto theotheragentsjust like normalattributesdo. However, sincetheotheragentsonly shareancestorzones,we needa secondmechanismto propagatetheseAFCsdown into theleafMIBs. Thissecondmechanism,calledadoption, worksasfollows.EachAstrolabeagentscansits ancestorzonesfor new AFC attributes.If it detectsanew one,theagentwill automaticallycopy the AFC into its virtual “system” MIB (describedabove inSection2). This way, anintroducedAFC will propagateto all agentswithin theentireAstrolabetree.

Jointly, thesetwo AFC propagationmechanismspermitanapplicationto introducea new AFC dynamicallyinto the entiresystem. The AFC will rapidly spreadto theappropriatenodes,and,typically within tensof seconds,the new MIB attributeswillhave beencomputed.For purposesof garbagecollection,the creatorof an AFC canspecifyanexpirationtime; unlessthetime is periodicallyadvanced(by introducinganAFC with a new issuedandexpirationtime), theaggregationfunctionwill eventuallyexpireandthecomputedattributeswill thenvanishfrom thesystem.

It is clearthata securemechanismis necessaryto preventclientsfrom spreadingcodearoundthatcanchangeattributesarbitrarily. Thefirst securityrule is thatcertifi-catesareonly consideredfor propagationif the AFC is correctlysignedby an ancestorzoneor a client of one,preventing

“outsiders” from installing AFCs. In caseof an AFC signedby a client, theclient hasto begrantedsuchpropagationin its client certificate.

theAFC hasnot expired.

the category attribute of the AFC is the sameas the nameof the attribute inwhich it is installed.This preventsa rogueclient from trying to useanAFC foranattributeit wasnot intendedfor.

We call suchAFCsvalid.It is possible,and in fact common,for an AFC to be installedonly in a subtree

of Astrolabe.Doing so requiresonly that thecorrespondingzonesign thecertificate,whichsignatureis easierto obtainthantheCA-signatureof therootzone.

The adoptionmechanismallows certificatesto propagate“down” the Astrolabetree. Eachagentcontinuouslyinspectsits zones,scanningthemfor new certificates,andinstallsonein its own MIB if theAFC satisfiesthefollowing conditions: if anotheroneof thesamecategory is alreadyinstalled,thenew oneis preferable.

its copyattributeis not setto “no” (suchAFCshave to beinstalledin eachzonewherethey areto beused).

27

Page 28: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

“Preferable”is a partial orderrelationbetweencertificatesin the samecategory:M �ON meansthatcertificateM is preferableto certificateN . M �PN if f

QM is valid, andN is not,or

QM and N arebothvalid, the level of M is strong,and MSR T<U is a child zoneof N R T4U ,or

QM andN arebothvalid, M is not strongerthanN , andMSR T D2D@H+> U �?N R T D7DCH+> U .Note that theserulesof propagationallow the CAs to exercisesignificantcontrol

overwherecertificatesareinstalled.This controlledpropagationcanbeexploited to do run-timeconfiguration.Astro-

labeconfiguresitself thisway. The&config certificatecontainsattributesthatcontrolthe run-timebehavior of Astrolabeagents.Using this certificate,the gossiprateandsecuritypoliciescanbecontrolledat everyzonein theAstrolabetree.

7 Performance

In this section,we presentsimulationresultsthat supportour scalabilityclaims,andexperimentalmeasurementsto verify theaccuracy of our simulations.Finally, we de-scribevariousimplementationdetailswith thegossipprotocolin thecurrentInternet.

7.1 Latency

Sincesibling zonesexchangegossipmessages,thezonehierarchyhasto bebasedonthenetwork topologysothatloadon network links androutersremainswithin reason.As aruleof thumb,if acollectionof machinescanbedividedinto two groupsseparatedby a singlerouter, thesegroupsshouldbe in disjoint zones.On theotherhand,aswewill see,thesmallerthebranchingfactorof thezonetree,theslower thedissemination.Sosomecareshouldbe takennot to make thezonestoo small in termsof numberofchild zones.With presentdaynetwork packet sizes,CPUspeeds,andmemorysizes,the numberof child zonesshouldbe approximatelybetween5 to 50. (In the nearfuture, we hopethat throughimproving our compressiontechniqueswe cansupporthigherbranchingfactors.)

As themembershipgrows,it maybenecessaryto createmorezones.Initially, newmachinescanbeaddedsimply to leafzones,but atsomepoint it becomesnecessarytodivide the leaf zonesinto smallerzones.Note that this re-configurationonly involvesthemachinesin thatleaf zone.Otherpartsof Astrolabedonot needto know aboutthere-configuration,andthis is extremelyimportantfor scalinganddeploying Astrolabe.

We know that the time for gossipto disseminatein a “flat” populationgrows log-arithmically with the size of the population,even in the faceof network links andparticipantsfailing with a certainprobability[11]. Thequestionis, is this slow growthin latency alsotrue in Astrolabe,which usesa hierarchicalprotocol?Theanswerap-pearsto beyes,albeitthatthelatency growssomewhatfaster. To demonstratethis,weconductedanumberof simulatedexperiments.13

13AlthoughAstrolabeis currentlydeployedon approximately60 machines,this is not a sufficiently largesystemto evaluateits scalability.

28

Page 29: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

0

5

10

15

20

25

30

35

1 10 100 1000 10000 100000

expe

cted

#ro

unds

V

#members

bf = 5bf = 25

bf = 125flat

Figure 6: The averagenumberof roundsnecessaryto infect all participants,usingdifferentbranchingfactors.In all thesemeasurements,thenumberof representativesis 1, andthereareno failures.

In theexperiments,we variedthebranchingfactorof the tree,andthe numberofrepresentativesin a zone,theprobabilityof messageloss,andtheratio of failedhosts.In all experiments,weusedabalancedtreewith afixedbranchingfactor. Wesimulatedup to WYX (390,625)members.In the simulation,gossipoccurredin rounds,with allmembersgossipingat thesametime.14 We assumedthatsuccessfulgossipexchangescompletewithin a round. (Typically, Astrolabeagentsareconfiguredto gossiponceevery two to fiveseconds,sothisassumptionseemsreasonable.)Eachexperimentwasconductedat leastten times. (For small numbersof membersmuchmoreoften thanthat.) In all experiments,thevarianceobservedwaslow.

In thefirst experiment,we variedthebranchingfactor. We usedbranchingfactors5, 25,and125(that is, W[Z , WY\ , and WY] ). In this experimenttherewasjust onerepresen-tativeperzone,andtherewerenofailures.Wemeasuredtheaveragenumberof roundsnecessaryto disseminateinformationfrom onenodeto all othernodes.15 We show theresultsin Figure6 (ona log scale),andcomparethesewith flat (non-hierarchical)gos-sip. Flat gossipwould be impracticalin a realsystem,astherequiredmemorygrowslinearly, andnetwork load quadraticallywith the membership[32], but it providesausefulbaseline.

Flatgossipprovidesthelowestdisseminationlatency. Thecorrespondingline in thegraphis slightly curved,becauseAstrolabeagentsnever gossipto themselves,whichsignificantly improvesperformanceif the numberof membersis small. Hierarchical

14In moredetaileddiscreteeventsimulations,in whichthemembersdid notgossipin roundsbut in amorerandomizedfashion,we foundthatgossippropagatesfaster, andthusthat the round-basedgossipprovidesuseful“worst-case”results.

15We useda non-representative nodeasthe sourceof information. Representatives have an advantage,andtheir informationdisseminatessignificantlyfaster.

29

Page 30: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

0

5

10

15

20

25

30

35

1 10 100 1000 10000 100000

expe

cted

#ro

unds

V

#members

nrep = 1nrep = 2nrep = 3

flat

Figure7: The averagenumberof roundsnecessaryto infect all participants,usingadifferentnumberof representatives.In all thesemeasurements,thebranchingfactoris25,andthereareno failures.

gossipalsoscaleswell, but is significantlyslower thanflat gossip.Latency improveswhenthebranchingfactoris increased,but doingsoalsoincreasesoverhead.

For example,at 390,625membersandbranchingfactor5, thereare8 levelsin thetree. Thus eachmemberonly hasto maintainandgossiponly ^`_aWcbedY� MIBs.(Actually, sincegossipmessagesdo not includetheMIBs of thedestinationagent,thegossipmessageonly contains32 MIBs.) With a branchingfactorof 25, eachmembermaintainsandgossipsdf_hgYW�bji4�Y� MIBs. In the limit (flat gossip),eachmemberwouldmaintainandgossipanimpracticalk)lY�nmCoYg)W MIBs.

With only onerepresentativeperzone,Astrolabeis highly sensitiveto hostcrashes.Theprotocolsstill work, asfaulty representativesaredetectedandreplacedautomat-ically, but this detectionandreplacementtakestime andleadsto significantdelaysindissemination.Astrolabeis preferablyconfiguredwith morethanonerepresentativeineachnon-leafzone. In Figure7, we show theaveragenumberof roundsnecessarytoinfectall participantsin anAstrolabetreewith branchingfactor25. In theexperimentsthat producedthesenumbers,we varied the numberof representatives from one tothree. Besidesincreasingfault-tolerance,morerepresentativesalsodecreasethe timeto disseminatenew information.But threetimesasmany representativesalsoleadstothreetimesasmuchloadon therouters,sotheadvantagescomeat somecost.

In the next experiment,we determinedthe influenceof messagelosson the dis-seminationlatency. In this experimentwe used,again,a branchingfactorof 25, butthis time we fixed the numberof representativesat three. Gossipexchangeswereal-lowed to fail with a certainindependentprobability prqsD2D , which we variedfrom � toR i<W (15%). As canbeseenfrom the resultsin Figure8, lossdoesleadto slower dis-semination,but, as in flat gossip[32], the amountof delay is surprisinglylow. In apracticalsetting,we would probablyobserve dependentmessagelossdueto faulty or

30

Page 31: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

0

5

10

15

20

25

30

35

1 10 100 1000 10000 100000

expe

cted

#ro

unds

V

#members

loss = .00loss = .05loss = .10loss = .15

Figure 8: The averagenumberof roundsnecessaryto infect all participants,usingdifferentmessagelossprobabilities.In all thesemeasurements,thebranchingfactoris25,andthenumberof representativesis three.

overloadedroutersand/ornetwork links, with moredevastatingeffects. Nevertheless,becauseof therandomizednatureof thegossipprotocol,updatescanoftenpropagatearoundfaulty componentsin thesystem.An exampleof suchdependentmessagelossis thepresenceof crashedhosts.

In this final simulatedexperiment,we stoppedcertainagentsfrom gossipinginorderto investigatetheeffectof hostcrashesonAstrolabe.Againweusedabranchingfactorof 25, andthreerepresentatives.Messagelossdid not occurthis time, but eachhost was down with a probability that we varied from � to R �)^ (8%). (With largenumbersof members,doingsomadetheprobability thatall representativesfor somezonearedown ratherhigh. Astrolabe’saggregationfunctionswill automaticallyassignnew representativesin suchcases.)As with messageloss,the effect of crashedhostson latency is quitelow (seeFigure9).

If we configureAstrolabeso thatagentsgossiponceevery two to five seconds,aswe normallydo, we canseethatupdatespropagatewith latencieson theorderof tensof seconds,ratherthanhours,ascanbethecasewith DNS if TTL settingsarelarge.

7.2 Load

We werealso interestedin the load on Astrolabeagents.We considertwo kinds ofload: thenumberof receivedmessagesperround,andthenumberof signaturechecksthatanagenthasto performper round. Theaveragemessagereceptionload is easilydetermined:on average,eachagentreceivesonemessageper roundfor eachzoneitrepresents.Thus, if thereare t levels, an agentthat representszoneson eachlevelwith havetheworstaverageloadof t messagespersecond.Obviously, this loadgrows!#"u$6&Y(v*-, .

31

Page 32: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

0

5

10

15

20

25

30

35

1 10 100 1000 10000 100000

expe

cted

#ro

unds

V

#members

down = .00down = .02down = .04down = .06down = .08

Figure 9: The averagenumberof roundsnecessaryto infect all participants,usingdifferentprobabilitiesof a hostbeingdown. In all thesemeasurements,thebranchingfactoris 25,andthenumberof representativesis three.

Dueto randomization,it is possiblethatanagentreceivesmorethanonemessageper round and per level. The varianceof messageload on an agentis expectedtogrow !#"%$'&)(+*-, : if a processis involved in t epidemicswith iid distributions,whereeachepidemicinvolvesthe samenumberof participants(the branchingfactorof theAstrolabetree), thenthe varianceis simply t timesthe varianceof the load of eachindividualepidemic.

In order to evaluatethe load on Astrolabeagentsexperimentally, we usedthreerepresentativesperzone,but eliminatedhostandmessageomissionfailures(astheseonly serve to reduceload).We ranasimulationfor 180rounds(fifteenminutesin caseeachroundis five seconds),andmeasuredthemaximumnumberof messagereceivedperroundacrossall agents.Theresultsareshown in Figure10. Thisfigureperhapsre-flectsbestthetrade-off betweenchoosingsmallandlargebranchingfactorsmentionedearlier.

If we simply checked all signaturesin all messagesthat arrived, the overheadofcheckingcould becomeenormous.As the numberof MIBs in a messagegrows as!#"u$6&Y(v*-, , the computationalload would grow as !#"%$'&)( \ *-, . The larger the branch-ing factor, the higher this load, as larger branchingfactorsresult in moreMIBs permessage,andcaneasilyrun in the thousandsof signaturechecksper roundeven formoderatelysizedpopulations.

Ratherthancheckingall signatureseachtime a messagearrives,theagentbuffersall arriving MIBs without processingthemuntil the start of a new round of gossip.(Actually, only thoseMIBs thatarenew with respectto whattheagentalreadyknowsarebuffered.) For eachzone,the agentfirst checksthe signatureon the mostrecentMIB, thenonthesecondmostrecentMIB, etc.,until it findsacorrectsignature(usually

32

Page 33: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

0

10

20

30

40

50

60

1 10 100 1000 10000 100000

max

# m

essa

ges

/ sec

ond

w

# members

bf = 5bf = 25

bf = 125

Figure10: Themaximumloadin termsof numberof messagesperroundasafunctionof numberof participantsandbranchingfactor.

Experiment # agents agents/host Description

1 48 1 (8 8 8 8 8 8)2 48 1 ((8 8) (8 8) (8 8))3 63 1 (8 8 8 7 8 8 8 8)4 63 1 ((8 8) (8 7 8) (8 8 8))5 96 2 (1616 16 1616 16)6 96 2 ((1616) (16 16) (1616))7 126 2 (1616 16 1416 16 16 16)8 126 2 ((1616) (16 1416) (1616 16))

Table3: Hierarchiesusedin eachexperiment.

the first time). The otherversionsof the sameMIB arethenignored. Thuswe haveartificially limited the computationalload to !#"%$'&)(+*-, without affecting the speedofgossipdissemination.In fact, the maximumnumberof checksper round is at mosttx_a"yA4z|{}iY, , where t is thenumberof levelsand A4z thebranchingfactor. Moreover,thecomputationalload is approximatelythesameon all agents,that is, not worseonthoseagentsthatrepresentmany zones.

Anotherapproachto limit computationaloverheadis basedon a Byzantinevotingapproach.It removesalmostall needfor signaturechecking[17].

7.3 Validating the Simulations

In orderto verify theaccuracy of our simulator, we performedexperimentswith up to126Astrolabeagentson 63 hosts.For theseexperiments,we createdthe topologyof

33

Page 34: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

P

Q

R

Figure11: The experimentaltopology, consistingof 63 hosts,eight 10 Mbps LANs,andthreeroutersconnectedby a 100Mbpsbackbonelink.

Figure11 in theEmulabnetwork testbedof theUniversityof Utah.Theset-upconsistsof six LANs, eachconsistingof eightCompaqDNARD Sharks(233MHz StrongARMprocessor, 32MB RAM, 10MbpsEthernetinterface,runningNetBSD)connectedto anAsante8+210/100Ethernetswitch. (As indicatedin thefigure,oneof theSharkswasbroken.) Theset-upalsoincludesthreerouters,consistingof 600MHz Intel PentiumIII “Coppermine”processors,eachwith 256 MB RAM and4 10/100Mbps Ethernetinterfaces,andrunningFreeBSD4.0.All therouters’Ethernetinterfaces,aswell astheAsanteswitches,areconnectedto aCisco6509switch,andthenconfiguredasdepictedin Figure11.

On this hardware,we createdeight differentAstrolabehierarchies(seeTable3)with theobviousmappingsto thehardwaretopology. For example,in Experiment8 wecreateda three-level hierarchy. Thetop-level zoneconsistedof achild zoneperrouter.Eachchild zonein turn hada grandchildzoneper LAN. In Experiments1, 3, 5, and7 we did not reflectthepresenceof the routersin theAstrolabehierarchy, but simplycreateda child zoneperLAN. Experiments1, 2, 5, and6 did not usethe4th and7thLAN. In the last four experiments,we rantwo agentsperhost. Here,theagentswereconfiguredto gossiponceeveryfivesecondsoverUDP, andthe(maximum)numberofrepresentativesperzonewasconfiguredto bethree.

In eachexperiment,wemeasuredhow longit tookfor avalue,updatedatoneagent,to disseminateto all agents.To accomplishthis dissemination,we installeda simpleaggregationfunction’SELECT SUM(test)AS test’. All agentsmonitoredtheattribute’ test’ in thetop-level zone,andnotedthetime at which this attributewasupdated.Weupdatedthe’ test’attributein thevirtual systemzoneof the“right-most” agent(theonein thebottom-rightof thetopology),sincethisagenthasthelongestgossipingdistanceto theotheragents.Eachexperimentwasexecutedat least100times. In Figure12(a)we reportthe measuredaverage,minimum,maximum,and95%confidenceintervalsfor eachexperiment.

Figure12(b) reportsthe simulationresultsof the sameexperiments.In the simu-lations,we assumedno messageloss,aswe hadno easyway of modelingthe actualbehavior of losson theEmulabtestbed.Thesimulations,therefore,show moreregular

34

Page 35: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8

late

ncy

experiment

(a)

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8

late

ncy

experiment

(b)

Figure12: Resultsof (a) the experimentalmeasurementsand (b) the correspondingsimulations. The x-axesindicatethe experimentnumberin Table3. The error barsindicatetheaveragesandthe95%confidenceintervals.

behavior than the experiments.They give sometimesslightly pessimisticresults,asin thesimulationstheagentsgossipin synchronousrounds(non-synchronizedgossipdisseminatessomewhatfaster).Nevertheless,theresultsof thesimulationscorrespondwell to theresultsof theexperiments,giving usconfidencethatthesimulationspredictthe actualbehavior in large networks well. The experimentalresultsshow the pres-enceof someoutliersin thelatencies(althoughnevermorethantwo roundsof gossip),whichcouldbeexplainedby occasionalmessageloss.

We plan to presentcomprehensive experimentaldatafor Astrolabein someotherforum. At present,ourexperimentsarecontinuingwith anemphasison understanding

35

Page 36: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

0

2

4

6

8

10

0 2 4 6 8 10 12 14 16

# ro

unds~

# entries in gossip message

randomround robin

random + selfround robin + self

Figure13: Costof fragmentation.

how the systembehaveson larger configurations,understress,or when the rate ofupdatesis muchhigher.

7.4 Fragmentation

As describedin Section4.5,theunderlyingnetwork mayenforceamaximumtransmis-sionunit size.In orderto dealwith this limitation, wereducethenumberof MIBs in amessage,andspeedup thegossiprateaccordingly. In orderto decidewhich is a goodmethodfor choosingMIBs to includein messages,we simulatedtwo basicstrategies,with a variantfor each.The two basicstrategiesareroundrobin andrandom.In thefirst, the agentfills up gossipmessagesin a roundrobin manner, while in thesecondtheagentpicks random(but different)MIBs. Thevariantstrategy for eachis that theagent’sown MIB is forcedto beincludedasexactlyoneof theentries.

In Figure13 we show the resultswhen the numberof hostshereis 16, and thehierarchyis flat. Roundrobin performsbetterthanrandom,and including the localMIB of thesendingagentis a goodidea.Thecurrentimplementationof theAstrolabeagentactuallyusesrandomfilling, but alsoincludesthelocal MIB. Thereasonfor notusingroundrobin is that, in a hierarchicalepidemic,roundrobin cancausetheMIBsof entirezonesto beexcluded.

8 RelatedWork

8.1 Dir ectory Services

Much work hasbeendonein theareaof scalablemappingof namesof objects(oftenmachines)ontometa-informationof theobjects.ThebestknownexamplesaretheDNS[18] andX.500 [21], andLDAP (RFC1777)standards.Similar to DNS, andparticu-larly influential to thedesignof Astrolabeis theClearinghousedirectoryservice[11].Clearinghousewasanearlyalternativeto DNS,usedinternallyfor theXeroxCorporateInternet.Like DNS, it mapshierarchicalnamesontometa-information.Unlike DNS,

36

Page 37: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

it doesnotcentralizetheauthorityof partsof thenamesspaceto any particularservers.Instead,the top two levelsof the namespacearefully replicatedandkepteventuallyconsistentusinga gossipalgorithmmuchlike Astrolabe’s. Unlike Astrolabe,Clear-inghousedoesnot apply aggregationfunctionsor hierarchicalgossiping,andthusitsscalability is inherently limited. The amountof storagegrows !#"6*-, , while the to-tal bandwidthtaken up by gossipgrows !#"6*�\�, (becausethe sizeof gossipmessagesgrows linearly with thenumberof members).Clearinghousehasnever beenscaledtomorethana few hundredservers.(NeitherhasAstrolabeat this time,but analysisandsimulationindicatethatAstrolabecouldpotentiallyscaleto millions.)

More recentwork appliesvariantsof theClearinghouseprotocolto databases(e.g.,Bayou’s anti-entropy protocol [20] andGolding’s timestampedanti-entropy protocol[12]). Thesesystemssuffer from thesamescalabilityproblems,limiting scaleto per-hapsa few thousandsof participants.

Also influentialto thedesignof Astrolabeis ButlerLampson’spaperon thedesignof a globalnameservice[16], basedon experiencewith Grapevine [7] andClearing-house.Thispaperenumeratestherequirementsof anameservice,which includelargesize,highavailability, fault isolation,andtoleranceof mistrust,all of whichwebelieveAstrolabesupports.The paper’s designdoesnot includeaggregation,but otherwisesharesmany of theadvantagesof AstrolabeoverDNS.

In the IntentionalNamingSystem[1], namesof objectsarebasedon propertiesratherthanlocation. A self-organizingresolver mapsnamesonto locations,andcanroutemessagesto objects,even in an environmentwith mobile hosts. A recentex-tensionto INS, Twineallows for partialmatchingof propertiesbasedon a peer-to-peerprotocol[4].

TheGlobesystem[33] is anexampleof averyscalabledirectoryservicethatmapsarbitraryobjectnamesontoobjectidentifiers,andthenontolocation.Globealsosup-portslocatingobjectsthatmovearound.

8.2 Network Monitoring

Network monitorscollectruntimeinformationfrom varioussourcesin thenetwork. Astandardfor suchretrieval is the SNMP protocol(RFC 1156,1157). A large varietyof commercialandacademicnetwork monitorsexist. Many of thesesystemsprovidetools for collecting monitoringdatain a centralizedplace,andvisualizing the data.However, thesesystemsprovide little or no supportfor disseminationto a largenum-berof interestedparties,aggregationof monitoreddata,or security. Thescaleof thesesystemsis oftenintendedfor, andlimited to, clustersof up to afew hundredmachines.Of these,Argent’s Guardian(www.argent.com)is closestto Astrolabe,in that it usesregionalagentsto collectinformationin ahierarchicalfashion(althoughit doesnot in-stallagentson themonitoredsystemsthemselves).Monitoring informationis reportedto a singlesite,or atmosta few sites,andGuardiandoesnotsupportaggregation.

Note that Astrolabedoesnot actuallyretrieve monitoringinformation; it just hasprovidesthe ability to disseminateandaggregatesuchdata. NeitherdoesAstrolabeprovidevisualization.Themonitoringagentsandvisualizationtoolsprovidedby theseproducts,togetherwith Astrolabe,couldform aninterestingmarriage.

37

Page 38: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

8.3 Event Notification

EventNotificationor Publish/Subscribeservicesallow applicationsto subscribeto cer-tain classesof eventsof interest,andsystemsto postevents. ExamplesareISIS [6],TIB/RendezVoustm[19], Gryphon[2], andSiena[10]. NotethatalthoughAstrolabepro-cessesevents,it doesnot routethemto subscribers,but processesthemto determinesomeaggregatestate“snapshot”of adistributedsystem.However, aPublish/Subscribesystemhasbeenbuilt on top of Astrolabeandis describedin Section3.3.

XMLBlaster (www.xmlblaster.com)anda systembuilt at MIT [25] encodeeventsin XML format,andallow routingbasedon queriesover theXML contentin themes-sages.The latter system’s protocol for this routing is similar to SelectCast(Section3.3), althoughour protocolcando routing not only basedon the contentof the mes-sages,but alsoon the aggregatestateof the environmentin which the messagesarerouted.

Although many of thesesystemssupportboth scaleandsecurity, the eventsareeithervery low-level, or generatedfrom aggregatinglow-level eventhistories.In orderto monitorthecollectivestateof adistributedsystem,aprocesswouldhavetosubscribeto many eventsources,anddotheentireeventprocessinginternally. Thisstrategy doesnotscale.EventnotificationandPublish/Subscribeservicesareintendedto disseminateinformationfrom few sourcesto many subscribers,ratherthanintegratinginformationfrom many sources.

8.4 SensorNetworks

In a sensornetwork, a large numberof sensorsmonitor a distributed system. Theproblemis detectingcertainconditionsandqueryingthe stateof the network. Therearemany projectsin this area.A couplethatrelatecloselyto our work areCougar[9]andanunnamedprojectdescribedin [15].

In Cougar, ausercanexpressalong-runningSQLquery, asin Astrolabe.A central-izedserverbreaksthequeryinto componentsthatit spreadsamongthevariousdevices.Thedevicesthenreportbackto theserver, which combinestheresultsandreportstheresultto theuser. Althoughmuchof theprocessingis in thenetwork, Cougar’scentral-izedservermaypreventadequatescalingandrobustness.

In [15], asensornetwork for awirelesssystemis described.As in ActiveNetworks[28], codeis dynamicallyinstalledin thenetwork thatroutesandfiltersevents.Usingaroutingmechanismcalled“DirectedDiffusion,” eventsareroutedtowardsareasof in-terest.Theeventmatchingis specifiedusinga low-level languagethatsupportsbinarycomparisonoperatorson attributes.Thesystemalsoallows application-specificfiltersto beinstalledin thenetwork thatcanaggregateevents.

We have studiedvariousepidemicprotocols,including the onesthat areusedbyAstrolabe,for usein power-constraintwirelesssensornetworks [30]. What is clearfrom this studyis that thehierarchicalepidemicsusedin Astrolabecannotbeusedasis in suchnetworks,astheprotocoldoessendmessagesacrosslargedistances,albeitoccasionally.

8.5 Cluster Management

As madeclearabove,directoryservicesandnetwork monitoringtoolsdo not supportdynamicdistributedconfiguration. Therearea numberof clustermanagementtools

38

Page 39: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

availablethattakecareof configuration,for example,Wolfpack,Beowulf, NUCM, andthe“DistributedConfigurationPlatform” of EKS.Unlike Astrolabe,they do not scalebeyonda few dozenmachines,but they do providevariousproblem-specificfunction-ality for themanagementof distributedsystem.Astrolabeis notsuchashrink-wrappedapplication,but could be incorporatedinto a clustermanagementsystemto supportscalability.

8.6 Peer-to-Peer Routing

A peer-to-peer routing protocol (P2PRP)routesmessagesto locations,eachdeter-minedby a location-independentkey includedin the correspondingmessage.Sucha protocolmay be usedto build a so-calledDistributedHashTable, simply by hav-ing eachlocationprovide a way to mapthe key to a value. Well-known examplesofP2PRPsincludeChord[27], Pastry[23], andTapestry[34] Theseprotocolshavebeenusedto implementdistributedfile systemsandapplication-level multicastprotocols.

As in Astrolabe,eachlocationrunsanagent.Whenreceiving a message,theagentinspectsthekey and,unlessit is responsiblefor thekey itself, forwardsthemessagetoanagentthatknowsmoreaboutthekey. Thenumberof hopsgrows !#"yp�q���*-, for eachof the P2PRPsnamedabove, asdoesthesizeof the routing tablesmaintainedby theagents.

The functionalitiesof a P2PRPandAstrolabeareorthogonal,yet, a comparisonprovesinteresting.Although sometimescalled location protocols, P2PRPscan’t findan object—they can only placethe object in a location where they will be able toretrieve the objectlater. Astrolabe,on the otherhand,is ableto find objectbasedonattributesof theobject.However, Astrolabeis quitelimited in how many objectsit canfind in a shortperiodof time,while P2PRPscanpotentiallyhandlehigh loads.

In their implementations,thereare someinterestingdifferencesbetweena P2Prouting protocolandAstrolabe. While Astrolabeexploits locality heavily in its im-plementation,P2PRPsonly try to optimize the length of hopschosenwhen routing(so-calledproximity routing). At eachhop in the routingpath,thenumberof eligiblenext hopsis small,andthustheeffectivenessof this approachmaybevariable.Worse,asagentsjoin andleavethesystem(so-calledchurn), objectshaveto bemovedaroundpotentiallyacrossvery largedistances).TheP2PRPprotocolsalsoemploy ping-basedfailuredetectionacrosstheselargedistances.

Additionally, whereasP2PRPstreatagentsasidentical,Astrolabeis ableto selectagentsasrepresentativesor routersbasedon variousattributessuchasload,operatingsystemtype,or longevity. Thatis, while P2PRPsassumetheagentsarehomogeneous,Astrolabetriesto exploit theheterogeneityofferedby theagents.

8.7 Epidemic Protocols

The earliestepidemicprotocol that we areawareof is GeneSpafford’s Usenetpro-tocol, the predecessorof today’s NNTP (RFC 977, February1986),which gossipednewsoverUUCPconnectionsstartingin 1983.Althoughepidemicin nature,thecon-nectionswerenot intentionallyrandomized,so that a stochasticanalysiswould havebeenimpossible.Thefirst systemthatusedsuchrandomizationonpurposewasthepre-viously describedClearinghousedirectoryservice[11], to solve thescalingproblemsthatthedesignerswerefacingin 1987(initially with muchsuccess).Morerecently, the

39

Page 40: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

REFDBMSbibliographicdatabasesystem[13] usesa gossip-basedreplicationstrat-egy for referencedatabases,while BimodalMulticast [5] (seealsoSection3.2) is alsobasedongossip.Astrolabeitself is baseddirectly onearlierwork describedin [29].

9 Conclusionsand Futur e Work

Astrolabeis aDNS-likedistributedmanagementservice,but differsfrom DNSin someimportantways. First, in Astrolabe,updatespropagatein secondsor tensof seconds,ratherthantensof minutesat best(anddaysmorecommonly)in DNS. Second,As-trolabeuserscanintroducenew attributeseasily. Third, Astrolabesupportson theflyaggregationof attributes,makingit possible,for example,to find resourcesbasedonattributesratherthanby name.A restrictedform of mobilecode,in theform of SQLSELECT statements,makes it possibleto changethe way attributesare aggregatedrapidly. This,andotheraspectsof ourdatabase-likepresentation,build on theexistingknowledgebaseandtool setfor databaseapplicationdevelopment,makingit easytointegratethe technologyinto existing settings.Finally, Astrolabeusesa peer-to-peerarchitecturethat is easyto deploy, andits gossipprotocolis highly scalable,fault tol-erant,andsecure.

Thesepropertiesenablenew distributedapplications,suchasscalablesystemman-agement,resourcelocationservices,sensornetworks,andapplication-level routing inpeer-to-peerservices.

WearecurrentlypursuingseveralimprovementstoAstrolabe.Currently, theamountof informationmaintainedby the attributesof a Astrolabezonecannotgrow beyondabouta kilobyte. By sendingdeltasratherthanentirenew versions,we believe wecansignificantlycompresstheamountof informationsendin thegossipprotocolandincreasethesizeor numberof theattributessignificantly. Alternatively or additionally,thesetreconciliationtechniquesof [17] canbeused.

Also aimedat increasingscale,wearealsolookingatwaysto directtheflow of in-formationtowardsthenodesrequestingaggregation.Although,usingthecopyattributeof AFCsa certainlevel of controlover theflow is alreadypossible,we arelooking atadditionalwaysthatwould allow us to increasethe numberof concurrentlyinstalledaggregationsconsiderably.

Acknowledgements

We would like to thankthefollowing peoplefor variouscontributionsto theAstrolabedesignandthis paper:Tim Clark, Al Demers,Dan Dumitriu, Terrin Eager, JohannesGehrke, Barry Gleeson,Indranil Gupta,KateJenkins,Anh Look, YaronMinsky, An-drew Myers, Venu Ramasubramanian,RichardShoenhair, Emin Gun Sirer, WernerVogels,LidongZhou,andtheanonymousreviewers.

References

[1] W. Adjie-Winoto, E. Schwartz, H. Balakrishnan,andJ. Lilley. The designandimplementationof anintentionalnamingsystem.In Proc.of theSeventeenthACMSymp.on OperatingSystemsPrinciples, KiawahIsland,SC,December1999.

40

Page 41: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

[2] M.K. Aguilera,R.E.Strom,D.C.Sturman,M. Astley, andT.D. Chandra.Match-ing eventsin a content-basedsubscriptionsystem.In Proc.of theACM Symp.onPrinciplesof DistributedComputing, Atlanta,GA, May 1999.

[3] D.G.Andersen,H Balakrishnan,M.F. Kaashoek,andR.Morris. Resilientoverlaynetworks. In Proc. of theEighteenthACM Symp.on Operating SystemsPrinci-ples, pages131–145,Banff, Canada,October2001.

[4] M. Balazinska,H. Balakrishnan,and D. Karger. Ins/twine: A scalablepeer-to-peer architecturefor intentional resourcediscovery. In F. Mattern andM. Naghshineh,editors,PervasiveComputing, First InternationalConference,Pervasive2002,Zurich, Switzerland,August26-28,2002,Proceedings, volume2414of Lecture Notesin ComputerScience, pages195–210.Springer, 2002.

[5] K. Birman,M. Hayden,O. Ozkasap,Z. Xiao, M. Budiu,andY. Minsky. BimodalMulticast. ACM Transactionson ComputerSystems, 17(2):41–88,May 1999.

[6] K. P. BirmanandT. A. Joseph.Exploiting virtual synchrony in distributedsys-tems. In Proc. of the EleventhACM Symp.on Operating SystemsPrinciples,pages123–138,Austin,TX, November1987.

[7] A. Birrell, R. Levin, R. Needham,andM. Schroeder. Grapevine: anexerciseindistributedcomputing.CACM, 25(4):260–274,April 1982.

[8] B. Bloom. Space/timetradeoffs in hashcodingwith allowableerrors. CACM,13(7):422–426,July1970.

[9] P. Bonnet,J.E.Gehrke, andP. Seshadri.Towardssensordatabasesystems. InProc.of theSecondInt. Conf. onMobileDataManagement, HongKong,January2001.

[10] A. Carzaniga,D.S. Rosenblum,and Wolf A.L. Design and evaluation of awide-areaevent notificationservice. ACM Transactionson ComputerSystems,19(3):332–383,August2001.

[11] A. Demers,D. Greene,C. Hauser, W. Irish, J. Larson,S. Shenker, H. Sturgis,D. Swinehart,andD. Terry. Epidemicalgorithmsfor replicateddatabasemainte-nance.In Proc.of theSixthACM Symp.on Principlesof DistributedComputing,pages1–12,Vancouver, BC, August1987.

[12] R.A. Golding. A weak-consistency architecturefor distributedinformationser-vices.ComputingSystems, 5(4):379–405,Fall 1992.

[13] R.A. Golding, D.D.E. Long, andJ. Wilkes. The REFDBMSdistributedbiblio-graphicdatabasesystem.In Proc.of Usenix’94, pages47–62,Winter1994.

[14] S.D. Gribble, M. Welsh, R. Von Behren,E.A. Brewer, D. Culler, N. Borisov,S. Czerwinski,R. Gummadi,J. Hill, A. Joseph,R.H. Katz, Z.M. Mao, S. Ross,andB. Zhao. The Ninja architecturefor robust internet-scalesystemsandser-vices. ComputerNetworks,SpecialIssueof ComputerNetworkson PervasiveComputing, 35(4):473–497,2001.

41

Page 42: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

[15] J.Heidemann,F. Silva,C. Intanagonwiwat,R. Govindan,D. Estrin,andD. Gane-san.Building efficientwirelesssensornetworkswith low-level naming.In Proc.of theEighteenthACM Symp.on Operating SystemsPrinciples, pages146–159,Banff, Canada,October2001.

[16] B.W. Lampson. Designinga global nameservice. In Proc. of the Fifth ACMSymp.on Principlesof DistributedComputing, Calgary, Alberta,August1986.

[17] Y. Minsky. SpreadingRumors Cheaply, Quickly, andReliably. PhDthesis,De-partmentof ComputerScience,CornellUniversity, Ithaca,NY, February2002.

[18] P. Mockapetris.TheDomainNameSystem. In Proc. of IFIP 6.5 InternationalSymposiumon ComputerMessaging, May 1984.

[19] B. M. Oki, M. Pfluegl, A. Siegel,andD. Skeen.TheInformationBus—anarchi-tecturefor extensibledistributedsystems.In Proc.of theFourteenthACM Symp.onOperatingSystemsPrinciples, pages58–68,Asheville, NC, December1993.

[20] K. Petersen,M.J. Spreitzer, D.B. Terry, M.M. Theimer, andA.J. Demers.Flex-ible updatepropagationfor weakly consistentreplication. In Proc. of the Six-teenthACM Symp.onOperatingSystemsPrinciples, pages288–301,Saint-Malo,France,October1997.

[21] S.Radicati.X.500DirectoryServices:TechnologyandDeployment. InternationalThomsonComputerPress,1994.

[22] G. Reese.DatabaseProgrammingwith JDBCandJava,2ndEdition. O’Reilly,2000.

[23] A. RowstronandP. Druschel. Pastry: Scalable,distributedobject locationandrouting for large-scalepeer-to-peersystems. In Proc. of the Middleware 2001,November2001.

[24] R.E.Sanders.ODBC3.5Developer’sGuide. M&T Books,1998.

[25] A. Snoeren,K. Conley, and D.K. Gifford. Mesh-basedcontentrouting usingXML. In Proc. of theEighteenthACM Symp.on Operating SystemsPrinciples,pages160–173,Banff, Canada,October2001.

[26] W. Stallings.SNMP, SNMPv2,andCMIP. Addison-Wesley, 1993.

[27] I. Stoica,R. Morris, D. Karger, andM.F. Kaashoek.Chord: A scalablepeer-to-peerlookupservicefor Internetapplications.In Proc.of the’95 Symp.on Com-municationsArchitectures & Protocols, Cambridge,MA, August 1995. ACMSIGCOMM.

[28] D.L. Tennenhouse,J.M.Smith,W.D. Sincoskie,D.J.Wetherall,andG.J.Minden.A survey of activenetwork research.IEEECommunicationsMagazine, 35(1):80–86,January1997.

[29] R. Van Renesse.Distributed ProcessingSystemwith ReplicatedManagementInformationBase. UnitedStatesPatent6,411,967,June2002.

42

Page 43: Astrolabe: A Robust and Scalable Technology For ... · Astrolabe uses a peer-to-peerapproach by running a process on each host.2 These processes communicate through an epidemic protocol

[30] R. VanRenesse.Power-awareepidemics.In Proc.of theInternationalWorkshoponReliablePeer-to-PeerDistributedSystems, Osaka,Japan,October2002.

[31] R. VanRenesseandD. Dumitriu. Collaborativenetworking in anuncooperativeinternet.In Proc.of the21stSymposiumonReliableDistributedSystems, Osaka,Japan,October2002.

[32] R. van Renesse,Y. Minsky, and M. Hayden. A gossip-stylefailure detectionservice.In Proc.of Middleware’98, pages55–70.IFIP, September1998.

[33] M. van Steen,F.J. Hauck,P. Homburg, andA.S. Tanenbaum.Locatingobjectsin wide-areasystems.IEEECommunicationsMagazine, pages104–109,January1998.

[34] B.Y. Zhao,J.Kubiatowicz, andA. Joseph.Tapestry:An infrastructurefor fault-tolerantwide-arealocationandrouting. TechnicalReportUCB/CSD-01-1141,Universityof California,Berkeley, ComputerScienceDepartment,2001.

43