p61-lloyd

Embed Size (px)

Citation preview

  • 8/19/2019 p61-lloyd

    1/8

    MAY 2014 |  VOL. 57 |  NO. 5 |  COMMUNICATIONS OF THE ACM   61

    GEO-REPLICATED STORAGE PROVIDES  copies of the samedata at multiple, geographically distinct locations.Facebook, for example, geo-replicates its data (profiles,

    friends lists, likes, and so on) to datacenters on theEast and West coasts of the U.S. and in Europe. In eachdatacenter, a tier of separate Web servers accepts

    browser requests and then handlesthose requests by reading and writingdata from the storage system, as shownin Figure 1.

    Geo-replication brings two keybenefits to Web services: fault toler-ance and low latency. It provides faulttolerance through redundancy: if one

    datacenter fails, others can continueto provide the service. It provides lowlatency through proximity: clients canbe directed to and served by a nearbydatacenter to avoid speed-of-light de-lays associated with cross-country orround-the-globe communication.

    Geo-replication brings its chal-lenges, however. The famous CAPtheorem, conjectured by Brewer1  andproved by Gilbert and Lynch,7  showsit is impossible to create a system thathas strong consistency, is always avail-able  for reads and writes, and is able

    to continue operating during network partitions. Each of these properties ishighly desirable. Strong consistency—more formally known as linearizabil-ity—makes programming easier. Avail-ability ensures front-end Web serverscan always respond to client requests.Partition tolerance ensures the system

    can continue operating even whendatacenters cannot communicate withone another. Faced with the choice ofat most two of these properties, manysystems5,8,16  have chosen to sacrificestrong consistency to ensure avail-ability and partition tolerance. Othersystems—for example, those that deal

     with money—sacrifice availability and/or partition tolerance to achieve thestrong consistency necessary for theapplications built on top of them.4,15 

    The former choice of availabilityand partition tolerance is not surpris-

    Don’t Settlefor Eventual

    Consistency

    DOI:10.1145/2596624

      Article development led by

    queue.acm.org

    Stronger properties for low-latencygeo-replicated storage.

    BY WYATT LLOYD, MICHAEL J. FREEDMAN,

    MICHAEL KAMINSKY, AND DAVID G. ANDERSEN

    http://dx.doi.org/10.1145/2596624http://dx.doi.org/10.1145/2596624

  • 8/19/2019 p61-lloyd

    2/8

    practice

    62   COMMUNICATIONS OF THE ACM   |   MAY 2014 |  VOL. 57 |  NO. 5

    single request; therefore, low latencyin the storage system is critical for en-abling fast page-load times, which are

    linked to user engagement with a ser- vice—and, thus, revenue. An always-available and partition-tolerant systemcan provide low latency on the order ofmilliseconds by serving all operationsin the local datacenter. A strongly con-sistent system must contact remotedatacenters for reads and/or writes,

     which takes hundreds of milliseconds.Thus, systems that sacrifice strong

    consistency gain much in return. Theycan be always available, guaranteeresponses with low latency, and pro-

     vide partition tolerance. In Clustersof Order-preserving Servers (COPS),11 developed for our original work onthis subject, we coined the term ALPSfor systems that provide these threeproperties—always available, low la-tency, and  partition tolerance—andone more: scalability. Scalability im-plies that adding storage servers toeach datacenter produces a propor-tional increase in storage capacityand throughput. Scalability is criticalfor modern systems because data has

    grown far too large to be stored orserved by a single machine.

    The question remains as to whatconsistency properties ALPS systemscan provide. Before answering this,let’s consider the consistency offeredby existing ALPS systems. For systemssuch as Amazon’s Dynamo, LinkedIn’sProject Voldemort, and Facebook/

     Apache’s Cassandra, the answer iseventual consistency.

    Eventual Consistency

    Eventual consistency is a widely used

    ing, however, given it also enables thestorage system to provide low latency—defined as latency for reads and writes

    that is less than half the speed-of-lightdelay between the two most distantdatacenters. A proof that predates the

    CAP theorem by 14 years10  shows it isimpossible to guarantee low latencyand provide strong consistency at the

    same time. Front-end Web servers reador write data from the storage systempotentially many times to answer a

    Figure 2. Anomaly 1: Comment reordering.

    Figure 3. Anomaly 2: Photo privacy violation.

    Figure 1. Geo-replicated storage.

    Data

    Store

    Web

    Tier

    Geo Replica+on

    B B

    B

    BB

    B

    B B

    B

    B

    B

    BB

    Browser

    West Coast East Coast

    Alice: “I’ve lost my wedding ring”

    Alice: “Whew found it upstairs!”Bob: “I’m glad to hear that”

    Alice: “I’ve lost my wedding ring”Bob: “I’m glad to hear that”

    Alost

    Afound

    Bglad

    i    m e Alost

    Afound

    Bglad

    West Coast East Coast

    [Student deletes incriminating photos][Student accepts advisor as a friend]

    Advisor can seeincriminating photos

    Sdelete

    Saccept

    Sdelete

    Saccept

    T i    m e

  • 8/19/2019 p61-lloyd

    3/8

    practice

    MAY 2014 |  VOL. 57 |  NO. 5 |  COMMUNICATIONS OF THE ACM   63

    term that can have many meanings.Here, it is defined as the strongestproperty provided by all systems thatclaim to provide it: namely, writes toone datacenter will eventually appearat other datacenters, and if all data-centers have received the same set of

     writes, they will have the same values

    for all data.Contrast this with the following part

    of the definition of strong consistency(linearizability): a total order existsover all operations in the system. Thismakes programming a strongly consis-tent storage system simple, or at leastsimpler: it behaves as a single entity.Eventual consistency does not say any-thing about the ordering of operations.This means that different datacenterscan reflect arbitrarily different sets of

    operations. For example, if someoneconnected to the West Coast datacen-ter sets A=1, B=2, and C=3, then some-one else connected to the East Coastdatacenter may see only B=2 (not A=1or C=3), and someone else connectedto the European datacenter may seeonly C=3 (not A=1 or B=2). This makesprogramming eventually consistentstorage complicated: operations canappear out of order.

    The out-of-order arrival leads tomany potential anomalies in eventu-

    ally consistent systems. Here are a fewexamples for a social network:

    Figure 2 shows that in the WestCoast datacenter, Alice posts, she com-ments, and then Bob comments. In theEast Coast datacenter, however, Alice’scomment has not appeared, makingBob look less than kind. Figure 3 showsthat in the West Coast datacenter, agrad student carefully deletes incrimi-nating photos before accepting anadvisor as a friend. Unfortunately, inthe East Coast datacenter, the friend-

    acceptance appears before the photodeletions, allowing the advisor to seethe photos.3 

    Figure 4 shows that in the WestCoast datacenter, Alice uploads pho-tos, creates an album, and then addsthe photos to the album, but in theEast Coast datacenter, the operationsappear out of order and her photos donot end up in the album. Finally, in Fig-ure 5, Cindy and Dave have $1,000 intheir joint bank account. Concurrently,Dave withdraws $1,000 from the EastCoast datacenter and Cindy withdraws

    $1,000 from the West Coast datacen-ter. Once both withdrawals propagateto each datacenter, their account is ina consistent state (-$1,000), but it is toolate to prevent the mischievous couplefrom making off with their ill-gottengains.

     Is eventual consistency the only op-

    tion?  Given that theoretical resultsshow the ALPS properties are incom-patible with strong consistency, do wehave to settle for eventual consistency?

     Are we stuck with all the anomaliesthat come with eventual consistency?No!

    Our research systems, COPS11  andEiger,12 have pushed on the propertiesthat ALPS systems can provide. In par-ticular, they provide causal  consistencyinstead of eventual, which prevents

    the first three anomalies. (The fourthanomaly in Figure 5 is unfortunatelyunavoidable in a system that accepts

     writes in every location and guarantees

    low latency.) In addition, they providelimited forms of read-only and write-only transactions that allow program-mers to consistently read or writedata spread across many different ma-chines in a datacenter.

    Causal Consistency

    Causal consistency ensures that opera-tions appear in the order the user in-tuitively expects. More precisely, it en-forces a partial order over operationsthat agree with the notion of potentialcausality. If operation A happens be-fore operation B, then any datacenterthat sees operation B must see opera-tion A first.

    Three rules define potential causal-ity:9

    1. Thread of Execution. If a and b are

    two operations in a single thread of ex-ecution, then a → b if operation a hap-pens before operation b.

    2. Reads-From.  If a is a write opera-

    Figure 5. Anomaly 4: Double money withdrawal.

    Figure 4. Anomaly 3: Broken photo album.

    West Coast East Coast

    [Alice uploads photos][Alice creates album][Alice adds photos to album]

    No photos or album exists

    Album exists, but is empty

    Album remains emptyAupload

    Aupload

    Aalbum

    Aalbum

    Aadd Aadd

    T i    m e

    West Coast East Coast

    [Cindy takes $1000] [Dave takes $1000]

    C–$1000

    D–$1000

    T i    m e

    C–$1000

    D–$1000

    Balance: –$1000

  • 8/19/2019 p61-lloyd

    4/8

    practice

    64   COMMUNICATIONS OF THE ACM   |   MAY 2014 |  VOL. 57 |  NO. 5

    execution rule.Now, in the East Coast datacenter,the operations can appear only in anorder that agrees with causality:

    Op1 [Alice uploads photos.]ThenOp1 [Alice uploads photos.]Op2 [Alice creates an album.]ThenOp1 [Alice uploads photos.]Op2 [Alice creates an album.]Op3 [Alice adds photos to the album.]

    but never in a different order that re-

    sults in an empty album or complicates what a programmer must think about.

     What causal consistency cannot do. Anomaly 4 represents the primary limi-tation of causality consistency: it can-not enforce global invariants. Anomaly4 has an implicit global invariant—thatbank accounts cannot go below $0—that is violated. This invariant cannotbe enforced in an ALPS system. Avail-ability dictates that operations mustcomplete, and low latency ensures theyare faster than the time it takes to com-

    municate between datacenters. Thus,the operations must return before thedatacenters can communicate and dis-cover the concurrent withdrawals.

    True global invariants are quiterare, however. E-commerce sites,

     where it seems inventory cannot go be-low 0, have back-order systems in placeto deal with exactly that scenario. Evensome banks do not enforce global $0invariants, as shown by a recent con-current withdrawal attack on ATMsthat extracted $40 million from only 12account numbers.14

    tion and b  is a read operation that re-turns the value written by a, then a → b.3. Transitivity. For operations a, b,

    and c, if a → b and b → c, then a → c.Thus, the causal relationship betweenoperations is the transitive closure ofthe first two rules.

    Causal consistency ensures opera-tions appear in an order that agrees

     with these rules. This makes users hap-py because their operations are appliedeverywhere in the order they intended.It makes programmers happy because

    they no longer have to reason aboutout-of-order operations.

    Causal consistency prevents each ofour first three anomalies, turning theminto regularities.

    Regularity 1: No Missing Comments

    In the West Coast datacenter, Aliceposts, and then she and Bob comment:

    Op1 Alice: I’ve lost my wedding ring.Op2 Alice: Whew, found it upstairs.Op3  [Bob reads Alice’s post and

    comment.]

    Op4 Bob: I’m glad to hear that.Op1 →  Op2  by the thread-of-execu-

    tion rule; Op2 → Op3 by the reads-fromrule; Op3 → Op4 by the thread-of-execu-tion rule.

     Write operations are only propagat-ed and applied to other datacenters, sothe full causal ordering that is enforcedis Op1 → Op2 → Op4.

    Now, in the East Coast datacenter,operations can appear only in an orderthat agrees with causality. Thus:

    Op1 Alice: I’ve lost my wedding ring.Then

    Op1 Alice: I’ve lost my wedding ring.Op2 Alice: Whew, found it upstairs.ThenOp1 Alice: I’ve lost my wedding ring.Op2 Alice: Whew, found it upstairs.Op4 Bob: I’m glad to hear that.

    but never the anomaly that makes Boblook unkind.

    Regularity 2: No Leaked Photos3

    In the West Coast datacenter, a gradstudent carefully deletes incriminatingphotos before accepting an advisor as

    a friend:Op1  [Student deletes incriminating

    photos.]Op2  [Student accepts advisor as a

    friend.]Op1 →  Op2  by the thread-of-execu-

    tion rule.Now, in the East Coast datacenter,

    operations can appear only in an orderthat agrees with causality, which is:

    [Student deletes incriminatingphotos.]

    Then

    [Student deletes incriminatingphotos.]

    [Student accepts advisor as a friend.]but never the anomaly that leaks pho-tos to the student’s advisor.

    Regularity 3: Normal Photo Album

    In the West Coast datacenter, Alice up-loads photos and then adds them toher Summer 2013 album:

    Op1 [Alice uploads photos.]Op2 [Alice creates an album.]Op3 [Alice adds photos to the album.]Op1 →  Op2 →  Op3  by the thread-of-

    Figure 6. Causality and dependency.Figure 6. Causality and dependency.

    User Op ID Operation

    Alice   w 1 write(Alice:Town, NYC)

    Bob   r 2 read(Alice:Town)

    Bob   w 3 write(Bob:Town, LA)

    Alice   r 4 read(Bob:Town))

    Carol   w 5 write(Carol:Likes, ACM, 8/31/12)

    Alice   w 6 write(Alice:Likes, ACM, 9/1/12)

    Alice   r 7 read(Carol:Likes, ACM)

    Alice   w 8 write(Alice:Friends, Carol, 9/2/12)

    Op ID Dependencies

    w 1 —

    w 3 w1

    w 5 —

    w 6 w3 w1

    w 8 w6 w5w 3 w 1

     C  ar   o l    

    w1

      w1

    w3

    w8

    w6

      w6

    w5

    w3

    w5

    r2

    r4

    r7

    B  o b  

    A  l    i     c  e

    L  o gi   c a l  T i  m e

    L  o gi   c a l  T i  m e

    (a) (b) (c) (d)

  • 8/19/2019 p61-lloyd

    5/8

    practice

    MAY 2014 |  VOL. 57 |  NO. 5 |  COMMUNICATIONS OF THE ACM   65

     Another limitation of causal con-sistency also stems from the possi-bility of concurrent operations. Pro-grammers must decide how to deal

     with concurrent write operations tothe same data at different datacen-ters. A common strategy is the last-

     writer-wins rule in which one concur-

    rent update overwrites the other. Forexample, a social-network user canhave only one birthday. Some situa-tions, however, require a more carefulapproach. Consider a scenario where

     Alice has two pending friend requestsbeing accepted concurrently at differ-ent datacenters. Each accepted friendrequest should increase Alice’s friendcount by one. With the last-writer-winsrule, however, one of the increments

     will overwrite the other. Instead, the

    two increments must be merged toincrease Alice’s total friend count bytwo. With causally consistent storage(as with eventually consistent stor-age), programmers must determine ifthe last-writer-wins rule is sufficient,or if they have to write a special func-tion for merging concurrent updates.

    The final limitation of causal con-sistency is it cannot see or enforce cau-sality outside of the system. The classicexample is a cross-country phone call.If Alice on the West Coast updates her

    profile, calls Bob on the East Coast, andthen Bob updates his profile, the sys-tem will not see the causal relationshipbetween the two updates and will notenforce any ordering between them.

    Providing causal consistency.  At ahigh level, our systems, COPS and Ei-ger capture causality through a clientlibrary and then enforce the observedordering when replicating writes toother datacenters. The ordering isenforced by delaying the applicationof a write until all causally previous

    operations have been applied. Thisdelay is necessary only in remotedatacenters; all causally previous op-erations have already been applied atthe datacenter that accepts the write.The client library that tracks causalitysits between the Web servers and thestorage tiers in each datacenter. (Incurrent implementations it is on the

     Web servers.) Individual clients areidentified through a special actor_idfield in the API to the client librarythat allows the operations of differentusers on the same Web server to be

    disentangled. For example, in a socialnetwork the unique user ID could beused as the actor_id.

    Let’s first describe an inefficientsystem that provides causality andthen explain how to refine it to makeit efficient.

    Our systems operate by tracking and

    enforcing the ordering only between write operations. Read operations es-tablish causal links between write op-erations by different clients, but theyare not replicated to other datacentersand thus do not need to have an order-ing enforced on them. For example, inanomaly/regularity 1, Bob’s read (Op3)of Alice’s post (Op1) and comment(Op2) creates the causal link that ordersBob’s later comment (Op4) after Alice’spost and comment. A causal link be-

    tween two write operations is called adependency—the later operation de-pends on the earlier operation.

    Figure 6 shows the relationshipbetween the graph of causality andthe graph of dependencies. A depen-dency is a small piece of metadatathat uniquely identifies a write opera-tion. It has two fields: a key, which isthe data location that is updated bythe write; and a timestamp, which isa globally unique logical timestampassigned by the logical clock of the

    server in the datacenter where it wasoriginally written. Figure 6 illustrates(a) a set of example operations; (b)the graph of causality between them;(c) the corresponding dependencygraph; and (d) a table listing depen-dences with one-hop dependenciesshown in bold.

    In the initial design the client librarytracks the full set of dependencies for

    each client. Tracking all dependenciesfor a client requires tracking three setsof write operations:

    1. All of the client’s previous writeoperations, because of the thread-of-execution rule.

    2. All of the operations that wrote values it read, because of the reads-

    from rule.3. All of the operations that the op-

    erations in 1 and 2 depend on, becauseof the transitivity rule.

    Tracking the first set is straightfor- ward: servers return the unique time-stamp assigned to each write to theclient library, which then adds a depen-dency on that write. Tracking the sec-ond set is also straightforward: serversreturn the timestamp of the write that

     wrote the value when they respond to

    reads, and then the client library adds adependency on that write. The third setof operations is a bit trickier: it requiresthat every write carry with it all of its de-pendencies, that these dependenciesare stored with the value, returned withreads of that value, and then added tothe reader’s set of dependencies by theclient library.

     With the full set of dependenciesfor each client stored in its client li-brary, all of these dependencies canbe attached to each write operation

    the client issues. Now when a serverin a remote datacenter receives a write

     with its full set of dependencies, itblocks the write and verifies each de-pendency is satisfied. Blocking thesereplicated write operations is accept-able because they are not client-facingand do not block reads to whateverdata they update. Here, we have ex-plicitly chosen to delay these write

    Figure 7. Regularity 1-Redux.

    West Coast East Coast

    Alice: “I’ve lost my wedding ring”Alice: “Whew found it upstairs!”Bob: “I’m glad to hear that”

    Alice: “I’ve lost my wedding ring”

    Alost

    Afound

    Bglad

    T i    m e

    Alost

    Afound

    Bglad

  • 8/19/2019 p61-lloyd

    6/8

    practice

    66   COMMUNICATIONS OF THE ACM   |   MAY 2014 |  VOL. 57 |  NO. 5

    Limited Transactions

    In addition to causal consistency, oursystems provide limited forms of trans-actions. These include read-only trans-actions, which transactionally read dataspread across many servers in a datacen-ter, and write-only transactions, whichtransactionally update data spread

    across many servers in a datacenter.These limited transactions are ne-

    cessitated—and complicated—by thecurrent scale of data. Data for manyservices is now far too large to fit on asingle machine and instead must bespread across many machines. Withdata resident on many machines, ex-tracting a consistent view of that databecomes tricky. Even though a datastore itself may always be consistent,a client can extract an inconsistent

     view because the client’s reads willbe served at different times by differ-ent servers. This, unfortunately, canreintroduce many of the anomaliesinherent in eventual consistency. InFigure 8, for example, in the WestCoast datacenter, a grad student re-moves photo-viewing permissionsfrom an advisor and uploads incrimi-nating photos. The advisor concur-rently tries to view the student’s pho-tos and, incorrectly, is shown theincriminating photos. To avoid these

    anomalies, causal consistency mustbe extended from the storage systemto the Web servers and then on to us-ers of the service. This can be doneusing read-only transactions.

    Read-only transactions allow pro-grammers to transactionally read dataspread across many servers, yielding aconsistent view of the data. The inter-face for a read-only transaction is sim-ple: a list of data locations. Instead ofissuing many individual reads for differ-ent data locations, a programmer issues

    a single read for all those locations. Thisis similar to batching these operations,

     which is often done to make dispatch-ing reads more efficient—except it alsoensures the results are isolated.

     With read-only transactions, anom-aly 5 can now be converted into a regu-larity as well. Figure 9 shows that withread-only transactions, the permis-sions and photos are read togethertransactionally, yielding any of thethree valid states shown, but never theanomaly that leaks the incriminatingphotos to the student’s advisor.

    operations until they can appear inthe correct order, as shown in Figure7. The dependency check for Bglad doesnot return until after Afound  is appliedon the East Coast, which ensures Bobis never misunderstood.

    The system described thus far pro- vides causal consistency and all of

    the ALPS properties. Causal consis-tency is provided by tracking causality

     with a client library and enforcing thecausal order with dependency checkson replicated writes. Availability andpartition tolerance are ensured bykeeping all operations inside the lo-cal datacenter. Low latency is guaran-teed by keeping all operations local,nonblocking, and lock free. Finally, afully decentralized design ensures thesystem has scalability.

    The current system, however, isinefficient. It has a huge amount ofdependency metadata that travelsaround with write operations and ahuge number of dependency checks

    to execute before applying them. Bothof these factors steal throughput fromuser-facing operations and reduce theutility of the system. Luckily, our sys-tems can exploit the transitivity inher-ent in the graph of causality to dras-tically reduce the dependencies thatmust be tracked and enforced. The

    subset of dependencies being trackedis the one-hop  dependencies, whichhave an arc to the current operationin the graph of causality. (Note thatin graph theoretic terms, the one-hopdependencies subset is the direct pre-decessor set of an operation.) In Fig-ure 6, the one-hop dependencies areshown in bold. They transitively cap-ture all of the ordering constraints onan operation. In particular, becauseall other dependencies are depended

    upon by at least one of the one-hopdependencies by definition, if this cur-rent operation occurs after the one-hop dependencies, then by transitivityit will occur after all others as well.

    Figure 8. Anomaly 5: Leaked photos.

    West Coast

    Friends ServerStudent

    Photo Server

    Advisor

    remove advisorTHEN

    add bad photos

    friends checkand photo fetch

    yes friends

    bad photos

    Sremove

    Sadd

    i    m e

    Figure 9. Regularity 5: No leaked photos.

    West Coast

    Friends ServerStudent

    Photo Server

    Advisor

    remove advisorTHEN

    add bad photos

    yes friends, old photos

    OR

    OR

    not friends, old photos

    not friends, bad photos

    Sremove

    Sadd

    T i    m e

  • 8/19/2019 p61-lloyd

    7/8

    practice

    MAY 2014 |  VOL. 57 |  NO. 5 |  COMMUNICATIONS OF THE ACM   67

    Providing read-only transactions.There are many techniques for ensur-ing isolated access to data locationsspread across many machines. Themost popular of these include two-phase locking (2PL), using a transac-tion manager (TM) to schedule whenreads are applied, and maintaining

    multiple versions of data with multiver-sion concurrency control (MVCC). Thefirst two approaches are at odds withthe ALPS properties. All forms of lock-ing, and 2PL in particular, can encoun-ter locks that are already acquired andthen either fail the operation, whichgives up on availability, or block theoperation until it can acquire the lock,

     which gives up on low latency. A TMis a centralized entity, and directingall reads though it is a bottleneck that

    inhibits scalability. This leaves MVCC,and our approach may be viewed as aparticularly aggressive variant of it thatis possible because our transactionsare limited.

    The basic idea behind our read-only transaction algorithm is that we

     want to read the entire distributeddata store at a single logical time. (Forlogical time, each node in the systemkeeps a logical clock that is updatedevery time an event occurs (for ex-ample, writing a value or receiving a

    message). When sending a message,a node includes a timestamp t   set toits logical clock c; when receiving amessage, a node sets c ←max(c, t  + 1).9 Logical time LT provides a progressinglogical view of the system even thoughit is distributed and there is no cen-tralized coordination of it. If event a happens before event b, then  LT (a) <

     LT (b). Thus, if distributed data is readat a single logical time LT  for all eventsseen at time t , we know all events thathappen before them have lower logi-

    cal times and thus are reflected in theresults. Figure 10 shows an example ofthis graphically, with validity periodsfor values, represented by letters, writ-ten to different locations.

    You can determine if values withina set are consistent with one anotherby annotating them with the logicaltime they became visible and then

     were overwritten. For example, in Fig-ure 10 consistent sets include {A,J,X},{B,K,X}, {B,K,Y}, {B,L,Y} and incon-sistent sets include {A,K,X}, {B,J,Y},and {C,L,X}, among others. Our serv-

    ers annotate values in this way and in-clude them when returning results tothe client library so it can determine if

     values are mutually consistent.Our read-only transaction algo-

    rithm is run by the client library andtakes at most two rounds of parallelreads. In the first round, the client li-brary sends out parallel requests forall requested data. Servers respond

     with their current visible values and validity intervals, which is the logi-cal time the value was written andthe current logical time at the server.The value may be valid at future logi-cal times as well, but conservatively

     we know it is valid for at least this in-terval. Once the client receives all re-sponses, it determines if all the valuesare mutually consistent by checkingfor intersection in their validity inter-

     vals. If there is intersection—which isalmost always the case unless someof the values are overwritten concur-rently with the read-only transac-tion—then the client library knowsthe values are consistent and returnsthem to the client.

    If the validity intervals do not allintersect, then the process moves to

    the second round of the algorithm.The second round begins by calculat-ing a single logical time at which to

    read values, called the effective time. Itis calculated by choosing a time thatensures an up-to-date view of the datainstead of being stuck on an old con-sistent cut of it, and it allows the useof many of the values retrieved in thefirst round. The client library then is-sues a second round of parallel readsfor all data for which it does not have a

     valid value at the effective time. Thesesecond-round reads ask for the valueof the data at the effective time, andservers answer these reads by travers-

    ing the older version of a value untilthey find the one that was valid at theeffective time. Figure 11 shows thesecond round in action. Figure 11a is aread-only transaction that completesin a single round, while Figure 11b isa read-only transaction that requires asecond round and requests data loca-tion 1 at the effective time 15  and re-ceives value B in response.

    This read-only transaction algo-rithm is specifically designed to main-tain all the ALPS properties and pro-

     vide high performance. It is available

    Figure 10. Logical time.

    Location 1

    Location 2

    Location 3

    1 11

    A B

    21

    C

    2 12

    J K

    19

    L

    3 15

    X Y

    Logical Time

    Figure 11. Read-only transactions.

    Location 1

    Location 2

    Location 3

    1 7

    A

    2 9

    J

    12   16

    15

    K

    3 8

    X

    3

    X

    1 10

    BA

    Logical TimeLogical Time

    (a) (b)

  • 8/19/2019 p61-lloyd

    8/8