Reliable Client-Server Communication. Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable

Reliable Client-Server Communication

Reliable Communication• So far: Concentrated on process resilience (by

means of process groups).

• What about reliable communication channels?• Error detection:

– Framing of packets to allow for bit error detection– Use of frame numbering to detect packet loss

• Error correction:– Add so much redundancy that corrupted packets can

be automatically corrected– Request retransmission of lost, or last N packets

Reliable Communication• Observation: Most of this work assumes point-

to-point communication– TCP reliable– Mask omission failure (loss of messages)– What if TCP connection breaks?

• High-level communication facilities.

Traditional RPC

Principle of RPC between a client and server program.

Remote Procedure Calls

1. The client procedure calls the client stub in the normal way.2. The client stub builds a message and calls the local operating

system.3. The client’s OS sends the message to the remote OS.4. The remote OS gives the message to the server stub.5. The server stub unpacks the parameters and calls the server.6. The server does the work and returns the result to the stub.7. The server stub packs it in a message and calls its local OS.8. The server’s OS sends the message to the client’s OS.9. The client’s OS gives the message to the client stub.10. The stub unpacks the result and returns to the client.

• A remote procedure call occurs in the following steps:

RPC Failures

• Five different classes of failures.

1. Can’t find server.2. Request message lost.3. Server crashes after receiving request.4. Reply message is lost.5. Client crashes after receiving request.

Methods

• 1: no server -- report back to client– Raise an exception.– Lost transparency.

• 2: Lost Request -- resend message– Start a timer, send another.– Or is the server down?

3: Server Crashes

• Harder issue: Server can crash in two different points.– Client can treat differently if known which case.– But client only knows no rep, how it tell and act accordingly.

• Solution?

Server Crashes• At least once: The server guarantees it will carry out an

operation at least once, no matter what. So keep trying until a reply comes back.

• At most once: The server guarantees it will carry out an operation at most once. So report failure immediately.

• No general solution for exactly once.

• Consider a print server that crashes and comes back up.– Client sends a message, gets an ack.– Server sends a completion message either right before or right

after.– If crash, client can never reissue, always reissue, only reissue if no

ack, only reissue if there is an ack.

Print Server

• Three events that can happen at the server:

1. Send the completion message (M).

2. Print the text (P).3. Crash (C).

Server Crashes• These events can occur in six different orderings:

1.M →P →C: A crash occurs after sending the completion message and printing the text.

2.M →C (→P): A crash happens after sending the completion message, but before the text could be printed.

3.P →M →C: A crash occurs after sending the completion message and printing the text.

4.P→C(→M): The text printed, after which a crash occurs before the completion message could be sent.

5.C (→P →M): A crash happens before the server could do anything.

6.C (→M →P): A crash happens before the server could do anything.

Server Crashes

• Server crashes and comes back up.

4: reply lost

• Detecting lost replies can be hard, because it can also be that the server had crashed. You don’t know whether the server has carried out the operation

• Solution: – None, except that you can try to make your operations

idempotent: repeatable without any harm done if it happened to be carried out before.

5: client crashes• Problem: The server is doing work and holding

resources for nothing (called doing an orphan computation).– Orphan is killed (or rolled back) by client when it

reboots– Broadcast new epoch number when recovering ⇒

servers kill orphans– Require computations to complete in a T time units.

Old ones are simply removed.

• Question: What’s the rolling back for?

Documents

Reliable Client-Server Communication. Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable