69
Unicamp MC714 Distributed Systems Slides by Maarten van Steen, adapted from Distributed Systems, 3rd edition Chapter 04: Communication

Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Unicamp MC714Distributed Systems

Slides by Maarten van Steen, adapted fromDistributed Systems, 3rd edition

Chapter 04: Communication

Page 2: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Revision: Threads and Distributed Systems

Improve performance

Starting a thread is typically much cheaper than starting a new process.Having a single-threaded server prohibits simple scale-up to amultiprocessor system.As with clients: hide network latency by reacting to next request whileprevious one is being replied.

Better structure

Most servers have high I/O demands. Using simple, well-understoodblocking calls simplifies the overall structure.Multithreaded programs tend to be smaller and easier to understand dueto simplified flow of control.

2 / 62

Page 3: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Revision: Ways of virtualization

(a) Process VM, (b) Native VMM, (c) Hosted VMM

Runtime system

Application/Libraries

Hardware

Operating system

Application/Libraries

Virtual machine monitor

Hardware

Operating system Virtual machine monitor

Application/Libraries

Hardware

Operating system

Operating system

(a) (b) (c)

Differences(a) Separate set of instructions, an interpreter/emulator, running atop an OS.(b) Low-level instructions, along with bare-bones minimal operating system(c) Low-level instructions, but delegating most work to a full-fledged OS.

3 / 62

Page 4: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Revision: Servers and state

Stateless serversNever keep accurate information about the status of a client after havinghandled a request:

Don’t record whether a file has been opened (simply close it again afteraccess)Don’t promise to invalidate a client’s cacheDon’t keep track of your clients

Consequences

Clients and servers are completely independentState inconsistencies due to client or server crashes are reducedPossible loss of performance because, e.g., a server cannot anticipateclient behavior (think of prefetching file blocks)

QuestionDoes connection-oriented communication fit into a stateless design?

4 / 62

Page 5: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Revision: Servers and state

Stateless serversNever keep accurate information about the status of a client after havinghandled a request:

Don’t record whether a file has been opened (simply close it again afteraccess)Don’t promise to invalidate a client’s cacheDon’t keep track of your clients

Consequences

Clients and servers are completely independentState inconsistencies due to client or server crashes are reducedPossible loss of performance because, e.g., a server cannot anticipateclient behavior (think of prefetching file blocks)

QuestionDoes connection-oriented communication fit into a stateless design?

4 / 62

Page 6: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Revision: Servers and state

Stateless serversNever keep accurate information about the status of a client after havinghandled a request:

Don’t record whether a file has been opened (simply close it again afteraccess)Don’t promise to invalidate a client’s cacheDon’t keep track of your clients

Consequences

Clients and servers are completely independentState inconsistencies due to client or server crashes are reducedPossible loss of performance because, e.g., a server cannot anticipateclient behavior (think of prefetching file blocks)

QuestionDoes connection-oriented communication fit into a stateless design?

4 / 62

Page 7: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Revision: Server Clusters

Common organization

Logical switch

(possibly multiple)

Application/compute servers Distributed

file/database

system

Client requests

Dispatched

request

First tier Second tier Third tier

Crucial elementThe first tier is generally responsible for passing requests to an appropriateserver: request dispatching

5 / 62

Page 8: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Revision: Request Handling

ObservationHaving the first tier handle all communication from/to the cluster may lead to abottleneck.

A solution: TCP handoff

SwitchClient

Server

Server

RequestRequest

(handed off)

ResponseLogically asingle TCPconnection

6 / 62

Page 9: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Models for code migration

Before execution After executionClient Server Client Server

CScode

exec

resource

code

exec*

resource

REVcode

−→ exec

resource

−→code

exec*

resource

CS: Client-Server REV: Remote evaluation

7 / 62

Page 10: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Models for code migration

Before execution After executionClient Server Client Server

CoD exec

resource

←−code code

exec*

resource

←−

MAcode

exec

resource

−→resource resource

−→code

exec*

resource

CoD: Code-on-demand MA: Mobile agents

8 / 62

Page 11: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Strong and weak mobility

Object components

Code segment: contains the actual code

Data segment: contains the state

Execution state: contains context of thread executing the object’s code

Weak mobility: Move only code and data segment (and reboot execution)

Relatively simple, especially if code is portable

Distinguish code shipping (push) from code fetching (pull)

Strong mobility: Move component, including execution state

Migration: move entire object from one machine to the other

Cloning: start a clone, and set it in the same execution state.

9 / 62

Page 12: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Revisao: Exercıcios

1 Considere um servico que leva um total de 10 ms para atender umpedido desde que os dados necessarios estejam em uma cachena memoria principal. Nos casos onde os dados nao estao nacache, uma operacao de disco que leva 90 ms e necessaria antesde completar o pedido, e durante este tempo a thread queprocessa o pedido e suspensa. Assuma que os dados estao nacache para 50% dos pedidos. Quantos pedidos por segundo oservidor pode tratar se for implementado com uma unica thread? Ese o servidor usar multiplas threads?

2 Faz sentido limitar o numero de threads em um processo servidor?Argumente.

3 Existem casos onde um servidor single-thread tem desempenhomelhor do que um servidor multi-thread? Argumente.

10 / 62

Page 13: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Revision:

Revisao: Exercıcios

4 Um servidor multi-processos tem algumas vantagens edesvantagens quando comparado com um servidor multi-threads.De alguns exemplos.

5 Um servidor que mantem uma conexao TCP/IP para um cliente estateful ou stateless?

11 / 62

Page 14: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Exercises:

Exercıcios

1 Descreva o processo de conexao entre cliente e servidor comsockets TCP/IP.

2 Diferencie comunicacao sıncrona e assıncrona, persistente etransiente. De exemplos de cada combinacao.

3 Descreva um problema de escalabilidade com comunicacaosıncrona transiente.

4 Qual e o papel de um broker na comunicacao orientada amensagens?

12 / 62

Page 15: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Exercises:

Exercıcios

5 Na Figura 4.35, qual e o fator de stretch da rede de overlay na rotaA→C?

6 Explique o princıpio de anti-entropia usado em protocolosepidemicos.

7 Descreva o problema de remocao de dados em protocolosepidemicos e apresente uma solucao.

8 Descreva um algoritmo epidemico que calcule o tamanho de umarede.

13 / 62

Page 16: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Foundations Layered Protocols

Basic networking model

Physical

Data link

Network

Transport

Session

Application

Presentation

Application protocol

Presentation protocol

Session protocol

Transport protocol

Network protocol

Data link protocol

Physical protocol

Network

1

2

3

4

5

7

6

DrawbacksFocus on message-passing onlyOften unneeded or unwanted functionalityViolates access transparency

The OSI reference model 14 / 62

Page 17: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Foundations Layered Protocols

Low-level layers

Recap

Physical layer: contains the specification and implementation of bits, andtheir transmission between sender and receiverData link layer: prescribes the transmission of a series of bits into a frameto allow for error and flow controlNetwork layer: describes how packets in a network of computers are to berouted.

ObservationFor many distributed systems, the lowest-level interface is that of the networklayer.

The OSI reference model 15 / 62

Page 18: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Foundations Layered Protocols

Transport Layer

Important

The transport layer provides the actual communication facilities for mostdistributed systems.

Standard Internet protocols

TCP: connection-oriented, reliable, stream-oriented communicationUDP: unreliable (best-effort) datagram communication

The OSI reference model 16 / 62

Page 19: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Foundations Layered Protocols

Middleware layer

ObservationMiddleware is invented to provide common services and protocols that can beused by many different applications

A rich set of communication protocols(Un)marshaling of data, necessary for integrated systemsNaming protocols, to allow easy sharing of resourcesSecurity protocols for secure communicationScaling mechanisms, such as for replication and caching

NoteWhat remains are truly application-specific protocols... such as?

Middleware protocols 17 / 62

Page 20: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Foundations Layered Protocols

An adapted layering scheme

Hardware

Middleware

ApplicationApplication protocol

Middleware protocol

Host-to-host protocol

Network

Operatingsystem

Physical/Link-level protocol

Middleware protocols 18 / 62

Page 21: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Foundations Types of Communication

Types of communication

Distinguish...

Client

Server

Synchronize after processing by server

Synchronize at request delivery

Synchronize at request submission

Request

Reply

Storage facility

Transmission interrupt

Time

Transient versus persistent communicationAsynchronous versus synchronous communication

19 / 62

Page 22: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Foundations Types of Communication

Types of communication

Transient versus persistent

Client

Server

Synchronize after processing by server

Synchronize at request delivery

Synchronize at request submission

Request

Reply

Storage facility

Transmission interrupt

Time

Transient communication: Comm. server discards message when itcannot be delivered at the next server, or at the receiver.Persistent communication: A message is stored at a communicationserver as long as it takes to deliver it.

20 / 62

Page 23: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Foundations Types of Communication

Types of communication

Places for synchronization

Client

Server

Synchronize after processing by server

Synchronize at request delivery

Synchronize at request submission

Request

Reply

Storage facility

Transmission interrupt

Time

At request submissionAt request deliveryAfter request processing

21 / 62

Page 24: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Foundations Types of Communication

Client/Server

Some observationsClient/Server computing is generally based on a model of transientsynchronous communication:

Client and server have to be active at time of communicationClient issues request and blocks until it receives replyServer essentially waits only for incoming requests, and subsequentlyprocesses them

Drawbacks synchronous communication

Client cannot do any other work while waiting for replyFailures have to be handled immediately: the client is waitingThe model may simply not be appropriate (mail, news)

22 / 62

Page 25: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Foundations Types of Communication

Client/Server

Some observationsClient/Server computing is generally based on a model of transientsynchronous communication:

Client and server have to be active at time of communicationClient issues request and blocks until it receives replyServer essentially waits only for incoming requests, and subsequentlyprocesses them

Drawbacks synchronous communication

Client cannot do any other work while waiting for replyFailures have to be handled immediately: the client is waitingThe model may simply not be appropriate (mail, news)

22 / 62

Page 26: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Message-oriented communication Simple transient messaging with sockets

Transient messaging: sockets

Berkeley socket interface

Operation Descriptionsocket Create a new communication end pointbind Attach a local address to a socketlisten Tell operating system what the maximum number of pending

connection requests should beaccept Block caller until a connection request arrivesconnect Actively attempt to establish a connectionsend Send some data over the connectionreceive Receive some data over the connectionclose Release the connection

connect

socket

socket

bind listen receive

receive

send

send

accept close

close

Server

Client

Synchronization point Communication

23 / 62

Page 27: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Message-oriented communication Simple transient messaging with sockets

Sockets: Python code

Server

1 from socket import *2 s = socket(AF_INET, SOCK_STREAM)3 s.bind((HOST, PORT))4 s.listen(1)5 (conn, addr) = s.accept() # returns new socket and addr. client6 while True: # forever7 data = conn.recv(1024) # receive data from client8 if not data: break # stop if client stopped9 conn.send(str(data)+"*") # return sent data plus an "*"

10 conn.close() # close the connection

Client

1 from socket import *2 s = socket(AF_INET, SOCK_STREAM)3 s.connect((HOST, PORT)) # connect to server (block until accepted)4 s.send(’Hello, world’) # send same data5 data = s.recv(1024) # receive the response6 print data # print the result7 s.close() # close the connection

24 / 62

Page 28: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Message-oriented communication Simple transient messaging with sockets

Messaging

Message-oriented middleware

Aims at high-level persistent asynchronous communication:

Processes send each other messages, which are queuedSender need not wait for immediate reply, but can do other thingsMiddleware often ensures fault tolerance

25 / 62

Page 29: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Basic RPC operation

Basic RPC operation

ObservationsApplication developers are familiar with simple procedure modelWell-engineered procedures operate in isolation (black box)There is no fundamental reason not to execute procedures on separatemachine

ConclusionCommunication between caller & calleecan be hidden by using procedure-callmechanism.

Call local procedureand return results

Call remoteprocedure

Returnfrom call

Client

Request Reply

Server

Time

Wait for result

26 / 62

Page 30: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Basic RPC operation

Basic RPC operation

Implementationof doit

Client OS Server OS

Client machine Server machine

Client stub

Client process Server process

1. Client call toprocedure

2. Stub buildsmessage

5. Stub unpacksmessage

6. Stub makeslocal call to “doit”

3. Message is sentacross the network

4. Server OShands messageto server stub

Server stubr = a,bdoit( ) r = a,bdoit( )

proc: “doit”

type1: val(a)

type2: val(b)

proc: “doit”

type1: val(a)

type2: val(b)

proc: “doit”

type1: val(a)

type2: val(b)

1 Client procedure calls client stub.2 Stub builds message; calls local OS.3 OS sends message to remote OS.4 Remote OS gives message to stub.5 Stub unpacks parameters; calls

server.

6 Server does local call; returns result to stub.7 Stub builds message; calls OS.8 OS sends message to client’s OS.9 Client’s OS gives message to stub.10 Client stub unpacks result; returns to client.

27 / 62

Page 31: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Parameter passing

RPC: Parameter passing

There’s more than just wrapping parameters into a message

Client and server machines may have different data representations (thinkof byte ordering)Wrapping a parameter means transforming a value into a sequence ofbytesClient and server have to agree on the same encoding:

How are basic data values represented (integers, floats, characters)How are complex data values represented (arrays, unions)

ConclusionClient and server need to properly interpret messages, transforming them intomachine-dependent representations.

28 / 62

Page 32: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Parameter passing

RPC: Parameter passing

Some assumptions

Copy in/copy out semantics: while procedure is executed, nothing can beassumed about parameter values.All data that is to be operated on is passed by parameters. Excludespassing references to (global) data.

ConclusionFull access transparency cannot be realized.

A remote reference mechanism enhances access transparency

Remote reference offers unified access to remote dataRemote references can be passed as parameter in RPCsNote: stubs can sometimes be used as such references

29 / 62

Page 33: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Parameter passing

RPC: Parameter passing

Some assumptions

Copy in/copy out semantics: while procedure is executed, nothing can beassumed about parameter values.All data that is to be operated on is passed by parameters. Excludespassing references to (global) data.

ConclusionFull access transparency cannot be realized.

A remote reference mechanism enhances access transparency

Remote reference offers unified access to remote dataRemote references can be passed as parameter in RPCsNote: stubs can sometimes be used as such references

29 / 62

Page 34: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Parameter passing

RPC: Parameter passing

Some assumptions

Copy in/copy out semantics: while procedure is executed, nothing can beassumed about parameter values.All data that is to be operated on is passed by parameters. Excludespassing references to (global) data.

ConclusionFull access transparency cannot be realized.

A remote reference mechanism enhances access transparency

Remote reference offers unified access to remote dataRemote references can be passed as parameter in RPCsNote: stubs can sometimes be used as such references

29 / 62

Page 35: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Variations on RPC

Asynchronous RPCs

EssenceTry to get rid of the strict request-reply behavior, but let the client continuewithout waiting for an answer from the server.

Call local procedure

Call remoteprocedure

Returnfrom call

Client

RequestAcceptrequest

Server Time

Wait foracceptance

Callback to client

Returnresults

Asynchronous RPC 30 / 62

Page 36: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Variations on RPC

Sending out multiple RPCs

EssenceSending an RPC request to a group of servers.

Call local procedure

Call local procedure

Call remoteprocedures

Client

Server

Server

Time

Callbacks to client

Multicast RPC 31 / 62

Page 37: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Example: DCE RPC

RPC in practice

C compiler

Uuidgen

IDL compiler

C compiler C compiler

Linker Linker

C compiler

Server stubobject file

Serverobject file

Runtimelibrary

Serverbinary

Clientbinary

Runtimelibrary

Client stubobject file

Clientobject file

Client stubClient code Header Server stub

Interfacedefinition file

Server code

#include#include

Writing a Client and a Server 32 / 62

Page 38: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Example: DCE RPC

Client-to-server binding (DCE)

Issues(1) Client must locate server machine, and (2) locate the server.

Porttable

Server

DCEdaemon

Client

1. Register port

2. Register service3. Look up server

4. Ask for port

5. Do RPC

Directoryserver

Server machineClient machine

Directory machine

Binding a client to a server 33 / 62

Page 39: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Message-oriented persistent communication

Message-oriented middleware

EssenceAsynchronous persistent communication through support of middleware-levelqueues. Queues correspond to buffers at communication servers.

Operations

Operation Description

put Append a message to a specified queueget Block until the specified queue is nonempty, and

remove the first messagepoll Check a specified queue for messages, and remove

the first. Never blocknotify Install a handler to be called when a message is put

into the specified queue

Message-queuing model 34 / 62

Page 40: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Message-oriented persistent communication

General model

Queue managers

Queues are managed by queue managers. An application can put messagesonly into a local queue. Getting a message is possible by extracting it from alocal queue only⇒ queue managers need to route messages.

Routing

Local OS

Source queuemanager

Logicalqueue-level

address (name)

Contactaddress

Destination queuemanager

Address lookupdatabase

Look upcontact addressof destinationqueue manager

Local OS

Network

General architecture of a message-queuing system 35 / 62

Page 41: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Message-oriented persistent communication

Message broker

ObservationMessage queuing systems assume a common messaging protocol: allapplications agree on message format (i.e., structure and data representation)

Broker handles application heterogeneity in an MQ system

Transforms incoming messages to target formatVery often acts as an application gatewayMay provide subject-based routing capabilities (i.e., publish-subscribecapabilities)

Message brokers 36 / 62

Page 42: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Message-oriented persistent communication

Message broker: general architecture

Local OS

Application

Interface

Local OS Local OS

Application

Interface

Broker plugins Rules

Queuinglayer

Source DestinationMessage broker

Message brokers 37 / 62

Page 43: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Advanced transient messaging

Making sockets easier to work with

ObservationSockets are rather low level and programming mistakes are easily made.However, the way that they are used is often the same (such as in aclient-server setting).

Alternative: ZeroMQProvides a higher level of expression by pairing sockets: one for sendingmessages at process P and a corresponding one at process Q for receivingmessages. All communication is asynchronous.

Three patterns

Request-replyPublish-subscribePipeline

Using messaging patterns: ZeroMQ 38 / 62

Page 44: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Advanced transient messaging

Request-reply

Server

1 import zmq2 context = zmq.Context()3

4 p1 = "tcp://"+ HOST +":"+ PORT1 # how and where to connect5 p2 = "tcp://"+ HOST +":"+ PORT2 # how and where to connect6 s = context.socket(zmq.REP) # create reply socket7

8 s.bind(p1) # bind socket to address9 s.bind(p2) # bind socket to address

10 while True:11 message = s.recv() # wait for incoming message12 if not "STOP" in message: # if not to stop...13 s.send(message + "*") # append "*" to message14 else: # else...15 break # break out of loop and end

Using messaging patterns: ZeroMQ 39 / 62

Page 45: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Advanced transient messaging

Request-reply

Client

1 import zmq2 context = zmq.Context()3

4 php = "tcp://"+ HOST +":"+ PORT # how and where to connect5 s = context.socket(zmq.REQ) # create socket6

7 s.connect(php) # block until connected8 s.send("Hello World") # send message9 message = s.recv() # block until response

10 s.send("STOP") # tell server to stop11 print message # print result

Using messaging patterns: ZeroMQ 40 / 62

Page 46: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Advanced transient messaging

Publish-subscribeServer

1 import zmq, time2

3 context = zmq.Context()4 s = context.socket(zmq.PUB) # create a publisher socket5 p = "tcp://"+ HOST +":"+ PORT # how and where to communicate6 s.bind(p) # bind socket to the address7 while True:8 time.sleep(5) # wait every 5 seconds9 s.send("TIME " + time.asctime()) # publish the current time

Client

1 import zmq2

3 context = zmq.Context()4 s = context.socket(zmq.SUB) # create a subscriber socket5 p = "tcp://"+ HOST +":"+ PORT # how and where to communicate6 s.connect(p) # connect to the server7 s.setsockopt(zmq.SUBSCRIBE, "TIME") # subscribe to TIME messages8

9 for i in range(5): # Five iterations10 time = s.recv() # receive a message11 print time

Using messaging patterns: ZeroMQ 41 / 62

Page 47: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Advanced transient messaging

Pipeline

Source

1 import zmq, time, pickle, sys, random2

3 context = zmq.Context()4 me = str(sys.argv[1])5 s = context.socket(zmq.PUSH) # create a push socket6 src = SRC1 if me == ’1’ else SRC2 # check task source host7 prt = PORT1 if me == ’1’ else PORT2 # check task source port8 p = "tcp://"+ src +":"+ prt # how and where to connect9 s.bind(p) # bind socket to address

10

11 for i in range(100): # generate 100 workloads12 workload = random.randint(1, 100) # compute workload13 s.send(pickle.dumps((me,workload))) # send workload to worker

Using messaging patterns: ZeroMQ 42 / 62

Page 48: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Advanced transient messaging

Pipeline

Worker

1 import zmq, time, pickle, sys2

3 context = zmq.Context()4 me = str(sys.argv[1])5 r = context.socket(zmq.PULL) # create a pull socket6 p1 = "tcp://"+ SRC1 +":"+ PORT1 # address first task source7 p2 = "tcp://"+ SRC2 +":"+ PORT2 # address second task source8 r.connect(p1) # connect to task source 19 r.connect(p2) # connect to task source 2

10

11 while True:12 work = pickle.loads(r.recv()) # receive work from a source13 time.sleep(work[1]*0.01) # pretend to work

Using messaging patterns: ZeroMQ 43 / 62

Page 49: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Advanced transient messaging

Example: RabbitMQ

ObjectiveRabbitMQ is a message broker. It accepts and forwards messages.

Persistent, asynchronous communicationSender

Create a communication channel and declare a message queuePublish data to the queue

Receiver

Create a communication channel and declare a message queueDefine a callback function to handle incoming informationStart consuming data from the channel

Using messaging patterns: ZeroMQ 44 / 62

Page 50: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Advanced transient messaging

Example: RabbitMQ

FeaturesRound-robin dispatchingDurable (persistent) messagesPublish/Subscribe (fanout)Topic-based exchange (filtering)RPC InterfaceMultiple language bindings

Using messaging patterns: ZeroMQ 45 / 62

Page 51: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Advanced transient messaging

MPI: When lots of flexibility is needed

Representative operations

Operation Description

MPI bsend Append outgoing message to a local send bufferMPI send Send a message and wait until copied to local or

remote bufferMPI ssend Send a message and wait until transmission startsMPI sendrecv Send a message and wait for replyMPI isend Pass reference to outgoing message, and continueMPI issend Pass reference to outgoing message, and wait until

receipt startsMPI recv Receive a message; block if there is noneMPI irecv Check if there is an incoming message, but do not

block

The Message-Passing Interface (MPI) 46 / 62

Page 52: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Example: IBM’s WebSphere message-queuing system

IBM’s WebSphere MQ

Basic concepts

Application-specific messages are put into, and removed from queuesQueues reside under the regime of a queue managerProcesses can put messages only in local queues, or through an RPCmechanism

Message transfer

Messages are transferred between queuesMessage transfer between queues at different processes, requires achannelAt each end point of channel is a message channel agentMessage channel agents are responsible for:

Setting up channels using lower-level network communicationfacilities (e.g., TCP/IP)(Un)wrapping messages from/in transport-level packetsSending/receiving packets

Overview 47 / 62

Page 53: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Example: IBM’s WebSphere message-queuing system

IBM’s WebSphere MQSchematic overview

MCA MCA

MQ Interface

Stub

Queuemanager

Serverstub

Send queueRouting table

Enterprise networkRPC(synchronous)

Local network

Message passing(asynchronous)

To other remotequeue managers

Client's receivequeueSending client Receiving client

MCA MCA

MQ Interface

Stub

Queuemanager

Serverstub

Channels are inherently unidirectionalAutomatically start MCAs when messages arriveAny network of queue managers can be createdRoutes are set up manually (system administration)

Overview 48 / 62

Page 54: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Example: IBM’s WebSphere message-queuing system

Message channel agents

Some attributes associated with message channel agents

Attribute Description

Transport type Determines the transport protocol to be usedFIFO delivery Indicates that messages are to be delivered in the

order they are sentMessage length Maximum length of a single messageSetup retry count Specifies maximum number of retries to start up the

remote MCADelivery retries Maximum times MCA will try to put received message

into queue

Channels 49 / 62

Page 55: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Remote procedure call Example: IBM’s WebSphere message-queuing system

IBM’s WebSphere MQ

Routing

By using logical names, in combination with name resolution to local queues, itis possible to put a message in a remote queue

SQ1

SQ1

SQ1

SQ1SQ2

SQ1

SQ1

SQ1SQ1

SQ2

SQ1

SQ1

QMA

QMB

QMA

QMAQMC

QMC

QMC

QMCQMB

QMD

QMD

QMD

Routing table

Routing table

Routing table

Routing table

LA1

LA1

LA1

LA2

LA2

LA2

QMA

QMC

QMA

QMC

QMD

QMD

Alias table

Alias table

Alias table

QMD

QMA

QMB

QMBSQ1

SQ1

SQ1

SQ1

SQ2

SQ2

Message transfer 50 / 62

Page 56: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Application-level tree-based multicasting

Application-level multicasting

EssenceOrganize nodes of a distributed system into an overlay network and use thatnetwork to disseminate data:

Oftentimes a tree, leading to unique pathsAlternatively, also mesh networks, requiring a form of routing

51 / 62

Page 57: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Application-level tree-based multicasting

Application-level multicasting in Chord

Basic approach1 Initiator generates a multicast identifier mid .2 Lookup succ(mid), the node responsible for mid .3 Request is routed to succ(mid), which will become the root.4 If P wants to join, it sends a join request to the root.5 When request arrives at Q:

Q has not seen a join request before⇒ it becomes forwarder; Pbecomes child of Q. Join request continues to be forwarded.Q knows about tree⇒ P becomes child of Q. No need to forwardjoin request anymore.

52 / 62

Page 58: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Application-level tree-based multicasting

ALM: Some costs

Different metrics

Ra

Rb

RcRe

A

B

D

C

Internet

RouterEnd host

Overlay network

75

1

1

1

1

1

30 20

40

E

Rd

Link stress: How often does an ALM message cross the same physicallink? Example: message from A to D needs to cross 〈Ra,Rb〉 twice.Stretch: Ratio in delay between ALM-level path and network-level path.Example: messages B to C follow path of length 73 at ALM, but 47 atnetwork level⇒ stretch = 73/47.

Performance issues in overlays 53 / 62

Page 59: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Flooding-based multicasting

Flooding

EssenceP simply sends a message m toeach of its neighbors. Eachneighbor will forward that message,except to P, and only if it had notseen m before.

PerformanceThe more edges, the moreexpensive!

The size of a random overlay asfunction of the number of nodes

50

100

150

200

250

300

pedge = 0.6

pedge = 0.4

pedge = 0.2

0

100 500 1000

Number of nodes

Num

ber

of edges (

x 1

000)

VariationLet Q forward a message with a certain probability pflood , possibly evendependent on its own number of neighbors (i.e., node degree) or the degree ofits neighbors.

54 / 62

Page 60: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Flooding-based multicasting

Flooding

EssenceP simply sends a message m toeach of its neighbors. Eachneighbor will forward that message,except to P, and only if it had notseen m before.

PerformanceThe more edges, the moreexpensive!

The size of a random overlay asfunction of the number of nodes

50

100

150

200

250

300

pedge = 0.6

pedge = 0.4

pedge = 0.2

0

100 500 1000

Number of nodes

Num

ber

of edges (

x 1

000)

VariationLet Q forward a message with a certain probability pflood , possibly evendependent on its own number of neighbors (i.e., node degree) or the degree ofits neighbors.

54 / 62

Page 61: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Gossip-based data dissemination

Epidemic protocols

Assume there are no write–write conflictsUpdate operations are performed at a single serverA replica passes updated state to only a few neighborsUpdate propagation is lazy, i.e., not immediateEventually, each update should reach every replica

Two forms of epidemics

Anti-entropy: Each replica regularly chooses another replica at random,and exchanges state differences, leading to identical states at bothafterwardsRumor spreading: A replica which has just been updated (i.e., has beencontaminated), tells a number of other replicas about its update(contaminating them as well).

55 / 62

Page 62: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Gossip-based data dissemination

Anti-entropy

Principle operations

A node P selects another node Q from the system at random.Pull: P only pulls in new updates from QPush: P only pushes its own updates to QPush-pull: P and Q send updates to each other

Observation

For push-pull it takes O(log(N)) rounds to disseminate updates to all N nodes(round = when every node has taken the initiative to start an exchange).

Information dissemination models 56 / 62

Page 63: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Gossip-based data dissemination

Anti-entropy: analysis

BasicsConsider a single source, propagating its update. Let pi be the probability thata node has not received the update after the i th round.

Analysis: staying ignorant

With pull, pi+1 = (pi)2: the node was

not updated during the i th round andshould contact another ignorant nodeduring the next round.With push,pi+1 = pi(1− 1

N )N(1−pi ) ≈ pie−1 (forsmall pi and large N): the node wasignorant during the i th round and noupdated node chooses to contact itduring the next round.With push-pull: (pi)

2 · (pie−1)

push

pull

push-pull

250 5 10 15 20

1.0

0.8

0.6

0.4

0.2P

robabili

ty n

ot yet update

d

Round

N = 10,000

Information dissemination models 57 / 62

Page 64: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Gossip-based data dissemination

Anti-entropy performance

push

pull

push-pull

250 5 10 15 20

1.0

0.8

0.6

0.4

0.2

Pro

babili

ty n

ot yet update

d

Round

N = 10,000

Information dissemination models 58 / 62

Page 65: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Gossip-based data dissemination

Rumor spreading

Basic modelA server S having an update to report, contacts other servers. If a server iscontacted to which the update has already propagated, S stops contactingother servers with probability pstop.

ObservationIf s is the fraction of ignorant servers (i.e., which are unaware of the update), itcan be shown that with many servers

s = e−(1/pstop+1)(1−s)

Information dissemination models 59 / 62

Page 66: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Gossip-based data dissemination

Rumor spreading

The effect of stopping

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.20

0.10

0.05

0.15

pstop

s

Consider 10,000 nodes1/pstop s Ns

1 0.203188 20322 0.059520 5953 0.019827 1984 0.006977 705 0.002516 256 0.000918 97 0.000336 3

NoteIf we really have to ensure that all servers are eventually updated, rumorspreading alone is not enough

Information dissemination models 60 / 62

Page 67: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Gossip-based data dissemination

Rumor spreading

The effect of stopping

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.20

0.10

0.05

0.15

pstop

s

Consider 10,000 nodes1/pstop s Ns

1 0.203188 20322 0.059520 5953 0.019827 1984 0.006977 705 0.002516 256 0.000918 97 0.000336 3

NoteIf we really have to ensure that all servers are eventually updated, rumorspreading alone is not enough

Information dissemination models 60 / 62

Page 68: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Gossip-based data dissemination

Deleting values

Fundamental problem

We cannot remove an old value from a server and expect the removal topropagate. Instead, mere removal will be undone in due time using epidemicalgorithms

SolutionRemoval has to be registered as a special update by inserting a deathcertificate

Removing data 61 / 62

Page 69: Unicamp MC714 Distributed Systemslucas/teaching/mc714/2017-2/slides/slid… · cache, uma operac¸ao de disco que leva 90 ms˜ e necessaria antes ... Recap Physical layer: contains

Communication: Multicast communication Gossip-based data dissemination

Deleting values

When to remove a death certificate (it is not allowed to stay for ever)

Run a global algorithm to detect whether the removal is knowneverywhere, and then collect the death certificates (looks like garbagecollection)Assume death certificates propagate in finite time, and associate amaximum lifetime for a certificate (can be done at risk of not reaching allservers)

NoteIt is necessary that a removal actually reaches all servers.

Removing data 62 / 62