17
Improving Robustness in Distributed Systems Per Bergqvist [email protected] Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Improving Robustness in Distributed Systems Per Bergqvist [email protected] [email protected] Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Embed Size (px)

Citation preview

Page 1: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Improving Robustness in Distributed Systems

Per [email protected]

Erlang User Conference 2001

(courtesy CellPoint Systems AB)

Page 2: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Design base

Cluster of cooperating hostsErlang and CCOTS hardware basedUnix based (i.e. Solaris or Linux)10/100/1000 base-T back plane(”system area network”)

Page 3: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Cluster

Shared, distributed, system configurationEach host have ONE cluster controllerDispatch and supervise worker tasksMaster cluster controller: holds configuration database (persistent replica)Slave cluster controller: gets configuration from master cluster controllersCluster is DOWN when all master cluster controllers are inaccessible

Page 4: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Typical system

FirewallSwitch

Traffic

Control

Page 5: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Cluster Key Benefits

Single system view

Enforces decoupling of parts of O&M from actual traffic processing

Page 6: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Implementing a cluster

Cluster->Host->Node->NodeData Cluster global parametersSubscription mechanisms for conf. changesMnesia as configuration database on master cluster controllersHomebrewn configuration distribution to slave controllers (NOT using mnesia)(Worker) node supervision

Page 7: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Mnesia gotchas

First distributed node startup Disallow writes when all replicas not

accessible Use timeout on table load and force

load

Page 8: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

... BUT ...

TCP based distribution

Network partitioning

Page 9: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Network parameters

Align TCP retransmission intervals w/ Erlang heartbeatsAlign TCP and IP rerouting parameters

Page 10: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Typical system II: Dual back plane

FirewallSwitch Traffic

Control

Page 11: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Erlang multi-homing problem

Host A

Host B

Host C

Page 12: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Multi-home Erlang w/ TCP

Add an alias interface to loop back i/fPatch tcp distribution to bind to alias

Publish alias interface on (all wanted) via real hw i/f’s Method 1: Static routes and

gratuitous/proxy arp Method 2: Use new (routing) protocol

Page 13: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

ARP method

Implement a utility to:- broadcast unsolicited ARP responses- respond to ARP requests for the alias i/f addressAdd static routes on all far end systemsNOTE: all real i/f needs to be on same IP subnet

Page 14: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

New routing protocol

Broadcast (Ethernet frames) what you have, including interface priorityLet the far end select path based on what/when they receiveFar end dynamically sets up host routesUse short retransmission intervals

Page 15: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Erlang multi-homing resolved ?

Host A

Host B

Host C

Page 16: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Summing up

Erlang can support multihoming with some additional workBy using loop back alias i/f, link failure becomes a routing problem (peer-peer association is kept intact)Solaris TCP/IP stack parameters are:- hard to find (only in out-of-date app. notes)- hard to set ”right”- host globalA distribution mechanism with built-in support for multi-homing preferred

Page 17: Improving Robustness in Distributed Systems Per Bergqvist per@synapse.se per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)

Erlang Distribution over SCTP

Per Bergqvist et [email protected]

Erlang User Conference 2002