View
217
Download
1
Tags:
Embed Size (px)
Citation preview
State Machine Replication
Project Presentation
Ido ZachevskyMarat Radan
Supervisor:Ittay Eyal
Winter Semester 2010
Goals
• Learn and understand Paxos and Python.
• Design program for fault-tolerant distributed system using the Paxos algorithm.
• Test on a real internet scale system, Planet-Lab.
The Problem – Distributed Storage
• Using Distributed Algorithms on a network has many advantages
• It also has many problems
• This project focuses on the Synchronization Problem
Synchronization
• The task: Successfully issue a state machine which involves all the computers of a network
• All the computers need to be in sync regarding the Current State and the Next States.
• All the computers need to know the transitions.
Problems?
• Can any computer choose the next state?
• What if a computer disconnects ungracefully?
• What if a message is delayed due to congestion?
• Other problems…
• Solution: Use a dedicated algorithm
A Solution – Paxos
• Keeping the Safety requirements ensures an agreed-upon value, by all computers, is chosen
• Keeping the Liveness requirements ensures a value will be chosen
Paxos - Background
Paxos Made Simple
Leslie Lamport01 Nov 2001
• Paxos Made Live
Principles
• The system consists of three agent classes:– Proposers– Acceptors– Learners
• Some of them distinguished
• Communicate via messages
Principles – continued
• A single computer – a Leader – is in charge
• Decision cycle in two phases:1. A majority must promise to commit to a
recent proposal.2. Once a majority has committed, all
computers are informed of the Decision.
Safety requirements
• Only a value that has been proposed may be chosen,
• Only a single value is chosen, and• A process never learns that a value has been
chosen unless it actually has been.
Liveness requirements
• Some proposed value is eventually chosen.• A process can eventually learn the value which
has been chosen.
Implementing a State Machine
• Collection of servers, each implementing a state machine.
• The i-th state machine command in the sequence is the value chosen by the i-th instance of the Paxos consensus algorithm.
• A pre-decided set of commands is necessary.
Planet-Lab
• Planet-Lab is a global research network that supports the development of new network services.
• Understanding the system is required• Monitoring is necessary
– Generally, implemented via NSSL-lab.
Project Design
• Chosen language for implementation: Python• Network framework: Twisted Matrix
• Implementation stages:– Single Decision on NSSL– Multiple Decisions on NSSL– Single Decision on Planet-Lab– Multiple Decisions on Planet-Lab
Clients 1
Server 1
Clients 2
Server 2
Clients N
Server N
The Network
……...
Transport
Listening Socket
Transport
Transport
Protocol
Protocol
Protocol
ProtocolFactory
Paxos Algorithm
Transport
Transport
Transport
Protocol
Protocol
Protocol
ProtocolFactory
Reactor Loop
... ...
... ...
Implementation
• Use Cases– Acceptor disconnects?
– Leader disconnects?• At which stage?
– Acceptor message fails to deliver?
Implementation
• Leader Election– In fact an inherent part of the algorithm
• Output and monitoring– Actual output not visible in general– Only via monitoring
Flow
1. Register Nodes 2. Verify and install necessary files3. Upload4. Initiate Monitor5. Run and wait for activity6. Review results
Implementation – File Structure
Initial Installation
Installationmy_install (csh)
Initial Communication send_install (py)
Alive Machines Server
install_serv (py)
Uploading and Running
Deployment my_deploy (csh)
Multi-Run my_multirun (csh)
Multi-Stop my_multistop (csh)
Core Paxos Program
Paxos Instancepaxos_inst (py)
Paxos Algorithmpaxos_alg (py)
Network Datapaxos_net_data
(txt)
ProjectFile Structure
Service Scripts and Files
Alive Nodes listnodes (txt)
Paxos Monitorpaxos_mon_serv
(py)
combine_nodes (csh)
conv_nodes (csh)
remove_done (csh)
Additional files
Results
• Everything works at the NSSL• In Real-Life, not necessarily• Communication phenomena – messages
arriving unordered, in large chunks, etc.• Works well for up to 20-30 Nodes• Use cases tested in Lab
Conclusions
• Preliminary work needed to understand Twisted Matrix and Planet-Lab
• Dealing with network problems– SSH Tunnel instead of “real” monitoring
• Requirements fulfilled
Further work
• Optimize networking protocol– Improve client-server interface– Inefficient startup – N(N-1) for N machines
• Partition Decision processes– Only few nodes decide each resolution
Thank you