DISTRIBUTED COORDINATIONWITH PYTHON
Ben Bangertmozilla
Tools of the Trade
DISTRIBUTED COORDINATION IS NOT...
• Distributed Databases (Cassandra, Riak)
• Distributed Computing (Hadoop, etc.)
• Distributed Event Analysis (Storm)
The Common Element
Apache Zookeeper
ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services.
ZOOKEEPER
WHY NOT USE...
• Memcached?
• MongoDB?
• Postgres/MySQL?
Hierarchical data structure in znodes
• Session Based
• Znode watches
• Ephemeral and Sequential Znodes
• Last for duration of client session
• Session dies when connection is closed or expires
• Can’t have children znodes
EPHEMERAL ZNODES
SEQUENTIAL ZNODES
• Supply a node name (or not), get node name back with a trailing sequence number (0001, 0002, 0003, etc.)
• Can be combined with ephemeral flag
BASIC COMMANDS
• create(PATH, DATA...)
• get(PATH...)
• get_children(PATH...)
• set(PATH, DATA...)
• delete(PATH...)
PYTHON CLIENTS
• txzookeeper
• kazoo
• unified client that works with gevent
• implements wire protocol in pure Python
USE KAZOO
EASY TO USE
from kazoo.client import KazooClient
client = KazooClient()client.start()
USE CASES
CONFIGURATION
• Store settings in node data
• Organize node structure
• Set watches on nodes of interest
PARTY MEMBERSHIP
• Join a party, find out who else is around
• Elect a leader if desired
• Recipe in Kazoo
LOCKS
• Lock a resource for a single client
• Lock a resource for multiple clients (Semaphore)
• Hard to write properly
• Recipe in Kazoo
BUILDING HIGHER LEVELABSTRACTIONS
ONZOOKEEPER
CAVEAT
DO NOT IMPLEMENT YOURSELFUSE THE RECIPE
BASIC STEPS
• Create lock parent node if needed
• Create ephemeral+sequence node under parent, store node name returned
• Get children of lock node
• Sort children list by sequence number
• First child in the list has the lock!
THINGS TO WATCH OUT FOR
• Avoid the thundering herd, use watches only when needed
• When our node isn’t the lowest, watch the one in front of us
• Only one client wanting a lock is ‘woken’ when the lock is released by a different client
HANDLING FAILURE
ROBUST CODE TAKES EFFORT
• What happens when a server fails?
• What happens when the client fails?
• What happens when we don’t know if the server has failed?
STOPPING WHEN UNCERTAIN
A BIT BETTER VERSION...
EVEN BETTER
FAILURE WILL HAPPEN
• Fail fast, fail completely.
• Session expiration is a good time to sys.exit
• Always include jitter (kazoo includes jitter on its connection and command retry operations)
• Consider what exceptions can occur in any code relying on a distributed system
• Distributed systems are hard
• Use existing battle-proven tools (Zookeeper, Kazoo)
• Always consider everything that can fail, and how
• Be wary of tools that don’t tell you how they fail
• Read Kyle Kingsbury’s Jepsen posts to see examples of systems failing: http://aphyr.com/tags/jepsen
FIN
QUESTIONS?