Upload
jonathan-lau
View
667
Download
2
Tags:
Embed Size (px)
DESCRIPTION
An introduction to the inner works of dynamo db
Citation preview
Smokehouse Software | Jonathan Lau | [email protected]
THE INNER WORKINGS OF AMAZON DYNAMO
Jonathan Lau Nov 2013
Smokehouse Software | Jonathan Lau | [email protected]
MOTIVATION AND BIO
• Early stage companies
• Build bigger system
• Specialize in backend system
Smokehouse Software | Jonathan Lau | [email protected]
DISTRIBUTE / CENTRALIZEDistributed Centralized
Data Different data for each node One master copy
Replicas Replicate smaller data set for each of the nodes
Replicate the master copy into read slaves
Scaling Data are shared into the nodes by default Extra work to shard
Smokehouse Software | Jonathan Lau | [email protected]
WHAT ABOUT NOSQL?High performance solution != scaling
Smokehouse Software | Jonathan Lau | [email protected]
DYNAMO DESIGN CONSIDERATION
• Distributed key value store
• Incremental scalability - Scaling one node at a time
• Decentralized design - Gossip-based protocol for membership and failure detection
• Symmetry - All the nodes have the same functionality
• Heterogeneity - The system will be deployed in a environment with huge variance on hardware and system performance.
Smokehouse Software | Jonathan Lau | [email protected]
H
F
A
G C
E
B
D
put()get()
Request for key "K", which is in [C, D)
HIGH LEVEL CONCEPTDistribute the data in N nodes in a ring
Smokehouse Software | Jonathan Lau | [email protected]
DYNAMO’S CHALLENGES• Data partitioning
• N-1 replicas
• High availability for writes
• Handling temporary failures
• Recovering from permanent failures
• Membership and failure detection
Smokehouse Software | Jonathan Lau | [email protected]
A CB D
Request for key K in [B, C)
PARTITIONING
• 128 bit MD5 hash
• Consistent hashing for key partitioning
• Virtual node helps improve the local distribution
• Request can hit any of the node on the key preference list (coordinator)
Smokehouse Software | Jonathan Lau | [email protected]
REPLICATION
• Replication is stored by N-1 successor nodes
• The nodes with the replicas and the coordinator node forms the preference list.
Smokehouse Software | Jonathan Lau | [email protected]
AVAILABLE FOR WRITES• Accepts all the writes based on the version modified
• Tracking modification and base version by vector clock
• Accepts all the writes and the vector clock
• Conflict resolution by examining the vector clock on the objects and reconcile during the read operation
• Consistency issue arises because of network or node failure
• Oldest vector clock items will be purged
Smokehouse Software | Jonathan Lau | [email protected]
HANDLING TEMPORARY FAILURES
• Trade off between durability and availability
• Sloppy Quorum - write / read is only consider successful if the first N healthy nodes return from the preference list.
• Hinted hand off - write will be picked up by the replicas when the designated coordinator node is down. The write picked up by replica will have hint about the intended recipient for the write so we can reconstruct the state.
Smokehouse Software | Jonathan Lau | [email protected]
REPLICA SYNCHRON
• Dynamo uses Merkle tree to track hash for the keys
• Passing only the root hash to validate synchronization states between the replicas
• If a replica is deemed to be out of sync, the node can traverse down the tree to figure out the exact mismatch portion.
Smokehouse Software | Jonathan Lau | [email protected]
NODE MEMBERSHIP• Partition and placement information is propagate via a
gossip protocol
• Each node will be aware of the token range of its peer
• They have seed node in the cluster to speed up the membership and the key range membership for the ring
• Nodes are not really aware of each other until an actual delete happens
Smokehouse Software | Jonathan Lau | [email protected]
GET() AND PUT()What happen during a read or write request?
Smokehouse Software | Jonathan Lau | [email protected]
GET() AND PUT()• get() and put() are routed through a generic load balancer +
partition aware library to route traffic
• top N nodes in the preference list for key K are the coordinators.
• Requests basically go down the list and bad nodes are skipped over
• Two configuration parameters: R and W, where R + W > N.
Smokehouse Software | Jonathan Lau | [email protected]
MORE ON GET() AND PUT()When a writes happens:
• coordinator generates a vector clock value
• sends the new value along with the vector clock value to N highest ranked reachable nodes
• If at least W-1 node responded, the write is considered successful.
When a read happens:
• coordinate sends a read request to N highest ranked reachable nodes
• wait for R nodes return, and then return the result to client
Smokehouse Software | Jonathan Lau | [email protected]
WHAT DOES IT ALL MEANHow does all these ties in together?
Smokehouse Software | Jonathan Lau | [email protected]
WHAT DOES IT MEAN?• Dynamo shards the data from day 1
• Replica and redundancy is baked in from day 1
• The configuration parameter W and R has a huge effect our trade off between availability and durability.
• W + R > N
• Consistency resolution at read will allow more controlled conflict resolution strategy
Smokehouse Software | Jonathan Lau | [email protected]
HAPPY SCALING
Read the dynamo design paper @
http://bit.ly/QeM8AC