Upload
joshua-goldbard
View
7.599
Download
0
Embed Size (px)
DESCRIPTION
This is the Expert Q&A from 2600hz and Cloudant on Database in Telecom. If you are a service provider, MSP or anyone running a VoIP switch, you should definitely check this out.
Citation preview
Powerful, Distributed, API Communications
Call-in Number: 513.386.0101Pin 705-705-141
Expert Q&A: Database Edition
May 31st, 2013
Welcome
Our Panelists
Joshua Goldbard
Marketing Ninja, 2600hz, Moderator
Darren Schreiber
Founder, 2600hz
Sam Bisbee
Cloudant
Database:It’s all good until it
isn’t
Some background…
What is Database?
• A Record of things Remembered or
Forgotten
• Used to be Unbelievably hard, now it’s
just hard sometimes
• Modern Databases are amazingly
resilient
• Failure Mode still requires lots of
attention
• In Distributed Environments…
• Database is inexorably linked to the
network
• The network is always unreliable if
public
Masters and Slaves
• Databases have to Replicate
• Most Databases use a form of Master-
Slave Relationship to manage replication
and dedupe
• Masters are where new data is entered
• Then it’s mirrored out to the Slaves for
storage
• If you lose access to the original Master,
you can convert a Slave into a Master
and restore operation
Durability
Other Replication Strategies
• Other strategies exist, such as…
• Master-Master (What 2600hz Uses)
• Tokenized Exchange
• Time-delimited
• The most popular methods tend to be
Master-Slave or Master-Master
Each Database has its advantages and
tradeoffs. Once again, there is no Magic
Bullet.
Failure and Quorum
• When A Database needs to elect a new
master…
• There are many different strategies
• Most involve the concept of quorum
(figuring out where the greatest
number of copies reside)
• Once Quorum is established, a new
master is elected and (hopefully)
operation can resume
• Quorum is different in Master-Master
(Explain)
Cap TheoremDatabases can have (at most) 2 out of 3 of the following:
•Consistency•Availability•Partition Tolerance
Modern Database Management is balancing between Consistency and Availability because
all modern networks are unreliable
Examples of Databases
What is Important in a Database?
• Reliable Storage of Data?
• Fast Retrieval of Data?
• Fast Saving of Data?
• Resilience during failures?
• <other>
Examples
• Buying tickets from ticketmaster
• What’s important and why?
• Withdrawing money from a bank?
• Storing Call Forwarding Settings?
• Storing a List of Favorite Stocks?
Each Scenario has a different set of
requirements and constraints. There is
no silver bullet; if you could write one
database for all these scenarios, you’d
be rich.
Which Database is Better?
• STUPID QUESTION
• But I thought there were no stupid
questions?
• This is the only stupid question.
• The fight of which database is better is
almost always silly
• Databases are a tool, to get a job done
• Like the previous examples, each job
is different
• Each database stresses different
pros/cons
Let’s Get Technical!
Trouble With Databases• HUGE TOPIC (We’re only going to cover
a little)
• Network Partitions
• Layer 1 disasters
• Flapping Internet (Special Class of
Network Partitions)
Network Partitions• Common in Distributed Databases• When Databases lose contact with each other they
can partition• Caused by unreliable or faulty network connections• Databases can behave very weirdly when in
partitions
Arguably, most of what a database admin does is prepare for network partitions and how to resolve
them.
Network without Partitions
Network with Partitions
Split-Brain• During a partition, some databases will elect N
masters, one for each partition in the network.• When the partition is fixed, unless there is a pre-
defined restoral procedure, there will be conflicts• Databases have all kinds of strategies for handling
WAN Split-brain failure, but you should understand them
Key Takeaway: No Database is perfect. Understand the automation but also understand the manual
intervention procedure.
Layer 1 Failures
Layer 1 Failures• Rut Roh• Actual Physical Disaster• No easy way out except…• Don’t be in a Datacenter that’s hit by a disasterOR• Be Nimble enough to Evade Disaster
Evading Disaster• We’re not Magicians, we can’t simply predict disasters• The next best thing is being able to move and move
fast• Kazoo requires one line of code to move• Kazoo moves fast• Moving the Database fast is awesome (Thanks
BigCouch!)
During Hurricane Sandy, we cut our Datacenters away from Downtown New York to a Datacenter above the 100 year flood plain on the East Coast. Result: No Downtime.
No Silver Bullets• Layer 1 disasters are a humbling experience• Don’t rely on DataCenters in the Path of a Storm• Flooding will brick datacenters that have
generators below ground• To avoid being powerless in a disaster…• Plan, Test, Analyze, Repeat• Check out Netflix Simian Army for examples of
tests
Flapping• Is it up? Is it Down? Around and Around it Goes,
where it stops nobody knows…• Flapping Internet is a special case of network
partition or lose connectivity• Flapping connections lose contact with other
servers and then appear to come back online before going off
Why is this bad?
Fixing Flapping• I’m trying to fix a partition• The Network keeps going up and down• As I repair my cluster, it keeps starting to repair
and failing (by attempting to reintegrate the unreliable nodes)
Flapping nodes make everything awful
Why is the Network Difficult?
“Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishable from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure. When partitions arise, we have no way to determine what happened on the other nodes: are they alive? Dead? Did they receive our message? Did they try to respond? Literally no one knows. When the network finally heals, we'll have to re-establish the connection and try to work out what happened–perhaps recovering from an inconsistent state.”
-Kyle Kingsbury, Aphyr.com
Why is the Network Difficult?
“Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishable from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure. When partitions arise, we have no way to determine what happened on the other nodes: are they alive? Dead? Did they receive our message? Did they try to respond? Literally no one knows. When the network finally heals, we'll have to re-establish the connection and try to work out what happened–perhaps recovering from an inconsistent state.”
-Kyle Kingsbury, Aphyr.com
Why is the Network Difficult?
“Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishable from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure. When partitions arise, we have no way to determine what happened on the other nodes: are they alive? Dead? Did they receive our message? Did they try to respond? Literally no one knows. When the network finally heals, we'll have to re-establish the connection and try to work out what happened–perhaps recovering from an inconsistent state.”
-Kyle Kingsbury, Aphyr.com
What does 2600hz use?• Cloudant BigCouch• NoSQL Database• Master-Master• Very sensibly designed for our use case
Why BigCouch?DEMANDS1.On the Fly Schema Changes2.Scale in a distributed fashion3.Configuration changes will happen as we grow4.Has to be equipment agnostic5.Accessible Raw Data View6.Simple to Install and Keep up7.It can’t fail, ergo Fault-Tolerance8.Multi-Master writes9.Simple (to cluster, to backup, to replicate, to split)
TRADEOFFS1.Eventual Consistency is OK2.Nodes going offline randomly3.Multi-server only
Why are we ok with these tradeoffs? They suit our use case.
Let’s take some time to pontificate
about Database at scale…
What are the first things you think
of when you get errors reported
from the Database? What’s your
Thought Process?
• Database is where you put stuff
• You want your Database not to die
• 2600hz uses BigCouch because it’s really
awesome technology
• Great for our Use Case
• Easy to Administrate
• Resilient and quick-to-restore
Recap
QUESTIONS???