Upload
couchbase
View
516
Download
2
Embed Size (px)
Citation preview
Ship It!!! Coding Reliable Couchbase Applications
for ProductionMichael Nitschinger, SDK Engineer
@daschl
©2015 Couchbase Inc. 2
WarningIn this session you will hear stories of lost packets, corrupted data, confused administrators sending terabytes of logs to even more confused developers and many other insanely scary things. If the thought of a bit flip frightens you because you have only parity checking and no error correction, this session may not be for you.
Computers were harmed while preparing this talk.
If what you typically type after “catch” involves only the word “log”, this session may help you. If you hope to learn how an HTTP 503 can be useful, this presentation is for you.
Game Show Time(war stories from the field)
©2015 Couchbase Inc. 4
Obligatory Raising of Hands Who here has used Couchbase? Who has seen this?
©2015 Couchbase Inc. 5
Hundredaire!
©2015 Couchbase Inc. 6
Question One System: Virtual machines at a public cloud provider. Node.js
application. Observation: Under load testing, saw high latencies
(>100ms).
Causes?
Root cause: The ethernet device driver in the linux distro didn’t work that well with the virtualized hardware interface causing high latencies.
Solution: Swap out the Linux OS distribution.– Went from one that was less common but had better user tooling to
one of the most common ones in production deployments
A) Bugs in Couchbase.
B) The system software wasn’t well matched and tested.
C) Running too many node.js processes for
the number of OS CPU cores.
D) It’s the “cosmic rays” man.
©2015 Couchbase Inc. 7
Question Two System: Private virtual machines on a private cloud. Strong
monitoring and control of the environment Observation: As daily load would ramp, latencies would rise
and failure to meet the SLA would consume.
Causes?
Root cause: Memory resources were overprovisioned on the private cloud.
Solution: Adjust the memory allocation within the environment.– Also found that the number of tomcat workers was rather unusually
set; thousands of worker processes for systems with 8 virtual cores.
A) Bugs in Couchbase.
B) JVM Garbage Collection Pauses.
C) Virtualization is overprovisioned.
D) The NSA wiretap program was slowing
things down.
©2015 Couchbase Inc. 8
Question Three System: Database running on physical hardware, applications on
VMs across the network. SLA need was 50ms or less. Observation: Regular heartbeat of high latency in the 3-400ms
range.
Causes?
Root cause: The monitoring system was inspecting kernel counters on a regular basis and was somehow hitting a hot lock.
Solution: Disable that one poller in the monitor.– There were no other apps in that environment that had the same
latency requirements, so it was assumed that the environment was clean.
A) Bugs in Couchbase.
B) Misconfigured load balancer
sending all traffic to one app JVM.
C) Monitoring system interrogating the kernel causing lock contention.
D) Standing waves from running a 50hz power supply under
60hz.
Planning for Success
©2015 Couchbase Inc. 10
Define & Measure!
Develop
TestMeasu
re
Evaluate
Requirements
If it‘s not defined you can‘t measure it. SLAs Throughput at max.
Latency
©2015 Couchbase Inc. 11
Define & Measure!
Develop
TestMeasu
re
Evaluate
Requirements
Ideally from the get-go:
Error Detection Error Recovery Error Mitigation
©2015 Couchbase Inc. 12
Define & Measure!
Develop
TestMeasu
re
Evaluate
Requirements
Not just unit testing.
Stress Tests Load Tests Failure Tests
©2015 Couchbase Inc. 13
Define & Measure!
Develop
TestMeasu
re
Evaluate
Requirements
You can‘t manage whatyou don‘t measure.
©2015 Couchbase Inc. 14
Define & Measure!
Develop
TestMeasu
re
Evaluate
Requirements
Evaluate, rinse, repeat.
©2015 Couchbase Inc. 15
Service Level Required 100% Uptime not easily achievable
For instance, is it 100% available if 50% of your users are leaving because it’s too slow?
The question must always be:
“At max latency, what throughput do I get?”
©2015 Couchbase Inc. 16
Avoid the Coffin Corner
http://de.wikipedia.org/wiki/Coffin_Corner#/media/File:CoffinCorner.png
Height
Speed
©2015 Couchbase Inc. 17
Avoid the Coffin Corner Both airplanes and your applications do not like the
extremes
Resource contention and overload conditions result in high latency
Keep some headroom to fly smoothly
©2015 Couchbase Inc. 18
Prepare for bad weather
https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg
©2015 Couchbase Inc. 19
with Error DetectionSystem MonitorsPeriodic Checking
WatchdogsVoting
Auditing
https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg
©2015 Couchbase Inc. 20
with Error RecoveryTimeoutsFailoverRetries
https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg
©2015 Couchbase Inc. 21
with Error MitigationIntelligent Data Structures
Failing FastCircuit BreakersBackpressure
https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg
©2015 Couchbase Inc. 22
Timeouts Are your last resort when calling external resources.
so: Always use them
©2015 Couchbase Inc. 23
Timeouts
©2015 Couchbase Inc. 24
Timeouts
©2015 Couchbase Inc. 25
Circuit Breakers monitor traffic open if errors happen
– Latency– Throughput– Wrong results
close in a controlledfashion
expose metrics
©2015 Couchbase Inc. 26
Circuit Breakers
©2015 Couchbase Inc. 27
Backpressure Allows for coordinated flow control under stress conditions
©2015 Couchbase Inc. 28
Backpressure Allows for coordinated flow control under stress conditions Is used to shed load and provide partial good experience
Source: http://mechanical-sympathy.blogspot.co.at/2011/10/smart-batching.html
Testing & Benchmarking
©2015 Couchbase Inc. 30
This is NOT a benchmark
©2015 Couchbase Inc. 31
This is NOT a benchmark
©2015 Couchbase Inc. 32
Benchmarking Benchmarks assert expectations while tests verfiy
correctness
Like with statistics, almost always wrong and biased
Two hard problems in computer science:– Cache Invalidation– Naming Things
©2015 Couchbase Inc. 33
Benchmarking Benchmarks assert expectations while tests verfiy
correctness
Like with statistics, almost always wrong and biased
Two Three hard problems in computer science:– Cache Invalidation– Naming Things– Benchmarking
©2015 Couchbase Inc. 34
Benchmarking The appropriate Workload
– Concurrency– Think Time
The right Environment– Hardware, OS– external effects
The proper Tool– Measure NOOPs– Be aware of GC, Coordinated Omission,...
©2015 Couchbase Inc. 35
And the industry? Yahoo! Cloud Serving Benchmark (YCSB)
– Industry Standard– Makes it easy to compare solutions– Be aware of the (many) pitfalls!
Pioneering a new fork: https://github.com/YCSB/YCSB– Maintained NoSQL versions– Coordinated Omission fixes– ...
©2015 Couchbase Inc. 36
And the industry? Java Microbenchmarking Harness (JMH)
(http://openjdk.java.net/projects/code-tools/jmh/)
http://shipilev.net/talks/jvmls-July2013-benchmarking.pdf
©2015 Couchbase Inc. 37
Load & Stress Testing Load Testing
– Determine behaviour during normal traffic
Stress Testing– Traffic heavily increased (to the “Coffin Corner“)– Explicitly test edge cases– Knowing where and how it breaks is important
©2015 Couchbase Inc. 38
Failure Testing Test specific failure cases
– Node failures– Netsplits– Firewall issues
(dropped packets, closed sockets)
Failures will happen, better to prepare for it early.
http://www.bloomberg.com/ss/09/04/0427_mdea_awards/image/002_lifepak15monitorde_220a.jpg
Some Tools to Consider
©2015 Couchbase Inc. 40
Tools of the trade Run tools to validate a set
up with a reasonably known workload.– libcouchbase’s cbc pillowfight– Java’s RoadRunner– .NET’s MeepMeep
Isolate performance statistics at different layers.– libcouchbase and Java SDKs
have performance profiling abilities
– Couchbase has cbstats timings
Questions?