41
Ship It!!! Coding Reliable Couchbase Applications for Production Michael Nitschinger, SDK Engineer @daschl

Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

Embed Size (px)

Citation preview

Page 1: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

Ship It!!! Coding Reliable Couchbase Applications

for ProductionMichael Nitschinger, SDK Engineer

@daschl

Page 2: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 2

WarningIn this session you will hear stories of lost packets, corrupted data, confused administrators sending terabytes of logs to even more confused developers and many other insanely scary things. If the thought of a bit flip frightens you because you have only parity checking and no error correction, this session may not be for you.

Computers were harmed while preparing this talk.

If what you typically type after “catch” involves only the word “log”, this session may help you. If you hope to learn how an HTTP 503 can be useful, this presentation is for you.

Page 3: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

Game Show Time(war stories from the field)

Page 4: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 4

Obligatory Raising of Hands Who here has used Couchbase? Who has seen this?

Page 5: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 5

Hundredaire!

Page 6: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 6

Question One System: Virtual machines at a public cloud provider. Node.js

application. Observation: Under load testing, saw high latencies

(>100ms).

Causes?

Root cause: The ethernet device driver in the linux distro didn’t work that well with the virtualized hardware interface causing high latencies.

Solution: Swap out the Linux OS distribution.– Went from one that was less common but had better user tooling to

one of the most common ones in production deployments

A) Bugs in Couchbase.

B) The system software wasn’t well matched and tested.

C) Running too many node.js processes for

the number of OS CPU cores.

D) It’s the “cosmic rays” man.

Page 7: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 7

Question Two System: Private virtual machines on a private cloud. Strong

monitoring and control of the environment Observation: As daily load would ramp, latencies would rise

and failure to meet the SLA would consume.

Causes?

Root cause: Memory resources were overprovisioned on the private cloud.

Solution: Adjust the memory allocation within the environment.– Also found that the number of tomcat workers was rather unusually

set; thousands of worker processes for systems with 8 virtual cores.

A) Bugs in Couchbase.

B) JVM Garbage Collection Pauses.

C) Virtualization is overprovisioned.

D) The NSA wiretap program was slowing

things down.

Page 8: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 8

Question Three System: Database running on physical hardware, applications on

VMs across the network. SLA need was 50ms or less. Observation: Regular heartbeat of high latency in the 3-400ms

range.

Causes?

Root cause: The monitoring system was inspecting kernel counters on a regular basis and was somehow hitting a hot lock.

Solution: Disable that one poller in the monitor.– There were no other apps in that environment that had the same

latency requirements, so it was assumed that the environment was clean.

A) Bugs in Couchbase.

B) Misconfigured load balancer

sending all traffic to one app JVM.

C) Monitoring system interrogating the kernel causing lock contention.

D) Standing waves from running a 50hz power supply under

60hz.

Page 9: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

Planning for Success

Page 10: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 10

Define & Measure!

Develop

TestMeasu

re

Evaluate

Requirements

If it‘s not defined you can‘t measure it. SLAs Throughput at max.

Latency

Page 11: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 11

Define & Measure!

Develop

TestMeasu

re

Evaluate

Requirements

Ideally from the get-go:

Error Detection Error Recovery Error Mitigation

Page 12: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 12

Define & Measure!

Develop

TestMeasu

re

Evaluate

Requirements

Not just unit testing.

Stress Tests Load Tests Failure Tests

Page 13: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 13

Define & Measure!

Develop

TestMeasu

re

Evaluate

Requirements

You can‘t manage whatyou don‘t measure.

Page 14: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 14

Define & Measure!

Develop

TestMeasu

re

Evaluate

Requirements

Evaluate, rinse, repeat.

Page 15: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 15

Service Level Required 100% Uptime not easily achievable

For instance, is it 100% available if 50% of your users are leaving because it’s too slow?

The question must always be:

“At max latency, what throughput do I get?”

Page 16: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 16

Avoid the Coffin Corner

http://de.wikipedia.org/wiki/Coffin_Corner#/media/File:CoffinCorner.png

Height

Speed

Page 17: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 17

Avoid the Coffin Corner Both airplanes and your applications do not like the

extremes

Resource contention and overload conditions result in high latency

Keep some headroom to fly smoothly

Page 18: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 18

Prepare for bad weather

https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg

Page 19: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 19

with Error DetectionSystem MonitorsPeriodic Checking

WatchdogsVoting

Auditing

https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg

Page 20: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 20

with Error RecoveryTimeoutsFailoverRetries

https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg

Page 21: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 21

with Error MitigationIntelligent Data Structures

Failing FastCircuit BreakersBackpressure

https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg

Page 22: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 22

Timeouts Are your last resort when calling external resources.

so: Always use them

Page 23: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 23

Timeouts

Page 24: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 24

Timeouts

Page 25: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 25

Circuit Breakers monitor traffic open if errors happen

– Latency– Throughput– Wrong results

close in a controlledfashion

expose metrics

Page 26: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 26

Circuit Breakers

Page 27: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 27

Backpressure Allows for coordinated flow control under stress conditions

Page 28: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 28

Backpressure Allows for coordinated flow control under stress conditions Is used to shed load and provide partial good experience

Source: http://mechanical-sympathy.blogspot.co.at/2011/10/smart-batching.html

Page 29: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

Testing & Benchmarking

Page 30: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 30

This is NOT a benchmark

Page 31: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 31

This is NOT a benchmark

Page 32: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 32

Benchmarking Benchmarks assert expectations while tests verfiy

correctness

Like with statistics, almost always wrong and biased

Two hard problems in computer science:– Cache Invalidation– Naming Things

Page 33: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 33

Benchmarking Benchmarks assert expectations while tests verfiy

correctness

Like with statistics, almost always wrong and biased

Two Three hard problems in computer science:– Cache Invalidation– Naming Things– Benchmarking

Page 34: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 34

Benchmarking The appropriate Workload

– Concurrency– Think Time

The right Environment– Hardware, OS– external effects

The proper Tool– Measure NOOPs– Be aware of GC, Coordinated Omission,...

Page 35: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 35

And the industry? Yahoo! Cloud Serving Benchmark (YCSB)

– Industry Standard– Makes it easy to compare solutions– Be aware of the (many) pitfalls!

Pioneering a new fork: https://github.com/YCSB/YCSB– Maintained NoSQL versions– Coordinated Omission fixes– ...

Page 36: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 36

And the industry? Java Microbenchmarking Harness (JMH)

(http://openjdk.java.net/projects/code-tools/jmh/)

http://shipilev.net/talks/jvmls-July2013-benchmarking.pdf

Page 37: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 37

Load & Stress Testing Load Testing

– Determine behaviour during normal traffic

Stress Testing– Traffic heavily increased (to the “Coffin Corner“)– Explicitly test edge cases– Knowing where and how it breaks is important

Page 38: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 38

Failure Testing Test specific failure cases

– Node failures– Netsplits– Firewall issues

(dropped packets, closed sockets)

Failures will happen, better to prepare for it early.

http://www.bloomberg.com/ss/09/04/0427_mdea_awards/image/002_lifepak15monitorde_220a.jpg

Page 39: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

Some Tools to Consider

Page 40: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

©2015 Couchbase Inc. 40

Tools of the trade Run tools to validate a set

up with a reasonably known workload.– libcouchbase’s cbc pillowfight– Java’s RoadRunner– .NET’s MeepMeep

Isolate performance statistics at different layers.– libcouchbase and Java SDKs

have performance profiling abilities

– Couchbase has cbstats timings

Page 41: Ship It! Coding Reliable Couchbase Applications to Production – Couchbase Live New York 2015

Questions?