Making clouds go faster, for fun and profit!

Preview:

DESCRIPTION

Everyone loves it when things are fast, and that statement holds true whether you're visiting http://www.livingsocial.com or whether you're hitting the OpenStack Nova API and requesting, "Please show me all the instances which I've got running". Nobody ever writes in asking for support and saying, "All of my API calls are completing far too quickly. Slow it down!". Optimizing the performance of software is arguably a never ending crusade. At some point in time you'll get things fast enough that you can say, "Any effort invested beyond this point is not adding value for the business" but then along comes new code which adds a zillion awesome features, but also regresses performance back to a level where it needs another tune-up. In the process of transforming our infrastructure and preparing our new OpenStack IaaS to host all our applications, we've been looking for performance wins across the whole stack. We've got some aggressive targets to meet. We've investigated many hardware options and chosen an optimal solution, we've instrumented some of the OpenStack APIs and benchmarked to produce interesting results, and whilst we're not done yet, we do have a "Half-Time Match Report". Join me as I walk through our learnings so far and propose follow-on areas for investigation and optimization.

Citation preview

This slide intentionally left blank.

Wednesday, 17 October 12

MAKING CLOUDS GO FASTERFOR FUN AND PROFIT

2

Wednesday, 17 October 12

3

Wednesday, 17 October 12

SpeakersWho crafted this talk?

4

Wednesday, 17 October 12

Alex Howells@nixgeek

Technical OperationsLivingSocial

alex.howells@livingsocial.comhttp://github.com/agh

5

Wednesday, 17 October 12

Paul Thomas@ftergl0w

Technical OperationsLivingSocial

paul.thomas@livingsocial.comhttp://github.com/AfterGlow

6

Wednesday, 17 October 12

Bedtime ReadingYou can get a copy of these slides after the talk -

https://speakerdeck.com/u/nixgeek

Wednesday, 17 October 12

Problem?8

Wednesday, 17 October 12

PerformanceIt doesn’t need to be rocket science.

It does matter though!

I promise I’m not trolling you.

9

Wednesday, 17 October 12

“Oh man, that was too fast!It’s so much betternow it’s slow!!”

-- Average User

In a parallel universe...

10

Wednesday, 17 October 12

YEAH RIGHTI wish I had users who were that easy to please!

But since we live in the real world...

11

Wednesday, 17 October 12

“Why is that dude smiling?!This is too slow!

Why can’t it be faster?”

-- Average Users

In our universe...

12

Wednesday, 17 October 12

THINGS ARE IMPROVINGCactus => Diablo => Essex => Folsom

13

But things can improve faster with focus!

Wednesday, 17 October 12

Today

Mostly reliable,but can be a bit slow!

14

Wednesday, 17 October 12

The Future?

Faster. More scalable.A real driving experience.

15

Wednesday, 17 October 12

Why should I listen to you?

What’s the big deal?16

Wednesday, 17 October 12

WE’RE A LOT LIKE YOU!Developers. Operators. Engineers. Users.We see potential. We see opportunities.

17

Wednesday, 17 October 12

18

Wednesday, 17 October 12

AirspaceLivingSocial PaaS

We care about speed because ...

19

* Scaling services up/down needs to happen fast! * Needing to maintain huge pools of “slack capacity” to account for sudden spikes in traffic sucks. * Upgrading applications should be fast.

What does fast mean to us? One example?

New instances online in under 10 seconds.

Wednesday, 17 October 12

Performance Matters

20

What could your business do if instances came online in under 5 seconds vs. 50 seconds?

> Makes integration tests leveraging the Cloud complete much faster. > Seasonal spikes? React to them faster - happier customers spend more money. > Engineers who don’t grumble that “getting servers is a pain in the ass”. > Deploy new applications and services more quickly and easily.

Along with many other things ...

Wednesday, 17 October 12

What do we do?

21

Wednesday, 17 October 12

Think Positive22

Because solutions are better than problems!

Wednesday, 17 October 12

23

Wednesday, 17 October 12

Two-ProngedApproach

Hardware & Software“A Love Story”

24

Wednesday, 17 October 12

Warning!

Picking the right hardware is quite hard.It’s often individual to your users needs.

What works for us may not rock your world.

25

Wednesday, 17 October 12

Hardware26

Wednesday, 17 October 12

Our Servers

27

Supermicro 1027R-WRFT+2x Intel Xeon E5-2670 (8C/16T 2.60GHz)16 x 8GB 1600MHz ECC MemoryLSI 9266-8i (1-LD RAID-10)8 x Intel 520-series 240GB SSDDual-Port Intel X540 10GBASE-T

Wednesday, 17 October 12

Benefits

28

* ‘Just right’ balance of CPU/RAM for us.

* Exceptional ephemeral I/O performance > Not using eMLC - trade off? > We can think about SQL on IaaS

* A surplus of network bandwidth

Servers are not a bottleneck!

Wednesday, 17 October 12

Our Network

29

Top of Rack -Arista Networks 7050T48-port 10GBASE-T Switch+ 4-port 40GbE (uplinks)

Zone Spine -Arista Networks 7050Q16-port 40GbE Switch

Wednesday, 17 October 12

Benefits

30

* A network which runs Linux!* Ability to automate it via ZTP and Chef

* Non-blocking communication in a rack.* Provision 160Gbps to spine via four cables.* Under 2:1 contention for comms in/out of rack.

* Less need to think about QoS!

Network is not a bottleneck!

Wednesday, 17 October 12

Software31

Wednesday, 17 October 12

Production

32

Ubuntu 12.04 LTS (‘Precise Pangolin’)Hypervisor -- KVM

CloudScaling OCS 1.3 .. based off OpenStack Essex ..

Moving to OCS 2.0 in near future... .. that one is OpenStack Folsom ..

Wednesday, 17 October 12

33

Ubuntu 12.04 LTS (‘Precise Pangolin’)Hypervisor -- KVM

Useful for development and testing .. we’re running OpenStack Folsom now ..

Most of the data shown later was grabbedwith help from DevStack running on similarhardware to our production environment.

Wednesday, 17 October 12

34

WHAT NOW?We’ve picked the hardware stack. It’s awesome.

We’ve got our software installed. It’s looking great.

Wednesday, 17 October 12

Support calls are imprecise. We need data!

Monitoring35

Wednesday, 17 October 12

Old School* Is my service (API) responding on TCP/8774?* Am I able to make a GET and fetch instance info?* Is my server running all the processes it should?* Are there any errors on my network ports?

If any of this looks broken,send me alerts saying so!

Wednesday, 17 October 12

New Thinking

* “How long did my website take to show?”* Individual performance of each click or API call* Inspection of latency within the application

If lots of users interactions are slow,then I want you to alert me.

If its just an outlier - log it and shut up.

“End-User Experience Monitoring”

Wednesday, 17 October 12

DEMO TIME!Because pretty pictures are awesome.

We’ll call the slowest transactions our “Disaster Porn”.

38

Wednesday, 17 October 12

Boundary

39

“AppViz”

* Port-to-port throughput/latency* How much SQL traffic are you doing?

Updates in real-time.Look backwards in time.

Powered by IPFIX (RFC 5101)

Wednesday, 17 October 12

Tracelytics

40

Lots more cool stuff to help ...We’ll blitz through a few more things next ...

Latency Trends* Over the last 60 minutes* Over the last 24 hours* Over the last 7 days

Top Tip: This is bad news.

Wednesday, 17 October 12

TracelyticsPatches

41

If you want to try out OpenStack APM -https://github.com/Afterglow/tracelytics-openstack

Any questions? Just open an issue!

Wednesday, 17 October 12

Glance

Wednesday, 17 October 12

Keystone

Wednesday, 17 October 12

Nova

Wednesday, 17 October 12

Nova

Wednesday, 17 October 12

Nova

Wednesday, 17 October 12

Nova

Wednesday, 17 October 12

“Call to Arms”

48

Reminder about those patches -https://github.com/Afterglow/tracelytics-openstack

> Performance regression tests as an OpenStack CI gate?> More people talking about “How I fixed those >5 second outliers!”> Better ‘shared knowledge’ about what settings to tweak for added oomph> Architectural analysis asking about “big picture” (big impact) changes

Wednesday, 17 October 12

CreditsBecause these folks are awesome

49

N.B. Not intended as an exhaustive list of all the awesome people in the world/room!

Wednesday, 17 October 12

http://www.livingsocial.com

Credits

50

Wednesday, 17 October 12

http://www.cloudscaling.com

Credits

51

Wednesday, 17 October 12

http://www.aristanetworks.com

Credits

52

Wednesday, 17 October 12

http://www.tracelytics.com

Credits

53

Wednesday, 17 October 12

We’re done talking,thanks for listening!

Any questions?

54

Wednesday, 17 October 12

Interested?E-mail Ken -

ken.persel@livingsocial.com

Or just find me!

Reminder that these slides are over at -https://speakerdeck.com/u/nixgeek

Wednesday, 17 October 12

Recommended