View
1.086
Download
4
Category
Preview:
Citation preview
Building a Private Cloud to Efficiently Handle 40 Billion Requests / Day
October 28th, 2015
Pierre Gohon | Sr. Site Reliability Engineer | pierre.gohon@tubemogul.comPierre Grandin | Sr. Site Reliability Engineer | pierre.grandin@tubemogul.com
Who are we?
TubeMogul (Nasdaq : TUBE)● Enterprise software company for digital branding● Over 27 Billion Ads served in 2014● Over 40 Billion Ad Auctions per day in Q3 2015● Bids processed in less than 50 ms● Bids served in less than 80 ms (inc. network round trip)● 5 PB of monthly video traffic served● 1.6 EB of data stored
Who are we?
Operations Engineering● Ensure the smooth day to day operation of the platform
infrastructure● Provide a cost effective and cutting edge infrastructure● Provide support to dev teams ● Team composed of SREs, SEs and DBAs (US and UA)● Managing over 2,500 servers (virtual and physical)
Our Infrastructure
Public Cloud On Premises
Multiple locations with a mix of Public Cloud and On Premises
● 6 AWS Regions (us-east*2, us-west*2, europe, apac)● Physical servers in Michigan / Arizona (Web/Databases)● DNS served by third party (UltraDNS +Dynect)● External monitoring using Catchpoint● CDNs to deliver content● External security audits
We’re not adding complexity!
Before Openstack: we’re already very “Hybrid”…
Why?
● Own your infrastructure stack● Physical proximity matters (reduced/controlled latency)● Better infrastructure planning● Technological transparency
● … $$ !
Project timeline
Where do we stand?
● DIY ?○ Small OPS team
■ 12 members in two timezones■ 3 only dedicated to OpenStack
○ New challenges■ Internal training■ Little external support (really ?) vs AWS■ Manage data centers (Servers, Network, …)
OpenStack challenges - Operational aspect
● Are applications AWS dependent ?○ Internal ops tools○ Developer’s applications○ AWS S3, DynamoDB, SNS, SQS, SES, SWF
● Convert developers to the project : we need their support● OpenStack release cycle (when shall we update to latest
version?)● OpenStack really needed components ?● How far do we go (S3 replacement ? Network control ?
Hardware control ?)
OpenStack challenges - Application migration aspect
● Managing our own ASN / IPs (v4/v6)● Choose “best for needs” transit providers (tier 1)● Better control routes to/from our endpoints● Allow dedicated AWS connections / others ● Allow direct peerings to ad networks● Want to be accountable for networking issues● Cost control
How? Networking - External connectivity
● Applications are already designed for redundancy/cloud● Circumvent virtualized networking limitations● Fine-tune baremetal nodes for HAProxy ● For the future equipments are “cloud ready” (nexus 5K for
top of rack switch)○ automatic switch configuration○ cisco software evolutions ?
● 1G for admin, X*10G for public ?● Leverage multicast ?
How? Networking - Hybrid physical / virtualized
How? Networking - Hybrid physical / virtualized
Network node Compute node Load balancer
public network
private networkusing VLANs
1
2 3 2
How? Networking - RTT
● Latency from our DC to AWS is 6ms average in US-WESTrtb-bidder01(rtb):~$ mtr -r -c 50 gw01.us-west-1a.publicHOST: rtb-bidder01 Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.4.1 0.0% 50 0.2 0.2 0.1 0.3 0.0 2.|-- XXX.XXX.XXX.XXX 0.0% 50 0.2 0.3 0.2 2.6 0.3 3.|-- ae-43.r02.snjsca04.us.bb. 0.0% 50 1.4 1.5 1.2 2.3 0.2 4.|-- ae-4.r06.plalca01.us.bb.g 0.0% 50 2.0 2.1 1.8 3.4 0.3 5.|-- ae-1.amazon.plalca01.us.b 0.0% 50 39.2 3.5 1.5 39.2 5.6 6.|-- 205.251.229.40 0.0% 50 3.5 2.8 2.2 4.9 0.6 7.|-- 205.251.230.120 0.0% 50 2.1 2.3 2.0 8.5 0.9 8.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0 9.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0 10.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0 11.|-- 216.182.237.133 0.0% 50 4.0 6.0 2.7 20.2 5.2
● If you are not building a multi-thousand hypervisors cloud, you don’t need it to be complex
● Simplifies day-to-day operations● Home made puppet catalog
○ because less lines of code○ because of the learning curve○ because need to tweak settings (ulimit?)
● No need for horizon● No need for shared storage
How? Keep it simple
● Affinity / anti-affinity rules○ Enforce resiliency using anti-affinity rules
○ Improve performances using affinity rules
How? Leverage your knowledge of your infrastructure
{"profile": "OpenStack", "cluster": "rtb-hbase", "hostname": "rtb-hbase-region01", "nagios_host": "mgmt01"}
How?Treat your infrastructure as any other
engineering project
Infrastructure As Code● Follow standard development lifecycle● Repeatable and consistent server
provisioning
Continuous Delivery● Iterate quickly● Automated code review to improve code
quality
ReliabilityImprove Production StabilityEnforce Better Security Practices
How? Continuous Delivery
● We already have a lot of automation:● ~10,000 Puppet deployments last year● Over 8,500 production deployments via jenkins last year
● On the infrastructure:○ masterless mode for the deployment○ master mode once the node is up and running
● On the VMs:○ Puppet run is triggered by cloud-init, directly at boot○ from boot to production ready: <5 minutes
Puppet
see also : http://www.slideshare.net/NicolasBrousse/puppet-camp-paris-2015
Infrastructure As Code - Code Review
Gerrit, an industry standard : OpenStack, Eclipse, Google, Chromium, WikiMedia, LibreOffice, Spotify, GlusterFS, etc...
Fine Grained Permissions RulesPlugged into LDAPCode Review per commitStream EventsIntegrated with Jenkins, Jira and HipchatManaging about 600 Git repositories
Infrastructure As Code - Gerrit Integration
Infrastructure As Code - Gerrit in Action
Automatic verify : -1 if the commit doesn’t pass Jenkins code validation
Infrastructure As Code - The Workflow
Lab / QA
Prod cluster
Infrastructure As Code - Continuous Delivery with Jenkins
Infrastructure As Code - Team Awareness
Infrastructure As Code - Safe upgrade paths
Easy as 1-2-3:1. Test your upgrades using Jenkins2. Deploy the upgrade by pressing a
single button*3. Enjoy the rest of your day
* https://github.com/pgrandin/lcamfig.1 : N. Brousse, Sr. Director of Operation Engineering, switching our production workload to OpenStack
Get ready for production :Monitor everything
Monitor as much as you can ?
● Existing monitoring (Nagios, Graphite) still in use● Specific checks for OpenStack
○ check component API : performance / availability / operability
○ check resources : ports, failed instances● Monitoring capacity metrics for all hardware● SNMP traps for network equipment● Monitoring is just an extension of our existing
monitoring in AWS
Monitoring auto-discovery
● New OpenStack node is automatically monitored○ automatically / upon request○ nagios detects new hosts (API query)○ nagios applies component related check by role○ graphing is also automatically updated
Centralized monitoring
Monitoring is graphing
A look in the rearview mirror
Benefits - Transparency / visibility
Discover new odd/unexpected traffic/activity patterns
Benefits - Tailored Instances
Before After
m3.xlarge + 2GB RAM? m3.2xlarge!
# nova flavor-create rtb.collector rtb.collector 17408 8 2
Benefits - Operational Transparency
AWS
OpenStack
# cerveza -m noc -- --zone tm-sjc-1a --start demo01
# cerveza -m noc -- --zone us-east-1a --start demo01
Benefits - Efficiency
Before After
Benefits - Efficiency
1+ million rx packets/s on only 2 Haproxy Load Balancers, full SSL
What does not fit?
Downscaling does not really make sense for uscpus are online and paid for, we should use them
Upscaling has its limits : AWS is refreshing instance types every year …
Sometime a small feature added can have huge load impact.
It makes sense to keep the elastic workloads (machine learning, ...) in AWS
● We can be “double hybrids” (aws + openstack + haproxy bare metal)
● Dev environment is needed for Openstack (new versions / break things)
● Storage is still a big issue due to our volume (1.6 EB)● Some stuff may stay “forever” on AWS ?● More dev/ops communication● OpenStack is flexible● No need for HA everywhere● Spikes can be offloaded on AWS
(cloud bursting)
What we’ve learnt
Still a lot left to do
Technical aspectNeed to migrate other AWS RegionsGain more experienceVersion upgradesContinue to adapt our toolingAdd more alarms for capacity issuesDifferent Regions, different issues ?
Human aspectDev team still thinks in the AWS world
( and sometimes OPS too…)
- Ad serving in production since 2015-05- Bidding traffic in production since 2015-09- 100% uptime since pre-production (2015-03)
Cost of operation for our current production workload:- Reduced by a factor of two, including OpEx cost!
Aftermath
Questions?
Pierre GohonPierre Grandin
@pierregohon@p_grandin
Recommended