Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Project Skyfall
Matt Abrams (@abramsm)
Agenda
A bit about AddThis!!
Why did we need Skyfall?!!
Architecture!!
Operations/Performance!
Introduction!
Fun with Numbers
AddThis JavaScript loads > 3 Billion times per day Edge Network (Skyfall) receives around 4B hits per day Either datacenter can handle 100% load (we test this often) Currently using around 1K servers (will double next year)
Data Center Porn
Why did we need Skyfall?
We couldn’t find anyone else to do it for us • Pervious vendors log aggregation was delayed by a
minimum of 3 hours and could take up to 5 days Minimize impact on our publishers
• Combining log collection with remote services means we only need 1 event instead of n
Support near real time applications
Why did we call it Skyfall?
Why did we call it Skyfall?
Skyfall Goals and Architecture!
Skyfall Goals (Technical) High Availability Low latency Use for internal and external Logging needs O(1) reads and writes Smart Clients
Handle Server and DC failure gracefully Zero downtime deployment and configuration In session RPC Support data filtering at the edge
Why speed and robustness matters
Web Event Web Event Architecture Web Event
Skyfall Skyfall Skyfall
Consumer Consumer Consumer
Consumer
Service Service
Service
DC1
Skyfall Skyfall Skyfall
Consumer Consumer Consumer
Consumer
Service Service
Service
DC2
Global Traffic Management
Repeater
1. Messages are placed on concurrent non-blocking queue (CNBQ) to minimize latency impact on producer
2. Messages are then popped from CNBQ and placed on a Disk-Backed queue (DBQ)
3. DBQ is used to provide temporary storage in case Kafka is down or backed up
4. Messages from DBQ are popped and sent to Kafka where they are persisted to file system
Kafka Kafka is treats persistence as a first class citizen Focus is on high throughput vs lots of bells and whistles State about what has been consumed is maintained in the client rather than the server Kafka is explicitly distributed Supports O(1) reads and writes Pull rather than push
http://incubator.apache.org/kafka/design.html
Circuit Breaker for remote Services Pattern is used to detect failures and encapsulates logic of preventing a failure to reoccur constantly[1]
If a service instance throws an error, times out, or responds with a failure message an error event is marked If the error rate threshold is exceeded that service instance is removed from the pool of available services Before re-adding a service to the pool a test request is made and validated Internal service failures should not be reflected in response to message originator
[1] - http://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
What does a call to our endpoint look like?
• "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!
Topic
What does a call to our endpoint look like?
• "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!
Topic Version
What does a call to our endpoint look like?
• "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!
Topic Version Resource
What does a call to our endpoint look like?
• "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!
Topic Version Resource URL Params
What does a call to our endpoint look like?
• "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!
Topic Version Resource URL Params Status Code
What does a call to our endpoint look like?
• "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!
Topic Version Resource URL Params Status Code Bytes Transferred
What does a call to our endpoint look like?
• "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!
Topic Version Resource URL Params Status Code Bytes Transferred
CDN Resource User Agent
What does a call to our endpoint look like?
"GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
Topic Version Resource URL Parameters
CDN Resource User Agent
Status Code Bytes Transferred
The endpoint also receives header and cookie information not Shown here.
Zero Downtime Deployment and Configuration
S1 S2 S2 S3 S3 4
S4 S4 8
S5 S5 16
Group 1
S1 S2 S2 S3 S3 4
S4 S4 8
S5 S5 16
Group 2
Endpoint Configuration
Each endpoint maps to a ‘topic’ Header elements may be extracted from the HTTP request Parameters may be mapped to new key names Variables may be extracted from the URL path
Data Center Repeater
N1
N2
N3
N1
N2
DC Repeater nodes automatically negotiate peering relationships with nodes in the other data center If a peer node becomes unreachable the local node will select a new peer These are special consumers of the Kafka log data created by the local node
Skyfall Operations!
Requests per/second (VA Data Center)
TCP - When do you say goodbye?
http://upload.wikimedia.org/wikipedia/commons/a/a2/Tcp_state_diagram_fixed.svg
Connection Tracking – what you need to know Connection information is maintained in memory The message: “ip_conntrack: table full, dropping packet” is BAD Chrome – doesn’t close connection on FIN This means that the connection info remains open until it times out, drastically increasing the number of connection your server needs to track You need some mechanism for timing out the connection in a reasonable time period
HA Proxy We use a simple round-robin load balancing algorithm with a liveness check Default connection timeouts are way to high. Reasonable values are used to prevent excessive connection tracking “http-close” and “http-server-close” are enabled to ensure low latency for clients and fast session reuse for the server HA Proxy is our solution of choice our LB needs. We prefer software solutions on commodity hardware vs expensive custom LB appliances They could use a new logo