Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
There are many, many groups and people that work on nytimes.com. I am just one of them. I do not claim to speak for everyone, or for the company.
The New York Times building on 8th ave. This is where I work. As does 90% of all NYT employees.The building was build in 2007. It is super nice. I can give you a tour if you are ever in the area. In the past, NYT adopted technology internally that might have have changed to an outsider. This is changing very quickly as technology becomes more and more ubiquitous in our daily lives.
September 18, 1851
On this date, we printed our first paper. We have printed paper almost every day since then.
July 21, 1969
http://timesmachine.nytimes.com/timesmachine/1969/07/21/issue.html
We put man on the moon, but the paper was basically the same.
October 16, 1997
First color photo. The choice was made to add color to the front page more that 140 years after the paper started.
January 8, 2014
nyt 5 was release on this date. This was the first update in many years( I think 6 or 8.) For the first time, the changes made where for the reader to have a better experience. The main website, mobile site, and mobile apps all share the same user experience. The goal of nyt5 was to be Sleeker, Faster, and More Intuitive. Although Cloud computing has been used in production and development before, it included a number of new features. One of these features was no page reloading when breaking news alerts or live video was occurring. There is no page polling for this
In house front end
The open source in house front end http://nimbul.github.io/nimbul/Originally developed to control access to AWS, it is being extended to include VMWare services as well. Offers a great deal of simplicity when dealing with AWS cloud.
Internal Private Cloud
There is an internal private cloud that is based on VMware. It runs in a number of physically separate data-centers. The front end is home grown.
External Public
The other option is to use the NYT amazon.com account. This is not a general account that any user can use. The same home grown front end is used, instead of using the API or the normal AWS interface.
But.. we had bigger plans
Nimbul really cut down on developer time. The could ask for a instance (or cluster) that the infrastructure team could rapidly create automatically. There are some caveats though. No root accessNo direct API access to AWSNo new instance types (AMIs)No Autoscale
A cloud case study
Andrew Canaday Michael Laing Mike Buzzetti
nyt fabrik went live in production on Jan 8th, 2014. We will use this as a case study for using cloud at the New York Times.
Principles
•Provide a standard input and output pipeline
•Provide a pipeline to transmit messages to front end clients
•Provide a pipeline to receive messages from front end in real time
•Provide a standard inter-service pipeline
•Provide a flexible caching layer
Message Pipes
Basically, we build “pipes” for messages. These pipes guarantee that the messages get delivered. Generally, the std in is and AMQP channel and stdout is a websocket.
Cloud Elasticity
We use the elasticity of the AWS cloud to meet our growing (and shrinking) demands.
Front End
•Clients connect through web sockets or sockjs
•Soon to be open sourced server code is based on python and C.
•Front end “shovels” message to and from core through RabbitMQ
•Front end has incredibly fast memory based internal cache
Every user gets a pipe
Every user gets a connection per devices or tab. One can imagine that this is a pretty high number.
We can scale the FE
We gather these stats from the front end. We then call the CloudWatch API to store them.
Instances scale with readers
We use the number of readers to tell CloudWatch when to increase the number of instances inside of the Elastic Load Balancer. We do this predictively based on a few internal algorithms.
Core
•RabbitMQs are clustered across availability zone
•Services are written in python
•Message are “shoveled” between regions (RabbitMQ)
•Core is more static than front end
Core message rates
2014: The core can handle somewhere around 600 messages a second. We are normally around 3 - 30 messages a second. We can handle the peak load, so it does not need to scale as much.
2015: The core can handle around 5000 messages a second. Thought the day we range from 500-2500 normally.
Cache
•Clustered across regions
•Python services protect from product change
•No deletes, ttl based instead
•Nodes are small for such a heavy Java application
NYT Backends
•The fabrik has a proxy to other NYT AWS service
•Each backend can have its own cluster of instances similar to core
•Some backends can just post messages to AMQP points (both gold and silver service levels)
Cloud Availability
Although availability is not a core characteristic of cloud, it is still very very important to us. First, we have to be up ~100% of the time. This lets our users and readers know that we are as reliable as our sources. It increases user perception of quality of our name.
Cloud Responsiveness
The fact that we can run in multiple geographies is also important. If a reader in India wants to get access to a digital version of the Times we do not want to have to connect from the USA. Cloud lets us have data centers closer to the people who read our site. This also helps us spread the load so as to minimize outages.
Cloud DNS (weighted)
Amazon offer Route 53, a dynamic cloud based Name service. The name service can be used to direct traffic with DNS CNAME. In this case we can use weighted CNAME. The weight is either manually control by the Fabrik Team, or by setting cloud watch alerts.
Cloud DNS (latency)
We also can use latency based CNAME. This directs the reader to the closed AWS resource, then goes off to that more local weighted CNAME.
Cloud Networking
Also know as broad network access. We would not have build the fabrik if it was not for the fact that we can use well defined APIs to create, destroy, and modify our cloud resources.
API Usagedef _update_instance_status(self): """ Get instance status from AWS for all instances we monitor. """ logger.info("Getting all instance status") try: status_by_instance = {} instance_by_status = {}
conn = boto.ec2.connect_to_region(self._config.region) instances = conn.get_all_instance_status( instance_ids=self._as_instances.keys()) for instance in instances: status_by_instance[instance.id] = instance.state_name
if not instance_by_status.has_key(instance.state_name): instance_by_status[instance.state_name] = [] instance_by_status[instance.state_name].append(instance.id)
self._status_by_instance = status_by_instance self._instance_by_status = instance_by_status except Exception, ex: err_msg = traceback.format_exc(ex) logger.error("Error getting instance status: %s", err_msg) return False
boto
Cloud Logging
==> outbox.log <== 2014-02-15 14:22:05,749 - INFO - transform_and_send:rk: client-message.us-west-2.core.app_buddy.i-4f675946.2603.us-west-2.hermes.push.-.- 2014-02-15 14:22:09,790 - INFO - transform_and_send:rk: client-message.us-west-2.core.app_buddy.i-4f675946.2603.us-west-2.hermes.push.-.- 2014-02-15 14:33:40,744 - INFO - transform_and_send:rk: client-message.us-west-2.core.app_buddy.i-4f675946.2603.us-west-2.hermes.push.-.-
==> input.log <== 2014-02-15 18:56:12,746 - DEBUG - get_message:Received raw json message from searchcloud. 2014-02-15 18:56:12,747 - WARNING - check_set_uuid:uuid missing - set to d6ebf748-9672-11e3-a706-124015c90b0b 2014-02-15 18:56:12,748 - DEBUG - send_metal_message:Message sent with rk: process.us-west-2.searchcloud.input.i-f0ee2cc7.10388.us-west-2.searchcloud.silver.-.-
==> route.log <== 2014-02-15 18:55:42,737 - DEBUG - route_message:published rk: metrics.searchcloud.minute.2014-02-15T18:55Z.terms 2014-02-15 18:56:12,758 - DEBUG - route_message:received rk: process.us-west-2.searchcloud.input.i-f0ee2cc7.10388.us-west-2.searchcloud.silver.-.- 2014-02-15 18:56:12,759 - DEBUG - route_message:published rk: metrics.searchcloud.minute.2014-02-15T18:56Z.terms
==> process.log <== 2014-02-15 18:55:02,732 - INFO - message_callback:rk: process.us-west-2.searchcloud.input.i-2e8f091a.6439.us-west-2.searchcloud.silver.-.- 2014-02-15 18:55:32,731 - INFO - message_callback:rk: process.us-west-2.searchcloud.input.i-2e8f091a.6439.us-west-2.searchcloud.silver.-.- 2014-02-15 18:56:02,732 - INFO - message_callback:rk: process.us-west-2.searchcloud.input.i-2e8f091a.6439.us-west-2.searchcloud.silver.-.-
Usage
•Front-end 8 in US, 8 in EU (c3.large)
•Core 3 in US, 3 in EU (c3.xlarge)
•Cache 6 in US, 6 in EU (r3.large)
•Product 3 in, US 3 in EU (c3.large)
•This is per product
c3.large (2 VCPU, 3.75 GB Mem, 2x16GB SSD)c3.xlarge ( 4 VCPU, 7.5 GB Mem, 2x40GB SSD)r3.large ( 2 VCPU, 15.25 GB Mem, 1x32GB SSD)
Message Types•There are three main types of messages
•Register user messages (regi)
•Feed messages (bna, live video, etc)
•Post office (inter-service)
•There are two service classes
•Silver (processed in one region)
•Gold (processed in all regions)
Silver Message
{“body": { “sub_type":"CreditCardExpired", “display":"sartre-display", "meta_link":"metaLink”, “pub_date":1390939740, “display_duration":600, "title":"The credit card we have on file is no longer valid.”, “end_time”:1391544540, "start_time":1390939740, }
"collection":"regi.29721864”, "hash_key":410121, "uuid":"06ccb430-8858-11e3-ba35-1231381031c5”, “cache_ttl_secs":604800 }
Gold Message
{“body": { "label":"Sports Alert”, “sub_type":"BreakingNews", “links”:[ {“url":"http://www.nytimes.com/2014/02/20/sports/olympics/ligety-takes-big-lead-in-giant-slalom.html", "offset":0,"count":0}], “display_duration":null, "title":"American Ted Ligety Wins Gold in Giant Slalom”, “end_time”:1392811081,“start_time":1392809818, “status":"updated", “display_type_id":1, “id”:2996930, “version":2 }, “collection":"feeds.breaking-news", “hash_key":2996930, “uuid":"4a3dd1d2-995a-11e3-af7a-12313b01ac80", “cache_ttl_secs":1199 }
Subscription Status Request Format
{ ... "type": "subscription-status-request", "correlation_id": String, "reply_to": String, "body": { "session_id": String, "version": String, "user_id": String | null, "password": String | null, "client_app": String, "regi_id": Integer | null } }
Subscription Status Request Message
{“body": { "client_app": "hermes.push", "password": null, "regi_id": 68752897, "session_id": "r6pko_pq", "user_id": null, "version": "2.8.11b" }, ”correlation_id": "r6pko_pq", "reply_to": "subscription-response.us-west-2.hermes.push.-.-.us-west-2.core.app_buddy.i-efd010d9.20110", "type": "subscription-status-request", "uuid": "622a7954-98b4-11e3-af4b-02e44d415eaf", }
Subscription Response Format
{ "routing_key": String, "type": "subscription-status", "correlation_id": String, "uuid": String, "body": { "client_app": String "regi_id": Integer | null, "feeds": List, "cache_pull": [ { "collection": String, "hash_key": String, }, ... ] } }
Subscription Response Message
{“body": { “feeds”:["feeds.broadcast.#",“feeds.breaking-news.#", “feeds.video.#”,”regi.67835031.#"], “regi_id":"67835031", “cache_pull”:[ {"hash_key":"#","collection":"regi.67835031"}, {"hash_key":"#","collection":"feeds.broadcast"}, {"hash_key":"#","collection":"feeds.breaking-news"}, {“hash_key":"#","collection":"feeds.video"}],
“client_app":"hermes.push"}, “uuid":"1266b820-98ae-11e3-b9f0-0aeb4d17d40d", “routing_key":"subscription-response.us-west-2.hermes.push.-.-.us-west-2.core.app_buddy.i-efd010d9.20110", “type":"subscription-status", "correlation_id":"3c59f5f1-38f4-4e9e-94e0-54ea3f7a817e"} }
Message Tracing (stored in nyt-history header)
account:fabrikaction:publishdomain:fabrik.nytimes.comenvironment:prdinstance_id:i-ff3b1acborg_unit:fabrikpid:8807product:coreproject:standardregion:us-west-2
service:input timestamp:2014-02-19T16:16:47.731324Zzone:us-west-2c
account:fabrikaction:publishdomain:fabrik.nytimes.comenvironment:prdinstance_id:i-344bf003org_unit:fabrikpid:6897product:coreproject:standardregion:us-west-2
service:process timestamp:2014-02-19T16:16:47.742523Zzone:us-west-2b
account:fabrikaction:publishdomain:fabrik.nytimes.comenvironment:prdinstance_id:i-ff3b1acborg_unit:fabrikpid:5709product:coreproject:standardregion:us-west-2
service:route timestamp:2014-02-19T16:16:47.745159Zzone:us-west-2c
Message Structure•A message has:
•message_uuid (version 1 UUID)
•replica_uuid (version 1 UUID)
•body (Optional: Large bodies are referenced in metatdata)
•A time to live (all ttls are < 30 days)
We use Universally Unique ID for every message. The version one has an implicit time stamp as a key seed for the id so that all messages can be lexicographically sorted very easily
Message indexing
•A message has one more ‘paths’ carried in its metadata
•Each path is comprised of :
•collection
•hash_key
•range_key (implicit = message_uuid)
Query patterns: get latest
•Get latest message in a subtree:
•walk a subtree of the path
•return the latest message for each complete path found
•Used to:
•get the latest versions of news items within category (e.g. query path ‘feeds.breaking-news.#’ will retrieve the latest version of each breaking news item)
•get the latest versions of client information for a client
Query patterns: get all
•Get all unexpired messages for a path up a limit:
•find the path
•return messages in reverse date order up to the limit
•Used to:
•get metrics for a time bucket (e.g. query path ‘metrics.searhcloud.minute.2014-02-01T09:39Z’ will retrieve all the messages in that bucket)
•get all the unexpired versions of a specific information set, e.g. a to do list
Other query patterns
•get a message by message_uuid
•get all messages by time bucket (journal)
•get a range of paths