PubSubHubbub for Developers

Embed Size (px)

Citation preview

PubSubHubbub for Developers

Brett Slatkin Software Engineer Google Inc.September 28, 2009

Agenda Background Intro Motivation Scale Progress

Background

Why do real-time messaging? Syndication o Creating a "flow" o Simultaneous delivery of an event spurs immediate conversation o More participation enables more developed conversations, better exchanging of ideas o Cross-site allows promotion, linking, swarming around sources, mash-ups, growth opportunity

Why do real-time messaging? Business, politics o 1 minute of delay could cost a company millions, cause a political scandal, be harmful to investors, etc o Concrete example: SEC earnings requirements

Why do real-time messaging? Future applications (out of scope, but ...) o Financial data o Public scientific measurements (e.g., stream of weather data, traffic status, polling, votes) o Sensor networks o Emergency information distribution o Anything you can think of that's a stream of information!

Why do decentralized messaging? Web was built on decentralized protocols No single point of failure Interoperability is key to network effects and growth One API for application developers

Intro

What is PubSubHubbub? A simple publish/subscribe protocol Turns Atom and RSS feeds into real-time streams Web-scale, low-latency messaging Three participants: Publisher, Subscriber, Hubs

Publisher

Hub

Subscriber

Design goals of PubSubHubbub Decentralized: No one company in control Scale to the size of the whole web Publishing and subscribing as easy as possible Complexity in the Hub Pragmatic (i.e., not theoretically perfect, but solve huge, known use cases with minimal effort)

How-to for Publishers1. Add a declaration in your feed with your Hub of choice 2. Add something to your feed!

3. Send a ping to the Hub with the feed URLPOST / HTTP/1.1 Content-Type: application/x-www-form-urlencoded ... hub.mode=publish&hub.url=

4. 204 response = Success, 4xx = Bad request, 5xx = Try again

How-to for Subscribers1. Detect the Hub declaration in a feed 2. Send a subscribe request to the feed's HubPOST / HTTP/1.1 Content-Type: application/x-www-form-urlencoded ... hub.mode=subscribe&hub.verify=sync& hub.topic=&hub.callback=

3. Hub will send a request to verify the subscriptionGET /callback?hub.challenge= HTTP/1.1 HTTP/1.1 200 ...

How-to for SubscribersProcess new content from the HubPOST /callback HTTP/1.1 Content-Type: application/atom+xml ... Awesome feed ... ...

The role of the Hub Logical component o Publishers may be their own Hub o Combined Hub/Publisher has p2p speed-up Distinct functions o Accept and verify subscriptions to new topics o Receive pings from publishers, retrieve content o Extract new/updated items from feed o Send all subscribers the new content

The role of the Hub Scalability o # of subscribers & feeds, update frequency o Delegation of content distribution (= bandwidth) Reliability o Retry fetch, delivery, idempotence

How the hub works

How the hub works

See my talk on building a hub using App Engine http://tinyurl.com/building-a-hub

Security model Subscriber verification prevents DoS attacks Declaration of the Hub is a delegation of trust o Subscribers may trust the Hub to deliver content on publisher's behalf o v0.2 supports shared-secret HMACs for subscribers to verify that notifications came from the hub Privacy through HTTPS for hubs, feeds, and callbacks o URLs and payloads can be sent via encrypted channel o Subscribed topics are not discoverable o Unguessable, capability URLs (e.g., from OAuth) Publishers can run their own hub!

Motivation

Push it to the limitWhy push content?

Push it to the limitWhy push content? Learn from our forefathers.

Push it to the limitWhy push content? Learn from our forefathers.

TCP(est. 1974)

Push it to the limitWhat is magical about TCP? The Window.

Push it to the limitWithout the window, the tube can't be full.

Push it to the limitTCP maximizes the throughput of a link Dump data in, it will be received The window means no waiting for acks! When acks are missed, the sender will retransmit Receivers reassemble the message in-order, de-dupe Good citizenship with congestion control

Push it to the limitWhere is such efficiency for application-level protocols? Exists, but often proprietary or an interoperability nightmare

Push it to the limitWhere is such efficiency for application-level protocols? Exists, but often proprietary or an interoperability nightmare (cough SOAP cough)

Why another protocol?

Why another protocol? We want interoperable, web-scale messaging Almost every company already has an internal system o TIBCO, WebsphereMQ, ActiveMQ, RabbitMQ, ... o Proprietary message payloads, topics, networks Existing attempts at an standard haven't caught on o XMPP weirds people out; started in 1999, still isn't used for interop widely beyond IM o These standards are too complex or not pragmatic (XEP0060, WS-*, AMQP, RestMS, new REST-*)

Why another protocol? Build the simplest interoperable messaging protocol that can scale to the size of the web Make the base specification bare-bones, easy-to-use Target Atom/RSS initially as a payload format; everyone uses them for time-based, idempotent streams In the future, add extensions for cool stuff

Why another protocol? Proof of simplicity is in the code o Bret Taylor added PubSubHubbub subscription to FriendFeed in a single evening

Scale

Goal World-wide RSS publishing currently o ~X,000 updates per second Legitimate email currently o ~X,000,000 per second Need to scale by at least 1000x; hopefully more Trying to enable new use-cases

Light pinging

Light pinging Protocols exist for faster Atom/RSS o Ping-o-Matic, changes.xml, SUP, rssCloud All only indicate the feed URL that has changed o Still need to go and fetch the content o These protocols are just optimized polling o Equivalent to killing the TCP window!

Light pinging Optimized polling is still worse o Latency is high: 3 round trips o Thundering herd as subscribers fetch published feeds Unpredictable, bursty load pattern o More bandwidth, CPU, connection star-pattern

Light pinging

Light pinging

Light pinging at scaleWhat if you had to use light pinging at scale? Send out pings slowly to reduce the herd Herd causes all feeds to be fully regenerated o Invalidates existing caches Bandwidth increases extremely fast o (average updates per feed) * (# feeds) * (# subscribers) * (average feed size) o Often 99.5%+ more than you needed CPU costs increase for subscribers with update frequency

Light pinging at scaleConsider a single-master replication scheme After each update, wait for copying to all replicas

Fat pinging

Fat pingingCompared to light pings Latency: 1/3 as much Based on reasonable averages o Bandwidth: ~20x less o CPU:~20x less Never wait for replication delays

Fat pinging

Fat pinging

Fat pinging at scaleWhat if you had to scale fat pinging? Run your own hub Compute feed deltas at update time; no need to regenerate a whole feed (or churn your caches) Send out new content at sustained network rate Bandwidth is minimum possible per subscriber o (update size) * (# feeds) * (# subscribers) CPU costs is minimum possible per subscriber

Fat pinging at scale

Fat pinging at scale

Fat pinging at scaleAdvanced protocol pieces Connection reuse from HTTP/1.1 Pipeline HTTP requests for feed fetching Use aggregated content delivery o Many Atom feeds in a single XML doc o Fewer connections

Progress

PubSubHubbub status Over 100 Million feeds are PubSubHubbub-enabled Companies: Google, FriendFeed (FB), livedoor, Six Apart, LiveJournal, LazyFeed, Superfeedr, ... Google products: FeedBurner, Blogger, Reader shared items, Google Alerts, ... Cool apps: Socnode, Reader2Twitter, chat gateways, ... More publishers, subscribers, hubs, apps on the way Publisher clients: Perl, PHP, Python, Ruby, Java, Haskell, C#, MovableType, WordPress, Django, Zend Active mailing list with 240+ members

Getting involved Review the spec; recommend improvements o Open process, will be licensed by Open Web Foundation Write some sample code for your favorite language or CMS Contribute to one of the open source Hub implementations Write on your blog about why we need push for the future o Do it for the children

What Facebook can do right now Subscribe to feeds that are PubSubHubbub-enabled o Put that great UI to work o Maybe reuse the FriendFeed index pipeline? o Call Bret and Ben Enable PubSubHubbub for activity streams o Provide Facebook app developers with real-time updates to users' home streams o Speeds up surfacing Facebook in other apps o Detecting new events could trigger the app to take action in real-time (send an email, classify a photo, initiate an action in a game, etc)

What Facebook can do next Figure out if private feeds will work with this model o Run your own hub o Use capability URLs (OAuth token in the query string) Give your developers more feeds to consume and syndicate

Rehash

Rehash Push for the future! Scale to new use-cases Decentralized, open spec: no company owns it One API for all stream-based content

Rehash Project page: http://pubsubhubbub.googlecode.com o Full Hub source code with tests o Example publisher and subscriber apps o Demo hub at http://pubsubhubbub.appspot.com

?

Hub storage space How much storage space does a Hub need? o Manageable costs ~10 million feeds ~1 million subscribers o Assume 1 billion events per day (~11,000/second) Thar be dragons!

Hub storage spaceFeedEntryRecord Key name o "FeedEntryRecord" + entry_id_hash + parent key o 400 bytes, could be smaller Indexed properties o Entry ID hash (again-- doh!): 160 bytes o Entry content hash: 160 bytes o Update time: 8 bytes Unindexed properties o Entry ID: 2048 bytes maximum, 200 on average Result ~1KB per entry 27TB per month at ~11,000 req/sec -- no sweat!

WebFingerUnified discovery for email addresses Transform an email address into XRD XRD defines all the services that address has Helps provide social networking as a protocol E.g., Simple way to discover if an account has a Portable Contacts interface

WebFinger