Flashback: QCon San Francisco 2012

Sergejus Barinovas

Why San Francisco?

Learn how others are doing at scale

Learn what problems others have

Learn does their solutions apply to us

Learn does their problems apply to us

Silicon Valley based companies:

- Google

- Facebook

- Twitter

- Netflix

- Pinterest

- Quora

- tons of others...

Why San Francisco?

NoSQL: Past, Present, Future

Eric Brewer – author of CAP theorem

CP vs. AP, but only on time-out (failure)

Real-time Web with

Real-time web

node.js – de-facto for real-time web

open connection for user and leave open for him

web sockets are great, but use fallbacks

- mobile devices doesn't support web sockets

- long polling, infinite frame, etc.

more companies moving to SPDY protocol

on mobile

Quora on mobile

first iPhone app

- mobile app is like old app shipped on CD

- hybrid application

- native code for controls and navigation

- HTML for viewing Q&A from the site

- separate mobile optimized HTML layout of the

web page

Quora on mobile

second Android app

- created clone of iPhone app - failed!

- UI natural on iPhone is alien on Android

- bought Android devices and learned their

philosophy

- used new Google Android UI design guidelines

- created new app with native for Android look & feel

- users in India pay per MB, so had to optimize traffic

- optimizations applied for iPhone app and web page

Quora on mobile

mobile first experience

- mobile has very unique requirements

- if you're good on mobile, you're good anywhere

- don't use mobile app on tablets, create separate

or use web

Continuous delivery

Continuous delivery

Jesse Robbins, author of Chef

infrastructure as code

- full stack automation

- datacenter API (for provisioning VMs, etc.)

- infrastructure is a product and app is a customer

Continuous delivery

application as services

- service orientation

- software resiliency

- deep instrumentation

dev / ops as teams

- service owners

- shared metrics / monitoring

- continuous integration / deployment

Release engineering at

Release engineering at Facebook

Chuck Rossi – release engineering manager

deployment process

- teams are not deploying to production by them selves

- for communication during deployment IRC is used

- if team member is not connected to IRC, release is

skipped

- BitTorrent for deployments

- powerful app monitoring and profiling

(instrumentation)

Release engineering at Facebook

deployment process

- ability to release on subset of servers

- very powerful feature flag mechanism by IP, gender,

age, …

- karma points for developers with down-vote button

facebook.com

- continuously deployed internally

- employees always access latest facebook.com

- easy to report bug from the internal facebook.com

Scaling

Scaling Pintereset

everything in Amazon cloud

before

- had every possible ‘hot’ technology including

MySQL,

Cassandra, Mongo, Redis, Memcached, Membase,

Elastic

Search – FAIL

- keep it simple, major re-architecting in late 2011

Scaling Pintereset

January 2012

- Amazon EC2 + S3 + Akamai, ELB

- 90 Web Engines + 50 API Engines

- 66 sharded MySQL DBs + 66 slave replicas

- 59 Redis

- 51 Memcache

- 1 Redis task queue + 25 task processors

- sharded Solr

- 6 engineers

Scaling Pintereset

now

- Amazon EC2 + S3 + Akamai, Level3, EdgeCast, ELB

- 180 Web Engines + 240 API Engines

- 80 sharded MySQL DBs + 80 slave replicas

- 110 Redis

- 200 Memcache

- 4 Redis task queues + 80 task processors

- sharded Solr

- 40 engineers

Scaling Pintereset

schemeless DB design

- no foreign keys

- no joins

- denormalized data (id + JSON data)

- users, user_has_boards, boards, board_has_pins, pins

- read slaves

- heavy use of cache for speed & better consistency

thinking of moving to their own DC

Architectural patterns for high availability at

Architectural patterns for HA

Adrian Cockcroft – director of architecture at Netflix

architecture

- everything in Amazon cloud in 3 availability zones

- chaos Gorilla, latency Gorilla

- service-based architecture, stateless micro-services

- high attention for service resilience

- handle dependent service unavailability or

increased latency

started open-sourcing to improve quality of the code

Architectural patterns for HA

Cassandra usage

- 2 dedicated Cassandra teams

- over 50 Casssandra clusters, over 500 nodes, over 30 TB

of

data, biggest cluster has 72 nodes

- most write operations, for reads Memcache layer is used

- moved to SSD in Amazon instead of spinning disks and

cache

- for ETL: read Cassandara backup files using Hadoop

- can scale zero-to-500 instances in 8 minutes

timelines at scale

Timelines at scale

Raffi Krikorian – director of Twiter's platform services

core architecture

- pull (timeline & search) and push (mobile, streams) use-

cases

- 300K QPS for timeline

- on write use fan-out process to copy data for each use-case

- timeline cache in Redis

- when you tweet and you have 200 followers there will be

200

inserts to each follower timeline

Timelines at scale

core architecture

- Hadoop for batch compute and recommendation

- code heavily instrumented (load times, latencies,

etc.)

- uses Cassandra, but moving off from it due to read

times

More info

Slides - http://qconsf.com/sf2012

Videos - http://www.infoq.com/

http://qconsf.com/sf2012

http://www.infoq.com/

Technology

Flashback: QCon San Francisco 2012