65
Pinterest Engineering Scaling Pinterest Yash Nelapati Ascii Artist Saturday, August 31, 13

Scaling Pinterest - QConSP · Pinterest Engineering April 2012 Growth Scaling Pinterest Mar 2010 · Amazon EC2 + S3 + Edge Cast · 135 Web Engines + 75 API Engines · 10 Service Instances

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Pinterest Engineering

Scaling Pinterest

Yash NelapatiAscii Artist

Saturday, August 31, 13

Pinterest Engineering

An online pinboard to organize and share what inspires you.

Pinterest is...

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Saturday, August 31, 13

Pinterest Engineering

March 2010Growth Scaling Pinterest

Page views per day

Mar 2010 Jan 2011 Jan 2012 May 2012

Saturday, August 31, 13

Pinterest Engineering

March 2010Growth Scaling Pinterest

Page views per day

Mar 2010 Jan 2011 Jan 2012 May 2012

Saturday, August 31, 13

Pinterest Engineering

March 2010Growth Scaling Pinterest

· RackSpace

· 1 small Web Engine

· 1 small MySQL DB

· 1 Engineer + 2 Founders

Page views per day

Mar 2010 Jan 2011 Jan 2012 May 2012

Saturday, August 31, 13

Pinterest Engineering

March 2010Growth

Saturday, August 31, 13

Pinterest Engineering

March 2010Growth

Saturday, August 31, 13

Pinterest Engineering

January 2011GrowthScaling Pinterest

Mar 2010 Jan 2011 Jan 2012

Page views per day

Saturday, August 31, 13

Pinterest Engineering

January 2011GrowthScaling Pinterest

Mar 2010 Jan 2011 Jan 2012

Page views per day

Saturday, August 31, 13

Pinterest Engineering

January 2011GrowthScaling Pinterest

· Amazon EC2 + S3 + CloudFront

· 1 NGinX, 4 Web Engines

· 1 MySQL DB + 1 Read Slave

· 1 Task Queue + 2 Task Processors

· 1 MongoDB

· 2 Engineers + 2 Founders

Mar 2010 Jan 2011 Jan 2012

Page views per day

Saturday, August 31, 13

Pinterest Engineering

Saturday, August 31, 13

Pinterest Engineering

September 2011GrowthScaling Pinterest

Mar 2010 Jan 2011 Jan 2012 May 2012

Page views per day

Saturday, August 31, 13

Pinterest Engineering

September 2011GrowthScaling Pinterest

Mar 2010 Jan 2011 Jan 2012 May 2012

Page views per day

Saturday, August 31, 13

Pinterest Engineering

September 2011GrowthScaling Pinterest

· Amazon EC2 + S3 + CloudFront· 2 NGinX, 16 Web Engines + 2 API

Engines· 5 Functionally Sharded MySQL DB +

9 read slaves· 4 Cassandra Nodes· 15 Membase Nodes (3 separate

clusters)· 8 Memcache Nodes· 10 Redis Nodes· 3 Task Routers + 4 Task Processors· 4 Elastic Search Nodes· 3 Mongo Clusters· 3 Engineers (8 Total)

Mar 2010 Jan 2011 Jan 2012 May 2012

Page views per day

Saturday, August 31, 13

Pinterest Engineering

It will fail. Keep it simple.

Saturday, August 31, 13

Pinterest Engineering

April 2012GrowthScaling Pinterest

Mar 2010

Page views per day

Mar 2010 Jan 2011 Jan 2012 May 2012

Saturday, August 31, 13

Pinterest Engineering

April 2012GrowthScaling Pinterest

Mar 2010

Page views per day

Mar 2010 Jan 2011 Jan 2012 May 2012

Saturday, August 31, 13

Pinterest Engineering

April 2012GrowthScaling Pinterest

Mar 2010

· Amazon EC2 + S3 + Edge Cast

· 135 Web Engines + 75 API Engines

· 10 Service Instances

· 80 MySQL DBs (m1.xlarge) + 1 slave

each

· 110 Redis Instances

· 60 Memcache Instances

· 2 Redis Task Manager + 60 Task

Processors

· 3rd party sharded Solr

· 15 Engineers (25 Total)

Page views per day

Mar 2010 Jan 2011 Jan 2012 May 2012

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

January 2012Growth

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Saturday, August 31, 13

Pinterest Engineering

August 2013GrowthScaling Pinterest

Page views per day

April 2012 August 2013

Saturday, August 31, 13

Pinterest Engineering

August 2013GrowthScaling Pinterest

Page views per day

April 2012 August 2013

Saturday, August 31, 13

Pinterest Engineering

August 2013GrowthScaling Pinterest

· Amazon EC2 + S3 + Edge Cast· 400+ Web Engines + 400+ API

Engines· 70+ MySQL DBs (hi.4xlarge on SSDs)

+ 1 slave each· 100+ Redis Instances· 230+ Memcache Instances· 10 Redis Task Manager + 500 Task

Processors· 70+ Engineers (130+ Total)

Page views per day

April 2012 August 2013

Saturday, August 31, 13

Pinterest Engineering

August 2013GrowthScaling Pinterest

· Amazon EC2 + S3 + Edge Cast· 400+ Web Engines + 400+ API

Engines· 70+ MySQL DBs (hi.4xlarge on SSDs)

+ 1 slave each· 100+ Redis Instances· 230+ Memcache Instances· 10 Redis Task Manager + 500 Task

Processors· 70+ Engineers (130+ Total)

Page views per day

April 2012 August 2013· 6 services (80 instances)· Sharded Solr· 20 HBase· 12 Kafka + Azkabhan· 8 Zookeeper Instances· 12 Varnish

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

ELB

Routing & Filtering(Varnish)

All connection pairings managed by ZooKeeper

Puppet

StatD

Monit

GangliaAPI App(Python)

Web App(Python)

Task Processing(Python/Pyres)

MySQL Service(Java/Finagle)

Memcache Mux(Nutcracker)

Feed Service(Python/Thrift)

Follower Service(Python/Thrift)

Images(S3 + CDN) MySQL Memcache Redis HBase

Task Queue(Redis)

Follower Service(Python/Thrift)

Follower Service(Python/Thrift)

Saturday, August 31, 13

Pinterest Engineering

Tripwire

EMR

S3

Scaling Pinterest

Web App(Python)

Task Processing(Python/Pyres)

Kafka

S3 Copier

Hive Azkaban

API(Python)

Saturday, August 31, 13

Pinterest Engineering

Saturday, August 31, 13

Pinterest Engineering

Technologies

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

ChoosingYourTech

Questions to ask • Does it meet your needs?

• How mature is the product?

• Is it commonly used? Can you hire people who have used it?

• Is the community active?

• How robust is it to failure?

• How well does it scale? Will you be the biggest user?

• Does it have a good debugging tools? Profiler? Backup software?

• Is the cost justified?

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Hosting Why Amazon Web Services?• Variety of servers running Linux

• Very good peripherals, such as load balancing, DNS, map reduce, basic firewalls, and more

• Good reliability (don’t throw tomatoes at me!)

• Very active dev community

• Not cheap, but...

• New instances ready in seconds

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Code Why Python?• Extremely mature

• Well known and well liked

• Solid active community

• Very good libraries specifically targeted to web development

• Effective rapid prototyping

• Free

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Production Data

Why MySQL and Memcache? • Extremely mature

• Well known and well liked

• Rarely catastrophic loss of data

• Response time to request rate increases linearly

• Very good software support: XtraBackup, Innotop, Maatkit

• Solid active community

• Free

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Why Redis?• Well known and well liked

• Active community

• Consistently good performance

• Variety of convenient and efficient data structures

• 3 Flavors of Persistence: Now, Snapshot, Never

• Free

Production Data

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Why HBASE? (Why not MySQL)• Efficient Storage

• Handle large write throughput

• Solid Hadoop interface

• Maturing quickly, used by facebook

• Built on HDFS

• Free

Production Data

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

What happened to Cassandra, Mongo, ES, and Membase?Production

Data • Does it meet your needs?

• How mature is the product?

• Is it commonly used? Can you hire people who have used it?

• Is the community active?

• How robust is it to failure?

• How well does it scale? Will you be the biggest user?

• Does it have a good debugging tools? Profiler? Backup software?

• Is the cost justified?

Saturday, August 31, 13

Pinterest Engineering

If you’re the biggest user of a technology, the challenges will

be greatly amplified

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

What’s happening now?

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Challenge: One Codebase + Lots of Engineers = Deploy Hell

• Major bugs and performance issues stall deploys

• Performance issues creep in under radar

• 7+ development teams, 1 ops team

• Workload changing more rapidly and less predictably

• Want developers to not fear moving fast

EmployeeGrowth

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Solution: Deploy Checkpoints• Aggressive unit tests (careful! don’t erase your DB!)

• Rings of deployment

• Canary, employees only, 5% of user base, etc.

• Continuous deployment

• Production integration tests

EmployeeGrowth

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Challenge: Increase Availability, Decrease Latency

• Push for better uptime and lower latency

• Initially, most uptime and latency issues due to DB + caching

• Fewer Instances => Few, but big failures

• More Instances => More smaller failures + more complexity

• How aggressively can you retry without hurting the system?

Uptime &Latency

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Solution: Metrics Dashboard and Alerts

• Create dashboard + alerts, and review response times weekly

• When? Soon after launch at latest

• Profile everything

• MySQL - Maatkit, InnoTop

• Memcache - Maatkit

• Frontend - New Relic

• General Ops - StatsD, Nagios / Monit, Ganglia

Uptime &Latency

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Solution: Configuration Manager and Failover

• Provides load balancing and automatic connection reconfiguration

• When? 30+ caches / DBs

• One option: Intermediate load balancers

• Example: HAProxy, Nginx, Varnish

• Extra latency hop

• More complication

• Configuration hassle (1 LB / 7 services?)

Uptime &Latency

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Co-ordination

Solution: Zookeeper• Centralized configuration management

• Used for service discovery

• Notifies of service failures

• WATCH and its callback are pretty reliable

• Experiment framework

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Co-ordination

Zookeeper

Services

app

Register

Solution: Zookeeper• Centralized configuration management

• Used for service discovery

• Notifies of service failures

• WATCH and its callback are pretty reliable

• Experiment framework

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Co-ordination

Zookeeper

Services

app

Register

WATCH

Solution: Zookeeper• Centralized configuration management

• Used for service discovery

• Notifies of service failures

• WATCH and its callback are pretty reliable

• Experiment framework

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

MySQL Failover

Part 1: Configuration Manager and Failover

A B

App

{“master” : “A”}

readonly=True

Zookeeper

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

MySQL Failover

Part 2: Configuration Manager and Failover

A B

App

{“master” : “B”}

readonly=True

Zookeeper

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

MySQL Failover

Part 2: Configuration Manager and Failover

A B

App

{“master” : “B”}

Zookeeper

readonly=False

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Challenge: Number of Connections Rising

• Initially, entire app tier connected to all Memcache, Redis, MySQL

• On Memcache...

• 20k connections * 10kB / connection = 195MB / Memcache

• 40 Memcaches means 7.6 GB used on connections

• Connection space is not allocated from slab memory!

• Can eventually cause Memcache process to leak into swap

• On MySQL

• At least 256 kB / connection

Connections

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Solution: Connection Pooling and Multiplexing

• Data Services, Nutcracker

• When? Once any service gets close to 10k connections

• Success: Memcache

• Once was >20k connections

• Now 1.3k connections

• But, aggressive fan-out causes...

• Network contention

• Incast congestion

Connections

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Memcache Failures

App Nutcracker

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Memcache Failures

App Nutcracker

Ketama Ring Adjusted

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Finagle• RPC for high concurrency

• Twitter

• Completely asynchronous

• Previous experience with Finagle

• Lots of compatible libraries

• JVM

• Lots of bells and whistles - Ostrich, Zipkin, lago

Why Java Over Python?

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

How did you shard?

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

How we Sharded

db00001db00002 ....db00512

db00513db00514 ....db01024

db03073db03074 ....db03583

db03584db03585 ....db04096

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

High Availability

db00001db00002 ....db00512

db00513db00514 ....db01024

db03073db03074 ....db03583

db03584db03585 ....db04096

Master Master Replication

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

IncreasedLoad on DB?

db00001db00002 ....db00512

db00001db00002 ....db00256

db00256db00257 ....db00512

Split your Shards

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

ID Structure 64 bits

Shard ID Local IDType

· A lookup data structure has physical server to

shard ID range (cached by each app server process)

· Shard ID denotes which shard

· Type denotes object type (e.g., pins)

· Local ID denotes position in table

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Objects &Mappings

Object tables (e.g., pin, board, user, comment)

· Local ID MySQL blob (JSON / Serialized thrift)

Mapping tables (e.g., user has boards, pin has likes)

· Full ID Full ID (+ timestamp)

· Naming schema is noun_verb_nounQueries are PK or index lookups (no joins)

· Data DOES NOT MOVEAll tables exist on all shards

No schema changes required (index = new table)

Saturday, August 31, 13

Pinterest Engineering

Scaling Pinterest

Looking Forward• Continually improve Pinner experience

• Better uptime and lower latency

• Help Pinners discover more of the things they love

• Reduce spam and abuse

• Continually collaborate and build bigger, better, faster products

• 140 Pinployees and beyond

• MySQL 5.6

What’s next?

Saturday, August 31, 13