SCALING INSTAGRAM INFRA › system › files › presentation-slides › qcon20… · RabbitMQ...

Preview:

Citation preview

SCALING INSTAGRAM INFRALisa Guo— March 7th, 2017 lguo@instagram.com

2017

INSTAGRAM HISTORY

2010

2012/4/9joined

Facebook

2014/1

600M users/month

INSTAGRAM EVERYDAY

400 Million Users

4+ Billion likes

100 Million photo/video uploads

Top account: 110 Million followers

SCALING MEANS

Scale out

Scale up

Scale dev team

SCALE OUT

SCALE OUT

SCALE OUT

“Let’s all pray that Amazon gets everything sorted out in short order.”

INSTAGRAM STACK

Tuesday, June 25th, 2013

memcache

RabbitMQ

PostgreSQL

Cassandra

Celery

OtherServicesDjango

STORAGE VS. COMPUTING

• Storage: needs to be consistent across data centers• Computing: driven by user traffic, as needed basis

SCALE OUT: STORAGE

Tuesday, June 25th, 2013

user, media, friendship etc

SCALE OUT: STORAGE

Tuesday, June 25th, 2013

user, media, friendship etc

Master

Replica

ReplicaDjango

Write

Read

SCALE OUT: STORAGE

Tuesday, June 25th, 2013

user, media, friendship etc

Master

Replica

ReplicaDjango

Write

ReadDC1

DC2

DC3

SCALE OUT: STORAGE

Tuesday, June 25th, 2013

user feeds, activities etc

Replica

ReplicaReplica

Write - 2Read - 1

SCALE OUT: STORAGE

Tuesday, June 25th, 2013

user feeds, activities etc

Replica

ReplicaReplica

Write - 2Read - 1

COMPUTING

Tuesday, June 25th, 2013

Tuesday, June 25th, 2013

Django

RabbitMQ PostgreSQL

CassandraCelery

Django

RabbitMQPostgreSQL

CassandraCelery

memcacheDC1 DC2memcache

MEMCACHE

Tuesday, June 25th, 2013

• High performance key-value store in memory• Millions of reads/writes per second• Sensitive to network condition• Cross region operation is prohibitive

No global consistency

feed

get

Django

User R

DC1

Django

PostgreSQL memcache

User Ccomment

setinsert

Django

memcache PostgreSQL

User Ccomment

insertset

DC1

Django

memcachePostgreSQL

User R

feed

get

DC2

replication

Django

memcache PostgreSQL

User Ccomment

insertset

DC1

Django

memcachePostgreSQL

User R

feed

DC2

replication

Cache invalidate

Cache invalidate

get

COUNTERS

select count(*) from user_likes_media

where media_id=12345;

100s ms

COUNTER

Tuesday, June 25th, 2013

select count from media_likes where media_id=12345;

10s us

Cache invalidatedAll djangos try to access DB

MEMCACHE LEASE

d1 d2 memcache dbtime

lease-get

filllease-get

wait or use stale

read from DB

lease-set

lease-get

hit

INSTAGRAM STACK - MULTI REGION

Tuesday, June 25th, 2013

Django

RabbitMQ

PostgreSQL

Cassandra

Celery

memcache

Django

RabbitMQ

PostgreSQL

Cassandra

Celery

memcache

DC1 DC2

SCALING OUT

Tuesday, June 25th, 2013

• Capacity• Reliability• Regional failure ready

SCALING OUT - CHALLENGES, OPPORTUNITIES

Tuesday, June 25th, 2013

• Beyond North America• More localized social network• Direct messaging• Live streaming

20

40

60

80

100

0 2 4 6 8 10 12 14 16 18 20 22 24

User growth Server growth

“Don’t count the servers, make the servers count”

SCALE UP

SCALE UP

Use as few CPU instructions as possible

Use as few servers as possible

SCALE UP

Use as few CPU instructions as possibleUse as few servers as possible

CPU

Monitor

Optimize

Analyze

COLLECT

struct perf_event_attr pe;

pe.type = PERF_TYPE_HARDWARE;

pe.config = PERF_COUNT_HW_INSTRUCTIONS;

fd = perf_event_open(&pe, 0, -1, -1, 0);

ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); <code you want to measure> ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); read(fd, &count, sizeof(long long));

DYNOSTATS

20

40

60

80

100

0 2 4 6 8 10 12 14 16 18 20 22 24

Follow

Feed

Explore

REGRESSION

20

40

60

80

100

0 2 4 6 8 10 12 14 16 18 20 22 24

With new feature

Without new feature

CPU

Monitor

Optimize

Analyze

PYTHON CPROFILE

import cProfile, pstats, StringIO pr = cProfile.Profile()

pr.enable() # ... do something ... pr.disable() s = StringIO.StringIO() sortby = 'cumulative' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print s.getvalue()

CPU - ANALYZEcontinuous profiling

generate_profile explore --start <start-time> --duration <minutes>

CPU - ANALYZEcontinuous profiling

20

40

60

80

100

0 2 4 6 8 10 12 14 16 18 20 22 24

Caller

Callee

Callee

CPU

Monitor

Optimize

Analyze

igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s300x300/12345678_1234567890_987654321_a.jpg

igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s150x150/12345678_1234567890_987654321_a.jpg

igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s400x600/12345678_1234567890_987654321_a.jpg

igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s200x200/12345678_1234567890_987654321_a.jpg

igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s300x300/12345678_1234567890_987654321_a.jpg

CPU - OPTIMIZE

igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s300x300/12345678_1234567890_987654321_a.jpg

150x150

400x600

200x200

CPU - OPTIMIZE

C is really faster

• Candidate functions:• Used extensively• Stable

• Cython or C/C++

Use as few CPU instructions as possible

Use as few servers as possible

SCALE UP

ONE WEB SERVER

Process 1

SharedMemory

PrivateMemory

Process N

SCALE UP: MEMORY

• Run in optimized mode (-O)• Remove dead code

Reduce code

SCALE UP: MEMORY

• Move configuration into shared memory• Disable garbage collection

Share more

SCALE UP: MEMORY

20+% capacity increase

SCALE UP: NETWORK LATENCY

Synchronous processing model with long latency

===> Worker starvation and fewer CPU instr executed

Stories

FeedDjango

Feed

Stories

SuggestedUsers

ASYNC IO

Use as few CPU instructions as possible

Use as few servers as possible

Scale up

SCALE UP: CHALLENGES, OPPORTUNITIES

• Faster python run-time• Async web framework• Better memory analysis• etc etc

SCALE DEV TEAM

SCALING TEAM

30% engineers joined in last 6 months

Bootcampers - 1 week

Hack-A-Month - 4 weeks

Intern - 12 weeks

Comment Filtering

Self-harm Prevention

Windows App

Multiple media in one post

Video View Notification

Saved Posts

First Story Notification

Instagram Live

Instagram Stories

Which server?

NewTable or New Column?

What Index?Should

I cache it?

Will I lock up DB?

Will I bring down Instagram?

WHAT WE WANT

• Automatically handle cache• Define relations, not worry about implementations• Self service by product engineers• Infra focuses on scale

TAO

USER1

USER2

USER3mediaposted

posted bylikes

liked by

likes

liked by

Comment Filtering

Self-harm Prevention

Windows App

Multiple media in one post

Video View Notification

Saved Posts

First Story Notification

Instagram Live

Instagram Stories

SOURCE CONTROL

Master

Live

Direct

SOURCE CONTROL

• Context switching• Code sync/merge overhead• Surprises• Refactor/major upgrade• Performance tracking harder

With branches

SOURCE CONTROL

Master

Live

Direct

SOURCE CONTROL

Master Live Direct

SOURCE CONTROL

• Continous integration• Collaborate easily• Fast bisect and revert• Continuous performance monitoring

No branches

FEATURE LAUNCH

Engineers

Employees

Dogfooder

Some demographics

World

FEATURE LOAD TEST

Once a

40-60 rollouts per day

daydiffweek?!!

CHECKS AND BALANCES

Code reviewunittest

Code acceptedcommitted Canary To the Wild

SCALING MEANS

Scale out

Scale up

Scale dev team

TAKEAWAYS

Scaling is everybody’s responsibility

Scaling is continuous effort

Scaling is multi-dimensional

QUESTIONS?

Recommended