SCALING INSTAGRAM INFRA - QCon San Francisco … INSTAGRAM INFRA Lisa Guo— Nov 7th, 2016...

SCALING INSTAGRAM INFRALisa Guo— Nov 7th, 2016 lguo@instagram.com

INSTAGRAM HISTORY

201114M users

2012/4/3Android release

2012/4/9Facebook acquisition

2014/1

INSTAGRAM EVERYDAY

300 Million Users

4.2 Billion likes

95 Million photo/video uploads

100 Million followers

SCALING MEANS

Scale out

Scale up

Scale dev team

SCALE OUT

“To scale horizontally means to add more nodes to a system, such as adding a new computer to a distributed software application. An example might involve scaling out from one Web server system to

three.”

- Wikipedia

MICROSERVICE

SCALING OUT

—> vertical partition horizontal sharding

SCALING OUT

INSTAGRAM STACK

Tuesday, June 25th, 2013

memcache

RabbitMQ

PostgreSQL

Cassandra

Celery

OtherServicesDjango

STORAGE VS. COMPUTING

• Storage: needs to be consistent across data centers• Computing: driven by user traffic, as needed basis

SCALE OUT: STORAGE

-Masterless-Async, low latency-Multiple data center ready-Tunable latency vs consistency trade-off

user feeds, stories, activities, and other logs

SCALE OUT: STORAGE

user, media, friendship etc

• One master, replicas are in each region• Reads are done locally• Writes are cross region to the master.

COMPUTING

Django

RabbitMQ PostgreSQL

CassandraCelery

Django

RabbitMQPostgreSQL

CassandraCelery

memcacheDC1 DC2memcache

MEMCACHE

• Millions of reads/writes per second• Sensitive to network condition• Cross region operation is prohibitive

Django

User R

Django

PostgreSQL memcache

User Ccomment

setinsert

Django

memcache PostgreSQL

User Ccomment

insertset

Django

memcachePostgreSQL

User R

replication

Django

memcache PostgreSQL

User Ccomment

insertset

Django

memcachePostgreSQL

User R

replication

Cache invalidate

COUNTERS

select count(*) from user_likes_media

where media_id=12345;

100s ms

COUNTERS

COUNTER

select count from media_likes where media_id=12345;

10s us

Cache invalidatedAll djangos try to access DB

MEMCACHE LEASE

d1 d2 memcache dbtime

lease-get

filllease-get

wait or use stale

read from DB

lease-set

lease-get

INSTAGRAM STACK - MULTI REGION

Django

RabbitMQ

PostgreSQL

Cassandra

Celery

memcache

Django

RabbitMQ

PostgreSQL

Cassandra

Celery

memcache

DC1 DC2

SCALING OUT

• Capacity• Reliability• Regional failure ready

Requests/second

LOAD TEST

Servers

0 2 4 6 8 10 12 14 16 18 20 22 24CPU instructions

Loaded

Regular

Load Balancer

DjangoServers

0 2 4 6 8 10 12 14 16 18 20 22 24

User growth Server growth

“Don’t count the servers, make the servers count”

SCALE UP

Use as few CPU instructions as possible

Use as few servers as possible

SCALE UP

Scale up

Monitor

Optimize

Analyze

COLLECT

struct perf_event_attr pe;

pe.type = PERF_TYPE_HARDWARE;

pe.config = PERF_COUNT_HW_INSTRUCTIONS;

fd = perf_event_open(&pe, 0, -1, -1, 0);

ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); <code you want to measure> ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); read(fd, &count, sizeof(long long));

DYNOSTATS

0 2 4 6 8 10 12 14 16 18 20 22 24

Follow

Explore

REGRESSION

0 2 4 6 8 10 12 14 16 18 20 22 24

GRADUAL REGRESSION

0 2 4 6 8 10 12 14 16 18 20 22 24

With new feature

Without new feature

Monitor

Optimize

Analyze

PYTHON CPROFILE

import cProfile, pstats, StringIO pr = cProfile.Profile()

pr.enable() # ... do something ... pr.disable() s = StringIO.StringIO() sortby = 'cumulative' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print s.getvalue()

CPU - ANALYZEcontinuous profiling

generate_profile explore --start <start-time> --duration <minutes>

CPU - ANALYZEcontinuous profiling

0 2 4 6 8 10 12 14 16 18 20 22 24

Caller

Callee

CPU - ANALYZEdecorator

def get_photos(): ……

def feed(): get_photos()

@log_stats def get_follows(): …… def follow(): get_follows()

@log_stats

get_follows get_photos

feed follow

log_stats

get_followsget_photos

feed follow

Keeping Demand in Check

Monitor

Optimize

Analyze

igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s300x300/12345678_1234567890_987654321_a.jpg

CPU - OPTIMIZE

150x150

400x600

200x200

CPU - OPTIMIZE

C is really faster

• Candidate functions:• Used extensively• Stable

• Cython or C/C++

CPU - CHALLENGE

cProfile is not free

False positive alerts

Better automation

Scale up

SCALE UP: MEMORY

(memory budget /process) X (# of processes) < system memory

Less memory budget/process

===> Dies sooner

===> More processes

LOAD TEST

Servers

0 2 4 6 8 10 12 14 16 18 20 22 24CPU instructions

Loaded

Regular

Load Balancer

DjangoServers

SCALE UP: MEMORY

Large configuration

SCALE UP: MEMORY

• Run in optimized mode (-O)• Use shared memory• NUMA• Remove dead code

SCALE UP: LATENCY

Synchronous Processing model

===> All user experience impacted

===> Worker starvation

Single service degradation

===> Fewer CPU instr executedLonger latency

Stories

FeedDjango

Stories

SuggestedUsers

ASYNC IO

Scale up

SCALE DEV TEAM

SCALING TEAM

30% engineers joined in last 6 months

Bootcampers - 1 week

Hack-A-Month - 4 weeks

Intern - 12 weeks

Comment Filtering

Self-harm Prevention

Windows App

Story Viewer Ranking

Video View Notification

Save Draft

First Story Notification

Which server?

NewTable or New Column?

What Index?Should

I cache it?

Will I lock up DB?

Will I bring down Instagram?

WHAT WE WANT

• Automatically handle cache• Define relations, not worry about implementations• Self service by product engineers• Infra focuses on scale this service

USER3mediaposted

posted bylikes

liked by

SCALE DEV - END OF POSTGRES

SHIPPING LOVE

60-80 daily diffs

>120 engineers committed code last month

RELEASE

• Master, no branch• All features developed on master

gated by configuration• Continuous integration • No branch integration overhead

• No surprises• Iterate fast, collaborate easily• Fast bisect and revert

Once a week?

40-50 rollouts per day

Once a day?Once a diff!!

CHECKS AND BALANCES

Code reviewunittest

Code acceptedcommitted Canary To the Wild

Dark launchLoad test

TAKEAWAYS

Scaling is a continuous effort

Scaling is multi-dimensional

Scaling is everybody’s responsibility

QUESTIONS?

SCALING INSTAGRAM INFRA - QCon San Francisco … INSTAGRAM INFRA Lisa Guo— Nov 7th, 2016...

Documents

Revista Infra

Padamshree Infra

Shipbuilding Infra

Wave Infra

Infra Apparel

Kridhan Infra Ltd.19May15 - Rakesh Jhunjhunwalarakesh-jhunjhunwala.in/.../Kridhan-Infra-Initiating-Coverage.pdf · Kridhan Infra Ltd. Strong foundation ... Kridhan Infra Ltd: Bagged

Infra Financing

RREEF Infra Australian Infra Mkt Review Q1 2012

Instagram · Users can find other Instagram users to follow by submitting their email contact list to Instagram or logging into Facebook orTwitter through Instagram. Instagram will

H.G. Infra Engineering Limited. Infra Engineering... · 2018. 2. 22. · HG Infra Engineering Ltd (HG Infra) is an infrastructure construction, development and management ... HG infra

Infra UPBResidenceHallSpecifications

TECHNICAL MANUAL INFRA LINE TECHNISCHES HANDBUCH INFRA …

Stroke infra

Instagram Integration Pro extension User Manual · Instagram Integration Pro extension User Manual Instagram Integration Pro extension integrates Magento with Instagram to Instagram

Infra Report

INFRA Platform in Action INFRA Platform in Action

Firewall Infra

Infra Design

SCALING INSTAGRAM INFRA › system › files › presentation-slides › qcon20… · RabbitMQ PostgreSQL Cassandra Celery Other Services Django. STORAGE VS. COMPUTING • Storage:

Infra [ initiatives ]