View
4
Download
0
Category
Preview:
Citation preview
2017
INSTAGRAM HISTORY
2010
2012/4/9joined
2014/1
600M users/month
INSTAGRAM EVERYDAY
400 Million Users
4+ Billion likes
100 Million photo/video uploads
Top account: 110 Million followers
SCALING MEANS
Scale out
Scale up
Scale dev team
SCALE OUT
SCALE OUT
SCALE OUT
“Let’s all pray that Amazon gets everything sorted out in short order.”
INSTAGRAM STACK
Tuesday, June 25th, 2013
memcache
RabbitMQ
PostgreSQL
Cassandra
Celery
OtherServicesDjango
STORAGE VS. COMPUTING
• Storage: needs to be consistent across data centers• Computing: driven by user traffic, as needed basis
SCALE OUT: STORAGE
Tuesday, June 25th, 2013
user, media, friendship etc
SCALE OUT: STORAGE
Tuesday, June 25th, 2013
user, media, friendship etc
Master
Replica
ReplicaDjango
Write
Read
SCALE OUT: STORAGE
Tuesday, June 25th, 2013
user, media, friendship etc
Master
Replica
ReplicaDjango
Write
ReadDC1
DC2
DC3
SCALE OUT: STORAGE
Tuesday, June 25th, 2013
user feeds, activities etc
Replica
ReplicaReplica
Write - 2Read - 1
SCALE OUT: STORAGE
Tuesday, June 25th, 2013
user feeds, activities etc
Replica
ReplicaReplica
Write - 2Read - 1
COMPUTING
Tuesday, June 25th, 2013
Tuesday, June 25th, 2013
Django
RabbitMQ PostgreSQL
CassandraCelery
Django
RabbitMQPostgreSQL
CassandraCelery
memcacheDC1 DC2memcache
MEMCACHE
Tuesday, June 25th, 2013
• High performance key-value store in memory• Millions of reads/writes per second• Sensitive to network condition• Cross region operation is prohibitive
No global consistency
feed
get
Django
User R
DC1
Django
PostgreSQL memcache
User Ccomment
setinsert
Django
memcache PostgreSQL
User Ccomment
insertset
DC1
Django
memcachePostgreSQL
User R
feed
get
DC2
replication
Django
memcache PostgreSQL
User Ccomment
insertset
DC1
Django
memcachePostgreSQL
User R
feed
DC2
replication
Cache invalidate
Cache invalidate
get
COUNTERS
select count(*) from user_likes_media
where media_id=12345;
100s ms
COUNTER
Tuesday, June 25th, 2013
select count from media_likes where media_id=12345;
10s us
Cache invalidatedAll djangos try to access DB
MEMCACHE LEASE
d1 d2 memcache dbtime
lease-get
filllease-get
wait or use stale
read from DB
lease-set
lease-get
hit
INSTAGRAM STACK - MULTI REGION
Tuesday, June 25th, 2013
Django
RabbitMQ
PostgreSQL
Cassandra
Celery
memcache
Django
RabbitMQ
PostgreSQL
Cassandra
Celery
memcache
DC1 DC2
SCALING OUT
Tuesday, June 25th, 2013
• Capacity• Reliability• Regional failure ready
SCALING OUT - CHALLENGES, OPPORTUNITIES
Tuesday, June 25th, 2013
• Beyond North America• More localized social network• Direct messaging• Live streaming
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18 20 22 24
User growth Server growth
“Don’t count the servers, make the servers count”
SCALE UP
SCALE UP
Use as few CPU instructions as possible
Use as few servers as possible
SCALE UP
Use as few CPU instructions as possibleUse as few servers as possible
CPU
Monitor
Optimize
Analyze
COLLECT
struct perf_event_attr pe;
pe.type = PERF_TYPE_HARDWARE;
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
fd = perf_event_open(&pe, 0, -1, -1, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); <code you want to measure> ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); read(fd, &count, sizeof(long long));
DYNOSTATS
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18 20 22 24
Follow
Feed
Explore
REGRESSION
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18 20 22 24
With new feature
Without new feature
CPU
Monitor
Optimize
Analyze
PYTHON CPROFILE
import cProfile, pstats, StringIO pr = cProfile.Profile()
pr.enable() # ... do something ... pr.disable() s = StringIO.StringIO() sortby = 'cumulative' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print s.getvalue()
CPU - ANALYZEcontinuous profiling
generate_profile explore --start <start-time> --duration <minutes>
CPU - ANALYZEcontinuous profiling
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18 20 22 24
Caller
Callee
Callee
CPU
Monitor
Optimize
Analyze
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s300x300/12345678_1234567890_987654321_a.jpg
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s150x150/12345678_1234567890_987654321_a.jpg
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s400x600/12345678_1234567890_987654321_a.jpg
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s200x200/12345678_1234567890_987654321_a.jpg
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s300x300/12345678_1234567890_987654321_a.jpg
CPU - OPTIMIZE
igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpl1/t51.2885-19/s300x300/12345678_1234567890_987654321_a.jpg
150x150
400x600
200x200
CPU - OPTIMIZE
C is really faster
• Candidate functions:• Used extensively• Stable
• Cython or C/C++
Use as few CPU instructions as possible
Use as few servers as possible
SCALE UP
ONE WEB SERVER
Process 1
SharedMemory
PrivateMemory
Process N
SCALE UP: MEMORY
• Run in optimized mode (-O)• Remove dead code
Reduce code
SCALE UP: MEMORY
• Move configuration into shared memory• Disable garbage collection
Share more
SCALE UP: MEMORY
20+% capacity increase
SCALE UP: NETWORK LATENCY
Synchronous processing model with long latency
===> Worker starvation and fewer CPU instr executed
Stories
FeedDjango
Feed
Stories
SuggestedUsers
ASYNC IO
Use as few CPU instructions as possible
Use as few servers as possible
Scale up
SCALE UP: CHALLENGES, OPPORTUNITIES
• Faster python run-time• Async web framework• Better memory analysis• etc etc
SCALE DEV TEAM
SCALING TEAM
30% engineers joined in last 6 months
Bootcampers - 1 week
Hack-A-Month - 4 weeks
Intern - 12 weeks
Comment Filtering
Self-harm Prevention
Windows App
Multiple media in one post
Video View Notification
Saved Posts
First Story Notification
Instagram Live
Instagram Stories
Which server?
NewTable or New Column?
What Index?Should
I cache it?
Will I lock up DB?
Will I bring down Instagram?
WHAT WE WANT
• Automatically handle cache• Define relations, not worry about implementations• Self service by product engineers• Infra focuses on scale
TAO
USER1
USER2
USER3mediaposted
posted bylikes
liked by
likes
liked by
Comment Filtering
Self-harm Prevention
Windows App
Multiple media in one post
Video View Notification
Saved Posts
First Story Notification
Instagram Live
Instagram Stories
SOURCE CONTROL
Master
Live
Direct
SOURCE CONTROL
• Context switching• Code sync/merge overhead• Surprises• Refactor/major upgrade• Performance tracking harder
With branches
SOURCE CONTROL
Master
Live
Direct
SOURCE CONTROL
Master Live Direct
SOURCE CONTROL
• Continous integration• Collaborate easily• Fast bisect and revert• Continuous performance monitoring
No branches
FEATURE LAUNCH
Engineers
Employees
Dogfooder
Some demographics
World
FEATURE LOAD TEST
Once a
40-60 rollouts per day
daydiffweek?!!
CHECKS AND BALANCES
Code reviewunittest
Code acceptedcommitted Canary To the Wild
SCALING MEANS
Scale out
Scale up
Scale dev team
TAKEAWAYS
Scaling is everybody’s responsibility
Scaling is continuous effort
Scaling is multi-dimensional
QUESTIONS?
Recommended