33
1 Rajat Bansal Chief Technology Officer Scaling to Millions, FAST

Scaling Hike Messenger to 15M Users

  • Upload
    mongodb

  • View
    1.444

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Scaling Hike Messenger to 15M Users

1

Rajat Bansal Chief Technology Officer

Scaling to Millions, FAST!

Page 2: Scaling Hike Messenger to 15M Users

2

- The Hike Journey!

- Why MongoDB?!

- MongoDB Use Cases:Overview and Deep Dive!

- Lessons learned & Next Steps

Agenda

Page 3: Scaling Hike Messenger to 15M Users

3

The fastest growing !Made in India IM App!

Page 4: Scaling Hike Messenger to 15M Users

4

Page 5: Scaling Hike Messenger to 15M Users

5

Page 6: Scaling Hike Messenger to 15M Users

6

Page 7: Scaling Hike Messenger to 15M Users

7

Page 8: Scaling Hike Messenger to 15M Users

8

Page 9: Scaling Hike Messenger to 15M Users

9

Page 10: Scaling Hike Messenger to 15M Users

10

Page 11: Scaling Hike Messenger to 15M Users

11

Page 12: Scaling Hike Messenger to 15M Users

12

Page 13: Scaling Hike Messenger to 15M Users

13

AND TODAY

Page 14: Scaling Hike Messenger to 15M Users

14

Page 15: Scaling Hike Messenger to 15M Users

15

Why MongoDB?

Storage decisions governed by CAP Theorem

Page 16: Scaling Hike Messenger to 15M Users

16

Why MongoDB?

C Consistency Availability Partition Tolerance

Cost Availability of Talent

Production Support

A P

Page 17: Scaling Hike Messenger to 15M Users

17

Cost

- Start Small, Grow Big, FAST, REALLY FAST!- Started with 1 replica set on EC2!- Grew to multiple replica sets, sharded clusters

Page 18: Scaling Hike Messenger to 15M Users

18

Availability

- Small team of 3 people writing the entire server!- Easy Ramp-up and management!- Low setup and administration requirements

Page 19: Scaling Hike Messenger to 15M Users

19

Production Support

- Cost of downtime is very high. !- Small team needed support in downtime. !- Decision to take Production Support proved life-saving in outages

Page 20: Scaling Hike Messenger to 15M Users

20

User Profile Store!!! - 3000 reads / 500 writes per sec!

!

Temporary Message Store!!! - 1000 reads / 3900 writes per sec!!! !

Other Miscellaneous usage: Grid FS etc

HIKE’s MongoDB Use-cases

Page 21: Scaling Hike Messenger to 15M Users

21

Offline Message Store!!! - App Level Sharding!!! - 4 Mongo instances with 32 DB each!!! - Horizontally scalable upto 128 instances when needed!!! - Tested upto 30K Ops in simulated environment !!! - Protected by “Redis” layer to reduce queries!!! - Latencies < 1ms

HIKE MongoDB Architecture

Page 22: Scaling Hike Messenger to 15M Users

Primary!

Secondary!

Secondary!

Mongod-1

Primary!

Secondary!

Secondary!

Mongod-2

Primary!

Secondary!

Secondary!

Mongod-3

Primary!

Secondary!

Secondary!

Mongod-4

Shard Manager

32 dbs each

App Layer

Page 23: Scaling Hike Messenger to 15M Users

23

User Profile Store - Replica Set (1 primary, 2 Secondary) - Writes to Primary - Reads from Secondary - Latencies < 1ms

HIKE MongoDB Architecture

Page 24: Scaling Hike Messenger to 15M Users

24

mongoDB (Happy State < 1ms)

0.65ms

0.80ms

Page 25: Scaling Hike Messenger to 15M Users

25

mongoDB (1 Year Timeline)

Page 26: Scaling Hike Messenger to 15M Users

26

mongoDB (Outage 1)

Outage 1

Page 27: Scaling Hike Messenger to 15M Users

27

Outage 1!!! - Latencies went over the roof “1ms —> 1000ms”!!! - What went wrong: Lot of operations on “Arrays”!!! - “Production Support” to the rescue! !!!“Adding and modifying array entries can require a scan of much or all of each array being updated, resulting in slow operations"

HIKE Learnings

Page 28: Scaling Hike Messenger to 15M Users

28

mongoDB (Outage 2)

Outage 2

Page 29: Scaling Hike Messenger to 15M Users

29

Outage 2!!! - Latencies increased 20-50X!!! - What went wrong: !!! ! ! - Disk I/O was bottleneck!!! ! ! - “ReadAhead” was high!!! !!“Read/Write Throughput Exceeds I/O”

HIKE Learnings

Page 30: Scaling Hike Messenger to 15M Users

30

mongoDB (Outage 3)

Outage 3

Page 31: Scaling Hike Messenger to 15M Users

31

Outage 3!!! - MongoDB crashed!!! - Adhoc Script doing fullTableScan !!! - Need to protect your systems “noTableScan” !!

!

“Protect your production systems. Use the mechanisms available”! !

HIKE Learnings

Page 32: Scaling Hike Messenger to 15M Users

32

- Proactive Health Checks - Production Support Helps - Put Mechanisms to safeguard production

HIKE Learnings

Page 33: Scaling Hike Messenger to 15M Users

http://hike.in @hikeapp

33