49

Replicated RocksDB at Pinterest @scale 2016 San Jose

  • Upload
    bo-liu

  • View
    190

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Replicated RocksDB at Pinterest @scale 2016 San Jose
Page 2: Replicated RocksDB at Pinterest @scale 2016 San Jose

August 31, 2016

PinterestEngineering

Page 3: Replicated RocksDB at Pinterest @scale 2016 San Jose

Bo LiuSoftware Engineer, Serving Systems

Replicated RocksDB at Pinterest

Page 4: Replicated RocksDB at Pinterest @scale 2016 San Jose

Kafka

Example 1WritesReads

John saw Pin 1, Pin 2, …Pin K at Time T

Online event tracking system

Page 5: Replicated RocksDB at Pinterest @scale 2016 San Jose

Kafka

Example 1Writes

Fetch the last 1,000 Pins seen by John

Reads

John saw Pin 1, Pin 2, …Pin K at Time T

Online event tracking system

Page 6: Replicated RocksDB at Pinterest @scale 2016 San Jose

Kafka

Example 1Writes

Fetch the last 1,000 Pins seen by John

Fetch the number of Pins seen by John between Time T1 and T2

Reads

John saw Pin 1, Pin 2, …Pin K at Time T

Online event tracking system

Page 7: Replicated RocksDB at Pinterest @scale 2016 San Jose

Kafka

Example 2WritesReads

John just followed Board 1

Board based Pin retrieving and ranking system

Page 8: Replicated RocksDB at Pinterest @scale 2016 San Jose

Kafka

Example 2WritesReads

John just followed Board 1

Board based Pin retrieving and ranking system

Pin 1 was just saved to Board 1

Page 9: Replicated RocksDB at Pinterest @scale 2016 San Jose

Kafka

Example 2Writes

Fetch the most relevant Pins followed by John

Reads

John just followed Board 1

Board based Pin retrieving and ranking system

Pin 1 was just saved to Board 1

Page 10: Replicated RocksDB at Pinterest @scale 2016 San Jose

Kafka

Example 3WritesReads

Add u to HyperLogLog A

Distributed storage system with data structure support

Page 11: Replicated RocksDB at Pinterest @scale 2016 San Jose

Kafka

Example 3WritesReads

Add u to HyperLogLog A

Distributed storage system with data structure support

Add e to List B

Page 12: Replicated RocksDB at Pinterest @scale 2016 San Jose

Fetch List B

Kafka

Example 3WritesReads

Add u to HyperLogLog A

Distributed storage system with data structure support

Add e to List B

Page 13: Replicated RocksDB at Pinterest @scale 2016 San Jose

Fetch List B

Fetch the unique member # of HyperLogLog A

Kafka

Example 3WritesReads

Add u to HyperLogLog A

Distributed storage system with data structure support

Add e to List B

Page 14: Replicated RocksDB at Pinterest @scale 2016 San Jose
Page 15: Replicated RocksDB at Pinterest @scale 2016 San Jose
Page 16: Replicated RocksDB at Pinterest @scale 2016 San Jose
Page 17: Replicated RocksDB at Pinterest @scale 2016 San Jose
Page 18: Replicated RocksDB at Pinterest @scale 2016 San Jose

RocksDB Replicator

Application API Admin API

Generate cluster config

Application Logic Admin Logic ZooKeeper

Admin tool

Common system architecture

Rocks DBRocks DBRocks DBRocks DB

Page 19: Replicated RocksDB at Pinterest @scale 2016 San Jose

RocksDB Replicator

Generate cluster config

Admin tool

Load configwhen start

Application API Admin API

Application Logic Admin Logic ZooKeeper

Common system architecture

Rocks DBRocks DBRocks DBRocks DB

Page 20: Replicated RocksDB at Pinterest @scale 2016 San Jose

RocksDB Replicator

Generate cluster config

Admin tool

Load configwhen start

ZooKeeper

Application API Admin API

Application Logic Admin Logic

Create/Open DB

Common system architecture

Rocks DBRocks DBRocks DBRocks DB

Page 21: Replicated RocksDB at Pinterest @scale 2016 San Jose

RocksDB Replicator

Generate cluster config

Admin tool

Load configwhen start

ZooKeeper

Add/Remove DB for replication

Application API Admin API

Application Logic Admin Logic

Create/Open DB

Common system architecture

Rocks DBRocks DBRocks DBRocks DB

Page 22: Replicated RocksDB at Pinterest @scale 2016 San Jose

Generate cluster config

Admin tool

Load configwhen start

Create/Open DB Add/Remove DB for replication

Data Replicationlocal updates

remote updates

Application API Admin API

Application Logic Admin Logic

RocksDB Replicator

ZooKeeper

Common system architecture

Rocks DBRocks DBRocks DBRocks DB

Page 23: Replicated RocksDB at Pinterest @scale 2016 San Jose

Generate cluster config

Load configwhen start

Create/Open DB Add/Remove DB for replication

Data Replicationlocal updates

remote updates

RocksDB Replicator

ZooKeeper

Cluster management

Application API Admin APIAdmin tool

Application Logic Admin Logic

Common system architecture

Rocks DBRocks DBRocks DBRocks DB

Page 24: Replicated RocksDB at Pinterest @scale 2016 San Jose

Cluster managementGenerate cluster config

Load configwhen start

Create/Open DB Add/Remove DB for replication

Data Replicationlocal updates

remote updates

RocksDB Replicator

Admin tool

GetDB()

Application API Admin API

Admin Logic ZooKeeperApplication Logic

Common system architecture

Rocks DBRocks DBRocks DBRocks DB

Page 25: Replicated RocksDB at Pinterest @scale 2016 San Jose

Cluster managementGenerate cluster config

Load configwhen start

Create/Open DB Add/Remove DB for replication

Data Replicationlocal updates

remote updates

RocksDB Replicator

Admin tool

GetDB()ZooKeeper

Read/Write

Common system architectureApplication API Admin API

Application Logic Admin Logic

Rocks DBRocks DBRocks DBRocks DB

Page 26: Replicated RocksDB at Pinterest @scale 2016 San Jose

RocksDB replicator design•Support async Master-Slave replication only•Replicate multiple RocksDBs in one process•Replication role at RocksDB instance level•Work reactively ( AddDB(), RemoveDB() )•Low replication latency

Page 27: Replicated RocksDB at Pinterest @scale 2016 San Jose

RocksDB replicator implementation•RocksDB WAL sequence # as global replication sequence #

•fbthrift for RPC•Pull & Push

Page 28: Replicated RocksDB at Pinterest @scale 2016 San Jose

Latest SEQ #

Thrift Server

Worker threads

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 29: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get update sinceSEQ# for DB2Latest SEQ #

Thrift Server

Worker threads

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 30: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get update sinceSEQ# for DB2

Updates since SEQ# for DB2

Latest SEQ #

Thrift Server

Worker threads

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 31: Replicated RocksDB at Pinterest @scale 2016 San Jose

Apply updates

Get update sinceSEQ# for DB2

Updates since SEQ# for DB2

Latest SEQ #

Thrift Server

Worker threads

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 32: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 33: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send request

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 34: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send request

Has updates since SEQ#?

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 35: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send requestYes, this is the data

Has updates since SEQ#?

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 36: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send requestResponseYes, this is the data

Has updates since SEQ#?

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 37: Replicated RocksDB at Pinterest @scale 2016 San Jose

Response

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send requestResponseYes, this is the data

Has updates since SEQ#?

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 38: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 39: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send request

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 40: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send request

Has updates since SEQ#?

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 41: Replicated RocksDB at Pinterest @scale 2016 San Jose

No, wait for my notification

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send request

Has updates since SEQ#?

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 42: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send request

Writes

No, wait for my notification

Has updates since SEQ#?

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 43: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send requestNo, wait for my notification

Has updates since SEQ#?

These are the new updates

RocksDB replicator workflowWrites

DB1 Master

DB2 Slave

Upstream: ip_Port

Page 44: Replicated RocksDB at Pinterest @scale 2016 San Jose

Get updates since SEQ# for DB1

Thrift Server

Worker threads

Send requestNo, wait for my notification

Has updates since SEQ#?

These are the new updates

Response

RocksDB replicator workflow

DB1 Master

DB2 Slave

Upstream: ip_Port

Writes

Page 45: Replicated RocksDB at Pinterest @scale 2016 San Jose

Response

Get updates since SEQ# for DB1

Thrift Server

Worker threads

RocksDB replicator workflow

Send requestNo, wait for my notification

Has updates since SEQ#?

These are the new updates

Response

DB1 Master

DB2 Slave

Upstream: ip_Port

Writes

Page 46: Replicated RocksDB at Pinterest @scale 2016 San Jose

•Production load: 1MB/s, P99 12ms, Max 60ms•Synthetic load: 76MB/s, P99 106ms, Max 224ms•Developer velocity: Build a production quality real-time counter service in one week

Performance

Page 47: Replicated RocksDB at Pinterest @scale 2016 San Jose

Cluster managementGenerate cluster config

Load configwhen start

Create/Open DB Add/Remove DB for replication

Data Replicationlocal updates

remote updates

RocksDB Replicator

Admin tool

GetDB()ZooKeeper

Read/Write

Application API Admin API

Rocks DBRocks DBRocks DBRocks DB

Application Logic Admin Logic

Open source - coming soon

Page 48: Replicated RocksDB at Pinterest @scale 2016 San Jose

Serving Systems Team @Pinterest

Thank you

Bo Liu, Shu Zhang, Jian Fang, Jinru He, Linda Lo, Yongsheng Wu

Data Analytics Team @PinterestBryant Xiao, Justin Mejorada Pier, Shuo Xiang,Qingxian Lai, Tien Nguyen, Chunyan Wang

Page 49: Replicated RocksDB at Pinterest @scale 2016 San Jose

Q&A