Download pdf - MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

MHBtree - InnoDB backend for Manhattan

@JagtapUnmesh@inaamrana

4/21/2016

Manhattan

● Twitter’s holistic storage solution● Storage as a service, not just technology● Focus on reliability, availability, extensibility,

operability and developer productivity● Multi-Tenancy, Quality of Service (QoS) and Self-Service

are first-class citizens

Manhattan Architecture

Manhattan Data Model

● Datasets are namespaces for keys● Pkeys are used for partitioning data● Lkeys are ordered within a pkey

○ dataset=user pkey=”john” lkey=”DOB” => “1985-05-18”○ dataset=user pkey=”john” lkey=”Movies”, “Cast Away” => “”○ dataset=user pkey=”john” lkey=”Movies”, “Shutter Island” => “”

● Each attribute can be modified concurrently● No read modify write required

Manhattan Client API

● Basic Operations○ insert○ delete○ deleteRange○ get○ get_slice

● Strong consistency○ increment○ checkAndSet

https://docbird.twitter.biz/manhattan/client.html#delete

https://docbird.twitter.biz/manhattan/client.html#delete

Storage Engine Requirements

● Implement Manhattan API● Fast topology transitions aka “optimized streaming”● Incremental backups● Efficient space reclamation● On-demand and scheduled data exports● Application level visibility

SSTable Storage Engine

● For write-heavy use-cases at Twitter● Append-only architecture● Reconciliation of versions on reads● Compactions to reuse space● Consistent snapshots for optimized streaming and

backups is trivial● Exports are served from backups

MHBtree Storage Engine

● For read-heavy use cases at Twitter● Update in place● Compactions replaced by MHBtree purge● Consistent snapshots for backups and streaming is non

trivial due to mutable nature● Exports pushed down to nodes with data

MHBtree Architecture

InnoDB Storage Engine

mysqld

MH Replica

MySQL Server MHBtree Plugin

Handler API

MHBtree protocol

InnoDB

Request flow with MHBtree

MH Client MH Coordinator

mysqlMH Replica

Storage Node 1

mysqlMH Replica

Storage Node 2

mysqlMH Replica

Storage Node 3

MHBtree Plugin Internals

READ WORKER GROUP

WRITE WORKER GROUP

LONG RUNNING WORKER GROUP

TRANSACTIONAL WORKER GROUP

NETWORK THREADS

MH REPLICA

PLUGIN

● A mysql table corresponds to one copy of a shard● Contains key, metadata, value● Key components are stitched together with MSB1

encoding● Manhattan keys can be up to 64KB long

○ Segmentation at schema level● No prefix deduplication

○ Compression with key block size of 4/8 KB

MHBtree Data Model

● Space reclamation mechanism for MHBtree● Periodic throttled activity orchestrated by the replica● Cleans up:

○ Expired data○ Reconciled tombstones○ Data affected by range/dataset deletion

● A sweep for a shard gathers stats at logical level

MHBtree Purge

Strong Consistency Architecture

MH Client SUBSCRIBE

MH Coord

Storage Node 1

mysqlMH Replica

Rlog stream 0

Rlog stream 1

Rlog stream N-1

Storage Node 3

mysqlMH Replica

Storage Node 2

mysqlMH Replica

Distributed Log

ACK

● Eventually consistent operations are self enclosed txns● Transaction lifecycle managed by replica for strong

operations● Transaction management support in the plugin

○ Separate worker group for transactions○ Reshuffling of THD/txn across executor threads during

transaction scope○ Primitive deadlock avoidance with rate limiting

Strong Consistency

Read Operation

● Separate configurable thread pools for reads and writes● Read path

○ get_slice() translates to range read○ READ_COMMITTED isolation level○ Tag transactions as READ_ONLY

Write Operation

● Write Path○ First try to insert the row○ If DUPLICATE KEY encountered

■ Existing row is locked and returned from InnoDB■ Do timestamp-based conflict resolution■ Update existing row if incoming request has higher

timestamp

Write Contention● Hot Keys

○ Multiple nearly simultaneous requests to update same row

○ Results in all/most threads waiting for row lock● Mitigation:

○ Do a consistent non-locking read before trying update○ Make InnoDB return a new error in case a lock wait is

encountered HA_ERR_LOCK_NOT_AVAILABLE○ Cache currently executing mutate requests in plugin

(WIP)

Consistent Snapshot● Requirement:

○ Ability to take table level snapshot○ Use the same primitives for both topology transitions

and backups (including incremental backups)● Available Solutions:

○ Logical Backup■ No support for incremental backup

○ Transportable Tablespace■ DML lock during copy stage

○ XtraBackup■ Overhead of copying ibdata and redo logs

Consistent Snapshot● Copy on write

○ Built upon transportable tablespace feature○ A page-level operation

■ Copy before image of a page before allowing user to change it

■ Track/intercept page access requests in the buffer pool■ If page is requested with X-lock, copy the consistent

image to an in-memory buffer■ Periodically persist the in-memory buffer to .cow file on

disk○ Allows InnoDB write activity to continue on the .ibd file while

it is being copied

Transportable Tablespace Design

i.ibd fileibdata file

purge

DDL Select DML DML

ibuf

t0: DDL + DML Lock

t1: ibuf merge

t2: stop purge

t3: flush dirty pages

t4: write .cfg file

t5: copy .ibd & .cfg file

t6: start purge

t7: release lock.cfg file

Copy on Write Design

i.ibd fileibdata file

purge

DDL Select DML DML

ibuf

t0: DDL + DML Lock

t1: ibuf merge

t2: stop purge

t3: trigger COW

t4: release DML lock

t5: start purge

t6: flush dirty pages

t7: copy .ibd file.cfg file

.cow file t8: release DDL lock

Optimized Streaming

MH BackendMH Backend

MySQL/InnoDB MySQL/InnoDBBTree Plugin BTree Plugin

ibd file

Source DestinationDest Enable dual writes

Source Plugin handshake

Source Backup Locks/COW

Source Copy ibd file

Source Unlock table

Source Copy cfg and cow files

Dest IMPORT tablespace

Dest Switch writes

Dest Merge btrees

Dest Enable reads

tmp_btreecow ibd filecfg

cow cfg

Physical Backups● COW mechanism is used for:

○ Table level full physical backups○ Table level incremental backups

■ The .ibd file is scanned and a .dlt file is generated based on last full backup LSN

■ .dlt file is made consistent by applying .cow file■ .dlt file is then applied to the full backup to move it

from one consistent state to next○ All above changes are made in InnoDB extending

FLUSH TABLES … FOR EXPORT

Dealing with I/O

● Batch AIO requests○ Idea taken from FB patch and extended to write I/O

● Add separate LRU flush thread○ Ported from Percona○ Disable single page flushing

● Use fdatasync() instead of fsync()● Use larger doublewrite buffer

○ A quick hack solution to increase the size of double write buffer

Visibility

● Information Schema tables○ Get state of plugin

■ mhbtree_streaming■ mhbtree_purge_stats■ mhbtree_keyspace_options■ innodb_copy_on_write

○ Get InnoDB file space usage■ innodb_segement_usage■ innodb_sys_tablespace

MHBtree Read Performance

MHBtree Write Performance

MHBtree Disk Usage

Questions?Follow us @corestorage!

https://twitter.com/corestorage