30
MHBtree - InnoDB backend for Manhattan @JagtapUnmesh @inaamrana 4/21/2016

MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Embed Size (px)

Citation preview

Page 1: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

MHBtree - InnoDB backend for Manhattan

@JagtapUnmesh@inaamrana

4/21/2016

Page 2: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Manhattan

● Twitter’s holistic storage solution● Storage as a service, not just technology● Focus on reliability, availability, extensibility,

operability and developer productivity● Multi-Tenancy, Quality of Service (QoS) and Self-Service

are first-class citizens

Page 3: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Manhattan Architecture

Page 4: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Manhattan Data Model

● Datasets are namespaces for keys● Pkeys are used for partitioning data● Lkeys are ordered within a pkey

○ dataset=user pkey=”john” lkey=”DOB” => “1985-05-18”○ dataset=user pkey=”john” lkey=”Movies”, “Cast Away” => “”○ dataset=user pkey=”john” lkey=”Movies”, “Shutter Island” => “”

● Each attribute can be modified concurrently● No read modify write required

Page 5: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Manhattan Client API

● Basic Operations○ insert○ delete○ deleteRange○ get○ get_slice

● Strong consistency○ increment○ checkAndSet

Page 6: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Storage Engine Requirements

● Implement Manhattan API● Fast topology transitions aka “optimized streaming”● Incremental backups● Efficient space reclamation● On-demand and scheduled data exports● Application level visibility

Page 7: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

SSTable Storage Engine

● For write-heavy use-cases at Twitter● Append-only architecture● Reconciliation of versions on reads● Compactions to reuse space● Consistent snapshots for optimized streaming and

backups is trivial● Exports are served from backups

Page 8: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

MHBtree Storage Engine

● For read-heavy use cases at Twitter● Update in place● Compactions replaced by MHBtree purge● Consistent snapshots for backups and streaming is non

trivial due to mutable nature● Exports pushed down to nodes with data

Page 9: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

MHBtree Architecture

InnoDB Storage Engine

mysqld

MH Replica

MySQL Server MHBtree Plugin

Handler API

MHBtree protocol

InnoDB

Page 10: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Request flow with MHBtree

MH Client MH Coordinator

mysqlMH Replica

Storage Node 1

mysqlMH Replica

Storage Node 2

mysqlMH Replica

Storage Node 3

Page 11: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

MHBtree Plugin Internals

READ WORKER GROUP

WRITE WORKER GROUP

LONG RUNNING WORKER GROUP

TRANSACTIONAL WORKER GROUP

NETWORK THREADS

MH REPLICA

PLUGIN

Page 12: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

● A mysql table corresponds to one copy of a shard● Contains key, metadata, value● Key components are stitched together with MSB1

encoding● Manhattan keys can be up to 64KB long

○ Segmentation at schema level● No prefix deduplication

○ Compression with key block size of 4/8 KB

MHBtree Data Model

Page 13: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

● Space reclamation mechanism for MHBtree● Periodic throttled activity orchestrated by the replica● Cleans up:

○ Expired data○ Reconciled tombstones○ Data affected by range/dataset deletion

● A sweep for a shard gathers stats at logical level

MHBtree Purge

Page 14: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Strong Consistency Architecture

MH Client SUBSCRIBE

MH Coord

Storage Node 1

mysqlMH Replica

Rlog stream 0

Rlog stream 1

Rlog stream N-1

Storage Node 3

mysqlMH Replica

Storage Node 2

mysqlMH Replica

Distributed Log

ACK

Page 15: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

● Eventually consistent operations are self enclosed txns● Transaction lifecycle managed by replica for strong

operations● Transaction management support in the plugin

○ Separate worker group for transactions○ Reshuffling of THD/txn across executor threads during

transaction scope○ Primitive deadlock avoidance with rate limiting

Strong Consistency

Page 16: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Read Operation

● Separate configurable thread pools for reads and writes● Read path

○ get_slice() translates to range read○ READ_COMMITTED isolation level○ Tag transactions as READ_ONLY

Page 17: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Write Operation

● Write Path○ First try to insert the row○ If DUPLICATE KEY encountered

■ Existing row is locked and returned from InnoDB■ Do timestamp-based conflict resolution■ Update existing row if incoming request has higher

timestamp

Page 18: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Write Contention● Hot Keys

○ Multiple nearly simultaneous requests to update same row

○ Results in all/most threads waiting for row lock● Mitigation:

○ Do a consistent non-locking read before trying update○ Make InnoDB return a new error in case a lock wait is

encountered HA_ERR_LOCK_NOT_AVAILABLE○ Cache currently executing mutate requests in plugin

(WIP)

Page 19: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Consistent Snapshot● Requirement:

○ Ability to take table level snapshot○ Use the same primitives for both topology transitions

and backups (including incremental backups)● Available Solutions:

○ Logical Backup■ No support for incremental backup

○ Transportable Tablespace■ DML lock during copy stage

○ XtraBackup■ Overhead of copying ibdata and redo logs

Page 20: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Consistent Snapshot● Copy on write

○ Built upon transportable tablespace feature○ A page-level operation

■ Copy before image of a page before allowing user to change it

■ Track/intercept page access requests in the buffer pool■ If page is requested with X-lock, copy the consistent

image to an in-memory buffer■ Periodically persist the in-memory buffer to .cow file on

disk○ Allows InnoDB write activity to continue on the .ibd file while

it is being copied

Page 21: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Transportable Tablespace Design

i.ibd fileibdata file

purge

DDL Select DML DML

ibuf

t0: DDL + DML Lock

t1: ibuf merge

t2: stop purge

t3: flush dirty pages

t4: write .cfg file

t5: copy .ibd & .cfg file

t6: start purge

t7: release lock.cfg file

Page 22: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Copy on Write Design

i.ibd fileibdata file

purge

DDL Select DML DML

ibuf

t0: DDL + DML Lock

t1: ibuf merge

t2: stop purge

t3: trigger COW

t4: release DML lock

t5: start purge

t6: flush dirty pages

t7: copy .ibd file.cfg file

.cow file t8: release DDL lock

Page 23: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Optimized Streaming

MH BackendMH Backend

MySQL/InnoDB MySQL/InnoDBBTree Plugin BTree Plugin

ibd file

Source DestinationDest Enable dual writes

Source Plugin handshake

Source Backup Locks/COW

Source Copy ibd file

Source Unlock table

Source Copy cfg and cow files

Dest IMPORT tablespace

Dest Switch writes

Dest Merge btrees

Dest Enable reads

tmp_btreecow ibd filecfg

cow cfg

Page 24: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Physical Backups● COW mechanism is used for:

○ Table level full physical backups○ Table level incremental backups

■ The .ibd file is scanned and a .dlt file is generated based on last full backup LSN

■ .dlt file is made consistent by applying .cow file■ .dlt file is then applied to the full backup to move it

from one consistent state to next○ All above changes are made in InnoDB extending

FLUSH TABLES … FOR EXPORT

Page 25: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Dealing with I/O

● Batch AIO requests○ Idea taken from FB patch and extended to write I/O

● Add separate LRU flush thread○ Ported from Percona○ Disable single page flushing

● Use fdatasync() instead of fsync()● Use larger doublewrite buffer

○ A quick hack solution to increase the size of double write buffer

Page 26: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Visibility

● Information Schema tables○ Get state of plugin

■ mhbtree_streaming■ mhbtree_purge_stats■ mhbtree_keyspace_options■ innodb_copy_on_write

○ Get InnoDB file space usage■ innodb_segement_usage■ innodb_sys_tablespace

Page 27: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

MHBtree Read Performance

Page 28: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

MHBtree Write Performance

Page 29: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

MHBtree Disk Usage

Page 30: MHBtree - InnoDB backend for Manhattan @inaamrana … Twitter’s holistic storage solution Storage as a service, not just technology Focus on reliability, availability, extensibility,

Questions?Follow us @corestorage!