MHBtree - InnoDB backend for Manhattan
@JagtapUnmesh@inaamrana
4/21/2016
Manhattan
● Twitter’s holistic storage solution● Storage as a service, not just technology● Focus on reliability, availability, extensibility,
operability and developer productivity● Multi-Tenancy, Quality of Service (QoS) and Self-Service
are first-class citizens
Manhattan Architecture
Manhattan Data Model
● Datasets are namespaces for keys● Pkeys are used for partitioning data● Lkeys are ordered within a pkey
○ dataset=user pkey=”john” lkey=”DOB” => “1985-05-18”○ dataset=user pkey=”john” lkey=”Movies”, “Cast Away” => “”○ dataset=user pkey=”john” lkey=”Movies”, “Shutter Island” => “”
● Each attribute can be modified concurrently● No read modify write required
Manhattan Client API
● Basic Operations○ insert○ delete○ deleteRange○ get○ get_slice
● Strong consistency○ increment○ checkAndSet
Storage Engine Requirements
● Implement Manhattan API● Fast topology transitions aka “optimized streaming”● Incremental backups● Efficient space reclamation● On-demand and scheduled data exports● Application level visibility
SSTable Storage Engine
● For write-heavy use-cases at Twitter● Append-only architecture● Reconciliation of versions on reads● Compactions to reuse space● Consistent snapshots for optimized streaming and
backups is trivial● Exports are served from backups
MHBtree Storage Engine
● For read-heavy use cases at Twitter● Update in place● Compactions replaced by MHBtree purge● Consistent snapshots for backups and streaming is non
trivial due to mutable nature● Exports pushed down to nodes with data
MHBtree Architecture
InnoDB Storage Engine
mysqld
MH Replica
MySQL Server MHBtree Plugin
Handler API
MHBtree protocol
InnoDB
Request flow with MHBtree
MH Client MH Coordinator
mysqlMH Replica
Storage Node 1
mysqlMH Replica
Storage Node 2
mysqlMH Replica
Storage Node 3
MHBtree Plugin Internals
READ WORKER GROUP
WRITE WORKER GROUP
LONG RUNNING WORKER GROUP
TRANSACTIONAL WORKER GROUP
NETWORK THREADS
MH REPLICA
PLUGIN
● A mysql table corresponds to one copy of a shard● Contains key, metadata, value● Key components are stitched together with MSB1
encoding● Manhattan keys can be up to 64KB long
○ Segmentation at schema level● No prefix deduplication
○ Compression with key block size of 4/8 KB
MHBtree Data Model
● Space reclamation mechanism for MHBtree● Periodic throttled activity orchestrated by the replica● Cleans up:
○ Expired data○ Reconciled tombstones○ Data affected by range/dataset deletion
● A sweep for a shard gathers stats at logical level
MHBtree Purge
Strong Consistency Architecture
MH Client SUBSCRIBE
MH Coord
Storage Node 1
mysqlMH Replica
Rlog stream 0
Rlog stream 1
Rlog stream N-1
Storage Node 3
mysqlMH Replica
Storage Node 2
mysqlMH Replica
Distributed Log
ACK
● Eventually consistent operations are self enclosed txns● Transaction lifecycle managed by replica for strong
operations● Transaction management support in the plugin
○ Separate worker group for transactions○ Reshuffling of THD/txn across executor threads during
transaction scope○ Primitive deadlock avoidance with rate limiting
Strong Consistency
Read Operation
● Separate configurable thread pools for reads and writes● Read path
○ get_slice() translates to range read○ READ_COMMITTED isolation level○ Tag transactions as READ_ONLY
Write Operation
● Write Path○ First try to insert the row○ If DUPLICATE KEY encountered
■ Existing row is locked and returned from InnoDB■ Do timestamp-based conflict resolution■ Update existing row if incoming request has higher
timestamp
Write Contention● Hot Keys
○ Multiple nearly simultaneous requests to update same row
○ Results in all/most threads waiting for row lock● Mitigation:
○ Do a consistent non-locking read before trying update○ Make InnoDB return a new error in case a lock wait is
encountered HA_ERR_LOCK_NOT_AVAILABLE○ Cache currently executing mutate requests in plugin
(WIP)
Consistent Snapshot● Requirement:
○ Ability to take table level snapshot○ Use the same primitives for both topology transitions
and backups (including incremental backups)● Available Solutions:
○ Logical Backup■ No support for incremental backup
○ Transportable Tablespace■ DML lock during copy stage
○ XtraBackup■ Overhead of copying ibdata and redo logs
Consistent Snapshot● Copy on write
○ Built upon transportable tablespace feature○ A page-level operation
■ Copy before image of a page before allowing user to change it
■ Track/intercept page access requests in the buffer pool■ If page is requested with X-lock, copy the consistent
image to an in-memory buffer■ Periodically persist the in-memory buffer to .cow file on
disk○ Allows InnoDB write activity to continue on the .ibd file while
it is being copied
Transportable Tablespace Design
i.ibd fileibdata file
purge
DDL Select DML DML
ibuf
t0: DDL + DML Lock
t1: ibuf merge
t2: stop purge
t3: flush dirty pages
t4: write .cfg file
t5: copy .ibd & .cfg file
t6: start purge
t7: release lock.cfg file
Copy on Write Design
i.ibd fileibdata file
purge
DDL Select DML DML
ibuf
t0: DDL + DML Lock
t1: ibuf merge
t2: stop purge
t3: trigger COW
t4: release DML lock
t5: start purge
t6: flush dirty pages
t7: copy .ibd file.cfg file
.cow file t8: release DDL lock
Optimized Streaming
MH BackendMH Backend
MySQL/InnoDB MySQL/InnoDBBTree Plugin BTree Plugin
ibd file
Source DestinationDest Enable dual writes
Source Plugin handshake
Source Backup Locks/COW
Source Copy ibd file
Source Unlock table
Source Copy cfg and cow files
Dest IMPORT tablespace
Dest Switch writes
Dest Merge btrees
Dest Enable reads
tmp_btreecow ibd filecfg
cow cfg
Physical Backups● COW mechanism is used for:
○ Table level full physical backups○ Table level incremental backups
■ The .ibd file is scanned and a .dlt file is generated based on last full backup LSN
■ .dlt file is made consistent by applying .cow file■ .dlt file is then applied to the full backup to move it
from one consistent state to next○ All above changes are made in InnoDB extending
FLUSH TABLES … FOR EXPORT
Dealing with I/O
● Batch AIO requests○ Idea taken from FB patch and extended to write I/O
● Add separate LRU flush thread○ Ported from Percona○ Disable single page flushing
● Use fdatasync() instead of fsync()● Use larger doublewrite buffer
○ A quick hack solution to increase the size of double write buffer
Visibility
● Information Schema tables○ Get state of plugin
■ mhbtree_streaming■ mhbtree_purge_stats■ mhbtree_keyspace_options■ innodb_copy_on_write
○ Get InnoDB file space usage■ innodb_segement_usage■ innodb_sys_tablespace
MHBtree Read Performance
MHBtree Write Performance
MHBtree Disk Usage
Questions?Follow us @corestorage!