View
217
Download
2
Tags:
Embed Size (px)
Citation preview
IBM Labs in Haifa © 2003 IBM Corporation
Building a Distributed Database with Device Served Leases
or, Distributed ARIES
Ohad Rodeh
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation2
Presentation Structure
Motivation A single node database, DBs
The database uses object-disks instead of regular disks A distributed database, DBm
Based on DBs Summary Acknowledgments
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation3
Motivation
Object-Disks (OSDs) are a novel storage appliance They allow adding new functionality to disks Clustered databases have been built using group-services Adding leases to OSDs allows constructing database clusters
without group-services Group services limit scalability and complicate programming
Who is alive? Who should be fenced out? A fencing mechanism is needed
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation4
A Clustered Database
Network
DatabaseCompute nodes
Shared everything The disks are shared
Shared nothing Disks are local to database
servers DB2 uses (mostly) shared
nothing The mainframe version
uses shared-disks Oracle uses shared disks This paper is focused on
shared disks
Object disks
client
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation5
Building it the Old Way
Network
DatabaseCompute nodes
disks
client
GCS
Fencing inside switch
A GCS connects the compute-nodes
If a compute-node is declared dead it is fenced out
Fencing is supported by the switch
Result: complexity To build a clustered database
one needed to sow together GCS Database Switch
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation6
An Object Disk
An object disk is An appliance connected to the network Talks a standard protocol Implements a flat file-system
SNIA has a working group on standardization Participating companies:
Panasas, IBM, HP, Veritas, Seagate, …
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation7
A Single Node Database, DBs
Mapping tables to objects A database table is realized as an object on an OSD
ARIES DBs is assumed to use ARIES Each journal entry refers to a single page
Locking Transactions can get into deadlocks Deadlock detection is used After detection the database chooses a victim and
aborts it.
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation8
Distributed Locking I
DBm is based on DBs DBm requires distributed locking Lease support in the OSD
Each OSD provides a major-lease The lease is valid for, say, 30 seconds The holder of the major-lease can perform operations on
the OSD The major-lease can be delegated
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation9
Locking for records, pages, tables
A table is composed of records A database provides locking on a record basis Here we assume distributed locking is done per page Internally to a node per-record locking is provided
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation10
Using the OSD Lease
Per OSD a lock server is ran on a compute-node For OSD X it is XLKM XLKM takes the major lease
for X and provides page-level/object level locking
Requests with outdated leases are rejected
L=Lease(30)
L
{Read(OID), L}
OSD X
XLKM
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation11
An OSD lock-server
Locks are hardened to an object on X, Xlocks
If the compute-node fails, the locks are recoverable
L=Lease(30)
L
{Read(OID), L}
OSD X
XLKM
Xlocks
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation12
Connecting to a lock-manager
Compute-nodes connect to XLKM
Can take and release locks on pages and tables on X XLKM gives the client the major-lease for X This allows the client direct access to the OSD
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation13
Connecting to a lock-manager II
The client takes a lease on XLKM The lease protects locks taken If the lease is not renewed in time, the locks are broken The client provides the location of its log to XLKM When the lease is broken, the locks are revoked and the
pages are marked to-recover The next client to take a lock on a to-recover page is
provided with the log and needs to perform recovery
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation14
Deadlocks
Deadlocks can happen in DBm For local deadlocks the DBs algorithm is sufficient For distributed deadlocks there is known literature For example:
Once in a while each compute node requests the set of locks from other compute nodes
Search for cycles For each cycle kill a victim transaction
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation15
Tables are implemented as B-trees
Tables are implemented as B+-trees
The table is physically allocated on an OSD object
The internal nodes contain keys
The leaf nodes contains keys and data
Each node is represented as an 8K page
Each page (and key) can be locked separately
5 84 15 1610 2520
104
42 4540 6049
40 4920
204
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation16
Transactions on DBm I
Each client has a log object If A is a compute-node then logA is the log
logA contains the write-ahead log for A
Normally, each node accesses only its own log If A fails another node will recover logA
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation17
Transactions on DBm II
Take locks on pages From appropriate lock-manager
Write open-transaction to logA
Add log-records and modify pages in memory Write close-transaction to logA
Release locks Modified pages need to be written to disk prior to
releasing locks A node can do write-back caching as long as other
nodes do not request a page
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation18
Example Transaction Assume table T contains
keys K1 and K2 Data D1 and D2 respectively
Node A Wishes to switch between values D1 and D2
Node A Take read-locks for K1 and K2 Reads values D1 and D2 Takes write-locks for K1 and K2 Modifies value of K1 to D2
Adds a log entry Modifies value of K2 to D1
Adds a log entry Releases locks
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation19
DBm: Rollback
Basically, DBm uses the DBs solution Assume node A is performing transaction T
Initially A holds a lock on logA and a set of pages To rollback T, A needs to
Perform the set of log-entries in undo mode and add a CLR to logA for each modification
Read the pages from disk, modify, write back to disk Release all locks
Since A initially holds the set of locks, no deadlocks
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation20
Recovery: Lease Expiration
Node A loses the lease to a lock-manager on OSD X logA is on X
A breaks all connections to lock-managers All A’s pages are marked to-recover in XLKM
Full recovery is needed logA is not on X
A reconnects to XLKM
If B attempts to lock page P, it sees to-recover B attempts to lock logA, fails, and releases the lock on P
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation21
Recovery: Compute Node Failure Scenario: node A fails and recovers LogA needs to be replayed Recovery is done ARIES style
Take exclusive lock on logA
Perform redo scan, then undo scan A log entry E that applies to record R in page P is
replayed by the following sequence: Take lock on P Check if PLSN is lower than ELSN
If so than apply update Node B can recover logA if A does not recover
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation22
Applications
Possible applications: Databases with low
sharing Search-mostly databases Storage-Tank meta-data?
Network
Object disks
client client MDS cluster
S1
Log(S1)
S2
Log(S2)
S3
Log(S3)
S1 S2
S3
Directory structure
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation23
Summary
We have shown a method to construct clustered databases without group-services
Pros Good scalability Good performance on low-sharing workloads
Cons Bad performance on high-sharing workloads
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation24
Acknowledgments
Avi Teperman Gary Valentin Effi Offer Mark Hayden
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation25
LSNs
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation26
Log Sequence Number (LSN) Each page is stamped with an LSN LSNs are monotonically increasing and chosen by the log-manager
component Node A performs
Take read-locks for K1 and K2 Read values D1 and D2 Take write-locks for K1 and K2 Modify value of K1 to D2
Add a log entry to LogA with LSN=8 The page where K1 is located is marked with LSN=8
Modify value of K2 to D1 Add a log entry to LogA with LSN=11 The page where K2 is located is marked with LSN=11
Release locks
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation27
Synchronizing LSNs There needs to be a way to synchronize the LSNs, this is required for
ARIES to work Example
Node A takes write-lock for page P Node A modifies P and marks it with LSN 10 Node A writes P to disk and releases the lock Node B takes the write-lock for P Node B modifies P and marks it with LSN 6 Node B writes P to disk and releases the lock
After this sequence If A modifies P and fails During recovery log entry with LSN 10 will be redone twice.
IBM Labs in Haifa – Object Based Storage
© 2003 IBM Corporation28
LSN Solution
When reading a page from disk a node will update its maximal LSN to the maximum between its current LSN and the page LSN
This ensures a monotonically increasing LSN per page. This is sufficient. It is relatively cheap as no cluster-wide synchronization is
needed