28
IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa © 2003 IBM Corporation

Building a Distributed Database with Device Served Leases

or, Distributed ARIES

Ohad Rodeh

Page 2: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation2

Presentation Structure

Motivation A single node database, DBs

The database uses object-disks instead of regular disks A distributed database, DBm

Based on DBs Summary Acknowledgments

Page 3: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation3

Motivation

Object-Disks (OSDs) are a novel storage appliance They allow adding new functionality to disks Clustered databases have been built using group-services Adding leases to OSDs allows constructing database clusters

without group-services Group services limit scalability and complicate programming

Who is alive? Who should be fenced out? A fencing mechanism is needed

Page 4: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation4

A Clustered Database

Network

DatabaseCompute nodes

Shared everything The disks are shared

Shared nothing Disks are local to database

servers DB2 uses (mostly) shared

nothing The mainframe version

uses shared-disks Oracle uses shared disks This paper is focused on

shared disks

Object disks

client

Page 5: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation5

Building it the Old Way

Network

DatabaseCompute nodes

disks

client

GCS

Fencing inside switch

A GCS connects the compute-nodes

If a compute-node is declared dead it is fenced out

Fencing is supported by the switch

Result: complexity To build a clustered database

one needed to sow together GCS Database Switch

Page 6: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation6

An Object Disk

An object disk is An appliance connected to the network Talks a standard protocol Implements a flat file-system

SNIA has a working group on standardization Participating companies:

Panasas, IBM, HP, Veritas, Seagate, …

Page 7: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation7

A Single Node Database, DBs

Mapping tables to objects A database table is realized as an object on an OSD

ARIES DBs is assumed to use ARIES Each journal entry refers to a single page

Locking Transactions can get into deadlocks Deadlock detection is used After detection the database chooses a victim and

aborts it.

Page 8: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation8

Distributed Locking I

DBm is based on DBs DBm requires distributed locking Lease support in the OSD

Each OSD provides a major-lease The lease is valid for, say, 30 seconds The holder of the major-lease can perform operations on

the OSD The major-lease can be delegated

Page 9: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation9

Locking for records, pages, tables

A table is composed of records A database provides locking on a record basis Here we assume distributed locking is done per page Internally to a node per-record locking is provided

Page 10: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation10

Using the OSD Lease

Per OSD a lock server is ran on a compute-node For OSD X it is XLKM XLKM takes the major lease

for X and provides page-level/object level locking

Requests with outdated leases are rejected

L=Lease(30)

L

{Read(OID), L}

OSD X

XLKM

Page 11: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation11

An OSD lock-server

Locks are hardened to an object on X, Xlocks

If the compute-node fails, the locks are recoverable

L=Lease(30)

L

{Read(OID), L}

OSD X

XLKM

Xlocks

Page 12: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation12

Connecting to a lock-manager

Compute-nodes connect to XLKM

Can take and release locks on pages and tables on X XLKM gives the client the major-lease for X This allows the client direct access to the OSD

Page 13: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation13

Connecting to a lock-manager II

The client takes a lease on XLKM The lease protects locks taken If the lease is not renewed in time, the locks are broken The client provides the location of its log to XLKM When the lease is broken, the locks are revoked and the

pages are marked to-recover The next client to take a lock on a to-recover page is

provided with the log and needs to perform recovery

Page 14: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation14

Deadlocks

Deadlocks can happen in DBm For local deadlocks the DBs algorithm is sufficient For distributed deadlocks there is known literature For example:

Once in a while each compute node requests the set of locks from other compute nodes

Search for cycles For each cycle kill a victim transaction

Page 15: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation15

Tables are implemented as B-trees

Tables are implemented as B+-trees

The table is physically allocated on an OSD object

The internal nodes contain keys

The leaf nodes contains keys and data

Each node is represented as an 8K page

Each page (and key) can be locked separately

5 84 15 1610 2520

104

42 4540 6049

40 4920

204

Page 16: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation16

Transactions on DBm I

Each client has a log object If A is a compute-node then logA is the log

logA contains the write-ahead log for A

Normally, each node accesses only its own log If A fails another node will recover logA

Page 17: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation17

Transactions on DBm II

Take locks on pages From appropriate lock-manager

Write open-transaction to logA

Add log-records and modify pages in memory Write close-transaction to logA

Release locks Modified pages need to be written to disk prior to

releasing locks A node can do write-back caching as long as other

nodes do not request a page

Page 18: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation18

Example Transaction Assume table T contains

keys K1 and K2 Data D1 and D2 respectively

Node A Wishes to switch between values D1 and D2

Node A Take read-locks for K1 and K2 Reads values D1 and D2 Takes write-locks for K1 and K2 Modifies value of K1 to D2

Adds a log entry Modifies value of K2 to D1

Adds a log entry Releases locks

Page 19: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation19

DBm: Rollback

Basically, DBm uses the DBs solution Assume node A is performing transaction T

Initially A holds a lock on logA and a set of pages To rollback T, A needs to

Perform the set of log-entries in undo mode and add a CLR to logA for each modification

Read the pages from disk, modify, write back to disk Release all locks

Since A initially holds the set of locks, no deadlocks

Page 20: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation20

Recovery: Lease Expiration

Node A loses the lease to a lock-manager on OSD X logA is on X

A breaks all connections to lock-managers All A’s pages are marked to-recover in XLKM

Full recovery is needed logA is not on X

A reconnects to XLKM

If B attempts to lock page P, it sees to-recover B attempts to lock logA, fails, and releases the lock on P

Page 21: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation21

Recovery: Compute Node Failure Scenario: node A fails and recovers LogA needs to be replayed Recovery is done ARIES style

Take exclusive lock on logA

Perform redo scan, then undo scan A log entry E that applies to record R in page P is

replayed by the following sequence: Take lock on P Check if PLSN is lower than ELSN

If so than apply update Node B can recover logA if A does not recover

Page 22: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation22

Applications

Possible applications: Databases with low

sharing Search-mostly databases Storage-Tank meta-data?

Network

Object disks

client client MDS cluster

S1

Log(S1)

S2

Log(S2)

S3

Log(S3)

S1 S2

S3

Directory structure

Page 23: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation23

Summary

We have shown a method to construct clustered databases without group-services

Pros Good scalability Good performance on low-sharing workloads

Cons Bad performance on high-sharing workloads

Page 24: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation24

Acknowledgments

Avi Teperman Gary Valentin Effi Offer Mark Hayden

Page 25: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation25

LSNs

Page 26: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation26

Log Sequence Number (LSN) Each page is stamped with an LSN LSNs are monotonically increasing and chosen by the log-manager

component Node A performs

Take read-locks for K1 and K2 Read values D1 and D2 Take write-locks for K1 and K2 Modify value of K1 to D2

Add a log entry to LogA with LSN=8 The page where K1 is located is marked with LSN=8

Modify value of K2 to D1 Add a log entry to LogA with LSN=11 The page where K2 is located is marked with LSN=11

Release locks

Page 27: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation27

Synchronizing LSNs There needs to be a way to synchronize the LSNs, this is required for

ARIES to work Example

Node A takes write-lock for page P Node A modifies P and marks it with LSN 10 Node A writes P to disk and releases the lock Node B takes the write-lock for P Node B modifies P and marks it with LSN 6 Node B writes P to disk and releases the lock

After this sequence If A modifies P and fails During recovery log entry with LSN 10 will be redone twice.

Page 28: IBM Labs in Haifa © 2003 IBM Corporation Building a Distributed Database with Device Served Leases or, Distributed ARIES Ohad Rodeh

IBM Labs in Haifa – Object Based Storage

© 2003 IBM Corporation28

LSN Solution

When reading a page from disk a node will update its maximal LSN to the maximum between its current LSN and the page LSN

This ensures a monotonically increasing LSN per page. This is sufficient. It is relatively cheap as no cluster-wide synchronization is

needed