31
1 Storage in Database Systems CMPSCI 445 Fall 2010

Lecture storage-buffer

Embed Size (px)

DESCRIPTION

Storage

Citation preview

Page 1: Lecture storage-buffer

1

Storage in Database Systems

CMPSCI 445Fall 2010

Page 2: Lecture storage-buffer

Storage Topics

Architecture and Overview Disks Buffer management Files of records

2

Page 3: Lecture storage-buffer

3

DBMS Architecture

Disk Space Manager

File & Access Methods

Buffer Manager

Query Parser

Query Rewriter

Query Optimizer

Query Executor

Lock Manager Log Manager

Page 4: Lecture storage-buffer

4

Data on External Storage Disks: Can retrieve random page at fixed cost

But reading several consecutive pages is much cheaper than reading them in random order

Tapes: Can only read pages in sequence Cheaper than disks; used for archival storage

Page: Unit of information read from or written to disk Size of page: DBMS parameter, 4KB, 8KB

Disk space manager: Abstraction: a collection of pages. Allocate/de-allocate a page. Read/write a page.

Page I/O: Pages read from disk and pages written to disk Dominant cost of database operations

Page 5: Lecture storage-buffer

5

Buffer Management Architecture:

Data is read into memory for processing Data is written to disk for persistent storage

Buffer manager stages pages between external storage and main memory buffer pool.

Access method layer makes calls to the buffer manager.

Page 6: Lecture storage-buffer

6

Access Methods Access methods: routines to manage various disk-based

data structures. Files of records Various kinds of indexes

File of records: Important abstraction of external storage in a DBMS! Record id (rid) is sufficient to physically locate a record

Indexes: Auxiliary data structures Given values in index search key fields, find the record ids of

records with those values

Page 7: Lecture storage-buffer

7

File organizations & access methods

Many alternatives exist, each ideal for some situations, and not so good in others: Heap (unordered) files: Suitable when typical

access is a file scan retrieving all records. Sorted Files: Best if records must be retrieved in

some order, or only a `range’ of records is needed. Indexes: Data structures to organize records via

trees or hashing. • Like sorted files, they speed up searches for a subset of

records, based on values in certain (“search key”) fields• Updates are much faster than in sorted files.

Page 8: Lecture storage-buffer

Disks

8

Page 9: Lecture storage-buffer

9

Disks and DBMS Design

DBMS stores information on disks. This has major implications for DBMS design!

READ: transfer data from disk to main memory (RAM) for data processing.

WRITE: transfer data from RAM to disk for persistent storage.

Both are high-cost operations, relative to in-memory operations, so must be planned carefully!

Page 10: Lecture storage-buffer

10

Why Not Store Everything in Main Memory?

Main memory is volatile. We want data to be saved between runs. (Obviously!)

Costs too much. $100 will buy you either 4GB of RAM or 1TB of disk today.

32-bit addressing limitation. 232 bytes can be directly addressed in memory. Number of objects cannot exceed this number.

Page 11: Lecture storage-buffer

11

Basics of Disks

Unit of storage and retrieval: disk block or page. A disk block/page is a contiguous sequence of bytes. Size of a DBMS parameter, 4KB or 8KB.

Disks support direct access to a page. Unlike RAM, time to retrieve a page varies!

It depends upon the location on disk. Therefore, relative placement of pages on disk has

major impact on DBMS performance!

Page 12: Lecture storage-buffer

12

Components of a Disk

Platters

Platters spin (say, 7200rpm).

Spindle

Arm assembly is moved in or out to position a head on a desired track.

Disk head

Arm movement

Arm assembly

Only one head reads/ writes at any one time.

Tracks under heads make a cylinder (imaginary!). Each track is divided into sectors (whose size is fixed). Block size is a multiple of sector size.

Sector

Track

Page 13: Lecture storage-buffer

13

Accessing a Disk Page

Time to access (read/write) a disk block: seek time (moving arms to position disk head on track) rotational delay (waiting for block to rotate under head) transfer time (actually moving data to/from disk surface)

Seek time and rotational delay dominate. Seek time varies from about 2 to 30 msec Rotational delay varies from 0 to 11 msec Transfer rate between 25 to 100 MB/s

Key to lower I/O cost: reduce seek/rotation delays!

Page 14: Lecture storage-buffer

14

Arranging Pages on Disk

`Next’ block concept: blocks on same track, followed by blocks on same cylinder, followed by blocks on adjacent cylinder

Blocks in a file should be arranged sequentially on disk (by `next’), to minimize seek and rotational delay.

For a sequential scan, pre-fetching several pages at a time is a big win!

Page 15: Lecture storage-buffer

Buffer Management

15

Page 16: Lecture storage-buffer

16

Buffer Management in a DBMS

Data must be in RAM for DBMS to operate on it! Table of <frame#, pageid> pairs is maintained.

DB

MAIN MEMORY

DISK

disk page

free frame

Page Requests from Higher Levels

BUFFER POOL

choice of frame dictatedby replacement policy

Page 17: Lecture storage-buffer

17

More on Buffer Management

Requestor of page must unpin it, and indicate whether page has been modified: dirty bit is used for this.

Page in pool may be requested many times, a pin count is used. A page is a candidate for

replacement iff pin count = 0. CC & recovery may entail additional I/O

when a frame is chosen for replacement. (Write-Ahead Log protocol; more later.)

Page 18: Lecture storage-buffer

18

When a Page is Requested ...

If requested page is not in pool: Choose a frame for replacement If frame is dirty, write it to disk Read requested page into chosen frame

Pin the page and return its address.

If requests can be predicted (e.g., sequential scans) pages can be pre-fetched several pages at a time!

Page 19: Lecture storage-buffer

19

Buffer Replacement Policy

Frame is chosen for replacement by a replacement policy: Least-recently-used (LRU), Clock, MRU etc.

Policy can have big impact on # of I/O’s; depends on the access pattern.

Sequential flooding: Nasty situation caused by LRU + repeated sequential scans. # buffer frames < # pages in file means each page

request causes an I/O. MRU much better in this situation (but not in all situations, of course).

Page 20: Lecture storage-buffer

20

DBMS vs. OS File System OS does disk space & buffer mgmt: why not let

OS manage these tasks?

Differences in OS support: portability issues Some limitations, e.g., files can’t span disks. Buffer management in DBMS requires ability to:

pin a page in buffer pool, force a page to disk (important for implementing CC & recovery),

adjust replacement policy, and pre-fetch pages based on access patterns in typical DB operations.

Page 21: Lecture storage-buffer

Files

21

Page 22: Lecture storage-buffer

22

Files

Access method layer offers an abstraction of data on disk: a file of records residing on multiple pages A number of fields are organized in a record A collection of records are organized in a page A collection of pages are organized in a file

Page 23: Lecture storage-buffer

23

Files of Records

Page or block is OK when doing I/O, but higher levels of DBMS operate on records and files of records.

FILE: A collection of pages, each containing a collection of records. Must support: insert/delete/modify record read a particular record (specified using record id) scan all records (possibly with some conditions on

the records to be retrieved)

Page 24: Lecture storage-buffer

24

Unordered (Heap) Files

Simplest file structure contains records in no particular order.

As file grows and shrinks, disk pages are allocated and de-allocated.

To support record level operations, we must: keep track of the pages in a file keep track of free space on pages keep track of the records on a page

There are many alternatives for keeping track of this.

Page 25: Lecture storage-buffer

25

Heap File Implemented as a List

(heap file name, header page id) stored in a known place. Two doubly linked lists, for full pages & pages with space.

Each page contains 2 `pointers’ plus data.

Upon insertion, scan the list of pages with space, or ask disk space manager to allocate a new page

HeaderPage

DataPage

DataPage

DataPage

DataPage

DataPage

DataPage Pages with

Free Space

Full Pages

Page 26: Lecture storage-buffer

26

Heap File Using a Page Directory

A directory entry per page; it can include the number of free bytes on the page.

The directory is a collection of pages; linked list implementation is just one alternative. Much smaller than linked list of all HF pages!

DataPage 1

DataPage 2

DataPage N

HeaderPage

DIRECTORY

Page 27: Lecture storage-buffer

27

Page Format

How to store a collection of records on a page? Consider a page as a collection of slots, one for

each record. A record is identified by rid = <page id, slot #> Record ids (rids) are used in indexes

(Alternatives 2 and 3).

Page 28: Lecture storage-buffer

28

Page Format: Fixed Length Records

Moving records for free space management changes rid! May not be acceptable.

Slot 1Slot 2

Slot N

PACKED

N

FreeSpace

number of records

. . .

M10. . .M ... 3 2 1

UNPACKED, BITMAP

Slot 1Slot 2

Slot N

Slot M

11

numberof slots

. . .

Page 29: Lecture storage-buffer

29

Page Format: Variable Length Records

Can move records on page without changing rid; so, attractive for fixed-length records too. (level of indirection)

Page iRid = (i,N)

Rid = (i,2)

Rid = (i,1)

20 16 24 N Pointerto startof freespace

SLOT DIRECTORY

N . . . 2 1 # slots

Rid of record is <page id, slot #>

Page 30: Lecture storage-buffer

30

Record Format: Fixed Length

Information of a record type e.g., the number of fields and field types is stored in the system catalog.

Fixed length record: (1) the number of fields is fixed, (2) each field has a fixed length.

Store fields consecutively in a record. Finding i’th field does not require scan of record.

Base address (B)

L1 L2 L3 L4

F1 F2 F3 F4

Address = B+L1+L2

Page 31: Lecture storage-buffer

31

Record Format: Variable Length Variable length record: (1) number of fields is fixed, (2)

some fields are variable length Two alternatives:

Second offers direct access to i’th field, efficient storage of NULLs ; small directory overhead.

$ $ $ $

Scan

Fields Delimited by Special Symbols

F1 F2 F3 F4

F1 F2 F3 F4S1 S2 S3 S4 E4

Array of Field Offsets