Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)

2 December 2005

Introduction to DatabasesStorage Management

Prof. Beat Signer

Department of Computer Science

Vrije Universiteit Brussel

http://www.beatsigner.com

http://www.beatsigner.com/

Beat Signer - Department of Computer Science - [email protected] 2April 3, 2015

Context of Today's Lecture

Access

Methods

System

Buffers

Authorisation

Control

Integrity

Checker

Command

Processor

Program

Object Code

DDL

Compiler

File

Manager

Buffer

Manager

Recovery

Manager

Scheduler

Query

Optimiser

Transaction

Manager

Query

Compiler

Queries

Catalogue

Manager

DML

Preprocessor

Database

Schema

Application

Programs

Database

Manager

Data

Manager

DBMS

Programmers Users DB Admins

Based on 'Components of a DBMS', Database Systems,

T. Connolly and C. Begg, Addison-Wesley 2010

Data, Indices and

System Catalogue


Storage Device Hierarchy

Storage devices vary in data capacity

access speed

cost per byte

Devices with fastest

access time have

highest costs and

smallest capacity

Cache

Main Memory

Flash Memory

Magnetic Disk

Optical Disk

Magnetic Tapes


Cache

On-board cache on the same chip as the microprocessor level 1 (L1) cache

temporary storage of instructions and data

typical size of ~64 kB

Extra cache levels located on separate chips e.g. level 2 (L2) cache

typical size of ~1 MB

Data items in the cache are copies of values in main

memory locations

If data in the cache has been updated, changes must be

reflected in the corresponding memory locations


Main Memory

Main memory can be several gigabytes large

Normally too small and too expensive for storing the

entire database content is lost during power failure or crash (volatile memory)

in-memory databases (IMDB) primarily rely on main memory

- note that IMDBs lack durability (D of the ACID properties)

IMDB size limited by the maximal addressable memory space

- e.g. maximal 4 GB for 32-bit address space

Random access memory (RAM) time to access data is more or less independent of its location

(different from magnetic tapes)

Typical access time of ~10 nanoseconds (10-8 seconds)


Secondary Storage (Hard Disk)

Essentially random access

Files are moved between a hard disk and main memory

(disk I/O) by the operating system (OS) or the DBMS the transfer units are blocks

tendency for larger block sizes

Parts of the main memory are used to buffer blocks the buffer manager of the DBMS manages the loading and

unloading of blocks for specific DBMS operations

Typical block I/O time (seek time) ~10 milliseconds 1'000'000 times slower than main memory access

Capacity of multiple multiple terabytes and a system can

use many disk units


Hard Disk

A hard disk contains one

or more platters and one or

more heads

The platters were originally

addressed in terms of cylinders,

heads and sectors (block) cylinder-head-sector (CHS) scheme

max of 1024 cylinders, 16 headsand 63 sectors

Current hard disks offer

logical block addressing (LBA) hides the physical disk geometry


Solid-State Drives (SSD)

Storage device that uses solid-state memory

(flash memory) to persistently store data

Offers a hard disk interface with a storage capacity of up

to a few hundred gigabytes

Typical block I/O time (seek time) ~0.1 milliseconds

SSDs might help to reduce the gap between primary and

secondary storage in DBMS systems

Currently there are still some limitations of SSDs the limited number of SSD write operations before failure can be a

problem for DBs with a lot of update operations

write operations are often still much slower than read operations


Tertiary Storage

No random access

access time depends on

data location

Different devices

tapes

optical disk jukeboxes

- racks of CD-ROMs (read only)

tape silos

- room-sized devices holding

racks of tapes operated by

tape robots

- e.g. StorageTek PowderHorn

with up to 28.8 petabytes


Models of Computation

RAM model of computation assumes that all data is held in main memory

DBMS model of computation assumes that data does not fit into main memory

efficient algorithms must take into account secondary and even tertiary storage

best algorithms for processing large amounts of data often differ from those for the RAM model of computation

minimising disk accesses plays a major role

- I/O model of computation

I/O model of computation the time to move a block between disk and memory is much

higher than the time for the corresponding computation


Accelerating Secondary Storage Access

Various possible strategies to improve secondary

storage access placement of blocks that are often accessed together on the same

disk cylinder

distribute data across multiple disks to profit from parallel disk accesses (e.g. RAID)

mirroring of data

use of disk scheduling algorithms in OS, DBMS or disk controller to determine order of requested block read/writes

- e.g. elevator algorithm

prefetching of disk blocks

efficient caching

- main memory

- disk controllers


Redundant Array of Independent Disks

The redundant array of independent disks (RAID)

organisation technique provides a single disk view for a

number (array) of disks divide and replicate data across multiple hard disks

introduced in 1987 by D.A. Patterson, G.A. Gibson and R. Katz

The main goals of a RAID solution are higher capacity by grouping multiple disks

- originally a RAID was also a cheaper alternative to expensive large disks

• original name: Redundant Array of Inexpensive Disks

higher performance due to parallel disk access

- multiple parallel read/write operations

increased reliability since data might be stored redundantly

- data can be restored if a disk fails


RAID ...

There are three main concepts in RAID systems identical data is written to more than one disk (mirroring)

data is split accross multiple disks (striping)

redundant parity data is stored on separated disks and used to detect and fix problems (error correction)


RAID Reliability

The mean time between failures (MTBF) is the average

time until a disk failure occurs e.g. a hard disk might have a MTBF of 200'000 hours (22.8 years)

- note that the MTBF decreases as disks get older

If a DBMS uses an array of disks, then the overall

system's MTBF can be much lower e.g. the MTBF for a disk array of 100 of the disks mentioned

above is 200'000 hours/100 = 2'000 hours (83 days)

By storing information redundantly, data can be restored

in the case of a disk failure


RAID Reliability ...

The mean time to data loss (MTTDL) depends on the

MTBF and the mean time to repair if we mirror the information on two disks with a MTBF of 200'000

hours and a mean time to repair of 10 hours then the MTTDL is 200'0002/(2*10) hours = 228'000 years

of course in reality it is more likely that an error occurs on multiple disks around the same time

- drives have the same age

- power failure, earthquake, fire, ...


RAID Levels

The different RAID levels offer different

cost-performance trade-offs

RAID 0 block level striping without any redundancy

RAID 1 mirroring without striping

RAID 2 bit level striping

multiple parity disks

RAID 3 byte level striping

one parity disk

[http://en.wikipedia.org/wiki/RAID]


RAID Levels ...

RAID 4 block level striping

one parity disk

similar to RAID 3

RAID 5 block level striping with distributed parity

no dedicted parity disk

RAID 6 block level striping with dual

distributed parity

no dedicted parity disk

similar to RAID 5


Data Representation

A DBMS has to define how the elements of its data model

(e.g. relational model) are mapped to secondary storage a field contains a fixed- or variable-length sequence of bytes and

represents an attribute

a record contains a fixed- or variable-length sequence of fields and represents a tuple

records are stored in fixed-length physical block storage units representing a set of tuples

- the blocks also represent the units of data transfer

a file contains a collection of blocks and represents a relation

A database is finally mapped to a number of files

managed by the underlying operating system index structures are stored in separate files


Relational Model Representation

A number of issues have to be addressed when

mapping the basic elements of the relational model

to secondary storage how to map the SQL datatypes to fields?

how to represent tuples as records?

how to represent records in blocks?

how to represent a relation as a collection of blocks?

how to deal with record sizes that do not fit into blocks?

how to deal with variable-length records?

how to deal with schema updates and growing record lengths?

...


Representation of SQL Datatypes

Fixed-length character string (CHAR(n)) represented as a field which is an array of n bytes

strings that are shorter than n bytes are filled up with a special "pad" character

Variable-length character string (VARCHAR(n)) two common representations (non-fixed length version later)

length plus content

- allocate an array of n + 1 bytes

- the first byte represents the length of the string (8-bit integer) followed by the

string content

- limited to a maximal string length of 255 characters

null-terminated string

- allocate an array of n + 1 bytes

- terminate the string with a special null character (like in C)


Representation of SQL Datatypes ...

Dates (DATE) fixed-length character string

Time (TIME(n)) the precision n leads to strings of variable length and two possible

representations

fixed-precision

- limit the precision to a fixed value and store as VARCHAR(m)

true-variable length

- store the time as true variable length value

Bits (BIT(n)) bit values of size n can be packed into single bytes

packing of multiple bit values into a single byte is not recommended

- makes the retrieval and updating of a value more complex and error-prone


Storage Access

A part of the system's main memory is used as a buffer

to store copies of disk blocks

The buffer manager is responsible to move data from

secondary disk storage into memory the number of block transfers between disk and memory should

be minimised

as many blocks a possible should be kept in memory

The buffer manager is called by the DMBS every time a

disk block has to be accessed the buffer manager has to check whether the block is already

allocated in the buffer (main memory)


Buffer Manager

If the requested block is already in the buffer, the buffer

manager returns the corresponding address

If the block is not yet in the buffer, the buffer manager

performs the following steps allocate buffer space

- if no space is available, remove an existing block from the buffer

(based on a buffer replacement strategy) and write it back to the disk if it has

been modified since it was last fetched/written to disk

read the block from the disk, add it to the buffer and return the corresponding memory address

Note the similarities to a virtual memory manager


Buffer Replacement Strategies

Most operating systems use a least recently used (LRU)

strategy where the block that was least recently used is

moved back from memory to disk use past access pattern to predict future block access

A DBMS is able to predict future access patterns more

accurately than an operating system a request to the DBMS involves multiple steps and the DBMS

might be able to determine which blocks will be needed by analysing the different steps of the operation

note that LRU might not always be the best replacement strategy for a DBMS


Buffer Replacement Strategies ...

Let us have a look at the procedure to compute the

following natural join query: order ⋈ customer note that we will see more efficient solutions for this problem

when discussing query optimisation

for each tuple o of order {for each tuple c of customer {

if o.customerID = c.customerID {create a new tuple r with:r.customerID := c.customerIDr.name := c.name...r.orderID := o.orderID...add tuple r to the result set of the join operation

}}

}



We further assume that the two relations order and

customer are stored in separate files

From the pseudocode we can see that once an order tuple has been processed, it is not needed

anymore

- if a whole block of order tuples has been processed, that block is no longer

required in memory (but an LRU strategy might keep it)

- as soon as the last tuple of an order block has been processed, the buffer

manager should free the memory space toss-immediate strategy

once a customer tuple has been processed, it is not accessed again until all the other customer tuples have been accessed

- when the processing of a customer block has been finished, the least recently

used customer block will be requested next

- we should replace the block that has been most recently used (MRU)



A memory block can be marked to indicate that this block

is not allowed to be written back to disk (pinned block) note that if we want to use an MRU strategy for the inner loop of

the previous example, the block has to be pinned

- the block has to be unpinned after the last tuple in the block has be processed

the pinning of blocks provides some control to restrict the time when blocks can be written back to disk

- important for crash recovery

- blocks that are currently updated should not be written to disk

The prefetching of blocks might be used to further

increase the performance of the overall system e.g. for serial scans (relation scans)



The buffer manager can also use statistical information

about the probability that a request will reference a

particular relation (and its related blocks) the system catalogue (data dictionary) with its metadata is one of

the most frequently accessed parts of the database

- if possible, system catalogue blocks should always be in the buffer

index files might be accessed more often than the corresponding files themselves

- do not remove index files from the buffer if not necessary

the crash recovery manager can also provide constraints for the buffer manager

- the recovery manager might demand that other blocks have to be written first

(force-output) before a specific block can be written to disk


System Catalogue / Data Dictionary

Stores metadata about the database names of the relations

names, domain and lengths of the attributes of each relation

names of views

names of indices

- name of relation that is indexed

- name of attributes

- type of index

integrity constraints

users and their authorisations

statistical data

- number of tuples in relation, storage method, ...

...


File Organisation

A file is a logically organised as a sequence of records each record contains a sequence of fields

name, datatype and offset of record fields are defined bythe schema

record types (schema) might change over time

The records are mapped to disk blocks the block size is fixed and defined by the physical properties of

the disk and the operating system

the record size might vary for different relations and even between tuples of the same relation (variable field size)

There are different possible mappings of records to files use multiple files and only store fixed-length records in each file

store variable-length records in a file


Fixed-Length Records

If we assume that an integer requires 2 bytes and

characters are represented by one byte, then the

customer record is 64 bytes long

type customer = recordcID int;

name varchar(30)

street varchar(30)

end

cID name street... ...

0 2 33 64

Block


Fixed-Length Records ...

Often a record header is added to each record for

managing metadata about the record schema (pointer s to the DBMS schema information)

timestamp t about the last access or modification time

the length l of the record

- could be computed from the schema but the information is convenient if we

want to quickly access the next record without having to consult the schema

...

0 12 48 80

cID name street... ...s t l

16

Block


Fixed-Length Records in Blocks/Files

Problems with this fixed length representation after a record has been deleted, its space has to be filled with

another record

- could move all records after the deleted one but that is too expensive

- can move the last record to the deleted record's position but also that might

require an additional block access

if the block size is not a multiple of the record size, some records will cross block boundaries and we need two block accesses to read/write such a record

1 Max Frisch Bahnhofstrasse 7h

2 Eddy Merckx Pleinlaan 25h

5 Claude Debussy 12 Rue Louiseh

53 Albert Einstein Bergstrasse 18h

8 Max Frisch ETH Zentrumh

record 0

record 1

record 2

record 3

record 4


Fixed-Length Records in Blocks/Files ...

Since insert operations tend to be more frequent that

delete operations, it might be acceptable to leave the

space of the deleted record open until a new record is

inserted we cannot just add an additional boolean flag ("free") to the

record since it will be hard to find the free records

allocate a certain amount of bytes for a file header containing metadata about the file

The block/file header contains a pointer (address) to the

first deleted record each deleted record contains a pointer (address) to the next

deleted record

the linked list of deleted records is called a free list


record 0

record 1

record 2

record 3

record 4

header

Fixed-Length Records in Blocks/Files ...

To insert a new record, the first free record pointed to by

the header is used and the address in the header is

updated to the free record that the used record was

pointing to to save some space, the pointers of the free list can also be

stored in the unused space of deleted records (no additional field)

1 Max Frisch Bahnhofstrasse 7h

5 Claude Debussy 12 Rue Louiseh

8 Max Frisch ETH Zentrumh


Address Space

There are several ways how the database address

space (blocks and block offsets) can be represented physical addresses consisting of byte strings (up to 16 bytes)

that address

- host

- storage device identifier (e.g. hard disk ID)

- cylinder number of the disk

- track within the cylinder (for multi-surface disks)

- block within the track

- potential offset of record within the block

logical addresses consisting of an arbitrary string of length n


Address Space Mapping

A map table is stored at a known disk location and

provides a mapping between the logical and physical

address spaces introduces some indirection since the map table has to be

consulted to get the physical address

flexibility to rearrange records within blocks or move them to other blocks without affecting the record's logical address

different combinations of logical and physical addresses are possible (structured address schemes)

... ...

logical physicallogical

address

physical

addressmap table


Variable-Length Data

Records of the same type may have different lengths

We may want to represent record fields with varying size (e.g. VARCHAR(n))

large fields (e.g. images)

...

We need an alternative data representation to deal with

these requirements


Variable-Length Record Fields

Scheme for records with variable-length fields put all fixed-length fields first (e.g. cID)

add the length of the record to the record header

add the offsets of the variable-length fields to the record header

Note that if the order of the variable-length fields is

always the same, we do not have to store an offset for

the first variable-length field (e.g. name)

cID name street

record length


Variable-Length Records

There are different reasons why we might have to use

variable-length records to store records that have at least one field with a variable length

to store different record types in a single block/file

Structured address scheme (slotted-page structure) address of a record consists of the block address in combination

with an offset table index

records can be moved around

record3 record2 record1... free ...

offset table


Large Records

Sometimes we have to deal with values that do not fit

into a single block (e.g. audio or movie clips) a record that is split across two or more blocks is called

a spanned record

spanned records can also be used to pack blocks more efficiently

Extra header information each record header carries a bit to indicate if it is a fragment

- fragments have some more bits; telling whether first or last fragment of record

potential pointers to previous and next fragment

block header

record header

record2b record3record2arecord1

block 1 block 2


Storage of Binary Large Objects (BLOBS)

BLOB is stored as a sequence of blocks often blocks allocated successively on a disk cylinder

BLOB might be striped across multiple disks for more

efficient retrieval

BLOB field might not be automatically fetched into

memory user has to explicitly load parts of the BLOB

possibly index structures to retrieve parts of a BLOB


Insertion of Records

If the records are not kept in a particular order, we can

just find a block with some empty space or create a new

block if there is no such space

If the record has to be inserted in a particular order, but

there is no space in the block, there are two alternatives find space in a nearby block and rearrange some records

create an overflow block and link it from the header of the original block

- note that an overflow block might point to another overflow block and so on


offset table


Deletion of Records

If we use an offset table, we may compact the free space

in the block (slide around the records)

If the records cannot be moved, we might have a free list

in the header

We might also be able to remove an overflow block after

a delete operation


offset table


Update of Records

If we have to update a fixed-length record there is no

problem since we will still use the same space

If the updated record is larger than the original version,

then we might have to create more space same options as discussed for insert operation

If the updated record is smaller, then we may compact

some free space or remove overflow blocks similar to delete operation


offset table


Homework

Study the following chapter of the

Database System Concepts book chapter 10

- sections 10.1-10.9

- Storage and File Structure


Exercise 8

Functional Dependencies and Normalisation


References

H. Garcia-Molina, J.D. Ullman and J. Widom,

Database Systems – The Complete Book,

Prentice Hall, 2002

A. Silberschatz, H. Korth and S. Sudarshan, Database

System Concepts (Sixth Edition), McGraw-Hill, 2010

2 December 2005

Next LectureAccess Methods

Education

Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)