Lecture 8: Physical Schema: Storage - New York University · Lecture 8: Physical Schema: Storage . Physical Schema Conceptual Schema View 1 View 2 View 3 Disk What happens under the

Database Systems

Mohamed Zahran (aka Z)

[email protected]

http://www.mzahran.com

CSCI-GA.2433-001

Lecture 8: Physical Schema: Storage

Physical Schema

Conceptual Schema

View 1 View 2 View 3

Disk What happens under the hood?

1. Create a model of the enterprise (e.g., using ER)

2. Create a logical “implementation” (using a relational model and normalization)

Requirements Analysis

Logical Design

Conceptual Design

Schema Refinement

Physical Design

Application & Security Design

• Uses a file system to store the relations • Requires knowledge of hardware and

operating systems characteristics

First, Let’s Look at a Typical Hierarchy

Level Access time Typical size

Registers "instantaneous" under 1KB

Level 1 Cache 1-3 ns 64KB per core

Level 2 Cache 3-10 ns 256KB per core

Level 3 Cache 10-20 ns 2-20 MB per chip

Main Memory 30-60 ns 4-32 GB per system

Hard Disk 3,000,000-10,000,000 ns over 1TB

In DB, we care about those two levels

Physical Design

Criteria Storage media

Indexes File Structures

Performance • Hard-disk drives • Solid-state Disks

Disk Drives

• To access data: — seek time: position head over the proper track — rotational latency: wait for desired sector (RPM) — transfer time: grab the data (one or more sectors)

Platter

Track

Platters

Sectors

Tracks

Hard Disks

• spinning platter of special material • mechanical arm with read/write head

must be close to the platter to read/write data

• data is stored magnetically • disks are random access meaning data

can be read/written anywhere on the disk

A Conventional Hard Disk Structure

Hard Disk Architecture

• Surface = group of tracks

• Track = group of sectors

• Sector = group of bytes

• Cylinder: several tracks on corresponding surfaces

Disk Sectors and Access • Each sector records

– Sector ID – Data (512 bytes, 4096 bytes proposed) – Error correcting code (ECC)

• Used to hide defects and recording errors

– Synchronization fields and gaps

• Access to a sector involves – Queuing delay if other accesses are pending – Seek: move the heads – Rotational latency – Data transfer – Controller overhead

Example of a Real Disk: MKxx59GSM from Toshiba

• Size: 1TB • 5,400 RPM • 2.5” • Number of platters: 3 • Number of data heads: 6 • Interface: SATA • Transfer rate to host: 3GB/sec • Average seek time: 12ms • Track-to-Track seek: 2ms

Disks: Other Issues

• Average seek and rotation times are helped by locality.

• Disk performance improves about 10%/year

• Capacity increases about 60%/year

• Common disk interfaces/controllers: • SCSI, IDE, SATA

The Disk: A View from the Top

• A sequence of blocks

– A physical unit of access is always a block

• All blocks are of the same size

• Actually, a block consists of 1 or more sectors

• A file:

– Logically: a series of records, of similar or different sizes

– Physically: a series of, no necessarily contiguous blocks

• The file system helps us to:

– Find the first block

– Find the last block

– Find the next block

– Find the previous block

Relation

File

Blocks

Records

Sectors

Each tuple is a record in the file.

Logical view

Physical view

Assumptions: •There can be several records in a block •No record spans more than one block

There is one extra layer that we will discuss shortly!

1 1200

3 2100

4 1800

2 1200

6 2300

9 1400

8 1900

E# Salary

1 1200

3 2100

4 1800

2 1200

6 2300

9 1400

8 1900

RecordsRelation Blocks

1 1200

3 2100

4 1800

2 1200

6 2300

9 1400

8 1900

Records

6 23009 1400

1 1200 3 2100 8 1900

4 18002 1200

Left-over

SpaceFirst block

of the file

Example

Processing a Query: What Happens Under The Hood?

SELECT E# FROM R WHERE SALARY > 1500;

Read from disk into RAM all

relevant blocks

Get the relevant information from

blocks

Additional processing then

produce the answer What is the cost of this?

A Simple Cost Model

• Assumptions – Reading or Writing a block costs one time unit – Processing is free

• Justification – Accessing the disk is much more expensive

than any reasonable CPU processing of queries • Implication

– Goal: minimize number of block accesses – Good Heuristic: Organize the physical

database so that you make as much use as possible from any block you read/write

Why Not Store Everything in Main Memory?

Example

Blocks on disk

1 1200

3 2100

4 1800

2 1200

6 2300

9 1400

8 1900

Array in RAM

6 23009 1400

1 1200 3 2100 8 1900

4 18002 1200

What is the cost of accessing: E# 2 and E# 9?

What is the cost of accessing: E# 2 and E# 4?

What Is The Best Place for the Next Block?

RAID

• Disk Array: Arrangement of several disks that gives abstraction of a single, large disk.

• Goals: Increase performance and reliability. • Two main techniques:

– Data striping: Data is partitioned; size of a partition is called the striping unit. Partitions are distributed over several disks.

– Redundancy: More disks => more failures. Redundant information allows reconstruction of data if a disk fails.

RAID Levels

• Level 0: No redundancy

• Level 1: Mirrored (two identical copies)

– Each disk has a mirror image – Parallel reads, a write involves two disks. – Maximum transfer rate = transfer rate of one disk – Better reliability but no protection against data corruption of

viruses

RAID Levels

• Level 0+1: Striping and Mirroring – Parallel reads, a write involves two disks.

– Maximum transfer rate = aggregate bandwidth

Level 1+0 is the same as

0+1 with reverse in mirroring and stripping.

RAID Levels • Level 2: Bit-Interleaved

– Uses hamming code for error correction – Can recover data from single-bit corruption – Out of fashion!

For Error correction (Hamming code)

RAID Levels • Level 3: Byte-Interleaved with Parity

– Striping Unit: One byte. – One check disk. – Each read and write request involves all

disks; disk array can process one request at a time.

RAID Levels • Level 4: Block-Interleaved Parity

– Striping Unit: One disk block. One check disk.

– Parallel reads possible for small requests, large requests can utilize full bandwidth

– Writes involve modified block and check disk

RAID Levels • Level 5: Block-Interleaved Distributed

Parity – Similar to RAID Level 4, but parity blocks

are distributed over all disks

Level 6 is similar to 5 but with extra parity block

What Is This Story of Solid State Disks (SSD)?

• No moving parts hence called “solid” state

• reads and writes to a medium called NAND flash memory

• Faster startup: no spinning

• Extremely low read latency

• Deterministic: performance does not depend on the location of the data

BUT …

• Much more expensive than hard disks (~3$/GB vs. 0.15$/GB)

• Limited write erase time

• Slower write speeds

• High capacity SSDs may have significant higher power requirements

• SSD can get slower as it ages

SSD Organization

• SSD has its own page

• SSD pages are getting larger: 8KB, 16KB, 32KB

• Sector addressable

• Most SSD uses 4KB sectors

Comparison of typical hard-drives and SSD

More up-to-date- SSD: Samsung SSD 840 Pro Series: 256 GB • Read XFR Rate: ~510 MB/s •Write XFR Rate: ~500 MB/s

DBMS

Application

OS

DISK Controller

Data must be in RAM for DBMS to operate on it!

Query Optimization

and Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

DB

Important: Tow flavors of DBMS •Relying on OS file systems •Uses its own or extend OS

Disk Space Manager

• Abstractly: deals with pages as a unit of data (read, write, allocate, deallocate)

• Page size usually chosen as disk block • Keeps track of:

– Which blocks are in use • List of bitmap

– Which pages are on which page blocks

• Hides the details of the hardware and OS and makes higher level layers think in term of pages

Buffer Manager

• Memory may not be able to hold all needed pages

• the software layer responsible for bringing pages from disk to main memory as needed

• Implements replacement policy • Higher levels of the DBMS code can be

written without worrying about whether data pages are in memory or not.

Buffer Manager

• It does its job as follows: – Partitions the memory into frames (1 frame holds 1 page)

– The collection of all frames/pages is called buffer pool

DB

MAIN MEMORY

DISK

disk page

free frame

Page Requests from Higher Levels

BUFFER POOL

choice of frame dictated by replacement policy

Hierarchy of Data

Tuples/Records

Relations

Sectors

Files

Blocks

Pages

We can try to minimize the number of block

accesses for “frequent” queries.

Indexing File Organization

Oversimplification: Tries to provide When you read a block you get “many” useful records

Oversimplification: Tries to provide where blocks containing useful records are

Conclusions

• From the application to the disk, each layer “sees” the data differently

• Disks provide cheap, non-volatile storage.

• Performance and cost are the main issues

Documents

Lecture 8: Physical Schema: Storage - New York University · Lecture 8: Physical Schema: Storage . Physical Schema Conceptual Schema View 1 View 2 View 3 Disk What happens under the