Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CS2300: File Structures and
Introduction to Database Systems
Lecture 4: File Structure
Doug McGeehan
How is data stored
in the database?
File Structure
Outline
▪ Disk storage devices
▪ Files of records
▪ Operations on files
▪ Types of files
▪ Unordered files
▪ Ordered files
▪ Hash files
Disk Storage Devices
▪ Storage hierarchy.
▪ Primary storage: main memory, cache
▪ Secondary storage: magnetic disk,CD-ROM/DVD, tape, solid state, etc
▪ Most databases are stored on disk
▪ Disk is cheaper and non-volatile,though slower.
▪ DBMS files often optimized for spinning magnetic disks
5
Disk Storage Devices
▪ Disks are divided into concentric circular trackson each disk surface.
▪ A track is further divided into sectors, whose size is traditionally 512 bytes (modern: 4096 bytes).
▪ A sector is the smallest addressable unit on a disk.
▪ Tracks from all surfaces which are at the same diameter form a cylinder.
Platters
Spindle
Disk head
Arm movement
Arm assembly
Tracks
Sector
Disk Storage Devices
▪ Continuous sectors are organized into blocks
▪ The block size B is fixed during disk formatting
▪ Typical range: 512 bytes to 8192 bytes
▪ Blocks are separated by fixed-size interblock gaps
▪ Whole blocks are transferred between disk and main
memory for processing
▪ A read-write head moves to the track that contains
the block to be transferred
▪ Disk rotation moves the block under the read-write
head for reading or writing.
Disk Storage Devices
▪ A physical disk block address consists of:
▪ Surface number
▪ Track number (within surface)
▪ Block number (within track)
▪ Disk drives typically rotate continuously at a
constant speed
i.e. a fixed number of revolutions per minute
(rpm)
Disk Storage Devices
▪ The total time of reading or writing a disk block is the sum of:
– Seek time: the time required to move the read-write head to the correct cylinder.
– Rotational delay: the time required to rotate the disk so the desired block can be placed under the read-write head.
– Transfer time: the time required to transfer the data from the disk to main memory (buffer).
Disk Storage Devices
Let B be block size in bytes
T be track size in bytes
V be spindle speed in rpm (rotation per minute)
s be seek time
Average rotational delay (rd) = the time of half revolution
=
Transfer rate (tr) = track size / time to spin once
=
Block transfer time (btt) = block size / bytes transferred per ms
=
Total time to find and transfer a block is (s+rd+btt)
Example
Block size = 500 bytes
# of blocks per track = 20
# of tracks per surface = 400
A disk pack consists of 15 double-sided disks.
Seek time = 30ms
1. How many cylinders?
2. What is the total capacity of a disk pack?
3. At 5000 rpm (revolutions per minutes)
1. What is the transfer rate and block transfer time?
2. What is the total time to find and transfer a block?
Example
1. 400 cylinders
2. Total capacity = 500*20*400*15*2 = 120,000,000(bytes)
3. Track size = 500*20 = 10,000 bytes
Transfer rate = 10,000/ ((60*1000)/5000) = 833(bytes/ms)
Block transfer time = 500/833 = 0.6 ms
Average rotational delay = 0.5 * (60*1000/5000) = 6ms
Total time to find and transfer a block
= seek time + average rotational delay + Block transfer time
= 30 + 6 + 0.6 = 36.6 (ms)
Disk Storage Devices
▪ Reading or writing a disk block is time
consuming
▪ Seek time
▪ Rotational delay (latency)
▪ Locating data on disk is a major
bottleneck in database applications.
Outline
▪ Disk storage devices
▪ Files of records
▪ Operations on files
▪ Types of files
▪ Unordered files
▪ Ordered files
▪ Hash files
Files of Records
▪ A file is a sequence of records
▪ A record is a collection of fields
▪ Corresponds to an entity
▪ Records are stored on disk blocks
▪ The blocking factor (bfr) for a file:The (average) number of file records stored
in a disk block
▪ A file descriptor (or file header) includes information
▪ Describes a file (e.g. field names and data types)
▪ The addresses of the file blocks on disk
Files of Records
Fields
Records
Blocks
Files
Files of Records
▪ A file can have fixed-length records or
variable-length records.
▪ A record may have fixed-length fields or
variable-length fields.
Fixed-length Records
Employee record
(1) EID, 2 byte integer
(2) Name, 10 char. Schema
(3) Dept, 2 byte code
Records
19
Variable-length Records
• Variable-length fields: exact length not known
ahead of time
• Special separators characters to denote end
of a field, such as ?, %, or $
20
Records
▪ Records can be unspanned (no record can span
two blocks) or spanned (a record can be stored
in more than one block).
▪ Unspanned
▪ Spanned
R1 R2 R3 R4 R5
R1 R2 R3(a)
R3(b)
R6R5R4 R7(a)
Block 1 Block 2
Block 1 Block 2
21
Example
• Suppose that blocks are of size 1024 bytes;
a table has 1000 tuples of 100 bytes each;
No tuple is allowed to span two blocks
• How many blocks are needed to store these
tuples?
1024/100 = 10 tuples/block
1000/10 = 100 blocks
• How much space is “wasted”?
100*(1024 – 100*10) = 2400 bytes
22
Spanned vs. Unspanned
• Unspanned is much simpler, but may
waste space…
• Spanned essential if
record size > block size
23
Files of Records
▪ Allocated blocks for storing records may
be contiguous, linked, or indexed.
• Contiguous:
24
Files of Records
▪ Allocated blocks for storing records may
be contiguous, linked, or indexed.
• Linked:
25
Files of Records
▪ Allocated blocks for storing records may
be contiguous, linked, or indexed.
• Indexed:
26
Operations on Files
Actual operations vary from system to system
▪ OPEN
▪ FIND
▪ FINDNEXT
▪ READ
▪ INSERT
▪ DELETE
▪ MODIFY
▪ CLOSE
27
Types of Files
• Typically, all the tuples in a given table in the database are stored in files
• The database takes over all interactions with these files from the operating system
• Files can be organized in three different ways:– Heap files (unordered files)
– Sorted files (ordered files)
– Hash files
28
Unordered Files
▪ Also called a heap or a pile file.
▪ New records are inserted at the end of the file. --- Efficient
▪ To search for a record, a linear searchthrough the file records is necessary. ▪ Requires reading / searching half the file blocks
on the average
▪ Quite expensive for large files
Example of an Unordered File
29
123456 CS305 F1998 3.0
232323 MAT123 S1996 2.0 Block 1 (Page 1)
123456 CS305 F1995 2.0
234567 EE101 F1995 3.0
123456 CS315 S1997 4.0
111111 MGT123 F1994 4.0
123456 EE101 S1998 3.0 Block 2 (Page 2)
30
Ordered Files
▪ Also called a sorted file.
▪ File records are kept sorted by the values of
an ordering field.
▪ Insertion is expensive: records must be
inserted in the correct order.
▪ Searching the records in order of the ordering
field is quite efficient.
▪ Can use a binary search
▪ Requires accessing log2 of the file blocks (average)
Example of an Ordered File
CS 238 - Dan Lin 31
111111 CS238 F2008 4.0
111111 CS238 S2008 4.0 Block 1 (Page 1)
123456 CS101 F2007 2.0
123456 CS338 F2008 3.0
123456 CS315 S2008 4.0
123456 EE101 S2008 3.0 Block 2 (Page 2)
234567 EE101 F2008 3.0
600017 EE101 F2008 3.0
Insert new record “150000 CS101 S2007 3.0”
32
Hash Files
▪ Also called a direct file
▪ Execute a hash function on a key field
▪ Yields disk block address containing record
(ideally)
▪ Static external hashing
▪ Dynamic hashing techniques
33
Static External Hashing
▪ The file blocks are divided into M equal-sized
buckets, numbered bucket0, bucket1, ...,
bucketM-1
▪ Typically, a bucket corresponds to one
(or a fixed number of) disk block(s)
▪ One of the key fields is designated the
hash key of the file
34
Static External Hashing
▪ Record with key value k
▪ Hash value i = h(K) for hash function h()
▪ Store record in bucketi
This hash
table is
maintained in
the file header
CS 238 - Dan Lin 35
Example of a Hash File
• There are 10 buckets.
• Suppose the hash
function on the branch
names are:
• h(Perryridge) = 5
• h(Round Hill) = 3
• h(Brighton) = 3
36
Static External Hashing
▪ Search is very efficient on the hash key.▪ Search for the fields other than the hash key is as
expensive as the unordered file.
▪ Collisions can occur▪ New record hashes to a full bucket
▪ Overflow file is kept for storing such records
▪ Overflow records from a particular bucketcan be linked together
CS 238 - Dan Lin 37
Hashed Files - Overflow handling
38
Static External Hashing
▪ To reduce overflow records, a hash file is typically kept 70-80% full.
▪ A good hash function h▪ Distributes records uniformly among buckets
▪ Otherwise, search time will increase;Many overflow records will exist
▪ Fixed bucket count M is problematic▪ The number of records in the file grows or shrinks