Upload
philippa-peters
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Storage Requirements of Databases
• Need data to be stored “permanently” or persistently for long periods of time
• Usually too big to fit in main memory• Low cost of storage per unit of data and the
definition of “very large databases” • Main cost incurred after storage is of
searching the database• Primary and secondary (auxiliary) file
organizations
File Organizations
• Relations usually stored in files as logical “records” and read in terms of physical “blocks”
• File organization refers to the way records are stored in terms of blocks and the way blocks are placed on the storage medium and interlinked.
• Types of organizations– Unsorted– Sorted– Hashing
Records
• Represents a tuple in a relation
• A file is a sequence of records
• Records could be either fixed-length or variable-length
• Records comprise of a sequence of fields (column, attribute)
Blocks• Refer to physical units of storage in storage
devices (Example: Sectors in hard disks, page in virtual memory)
• Of fixed length, based on physical characteristics of the storage/computing device and operating system
• Storage device is either defragmented or fragmented depending on whether contiguous sets of records lie in contiguous blocks
Blocking Factor
The number of records that are stored in a block is called the “blocking factor”. Blocking factor is constant across blocks if record length is fixed, or variable otherwise.
If B is block size and R is record size, then blocking factor is:
bfr = B/R
Since R may not exactly divide B, there could be some left-over space in each block equal to:
B – (bfr * R) bytes.
Spanned and Unspanned Records
When extra space in blocks are left unused, the record organization is said to be “unspanned”.
Record 1 Record 2 Record 3
Unused
Spanned and Unspanned Records
In “spanned” record storage, records can be split so that the “span” across blocks.
Record 1 Record 2 Record 3
Record 4 (part)
Record 4(remaining)
Block m
Block p
p
Spanned and Unspanned Records
When record size is greater than block size (i.e. R > B), use of “spanned” record storage is compulsory.
Indexes• Index Files
– Secondary or auxiliary files that help speed up data access in primary files
• Indexes or access structures.– Data structures (and search methods) used for fast
access
• Single level index– index file maps directly to the block or the address of
the record
• Multi-level index– multiple levels of indirection among indexes
Definitions
• Indexing field (indexing attribute): The field on which an index structure is built (searching is fast on this field)
• Primary index: An index structure that is defined on the ordering key field (the field that is used to physically order records on disk in sorted file organizations)
Definitions
• Clustering index: When the ordering field is not a key field (i.e. not unique) a clustering index is used instead of a primary index
• Secondary index: An index structure defined on a non-ordering field.
Primary Indexes
• Comprises of an ordered file of fixed length records having two fields
• The first field of same data type as ordering key (primary key), and second field is of the type block address.
Primary index records are represented by a pair:
(k(i), a(i))
–Where k(i) is the key for the ith record and a(i) is the block address containing the ith record.
Primary Index
.
.
.
Index File
2003-0101
2003-0121
2003-0181
…..…..
…..
2003-0201
2003-0241
2003-0221…..
RollNo Name Age Gender Grade2003-0101
2003-0121
2003-0262
2003-0120
…..
.
.
2003-0140
2003-0221
2003-0262
…..…..
…..…..
…..…..
…..…..
2003-0240
2003-0280K(i) a(i)
.
.
Primary Index
• The number of entries in the index is equal to the number of disk blocks in the ordered data file
• The first record in each block of the file is indexed (in sparse indexes). These records are called anchor records
• A sparse index has index entries for only some of the search values
• A dense index has an index for every search key value (every record in the data file). Dense indexes are not beneficial on ordered data files.
Primary Index
• Search: – Easy. Perform Binary Search on index file to
identify block containing required record
• Insertion / Deletion:– Easy if key values in records are fixed length and
statically allocated to blocks without block spanning (results in wasted space however).
– Else, re-computation of index required on insertion / deletion. Use of overflow buffers may be necessary.
Clustering Index
• Clustering field: A non-key ordering field. That is, blocks are ordered on this field which does not have the UNIQUE constraint
• Structure of index file similar to primary index file, but each index points to the first block having the given value in its clustering field.
• One index entry for every distinct value of the clustering field.
Clustering Index
.
.
.
K(I) A(I)
1
2
30
3…..
…..
39
…..
80…..
…..
.
.
90
Dept No Name Gender DOB Job1112
2233
808081
898990
Clustering Index
• A sparse index, since only distinct values are indexed
• Insertion and deletion cause problems when a block can hold more than one value for clustering field
• Alternative solution: Allocate blocks for each value of clustering field.
Clustering Index
.
.
.
K(I) A(I)
1
2
30
3…..
…..
39
…..
80…..
…..
.
.
90
Dept No Name Gender DOB Job111
222
8080
898989
More 1 fields
More 2 fields
More 89 fields
Secondary Index
• Used to index fields that are neither ordering fields nor key fields.
• Many secondary indexes possible on a single file.
• One index entry for the every record in the data file (dense index), containing the value of the indexed attribute, and a pointer to the block / record.
Secondary Index on Key Field
K(i), A(i)
2003-0101
2003-0102
2003-0103
2003-0104
2003-0106
2003-0105
RollNo Name Age Dept No Job
2003-01012003-0107
2003-0107
2003-01032003-0102
2003-01052003-01042003-0106
…
Has as many index entries as the number of records…
Secondary Index on Key Field
• Since key fields are unique, number of index entries equal to number of records
• Data file need not be sorted on disk
• Fixed length records for index file
Secondary Index on non-key Field
• When a non-key field is indexed, duplicate values have to be handled.
• There are three different techniques for handling duplicates: – Duplicate index entries– Variable length records– Extra redirection levels
Duplicate Index Entries
K(i) A(i)2003-0101
2003-0102
2003-0102
2003-0102
2003-0103
2003-0102
2003-0103
…
Index entries are repeated for each duplicate occurrence of the non-key attribute.
Binary search becomes more complicated. Mid-point of a search may have duplicate entries on either side.
Insertion of records may need restructuring of index table.
Variable Length Records
• Use variable length records for index table in order to accommodate duplicate key entries
• For a given key K(i), there is a set of address pointers instead of a single address pointer
• Binary search becomes complicated since address mid points cannot be computed efficiently
• Insertion of records may need restructuring of the index table
Extra Redirection Levels
K(I) A(I)
1
23
4
RollNo Name Age LabId Grade1
3
.
.
.
.
22
1
2
3
14
41
1
AddressBlocks
Extra Indirection Levels
• Most frequently used technique• Index records are of fixed length• A(i) in an index record points to a block
of address fields • Block overflows handled by chaining • Retrieval requires sequential search
within blocks• Insertion of records straightforward
Multi-level Indexes
• Binary search in single-level indexes require a search time of the order of log2b number of block accesses. Here b is the number of blocks in the index file
• If the bfr of the index file is greater than 2, number of block accesses can be reduced even further
• Multi-level indexes are meant for such a reduction.
Multi-level Indexes
• Contains several levels of the index file
• Each index block at a given level connects to a maximum of fo number of blocks at the next level. Here fo is called the “fan out” of the index structure
• Block accesses reduced from log2b to logfob on an average.
A Two-level Index Structure
2
10
2
5
10
15
2
4
5
8
10
12
15
18
First (base) level
Second (top)level
Block 1
Block 2
Two-level Index Structure
• First (Base) level is the usual primary index that is maintained in a sorted file
• Second (Top) level is a primary index into the first level index file
• The process can be repeated to any number of levels
• Each level reduces the number of entries of its next level by a factor of fo.
Summary
Types of Indexes
Ordering Field Nonordering Field
Key field Primary index Secondary index (key)
Non-key field Clustering index
Secondary index (non-key)
SummaryProperties of
IndexesNumber of (first-
level) index entries
Dense or non-dense
Primary Number of blocks in data file
Non-dense
Clustering Number of distinct index field values
Non-dense
Secondary (key) Number of records in data file
Dense
Secondary (non-key)
Number of records or number of distinct field values
Dense or non-dense
Summary
• Multi-level indexes: Several level of index files
• Characteristic “fan out” property. Fan out fo preferably greater than 2
• Reduces number of block accesses to order of logfob.
Overview of Index Structures• Index Files
– Secondary or auxiliary files that help speed up data access in primary files
• Indexes or access structures– Data structures (and search methods) used for fast
access
• Single level index– index file maps directly to the block or the address of
the record
• Multi-level index– multiple levels of indirection among indexes
Definitions
• Indexing field (indexing attribute): The field on which an index structure is built (searching is fast on this field)
• Primary index: An index structure that is defined on the ordering key field (the field that is used to physically order records on disk in sorted file organizations)
Definitions
• Clustering index: When the ordering field is not a key field (i.e. not unique) a clustering index is used instead of a primary index
• Secondary index: An index structure defined on a non-ordering field.
Primary Index Illustration
.
.
.
Index File
2003-0101
2003-0121
2003-0181
…..…..
…..
2003-0201
2003-0241
2003-0221…..
RollNo Name Age Gender Grade2003-0101
2003-0121
2003-0262
2003-0120
…..
.
.
2003-0140
2003-0221
2003-0262
…..…..
…..…..
…..…..
…..…..
2003-0240
2003-0280K(i) a(i)
.
.
Clustering Index Illustration
.
.
.
K(I) A(I)
1
2
30
3…..
…..
39
…..
80…..
…..
.
.
90
Dept No Name Gender DOB Job1112
2233
808081
898990
Secondary Index on Key Field
K(i), A(i)
2003-0101
2003-0102
2003-0103
2003-0104
2003-0106
2003-0105
RollNo Name Age Dept No Job
2003-01012003-0107
2003-0107
2003-01032003-0102
2003-01052003-01042003-0106
…
Has as many index entries as the number of records…
Secondary Index on non-Key Field
K(I) A(I)
1
23
4
RollNo Name Age LabId Grade1
3
.
.
.
.
22
1
2
3
14
41
1
AddressBlocks
Summary
Types of Indexes
Ordering Field Nonordering Field
Key field Primary index Secondary index (key)
Non-key field Clustering index
Secondary index (non-key)
SummaryProperties of
IndexesNumber of (first-
level) index entries
Dense or non-dense
Primary Number of blocks in data file
Non-dense
Clustering Number of distinct index field values
Non-dense
Secondary (key) Number of records in data file
Dense
Secondary (non-key)
Number of records or number of distinct field values
Dense or non-dense
Multi-level Indexes
• Binary search in single-level indexes require a search time of the order of log2b number of block accesses. Here b is the number of blocks in the index file
• If the bfr of the index file is greater than 2, number of block accesses can be reduced even further
• Multi-level indexes are meant for such a reduction.
Multi-level Indexes
• Contains several levels of the index file
• Each index block at a given level connects to a maximum of fo number of blocks at the next level. Here fo is called the “fan out” of the index structure
• Block accesses reduced from log2b to logfob on an average.
A Two-level Index Structure
2
10
2
5
10
15
2
4
5
8
10
12
15
18
First (base) level
Second (top)level
Block 1
Block 2
Two-level Index Structure
• First (Base) level is the usual primary index that is maintained in a sorted file
• Second (Top) level is a primary index into the first level index file
• The process can be repeated to any number of levels
• Each level reduces the number of entries of its next level by a factor of fo.
A Two-level Index Structure
2
10
2
5
10
15
2
4
5
8
10
12
15
18
First (base) level
Second (top)level
Block 1
Block 2
Two-level Index Structure
• First (Base) level is the usual primary index that is maintained in a sorted file
• Second (Top) level is a primary index into the first level index file
• The process can be repeated to any number of levels
• Each level reduces the number of entries of its next level by a factor of fo.
Insertions and Deletions
• Balanced property of index trees should be maintained during insertions and deletions
• Insertions and deletions are problematic in multi-level index, since all index files are physically sorted files
• An approach to overcome this is to use dynamic multi-level indexes
B-Trees
• A Tree data structure where each node has a predetermined maximum fan-out p
• Terminologies: root node, leaf nodes, internal nodes, parent, children
Structure of a Node
DataPointer
Left-mostSubtree
K1 K2 Ki-1 Ki
X > K
DataPointer
X < K1
K1 < X < K2
Right-most
Subtree
B-Tree constraints
• For a node containing p-1 (or p sub trees) keys, the following condition must always hold: – K1 < K2 < … < Kp-1
• For any data element X in subtree Pi, it should always be the case that: – Ki-1 < X < Ki, K1 < X and Kp-1 > X
B-Tree Constraints
• Each node has at most p tree pointers• Each node, except the root and leaf nodes,
has at least p/2 tree pointers (tree balancing constraint)
• The root node has at least 2 tree pointers unless it is the only node in the tree
• All leaf nodes are at the same level. In a leaf node, all tree pointers are null.
B+ Trees
• Most common index structures in RDBMS.• Leaf and non-leaf nodes have different structures:
data pointers are stored only at the leaf nodes• Leaf nodes form a “sense index” containing every
entry for the search field and its corresponding record pointer
• Leaf nodes linked to provide ordered access to data file records.
Non-leaf Nodes in B+ Trees
Left-mostSubtree
K1 K2 Ki-1 Ki
X > K X < K1
K1 < X < K2
Right-most
Subtree
Leaf Nodes in B+ Trees
K1 K2 Ki-1 Ki
Datapointer
Datapointer
Datapointer
Datapointer
Pointer tonextleaf nodein tree
Properties of Leaf Nodes
• Keys along the leaf nodes chain is organized in sorted order– K1 < K2 < … < Kn
• Each leaf node has at least p/2 values
• All leaf nodes are at the same level
Searching in B+ Trees
Generalization of Binary Search.
1. Given a search key k start from the root node2. If key is present in current node then success; else 3. If current node is a leaf node and key not present
in node, then key not in the database4. Search for a tree pointer Pi such that Ki-1 < k ki 5. Return to step 2 to continue search.
Insertion
• Originally, tree begins with only the root node.
• As and when nodes fill up, they are “split” and made children of a new node.
• Keys are split uniformly across the three nodes.
Insertion
85
Let p = 2.
Let insertion sequence of keys be: 5, 8, 3, 7, 2, 9, 17, 10, …
Tree, after insertion of 5 and 8.
Insertion of next key 3 causes overflow requiring a split.
Deletion
• Deletion of keys may cause underflows which have to be handled separately
• An underflow occurs when a node contains less than p/2 keys
• Nodes are merged with their siblings when underflows occur
Indexes on Multiple Attributes
• All index structures explored till now assumes simple attributes: comprising of only one value
• Many applications require multi-attribute (composite) keys
Ordered Index on Multi-attributes
• Considers a composite key as a tuple of simple keys (k1, k2, …kn)
• Ordered index files maintained by ordering each key in sequence.
Partitioned Hashing
• Given a composite key (k1, k2, …kn), partitioned hashing returns n different bucket numbers
• Hash bucket is determined by concatenating the n numbers.
Grid Files
• Partitions the range of key values for each key into several buckets
• Combinations of buckets of each key forms a “grid”
• A grid file stores a grid in either a row major or a column major form.
Grid Files
Roll No. 1 2 3 4 5
Grade
A
B
C
D
Roll No.
1 001– 0252 026 – 0503 051 – 0754 076 – 1005 101 – 125
Bucket Pool