View
230
Download
0
Category
Preview:
Citation preview
7/29/2019 Lt20 21 Index
1/28
Database Systems and
Applications
Lecture 20-21
Indexing
7/29/2019 Lt20 21 Index
2/28
Physical model in 3 level design?
Recall 3 levels of database design
Conceptual model: high level abstract description
Logical model: description of a concrete realization
Physical model: implementation using basic components
Analogy with vehicles
Conceptual model: mechanisms to move, turn, stop, ...
Logical models:
Car: accelerator pedal, steering wheel, brake pedal,
Bicycle: pedal forward to move, turn handle, pull brakes on handle
Physical models :
Car: engine, transmission, master cylinder, break lines, brake pads,
Bicycle: chain from pedal to wheels, gears, wire from handle to brakepads
7/29/2019 Lt20 21 Index
3/28
What is a physical data model?
What is a physical data model of a database?
Concepts to implement logical data model
Using current components, e.g. computer hardware,operating systems
In an efficient and fault-tolerant manner
Why learn physical data model concepts?
To be able to use DBMS facilities for performancetuning
For example, If a query is running slow,
one may create an index to speed it up
For example, if loading of a large number of tuplestakes for ever
one may drop indices on the table before the inserts
and recreate index after inserts are done!
7/29/2019 Lt20 21 Index
4/28
Concepts in a physical data model
Database concepts
Conceptual data model - entity, (multi-valued) attributes,relationship,
Logical model - relations, atomic attributes, primary andforeign keys
Physical model - secondary storage hardware, file structures,indices,
Examples of physical model concepts from relational DBMS
Secondary storage hardware: Disk drives
File structures - sorted
Auxiliary search structure -
search trees (hierarchical collections of one-dimensional ranges)
7/29/2019 Lt20 21 Index
5/28
Storage Hierarchy in Computers
Computers have several componentsCentral Processing Unit (CPU)Input, output devices, e.g. mouse, keyword, monitors, printersCommunication mechanisms, e.g. internal bus, network card,modemStorage Hierarchy
Types of storage DevicesMain memories - fast but content is lost when power is offSecondary storage - slower, retains content without powerTertiary storage - very slow, retains content, very large capacity
DBMS usually manage dataon secondary storage, e.g. disksUse main memory to improve performanceUser tertiary storage (e.g. tapes) for backup, archival etc.
7/29/2019 Lt20 21 Index
6/28
Software view of Disks: Fields,Records and File
Views of secondary storage (e.g. disks)Hardware views previous slides.
Software views - Data on disks is organized into fields, records, files
ConceptsField presents a property or attribute of a relation or an entity
Records represent a row in a relational table
Collection of fields for attributes in relational schema of the table
Files are collections of records /A relation is typically stored as a file ofrecords.
Homogeneous collection of records may represent a relation
Heterogeneous collections may be a union of related relations.
7/29/2019 Lt20 21 Index
7/28
File organization
7/29/2019 Lt20 21 Index
8/28
DBMS Plan Executor ParserQuery
EvaluationOperator Evaluator Optimizer
Engine%
File and Access Methods
Transaction
Manager
Buffer ManagerRecovery
Manager
Lock
ManagerDisk Space Manager
Concurrency Control%
Index FilesSystem
Data FilesCatalog
Web Forms Application Front Ends SQL Interface
SQL Commands
DBMS ARCHITECTURE
7/29/2019 Lt20 21 Index
9/28
Data on External Storage Disks: Can retrieve random page at fixed cost.
But reading several consecutive pages is much cheaper thanreading them in random order.
Tapes: Can only read pages in sequence. Cheaper than disks; used for archival storage.
File organization: Method of arranging a file of recordson external storage. Record id (rid) is sufficient to physically locate record.
Indexes are data structures that allow us to find the record ids ofrecords with given values in index search key fields.
Architecture: Buffer manager fetches pages fromexternal storage to main memory buffer pool. File and
index layers make calls to the buffer manager.
7/29/2019 Lt20 21 Index
10/28
Alternative File Organizations Many alternatives exist, each ideal for some situations,
and not so good in others:
Heap (random order) files: Suitable when typical access is a filescan retrieving all records.
Sorted Files: Best if records must be retrieved in some order, oronly a `range of records is needed.
Indexes: Data structures to organize records via trees orhashing.
Like sorted files, they speed up searches for a subset of records, basedon values in certain (search key) fields.
Updates are much faster than in sorted files.
7/29/2019 Lt20 21 Index
11/28
Indexes
An indexon a file speeds up selections on the
search key fields for the index. Any subset of the fields of a relation can be the
search key for an index on the relation.
Search keyis not the same as key(minimal set offields that uniquely identify a record in a relation).
An index contains a collection ofdata entries,and supports efficient retrieval of all dataentries k* with a given key value k.
A t t D t E t
7/29/2019 Lt20 21 Index
12/28
A ternat ves or Data Entry nIndex
Three alternatives: Data record with key value k
Choice of alternative for data entries is
orthogonal (list of axes) to the indexingtechnique used to locate data entries with agiven key value k. Examples of indexing techniques: B+ trees, hash-
based structures.
Typically, index contains auxiliary information thatdirects searches to the desired data entries.
Al i f D E i
7/29/2019 Lt20 21 Index
13/28
Alternatives for Data Entries(Contd.)
Alternative 1: If this is used, index structure is a file
organization for data records (instead of aHeap file or sorted file).
At most one index on a given collection of datarecords can use Alternative 1. (Otherwise,data records are duplicated, leading toredundant storage and potentialinconsistency.)
If data records are very large, # of pagescontaining data entries is high.
7/29/2019 Lt20 21 Index
14/28
erna ves or a a n r es(Contd.)
Alternatives 2 and 3: Data entries typically much smaller than data
records. So, better than Alternative 1 withlarge data records, especially if search keysare small. (Portion of index structure used to
direct search, which depends on size of dataentries, is much smaller than with Alternative1.)
Alternative 3 more compact than Alternative 2,
but leads to variable sized data entries even ifsearch keys are of fixed length.
7/29/2019 Lt20 21 Index
15/28
Example of Alternative 1
8bluerectangle6
4bluesquare5
2blueround4
Red
Red
Red
colour
3
2
1
Location
8rectangle
4square
2round
holesshape
6 data entries,sorted by colour
7/29/2019 Lt20 21 Index
16/28
Example of Alternative 2
blue6
blue5
blue4
Red
Red
Red
colour
3
2
1
Location
6 data entries,sorted bycolour
7/29/2019 Lt20 21 Index
17/28
Example of Alternative 3
Locations
colour
1, 2, 3 Red
4,5,6 Blue
2 dataentries,
variablelength
7/29/2019 Lt20 21 Index
18/28
Index Classification
Primaryvs. secondary: If search key contains primary key,
then called primary index. Unique index: Search key contains a candidate key.
Clustered vs. unclustered: If order of data records is thesame as, or `close to, order of data entries, then calledclustered index.
Alternative 1 implies clustered; in practice, clustered also impliesAlternative 1 (since sorted files are rare).
A file can be clustered on at most one search key.
Cost of retrieving data records through index varies greatlybasedon whether index is clustered or not!
7/29/2019 Lt20 21 Index
19/28
Suppose that Alternative (2) is used fordata entries, and that the data recordsare stored in a Heap file.
To build clustered index, first sort theHeap file (with some free space on eachpage for future inserts).
Overflow pages may be needed for inserts.(Thus, order of data recs is `close to, but
not identical to, the sort order.)
Index Classification(cntd)
7/29/2019 Lt20 21 Index
20/28
Clustered vs. Unclustered Index
CLUSTERED Index entries UNCLUSTEREDdirect searchfor data entries
Data entries Data entries
(index file)
(data file)
Data Records DataRecords
7/29/2019 Lt20 21 Index
21/28
Hash-Based Indexes
Good for equality selections. Index is a collection ofbuckets.Bucket =primarypage plus zero or more
overflow pages.
Hashing function h: h(r) = bucket in whichrecord rbelongs. h looks at the search keyfields ofr.
If Alternative (1) is used, the buckets contain the datarecords; otherwise, they contain
or pairs.
7/29/2019 Lt20 21 Index
22/28
Samu,44,3000
Jones,40,6003
Tracy,44,5004Sahu,25,3000
Kones,33,4003
Tincy,29,2007Sanju,50,5004
John,22,6003
h1
age
h(age)=00
h(age)=10
h(age)=01
3000
3000
5004
5004
4003
2007
6003
6003
h2
sal
h(sal)=00
h(sal)=11
7/29/2019 Lt20 21 Index
23/28
Tree Indexes
Non-leafpages
Leaf pages contain data entries and are chained (prev. &next).
Non-leaf pages contain index entries and direct searches.
leaf
pages
7/29/2019 Lt20 21 Index
24/28
Cost Model for Our Analysis
We ignore CPU costs, for simplicity: B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page
Measuring number of page I/Os ignores gains ofpre-fetching a sequence of pages; thus, even I/O
cost is only approximated. Average-case analysis; based on several
simplistic assumptions.* Good enough to show the overall trends!
7/29/2019 Lt20 21 Index
25/28
Comparing File Organizations
Heap files (random order; insert at eof)
Sorted files, sorted on
Clustered B+ tree file, Alternative (1),search key
Heap file with unclustered B + treeindex on search key < age, sal >
Heap file with unclustered hash index onsearch key < age, sal >
7/29/2019 Lt20 21 Index
26/28
Operations to Compare
Scan: Fetch all records from disk
Equality search
Range selection Insert a record
Delete a record
7/29/2019 Lt20 21 Index
27/28
Assumptions in Our Analysis
Heap Files: Equality selection on key; exactly one match.
Sorted Files:
Files compacted after deletions. Indexes:
Alt (2), (3): data entry size = 10% size of record
Hash: No overflow buckets.
80% page occupancy => File size = 1.25 data sizeTree: 67% occupancy (this is typical).
Implies file size = 1.5 data size
7/29/2019 Lt20 21 Index
28/28
(a)scan
(b)equality
(c) range (d)insert
(e)delete
1.Heap BD 0.5BD BD 2D Searc
h+D
2.Sorted BD Dlog2B Dlog2B+#matchingpages
Search+BD
Search+BD
3.Clustered 1.5BD DlogF1.5B
DlogF1.5B+#matchingpages
Search+D
Search+D
4.Unclusteredtree index
BD(R+0.15)
D(1+logF0.15B)
D(logF0.15B+#matchingrecords)
D(3+logF0.15B)
Search+2D
5.Unclusteredhash index
BD(R+0.125)
2D BD 4D Search+2D
Recommended