Select Operation- disk access
and Indexing
*Some info on slides from Dr. S. Son, U. Va
Disk access
• DBs traditionally stored on disk
• Cheaper to store on disk than in memory
• Costs for:– Seek time, latency, data transfer time
• Disk access is page (block) oriented
• 2 - 4 KB page size
Access time
• Access time is the time to randomly access a page
• System initially determines if page in memory buffer (page tables, etc.)
• Large disparity between disk access and memory access
Select operation using table scan
• If read the entire table for a select – table scan
• Improvements to table scan of disk:– Parallel access– Sequential prefetch
Parallel access
• Linear search - all data rows read in from disk – I/O parallelism can be used (Raid)
• multiple I/O read requests satisfied at the same time
• stripe the data across different disks
– Problems with parallelism?• must balance disk arm load to gain maximum
parallelism
• requires the same total number of random I/O's, but using devices for a shorter time
Sequential prefetch I/O
• Retrieve one disk page after another (on same track) – (32 in DB2, varies in Oracle)
• Seek time no longer a problem
• Must know in advance to read 32 successive pages
• Speed up of I/O by a factor of ≈10 (500 I/O's per second vs. 70)
Access time
• Seek time –as low as 4 ms server
• Latency time –as low as 1 ms or less
• Data transfer time – .4-2 ms
• Solid state disks up to 100,000 I/Os per sec. – still expensive
Access time for fast I/O
RIO Seq. Prefetch .004 .004 Seek - disk arm to cylinder .001 .001 Latency - platter to sector .0005 .016 Data transfer - Page .0055 .021 1 page vs. 32 pages
.176* seconds .021 seconds 32 pages for both
* .0055X32=.176 for 32 pages of RIO vs .021 for 32 pages of Seq. Prefetch
Organizing disk space
• How to store data so minimize access time if read the entire table?
Disk allocation
• Disk Resource Allocation for Databases (DBA has control)
• Goal – contiguous sectors on disk - want data as close together as possible to minimize seek time
• No standard SQL approach, but general way to deal with allocation
• Some OS allow specification of size of file and disk device
Types of Files• Heap files (unordered – sequential)• Sorted files (ordered – sort key)• Hash files (hash key, hash function)• B+-trees
• Storage Area Networks SAN – ERP (enterprise resource planning) and DW (data warehouses)– Storage devices configured as nodes in network –
can attach/detach
Tablespace
Tablespace is:• Allocation medium for tables and indexes for
ORACLE, DB2, etc.• Can put >1 table in a table space if accessed
together • Tablespace corresponds to 1 or more OS files
and can span disk devices• Usually relations cannot span disk devices
DB storage structures
DB Company DatabaseTable- tspace 1
system
space
OS files fname1 fname2 fname3 Tables Empl Dept Proj Dep EmpIndx
Segments data data data data index
Extents
Tablespace
• ORACLE DB's contain several tablespaces, including one called system - data description + indexes + user-defined tables
• default tablespace given to each user • if multiple tablespaces - better control over load
balancing • can take some disk space off-line
Extent• Relation composed of 1 or more extents
• Extent - contiguous storage on disk • when data segment or index segment first
created, given an initial extent from tablespace 10KB (5 pages)
• if need more space given next contiguous extent
DB storage structures
DB Company DatabaseTable- tspace 1
system
space
OS files fname1 fname2 fname3 Tables Empl Dept Proj Dep EmpIndx
Segments data data data data index
Extents
Extent
• Can increase the size by a positive % (cannot decrease) – initial n - size of initial extent – next n - size of next – max extents - maximum number of extents – min extents - number of extents initially
allocated – pct increase n - % by which next extent
grows over previous one
Oracle create tablespace
• http://www.adp-gmbh.ch/ora/sql/create_tablespace.html
Create table
• Create table statement - can specify tablespace, no. of extents– When initial extent full, new extent allocated – pctfree - determine how much space in a page can be
used for inserts of new rows • if pctfree =10%, inserts stop when page is 90% full
» Uses another page
– pctused – determines when new inserts start again • if fall below certain percentage of total, default pctused =
40% pctfree + pctused < 100
Rows
• Row layout on each disk page
1 2 3… N Row N Row N-1 … Row 1Header info Row directory free space data rows
• Header - • Row directory – row number and page byte offset
– Row number is row number in page – also called slot#• Page byte offset – with varchar, row size not constant
• To identify a particular row use RID (RowID) – page #, slot # [file#]
slot# is number in row directory (logical #)
Differences in DBMSs re: rows
• ROWID can be retrieved in ORACLE but not DB2 (violates relational model rule)
• ORACLE • rows can be split between pages (row record
fragmentation) • Can have rows from multiple tables on same page,
more info
• DB2, no splitting, entire row moved to new page, need forwarding pointer
Select operation using Indexes
• Alternative to table scan
24
Why use an index?
• If use a select (or join) on the same attribute frequently
• want a way to improve performance - use indexes– For example:
Select from Employee
where ssn = 333445555
B+-tree
• Most commonly used index structure type in DBs today • Based on B-tree• Good for equality and range searches• B+ tree: dynamic, adjusts gracefully under inserts and
deletes.• Used to minimize disk I/O • available in DB2, ORACLE also has hash cluster, Ingres
has heap structure, B-tree, isam (chain together new nodes)
Structure of B+ Trees
• leaf level pointers to data (RIDs)
• the remaining are directory (index) nodes that point to other index nodes Fig.
Index Entries
Data Entries("Sequence set")
(Direct search)
Example of B+Tree
10 20 40
1 2 3 10 12 20 35 40 42 50
Points to data
Characteristics of B+ Tree
• Order of tree (fan out) – max number of child nodes
• Minimum 50% occupancy (except for root). Each node contains d/2 <= m <= d-1 entries. – Where the parameter d is the order of the tree.
• Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages)
• Supports equality and range-searches efficiently
Cost of I/O for B+-tree
• One index node is one page • If tree with depth of 3, 3 I/Os to get pointer to
data• Read in index node can remain in memory
– likely since frequent access to upper -level nodes of actively used B+-trees
B+ Trees in Practice
• Typical order: between 100-200 children• Typical fill-factor: 2/3 full (66.6%)
– average fanout = 133 (if 200 children)
• Typical capacities:– Height 4: 1334 = 312,900,700 records– Height 3: 1333 = 2,352,637 records
• Can often hold top levels in buffer pool:– Level 1 = 1 page = 8 Kbytes– Level 2 = 133 pages = 1 Mbyte– Level 3 = 17,689 pages = 133 MBytes
Why B+-tree
• Directory structure - retrieve range of values efficiently – search for leftmost index entry Si such that
X <= Si
• Index entries always in sequence by value - can use sequential prefetch on index
• Index entries shorter than data rows - less I/O
B+-tree
• Balancing of B+-trees - insert, delete
• Nodes usually not full
• Utilities to reorganize to lower disk I/O
• Most systems allow nodes to become depopulated- no automatic algorithm to balance
• Average node below root level 71% full in active growing B+-trees
Duplicate key values
• Duplicate key values in index • leaf nodes have sibling pointers • but a delete of a row that has a heavily
duplicated key entails a long search through the leaf-level of the B+-tree
• Index compression - with multiple duplicates
| header info | PrX keyval RID RID ... RID | PrX keyval RID…RID|
where PrX is count of RID values
Create Index
Options: multiple columns
tablespace storage - initial extents, etc. percent free default = 10
% of each page left unfilled (creation)
free page (1 free page for every n index pages during creation)
35
Types of indexes (textbook)
• Primary index - key field is a candidate key (must be unique) – data file ordered by key field
• Clustering index - key field is not unique, data file is ordered – all records with same values on same pages
• Secondary index - non-clustering index – data file not ordered– First record in the data page (or block) is called the
anchor record• Non-dense index - pointer in index entry points to anchor• Dense index - pointer to every record in the file
Clustering
• Efficiency advantage read in a page, get all of the rows with
the same value • clustering is useful for range queries
e.g. between keyval1 and keyval2
Clustering
• Can only cluster table by 1 clustering index at a time • In SQL server
– creates clustered index on PK automatically if no other clustered index on table and PK nonclustered index not specified
• In DB2 – – if the table is empty, rows sorted as placed on disk – subsequent insertions not clustered, must use REORG
• In Oracle-– Cluster index – now available for PK in 10g– Define a cluster to create cluster index for 2 tables
Please help me to remember to
TURN OFF THE PROJECTOR!!
Indexes vs. table scan
• To illustrate the difference between table scan, secondary index (non clustered) and clustered index Assume 10 M customers, 200 cities2KB/page, row = 100 bytes, 20 rows/page Select *
From Customers Where city = Birmingham
1/200 * 10M if assume selectivity = 1/200 50,000 customers in a city
Rules of Thumb for I/O
• Assume slightly slower times than before:– Random I/O – 160 pages/second, .00625– Sequential prefetch I/O – 1600 pages/second,
.000625
Will discuss later:– List prefetch I/O – 400 pages/second, .0025
Table Scan
Table Scan - read entire table
If used an random I/O (RIO) – WHICH ONE WOULD NEVER DO
10,000,000/20 = 500,000 pages 500,000*RIO = 3125
Instead, it makes more sense to use:sequential prefetch (SP) read 32 pages at a time
500,000*SP = 312
Clustering IndexClustering Index –
• All entries for B'ham clustered on same pages
• 50,000/20 = 2500 data pages (with 20 rows per page)
• Assume 3 upper nodes of the tree
• Assume 1000 index entries per leaf node, read 50000/1000 = 50
index pages
3 + 50000/1000 + 50,000/20 = number of pages to access
• If top 3 levels of tree in memory, count access time as 0
• Access time:
(3*0) + (50*SP) + (2500*SP) = 2,550 * .000625 = 1.6
Secondary Index
• In the worst case 1 entry for B'ham per page
• 50,000 data pages pages (10M/200)
3 + 50 + 50,000 = 50, 053 number of accesses
(3*0)+(50*SP) + (50,000*RIO)=312.5 access time
REALLY slow – see next slide for a better solution!
Use List Prefetch instead of RIO
List Prefetch – Better solution
Create list of data pages to access
Pages not necessarily in contiguous sequential order
System orders pages to minimize disk I/O
E.g. elevator algorithm for disk request scheduling
Using list prefetch (LP)
0+(50*SP)+50,000*LP=125.03 access time
% Free
• Redo the previous calculations assuming relations created with 50% free option specified.
Creating Indexes
• When determining what indexes to create consider:– workload - mix of queries and frequencies of requests
• 20% of requests are updates, etc. – can create lots of indexes but:
• cost to create • insertions • initial load time high if a large table • index entries can become longer and longer as
multiple columns included
Multiple Indexes
• More than one index on a relation – e.g. age – one index, class - one index,
gender - one index
Composite Index
• One index based on more than one attribute Create Index index_name on Table (col1, col2,... coln)
• Composite index entry - values for each attribute age, class, gender entry in index is: C1, C2, C3, RID
Using Indexes
• System must decide if to use index
• What if more than one index, which one?
• What if composite index?
Plans using Indexes
Can use an index if index matches select condition in where clause:
1. A matching index scan - only have to access a limited number of contiguous leaf entries to access data
2. Predicate screening with matching index scan – index entries to eliminate RIDs
3. Non-matching index scan – use index to identify RIDs4. Index-only retrieval – don’t access data, RIDs only5. Multiple index retrieval – use >1 index to identify RIDs
Matching index scan
Definition of a matching index scan - Only have to access contiguous leaf nodes
1) Single where clause and index matchesCreate index Idx1 on T1 ( C1)Select * from T1
where C1=10
search B+-tree to leaf level for leftmost entry having specified values
useful for =, between
Matching Index Scan 2) If multiple where clauses and all '='
Select * from T1 where C1=10 and C2=5
i) if there is a composite index and selectcolumns match all index columns, e.g.
Create index Idx2 on T1 ( C1, C2) only have to read contiguous leaf pages
ii) if there is a separate index for each clause, e.g. Create index idx3 on T1(C1);
Create index idx4 on T1(C2); must choose one or more of the indexes (later)
Matching Index Scan - Rules
A matching scan can be used ONLY IF
one of the columns in select is the first column of index
Decide how many attributes to match in a composite index after the first column, so can read in a small contiguous range of leaf entries in B+-tree to get RIDs
• Match first column of composite index then: – look at index columns from left to right – Match ends when no predicate found – If range (<=, like, between) for a column, match terminates
thereafter• easier to scan all entries for range – process rest of entries
using predicate screening
Matching Index Scan with Predicate screening
1) If select conditions match some index columns of composite index Create index idx6 on T1(C1, C2, C3, C4);
Select * from T1
where C1=10 and C2=3 and C4=20
• Access contiguous leaf pages, but not all results on contiguous leaf pages
• Must examine index entries to determine if in the result -- called predicate screening
Matching Index Scan with Predicate screening Another example:
2) If all select conditions match composite index columns and some selects are a range
Create index idx7 on T1(C1, C2, C3); Select * from T1 where C1=10 and C2
between 1 and 5 and C3 =‘F’
Advantages to Predicate screening
• discard RIDs based on values (for index)• will access fewer tuples because RIDs used to eliminate
potential tuples
Non-matching index scan
• Not always used by DBMSs• attributes in where clause don't include initial attribute of
index Create index idx3 on T1(C1, C2, C3);
Select * from T1 where C2=2 and C3=‘M’
• Search leaf entries of index and compare values for entries • must read in all index leaf pages to find C2, C3 value (so
why do it?)– 50 index pages vs 500,000 data pages
Index only retrieval
• Elements retrieved in select clause are attributes of compose index
• Don't need to access rows (actual data)
Create index idx5 on T1(C1, C3);
Select C1, C3 from T1 where C1=5 and C3 between 2 and 5
Select sum(C3) from T1
Multiple Index Access
• If conjunctive conditions & in where clause, can use >1 index– Extract RIDs from each index satisfying matching
predicate – Intersect lists of RIDs (and them) from each index – Final list - satisfies all predicates indexed
• If disjunctive conditions (or) – Union the two lists of RIDs
Some Query optimizer rules for using RID-lists (then use list prefetch)
1. predicted active resulting RIDs must not be > 50% of RID pool
2. Limit to any single RID list the size of the RID memory pool (16M RIDs)
3. RID list cannot be generated by screening predicates
Rules for multiple index Access
Optimizer determines diminishing returns using multiple index access
1. List indexes with matching predicates in where clause
2. Place indexes in order by increasing filter factor
3. For successive indexes, extract RID list only if reduced cost for final row returned e.g. no sense reading 100's of pages of a new index
to get number of rows to only 1 tuple
Example: Using RID lists with Multiple IndexesProspects Table : 50M rows - 10 rows per pagePages in table: 5,000,000
There are 4 Indexes: • age – 50 values (1000 entries per page)• zipcode – 100,000 values (100 entries per page)• hobby – 100 values (1000 entries per page)• incomeclass – 10 values (1000 entries per page)
Problem cont’d
Select name, straddr from prospectswhere zipcode between 02159 and 02658and age = 40 and hobby = ‘chess’ and incomeclass = 10;
Compute FF : Make sure in ascending order• FF(zipcode) =
500/100,000 = 1/200• FF(hobby) =
1/100• FF(age) =
1/50• FF(incomeclass) =
1/10
Problem cont’d
Data rows read if use indexes: (1) 50,000,000/200 = 250,000 (1,2) 250,000/100 = 2500 (1,2,3) 2500/50 = 50 (1,2,3,4) 50/10 = 5 How much time will this take? Is it cost effective to use all
of these indexes?
Problem cont’d I/O costs
• Cost:– Random IO: RIO= 1/160 = .00625– Sequential Prefetch: SP = 1/1600 = .000625– List Prefetch: LP = 1/400 = .0025
• Note:– Some textbooks assume if read <= 3 pages use RIO– They also assume non-leaf nodes RIO, we assume in memory
so it takes 0 disk access time
Problem cont’d
Table scan:
50M/10 per page * SP
Total time: 5,000,000 * 0.000625 = 3125
Using index 1: (100 entries per page)
data: 50M*FF*LP
250,000 * 0.0025 = 625
index:
non-leaf pages+(#leaf entries*FF*entries per page))*SP
(3*0) + (50,000,000/200/100) * 0.000625 = 1.56
Total time: 1.56 + 625 = 626.56
Problem cont’d
Using indexes 1&2:
data: 250,000/100 * LP
2500 * 0.0025 = 6.25
index 2: (1000 entries per page)
(3*0) + (50,000,000/100/1000)* 0.000625 = 0.3125
To use both indexes: 1.56 + 0.3125 = 1.8725
Total time: 1.8725 + 6.25 = 8.1225
Problem cont’dUsing indexes 1,2,3:
data: 50 * 0.0025 = 0.125
index 3: (1000 entries per page)
(3*0) + (50,000,000/50/1000) * .000625= .625
To use 3 indexes: 1.56 + 0.3125 + 0.625 = 2.4975
Total time: 2.4975 + 0.125 = 2.6225
Using indexes 1,2,3,4:
data: 5 * 0.0025 = 0.0125
index 4: (1000 entries per page)
(3*0)+ (50,000,000/10/1000)*.000625 = 3.125
To use 4 indexes: 1.56+0.3125+0.625+3.125=5.6225
Total time: 5.6225 + 0.0125 = 5.635
Problem cont’dIndex used
Data rows
I/O cost
Index I/O cost
Trade off if use index
None 50M
3125 sec
1 250,000
625 sec
1.56 sec Decrease 3125 to 625 sec
With 1.56 additional sec
1,2 2500
6.25 sec
1.56 + 0.3125 sec
Decrease 625 to 6.25 sec
With 0.3125 additional sec
1,2,3 50
0.125 sec
1.56 + 0.3125 + 0.625 sec
Decrease 6.25 to 0.125 sec
With 0.625 additional sec
1,2,3,4 5
0.0125 sec
1.56 + 0.3125 + 0.625 + 3.125 sec
Decrease 0.125 to 0.0125 sec
With 3.125 additional sec
Indexes and Information Retrieval
Some information on slides taken from CS245 – Stanford Univ.
Query: Get employees in
(Toy Dept) ^ (2nd floor)
Dept. index EMP Floor index
Toy 2nd
Intersect toy RIDs and 2nd Floor RIDs to get set of matching EMP’s
This idea used in text information retrieval
Documents
...the cat is fat ...
...was raining cats and dogs...
...Fido the dog ...
This idea used in text information retrieval
Documents
...the cat is fat ...
...was raining cats and dogs...
...Fido the dog ...
Inverted lists
cat
dog
IR QUERIES
• Find articles with “cat” and “dog”
• Find articles with “cat” or “dog”
• Find articles with “cat” and not “dog”
IR QUERIES
• Find articles with “cat” and “dog”
• Find articles with “cat” or “dog”
• Find articles with “cat” and not “dog”
• Find articles with “cat” in title
• Find articles with “cat” and “dog” within 5 words
IR – Web search problems
– Crawling and indexing share similar characteristics and requirements
– Both are offline problems, no need for real-time– Tolerable for a few minutes delay before content
searchable– OK to run smaller-scale index updates frequently
– Querying online problem – Demands sub-second response time– Low latency high throughput– Loads can very greatly
Architecture of IR SystemsDocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
offlineonline
How do we represent text?
• “Bag of words”– Treat all the words in a document as index terms for
that document– Assign a “weight” to each term based on “importance”– Disregard order, structure, meaning, etc. of the words– Simple, yet effective!
• Assumptions– Term occurrence is independent– Document relevance is independent– “Words” are well-defined
Stop Word List
• Words filtered out
• Common words
• Match on common word not asuseful as match on rare words...
• Not one definite list
Representing Documents
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
the
isfor
to
of
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110110110010100
11001001001101011
Term Doc
ume
nt 1
Doc
ume
nt 2
Stopword List
Inverted Index
• Inverted indexing is fundamental to all IR models• Consists of postings lists, one with each term in the
collection• Posting list – document id and payload
– Payload can be term frequency or number of times occurs on document, position of occurrence, properties, etc.
– Can be ordered by document id, page rank, etc.– Data structure necessary to map from document
id to e.g. URL
Inverted Index
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110000010010110
01001001001100001
Term
Doc
1D
oc 2
00110110110010100
11001001001000001
Doc
3D
oc 4
00010110010010010
01001001000101001
Doc
5D
oc 6
00110010010010010
10001001001111000
Doc
7D
oc 8
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
4 82 4 61 3 71 3 5 72 4 6 83 53 5 72 4 6 831 3 5 7
1 3 5 7 8
2 4 82 6 8
1 5 72 4 6
1 36 8
Term Postings
CS 245 Notes 4 85
Posting: an entry in inverted list.Represents occurrence ofterm in article
Size of a list: 1 Rare words or (in postings) miss-spellings
106 Common words
Size of a posting: 10-15 bits (compressed)
Process query
• Given a query, fetch posting lists associated with query, traverse postings to compute result set
• Query document scores must be computed• Partial scores stored in accumulators• Top k documents extracted• Optimization strategies to reduce # postings
must examine
Indexing: Performance Analysis
• The indexing problem– Must be relatively fast, but need not be real time– For Web, incremental updates are important
• How large is the inverted index?– Size of vocabulary– Size of postings
• Fundamentally, a large sorting problem– Terms usually fit in memory– Postings usually don’t
Index
• Size of index depends on payload• Well-optimized inverted index can be 1/10 of size of
original document collection• If store position info, could be several times larger• Usually can hold entire vocabulary in memory (using
front-coding)• Postings lists usually too large to store in memory• Query evaluation involves random disk access and
decoding postings– Try to minimize random seeks