04/18/23 1PSU’s CS 587
14. Evaluating Relational Operators
Table Statistics Operators and example schema Selection Projection Equijoins General Joins Set Operators Buffering Intergalactic standard reference
G. Graefe, "Query Evaluation Techniques for Large Databases," ACM Computing Surveys, 25(2) (1993), pp. 73-170
04/18/23 2PSU’s CS 587
Learning Objectives In a typical major DBMS, what statistics
are automatically collected and when? Given collected statistics, estimate a
predicate’s output size/selectivity. For each relational operator, describe the
major algorithms, their optimizations, their pros and cons, and their costs Selection Projection Equijoins General Joins Set Operators
04/18/23 3PSU’s CS 587
Motivation/Review
Assume Sailors has an index on age. Does the optimal plan for this query use
the index?SELECT *FROM Sailors SWHERE S.age < 31 Moral: In order to choose the optimal
plan we need to know the selectivity of the predicate/size of the output.
04/18/23 4PSU’s CS 587
Collecting Statistics What are statistics
Table sizes, index sizes, ranges of values, etc. Where are statistics kept? In the system catalog. Why collect statistics?
Statistics are needed to determine selectivity of predicates and sizes of outputs of operators.
Data in the previous bullet is needed to calculate costs of plans.
Data in the previous bullet is needed by the optimizer to find the optimal plan.
How often are statistics collected? Typically done when 10% of the data has been updated.
• Can be overidden manually• UPDATE STATISTICS, ANALYZE
Typically done by sampling, if table is large Which tables/columns are monitored?
Typically all tables/columns are monitored
04/18/23 5PSU’s CS 587
Sailors Example What statistics
would you collect to find the number of rows satisfying Age < 31? Rank =4 ?
In Intro DB we assumed data was uniformly distributed.
30/30.2/30.4/30.6/30.8/30.9/31/32/33/34.5
35/36/36.6/37.8/39/39.5/40/41/43/44
/46/48/50/51/52/54/56/58/59/60
04/18/23 6PSU’s CS 587
Histograms Histograms are more accurate than the uniformity
assumption. Equiwidth Histogram:
Estimate number of rows/selectivity for the predicate “age< 31”:
Equidepth Histogram:
Estimate number of rows/selectivity for the predicate “age< 31”:
Actual value: 6. Moral: Equidepth histograms are more accurate.
#valsrange 30 33 36 39 42 45 48 51 54 57
9 3 3 3 2 1 2 2 2 3
#valsrange 30 30.430.9 33 36 39 41 46 51 56
3 3 3 3 3 3 3 3 3 3
04/18/23 7PSU’s CS 587
Real DBMSs Store Histograms and More
The next slide is the result of:SELECT attname, n_distinct, most_common_vals,
most_common_freqs, histogram_bounds
FROM pg_stats WHERE tablename = 'sailors';
n_distinct: positive means number of distinct values, negative means distinct values over number of rows, so -1 means unique
most_common_vals: NULL if no values are more common than others
histogram: equidepth, for values except most common
04/18/23 8PSU’s CS 587
Example Statistics: Postgres
EXPLAIN SELECT * from sailors where age < 31;
QUERY PLAN Seq Scan on sailors (cost=0.00..1.38 rows=6 width=21)Filter: (age < 31::double precision)
04/18/23 9PSU’s CS 587
Compound Conditions
How many rows: WHERE age < 31 AND rank = 4
• Assume statistical independence
• Selectivity of AND is product of selectivities
WHERE age < 31 OR rank = 4• Typical formula is s1+s2-s1*s2
04/18/23 10PSU’s CS 587
Join Size Estimation
Suppose there are ר rows in R, ש rows in S, R ⋈ S is a join on ri = sj
ri is a foreign key that references sj
How many rows are there in R ⋈ S? Most real joins are foreign key joins
04/18/23 11PSU’s CS 587
Pages, Rows in a Table?
select relpages, reltuples
from pg_class where relname = 'sailors';
04/18/23 12PSU’s CS 587
Assumptions General assumptions:
All attributes are uniformly distributed. 2 Meg buffer space, or 256 8K pages
• So can sort up to 512Mof data, 64K pages, in two passes All indexes are alternative 2, height 2. Data entries are 10% the size of data records. Each I/O takes 10ms
Reserves, Sailors, Boats: Sailors (sid: integer, sname: string, rank: integer, age: real)
• Clustered index on sid• Nonclustered index on rank
Reserves (sid: integer, bid: integer, day: dates, rname: string)
• Clustered index on (sid,bid,day)• Nonclustered index on bid
Boats (bid: integer, color:string)• Clustered index on bidBytes
/RowRows/Page
Pages
Sailors 50 80 500
Reserves
40 100 1000
Boats 16 250 20
Rank: 1 to 10
Bid: 1 to 100
04/18/23 13PSU’s CS 587
14. Evaluating Relational Operators
Operators and example schema Selection
One term Nonclustered index optimization Multiple terms
Projection Equijoins General Joins Set Operators Buffering
04/18/23 14PSU’s CS 587
14.1 One Table, One Term
• Predicate is of the form R.attr op value, where op is < .
• Assume 10% of rnames begin with A, B.
• Suppose there is a clustered B+ tree index and an unclustered B+ tree index, both on rname.
• What are the possibile plans (algorithms)?
• Which plan will the optimizer choose?
• We’ll compute the cost of each plan/algorithm.
SELECT *FROM Reserves RWHERE R.rname < ‘C%’
14. Op.Eval.
04/18/23 15PSU’s CS 587
What is the optimal plan? File Scan Plan Cost
I/Os Seconds
Nonclustered Index Scan Plan Cost Access first qualifying data entry Access first qualifying data item Access all qualifying data items Total Cost
Clustered Index Scan Plan Cost Access first qualifying data entry Access first qualifying data item Access all qualifying data items Access all qualifying data entries Total Cost
What is the optimal plan?
14. Op.Eval.
04/18/23 16PSU’s CS 587
Index entries
Data entries
(Index File)
Data Records
CLUSTERED
(Data file)
Data entries
Data Records
UNCLUSTERED
04/18/23 17PSU’s CS 587
Costs for a one-term selection on R, with selectivity ρ
File scan Number of pages in R = M
Nonclustered index scan Worst case is at least the number of
qualifying rows in R = (MpR)ρ.
Clustered index scan Hight of index + number of qualifying data
entries + number of qualifying data pages = 2 + .1Mρ + Mρ = 2+1.1Mρ
*The .1 comes from our assumption that data entries are 10% the length of data items, and should be adjusted accordingly.
04/18/23 18PSU’s CS 587
How do the access methods compare?
Remember Clustered Indexes are rare pR is typically large
Conclusions Clustered Index Scan is optimal if ρ< 1/1.1 If there is no clustered index, File Scan is a big win
unless ρ is very small Let’s look at ways to make a Nonclustered
Index Scan even better
File Scan M
Nonclustered Index Scan
At least pR*ρ*M
Clustered Index Scan 2+1.1*ρ*M
04/18/23 19PSU’s CS 587
Nonclustered Index Optimization
Consider this algorithm for using a nonclustered index (the “shopping list” optimization):
1. Retrieve the RIDs in all the qualifying data entries.
2+ ρ*0.1*M2. Sort those RIDs in order of page number. ???3. Fetch data records using those RIDs in sorted order. This
ensures that each data page is retrieved just once (though
the # of such pages likely to be higher than with clustering). At most M
• If ρ is small, cost is much less.
Cost? At most 2+(1+.1ρ)M + ???
14. Op.Eval.
04/18/23 20PSU’s CS 587
A Second Nonclustered Index Optimization: The ??? Term.
What if RIDs are too large to sort in memory? 1. Retrieve the RIDs in all the qualifying data entries.
2’. Make a bitmap, one bit for each page in the table.3’. For each page in a qualifying RID, turn on the corresponding bit in
the bitmap4’. Fetch each page whose bit is on.
5’. Scan every record in every fetched page to find which ones qualify.
Why is step 5’ necessary? Cost: At most 2+(1+.1ρ)M
If ρ is small, cost is much less.
14. Op.Eval.
04/18/23 21PSU’s CS 587
How do the access methods compare?
Conclusions Clustered Index Scan is optimal if ρ< 1/1.1 If there is no clustered index, File Scan is a
slight win unless ρ is very small.
File Scan M
Nonclustered Index Scan
At most 2+(1+.1ρ)M if ρ is
small it will be much less
Clustered Index Scan 2+1.1*ρ*M
04/18/23 22PSU’s CS 587
Access Paths, Matching
An access path is a way of retrieving tuples. A full or partial scan An indexed access
So an optimal plan, in this simple case (one-term WHERE clause), is an optimal access path.
A term matches an access path if that access path can be used to retrieve just the tuples satisfying the term. Example: bid>50 matches what access path?
04/18/23 23PSU’s CS 587
General Selections
First approach: 1. Choose a term (Which one?)2. Retrieve tuples using the optimal access path for that
term.3. Apply any remaining terms, in-memory
Second approach Applicable if we have 2 or more access paths matching
terms and using Alternatives (2) or (3)1. Get sets of rids of data records using each matching
index.2. Then intersect these sets of rids 3. Retrieve the records and apply any remaining terms.
Which approach is cheaper?
14. Op.Eval.
SELECT attribute list FROM relation listWHERE term1 AND ... AND termk k > 1
04/18/23 24PSU’s CS 587
14. Evaluating Relational Operators
Operators and example schema Selection Projection
What is the problem? Sort and hash approaches
Equijoins General Joins Set Operators Buffering
04/18/23 25PSU’s CS 587
14.3 Projection
Hard part of projection is duplicate elimination
Simple sort-based projection Assume size ratio is 0.50, duplicate ratio is 0.251. Eliminate all attributes except sid, bid.2. Sort the result by (sid,bid). 3. Scan that result, eliminating duplicates
• Cost?
Optimizations? Combine attribute elimination with sort
• Cost? Combine duplicate elimination with sort.
• Best case cost?
SELECT DISTINCT R.sid, R.bidFROM Reserves R
14. Op.Eval.
04/18/23 26PSU’s CS 587
Projection Based on Hashing Simple Hash-based projection
Eliminate unwanted attributes Hash the result into B-1 output buffers and
partitions For each partition: Read in, eliminate
duplicates in memory, write the answer Optimizations?
Combine elimination of unwanted attributes with hashing.
Eliminate duplicates as soon as possible
14. Op.Eval.
04/18/23 27PSU’s CS 587
Discussion of Projection Sort-based approach is the standard; better handling
of skew and result is sorted. If an index on the relation contains all wanted
attributes in its search key, can do index-only scan. Apply projection techniques to data entries (much smaller!)
If an ordered (i.e., tree) index contains all wanted attributes as prefix of search key, can do even better: Retrieve data entries in order (index-only scan), discard
unwanted fields, compare adjacent tuples to check for duplicates.
{Prefixes
Single index (bid, sid, day)
04/18/23 28PSU’s CS 587
14. Evaluating Relational Operators
Operators and example schema Selection Projection Equijoins
Nested Loop Join Index Nested Loop Join Sort-Merge Join Hash Join Hybrid Hash Join
General Joins Set Operators Buffering
04/18/23 29PSU’s CS 587
14.4 Equijoin With One Join Column
SELECT S.name, R.bidFROM Reserves R, Sailors SWHERE R.sid=S.sid AND bid = 100
Example:
Reserves Sailors
⋈sid=sid
bid=100
sname
04/18/23 30PSU’s CS 587
Nested Loops Join
R is called the outer, S the inner table.
(1) foreach tuple r in R do(2) for each tuple s in S
if ri == sj then output <r, s>
14. Op.Eval.
04/18/23 31PSU’s CS 587
The Join Algorithms Every in-memory join algorithm is in some sense
a special case of Nested Loops Simple Nested Loops: Scan outer, scan inner Index Nested Loops: Scan outer, index scan inner Sort Merge: Inputs must be sorted on joining attributes.
Scan outer in order, scan matching interval of inner. Hash Join: special case of Index Nested Loops, where
index is constructed on-the-fly and inputs are partitioned.
Implications Every join algorithm scans its outer relation. Every join algorithm costs at least M.
14. Op.Eval.
04/18/23 32PSU’s CS 587
Cost of Nested Loops Join (Scan of R)+(# rows in R)*(Scan of S) = M + (M*pR*ρ)*N
Where ρ is the selectivity of the selection on R For simplicity we are assuming that M is accessed
with a file scan. In general M could be accessed with an index scan and the first cost “M” would be replaced by the cost of the index scan.
For our example, the cost is prohibitive: 1000+(1,000)*500 = 501,000 I/Os = 5,010
seconds ~= 1.4 hours Nested loops is used only for small tables.
04/18/23 33PSU’s CS 587
Index Nested Loops Join
Cost: M + (M*pR* ρ) * (Cost of finding matching S tuples) Again, the first cost M might be replaced by an index scan cost.
Cost of finding matching S tuples= cost of probing S index + cost of finding S tuples. For each R tuple, cost of probing S index is about 1.2 for hash
index, 2-4 for B+ tree. Cost of then finding S tuples (assuming Alt. (2) or (3) for data
entries) depends on clustering.• Clustered index: 1 I/O (typical), unclustered: up to 1 I/O per
matching S tuple.
Valid only when S has an index on sj
foreach tuple r in R doforeach tuple s in S where ri == sj use index
hereadd <r, s> to result
14. Op.Eval.
04/18/23 34PSU’s CS 587
Examples of Index Nested Loops
In our example Sailors has a clustered index on sid so we can apply INL.
Cost is M + (M*pR* ρ) * 3 = 1000 + (1000*100*.01) *3 = 4000 I/Os = 40 seconds.
14. Op.Eval.
04/18/23 35PSU’s CS 587
14.4.2 Sort-Merge Join Costs are in Blue.
Assume R, S can be sorted in 2 passes, i.e. M, N < B2
No-brainer Implementation: Sort R 4M Sort S 4N Merge-join R and S M+N Total Cost 5(M+N)
Better Implementation: Sort* R into runs 2M Sort* S into runs 2N Merge runs of R and S and merge-join R and S M+N Total Cost 3(M+N)*Use snowplow/replacement sort optimization
14. Op.Eval.
04/18/23 36PSU’s CS 587
Example of Sort-Merge Join
sid sname rating age22 dustin 7 45.028 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
sid bid day rname
28 103 12/4/96 guppy28 103 11/3/96 yuppy31 101 10/10/96 dustin31 102 10/12/96 lubber31 101 10/11/96 lubber58 103 11/12/96 dustin
14. Op.Eval.
04/18/23 37PSU’s CS 587
Sort-Merge Join Optimization
. .
.
. .
.
Runs from S
Runs from R
.
.
.
.
Fill Buffers
Merge Join Output
04/18/23 38PSU’s CS 587
Hybrid SMJ costs
If R and S are sorted, cost is M+N If R is sorted, cost is M+N +2 ρSN – 2( B - ρSN/B ) Argument similar to preceding page
If neither R nor S are sorted, cost is M+N + 2 ρRM + 2 ρSN -2( B - ρRM/B - ρSN/B )
The initial M and N costs might be replaced by index scan costs if appropriate.
⋈R S
R S
04/18/23 39PSU’s CS 587
Example: Hybrid SMJ
Assume this plan uses a clustered sid index scan of Reserves at a cost of 1102 and a file scan of Sailors at a cost of 500, B = 256.
Plan cost is then M+N +2 ρSN – 2( B - ρSN/B ) = 1102 + 500 +2*.5*500-2(256-.5*500/256) =
1590
⋈bid=100 rank>5
Reserves Sailors
04/18/23 40PSU’s CS 587
14.4.3 Hash-Join [677] Confusing because it combines two distinct
ideas:1. Partition R and S using a hash function
• You should understand how and why this works. Read the text if you don’t.
2. Join corresponding partitions using a version of INL with hash indexes built on-the-fly.• You should understand why the hash function used here
must differ from the hash function used for partitioning. The partition step could be followed by any join
method in step 2. Hashing is used because The tables are large (they needed partitioning) so NLJ
is not a contender. The partitions were just created so they have no
indexes on them, so classic INL is not possible. The partitions were just created so are not sorted, so
SMJ is not a big winner. There may be cases where a sort order is needed later, but this is not considered.
04/18/23 41PSU’s CS 587
Partitionsof R & S
Input bufferfor Si
Hash table for partitionRi (k < B-1 pages)
B main memory buffersDisk
Output buffer
Disk
Join Result
hashfnh2
h2
B main memory buffers DiskDisk
Original Relation OUTPUT
2INPUT
1
hashfunction
h B-1
Partitions
1
2
B-1
. . .
04/18/23 42PSU’s CS 587
Hash-Join Example: PartitionR
16,23,17,86,84,21,61
S
19,17,84,25, 22
Input
0 mod k
1 mod k
k Partitions of R
k Partitions of S
04/18/23 43PSU’s CS 587
Hash-Join Example: INL
Join Output
Partitions of R
Partitions of S
16,86,84
23,17,21,61
84,22
19,17,25,43
Scan partition of S
Hashed partition of R
Output
16,86,84
04/18/23 44PSU’s CS 587
Cost of Hash Join We only need one of M,N to be B2 !
Let it be M Read R, write B partitions: 2M
• Each one is, on the average, at most B pages• Some fudging going on here
Read S, write B partitions: 2N• We don’t care about the size of these partitions
For each of the B partitions of R and S: M+N• Build a hash table for the R-partition• Read each partition of R, and matching partition of S:
M+N
Total: 3(M+N) Can we do better if M << B2? Hybridize!
04/18/23 45PSU’s CS 587
Hybrid Hash Join, M 2B, Partition
Input Buffer
Partition 0 Output Buffer
Partition 1 Of R
Partition 0 of S
Partition 0 of R
S
R
R1 ⋈ S1
04/18/23 46PSU’s CS 587
Hybrid Hash Join, M 2B, INL
R0 ⋈ S0
Output Buffer
Load R Partition 0
Partition 0 of S
Scan S Partition 0
Partition 0 of R
04/18/23 47PSU’s CS 587
Hybrid Hash Join, Partition
Input Buffer
Partitions 0..k-1 Output Buffers
Partition k Of R
Partitions 0..k-1 of S
Partitions 0..k-1 of R
S
R
Rk ⋈ Sk
…
…
…
04/18/23 48PSU’s CS 587
Hybrid Hash Join, INL, Partition i
Ri ⋈ Si
Output Buffer
Load R Partition i
Partition i of S
Scan S Partition 0
Partition i of R
04/18/23 49PSU’s CS 587
Fudging in our treatment of Hash-Join
If we build an in-memory hash table to speed up the matching of tuples, a little more memory is needed.
If the hash function does not partition uniformly, one or more R partitions may not fit in memory. In this case we can apply the hash-join
technique recursively to do the join of this R-partition with its corresponding S-partition.
14. Op.Eval.
04/18/23 50PSU’s CS 587
Sort-Merge vs. Hash Join
When can we achieve the low cost of 3(M+N)? Hash: Smaller of M, N B2
Sort-merge: Both M, N B2
So Hash Join is preferable if table sizes differ greatly.
But Sort-Merge Join is less sensitive to data skew. The result of Sort Merge Join is ordered.
04/18/23 51PSU’s CS 587
Hash Join comments
Can view first phase of Hash Join of R, S as divide and conquer: partitioning the join into subjoins that can be done with one input smaller than memory.
What to do if we don’t know the size of R? Assume the case M B, then split off more and more of the hash table onto disk as it grows. Used in Postgres.
04/18/23 52PSU’s CS 587
Equijoin Cost Summary
Nested Loops best for small tables Others have high overhead costs
Nested Loops M + MN
Index Nested Loops M + M pR(1.2-4)(1+)
Sort-Merge, Hash 3(M+N)
04/18/23 53PSU’s CS 587
14.4.4 General Join Conditions
Equalities over several attributes (e.g., R.sid=S.sid AND R.rname=S.sname): For Index NL, build index on <sid, sname> (if
S is inner); or use existing indexes on sid or sname.
For Sort-Merge and Hash Join, sort/partition on combination of the two join columns.
Not much different than one-equality equijoins.
14. Op.Eval.
04/18/23 54PSU’s CS 587
Inequality Conditions
Typical example: R.rname < S.sname Relatively rare. Does the example make
sense? For Index NL, need (clustered!) B+ tree
index to get any efficiency.• Range probes on inner; # matches likely to be
much higher than for equality joins. Hash Join, Sort Merge Join not applicable. With no index, Block NL is the best join
method.
04/18/23 55PSU’s CS 587
14.5 Set Operations INTERSECTION:
INTERSECTION is a special case of join.• How is it a special case?
So all the usual join algorithms apply CROSS PRODUCT
Also a special case of join• How?
What join algorithm is best to compute CROSS PRODUCT?
• Will Sort-Merge or Hash help?
14. Op.Eval.
04/18/23 56PSU’s CS 587
UNION and EXCEPT They are similar; we’ll do UNION. The hard part is removing duplicates. Sorting based approach to union:
Sort both relations (on a key). Merge sorted relations, discarding duplicates. Alternative: Merge runs from Pass 0 for both
relations. Hash based approach to union:
Partition R and S using hash function h. For each S-partition, build in-memory hash table,
scan corresponding R-partition and add tuples to table while discarding duplicates.
04/18/23 57PSU’s CS 587
14.6 Aggregates w/o grouping Example
SELECT AVE(S.age) FROM SAILORS S In general, requires scanning the relation.
What is the cost for this query? Given index whose search key includes all
attributes in the SELECT or WHERE clauses, can do index-only scan. If there is an index on age, what is the cost of
this query?
14. Op.Eval.
04/18/23 58PSU’s CS 587
Aggregates with grouping Example
SELECT S.rating, AVE(S.age) FROM SAILORS S GROUP BY S.rating
What is an algorithm based on sorting? What is the cost of this query, assuming 2-pass sort? Optimizations? Cost?
What is an algorithm based on hashing? How much memory is needed? Cost?
Conclusion: Hash is best If there is an appropriate index, use index-only alg.
14. Op.Eval.
04/18/23 59PSU’s CS 587
14.7 Impact of Buffering If several operations are executing
concurrently, estimating the number of available buffer pages is guesswork.
Repeated access patterns interact with buffer replacement policy. e.g., Inner relation is scanned repeatedly in
Simple Nested Loop Join. With enough buffer pages to hold inner, replacement policy does not matter. Otherwise, MRU is best, LRU is worst (sequential flooding).
What replacement policy is best for • Block Nested Loops?• Index Nested Loops? • Sort-Merge Join sort phase?• Sort-Merge Join merge phase?
14. Op.Eval.