Upload
vanessa-lloyd
View
223
Download
1
Tags:
Embed Size (px)
Citation preview
Chapter 19-1
CSE4701
Chapter 19 6e - 17 & 18 5: System Catalog and Query Optimization
Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department
The University of Connecticut191 Auditorium Road, Box U-155
Storrs, CT [email protected]
http://www.engr.uconn.edu/~steve(860) 486 - 4818
A portion of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech.
Other slides have been adapted from the AWL web site for the textbook. Remaining slides represent new material.
Chapter 19-2
CSE4701
Overview of Material Key Background Topics:
What are Typical Database Processing Actions? Disk Drives and Disk Storage Database Processing/Architectures Motivating Query Optimization Query Processing
Chapter 17 - System Catalog What is it? How is it Used?
Chapter 18 - Query Optimization in RDBMS High-level Query Optimization (Algebraic) Low-level Query Optimization (Cost-based)
Chapter 19-3
CSE4701
Typical Database Processing
Pre-Processing- Parser/Lexical- Optimizer/Views
Post-Processing- Collection of Results- Aggregation Operations- Security Checks
User Transaction
Response to User
Errors
High-Level Processing- Enqueue Trans.- Request Locks- Release Locks-Dequeue Trans.
ErrorsResults
Parsed and OptimizedUser Trans.
Low-Level Processing- Enqueue Trans.- Request Locks- Issue I/Os- Process Returned Data- Integrity Checks- Security Checks- Logging for Recovery- Release Locks- Dequeue Trans.
Concurrency ControlLock Request
Response Lock Request
Disk I/O
Recovery
I/ORequest
Results
Chapter 19-4
CSE4701
What are the Processing Issues for DBs? Database Applications of Today and Tomorrow
Require High Volumes of Information! Increase of Information Still Requires High
Performance! Throughput and Response Time Where's the Bottleneck in DBS?
CPU ?? Main Memory Size/Speed ?? Virtual Memory Limitations ?? Communications Bus ?? I/O Channel ??
Chapter 19-5
CSE4701
90-10 Rule for Database Processing Load (Transaction per second) vs.
Performance (Response Time of Transactions) Processing of Large Amounts of Raw Data
Addressed in Secondary Storage Staged to Main Memory
Identifying Relevant Data Large Amounts of Raw Data Discarded Focus on Data Most Likely to Contain Answers Possible Loss of CPU and Main Memory Cycles
This is Double Jeopardy! Load of DBS Must be Reduced Performance of DBS Degrades
Chapter 19-6
CSE4701
Only 10% of Relevant Data has Answers
Note: Naive Approach to Database Searching Often Occurs (Little or No Indexing in Practice!)
90-10 Rule for Conventional DBS
ApplicationPrograms
OperatingSystem
DatabaseFunctions
On-LineI/O
Disk I/O
Only 10% of Raw Data is Relevant
Chapter 19-7
CSE4701
Randomly Accessed Storage Devices Popular Media (Hard Drives, CDs, DVDs, etc.) Access to Information in Any Order Sequential Access Not Typically Supported or Needed,
Since “Files” Not Stored Sequentially Recall, Disk Defragmentation on PC Platform Block-Oriented Utilization of Device
Block Access to Optimize Transfer Block Size is Device/Controller Dependent Linear/Non-Linear Byte Orders with Blocks
Key Concepts … Platter Track Sector Cylinder Read/Write Heads
Chapter 19-8
CSE4701 Track
Sect
or
Top View of a Surface
Rotating Storage
Cylinder
Platters
R/W Heads
Note: Parallel Read/Write DrivesActivate All Heads Simultaneously
Chapter 19-9
CSE4701
Disk Drive Components
Chapter 19-10
CSE4701
Disk Characteristics and Access Transfer Time: Time to Copy Bits From Disk Surface
to Primary Memory Disk Latency Time:
Rotational Delay Waiting for Proper Sector to Rotate Under R/W Head
Rotate to Next Sector to Process Next Request Disk Seek Time:
Delay While R/W Head Moves to the Destination Track/Cylinder
Move Head In/Out to Seek Next Track/Cylinder Access = Seek (In/Out) + Latency (Around) + Transfer (Bytes) For DBMS - Key is Moving Data To/From Disk ASAP
w.r.t. Performance and Response Time Improve on 90-10 via Processing/Optimization
Chapter 19-11
CSE4701
Historical DB Architecture - Mainframe
Chapter 19-12
CSE4701
Client/Server DBS Architecture
Chapter 19-13
CSE4701
Mixed Architecture
Chapter 19-14
CSE4701
Three and Four Tier Architectures
From: http://java.sun.com/javaone/javaone98/sessions/T400/index.html
Chapter 19-15
CSE4701
What is MBDS? MBDS is Multi-Process, Multi-Computer, Parallel
Database System MBDS Composed of …
Host for Issuing User Requests Controller to Interact with Host (and User) One or More Backend Database Processors
Goals of MBDS Suppose Request Takes 4 Minutes with One
Backend Improve Response Time by Increasing Backends
Two Backends - Request 2+ Minutes Four Backends - Request 1+ Minutes
Chapter 19-16
CSE4701
What is MBDS Architecture?
BackendDatabase Processor
BackendDatabaseProcessor
BackendDatabase Processor
DatabaseController
HostUser
Database Blocks are Distributed Across All Backends
Backend (BE) DB Processors are Replicated
Database ControllerSends Same Query in Parallel to all BEs
BEs work in Parallel onEach Query and Communicate for Join
Results are Sent to and Collected bythe DB Controller - then to the User
Chapter 19-17
CSE4701
Approach Distributes Data Across Backends Suppose System has 10
Backends Consider a Number of Tables
Inventory Customers Employees …
What Happens if Place One Table/Backend?
What Happens if you Distribute … Table Across 10 Backends?
BackendDatabase
Processor 2
BackendDatabase
Processor 1
BackendDatabase
Processor 10
Chapter 19-18
CSE4701
What are MBDS Processes?
Get Msg.Put Msg.
RequestPreparation
Post Processing
Get Msg. Put Msg.
DirectoryManagement
Record Processing
ConcurrencyControl
Disk I/O
DatabaseController
BackendDatabase Processor
Chapter 19-19
CSE4701
What are MBDS Messages?
No. Type SRC DST1 New Request Host ReqP2 Results of Request PoPr Host3 Number of Reqs in Transaction ReqP PoPr4 Aggregate Operators (Sum, etc.) ReqP PoPr6 Parsed Request to Backends ReqP DM12 Backend Aggregate Operator Results RecP PoPr15 Ids for Accessing Database Indexes DM DMs16 Request and Disk Addresses DM RecP21 Ids for Accessing Database Records DM CC22 Locks Obtained: Okay to Execute CC RecP23 Request ID of Finished Request RecP CC
Chapter 19-20
CSE4701
Sample Processing of Retrieve Request
Get Msg.Put Msg.
RequestPreparation
Post Processing
Get Msg. Put Msg.
DirectoryManagement
Record Processing
ConcurrencyControl Disk I/O
F15 FromOther
BackendE15 To Backend(s)
A1 B3
C4D6
D6,F15 E15
G21 H22
I16
J23
K12
K12
K12
Chapter 19-21
CSE4701
Coordination of Synchronous Behavior … Within Controller and Backend to Allow Multiple
Active Requests within Each Process Requests at Different Stages in Different Processes
Between Controller and Backends to Allow A Request to be Processed by All Backends A Request to be Processed by One Backend
Among Multiple Backends to Allow a Backend to Synchronize its Work on one Request with Other
Backends to Forward Results to Another Backend
What are Synchronization Issues in MBDS?
Chapter 19-22
CSE4701
Introduction to Query Processing
Query optimization: The process of choosing a suitable execution
strategy for processing a query. Two internal representations of a query:
Query Tree Query Graph
Chapter 19-23
CSE4701
Introduction to Query Processing
Chapter 19-24
CSE4701
Translating SQL Queries into Relational Algebra
Query block: The basic unit that can be translated into the
algebraic operators and optimized. A query block contains a single SELECT-FROM-
WHERE expression, as well as GROUP BY and HAVING clause if these are part of the block.
Nested queries within a query are identified as separate query blocks.
Aggregate operators in SQL must be included in the extended algebra.
Chapter 19-25
CSE4701
Translating SQL Queries into Relational Algebra
SELECT LNAME, FNAMEFROM EMPLOYEEWHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEEWHERE DNO = 5);
SELECT MAX (SALARY)FROM EMPLOYEE
WHERE DNO = 5
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > C
πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))
Chapter 19-26
CSE4701
Why is Query Optimization Needed?
Data Volume in any Type of Join or Cartesian Product has the Potential to be Very Large!
Consider R(A, B) = {r1, r
2 , ..., r
n}
Consider S(C, D) = {s1, s
2 , ..., s
m}
R x S = {r1 s
1, r
1 s
2, r
1 s
3, r
1 s
4, …
r2 s
1, r
2 s
2, r
2 s
3, r
2 s
4, … }
which contains n x m tuples! What is the Issue?
If n is 10,000 and m is 20,000 then Cartesian Product has 200,000,000 Tuples Join must Perform 200,000,000 Comparisons
Chapter 19-27
CSE4701
Why is Query Optimization Needed?
n/m - Number of Tuples of R/S Respectively b
R / b
S - Number of Tuples/Block of Memory
Assume that K Blocks Fit into Primary Memory
21
3
K-1
n / bR
(m / bS ) Number of Blocks for R/S1 Block
of R
K-1 Blocks of S
(m / bS )/(K-1) Number of Times that K-1
Memory Chunk Filled by S
(n / bR
)[(m / bS )/(K-1)] Which if Filled for
Each Block of R
(n / bR
) + (n / bR
)[(m / bS )/(K-1)]
Total Block Reads Must also Read Blocks of R
Chapter 19-28
CSE4701
Why is Query Optimization Needed?
21
3
K-1
n / bR
(m / bS ) Number of Blocks for R/S1 Block
of R
K-1 Blocks of S
(m / bS )/(K-1) Number of Times that K-1
Memory Chunk Filled by S
(n / bR
)[(m / bS )/(K-1)] Which if Filled for
Each Block of R
(n / bR
) + (n / bR
)[(m / bS )/(K-1)]
Total Block Reads Must also Read Blocks of R
If n = m = 10,000 and bR
= bS = 5, and K= 100
(10,000/5)+(10,000/5)[(10,000/5)/99] = 42,400 Blocks to ReadAt 20 Blocks/Second - 35 Minutes!
Chapter 19-29
CSE4701
Observation Cartesian Product Yields Unwanted Data SELECT R.A
FROM R, SWHERE R.B = S.C and S.C = 99
In Relational Algebra:
A ( B=C and D=99 (R x S))
= A ( B=C (R x D=99 (S) ))
= A (R x B=C ( D=99 (S))) Has Performance Improved? How?
Chapter 19-30
CSE4701
Evaluation Cartesian Product for SELECT - 40,000 Blocks SELECT R.A
FROM R, SWHERE R.B = S.C and S.C = 99
Relational Algebra with Equijoin:
A (R x B=C ( D=99 (S)))
The D=99 (S) Limits the Size of S Dramatically
As a Result, the Equijoin of R and D=99 (S) Would
Likely Reduce the Total Blocks Required to 4,000 Thus, a “Smart” Query Execution Strategy Can
Dramatically Reduce the Amount of I/Os
Chapter 19-31
CSE4701
Query Optimization Goal Limit Costly Join Operation by Reducing Data to be
Scanned or that Participates in the Join Query Optimization is Strategy to Achieve Goal While Improving Selection and Projection can Help,
the Main Objective is Join In Worst Case - Cartesian Product Can Improve by Introducing Indices on the Join
Attributes (R.B and S.C) to Limit “Product” Can Further Improve by Sorting on the Join
Attributes (R.B and S.C) This Reduces Block Accesses by Limiting the Number
of Blocks that Must be Examined in a Join If B’s Values Range from 0 to 100 and C from 50 to
150, only need to Compare from 50 to 100
Chapter 19-32
CSE4701
Query Processing Internal Data Structure
Memory Hierarchy Main Memory + Secondary Memory Information Must be Staged from Secondary to Primary
Memory for Database Operation Sequential Search
Brute force Approach Direct Access (Indexed Search)
Hash, Inverted Index file, Binary Search Tree, B-tree, B+-tree
Improves Selection by Focusing on Subset of Tuples that are Involved in the Answer and Equijoin by Not Having to Compare All Blocks in Two Relations
Chapter 19-33
CSE4701
Algorithms for Database Query Operators Largely Fall into Three Classes
Sorting-Based Methods Hash-Based Methods Index-Based Methods
Such Algorithms are Divided into Three Degrees of Difficulty and Cost (Limiting Factor is Size of Data) One Pass Algorithms
Where Data is Only Read Once From Disk Two-pass Algorithms
Data is Read from Disk, Processed in Some Way, Written Back to Disk, Read Again for Processing, etc.
Multi-pass Algorithms Where 3 or More Passes Are Required, i.e., Recursive
Generalization of the Two-pass Algorithms
Chapter 19-34
CSE4701
21 3 1000
Database Join and Sort are External Suppose that your DBS has 1,000 1K Blocks of
Memory Available for Performing Operations (e.g., Select, Project, Join, Union, Aggregation, etc.)
Suppose Sort R by R.B R Contains 5000 Blocks In order to Perform a Sort/Merge - You Must Use
External Algorithm since all 5000 Blocks Can Fit Into Memory at the Same Time
Suppose Join R (500 Blocks) and S (800 Blocks) Again - their Total Exceeds Memory - Hence you
Must Take an Approach that Compares One Block of R with All Blocks of S, etc. (Slides 22,23)
Chapter 19-35
CSE4701
Database Join and Sort are External What’s True about Today’s DBMS Like Oracle? Oracle Recommends 2 Gigabytes of Primary Memory That 2 Gigabytes Must be Shared by:
Operating System Other Applications Running on “Same” Server
(Web Server, etc.) Database Management Software
Even if there was 1.5 Gigabytes Available, Modern DBs can Exceed that size Very Easily
Moreover, Cartesian Product Could Exceed Available Mem. Join Could Require External Approach Since All
Tables Involved in Join Can’t fit in 1.5 Gigabytes External Sorting/Block Oriented Processing is Norm
Chapter 19-36
CSE4701
Algorithms for DB Query Operators Relational Algebra Operators can be Classified into
Three Groups Tuple-at-a-time Unary Operators
Selection and Projection No Need to Bring Entire Relation into Memory at One
Time Full-Relation Unary Operators
Duplicate Elimination and Grouping Requires Seeing All or Most of the Tuples in Memory
at Once Full-Relation Binary Operators
Set and Bag Versions of Union, Intersection, and Difference, Joins, and Cartesian Products
Requires Seeing the Tuples of Both Relations in Memory
Chapter 19-37
CSE4701
Query Access
Application Interfaces
Application Programs
Database Schema
Query
DMLPreprocessor
Query Processor
DDLPreprocessor
DatabaseManager
ObjectCode of Aps
Dbms
File Manager
System Catalog
Data Files
DiskStorage
SELECT EMP.ENAME
FROM EMP, WORKS, PROJ
WHERE (EMP.ENO= WORKS.ENO)
AND (WORKS.PNO = PROJ.PNO)
AND (PROJ.PNAME = “CAD/CAM”)
Chapter 19-38
CSE4701
DBMSSystem Buffer
User Program A
LanguageUser Work Area (UWA)
external schemaused by user
program A
Schema
Physical/InternalData Schema
Operating System
Database
1 2
3
45
6
78
910
Database Access
(DBMS)
SELECT EMP.ENAMEFROM EMP, WORKS, PROJWHERE (EMP.ENO= WORKS.ENO) AND (WORKS.PNO = PROJ.PNO) AND (PROJ.PNAME = “CAD/CAM”)
Chapter 19-39
CSE4701
Database Access1. User program A sends to DBMS an invoke command to
retrieve a (set of) record2. DBMS analyzes the external schema of the user program A and
finds the database description of the record.3. DBMS checks with the schema to get the data types and
location information of record4. DBMS checks with the physical schema to find out which
device the record is in and what access methods can be used.5. According to 4, DBMS sends OS a read command to execute
the search.6. OS issues the page invoke command to the correspond device,
and then puts the page fetched into the system buffer.7. DBMS uses the schema and the external schema to infer the
logical structure of the retrieving record.8. DBMS places the relevant data to the UWA, and9. provides the status information at the program invocation exit
Chapter 19-40
CSE4701
The System Catalog Store the Meta Information that Describes Each
Database, Including a Description of Conceptual Database Schema (Logical Data
Model) Relations, Attributes, Keys, Indexes, Views
Internal Schema External Schema
Store Information Needed by Specific DBMS Modules Query Optimization Module Security and Authorization
Chapter 19-41
CSE4701
Metadata - What is it? System metadata:
Where data came from How data were changed How data are stored How data are mapped Who owns data Who can access data Data usage history Data usage statistics
System metadata are critical in a DBMS
Application metadata: What data are available Where data are located What the data mean How to access the data Predefined reports Predefined queries How current the data are
Application metadata are critical in a database system
Chapter 19-42
CSE4701
Metadata v.s. Data
Meta schema describes all schemata
that can be defined in the data model
Data Dictionary Schema contains copy of
metaschema; schema for format definitions; schema for data about application data
Data Dictionary Data schema for application
data; metadata about application data
Data raw formatted application
data
relations
access-rights
relations
supplier
rel-name
rel-name
att-name
att-name
dom-name
dom-name
(u1, supplier, insert)(u2, supplier, delete)
user relation operation
s# sname location
(s1, smith, london)(s2, jones, boston)
Chapter 19-43
CSE4701
Example of Catalog Information
Chapter 19-44
CSE4701
Relational DBMS Catalog All Metadata Stored as Relations Example of Metadata Tables are:
Chapter 19-45
CSE4701
EER Diagram for Relational Catalog
Chapter 19-46
CSE4701
Metadata in Oracle Complex Data Dictionary
All Schema Objects (Tables,Views, Indices, …) User, All, and DBA Views
SELECT *FROM ALL_CATALOGWHERE OWNER=‘SMITH’;
SELECT COLUMN_NAME, DATA_TYPE, DATA_LENGTH, NUM_DISTINCT, LOW_VALUE,
HIGH_VALUEFROM USER_TAB_COLUMSWHERE TABLE_NAME=‘ORDERS’;
Chapter 19-47
CSE4701
Metadata in Oracle
SELECT PCT_FREE, INITIAL_EXTENT, NUM_ROWS, BLOCKS, EMPTY_BLOCKS, AVG_ROW_LENGTH
FROM USER_TABLES
WHERE TABLE_NAME = ‘ORDERS’;
SELECT INDEX_NAME, UNIQUENESS, BLEVEL, LEAF_BLOCKS,DISTINCT_KEYS, AVG_LEAF_BLOCKS_PER_KEY, AVG_DATA_BLOCKS_PER_KEY
FROM USER_INDEXES
WHERE TABLE_NAME = ‘ORDERS’;
Chapter 19-48
CSE4701
SELECT EMP.ENAMEFROM EMP, WORKS, PROJWHERE (EMP.ENO= WORKS.ENO) AND (WORKS.PNO = PROJ.PNO) AND (PROJ.PNAME = “CAD/CAM”)
Uses of System Catalog DDL Compilers:
Correct Definition ofRelations and Attributes
DML (Query) Compiler: DML Parser
Guided by the Description of DML Syntax and the Schema Information in the Catalog, Generates a Query Tree after Parser
Optimizer Generates Access Paths that is Relatively Optimal for
Executing a Query/ DML Command, by Accessing the Database Structure Information (Schemas), and Mapping High-level SQL Queries Into Low-level File Access Commands
Chapter 19-49
CSE4701
Revisit Typical Database Processing
Pre-Processing- Parser/Lexical- Optimizer/Views
Post-Processing- Collection of Results- Aggregation Operations- Security Checks
User Transaction
Response to User
Errors
High-Level Processing- Enqueue Trans.- Request Locks- Release Locks-Dequeue Trans.
ErrorsResults
Parsed and OptimizedUser Trans.
Low-Level Processing- Enqueue Trans.- Request Locks- Issue I/Os- Process Returned Data- Integrity Checks- Security Checks- Logging for Recovery- Release Locks- Dequeue Trans.
Concurrency ControlLock Request
Response Lock Request
Disk I/O
Recovery
I/ORequest
Results
Chapter 19-50
CSE4701
Typical Database Processing Pre-Processing
Actions Taken Upon Receipt of a Query from User SQL Query via Query Tool or JDBC Call “Compilation” of DB Query Check Syntax, Optimize, Develop Run-Time
Strategy (Similar to PL Compilation) Query is Translated to DB Transaction
A Transaction Contains Multiple DB Operations Transaction has Explicit Order of Operations
Database Transaction Must Succeed or Fail There is no Intermediate State – All or Nothing Completely Executed and Committed or
Aborts at any Point and Undone New State or Previous State of DB
Chapter 19-51
CSE4701
Typical Database Processing High-Level Processing
Enqueue Transaction from Pre-Processing Transaction Must Wait for “Earlier” Transactions Remember - Shared DB State!
Request Locks from Concurrency Control All Locks Before Proceeding vs. Locks as Needed Avoid Deadlock and Livelock
Release Locks As Use of Data Completes to Increase Availability What Happens if Failure of Later Step in Transaction
Dequeue Transaction Completes Transaction Processing Return “Result” to Post-Processing
Chapter 19-52
CSE4701
What are Deadlock and Livelock? Deadlock
Query 1 Gets Access to Table A Needs Table B Query 2 Gets Access to Table B Needs Table A
Query 1 Won’t Release A until it Gets B Query 2 Won’t Release B until it Gets A
This is Deadlock! Livelock
Query 1 Gets A, Seeks B Can’t so Releases A Query 2 Gets B, Seeks A, Can’t so Releases B Process Keeps Repeating Can Lead to Starvation
Analogy – Two People Trying to Pass in Narrow Hall
Chapter 19-53
CSE4701
Typical Database Processing Low-Level Processing
Enqueue Transaction - Do Actual DB Operations Request Locks - Lower Granularity Level Issue I/Os - Based on Operations to Access
“Correct” and “Relevant” DB Records Process Returned Data - Aggregation, Sorting Integrity Checks: Do I/D/U Satisfy Constraints? Security Checks: Is DB R/I/D/U Allowed? Logging for Recovery - Commit the Transaction Release Locks - Available to Others Dequeue Transaction - Return Results to High-
Level Processing Note: The Multiple Operations of Each DB
Transaction All Must be Successful
Chapter 19-54
CSE4701
Typical Database Processing Post Processing
Collection of Results May be Passed Portions of Results as they Complete For Example, Sorted Blocks of Data that are then
Merged in a Final Step Aggregation Operations
May be Passed Aggregate Intermediate Results Sum for Different Departments to be Totaled
Security Checks Last Step Filtering to Insure Only Allowed Data is
Returned May Execute Query but Only see Aggregate Result
Send Results to User
Chapter 19-55
CSE4701
Typical Database Processing Concurrency Control
Control Access to Information Data and Metadata Prevent Simultaneous Updates Ensure Database Always Correct and Consistent Serial Schedule vs. Serializable Transaction Two Types
Pessimistic - Locking-Based - Assume Collisions Will Occur - e.g., Peoplesoft Course Registration
Optimistic - Time-Based - Fix Problems After the Fact - e.g., ATM Machines Example
CC Manages Locks at Different Granularity Levels (Table, Attribute, View, Tuple, Metadata, etc.)
Chapter 19-56
CSE4701
Typical Database Processing Disk I/O
Performs the Actual Disk I/O for Read/Writes Block Oriented Activity Maintain Queue of All I/O Requests
Ordering is Critical Related to Concurrency Control and Consistency
Single DB Transactions can have Multiple DB Operations with Multiple Disk I/Os
Disk I/Os for Different Operations at Different Times
High and Low Level Processing will Determine What Operations Needed When
Disk I/O - Relatively “Dumb”
Chapter 19-57
CSE4701
Typical Database Processing Recovery
Tightly Tied to DB Transaction Concept Transactions Must be:
Atomic - Happens or Doesn’t Durable - Once Committed, Results Survive Failure Consistent - Follows Protocol/Correct DB State
When Failure Occurs, Can we: Recover to a Correct “Earlier” State Reconcile all “Active” Transactions that were
Executing at Failure Time Involves Logging of Database Actions Objective: High Availability and Reliability
Chapter 19-58
CSE4701
Query Optimization Not Really Optimizing, but Planning to Avoid Bad
Execution Strategies Models
Heuristics-Based Apply Transformation Rules According to a General
Strategy Focus on Relational Algebra that Underlies Each Query Improve the “Order” of Relational Operations
Cost-Based Minimize a Cost Function
I/O Cost + CPU Cost Subject to a Set of Constraints
Chapter 19-59
CSE4701
Query Processing Methodology
High-level Calculus-based Query
QueryPreprocessing
QueryPreprocessing
QueryOptimization
QueryOptimization
Algebraic Query (a tree structure) LOGICALSCHEMA
LOGICALSCHEMA
INTERNALSCHEMA
INTERNALSCHEMA
Execution Schedule (file access plan)
EXTERNALSCHEMA
EXTERNALSCHEMA
Chapter 19-60
CSE4701
Query Preprocessing Input: Calculus Query on Base Relations Normalization
Manipulate Query Quantifiers and Qualification Analysis
Detect and Reject Incorrect Queries Possible for Only a Subset of Relational Calculus
Simplification Eliminate Redundant Predicates
Restructuring Calculus Query Algebraic Query More Than One Translation is Possible Use Transformation Rules
Chapter 19-61
CSE4701
Normalization Lexical and Syntactic Analysis (Similar to Compilers)
Check Validity Check for Attributes and Relations Type Checking on the Qualification
Put into Normal Form Conjunctive Normal Form
(p11p12…p1n) …pm1pm2…pmn) Disjunctive Normal Form
(p11p12…p1n) …pm1pm2…pmn) OR's Mapped into Union AND's Mapped into Join or Selection
Chapter 19-62
CSE4701
Refute Incorrect Queries Example:
E(ENAME, ENO), P(JNO,JNAME), W(ENO,PNO,DUR) SELECT ENAME, PNAME
FROM E, P, W WHERE DUR > 27 AND DUR < 25 Incorrect
Disjoint Components are Useless Multiple Relations, Missing Joins, may not be
incorrect, but may indicate Cartesian product Contradictory
Qualification can not be Satisfied by any Tuple DUR > 27 AND DUR < 25
Chapter 19-63
CSE4701
Simplification Why Simplify?
The Simpler the Query, the Less Work there is and the Better the Performance
How? Use transformation rules Elimination of Redundancy
Idempotency Rulesp1 ¬(p1) = false
¬(p1 p2) = ¬(p1) ¬(p2)
p1 false = p1 – …
Application of Transitivity Use of Integrity Rules
Example x > a and x > b DUR > 27 AND DUR > 25
Chapter 19-64
CSE4701
Restructuring Convert Relational Calculus to
Relational Algebra Make use of Query Trees Example Find the names of employees
other than J. Doe who worked on the CAD/CAM project for either 1 or 2 years.
SELECT ENAMEFROM E, W, PWHERE E.ENO=W.ENO AND W.JNO=P.JNO AND E.ENAME°"J. Doe"AND P.JNAME="CAD/CAM" AND (W.DUR=12 OR W.DUR=24)
ENAME
(DUR=12 OR DUR=24) AND
JNAME=“CAD/CAM” AND
ENAME°“J. DOE”
JNO
ENO
P W E
Project
Select
Join
Chapter 19-65
CSE4701
Query Optimization Objectives Improving Performance Arriving at a Query Plan of Execution Analyzing the Relational Algebra Query
Replace Costly Operations Do Selections and Projections Early
Optimization Heuristics for the Relational Algebra Performing Selection and Projection Before Join Combining Several Selections Over a Single
Relation Into One Selection Find Common Subexpressions Algebraic Rewriting/transformation Rules
General Transformation Rules for Relational Algebra (Equivalence-preserving Algebraic Rewriting Rules)
Chapter 19-66
CSE4701
Why is it important?
SELECTENAMEFROM E,WWHERE E.ENO = W.ENO AND W.RESP = "Manager"
Strategy 1 ENAME(RESP="Manager"E.ENO=G.ENO(E W))
Strategy 2 ENAME( E ENO(RESP="Manager"(W)))
Query Optimization: An Example
Chapter 19-67
CSE4701
Assume : card(E) = 4,000; card(W)=10,000 10% of tuples in W satisfy RESP="Manager"
(selection generates 1,000 tuples) Execution time Proportional to the Sum of the
Cardinalities of the Temporary Relations Searching is Done by Sequential Scanning
Strategy 1 Strategy 2Cartesian prod. = 40,000,000 Selection over W = 10,000Search over all = 40,000,000 Join(4000*1000) = 4,000,000
80,000,000 4,010,000
Cost of Alternatives
Chapter 19-68
CSE4701
General Query Optimization Strategy Perform Selections Early
Yields Smaller Intermediate Results Direct Impact on Subsequent Join/Cartesian Prod.
Combine Selections with a Prior Cartesian Product into a Theta or Equi Join Join is a Cheaper Operation
Combine (Cascade) Selections and Projections
AB(B (R)) AB(R)
p1 ( p2 (R)) p1 ^ p2 (R)
This Results in One Pass Instead of Two over Table
Chapter 19-69
CSE4701
General Query Optimization Strategy Identify Common Subexpressions
Compute Once and Store use Stored Version for Subsequent Times Often Useful When Views are Employed
Preprocess Data via Sorts and Indexes Speeds up Searches and Joins by Limiting Scope
Evaluate and Assess Different Options For Cartesian Product, Use Smaller Relation for
Comparison Use System Catalog (Meta-data) to Effect Order in
Query Execution Plan
Chapter 19-70
CSE4701
Relational Algebra Transformations
1. Cascade of Selection
p1 ^ p2 ^ …^ pn(R)p1
(p2(...(pn
(R))...))
2. Commutativity of Selection
p1(p2
(R))p2(p1
(R))
p1 orp2(R )p1
(R p2(R)
3. Cascade of Projection
A1,A2, … An(R)A1(A2(...(An(R))...))
A1(R) if A1 A2 ... An4. Commuting Selection with Projection (A’s not in p)
A1,A2,...,An(p(R))p(A1,A2,...,An(R)
Chapter 19-71
CSE4701
Relational Algebra Transformations
5. Commutativity of Theta Join and Cartesian Product R A SS A R R SS R
6. Commuting Selection with Theta Join (Cartesian) p(A)(R S) p(A)(R)) S
A defined on R only p(A)^p(B)(R S) p(A)(R)) (p(B)(S))
(A defined on R, B defined on S) Also Holds for Theta Join as Well
7. Commuting Projection with Theta Join (Cartesian) C(R S) A(R) B(S) where AB=C A are Attributes in C for R and B are Attributes in
C for S
Chapter 19-72
CSE4701
Relational Algebra Transformations
8. Commutativity of Set Operations R S S R R S S R
9. Associativity of Set Operations (R S) T R S T) (R S) T R (S T) (R S) S R (S T) (R S) S R (S T)
10. Commuting Select with Set Operations
p(Ai)(R T) p(Ai)(R) p(Ai)(T)
where Ai is defined on both R and T
p(Ai)(R T) p(Ai)(R) p(Ai)(T)
where Ai is defined on both R and T
Chapter 19-73
CSE4701
11. Commuting Projection with Union
C(R q(Aj,Bk) S) A(R) q(Aj,Bk) B(S)
C(R S) A’ (R) B’ (S)
where R[A] and S[B]
C = A' B' where A' A, B’ B12. Converting Selection/Cartesian Into Theta Join
C (R S) R S
Relational Algebra Transformations
C
Chapter 19-74
CSE4701
Using Heuristics in Query Optimization
Process for heuristics optimization1. The parser of a high-level query generates an initial
internal representation;2. Apply heuristics rules to optimize the internal
representation.3. A query execution plan is generated to execute
groups of operations based on the access paths available on the files involved in the query.
The main heuristic is to apply first the operations that reduce size of intermediate results E.g., Apply SELECT and PROJECT operations
before applying the JOIN or other operations.
Chapter 19-75
CSE4701
Using Heuristics in Query Optimization (2) Query tree:
A tree data structure that corresponds to a relational algebra expression. It represents the input relations of the query as leaf nodes of the tree, and represents the relational algebra operations as internal nodes.
An execution of the query tree consists of executing an internal node operation whenever its operands are available and then replacing that internal node by the relation that results from executing the operation.
Query graph: A graph data structure that corresponds to a relational
calculus expression. It does not indicate an order on which operations to perform first. There is only a single graph corresponding to each query.
Chapter 19-76
CSE4701
Using Heuristics in Query Optimization
Heuristic Optimization of Query Trees: The same query could correspond to many different
relational algebra expressions — and hence many different query trees.
The task of heuristic optimization of query trees is to find a final query tree that is efficient to execute.
Example:Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECTWHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN AND BDATE > ‘1957-12-31’;
Chapter 19-77
CSE4701
Heuristics Algebraic Optimization Concepts Using Cascade of Selections Rule, Break up Any
Selections With Conjunctive Conditions Into a Cascade of Selections Allows More Freedom in Moving Selections
Down Different Branches of the Tree Using Commutativity of Selections with Other
Operations Rules, Move Each Selection Down the Query Tree as far as Possible
If Possible, Combine a Cartesian Product With a Selection Into a Join
Chapter 19-78
CSE4701
Heuristics Algebraic Optimization Concepts Using Associativity of Binary Operations, Rearrange
the Leaf Nodes So That the Most Restrictive Selections Are Executed First The Fewer Tuples the Resulting Relation Contains,
the More Restrictive the Selection Reducing the Size of Intermediate Results
Improves Performance Using Cascade of Projections and Commutativity of
Projections with Other Operations, Move Projections Down the Query Tree as Far as Possible
Identify Subtrees that Represent Groups of Operations that can be Executed by a Single Algorithm
Chapter 19-79
CSE4701
Heuristic Algebraic Optimization Algorithm Use Rule 1 to Break up Selects with Conjunctions into
a Cascade to Move them Down the Query Tree Use Rules 2, 4, 6, and 10 to Commute Select with
Project, Join, Cart. Prod., Union, and Intersection Use Rule 5 (Commute) and 9 (Associative) to
Rearrange the Leaf Nodes of Query Tree to: Most Restrictive Select Executed First Avoid Cartesian Product in Leaf Nodes
Use Rule 12 to Convert a Select/Cart Prod to Join Use Rules 3, 4, 7, and 11 to Cascade and Commute
Project - Pushing Down Tree as Far as Possible Identify Subtrees that Can Execute as Independent
Algorithms (Set of Operations)
Chapter 19-80
CSE4701 ENAME
(DUR=12 OR DUR=24) AND
JNAME=“CAD/CAM” AND
ENAME= “J. DOE”
JNO
ENOP
W E
Canonical query tree at the end of query preprocessing phase
E(ENAME, ENO)P(JNO,JNAME)
W(ENO,PNO,DUR)
Heuristic Optimization: Example
Chapter 19-81
CSE4701
ENAME
DUR=12 OR DUR=24
JNAME=“CAD/CAM”
ENAME = “J. DOE”
JNO
ENOP
W E
Use cascading of selectionsrule to decompose selections
Heuristic Optimization– Example
Chapter 19-82
CSE4701
E
ENAME = "J. Doe"
JNO
ENO
P W
ENAME
DUR=12 OR DUR=24
JNAME=“CAD/CAM” Push selection downusing commutativity of selection over join
Heuristic Optimization– Example
Chapter 19-83
CSE4701
P
JNO
JNAME = "CAD/CAM"
E
ENAME = "J. Doe"
ENO
W
ENAME
DUR=12 OR DUR=24 Push selection downusing commutativity of selection over join
Heuristic Optimization–Example
Chapter 19-84
CSE4701
E
ENAME
ENAME = "J. Doe"
WP
JNO
ENO
JNAME = "CAD/CAM" DUR =12 DUR=24
Push selection down
Heuristic Optimization–Example
Chapter 19-85
CSE4701
E
ENAME
ENAME = "J. Doe"
WP
JNO
JNO,ENAME
ENO
JNAME = "CAD/CAM"
JNO
DUR =12 DUR=24
JNO,ENO
JNO,ENAMEDo early projection
Heuristic Optimization–Example
Chapter 19-86
CSE4701
E
ENAME
ENAME = "J. Doe"
W
P
JNO
JNO,ENAME
ENO
JNAME = "CAD/CAM"
JNO
DUR =12 DUR=24
JNO,ENO
JNO,ENAME
Identify subtrees thatcan be implemented in one algorithm
Heuristic Optimization–Example
Chapter 19-87
CSE4701
BOOKS(Title, Author, Pname, LC_No)PUBLISHERS(Pname, Paddr, Pcity)BORROWERS(Name, Addr, City, Card_No)LOANS(Card_No, LC_No, Date)
Let XLOANS = S(F(Loans x Borrowers x Books))where:S ={Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date}andF = {Borrower.Card_No = Loans.Card_No ^ Books.LC_No = Loans.LC_No}
Heuristic Optimization: A Second Example
Chapter 19-88
CSE4701
XLOANS
Books
Loans Borrower
Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date
Borrower.Card_No = Loans.Card_No ^Books.LC_No = Loans.LC_No
X
X
Heuristic Optimization: A Second Example
Chapter 19-89
CSE4701
Query= TITLE(Date 1/1/88 (XLOANS))
Books
Loans Borrower
Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date
Borrower.Card_No = Loans.Card_No ^Books.LC_No = Loans.LC_No
X
X
Title
Date 1/1/88
Heuristic Optimization: A Second Example
Chapter 19-90
CSE4701
Books
Loans Borrower
Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date
Borrower.Card_No = Loans.Card_No ^Books.LC_No = Loans.LC_No
X
X
Title
Date 1/1/88
Date 1/1/88
Try to Cascade
Heuristic Optimization: A Second Example
Chapter 19-91
CSE4701
Books
Loans Borrower
Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date
Borrower.Card_No = Loans.Card_No ^Books.LC_No = Loans.LC_No
X
X
Title
Date 1/1/88
Commute Selectand Project
Heuristic Optimization: A Second Example
Chapter 19-92
CSE4701
Books
Loans Borrower
Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date
Borrower.Card_No = Loans.Card_No ^Books.LC_No = Loans.LC_No
X
X
Title
Date 1/1/88
Commute Selectand Select
Heuristic Optimization: A Second Example
Chapter 19-93
CSE4701
Books
Loans
Borrower
Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date
Borrower.Card_No = Loans.Card_No ^Books.LC_No = Loans.LC_No
X
X
Title
Date 1/1/88
Commute Select andCartesian ProductTwo Levels Down
Heuristic Optimization: A Second Example
Chapter 19-94
CSE4701
Books
Loans
Borrower
Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date
Borrower.Card_No = Loans.Card_No ^Books.LC_No = Loans.LC_No
X
X
Title
Date 1/1/88
Try to CascadeBooks.LC_No = Loans.LC_No
Heuristic Optimization: A Second Example
Chapter 19-95
CSE4701
Books
Loans
Borrower
Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date
Borrower.Card_No = Loans.Card_No
X
X
Title
Date 1/1/88
Commute Select andCartesian ProductOne Level Down
Books.LC_No = Loans.LC_No
What’s Next?
Heuristic Optimization: A Second Example
Chapter 19-96
CSE4701
Books
Loans
Borrower
Borrower.Card_No = Loans.Card_No
X
X
Title
Date 1/1/88
CombineProjections
Books.LC_No = Loans.LC_No
What is Still a Problem?We are Not Projecting so All Attributes are Still Collected Until the Final Project!
Heuristic Optimization: A Second Example
Chapter 19-97
CSE4701
Books
Loans
Borrower
Borrower.Card_No = Loans.Card_No
X
X
Title
Date 1/1/88
Add Strategic Projections to Send Only the Minimum
Up the Tree as Needed for Join/Result Set
Books.LC_No = Loans.LC_No
Heuristic Optimization: A Second Example
Loans.LC_No,Loans.Card_No
Loans.LC_No
Borr.Card_No
Books.LC_No, Title
Chapter 19-98
CSE4701
Books
Loans
Borrower
Borrower.Card_No = Loans.Card_No
X
X
Title
Date 1/1/88
Books.LC_No = Loans.LC_No
Heuristic Optimization: A Second Example
Loans.LC_No,Loans.Card_No
Loans.LC_No
Borr.Card_No
Books.LC_No, Title
What is the Final Step? Combine Select and Cartesian Product
Result: Equijoins!
Chapter 19-99
CSE4701
Heuristics Query Optimization: Summary First Apply Operations that Reduce the Size of
Intermediate Results Move Selections and Projections Down the Tree as
far as Possible Early Selections Reduce the Number of Tuples Early Projections Reduce the Number of Attributes
Selection and Join Should be Executed Before Other Similar Operations. This is Accomplished by Reordering the Leaf Nodes of
the Tree Among Themselves and Adjusting the Rest of the Tree Appropriately
Chapter 19-100
CSE4701
Cost-Based Optimization Reduce Defined Cost of Executing Queries What is Involved in the Cost of Executing a Query?
Access Cost to Secondary Storage Search for Data Block (Index) Read/Write Index and Data Blocks
Storage Cost Index and Data Blocks Intermediate Files
Computation Cost Query Planning - Optimization Effort Record Search, Sort, Merge Actual Transaction/Query Operations
Communications Cost Transfer of Results to the User
Chapter 19-101
CSE4701
Operation Complexity
SelectProject
(w/o duplicate elimination)O(n)
Project(with duplicate elimination)
GroupO(nlog n)
Join
Division
Set Operators
O(nlog n)
Cartesian Product O(n2)
Complexity of Relational Operations Assuming
Relations of Cardinality n
Sequential Scan of Data in each Relation
Complexity of Each Operation is Indicated
Avoid Cartesian Product at All Costs!
Chapter 19-102
CSE4701
Cost-Based Optimization To Understand Cost-Based Operations, we Must Focus
on Implementation Strategy of: Select Project Join
For Select and Project - There is a Fixed Cost that we Must Live With
For Join Implementation Strategy Different Join Strategies
Objective: Minimize the Number of Blocks Involved
Note that Cost-Based and Relational Algebra Heuristic Optimization Can Complement One Another
Chapter 19-103
CSE4701
Implementation of SELECT Principles
Equality Eliminates Many Tuples Index Focuses and Limits Search Scope
Sequential Scan Brute Force Search All Records to Find Matching Ones
Binary Search Equality Comparison on a Key Attribute
Primary Index or Hash Key for Single Record Equality Comparison on a Key Attribute With
Primary Index or Hash Key Go Directly to Record; No Need to Scan Entire
Table Cost to Maintain Index/Hash
Chapter 19-104
CSE4701
Implementation of SELECT Primary Index for Multiple Records
Use Primary Key to Find the Equality Attribute Go Forward (> or ) or Backward (< or )
According to the Comparison Operator Clustering Index for Multiple Records
Equality Comparison on a Non-key Attribute With a Clustering Index (e.g., Sort-Merge Algorithm)
Secondary Index Equality or Range Queries
Primary Indexes Play a Role Similar to Searching Sorted Array
We’ll Discuss Indexing Techniques at a Later Time
Chapter 19-105
CSE4701
Recall B+ Tree – Find Leave and Go L or R
Chapter 19-106
CSE4701
Recall B+ Tree – Find Leave and Go L or R
Chapter 19-107
CSE4701
Implementation of SELECT Conjunctive Selection (C1 C2 … CN)
If One of the Conjuncts has a Good Access Path, Use it and Check the Other Conjuncts for Each of These Records
Composite Index If an Index has Been Established Jointly for a
Number of Attributes in the Conjunct Equality Condition
Intersection of Pointers If Secondary Indexes Exist on All or Most of the
Attributes in the Conjunct and the Indexes Include Record Pointers
Retrieve Each Attribute Using These Indexes and Then Take Their Intersection
Chapter 19-108
CSE4701
Implementing PROJECT If <Attribute List> Includes Key
Simple Since the Cardinality of the Result is the Same as the Cardinality of the Original Relation
No Need to Remove Duplicates - Key Attribute If <Attribute List> Does Not Include Key
Duplicates Allowed Duplicate Elimination
Sort After Projection and then Eliminate Consecutively Appearing Duplicates
See Textbook for Algorithms Use Hashing: Hash Each Record Into a Bucket and
Check Against Records Already in That Bucket Size Estimation: card(A(R))=card(R)
Chapter 19-109
CSE4701
Implementing JOIN Nested Loop
Simple Iteration and Block-Oriented IterationFor Each Block in R do
Retrieve Every Record from S and Test Join Condition An Index for S may Speed up the Inner Loop Smaller Relation should be Outer Loop Calculation of I/O
Let bo (bi) be the Number of Blocks taken up by Outer (Inner) Relation
Let nB (>1) the Buffer Size (in blocks) Devoted to Arguments
Let bR be the size of the Resulting Relation (in blocks)
Total no. of Block Access = bo+ bo/(nB-1)bi+ bR
Chapter 19-110
CSE4701
Implementing JOIN Sort-Merge Join
Physically Sort Relations R and S Scan R and S in the Sorted Order and Merge See Algorithm in Textbook
If Files are Not Physically Sorted, but Sorted on the Join Attributes, a Variation May be used Quite Inefficient Since Records are Scattered Over
the Disk Total number of block access =
b + bi+ bolog2bo + bilog2bi + bR
Chapter 19-111
CSE4701
Implementing JOIN Hash Join
Hash R and S Using the Same Hash Function If Hash File Can Be Memory-Resident, it is
Efficient and Easy to Implement If Buffer Space is Insufficient, then Part of the Hash
File has to be on Disk Various Optimizations for this Case
Hybrid Hash Join is Described in the Book Again - Biggest Problem is Overhead Associated with
Maintaining Hash Index Over Time
Chapter 19-112
CSE4701
Example: Given a bank database consisting of the following three relation schemas:
Branch(bank-name, assets, bank-city)Deposit(bank-name, account-number, customer-name, balance)Customer(customer-name, street, zipcode, customer-city)
Consider the SQL query for the bank database:
Select account-numberFrom Deposit Where bank-name = “BofA” and
customer-name = “Bill” and balance > 1000;
Access Using Indices: Estimation of Costs
Chapter 19-113
CSE4701
Account-Number
bank-name = “BofA”
customer-name = “Bill”
balance > 1000;
Deposit
Account-Number
bank-name = “BofA”
customer-name = “Bill”
balance > 1000;
Deposit
Account-Number
bank-name = “BofA”
customer-name = “Bill”
balance > 1000;
Deposit
Heuristic Optimization Use Cascading of Selections Rule to Decompose,
Three Logical Query Plan Alternatives Are Obtained Objective - Choose the “Best” Alternative in Terms of
Execution Time (Block Reads) What should be the Focus in Select Order?
Chapter 19-114
CSE4701
Assumptions:100 Different Banks (bank-name)1000 Customers (on average) per bankBalance could range from 0 to 10,000 dollars
Branch(bank-name, assets, bank-city)Deposit(bank-name, account-number, customer-name, balance)Customer(customer-name, street, zipcode, customer-city)
Select account-numberFrom Deposit Where bank-name = “BofA” and
customer-name = “Bill” and balance > 1000;
Access Using Indices: Estimation of Costs
Chapter 19-115
CSE4701
Estimation of Cost of Access - Version 1
Account-Number
bank-name = “BofA”
customer-name = “Bill”
balance > 1000;
Deposit
Branch(bank-name, assets, bank-city)Deposit(bank-name, account-number, customer-name, balance)Customer(customer-name, street, zipcode, customer-city)
Recall Assumptions 100 Banks 1000 Customers/Bank 0 to 10,000 dollars/account that
are Distributed Evenly Across Accts.
Tuples in Deposit? 100,000 What Does balance > 1000 do?
Retrieve 90% of Accounts All Banks, All Customers
What Does customer-name = “bill” do? All Customers Named Bill
Regardless of the Bank Is this a Good Strategy?
Chapter 19-116
CSE4701
Estimation of Cost of Access - Version 2
Account-Number
bank -name = “BofA”
customer-name = “Bill”
balance > 1000;
Deposit
Branch(bank-name, assets, bank-city)Deposit(bank-name, account-number, customer-name, balance)Customer(customer-name, street, zipcode, customer-city)
Recall Assumptions 100 Banks 1000 Customers/Bank 0 to 10,000 dollars/account that are
Distributed Evenly Across Accts. Tuples in Deposit 100,000
What Does bank-name = “BofA” do? Retrieves 1000 Tuples for BofA on
Average What Does customer-name = “bill” do?
The Customer “Bill” What Does balance > 1000 do? Is this a Good Strategy?
Chapter 19-117
CSE4701
Estimation of Cost of Access - Version 3
Account-Number
bank -name = “BofA”
customer-name = “Bill”
balance > 1000;
Deposit
Branch(bank-name, assets, bank-city)Deposit(bank-name, account-number, customer-name, balance)Customer(customer-name, street, zipcode, customer-city)
Recall Assumptions 100 Banks 1000 Customers/Bank 0 to 10,000 dollars/account that are
Distributed Evenly Across Accts. Tuples in Deposit 100,000
What Does customer-name = “bill” do? Retrieves 100 Tuples One per Bank
What Does balance > 1000 do? Do they Have Enough Money?
What Does bank-name= “BofA” do? Is this a Good Strategy?
Chapter 19-118
CSE4701
Example: Consider the natural join Deposit Customer
• nDeposit = 10,000.• nCustomer = 200.• 20 tuples fit in one block for both relations• buffersize = 2 blocks
Join Strategies Several Factors Influence the Selection of an Optimal
The Physical Order of Tuples in a Relation The Presence of Indices and the Type of Index
(Clustering or Nonclustering) The Cost of Computing a Temporary Index for the
Sole Purpose of Processing One Query
Chapter 19-119
CSE4701
Block-oriented Iteration: • Process the relations on a per-block basis rather on a per-tuple
basis• Using this approach, a major saving in block accesses results
Example: Consider the natural join Deposit Customer nDeposit = 10,000. nCustomer = 200. 20 tuples fit in one block for both relations
Case 1: outerloop: Deposit , inner loop: Customer• reading Customer once for every block of Deposit tuples |
requires (200/20) * (10,000/20) = 10 * 500 • reading Deposit relation requires 10000/20 = 500 block reads• the total cost in terms of block accesses is 5500
-> 5000 blocks accesses to Customer and -> 500 blocks accesses to Deposit
Join Strategies: Block-Oriented Iteration
Chapter 19-120
CSE4701
Case 2: outerloop: Customer, inner loop: Deposit • Reading Deposit once for every block of Customer
tuples requires (10,000/20) * (200/20) = 5000 • Reading Customer relation |
requires 200/20 = 10 block reads• The total cost in terms of block accesses is 5010
==>5000 accesses to Deposit blocks and ==>10 accesses to Customer blocks
Case 3: If Customer relation is smaller enough to fit in main memory, our strategy requires only
==>500 blocks to read Deposit relation and
==>10 blocks to read Customer relation.The total comes to 510 blocks
Join Strategies: Block-oriented Iteration
Chapter 19-121
CSE4701
Query Execution Cost: Summary Access Cost to Secondary Storage
Search for Data Block (Index) Read/write Index and Data Blocks
Storage Cost Index and Data Blocks Intermediate Files
Computation Cost Query Planning Record Search, Sort, Merge Actual Transaction/query Operations
Communications Cost Data Transfer Across a Network
Chapter 19-122
CSE4701
Access Plan Access Plan is a Concrete Query Processing Plan
which Presents a Detailed Strategy for Processing a Query
The Main Cost Factors to Be Considered Include The Relational Operations to be Performed Indices to be Used The Order in Which Tuples are to be Accessed The Order in Which Operations are to be
Performed Typical Focus is on Join and Optimizing its Execution,
Particularly when Multiple Tables are Involved
Chapter 19-123
CSE4701
Statistics The Following are Kept in the System Catalog for
Optimization Purposes File Parameters: Block Size Number of Tuples in Each Relation Size of Tuples Key Fields, Indices Number of Levels in an Index Highest Key, Lowest Key Number of Distinct Values (Maybe) Others: Frequency of Operations, Join Keys, Etc.
All DBMSs Keep the First Four, Many Keep All
Chapter 19-124
CSE4701
Given R S T W Determine the Best Ordering Alternative ((R S) T) W (R (S T)) W R (S (T W)) ((R T) S) W ((R W) S) T … (R S) (T W)
Ordering is Critical to Arrive at “Best” Strategy for Execution, Particularly as Number of Relations Increase Size of Relation (Tuples/Blocks) Increase
Join Ordering
Chapter 19-125
CSE4701
Query Optimization Search Strategies Exhaustive Search
“Optimal” Combinatorial Complexity in the Number of
Relations Heuristics
Not Optimal Group Common Sub-expressions Perform Selection, Projection First Replace a Join by a Series of Semi-joins Reorder Operations to Reduce Intermediate
Relation Size Optimize Individual Operations
Chapter 19-126
CSE4701
Query Optimization Timing Issues Static
Compilation ==> Optimize Prior to the Execution Difficult to Estimate the Size of the Intermediate
Results ==> Error Propagation Can Amortize Over Many Executions
Dynamic Run Time Optimization Exact Information on the Intermediate Relation
Sizes Have to Reoptimize for Multiple Executions
Hybrid Compile Using a Static Algorithm If the Error in Estimate Sizes > Threshold,
Reoptimize at Run Time
Chapter 19-127
CSE4701
Concluding Remarks Most Systems Implement Only a Few Strategies The Number of Strategies that are Considered by Any
Query Optimizer is Limited Some Systems Reduce the Number of Strategies by
Making a Heuristic Guess of Strategy for Each Query The Optimizer Considers Every Possible Strategy,
but Terminates as Soon as it Determines the Cost is Greater than the Pre-chosen Strategy
Thus Only a Few Competing Strategies Require Full Analysis of the Cost
The Overhead of Query Optimization is Reduced Remember - Trade off in Optimization Time
For PL - Optimization is Pre-Execution (Compile) For DB - Optimization is Part of Execution (Run)