Chapter 19-1 CSE 4701 Chapter 19 6e - 17 & 18 5: System Catalog and Query Optimization Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department

Chapter 19-1

CSE4701

Chapter 19 6e - 17 & 18 5: System Catalog and Query Optimization

Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department

The University of Connecticut191 Auditorium Road, Box U-155

Storrs, CT [email protected]

http://www.engr.uconn.edu/~steve(860) 486 - 4818

A portion of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech.

Other slides have been adapted from the AWL web site for the textbook. Remaining slides represent new material.

Chapter 19-2

CSE4701

Overview of Material Key Background Topics:

What are Typical Database Processing Actions? Disk Drives and Disk Storage Database Processing/Architectures Motivating Query Optimization Query Processing

Chapter 17 - System Catalog What is it? How is it Used?

Chapter 18 - Query Optimization in RDBMS High-level Query Optimization (Algebraic) Low-level Query Optimization (Cost-based)

Chapter 19-3

CSE4701

Typical Database Processing

Pre-Processing- Parser/Lexical- Optimizer/Views

Post-Processing- Collection of Results- Aggregation Operations- Security Checks

User Transaction

Response to User

Errors

High-Level Processing- Enqueue Trans.- Request Locks- Release Locks-Dequeue Trans.

ErrorsResults

Parsed and OptimizedUser Trans.

Low-Level Processing- Enqueue Trans.- Request Locks- Issue I/Os- Process Returned Data- Integrity Checks- Security Checks- Logging for Recovery- Release Locks- Dequeue Trans.

Concurrency ControlLock Request

Response Lock Request

Disk I/O

Recovery

I/ORequest

Results

Chapter 19-4

CSE4701

What are the Processing Issues for DBs? Database Applications of Today and Tomorrow

Require High Volumes of Information! Increase of Information Still Requires High

Performance! Throughput and Response Time Where's the Bottleneck in DBS?

CPU ?? Main Memory Size/Speed ?? Virtual Memory Limitations ?? Communications Bus ?? I/O Channel ??

Chapter 19-5

CSE4701

90-10 Rule for Database Processing Load (Transaction per second) vs.

Performance (Response Time of Transactions) Processing of Large Amounts of Raw Data

Addressed in Secondary Storage Staged to Main Memory

Identifying Relevant Data Large Amounts of Raw Data Discarded Focus on Data Most Likely to Contain Answers Possible Loss of CPU and Main Memory Cycles

This is Double Jeopardy! Load of DBS Must be Reduced Performance of DBS Degrades

Chapter 19-6

CSE4701

Only 10% of Relevant Data has Answers

Note: Naive Approach to Database Searching Often Occurs (Little or No Indexing in Practice!)

90-10 Rule for Conventional DBS

ApplicationPrograms

OperatingSystem

DatabaseFunctions

On-LineI/O

Disk I/O

Only 10% of Raw Data is Relevant

Chapter 19-7

CSE4701

Randomly Accessed Storage Devices Popular Media (Hard Drives, CDs, DVDs, etc.) Access to Information in Any Order Sequential Access Not Typically Supported or Needed,

Since “Files” Not Stored Sequentially Recall, Disk Defragmentation on PC Platform Block-Oriented Utilization of Device

Block Access to Optimize Transfer Block Size is Device/Controller Dependent Linear/Non-Linear Byte Orders with Blocks

Key Concepts … Platter Track Sector Cylinder Read/Write Heads

Chapter 19-8

CSE4701 Track

Sect

or

Top View of a Surface

Rotating Storage

Cylinder

Platters

R/W Heads

Note: Parallel Read/Write DrivesActivate All Heads Simultaneously

Chapter 19-9

CSE4701

Disk Drive Components

Chapter 19-10

CSE4701

Disk Characteristics and Access Transfer Time: Time to Copy Bits From Disk Surface

to Primary Memory Disk Latency Time:

Rotational Delay Waiting for Proper Sector to Rotate Under R/W Head

Rotate to Next Sector to Process Next Request Disk Seek Time:

Delay While R/W Head Moves to the Destination Track/Cylinder

Move Head In/Out to Seek Next Track/Cylinder Access = Seek (In/Out) + Latency (Around) + Transfer (Bytes) For DBMS - Key is Moving Data To/From Disk ASAP

w.r.t. Performance and Response Time Improve on 90-10 via Processing/Optimization

Chapter 19-11

CSE4701

Historical DB Architecture - Mainframe

Chapter 19-12

CSE4701

Client/Server DBS Architecture

Chapter 19-13

CSE4701

Mixed Architecture

Chapter 19-14

CSE4701

Three and Four Tier Architectures

From: http://java.sun.com/javaone/javaone98/sessions/T400/index.html

Chapter 19-15

CSE4701

What is MBDS? MBDS is Multi-Process, Multi-Computer, Parallel

Database System MBDS Composed of …

Host for Issuing User Requests Controller to Interact with Host (and User) One or More Backend Database Processors

Goals of MBDS Suppose Request Takes 4 Minutes with One

Backend Improve Response Time by Increasing Backends

Two Backends - Request 2+ Minutes Four Backends - Request 1+ Minutes

Chapter 19-16

CSE4701

What is MBDS Architecture?

BackendDatabase Processor

BackendDatabaseProcessor


DatabaseController

HostUser

Database Blocks are Distributed Across All Backends

Backend (BE) DB Processors are Replicated

Database ControllerSends Same Query in Parallel to all BEs

BEs work in Parallel onEach Query and Communicate for Join

Results are Sent to and Collected bythe DB Controller - then to the User

Chapter 19-17

CSE4701

Approach Distributes Data Across Backends Suppose System has 10

Backends Consider a Number of Tables

Inventory Customers Employees …

What Happens if Place One Table/Backend?

What Happens if you Distribute … Table Across 10 Backends?

BackendDatabase

Processor 2

BackendDatabase

Processor 1

BackendDatabase

Processor 10

Chapter 19-18

CSE4701

What are MBDS Processes?

Get Msg.Put Msg.

RequestPreparation

Post Processing

Get Msg. Put Msg.

DirectoryManagement

Record Processing

ConcurrencyControl

Disk I/O

DatabaseController


Chapter 19-19

CSE4701

What are MBDS Messages?

No. Type SRC DST1 New Request Host ReqP2 Results of Request PoPr Host3 Number of Reqs in Transaction ReqP PoPr4 Aggregate Operators (Sum, etc.) ReqP PoPr6 Parsed Request to Backends ReqP DM12 Backend Aggregate Operator Results RecP PoPr15 Ids for Accessing Database Indexes DM DMs16 Request and Disk Addresses DM RecP21 Ids for Accessing Database Records DM CC22 Locks Obtained: Okay to Execute CC RecP23 Request ID of Finished Request RecP CC

Chapter 19-20

CSE4701

Sample Processing of Retrieve Request

Get Msg.Put Msg.

RequestPreparation

Post Processing

Get Msg. Put Msg.

DirectoryManagement

Record Processing

ConcurrencyControl Disk I/O

F15 FromOther

BackendE15 To Backend(s)

A1 B3

C4D6

D6,F15 E15

G21 H22

I16

J23

K12

K12

K12

Chapter 19-21

CSE4701

Coordination of Synchronous Behavior … Within Controller and Backend to Allow Multiple

Active Requests within Each Process Requests at Different Stages in Different Processes

Between Controller and Backends to Allow A Request to be Processed by All Backends A Request to be Processed by One Backend

Among Multiple Backends to Allow a Backend to Synchronize its Work on one Request with Other

Backends to Forward Results to Another Backend

What are Synchronization Issues in MBDS?

Chapter 19-22

CSE4701

Introduction to Query Processing

Query optimization: The process of choosing a suitable execution

strategy for processing a query. Two internal representations of a query:

Query Tree Query Graph

Chapter 19-23

CSE4701

Introduction to Query Processing

Chapter 19-24

CSE4701

Translating SQL Queries into Relational Algebra

Query block: The basic unit that can be translated into the

algebraic operators and optimized. A query block contains a single SELECT-FROM-

WHERE expression, as well as GROUP BY and HAVING clause if these are part of the block.

Nested queries within a query are identified as separate query blocks.

Aggregate operators in SQL must be included in the extended algebra.

Chapter 19-25

CSE4701

Translating SQL Queries into Relational Algebra

SELECT LNAME, FNAMEFROM EMPLOYEEWHERE SALARY > ( SELECT MAX (SALARY)

FROM EMPLOYEEWHERE DNO = 5);

SELECT MAX (SALARY)FROM EMPLOYEE

WHERE DNO = 5

SELECT LNAME, FNAME

FROM EMPLOYEE

WHERE SALARY > C

πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))

Chapter 19-26

CSE4701

Why is Query Optimization Needed?

Data Volume in any Type of Join or Cartesian Product has the Potential to be Very Large!

Consider R(A, B) = {r1, r

2 , ..., r

n}

Consider S(C, D) = {s1, s

2 , ..., s

m}

R x S = {r1 s

1, r

1 s

2, r

1 s

3, r

1 s

4, …

r2 s

1, r

2 s

2, r

2 s

3, r

2 s

4, … }

which contains n x m tuples! What is the Issue?

If n is 10,000 and m is 20,000 then Cartesian Product has 200,000,000 Tuples Join must Perform 200,000,000 Comparisons

Chapter 19-27

CSE4701


n/m - Number of Tuples of R/S Respectively b

R / b

S - Number of Tuples/Block of Memory

Assume that K Blocks Fit into Primary Memory

21

3

K-1

n / bR

(m / bS ) Number of Blocks for R/S1 Block

of R

K-1 Blocks of S

(m / bS )/(K-1) Number of Times that K-1

Memory Chunk Filled by S

(n / bR

)[(m / bS )/(K-1)] Which if Filled for

Each Block of R

(n / bR

) + (n / bR

)[(m / bS )/(K-1)]

Total Block Reads Must also Read Blocks of R

Chapter 19-28

CSE4701


21

3

K-1

n / bR

(m / bS ) Number of Blocks for R/S1 Block

of R

K-1 Blocks of S

(m / bS )/(K-1) Number of Times that K-1

Memory Chunk Filled by S

(n / bR

)[(m / bS )/(K-1)] Which if Filled for

Each Block of R

(n / bR

) + (n / bR

)[(m / bS )/(K-1)]

Total Block Reads Must also Read Blocks of R

If n = m = 10,000 and bR

= bS = 5, and K= 100

(10,000/5)+(10,000/5)[(10,000/5)/99] = 42,400 Blocks to ReadAt 20 Blocks/Second - 35 Minutes!

Chapter 19-29

CSE4701

Observation Cartesian Product Yields Unwanted Data SELECT R.A

FROM R, SWHERE R.B = S.C and S.C = 99

In Relational Algebra:

A ( B=C and D=99 (R x S))

= A ( B=C (R x D=99 (S) ))

= A (R x B=C ( D=99 (S))) Has Performance Improved? How?

Chapter 19-30

CSE4701

Evaluation Cartesian Product for SELECT - 40,000 Blocks SELECT R.A

FROM R, SWHERE R.B = S.C and S.C = 99

Relational Algebra with Equijoin:

A (R x B=C ( D=99 (S)))

The D=99 (S) Limits the Size of S Dramatically

As a Result, the Equijoin of R and D=99 (S) Would

Likely Reduce the Total Blocks Required to 4,000 Thus, a “Smart” Query Execution Strategy Can

Dramatically Reduce the Amount of I/Os

Chapter 19-31

CSE4701

Query Optimization Goal Limit Costly Join Operation by Reducing Data to be

Scanned or that Participates in the Join Query Optimization is Strategy to Achieve Goal While Improving Selection and Projection can Help,

the Main Objective is Join In Worst Case - Cartesian Product Can Improve by Introducing Indices on the Join

Attributes (R.B and S.C) to Limit “Product” Can Further Improve by Sorting on the Join

Attributes (R.B and S.C) This Reduces Block Accesses by Limiting the Number

of Blocks that Must be Examined in a Join If B’s Values Range from 0 to 100 and C from 50 to

150, only need to Compare from 50 to 100

Chapter 19-32

CSE4701

Query Processing Internal Data Structure

Memory Hierarchy Main Memory + Secondary Memory Information Must be Staged from Secondary to Primary

Memory for Database Operation Sequential Search

Brute force Approach Direct Access (Indexed Search)

Hash, Inverted Index file, Binary Search Tree, B-tree, B+-tree

Improves Selection by Focusing on Subset of Tuples that are Involved in the Answer and Equijoin by Not Having to Compare All Blocks in Two Relations

Chapter 19-33

CSE4701

Algorithms for Database Query Operators Largely Fall into Three Classes

Sorting-Based Methods Hash-Based Methods Index-Based Methods

Such Algorithms are Divided into Three Degrees of Difficulty and Cost (Limiting Factor is Size of Data) One Pass Algorithms

Where Data is Only Read Once From Disk Two-pass Algorithms

Data is Read from Disk, Processed in Some Way, Written Back to Disk, Read Again for Processing, etc.

Multi-pass Algorithms Where 3 or More Passes Are Required, i.e., Recursive

Generalization of the Two-pass Algorithms

Chapter 19-34

CSE4701

21 3 1000

Database Join and Sort are External Suppose that your DBS has 1,000 1K Blocks of

Memory Available for Performing Operations (e.g., Select, Project, Join, Union, Aggregation, etc.)

Suppose Sort R by R.B R Contains 5000 Blocks In order to Perform a Sort/Merge - You Must Use

External Algorithm since all 5000 Blocks Can Fit Into Memory at the Same Time

Suppose Join R (500 Blocks) and S (800 Blocks) Again - their Total Exceeds Memory - Hence you

Must Take an Approach that Compares One Block of R with All Blocks of S, etc. (Slides 22,23)

Chapter 19-35

CSE4701

Database Join and Sort are External What’s True about Today’s DBMS Like Oracle? Oracle Recommends 2 Gigabytes of Primary Memory That 2 Gigabytes Must be Shared by:

Operating System Other Applications Running on “Same” Server

(Web Server, etc.) Database Management Software

Even if there was 1.5 Gigabytes Available, Modern DBs can Exceed that size Very Easily

Moreover, Cartesian Product Could Exceed Available Mem. Join Could Require External Approach Since All

Tables Involved in Join Can’t fit in 1.5 Gigabytes External Sorting/Block Oriented Processing is Norm

Chapter 19-36

CSE4701

Algorithms for DB Query Operators Relational Algebra Operators can be Classified into

Three Groups Tuple-at-a-time Unary Operators

Selection and Projection No Need to Bring Entire Relation into Memory at One

Time Full-Relation Unary Operators

Duplicate Elimination and Grouping Requires Seeing All or Most of the Tuples in Memory

at Once Full-Relation Binary Operators

Set and Bag Versions of Union, Intersection, and Difference, Joins, and Cartesian Products

Requires Seeing the Tuples of Both Relations in Memory

Chapter 19-37

CSE4701

Query Access

Application Interfaces

Application Programs

Database Schema

Query

DMLPreprocessor

Query Processor

DDLPreprocessor

DatabaseManager

ObjectCode of Aps

Dbms

File Manager

System Catalog

Data Files

DiskStorage

SELECT EMP.ENAME

FROM EMP, WORKS, PROJ

WHERE (EMP.ENO= WORKS.ENO)

AND (WORKS.PNO = PROJ.PNO)

AND (PROJ.PNAME = “CAD/CAM”)

Chapter 19-38

CSE4701

DBMSSystem Buffer

User Program A

LanguageUser Work Area (UWA)

external schemaused by user

program A

Schema

Physical/InternalData Schema

Operating System

Database

1 2

3

45

6

78

910

Database Access

(DBMS)

SELECT EMP.ENAMEFROM EMP, WORKS, PROJWHERE (EMP.ENO= WORKS.ENO) AND (WORKS.PNO = PROJ.PNO) AND (PROJ.PNAME = “CAD/CAM”)

Chapter 19-39

CSE4701

Database Access1. User program A sends to DBMS an invoke command to

retrieve a (set of) record2. DBMS analyzes the external schema of the user program A and

finds the database description of the record.3. DBMS checks with the schema to get the data types and

location information of record4. DBMS checks with the physical schema to find out which

device the record is in and what access methods can be used.5. According to 4, DBMS sends OS a read command to execute

the search.6. OS issues the page invoke command to the correspond device,

and then puts the page fetched into the system buffer.7. DBMS uses the schema and the external schema to infer the

logical structure of the retrieving record.8. DBMS places the relevant data to the UWA, and9. provides the status information at the program invocation exit

Chapter 19-40

CSE4701

The System Catalog Store the Meta Information that Describes Each

Database, Including a Description of Conceptual Database Schema (Logical Data

Model) Relations, Attributes, Keys, Indexes, Views

Internal Schema External Schema

Store Information Needed by Specific DBMS Modules Query Optimization Module Security and Authorization

Chapter 19-41

CSE4701

Metadata - What is it? System metadata:

Where data came from How data were changed How data are stored How data are mapped Who owns data Who can access data Data usage history Data usage statistics

System metadata are critical in a DBMS

Application metadata: What data are available Where data are located What the data mean How to access the data Predefined reports Predefined queries How current the data are

Application metadata are critical in a database system

Chapter 19-42

CSE4701

Metadata v.s. Data

Meta schema describes all schemata

that can be defined in the data model

Data Dictionary Schema contains copy of

metaschema; schema for format definitions; schema for data about application data

Data Dictionary Data schema for application

data; metadata about application data

Data raw formatted application

data

relations

access-rights

relations

supplier

rel-name

rel-name

att-name

att-name

dom-name

dom-name

(u1, supplier, insert)(u2, supplier, delete)

user relation operation

s# sname location

(s1, smith, london)(s2, jones, boston)

Chapter 19-43

CSE4701

Example of Catalog Information

Chapter 19-44

CSE4701

Relational DBMS Catalog All Metadata Stored as Relations Example of Metadata Tables are:

Chapter 19-45

CSE4701

EER Diagram for Relational Catalog

Chapter 19-46

CSE4701

Metadata in Oracle Complex Data Dictionary

All Schema Objects (Tables,Views, Indices, …) User, All, and DBA Views

SELECT *FROM ALL_CATALOGWHERE OWNER=‘SMITH’;

SELECT COLUMN_NAME, DATA_TYPE, DATA_LENGTH, NUM_DISTINCT, LOW_VALUE,

HIGH_VALUEFROM USER_TAB_COLUMSWHERE TABLE_NAME=‘ORDERS’;

Chapter 19-47

CSE4701

Metadata in Oracle

SELECT PCT_FREE, INITIAL_EXTENT, NUM_ROWS, BLOCKS, EMPTY_BLOCKS, AVG_ROW_LENGTH

FROM USER_TABLES

WHERE TABLE_NAME = ‘ORDERS’;

SELECT INDEX_NAME, UNIQUENESS, BLEVEL, LEAF_BLOCKS,DISTINCT_KEYS, AVG_LEAF_BLOCKS_PER_KEY, AVG_DATA_BLOCKS_PER_KEY

FROM USER_INDEXES

WHERE TABLE_NAME = ‘ORDERS’;

Chapter 19-48

CSE4701

SELECT EMP.ENAMEFROM EMP, WORKS, PROJWHERE (EMP.ENO= WORKS.ENO) AND (WORKS.PNO = PROJ.PNO) AND (PROJ.PNAME = “CAD/CAM”)

Uses of System Catalog DDL Compilers:

Correct Definition ofRelations and Attributes

DML (Query) Compiler: DML Parser

Guided by the Description of DML Syntax and the Schema Information in the Catalog, Generates a Query Tree after Parser

Optimizer Generates Access Paths that is Relatively Optimal for

Executing a Query/ DML Command, by Accessing the Database Structure Information (Schemas), and Mapping High-level SQL Queries Into Low-level File Access Commands

Chapter 19-49

CSE4701

Revisit Typical Database Processing

Pre-Processing- Parser/Lexical- Optimizer/Views

Post-Processing- Collection of Results- Aggregation Operations- Security Checks

User Transaction

Response to User

Errors

High-Level Processing- Enqueue Trans.- Request Locks- Release Locks-Dequeue Trans.

ErrorsResults

Parsed and OptimizedUser Trans.

Low-Level Processing- Enqueue Trans.- Request Locks- Issue I/Os- Process Returned Data- Integrity Checks- Security Checks- Logging for Recovery- Release Locks- Dequeue Trans.

Concurrency ControlLock Request

Response Lock Request

Disk I/O

Recovery

I/ORequest

Results

Chapter 19-50

CSE4701

Typical Database Processing Pre-Processing

Actions Taken Upon Receipt of a Query from User SQL Query via Query Tool or JDBC Call “Compilation” of DB Query Check Syntax, Optimize, Develop Run-Time

Strategy (Similar to PL Compilation) Query is Translated to DB Transaction

A Transaction Contains Multiple DB Operations Transaction has Explicit Order of Operations

Database Transaction Must Succeed or Fail There is no Intermediate State – All or Nothing Completely Executed and Committed or

Aborts at any Point and Undone New State or Previous State of DB

Chapter 19-51

CSE4701

Typical Database Processing High-Level Processing

Enqueue Transaction from Pre-Processing Transaction Must Wait for “Earlier” Transactions Remember - Shared DB State!

Request Locks from Concurrency Control All Locks Before Proceeding vs. Locks as Needed Avoid Deadlock and Livelock

Release Locks As Use of Data Completes to Increase Availability What Happens if Failure of Later Step in Transaction

Dequeue Transaction Completes Transaction Processing Return “Result” to Post-Processing

Chapter 19-52

CSE4701

What are Deadlock and Livelock? Deadlock

Query 1 Gets Access to Table A Needs Table B Query 2 Gets Access to Table B Needs Table A

Query 1 Won’t Release A until it Gets B Query 2 Won’t Release B until it Gets A

This is Deadlock! Livelock

Query 1 Gets A, Seeks B Can’t so Releases A Query 2 Gets B, Seeks A, Can’t so Releases B Process Keeps Repeating Can Lead to Starvation

Analogy – Two People Trying to Pass in Narrow Hall

Chapter 19-53

CSE4701

Typical Database Processing Low-Level Processing

Enqueue Transaction - Do Actual DB Operations Request Locks - Lower Granularity Level Issue I/Os - Based on Operations to Access

“Correct” and “Relevant” DB Records Process Returned Data - Aggregation, Sorting Integrity Checks: Do I/D/U Satisfy Constraints? Security Checks: Is DB R/I/D/U Allowed? Logging for Recovery - Commit the Transaction Release Locks - Available to Others Dequeue Transaction - Return Results to High-

Level Processing Note: The Multiple Operations of Each DB

Transaction All Must be Successful

Chapter 19-54

CSE4701

Typical Database Processing Post Processing

Collection of Results May be Passed Portions of Results as they Complete For Example, Sorted Blocks of Data that are then

Merged in a Final Step Aggregation Operations

May be Passed Aggregate Intermediate Results Sum for Different Departments to be Totaled

Security Checks Last Step Filtering to Insure Only Allowed Data is

Returned May Execute Query but Only see Aggregate Result

Send Results to User

Chapter 19-55

CSE4701

Typical Database Processing Concurrency Control

Control Access to Information Data and Metadata Prevent Simultaneous Updates Ensure Database Always Correct and Consistent Serial Schedule vs. Serializable Transaction Two Types

Pessimistic - Locking-Based - Assume Collisions Will Occur - e.g., Peoplesoft Course Registration

Optimistic - Time-Based - Fix Problems After the Fact - e.g., ATM Machines Example

CC Manages Locks at Different Granularity Levels (Table, Attribute, View, Tuple, Metadata, etc.)

Chapter 19-56

CSE4701

Typical Database Processing Disk I/O

Performs the Actual Disk I/O for Read/Writes Block Oriented Activity Maintain Queue of All I/O Requests

Ordering is Critical Related to Concurrency Control and Consistency

Single DB Transactions can have Multiple DB Operations with Multiple Disk I/Os

Disk I/Os for Different Operations at Different Times

High and Low Level Processing will Determine What Operations Needed When

Disk I/O - Relatively “Dumb”

Chapter 19-57

CSE4701

Typical Database Processing Recovery

Tightly Tied to DB Transaction Concept Transactions Must be:

Atomic - Happens or Doesn’t Durable - Once Committed, Results Survive Failure Consistent - Follows Protocol/Correct DB State

When Failure Occurs, Can we: Recover to a Correct “Earlier” State Reconcile all “Active” Transactions that were

Executing at Failure Time Involves Logging of Database Actions Objective: High Availability and Reliability

Chapter 19-58

CSE4701

Query Optimization Not Really Optimizing, but Planning to Avoid Bad

Execution Strategies Models

Heuristics-Based Apply Transformation Rules According to a General

Strategy Focus on Relational Algebra that Underlies Each Query Improve the “Order” of Relational Operations

Cost-Based Minimize a Cost Function

I/O Cost + CPU Cost Subject to a Set of Constraints

Chapter 19-59

CSE4701

Query Processing Methodology

High-level Calculus-based Query

QueryPreprocessing

QueryPreprocessing

QueryOptimization

QueryOptimization

Algebraic Query (a tree structure) LOGICALSCHEMA

LOGICALSCHEMA

INTERNALSCHEMA

INTERNALSCHEMA

Execution Schedule (file access plan)

EXTERNALSCHEMA

EXTERNALSCHEMA

Chapter 19-60

CSE4701

Query Preprocessing Input: Calculus Query on Base Relations Normalization

Manipulate Query Quantifiers and Qualification Analysis

Detect and Reject Incorrect Queries Possible for Only a Subset of Relational Calculus

Simplification Eliminate Redundant Predicates

Restructuring Calculus Query Algebraic Query More Than One Translation is Possible Use Transformation Rules

Chapter 19-61

CSE4701

Normalization Lexical and Syntactic Analysis (Similar to Compilers)

Check Validity Check for Attributes and Relations Type Checking on the Qualification

Put into Normal Form Conjunctive Normal Form

(p11p12…p1n) …pm1pm2…pmn) Disjunctive Normal Form

(p11p12…p1n) …pm1pm2…pmn) OR's Mapped into Union AND's Mapped into Join or Selection

Chapter 19-62

CSE4701

Refute Incorrect Queries Example:

E(ENAME, ENO), P(JNO,JNAME), W(ENO,PNO,DUR) SELECT ENAME, PNAME

FROM E, P, W WHERE DUR > 27 AND DUR < 25 Incorrect

Disjoint Components are Useless Multiple Relations, Missing Joins, may not be

incorrect, but may indicate Cartesian product Contradictory

Qualification can not be Satisfied by any Tuple DUR > 27 AND DUR < 25

Chapter 19-63

CSE4701

Simplification Why Simplify?

The Simpler the Query, the Less Work there is and the Better the Performance

How? Use transformation rules Elimination of Redundancy

Idempotency Rulesp1 ¬(p1) = false

¬(p1 p2) = ¬(p1) ¬(p2)

p1 false = p1 – …

Application of Transitivity Use of Integrity Rules

Example x > a and x > b DUR > 27 AND DUR > 25

Chapter 19-64

CSE4701

Restructuring Convert Relational Calculus to

Relational Algebra Make use of Query Trees Example Find the names of employees

other than J. Doe who worked on the CAD/CAM project for either 1 or 2 years.

SELECT ENAMEFROM E, W, PWHERE E.ENO=W.ENO AND W.JNO=P.JNO AND E.ENAME°"J. Doe"AND P.JNAME="CAD/CAM" AND (W.DUR=12 OR W.DUR=24)

ENAME

(DUR=12 OR DUR=24) AND

JNAME=“CAD/CAM” AND

ENAME°“J. DOE”

JNO

ENO

P W E

Project

Select

Join

Chapter 19-65

CSE4701

Query Optimization Objectives Improving Performance Arriving at a Query Plan of Execution Analyzing the Relational Algebra Query

Replace Costly Operations Do Selections and Projections Early

Optimization Heuristics for the Relational Algebra Performing Selection and Projection Before Join Combining Several Selections Over a Single

Relation Into One Selection Find Common Subexpressions Algebraic Rewriting/transformation Rules

General Transformation Rules for Relational Algebra (Equivalence-preserving Algebraic Rewriting Rules)

Chapter 19-66

CSE4701

Why is it important?

SELECTENAMEFROM E,WWHERE E.ENO = W.ENO AND W.RESP = "Manager"

Strategy 1 ENAME(RESP="Manager"E.ENO=G.ENO(E W))

Strategy 2 ENAME( E ENO(RESP="Manager"(W)))

Query Optimization: An Example

Chapter 19-67

CSE4701

Assume : card(E) = 4,000; card(W)=10,000 10% of tuples in W satisfy RESP="Manager"

(selection generates 1,000 tuples) Execution time Proportional to the Sum of the

Cardinalities of the Temporary Relations Searching is Done by Sequential Scanning

Strategy 1 Strategy 2Cartesian prod. = 40,000,000 Selection over W = 10,000Search over all = 40,000,000 Join(4000*1000) = 4,000,000

80,000,000 4,010,000

Cost of Alternatives

Chapter 19-68

CSE4701

General Query Optimization Strategy Perform Selections Early

Yields Smaller Intermediate Results Direct Impact on Subsequent Join/Cartesian Prod.

Combine Selections with a Prior Cartesian Product into a Theta or Equi Join Join is a Cheaper Operation

Combine (Cascade) Selections and Projections

AB(B (R)) AB(R)

p1 ( p2 (R)) p1 ^ p2 (R)

This Results in One Pass Instead of Two over Table

Chapter 19-69

CSE4701

General Query Optimization Strategy Identify Common Subexpressions

Compute Once and Store use Stored Version for Subsequent Times Often Useful When Views are Employed

Preprocess Data via Sorts and Indexes Speeds up Searches and Joins by Limiting Scope

Evaluate and Assess Different Options For Cartesian Product, Use Smaller Relation for

Comparison Use System Catalog (Meta-data) to Effect Order in

Query Execution Plan

Chapter 19-70

CSE4701

Relational Algebra Transformations

1. Cascade of Selection

p1 ^ p2 ^ …^ pn(R)p1

(p2(...(pn

(R))...))

2. Commutativity of Selection

p1(p2

(R))p2(p1

(R))

p1 orp2(R )p1

(R p2(R)

3. Cascade of Projection

A1,A2, … An(R)A1(A2(...(An(R))...))

A1(R) if A1 A2 ... An4. Commuting Selection with Projection (A’s not in p)

A1,A2,...,An(p(R))p(A1,A2,...,An(R)

Chapter 19-71

CSE4701


5. Commutativity of Theta Join and Cartesian Product R A SS A R R SS R

6. Commuting Selection with Theta Join (Cartesian) p(A)(R S) p(A)(R)) S

A defined on R only p(A)^p(B)(R S) p(A)(R)) (p(B)(S))

(A defined on R, B defined on S) Also Holds for Theta Join as Well

7. Commuting Projection with Theta Join (Cartesian) C(R S) A(R) B(S) where AB=C A are Attributes in C for R and B are Attributes in

C for S

Chapter 19-72

CSE4701


8. Commutativity of Set Operations R S S R R S S R

9. Associativity of Set Operations (R S) T R S T) (R S) T R (S T) (R S) S R (S T) (R S) S R (S T)

10. Commuting Select with Set Operations

p(Ai)(R T) p(Ai)(R) p(Ai)(T)

where Ai is defined on both R and T

p(Ai)(R T) p(Ai)(R) p(Ai)(T)

where Ai is defined on both R and T

Chapter 19-73

CSE4701

11. Commuting Projection with Union

C(R q(Aj,Bk) S) A(R) q(Aj,Bk) B(S)

C(R S) A’ (R) B’ (S)

where R[A] and S[B]

C = A' B' where A' A, B’ B12. Converting Selection/Cartesian Into Theta Join

C (R S) R S


C

Chapter 19-74

CSE4701

Using Heuristics in Query Optimization

Process for heuristics optimization1. The parser of a high-level query generates an initial

internal representation;2. Apply heuristics rules to optimize the internal

representation.3. A query execution plan is generated to execute

groups of operations based on the access paths available on the files involved in the query.

The main heuristic is to apply first the operations that reduce size of intermediate results E.g., Apply SELECT and PROJECT operations

before applying the JOIN or other operations.

Chapter 19-75

CSE4701

Using Heuristics in Query Optimization (2) Query tree:

A tree data structure that corresponds to a relational algebra expression. It represents the input relations of the query as leaf nodes of the tree, and represents the relational algebra operations as internal nodes.

An execution of the query tree consists of executing an internal node operation whenever its operands are available and then replacing that internal node by the relation that results from executing the operation.

Query graph: A graph data structure that corresponds to a relational

calculus expression. It does not indicate an order on which operations to perform first. There is only a single graph corresponding to each query.

Chapter 19-76

CSE4701

Using Heuristics in Query Optimization

Heuristic Optimization of Query Trees: The same query could correspond to many different

relational algebra expressions — and hence many different query trees.

The task of heuristic optimization of query trees is to find a final query tree that is efficient to execute.

Example:Q: SELECT LNAME

FROM EMPLOYEE, WORKS_ON, PROJECTWHERE PNAME = ‘AQUARIUS’ AND

PNMUBER=PNO AND ESSN=SSN AND BDATE > ‘1957-12-31’;

Chapter 19-77

CSE4701

Heuristics Algebraic Optimization Concepts Using Cascade of Selections Rule, Break up Any

Selections With Conjunctive Conditions Into a Cascade of Selections Allows More Freedom in Moving Selections

Down Different Branches of the Tree Using Commutativity of Selections with Other

Operations Rules, Move Each Selection Down the Query Tree as far as Possible

If Possible, Combine a Cartesian Product With a Selection Into a Join

Chapter 19-78

CSE4701

Heuristics Algebraic Optimization Concepts Using Associativity of Binary Operations, Rearrange

the Leaf Nodes So That the Most Restrictive Selections Are Executed First The Fewer Tuples the Resulting Relation Contains,

the More Restrictive the Selection Reducing the Size of Intermediate Results

Improves Performance Using Cascade of Projections and Commutativity of

Projections with Other Operations, Move Projections Down the Query Tree as Far as Possible

Identify Subtrees that Represent Groups of Operations that can be Executed by a Single Algorithm

Chapter 19-79

CSE4701

Heuristic Algebraic Optimization Algorithm Use Rule 1 to Break up Selects with Conjunctions into

a Cascade to Move them Down the Query Tree Use Rules 2, 4, 6, and 10 to Commute Select with

Project, Join, Cart. Prod., Union, and Intersection Use Rule 5 (Commute) and 9 (Associative) to

Rearrange the Leaf Nodes of Query Tree to: Most Restrictive Select Executed First Avoid Cartesian Product in Leaf Nodes

Use Rule 12 to Convert a Select/Cart Prod to Join Use Rules 3, 4, 7, and 11 to Cascade and Commute

Project - Pushing Down Tree as Far as Possible Identify Subtrees that Can Execute as Independent

Algorithms (Set of Operations)

Chapter 19-80

CSE4701 ENAME

(DUR=12 OR DUR=24) AND

JNAME=“CAD/CAM” AND

ENAME= “J. DOE”

JNO

ENOP

W E

Canonical query tree at the end of query preprocessing phase

E(ENAME, ENO)P(JNO,JNAME)

W(ENO,PNO,DUR)

Heuristic Optimization: Example

Chapter 19-81

CSE4701

ENAME

DUR=12 OR DUR=24

JNAME=“CAD/CAM”

ENAME = “J. DOE”

JNO

ENOP

W E

Use cascading of selectionsrule to decompose selections

Heuristic Optimization– Example

Chapter 19-82

CSE4701

E

ENAME = "J. Doe"

JNO

ENO

P W

ENAME

DUR=12 OR DUR=24

JNAME=“CAD/CAM” Push selection downusing commutativity of selection over join

Heuristic Optimization– Example

Chapter 19-83

CSE4701

P

JNO

JNAME = "CAD/CAM"

E

ENAME = "J. Doe"

ENO

W

ENAME

DUR=12 OR DUR=24 Push selection downusing commutativity of selection over join

Heuristic Optimization–Example

Chapter 19-84

CSE4701

E

ENAME

ENAME = "J. Doe"

WP

JNO

ENO

JNAME = "CAD/CAM" DUR =12 DUR=24

Push selection down


Chapter 19-85

CSE4701

E

ENAME

ENAME = "J. Doe"

WP

JNO

JNO,ENAME

ENO

JNAME = "CAD/CAM"

JNO

DUR =12 DUR=24

JNO,ENO

JNO,ENAMEDo early projection


Chapter 19-86

CSE4701

E

ENAME

ENAME = "J. Doe"

W

P

JNO

JNO,ENAME

ENO

JNAME = "CAD/CAM"

JNO

DUR =12 DUR=24

JNO,ENO

JNO,ENAME

Identify subtrees thatcan be implemented in one algorithm


Chapter 19-87

CSE4701

BOOKS(Title, Author, Pname, LC_No)PUBLISHERS(Pname, Paddr, Pcity)BORROWERS(Name, Addr, City, Card_No)LOANS(Card_No, LC_No, Date)

Let XLOANS = S(F(Loans x Borrowers x Books))where:S ={Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date}andF = {Borrower.Card_No = Loans.Card_No ^ Books.LC_No = Loans.LC_No}

Heuristic Optimization: A Second Example

Chapter 19-88

CSE4701

XLOANS

Books

Loans Borrower

Title, Author, Pname, LC_No, Name, Addr, City, Card_No, Date

Borrower.Card_No = Loans.Card_No ^Books.LC_No = Loans.LC_No

X

X


Chapter 19-89

CSE4701

Query= TITLE(Date 1/1/88 (XLOANS))

Books

Loans Borrower



X

X

Title

Date 1/1/88


Chapter 19-90

CSE4701

Books

Loans Borrower



X

X

Title

Date 1/1/88

Date 1/1/88

Try to Cascade


Chapter 19-91

CSE4701

Books

Loans Borrower



X

X

Title

Date 1/1/88

Commute Selectand Project


Chapter 19-92

CSE4701

Books

Loans Borrower



X

X

Title

Date 1/1/88

Commute Selectand Select


Chapter 19-93

CSE4701

Books

Loans

Borrower



X

X

Title

Date 1/1/88

Commute Select andCartesian ProductTwo Levels Down


Chapter 19-94

CSE4701

Books

Loans

Borrower



X

X

Title

Date 1/1/88

Try to CascadeBooks.LC_No = Loans.LC_No


Chapter 19-95

CSE4701

Books

Loans

Borrower


Borrower.Card_No = Loans.Card_No

X

X

Title

Date 1/1/88

Commute Select andCartesian ProductOne Level Down

Books.LC_No = Loans.LC_No

What’s Next?


Chapter 19-96

CSE4701

Books

Loans

Borrower


X

X

Title

Date 1/1/88

CombineProjections


What is Still a Problem?We are Not Projecting so All Attributes are Still Collected Until the Final Project!


Chapter 19-97

CSE4701

Books

Loans

Borrower


X

X

Title

Date 1/1/88

Add Strategic Projections to Send Only the Minimum

Up the Tree as Needed for Join/Result Set



Loans.LC_No,Loans.Card_No

Loans.LC_No

Borr.Card_No

Books.LC_No, Title

Chapter 19-98

CSE4701

Books

Loans

Borrower


X

X

Title

Date 1/1/88



Loans.LC_No,Loans.Card_No

Loans.LC_No

Borr.Card_No

Books.LC_No, Title

What is the Final Step? Combine Select and Cartesian Product

Result: Equijoins!

Chapter 19-99

CSE4701

Heuristics Query Optimization: Summary First Apply Operations that Reduce the Size of

Intermediate Results Move Selections and Projections Down the Tree as

far as Possible Early Selections Reduce the Number of Tuples Early Projections Reduce the Number of Attributes

Selection and Join Should be Executed Before Other Similar Operations. This is Accomplished by Reordering the Leaf Nodes of

the Tree Among Themselves and Adjusting the Rest of the Tree Appropriately

Chapter 19-100

CSE4701

Cost-Based Optimization Reduce Defined Cost of Executing Queries What is Involved in the Cost of Executing a Query?

Access Cost to Secondary Storage Search for Data Block (Index) Read/Write Index and Data Blocks

Storage Cost Index and Data Blocks Intermediate Files

Computation Cost Query Planning - Optimization Effort Record Search, Sort, Merge Actual Transaction/Query Operations

Communications Cost Transfer of Results to the User

Chapter 19-101

CSE4701

Operation Complexity

SelectProject

(w/o duplicate elimination)O(n)

Project(with duplicate elimination)

GroupO(nlog n)

Join

Division

Set Operators

O(nlog n)

Cartesian Product O(n2)

Complexity of Relational Operations Assuming

Relations of Cardinality n

Sequential Scan of Data in each Relation

Complexity of Each Operation is Indicated

Avoid Cartesian Product at All Costs!

Chapter 19-102

CSE4701

Cost-Based Optimization To Understand Cost-Based Operations, we Must Focus

on Implementation Strategy of: Select Project Join

For Select and Project - There is a Fixed Cost that we Must Live With

For Join Implementation Strategy Different Join Strategies

Objective: Minimize the Number of Blocks Involved

Note that Cost-Based and Relational Algebra Heuristic Optimization Can Complement One Another

Chapter 19-103

CSE4701

Implementation of SELECT Principles

Equality Eliminates Many Tuples Index Focuses and Limits Search Scope

Sequential Scan Brute Force Search All Records to Find Matching Ones

Binary Search Equality Comparison on a Key Attribute

Primary Index or Hash Key for Single Record Equality Comparison on a Key Attribute With

Primary Index or Hash Key Go Directly to Record; No Need to Scan Entire

Table Cost to Maintain Index/Hash

Chapter 19-104

CSE4701

Implementation of SELECT Primary Index for Multiple Records

Use Primary Key to Find the Equality Attribute Go Forward (> or ) or Backward (< or )

According to the Comparison Operator Clustering Index for Multiple Records

Equality Comparison on a Non-key Attribute With a Clustering Index (e.g., Sort-Merge Algorithm)

Secondary Index Equality or Range Queries

Primary Indexes Play a Role Similar to Searching Sorted Array

We’ll Discuss Indexing Techniques at a Later Time

Chapter 19-105

CSE4701

Recall B+ Tree – Find Leave and Go L or R

Chapter 19-106

CSE4701

Recall B+ Tree – Find Leave and Go L or R

Chapter 19-107

CSE4701

Implementation of SELECT Conjunctive Selection (C1 C2 … CN)

If One of the Conjuncts has a Good Access Path, Use it and Check the Other Conjuncts for Each of These Records

Composite Index If an Index has Been Established Jointly for a

Number of Attributes in the Conjunct Equality Condition

Intersection of Pointers If Secondary Indexes Exist on All or Most of the

Attributes in the Conjunct and the Indexes Include Record Pointers

Retrieve Each Attribute Using These Indexes and Then Take Their Intersection

Chapter 19-108

CSE4701

Implementing PROJECT If <Attribute List> Includes Key

Simple Since the Cardinality of the Result is the Same as the Cardinality of the Original Relation

No Need to Remove Duplicates - Key Attribute If <Attribute List> Does Not Include Key

Duplicates Allowed Duplicate Elimination

Sort After Projection and then Eliminate Consecutively Appearing Duplicates

See Textbook for Algorithms Use Hashing: Hash Each Record Into a Bucket and

Check Against Records Already in That Bucket Size Estimation: card(A(R))=card(R)

Chapter 19-109

CSE4701

Implementing JOIN Nested Loop

Simple Iteration and Block-Oriented IterationFor Each Block in R do

Retrieve Every Record from S and Test Join Condition An Index for S may Speed up the Inner Loop Smaller Relation should be Outer Loop Calculation of I/O

Let bo (bi) be the Number of Blocks taken up by Outer (Inner) Relation

Let nB (>1) the Buffer Size (in blocks) Devoted to Arguments

Let bR be the size of the Resulting Relation (in blocks)

Total no. of Block Access = bo+ bo/(nB-1)bi+ bR

Chapter 19-110

CSE4701

Implementing JOIN Sort-Merge Join

Physically Sort Relations R and S Scan R and S in the Sorted Order and Merge See Algorithm in Textbook

If Files are Not Physically Sorted, but Sorted on the Join Attributes, a Variation May be used Quite Inefficient Since Records are Scattered Over

the Disk Total number of block access =

b + bi+ bolog2bo + bilog2bi + bR

Chapter 19-111

CSE4701

Implementing JOIN Hash Join

Hash R and S Using the Same Hash Function If Hash File Can Be Memory-Resident, it is

Efficient and Easy to Implement If Buffer Space is Insufficient, then Part of the Hash

File has to be on Disk Various Optimizations for this Case

Hybrid Hash Join is Described in the Book Again - Biggest Problem is Overhead Associated with

Maintaining Hash Index Over Time

Chapter 19-112

CSE4701

Example: Given a bank database consisting of the following three relation schemas:

Branch(bank-name, assets, bank-city)Deposit(bank-name, account-number, customer-name, balance)Customer(customer-name, street, zipcode, customer-city)

Consider the SQL query for the bank database:

Select account-numberFrom Deposit Where bank-name = “BofA” and

customer-name = “Bill” and balance > 1000;

Access Using Indices: Estimation of Costs

Chapter 19-113

CSE4701

Account-Number

bank-name = “BofA”

customer-name = “Bill”

balance > 1000;

Deposit

Account-Number



balance > 1000;

Deposit

Account-Number



balance > 1000;

Deposit

Heuristic Optimization Use Cascading of Selections Rule to Decompose,

Three Logical Query Plan Alternatives Are Obtained Objective - Choose the “Best” Alternative in Terms of

Execution Time (Block Reads) What should be the Focus in Select Order?

Chapter 19-114

CSE4701

Assumptions:100 Different Banks (bank-name)1000 Customers (on average) per bankBalance could range from 0 to 10,000 dollars


Select account-numberFrom Deposit Where bank-name = “BofA” and

customer-name = “Bill” and balance > 1000;

Access Using Indices: Estimation of Costs

Chapter 19-115

CSE4701

Estimation of Cost of Access - Version 1

Account-Number



balance > 1000;

Deposit


Recall Assumptions 100 Banks 1000 Customers/Bank 0 to 10,000 dollars/account that

are Distributed Evenly Across Accts.

Tuples in Deposit? 100,000 What Does balance > 1000 do?

Retrieve 90% of Accounts All Banks, All Customers

What Does customer-name = “bill” do? All Customers Named Bill

Regardless of the Bank Is this a Good Strategy?

Chapter 19-116

CSE4701


Account-Number

bank -name = “BofA”


balance > 1000;

Deposit


Recall Assumptions 100 Banks 1000 Customers/Bank 0 to 10,000 dollars/account that are

Distributed Evenly Across Accts. Tuples in Deposit 100,000

What Does bank-name = “BofA” do? Retrieves 1000 Tuples for BofA on

Average What Does customer-name = “bill” do?

The Customer “Bill” What Does balance > 1000 do? Is this a Good Strategy?

Chapter 19-117

CSE4701


Account-Number

bank -name = “BofA”


balance > 1000;

Deposit


Recall Assumptions 100 Banks 1000 Customers/Bank 0 to 10,000 dollars/account that are

Distributed Evenly Across Accts. Tuples in Deposit 100,000

What Does customer-name = “bill” do? Retrieves 100 Tuples One per Bank

What Does balance > 1000 do? Do they Have Enough Money?

What Does bank-name= “BofA” do? Is this a Good Strategy?

Chapter 19-118

CSE4701

Example: Consider the natural join Deposit Customer

• nDeposit = 10,000.• nCustomer = 200.• 20 tuples fit in one block for both relations• buffersize = 2 blocks

Join Strategies Several Factors Influence the Selection of an Optimal

The Physical Order of Tuples in a Relation The Presence of Indices and the Type of Index

(Clustering or Nonclustering) The Cost of Computing a Temporary Index for the

Sole Purpose of Processing One Query

Chapter 19-119

CSE4701

Block-oriented Iteration: • Process the relations on a per-block basis rather on a per-tuple

basis• Using this approach, a major saving in block accesses results

Example: Consider the natural join Deposit Customer nDeposit = 10,000. nCustomer = 200. 20 tuples fit in one block for both relations

Case 1: outerloop: Deposit , inner loop: Customer• reading Customer once for every block of Deposit tuples |

requires (200/20) * (10,000/20) = 10 * 500 • reading Deposit relation requires 10000/20 = 500 block reads• the total cost in terms of block accesses is 5500

-> 5000 blocks accesses to Customer and -> 500 blocks accesses to Deposit

Join Strategies: Block-Oriented Iteration

Chapter 19-120

CSE4701

Case 2: outerloop: Customer, inner loop: Deposit • Reading Deposit once for every block of Customer

tuples requires (10,000/20) * (200/20) = 5000 • Reading Customer relation |

requires 200/20 = 10 block reads• The total cost in terms of block accesses is 5010

==>5000 accesses to Deposit blocks and ==>10 accesses to Customer blocks

Case 3: If Customer relation is smaller enough to fit in main memory, our strategy requires only

==>500 blocks to read Deposit relation and

==>10 blocks to read Customer relation.The total comes to 510 blocks

Join Strategies: Block-oriented Iteration

Chapter 19-121

CSE4701

Query Execution Cost: Summary Access Cost to Secondary Storage

Search for Data Block (Index) Read/write Index and Data Blocks

Storage Cost Index and Data Blocks Intermediate Files

Computation Cost Query Planning Record Search, Sort, Merge Actual Transaction/query Operations

Communications Cost Data Transfer Across a Network

Chapter 19-122

CSE4701

Access Plan Access Plan is a Concrete Query Processing Plan

which Presents a Detailed Strategy for Processing a Query

The Main Cost Factors to Be Considered Include The Relational Operations to be Performed Indices to be Used The Order in Which Tuples are to be Accessed The Order in Which Operations are to be

Performed Typical Focus is on Join and Optimizing its Execution,

Particularly when Multiple Tables are Involved

Chapter 19-123

CSE4701

Statistics The Following are Kept in the System Catalog for

Optimization Purposes File Parameters: Block Size Number of Tuples in Each Relation Size of Tuples Key Fields, Indices Number of Levels in an Index Highest Key, Lowest Key Number of Distinct Values (Maybe) Others: Frequency of Operations, Join Keys, Etc.

All DBMSs Keep the First Four, Many Keep All

Chapter 19-124

CSE4701

Given R S T W Determine the Best Ordering Alternative ((R S) T) W (R (S T)) W R (S (T W)) ((R T) S) W ((R W) S) T … (R S) (T W)

Ordering is Critical to Arrive at “Best” Strategy for Execution, Particularly as Number of Relations Increase Size of Relation (Tuples/Blocks) Increase

Join Ordering

Chapter 19-125

CSE4701

Query Optimization Search Strategies Exhaustive Search

“Optimal” Combinatorial Complexity in the Number of

Relations Heuristics

Not Optimal Group Common Sub-expressions Perform Selection, Projection First Replace a Join by a Series of Semi-joins Reorder Operations to Reduce Intermediate

Relation Size Optimize Individual Operations

Chapter 19-126

CSE4701

Query Optimization Timing Issues Static

Compilation ==> Optimize Prior to the Execution Difficult to Estimate the Size of the Intermediate

Results ==> Error Propagation Can Amortize Over Many Executions

Dynamic Run Time Optimization Exact Information on the Intermediate Relation

Sizes Have to Reoptimize for Multiple Executions

Hybrid Compile Using a Static Algorithm If the Error in Estimate Sizes > Threshold,

Reoptimize at Run Time

Chapter 19-127

CSE4701

Concluding Remarks Most Systems Implement Only a Few Strategies The Number of Strategies that are Considered by Any

Query Optimizer is Limited Some Systems Reduce the Number of Strategies by

Making a Heuristic Guess of Strategy for Each Query The Optimizer Considers Every Possible Strategy,

but Terminates as Soon as it Determines the Cost is Greater than the Pre-chosen Strategy

Thus Only a Few Competing Strategies Require Full Analysis of the Cost

The Overhead of Query Optimization is Reduced Remember - Trade off in Optimization Time

For PL - Optimization is Pre-Execution (Compile) For DB - Optimization is Part of Execution (Run)

Documents

Chapter 19-1 CSE 4701 Chapter 19 6e - 17 & 18 5: System Catalog and Query Optimization Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department