course: Database Applications (NDBI026)kopecky/vyuka/dbapl/lecture02.pdfThe ANSI SQL-92 standard introduced more types of table join in the FROM clause (semantics taken from RA) Cartesian

course:

Database Applications (NDBI026) WS2018/19

RNDr. Michal Kopecký, Ph.D. Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague

Schema modification Adding and deleting columns

Adding and deleting constraints

Changing column definition

ANSI SQL-92 Joins Optimizer and Query Optimization

Indexes

Execution Plans

Hinting

M. Kopecký Schema Modification and Query Optimization (NDBI026, Lect. 2) 2

Error in application design Wrong normal form of the schema

All required data can not be stored

Wrongly defined constraint Customer changes his/her requirements Support for new attributes and entities

Change of constraints in the real world

…


Schema and/or application logic changes represent the important part of the application life-time

Data stored in the database are usually more expensive and more important than the price of the software and hardware

Ability to change/modify the schema without any data loss is more important than the ability of its creating from scratch


Adding column to existing table ALTER TABLE tab_name ADD

column_definition; Example.

ALTER TABLE Person ADD Note CHARACTER VARYING(1000); ALTER TABLE Product ADD EAN NUMERIC(13) CONSTRAINT Product_U_EAN UNIQUE;


Dropping column in existing table

It is usually necessary to transfer data somewhere before column dropping!

ALTER TABLE tab_name DROP COLUMN col_name;

Example ALTER TABLE Person DROP COLUMN ZipCode;


Adding constraint to existing table ALTER TABLE tab_name ADD

constraint_definition; Example

ALTER TABLE Person ADD CONSTRAINT Person_FK_Mother FOREIGN KEY(Mother) REFERENCES Person(ID) ON DELETE SET NULL;


Dropping unnecessary constraint in the table ALTER TABLE tab_name

DROP CONSTRAINT constraint_name; Example

ALTER TABLE Person DROP CONSTRAINT Person_U_Name;


More columns and constraint can be added in one step using statement

ALTER TABLE tab_name ADD ( column_definition | constraint_definition [, …] );

Example ALTER TABLE Person ADD ( Note CHARACTER VARYING(1000), CONSTRAINT Person_Chk_Age CHECK (Age>=0) );


Columns can be modified using statement ALTER TABLE tab_name MODIFY (

new_incremental_column_definition [, …] ); -- Oracle

ALTER TABLE tab_name ALTER COLUMN new_incremental_column_definition; -- MS SQL

Example ALTER TABLE Person MODIFY ( Note CHARACTER VARYING(2000) );

Unnoticed features remain unchanged It is possible to change

NULL to NOT NULL and vice versa Column width

▪ Increase the width ▪ Decrease (usually only if the column is empty)



The ANSI SQL-92 standard introduced more types of table join in the FROM clause (semantics taken from RA) Cartesian product Equijoin Inner join Natural join Left/Right/Full outer join

Previous version allowed only Comma separated list of data sources (tables and views) Each source can be followed by the alias separated by

space Join conditions only in the WHERE clause


ANSI SQL-92 syntax Allows usage of keyword AS between data source

and alias … FROM Emp AS E, Dept AS D

Distinguishes semantically different types of join using new keywords in the FROM clause

WHERE clause remains for additional conditions (row selection)


X CROSS JOIN Y

Cartesian product

Equivalent of previous style X, Y

SELECT EmpNo, Loc FROM Emp CROSS JOIN Dept;


1111 10

2222 20

EmpNo DeptNo

20 NEW YORK

30 DALLAS

DeptNo Loc

1111 10

1111 10

EmpNo DeptNo

20 NEW YORK

30 DALLAS

DeptNo Loc

2222 20

2222 20

20 NEW YORK

30 DALLAS

X NATURAL [INNER] JOIN Y

Natural join over all common columns of both tables (here only DeptNo)

SELECT EmpNo, Loc FROM Emp NATURAL JOIN Dept;


1111 10

2222 20

EmpNo DeptNo

20 NEW YORK

30 DALLAS

DeptNo Loc

EmpNo DeptNo Loc

2222 20 NEW YORK

X [INNER] JOIN Y ON (condition)

Standard join of tables, equivalent to older FROM … X, Y … WHERE condition

X [INNER] JOIN Y USING (column [,…])

Join over equality of column values in all mentioned columns (both tables have to have defined those columns)


DBI026 -DB Aplikace - MFF UK

It is possible to use one of following versions instead of INNER keyword LEFT [OUTER], RIGHT [OUTER], FULL [OUTER]

In case of … X LEFT JOIN Y ON (condition) ... Contains the result all rows from the left table (X), even if there is no corresponding row in the right table (Y)


INNER can be replaced by one of keywords LEFT [OUTER], RIGHT [OUTER], FULL [OUTER]

SELECT * FROM Emp NATURAL LEFT JOIN Dept; The result contains all Employees including those

that are not assigned to any department Non-existing fields

from Dept table are empty (contain NULL value)


1111 10

EmpNo DeptNo Loc

2222 20 NEW YORK


SELECT * FROM Emp NATURAL RIGHT JOIN Dept; The result contains all Departments including

those that have no assigned Employees

Non-existing fields from Emp table are empty (contain NULL value)


2222 20

EmpNo DeptNo

NEW YORK

Loc

30 DALLAS


SELECT * FROM Emp NATURAL FULL JOIN Dept; Combination of both left and right outer join

Non-existing fields from both tables are empty (contain NULL value)


2222 20

EmpNo DeptNo

NEW YORK

Loc

30 DALLAS

1111 10

Oracle has also its own native (proprietary) syntax for outer joins

Older, only left and right, only for equality of values

ANSI version is better and portable Left outer join

SELECT * FROM Dept, Emp WHERE Dept.Deptno = Emp.Deptno(+);

Right outer join SELECT * FROM Dept, Emp

WHERE Dept.Deptno(+) = Emp.Deptno;

Full outer join DOES NOT EXISTS



Serve for speeding-up data access according to some condition in the WHERE clause

Do not change neither syntax nor semantics of DML statements

Unique vs. Non-unique indexes

One-column vs. More-column (concatenated) indexes

Clustered vs. unclustered indexes

B-trees vs. Bitmaps

Indexes on columns vs. on expressions

Domain indexes (full-text, space, XML, …)


Index creation is not standardized in SQL-92 Individual RDBMS’s implement them in a

proprietary way

It can vary

▪ Syntax

▪ Support of particular type(s) of indexes (bitmap, hash, …)

▪ Their (non)usage for given query and data content


Usually redundant B+-trees Values in leaves

Leaves in bi-directional list to allow easy range search.

Suitable for columns having high selectivity (high number of different values in the column).

Concatenated indexes can combine more columns together to increase selectivity. ▪ Suitable, if the query searches rows according to values of first k columns in

the index. First k-1 columns have to be restricted by equality to constant value.

▪ Not suitable, if there is no condition on first column of the index.

It is usually not possible to combine more B-tree indexes. The query is evaluated using one of them (the most selective one) and other conditions have to be tested programmatically.


Cannot help If the percentage of corresponding rows is too high

▪ Overhead caused by reading additional blocks of the index and mainly by non-sequential access to the data blocks

In queries searching for rows containing NULL values in indexed column ▪ NULL values are usually not stored in the index

Can help In queries searching rows according to equality of

column value to constant In queries searching rows with column value

belonging to interval


For each possible column value is created one bitmap (bit-string) containing 1 for and only for rows with given value in the column, otherwise 0 Suitable for columns with low selectivity

Bitmaps can be effectively combined from arbitrary number of indexes to increase selectivity

Combination can increase the selectivity

SELECT * FROM Citizen

WHERE Gender=’M’

AND State IN (’US-NY’,’US-WA’);

▪ Combination of three bitmaps


0

1

0

1

1

0

0

1

0

0

1

0

0

1

0

0

0

0

1

0

1

0

0

1

0

0

0

0

1

0

0

0

1

M N Y

W A

0

1

0

0

0

0

0

1

0

0

1

( )=

Both Oracle and MS SQL creates automatically unique indexes for

Primary keys

▪ The name is the same as the name of the constraint

Candidate keys (UNIQUE columns)

▪ The name is the same as the name of the constraint


Important is to create indexes suitable for foreign key searches !!! Speeds-up the manipulation with the master table

▪ If the master row is deleted all child rows have to be found. Without the index, the engine has to do it using table full scan

▪ If the cascade delete is used the table containing hierarchy of entities, full scan has to be done for each found and deleted child recursively.

Full-scan reads all blocks, even empty ones containing only already deleted rows

Index range scan finds all child rows effectively Oracle used to use full table lock in case it needed to lock

all children rows and there was no suitable index available. This restricts parallel access to data from more users at the same time.


DBI026 -DB Aplikace - MFF UK

In other cases the indexes should be created only if they substantially help to speed-up frequently used queries

Each index speeds up some queries, but slows down data modification


Indexes on columns CREATE [UNIQUE] INDEX index_name

ON tab_name(column1[, column2 [, …]]);

Example

CREATE INDEX Person_Sn_Nm_Inx ON Person(Surname,Name);

Index can be used in statement, that searches data according value of first declared column

SELECT * FROM Person WHERE Surname=’Drake’;


Indexes on columns CREATE [UNIQUE] INDEX index_name

ON tab_name(column1[, column2 [, …]]);

Example

CREATE INDEX Person_Sn_Nm_Inx ON Person(Surname,Name);

Index cannot be used in statement, that searches data according value of second declared column

SELECT * FROM Person WHERE Name=’Francis’;


It is better to declare uniqueness using PRIMARY KEY and UNIQUE constraints

Not only indexes, but also constraints are defined

Constraints has to be used to allow using those columns as target of foreign key(s)


Indexes with ordering CREATE [UNIQUE] INDEX index_name

ON tab_name(column1 [{ASC|DESC}] [, …]);

Define ordering for each individual column

Can define the resulting row ordering in index search queries

Example

CREATE INDEX Employee_Job_Sal_Inx ON Employee(Job, Salary DESC);


Bitmap Indexes (only non-unique) CREATE BITMAP INDEX index_name

ON tab_name({column1|expression1}, …);

Example

CREATE BITMAP INDEX Teaching_Day_Inx ON Teaching(DayOfWeek);


CLUSTERED

At most one – by default the primary key

If is defined

▪ Data in the table are ordered according to index (ISF). In fact, the table forms the leaves of the index tree.

▪ Other indexes points to primary key values instead of row ID’s

If it is not defined

▪ Data in the table are not particularly ordered (HEAP)

▪ All indexes points to row ID’s NONCLUSTERED


create table onheap( id numeric(5) identity (100,10) constraint onheap_pk primary key NONCLUSTERED, name character varying(10) constraint onheap_u_name unique );

select object_id, name, index_id iid, type typ, type_desc from sys.indexes;

object_id | name |iid|typ| type_desc

1357247890 | category_pk | 1 | 1 | CLUSTERED

1357247890 | category_u_name| 2 | 2 | NONCLUSTERED

1417772108 | NULL | 0 | 0 | HEAP

1417772108 | onheap_u_name | 2 | 2 | NONCLUSTERED

1417772108 | onheap_pk | 3 | 2 | NONCLUSTERED


Equivalent of CLUSTERED in MS SQL Table ordered according to primary key,

rows form leaf level of the primary key index

Other indexes point to logcal ROWID’s Primary key value + supposed address

CREATE TABLE Person( ID VARCHAR2(11) CONSTRAINT Person_PK PRIMARY KEY, … ) ORGANIZATION INDEX;


Index dropping ORACLE: DROP INDEX index_name;

MSSQL: DROP INDEX tab_name.index_name;


Index information are in Oracle stored in views USER_INDEXES

USER_IND_COLUMNS

Index information are in MS SQL stored in views INFORMATION_SCHEMA

.INDEXES


Use correct type of indexes for given selectivity

Do not create all possible indexes over all columns and their combinations

Slows down data actualizations

Increases the amount of disk space taken


When developing the application use all available means in the target database

For finding the best possible variant of the query

Hint the optimizer only in case all other possible tries failed

Optimizers have their limits

Heuristics are used to find the best plan, non-promissing branches of plan space are pruned

Thus, only some combinations of data access paths and table joins are taken into account


One query can be written in many ways The same semantics Different way to achieve the result The time spent can differ many times !!!

The plan for executing given query written in given form provides the query optimizer

You need Know how to find out the plan used Use the best optimizable form of the query or help the

optimizer with optimization explicitly (when no other thing helps)


A (binary) tree of elementary operations Evaluated in post-order manner,

the root operation provides complete result

In leaves are data access paths to sources ▪ Table ROWID direct access ▪ Index UNIQUE SCAN ▪ Index RANGE SCAN ▪ Table FULL SCAN ▪ …

In inner nodes ▪ Accesses to table rows according to index-provided addresses ▪ Joins (nested loops, MERGE JOIN, HASH JOIN) ▪ Data sorting operations ▪ Filters for remaining predicates ▪ …


In Oracle Older RULE BASED optimization (RBO)

▪ Derives the plan from the statement syntax and from available indexes

Newer COST BASED optimization (CBO) ▪ Oracle 8+, recommended for better results ▪ Based on metadata available/computed for tables and

columns, computes the overall cost of the plan according to estimated usage of resources for operation execution (amount of time, space, ordering, data block accesses, …)

▪ Can distinguish the effectiveness of two different index range scans as well as the cost of execution for different constant used in the query


The cost of data access in descending order Table Full-scan

▪ All data blocks of the table are read one by one. Conditions are checked programmatically for each row.

▪ Can be optimal if the number of matching rows is large enough.

Index-Range-Scan ▪ The interval is found out in the index. Other conditions are checked

programmatically.

Unique-Index-Scan ▪ The at most one suitable row is found out using search in the unique index.

Other conditions are checked programmatically.

ROWID-Scan ▪ The row is fetched according to its direct address in the database


Join cost for two tables

The optimizer tries usually to use the table with more expensive data access as the pivotal table (outer loop in nested loops)

Then it searches corresponding data in the other table for each found row of the pivotal table

If both tables provides only Full-Scan data access path, data in both tables are temporarily ordered and Merge-Join is used.


How to find out the plan? In Oracle you should have table named

PLAN_TABLE available (newer versions of Oracle provide it automatically) with correct schema

The optimizer then can store plan to this table, if it is asked asked to do so @?\rdbms\admin\utlxplan[.sql]

SQL*Plus client provides option SET AUTOTRACE {OFF|ON|TRACEONLY}

Oracle provides statement EXPLAIN PLAN


EXPLAIN PLAN

SET STATEMENT_ID = ’name’

[INTO tab_name]

FOR statement;

EXPLAIN PLAN

SET STATEMENT_ID = ’emp_dept’

FOR

SELECT Emp.*, Dept.Loc

FROM Dept, Emp

WHERE Dept.DeptNo = Emp.Deptno;


Obtaining the execution plan (version 10+) select plan_table_output

from table(

dbms_xplan.display(

'PLAN_TABLE',{statement_id|null},

{'ALL'|'TYPICAL'|'BASIC'|'SERIAL'}

)

); PLAN_TABLE_OUTPUT

----------------------------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|

----------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 96961 | 1893K| 270 (2)|

| 1 | NESTED LOOPS | | 96961 | 1893K| 270 (2)|

| 2 | INDEX RANGE SCAN | MF_CISPOLATR_SK_ATR_DO_PBCP | 12 | 216 | 3 (0)|

| 3 | COLLECTION ITERATOR PICKLER FETCH| XMLSEQUENCEFROMXMLTYPE | | | |

----------------------------------------------------------------------------------------------------



How to find out the plan?

ISQL console provides the possibility to show the plan in textual form

set showplan_text on

go

<příkaz>

go

Usual reccomendation: Use placeholders instead of constants in your application and bind application variables to them Two “different” queries have two distinct (but equal) execution plans.

Their creation costs time and resources of the database ▪ SELECT * FROM Emp WHERE DeptNo=10;

SELECT * FROM Emp WHERE DeptNo=20;

▪ SELECT * FROM Emp WHERE DeptNo=:d;

Sometimes, of course, two CBO plans can be helpful because they are different (pokud se princip provedení odůvodněně liší). | ▪ SELECT * FROM Soldiers WHERE Gender=’M’

▪ SELECT * FROM Soldiers WHERE Gender=’F’


90% of data (full s.)

One statement write in the same form on all places in the application

Different styles cause different plans and repeated analysis of statements

▪ SELECT * FROM Emp WHERE Ename LIKE ’A%’ AND DeptNo=10;

▪ SELECT * FROM Emp WHERE DeptNo=10 AND Ename LIKE ’A%’;


If there exist more non-unique indexes on the table, RBO can choose the worse of them

SELECT * FROM Person WHERE Name=’John’ AND City=’Idaho City’;

Either all Johns are searched and the city is tested programmatically, or vice versa


Usage of one of indexes can be “disabled” by using some expression in the query

SELECT * FROM Person WHERE CONCAT(Name,’’)=’John’ AND City=’Idaho City’;

The ondex on Name cannot be used, the index on City will be used instead

Note.: More sophisticated optimizer could recognize this trick and rewrite the query to its original form.


The overall cost for individual plans is computed using lot of criteria

Amount of I/O operations, rows, Bytes, …

The cost of needed ordering operations

The cost for HASH operations

The plan with lowest weighted cost is chosen


Uses statistical information about stored data Number of different values in indexed columns,

Histograms of data values in columns, Lowest/Highest values in columns Number of rows in table, Average length of one row Number of data blocks in table Number of empty data blocks in table Number of NULLs in columns ▪ For given value or interval it can be estimated

▪ The percentage of matching rows ▪ The percentage of needed blocks ▪ Their volume


In Oracle CBO allows create indexes over expressions, not only columns (RBO cannot use them) CREATE INDEX Emp_Income_INX

ON Emp(Sal+COALESCE(Comm,0)); The query with identical expression can use the

index SELECT EName FROM Emp

WHERE Sal+COALESCE(Comm,0) > 25000; Query with modified expression cannot use that

index SELECT EName FROM Emp

WHERE COALESCE(Comm,0)+Sal > 25000;


Selection of optimizer

ALTER SESSION SET OPTIMIZER_MODE*) = ▪ CHOOSE – the optimizer is chosen according to presence of

statistics

▪ ALL_ROWS – CBO will be used, minimizes cost of obtaining all rows of the select – indexes are less used

▪ Suitable for batch processing.

▪ FIRST_ROWS –CBO will be used, minimizes cost of obtaining first few rows – indexes are more used

▪ Suitable for interactive processing

▪ RULE – always RBO *) Note.: Older syntax: OPTIMIZER_GOAL


ANALYZE TABLE tab_name {COMPUTE | ESTIMATE | DELETE} STATISTICS [FOR {TABLE | ALL [INDEXED] COLUMNS}];

DBMS_UTILITY.ANALYZE_SCHEMA( ’schema_name’,{’compute’ | ’delete’ | ’estimate’} );

DBMS_STATS.GATHER_SCHEMA_STATS(’sch_name’);

Views in data dictionary INDEX_STATS,

USER_TAB_COL_STATISTICS USER_USTATS


By default the option AUTO_CREATE_STATISTICS is enabled Automatical statistics generation ALTER DATABASE dbname SET

AUTO_CREATE_STATISTICS {ON|OFF} Manually by procedure sp_createstats

Example: creation of additional statistic dodatečné for two-column valuebased on data sample CREATE STATISTICS FirstLast ON

Person.Contact(FirstName,LastName) WITH SAMPLE 50 PERCENT


Tables Number of rows

Number of rows in one block

Number of empty/all blocks

… Sloupce Number of different values

Number of NULL values

Histograms of values

…


Using “plus sign” comments placed immediatelly after first keyword of the statement SELECT/UPDATE/INSERT/DELETE

▪ SELECT --+ list of hints ▪ Seems to be ignored

▪ SELECT /*+ list of hints */

Can be used for statement level selection of optimizer ▪ SELECT /*+ RULE */ * FROM EMP …; ▪ SELECT /*+ FIRST_ROWS */ * FROM EMP …;

The hit usage (except of RULE hint) always forces to use CBO based on statistics. If statistics are not computed or are too old, the result can be contra-productive.


General setting for optimizer CHOOSE

▪ Optimizer choses the method according to presence / not presence of statistics

RULE ▪ Optimizer uses RBO even in case of statistics are available. When

using SQL-92 joins in the statement RBO hint will be ignored!

ALL_ROWS ▪ Optimizer will minimize the cost for all rows retrieval

FIRST_ROWS, FIRST_ROWS(n) ▪ Optimizer will minimize the cost for first / first n of rows retrieval


Other hints (for data access paths) FULL(tab_name)

▪ Given table should be full-scanned

INDEX (tab_name index_name) ▪ Given index should be used to retrieve data from the table

NO_INDEX (tab_name index_name) ▪ Given index should not be used to retrieve data from the table

ORDERED ▪ The order of tables in joins should correspond to the order of

appearance in FROM clause

USE_NL, USE_MERGE, USE_HASH ▪ Joins should be implemented using nested loops / merge joins / hash

joins


FULL(tab_name) SELECT /*+ FULL(Emp) */

EmpNo, Ename FROM Emp WHERE EName>’X’;

Use FULL SCAN even if the amount of retrieved rows is small

If the table has an alias, the hint has to use this alias, it allows use the table more times with different hints


INDEX(jm_tabulky index [index …])

SELECT /*+ INDEX(Emp ENameInx EDeptInx) */ EmpNo, Ename FROM Emp WHERE EName LIKE’SC%’ AND DeptNo>50;

Use one of listed indexes, do not use other indexes, even if available and suitable


NO_INDEX(jm_tabulky index [index …])

SELECT /*+ NO_INDEX(Emp ENameInx) */ EmpNo, Ename FROM Emp WHERE EName LIKE’SC%’ AND DeptNo>50;

Do not consider listed indexes during query optimization


ORDERED

SELECT /*+ ORDERED*/ EmpNo, Ename FROM Emp, Dept WHERE …;

Tables will be joined in order of appearance in the FROM clause

It saves the time by not considering other orders of tables in join


SELECT … OPTION (hint …); Hints can be chosen from:

{ { HASH | ORDER } GROUP { CONCAT | HASH | MERGE } UNION { LOOP | MERGE | HASH } JOIN | FAST number_rows | FORCE ORDER | MAXDOP number_of_processors | OPTIMIZE FOR ( @variable_name { UNKNOWN | = literal_constant } [ , ...n ] ) | …


{ { HASH | ORDER } GROUP Implement GROUP BY using hashing or ordering data

{ CONCAT | HASH | MERGE } UNION Implement UNION without duplicities by simple

concatenating, hashing, or merging individual results { LOOP | MERGE | HASH } JOIN Implement joins by nested loops / merge joins / hash joins

FAST number_rows Optimize query for fast retrieval of first number of rows


FORCE ORDER Keep order of tables in joins in according to the FROM clause

MAXDOP number_of_processors Limitation of maximal degree of parallelism

OPTIMIZE FOR ( @variable_name { UNKNOWN | = literal_constant } [ , ...n ] ) If the statement contains a variable (placeholder),

suppose either given value or unknown value


Documents

course: Database Applications (NDBI026)kopecky/vyuka/dbapl/lecture02.pdfThe ANSI SQL-92 standard introduced more types of table join in the FROM clause (semantics taken from RA) Cartesian