49
A small companion for the class Information Systems for Engineers A summary of M.Fourny’s lecture. In each chapter, you will find a subsection called ”some additional definitions” 1 , which might go beyond the scope of this class, but give some context. Available here: Exercise class website of Marie-Louise Achart Disclaimer: no guarantee for neither rightness nor completeness. Remarks are very welcome: [email protected] Contents 1 Introduction 2 2 The relational Model 5 3 Data Modelling with SQL 8 4 The Relational Algebra 11 5 SQL queries 14 6 Database design theory 25 7 Databases and host languages 30 8 Transactions 33 9 Views and indices 41 10 Storage and architecture 43 11 Data Cubes 46 1 from the Book: ”Database Systems: The Complete Book”, H. Garcia-Molina, J.D. Ullman, J. Widom 1

A small companion for the class Information Systems for

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A small companion for the class Information Systems for

A small companion for the class

Information Systems for Engineers

A summary of M.Fourny’s lecture.In each chapter, you will find a subsection called ”some additional definitions”1, whichmight go beyond the scope of this class, but give some context.Available here: Exercise class website of Marie-Louise AchartDisclaimer: no guarantee for neither rightness nor completeness.Remarks are very welcome: [email protected]

Contents

1 Introduction 2

2 The relational Model 5

3 Data Modelling with SQL 8

4 The Relational Algebra 11

5 SQL queries 14

6 Database design theory 25

7 Databases and host languages 30

8 Transactions 33

9 Views and indices 41

10 Storage and architecture 43

11 Data Cubes 46

1from the Book: ”Database Systems: The Complete Book”, H. Garcia-Molina, J.D. Ullman, J. Widom

1

Page 2: A small companion for the class Information Systems for

1 Introduction

1.1 Different forms of databases over time

- oral tradition: speaking, singing- primitive writing / accounting: stone tablets- printing press- computers- the Relational Era 1970s- the Object Era 1980s- the NoSQL Era 2000s

1.2 Edgar Codd

Edgar Frank Codd (19 August 1923 – 18 April 2003) was an English computer scientistwho, while working for IBM, invented the relational model for database management, thetheoretical basis for relational databases and relational database management systems.

1.3 Role of data independence

Data independence refers to the ability to make changes to data characteristics (defini-tion and organization) without having to make changes to the programs that access thedata (user applications). It saves time and potential errors. We do not need to rewritethe application program(external view of the data), when we change the physical level(storage...).

1.4 From data to knowledge

DATA+ meaning−−−−−−−→ INFORMATION

+ meaningful application−−−−−−−−−−−−−−−−−→ KNOWLEDGE

1.5 Fundamental shapes of data

- text: unstructured data- tree: semi-structured data- graph: semi-structured data- table: structured data- cube: structured data

In this course, we focus on tables and cubes.

unstructured data: are in a very raw form, with no explicit structure, but onlyimplicit features (grammar, pixel,...).

1.6 Data model

A model is an abstraction from reality. A data model is a notation for describing dataor information. The description generally consists of three parts: Structure of the data,Operations on the data, Constraints on the data.

1.7 International system of units

k kilo 1 000M Mega 1 000 000G Giga 1 000 000 000T Tera 1 000 000 000 000P Peta 1 000 000 000 000 000E Exa 1 000 000 000 000 000 000Z Zetta 1 000 000 000 000 000 000 000Y Yotta 1 000 000 000 000 000 000 000 000

2

Page 3: A small companion for the class Information Systems for

1.8 Why we need a database management system

- to enforce integrity constraints- to support multiple users- to avoid data loss- to support access control- to save development costs

1.9 Architecture of a database management system

DDL Interpreter + DML Interpreter

. m mMetadata Manager ⇔ Data Manager ⇔ Transaction Manager

. m mData + Metadata (files on disk)

1.10 Information System

Information system refers to a program or a set of programs that manages information.

User Interface

Business Logic

Persistency : DBMS

Database

1.11 Different job positions related to databases

- parametric user: business logic + user interface (menus/masks, front-end)- data analyst: business logic + user interface (menus/masks + sql queries)- power user: application (3 layers: persistency + business logic + user interface)- database designer: DBMS- database administrator: DBMS + database

1.12 DDL vs DML

- DDL (Data Manipulation Language): it is used to define data structures.- DML (Data Definition Language): it is used to manipulate data itself.

1.13 Declarative vs imperative languages

- imperative languages: C, Java, C++, Python: tell the compiler what you want tohappen, step by step.- declarative languages: Haskell, CAML, SQL, XQuery: code that describes what youwant, but not necessarily how to get it.

1.14 Some additional definitions 2

database: a collection of information that exists over a long periodeof time. a collectionof data that are managed by a DBMS. are at the core of many scientific investigations.

DBMS: Database Management System: tool for creating and managing large amountsof data efficiently and allowing it to persist over long periods of time, safely.

functions of a DBMS:- create new database with specific schema, using a data-definition language- query the data and modify using a data-manipulation language- store large amounts (many terabytes) of data over a long period of time- durability: recovery of the database when errors, failures, misuse

2from the Book: ”Database Systems: The Complete Book”, H. Garcia-Molina, J.D. Ullman, J. Widom,chapter 1: The Worlds of Database Systems

3

Page 4: A small companion for the class Information Systems for

- control access with user isolation and actions on the data performed completely withatomicity.

data-definition language commands: performed by the database administrator toimplement of modify the logicial structure of the data: schema.

data-manipulation language statement: performed by a user or an application pro-gram, does not affect the schema of the database but may affect his content.

metadata: the database schema that describes the structure of, and the constraintson, the database.

1.15 Review questions

What is the name of the principle, posited by Edgar Codd in 1970, according to whichusers should not be exposed to the implementation level of a database system?→ data independance

1.16 Review questions

What is a table?→ a mathematical relation over attribute domains.→ a set of tuples mapping strings to values and with identical support.

4

Page 5: A small companion for the class Information Systems for

2 The relational Model

2.1 Most important concepts (with synonyms) to describe a table

- table = collection

id name gini

1 Switzerland 29.5

2 USA 40.8

3 Costa Rica 40.7

4 Iran 37.4

- attribute = column = field = property

name

Switzerland

USA

Costa Rica

Iran

- tuple = business object = row = entity = document = record

1 Switzerland 29.5

2.2 Schema of a table vs. Extension of a table

- the schema of a table describes the name of the table, and the name and type of each ofits column.- the extension of a table describe the data contained inside the table itself.- extension :

1 Switzerland 29.5

2 USA 40.8

3 Costa Rica 40.7

4 Iran 37.4

2.3 Mathematical concepts used in the relational model

- set, eg. N integers, D decimals, B booleans, § strings, V atomic values, T relationaltables, ...have no order: {a, b} = {b, a}have no duplicates: a, a, b violates the definition.- list- bag

- function f ∈ A → B (all the elements of the domain are mapped to elements in thecodomain)- partial function g ∈ A 9 B (there can be some lonely elements in the domain, notmapped to elements of the codomain)- relation

relation partial function function

All functions are partial function. All partial function are relation. Not all partial functionare function.

- cartesian product- pair: a tuple of two elements

5

Page 6: A small companion for the class Information Systems for

- tuple (in the mathematical sense, not the database sense) t ∈ A ∗ B ∗ C,eg. (1,”Switzerland”,29.5) ∈ N ∗ S ∗ Dare ordered: (a, b, c, d) 6= (d, b, c, a)

- logical predicate- closure

2.4 Relational table (also called relation), in terms of partial functions

Definition as a mathematical object:A relation R is made of:- attributes: AttributesR ⊆ S- an extension (set of tuples): ExtensionR ⊆ S 9 Vsuch that: ∀t ∈ ExtensionR, support(t) = AttributesR

Actual user-friendly representation:A table is a set of partial functions mapping strings to values, and with identical support.

⇒ More rigorous definition, and pedagogically nice to separate definition from domainintegrity.- set of partial functions:{(id 7→ 1, name 7→ Switzerland, gini 7→ 29.5), (id 7→ 2, name 7→ USA, gini 7→ 40.8),(id 7→ 3, name 7→ CostaRica, gini 7→ 40.9), (id 7→ 4, name 7→ Iran, gini 7→ 37.4)}

2.5 Relational table (also called relation), in terms of mathematical re-lations (alternative definition in textbooks)

Definition as a mathematical object:A relation R is made of SchemaR, DomainR, ExtensionR as follows:- a set of schema attributes, ordered for convenience (arbitrarily): a1, a2, ..., an- a family of domains associated with these attributes (the types): a 7→ DomainR(a)- a set of tuples (the extension): ExtensionR ⊆ DomainR(a1)× ...×DomainR(an)

Domain integrity holds by definition.

Actual user-friendly representation:A table is a relation over its attribute domains.

⇒ Very popular definition in database books, but imprecise regarding attribute (non-)order.

2.6 Domain integrity

Given a relation R and a domain constraint DomainR, domain integrity is fulfilled if∀t ∈ ExtensionR,∀a ∈ AttributesR: t · a ∈ DomainR(a)

In other words: Tuples must have their values in the domains associated with their at-tributes.eg. Domain(id) = N , Domain(name) = S , Domain(gini) = DThus (1,”Switzerland”,”plouf”) violates domain integrity.

2.7 Some additional definitions 3

relation: two-dimensional table. convention: relation name begins with a Capital letter.

data model: notation for describing data or information consisting of structure, op-erations and constraints.

structure of the data: higher-level than data structures like in C++.

3from the Book: ”Database Systems: The Complete Book”, H. Garcia-Molina, J.D. Ullman, J. Widom,chapter 2: The Relational Model of Data

6

Page 7: A small companion for the class Information Systems for

operations on the data: limited set of queries (that retrieve information) and modifi-cations (that change the database).

constraints on the data: limitations of what data can be.

data models:- relational model: efficiency of access and modifications of the data (ability of the compilerto produce highly optimized code), ease of programming; can include object-relational ex-tensions (values can have structure rather tan being elementary types, relations can haveassociated methods);- semistructured-data model: more flexibility than relations, including XML, eg. trees orgraphs.

attribute: column of a relation. convention: attribute name begins with a lower-caseletter.

schema for a relation: name of a relation and the set of attributes (not a list, butwith a ”standard” order).

(relational) database schema: schema for a set of relations.

tuples: rows of a relation, except the header row with attributes. A tuple has onecomponent for each attribute of the relation.

domain: to each attribute of the relation is associated a domain, that must be an el-ementary type.

atomicity: the relational model requires that each component of each tuple be atomic.eg. set, list, array are not permitted.

equivalent representation: relation are set of tuples, not list of tuples, thus the or-der used for representing the tuples is immaterial.

2.8 Review questions

Integrity constraints:→ relational integrity: partial functions have same support.→ flat integrity: values are atomic.→ attribute integrity: the attributes in a table must all have the same length.→ domain integrity: values must be in the domain associated with their attribute.

2.9 Review questions

Domain integrity:→ char(3) must take exactly 3 characters.→ numeric(5)=numeric(precision=5,scale=0) can have up to 5 numbers, and up to 0number in the fractional part (behind the comma). eg. 1234 ok but 1234.5 not ok→ date must be of the form ??list all allowed formats??→ boolean must be of the form TRUE, FALSE, NULL.

7

Page 8: A small companion for the class Information Systems for

3 Data Modelling with SQL

3.1 SQL

SEQUEL (Structured English QUEry Language) is a declarative language.

3.2 Mapping between mathematical domains and SQL datatypes

character char(p), varchar(p), clob

binary binary(p), varbinary(p), blob

number, exact numeric(p,s), decimal(p,s), smallint, integer(p), bigint

number, approximate float(p), real, double precision

boolean boolean

date and time date, time, timestamp

intervals interval year to month, interval day to second

3.3 SQL implementation of a primary key

- uniquely identifies a row- can not be null- must be unique

3.4 SQL implementation of a foreign key

A foreign key is a field (or collection of fields) in one table that uniquely identifies a rowof another table or the same table. The foreign key is defined in a second table, but itrefers to the primary key or a unique key in the first table.

3.5 Some additional definitions 4

The language SQL is the principal query language for relational database systems.Postgresql is an opensource version of SQL.

The most common form of SQL query has the form SELECT FROM WHERE. It allowsus to take the product of several relations (the FROM clause), apply a condition to the tu-ples of the result (the WHERE clause), and produce desired components (the SELECT clause).

SELECT FROM WHERE queries can also be used as subquery within a WHERE clause ora FROM clause of another query.

logical operator: AND OR NOTcomparison operator: <,<=, >,>=,=, <>, ! = The ! = operator is converted to <>by the parser.

4from the Book: ”Database Systems: The Complete Book”, H. Garcia-Molina, J.D. Ullman, J. Widom

8

Page 9: A small companion for the class Information Systems for

a BETWEEN x AND y assumption: x<=y⇔ a >= x AND a <= y

a BETWEEN SYMMETRIC x AND y no assumption: x<=y or x>=y⇔ a >= x AND a <= y

a NOT BETWEEN x AND y

⇔ a < x AND a > y

a IS DISTINCT FROM b ⇔ a <> b assumption: a and b not NULLa IS NOT DISTINCT FROM b ⇔ a = b

a IS DISTINCT FROM NULL 5 ⇔ TRUE a <> NULL ⇔ NULL

NULL IS DISTINCT FROM NULL ⇔ FALSE NULL <> NULL ⇔ NULL

a IS NOT DISTINCT FROM NULL ⇔ FALSE a = NULL ⇔ NULL

NULL IS NOT DISTINCT FROM NULL ⇔ TRUE NULL = NULL ⇔ NULL

a IS TRUE assumption: a is of type booleana IS NOT TRUE 2 IS TRUE or UNKNOWN IS TRUE throw an errora IS FALSE NULL IS TRUE or NULL IS FALSE return FALSE

a IS NOT FALSE always return TRUE or FALSE

a IS NULL no assumption: a can be a number or a booleana IS NOT NULL 2 IS NULL return FALSE

a IS UNKNOWN assumption: a is of type booleana IS NOT UNKNOWN 2 IS UNKNOWN throw an error

the bag model: SQL actually regards relations as bags of tuples, not sets of tuples.We can force elimination of duplicate tuples with DISTINCT, while ALL allows the resultto be a bag in certain circumstances where bags are not default.

aggregation: the values appearing in one column of a relation can be summarized (ag-gregated) by using one of the keywords: SUM, AVG, MIN, MAX, COUNT. Tupples ca bepartitioned prior to aggregation with GROUP BY. Certain groups ca be eliminated with aclause introduced by HAVING.

key constraint: a set of attributes forms a key for a relation if we do not allowtwo tuples in a relation instance to have the same values in all the attributes of the key.(statement about all possible instances of the relation, not only the current instance).convention: underlying the attribute names that are key in the schema.

eg. for the relation Pers = name address with name as key, we write the constraintas the following relational algebra expression:σPers1.name=Pers2.name AND Pers1.address6=Pers2.address(Pers1× Pers2)

artificial key: creating an attribute, eg ID Number, that serves as a key.

declaring keys: set of attributes S declared as a key for the relation R :

UNIQUE: two tuples in R cannot agree on all of the attributes in S, unless one of themis NULL.Inserting or updating a tuple that violates this rule causes the DBMS to reject the actionthat caused the violation.

PRIMARY KEY: as above and in addition, the attributes of S are not allowed to have NULL

as a value for their components.

3.6 Review questions

DDL command:→ CREATE TABLE persons(name varchar, registered boolean);

5NULL is treated like the boolean UNKNOWN

9

Page 10: A small companion for the class Information Systems for

→ ALTER TABLE persons ADD COLUMN birth date NON NULL;→ DROP TABLE persons;

3.7 Review questions

DML command:→ INSERT INTO persons VALUES (’foo’,’y’,DATE ’1897-10-20’),(’bar’,’false’,DATE’1977-10-20’);

SELECT p.name, p.price, s.discount, s.customer |and projectionFROM products p JOIN sales s |inner join

ON p.pid=s.productWHERE s.number > 10 |followed by selection

10

Page 11: A small companion for the class Information Systems for

4 The Relational Algebra

4.1 Relational algebra

It’s a family of algebras providing the theoretical foundation for relational databases, usingoperators (unary or binary) to perform queries.

4.2 Different categories of relational algebra operations

- set queries: union, intersection, substraction- filter queries: selection, projection- renaming queries: relation renaming, attribute renaming- binary queries: cartesian product, natural join, theta join

4.3 Projection

- new name- specified subset of attributes- same attached domains- projection of tuples on attribute subset

4.4 Selection

- new name- same attributes- same domains- subset of tuples matching predicate- duplicates elimination is superfluous

4.5 Union

- new name- same attributes- same domains- union of extensions- duplicates are eliminated

4.6 Intersection

- new name- same attributes- same domains- intersection of extensions- duplicates elimination is superfluous

4.7 Renaming

- new name- set of attributes with the one changed- same domains (with attributes changed)- same tuples (with attributes changed)

4.8 Cartesian Product

- new name- union of attributes- ”union” of domains- cartesian product of tuples- duplicate elimination is superfluous

11

Page 12: A small companion for the class Information Systems for

4.9 Grouping

general form:γattribute1,...,attributeN,aggregation→aggregationname(attributetoaggregate)(tablename)exemple:γcountry,city,SUM(price)→volume,COUNT (customer)→transactionsamoumt(tableoftransactions)

- aggregation functions: MOD, MIN, MAX, SUM, AVERAGE, VARIANCE.

4.10 List or bag semantics vs. set semantics

- in sets: duplicates are not allowed- in bags: duplicates are allowed.

4.11 Some additional definitions 6

algebra: consists of operators (+ − ∗ /) and atomic operands(x y).

relational algebra: algebra with atomic operands being: variables that stand for re-lations and constants, which are finite relations.

cartesian product: cross-product R × S: the resulting schema of this relation is theUNION of the schema of R and of S. Warning: rename attributes that are in common inR and in S, eg. attribute name R.name and S.name.

natural join: R ./ S: given that R and S have some attributes in common, pairs oftuples from R and S which common attributes’ tuples agree with one another.

theta join: R ./conditions S: join on R and S whose tuples satisfy the conditions.

left outer join: R ./ Sright outer join: R ./ Sfull outer join: R ./ Sinner join: R ./ S

expression tree: are evaluated bottom-up.

expression tree of the relational algebra formula:πtitle,year(σlength>=100(Movies) ∩ σstudioName=′Fox′(Movies))is also equivalent to: πtitle,year(σlength>=100 AND studioName=′Fox′(Movies))

6from the Book: ”Database Systems: The Complete Book”, H. Garcia-Molina, J.D. Ullman, J. Widom

12

Page 13: A small companion for the class Information Systems for

relationships among operations: R ∩ S = R− (R− S) ;R ./C S = σC(R× S) ; etc.Expressing constraints with expressions of relational algebra:R− S = ∅ is equivalent to R ⊆ S.

referential integrity constraint: πA(R) ⊆ πB(S) if we have a value v as a compo-nent in attribute A of some tuple in a relation R, then we expect that v will appear ina component of -let say- attribute B of some tuple of another relation S equivalent toπA(R)− πB(R) = ∅.

13

Page 14: A small companion for the class Information Systems for

5 SQL queries

5.1 Useful commands and shortcuts on PostgreSQL

- Load the SQL module : %load ext sql

- Connect to the PostgreSQL database:%sql postgresql://username:password@localhost/Database name

Database name: postgres- Write a one-line command : %sql- Write a few lines command(s) : %%sql

- Shortcut esc: naviguate up and down the cells- Shortcut enter: modify the selected cell- Shortcut ctrl+enter: run the cell- Shortcut a: insert a new cell above the current cell- Shortcut b: insert a new cell bellow the current cell- Shortcut d,d: delete the cell

5.2 Queries on table

- Create the table called countries with one column/attribute called name :%sql CREATE TABLE countries(name VARCHAR7(30));

%sql CREATE TABLE countries(name CHAR(5));

- Print the table countries on the screen : %sql SELECT * FROM countries;

- Insert the tuple Switzerland in the attribute name of the table countries :%sql INSERT INTO countries(name) VALUES(’Switzerland’);

- Insert different tuples within one command line :%sql INSERT INTO countries(name) VALUES(’USA’), (’Costa Rica’), (’Iran’), (’Kazakhstan’);

WARNING : in the practice it happens that the data to be inserted do not fit the attributedomains properly, which will return an error.

- Truncate (to the 5 first characters) values while inserting :%sql INSERT INTO countries(name) VALUES (left(’Australia’, 5)),(left(’Austria’,

15));

WARNING : Tuples should be unique but most database, including Postgres do not check.

- Delete the table countries :%sql DROP TABLE countries;

- Insert an attribute called population to the table countries :%sql ALTER TABLE countries ADD COLUMN name CHAR(5);

- Drop the attribute population of the table countries :%sql ALTER TABLE countries DROP COLUMN name;

- Choose the appropriate attribute type for the following tupples to insert in the ta-ble countries :

attribute value SQL command

id 4 %sql ALTER TABLE countries ADD COLUMN id SMALLINT;population 7000 000 000 %sql ALTER TABLE countries ADD COLUMN population BIGINT;area 500 000 000 %sql ALTER TABLE countries ADD COLUMN area INT;gini8 46.1 %sql ALTER TABLE countries ADD COLUMN gini numeric(4,1);

7

char(n) fixed-length character

varchar(n) variable-length with limit

text variable unlimited length

8Gini index: statistical measure of income distribution among the population, developed by CorradoGini in 1912. The coefficient ranges from 0 (or 0%) to 1 (or 100%), with 0 representing perfect equalityand 1 representing perfect inequality. Values over 1 are theoretically possible due to negative income or

14

Page 15: A small companion for the class Information Systems for

.

- Drop different attributes of the table countries within one command line :%sql ALTER TABLE countries DROP COLUMN id, DROP COLUMN name, DROP COLUMN population;

WARNING : deleting all attributes of a table does not actually remove the extension.

- Clear the table countries / delete the extension:%sql DELETE FROM countries;

- Insert the attribute called id with the constraint: cannot be null, to the table coun-tries :%sql ALTER TABLE countries ADD COLUMN id INT NOT NULL;

- Insert the attribute called area with the constraint : can be null, to the table coun-tries :%sql ALTER TABLE countries ADD COLUMN area INT NULL;

WARNING : the constraint NULL and NOT NULL only apply on numeric data type, eg. isnot working with CHAR()

- Add tupple to the table countries :%%sql INSERT INTO countries

(id, name, population, area, gini) VALUES

(1, ’Switzerland’, 8401120, 41285, 29.5),

(2, ’USA’, 325365189, 9833520, 40.8),

(3, ’Kazakhstan’, 17987736, 2724900, 26.4);

-Insert the attribute called length to the table movies with the default value 120:%sql ALTER TABLE movies ADD COLUMN length INT DEFAULT 120);

- Update the value of the attribute genre to Sci-Fi, of the tuple which has the attributetitle Arrival :%sql UPDATE movies SET genre = ’Sci-Fi’ WHERE title = ’Arrival’;

- Update the value of the attribute genre to Thriller, of the tuple which has the 11 firstletter (starting from the left) of the attribute title being Bladerunner, of the table movies:%sql UPDATE movies SET genre = ’Thriller’ WHERE left(title, 11) = ’Bladerunner’;

- Create a table credits, with two attributes, one being of type DATE :%%sql CREATE TABLE credits(

name VARCHAR(30) NOT NULL,

birthdate DATE NULL);

- Modify the attribute birthdate to have a default value of 1st January 1970, of thetable credits :%sql ALTER TABLE credits ALTER COLUMN birthdate SET DEFAULT ’1970-01-01’;

- Insert two tuples to the table credits, one being the default value :%%sql INSERT INTO credits

(name, birthdate) VALUES

wealth.

15

Page 16: A small companion for the class Information Systems for

(’Rutger Hauer’, DEFAULT),

(’Sean Young’, ’1959-11-20’);

-Create a new Schema9:%sql CREATE SCHEMA ise18;

- Delete a schema, if it exists, and all the objects (tables, functions...) inside it:%sql DROP IF EXISTS ise18 CASCADE;

- Delete a schema, only if it is empty (default setting):%sql DROP ise18 RESTRICT;

- Create a table country inside a specific schema ise18:%sql CREATE TABLE ise18.countries();10

- Create a table with attribute population positive or null integer:%sql CREATE TABLE country(population BIGINT NULL CHECK(population >= 0));

- Create a table country with attribute gini that must be within the range [0,100] andcan be null:%sql CREATE TABLE country(gini NUMERIC(4,1) NULL CHECK(gini >= 0 AND gini <=

100));

- Insert tuples into a table country which is inside a specific schema ise18:%%sql INSERT INTO ise18.country(id, name)VALUES(1,’Switzerland’),(2,’USA’);

- Print the table country which is inside the schema ise18:%sql SELECT * FROM ise18.country;

- Create table movies with attributes genre with default value drama:&sq CREATE TABLE movies(genre VARCHAR(20) DEFAULT ’Drama’);

- Create table friends with attributes birthdate with default value 1st March 1970:%sql CREATE TABLE friends(birthdate DATE DEFAULT ’1970-03-01’);

- Create the table employees with appropriate constraints:%%sql CREATE TABLE employees(

. id SERIAL11 PRIMARY KEY,

. firstname VARCHAR(255)12 NOT NULL,

. lastname VARCHAR()255 NOT NULL,13

. email VARCHAR(255) UNIQUE, 14

. );

- Define the column name as primary key from an existing table:%sql ALTER TABLE ise18.credits ADD PRIMARY KEY(name);

- Define the column name as primary key during the table declaration:%%sql CREATE TABLE credits(

9schema: allow the creation and editting of a collection of tables independently from the rest of thedatabase system.

10creating table with no attributes accepted, but creating table with attributes without type is not.11serial: autoincrementing positive integer starting at 1.12string of characters up to 255 characters.13the employee must have a first and last name, it cannot be empty.14all the employees must have a different email from each other-if they have one-.

When you define an UNIQUE index for a column, the column cannot store multiple rows with thesame values.If you define a UNIQUE index for two or more columns, the combined values in these columns cannot beduplicated in multiple rows.PostgreSQL treats NULL as distinct value, therefore, you can have multiple NULL values in a column.

16

Page 17: A small companion for the class Information Systems for

. name CHAR(30) PRIMARY KEY,15

. gender CHAR(1)

. );

- Define the column name as primary key during the table declaration:%%sql CREATE TABLE credits(

. name CHAR(30),

. gender CHAR(1),

. PRIMARY KEY(name)

. );

- Define several attributes as primary key during the table declaration:%%sql CREATE TABLE credits(name CHAR(30), gender CHAR(1), length INT, year INT,

. PRIMARY KEY(name,gender,year));

- Define explicitly16 the primary key constraint’s name: clef, during the table declara-tion:%%sql CREATE TABLE credits(name CHAR(30), gender CHAR(1), length INT, year INT,

. CONSTRAINT clef PRIMARY KEY(name,gender,year));

-Remove primary key of the table stars:%sql ALTER TABLE stars DROP CONSTRAINT stars pkey;17

- Create table stars with 3 attributes: title, year and stars name with title and yearused as a foreign key from table movies:%% sql CREATE TABLE stars (

. title VARCHAR(250),

. year SMALLINT,

. stars name VARCHAR(50),

. FOREIGN KEY(title, year)18 REFERENCES ise18.movies(nom,annee));

- Create table stars with 3 attributes: title, year and stars name with stars name used asa foreign key from table acteurs:%%sql CREATE TABLE stars (

. title VARCHAR(250),

. year SMALLINT,

. stars name VARCHAR(50) REFERENCES acteurs(nom)

. );

- Modify the table directed by such that his attribute name is a foreign key from theattribute directors of the table films:%sql ALTER TABLE directed by ADD FOREIGN KEY(name) REFERENCES films(directors);

- Delete in the table movies the row/tuple whose attribute name is Fox:%sql DELETE FROM movies WHERE name = ’Fox’;

- Create a table abilities with attribute name as a foreign key from the attribute nomof the table joueur, such that when deleting a tuple from the table joueur, its name be-comes NULL/none:

15primary key: a column or a group of columns used to identify a row uniquely in a table.Technically, a primary key constraint is the combination of a not-null constraint and a UNIQUE constraint.A table can have one and only one primary key.Good practice: add a primary key to every table, then PostgreSQL creates a unique B-tree index on thecolumn(s) of the primary key.

16By default, PostgreSQL uses tablename pkey as the default name for the primary key constraint17stars pkey or explicit name of the constraint previously given in the CONSTRAINT clause.18a foreign key is defined in a table (called referencing table or child table, here stars)that references to

the primary key of the other table(called referenced table or parent table, here movies).A foreign key constraint indicates that values in column(s) in the child table match with the values incolumn(s) of the parent table. We say that a foreign key constraint maintains referential integritybetween child and parent tables.

17

Page 18: A small companion for the class Information Systems for

%sql CREATE TABLE abilities(name VARCHAR(30) NULL REFERENCES joueur(nom) ON

DELETE SET NULL);

- Create a table weapons with attribute name as a foreign key from the attribute nom ofthe table joueur, such that when deleting a tuple from the table joueur, its name becomes’placeholder’:%%sql CREATE TABLE weapons(

. name VARCHAR(30) DEFAULT ’placeholder’ REFERENCES joueur(nom) ON DELETE

SET DEFAULT);

- Create a table joueur actif with attribute name as a foreign key from the attributenom of the table joueur, such that when deleting a tuple from the table joueur, the ac-cording tuple get deleted in the table joueur actif too:%%sql CREATE TABLE joueur actif(

. name VARCHAR(30) NOT NULL REFERENCES joueur(nom) ON DELETE CASCADE

. );

- Create a table joueur actif with attribute name as a foreign key from the attributenom of the table joueur, such that tuples from the table joueur also appearing in the tablejoueur actif will not be deleted:%%sql CREATE TABLE joueur actif(;. name VARCHAR(30) NOT NULL REFERENCES joueur(nom) ON DELETE RESTRICT

. );

- Create a table joueur actif with attribute name as a foreign key from the attributenom of the table joueur, such that when trying to delete a tuple from the table joueur,also appearing in the table joueur actif, it will throw an error:%%sql CREATE TABLE joueur actif(

. name VARCHAR(30) NOT NULL REFERENCES tableau2(nom) ON DELETE NO ACTION19

. );

5.3 Using several SQL clauses jointly

relational algebra formula

SQL query

- Display the union of table1 and table2 with no duplicates:δ(table1 ∪ table2) bag semantics

%sql SELECT * FROM table1 UNION SELECT * FROM table2;

- Display the union of table1 and table2 with duplicates:table1 ∪ table2 bag semantics

%sql SELECT * FROM table1 UNION ALL SELECT * FROM table2;

- Display the table1 with the tuples in ascending order with respect to the attributetime:τtime ascending(δ(table1))

% sql SELECT* FROM table1 ORDER BY time ASC;

- Display the union with tuples in desc. order wrt. attribute time:τtime descending(table1 ∪ table2) bag sementics

%sql SELECT* FROM table1 UNION ALL SELECT* FROM table2 ORDER BY time DESC;

19if we don’t specify RESTRICT or CASCADE, Postgresql will apply NO ACTION by default.

18

Page 19: A small companion for the class Information Systems for

- Display the intersection with duplicates:table1 ∩ table2 bag sementics

%sql SELECT* FROM table1 INTERSECT ALL SELECT* FROM table2;

table1 =

nom 1 prenom 1

Beaulieu EddyChirac JackChirac JackChirac Jack

table2 =

nom 2 prenom 2

Hardy FrancoiseChirac JackChirac Jack

table1 ∩ table2 =

nom 1 prenom 1

Chirac JackChirac Jack

- Display the intersection with no duplicates:δ(table1 ∩ table2) bag semantics 20

%sql SELECT* FORM table1 INTERSECT SELECT* FROM table2;

table1 =

nom 1 prenom 1

Beaulieu EddyChirac JackChirac JackChirac Jack

table2 =

nom 2 prenom 2

Hardy FrancoiseChirac JackChirac Jack

δ(table1 ∩ table2) =nom 1 prenom 1

Chirac Jack

- Display:table1\table2 bag semantics

%sql SELECT* FROM table1 EXCEPT ALL SELECT* FROM table2

table1 =

nom 1 prenom 1

Beaulieu EddyChirac JackChirac JackChirac Jack

table2 =

nom 2 prenom 2

Hardy FrancoiseChirac Jack

table1\table2 =

nom 1 prenom 1

Beaulieu EddyChirac JackChirac Jack

- Display:

20the order matters, eg. %sql SELECT*FROM table2 INTERSECT SELECT*FROM table1; will display the

following temporary table : δ(table2 ∩ table1) =nom 2 prenom 2

Chirac Jack

19

Page 20: A small companion for the class Information Systems for

δ(table1\table2) bag semantics

%sql SELECT* FROM table1 EXCEPT SELECT* FROM table2

table1 =

nom 1 prenom 1

Beaulieu EddyChirac JackChirac JackChirac Jack

table2 =

nom 2 prenom 2

Hardy FrancoiseChirac Jack

δ(table1\table2) =nom 1 prenom 1

Beaulieu Eddy

- Display the columns game1,game2,game3 of the table players:πgame1,game2,game3(players)

%sql SELECT game1,game2,game3 FROM players;

- Display the rows, whose attribute game2 = won of the table players:σgame2=won(players)

%sql SELECT* FROM players WHERE game2=’won’;

- Display the attributes start and end from a previous selection of attributes containingstart,end,high and low:πstart,end(πstart,end,high,low(table))

%sql SELECT start,end FROM (SELECT start,end,high,low FROM table) AS projection;

- Display the tuples with a volume<20 among a selection of tuples whose time<23.10.2017:σvolume<20(σtime<23.10.2017(table))

%sql SELECT*

FROM (SELECT* FROM table WHERE time<’2017-10-23’)AS selection

WHERE volume<20;

- Display the number of tuples in the attribute game for each distinct (name,bday):γname,bday,COUNT (game)−>played(players)

%sql SELECT name,bday,COUNT(game) AS played FROM (players) GROUP BY name,bday;

table before the query:

name bday game

Tom 1992 02.03.2017Tom 1992 02.04.2017Tom 1992 02.05.2017Pablo 1985 02.03.2017Bassima 1978 02.03.2017Bassima 1978 02.05.2017

output:

name bday played

Tom 1992 3Pablo 1985 1Bassima 1978 2

20

Page 21: A small companion for the class Information Systems for

- Display the maximum tuple of the attribute volume under the name max vol :γMAX(volume)−>max vol(πvolume(table))

%sql SELECT MAX(volume) AS max vol FROM table;

output:max vol

250

5.4 Join

5.4.1 Inner Join

- Display the inner join = cartesian product with selection on matching tuples of at-tribute of both tables:%%sql SELECT*

. FROM

. fruit1

. INNER JOIN

. fruit2

. ON name1=name2;

fruit1:=

name1 qtty1

orange 1orange 1orange 4pomme 2ananas 1

fruit2:=

name2 qtty2

orange 1pomme 3kiwi 1

σname1=name2(fruit1 ./ fruit2)

⇔ fruit1 ./name1=name2 fruit2:=

name1 qtty1 name2 qtty2

orange 1 orange 1orange 1 orange 1orange 4 orange 1

pomme 2 pomme 3

- Display the inner join with renaming:%%sql SELECT

. name1 AS nom

. fruit1.qtty AS q 21

. name2 AS nom

. fruit2.qtty AS q

. FROM

. fruit1

. INNER JOIN

. fruit2

. ON fruit1.name1=fruit2.name2;

fruit1:=

name1 qtty

orange 1orange 1orange 4pomme 2ananas 1

fruit2:=

name2 qtty

orange 1pomme 3kiwi 1

21the attribute qtty can designate the one from table fruit1 or fruit2, thus we need to precise withtablename.columnname.

21

Page 22: A small companion for the class Information Systems for

ρname1→nom,fruit1.qtty→q,name2→nom,fruit2.qtty→q(fruit1 ./name1=name2 fruit2):=

nom q nom 1 q 122

orange 1 orange 1orange 1 orange 1orange 4 orange 1pomme 2 pomme 3

5.4.2 Outer Join

- Display the left outer join:%%sql SELECT*

. FROM

. fruit1

. LEFT JOIN

. fruit2

. ON name1=name2;

fruit1:=

name1 qtty1

orange 1orange 1orange 4pomme 2ananas 1

fruit2:=

name2 qtty2

orange 1pomme 3kiwi 1

σname1=name2(fruit1 ./ fruit2)

⇔ fruit1 ./name1=name2 fruit2:=

name1 qtty1 name2 qtty2

orange 1 orange 1orange 1 orange 1orange 4 orange 1pomme 2 pomme 3ananas 1 None None

- Display the right outer join:%%sql SELECT*

. FROM

. fruit1

. RIGHT JOIN

. fruit2

. ON name1=name2;

fruit1:=

name1 qtty1

orange 1orange 1orange 4pomme 2ananas 1

fruit2:=

name2 qtty2

orange 1pomme 3kiwi 1

σname1=name2(fruit1 ./ fruit2)

⇔ fruit1 ./ name1=name2 fruit2:=

name1 qtty1 name2 qtty2

orange 4 orange 1orange 1 orange 1orange 1 orange 1pomme 2 pomme 3None None kiwi 1

22Postgresql automatically add 1 to the attribute’s name such that each column’s name is distinct.

22

Page 23: A small companion for the class Information Systems for

- Display the full outer join:%%sql SELECT*

. FROM

. fruit1

. RIGHT JOIN

. fruit2

. ON name1=name2;

fruit1:=

name1 qtty1

orange 1orange 1orange 4pomme 2ananas 1

fruit2:=

name2 qtty2

orange 1pomme 3kiwi 1

σname1=name2(fruit1 ./ fruit2)

⇔ fruit1 ./ name1=name2 fruit2:=

name1 qtty1 name2 qtty2

orange 1 orange 1orange 1 orange 1orange 4 orange 1pomme 2 pomme 3ananas 1 None NoneNone None kiwi 1

5.4.3 Equivalence between relational algebra formulas and SQL queries

SELECT* a ./a.key=b.key bFROM a INNER JOIN b ON a.key=b.key;

SELECT* a ./ a.key=b.key bFROM a RIGHT [OUTER]23 JOIN b ON a.key=b.key;

SELECT* a ./a.key=b.key bFROM a LEFT [OUTER]17JOIN b ON a.key=b.key;24

SELECT* relative complement of a in bFROM a RIGHT JOIN b ON a.key=b.key

WHERE a.key IS NULL;

SELECT* relative complement of b in aFROM a LEFT JOIN b ON a.key=b.key

WHERE b.key IS NULL;

SELECT* a ./ a.key=b.key bFROM a FULL JOIN b ON a.key=b.key;

5.5 Some additional definitions 25

referential-integrity constraints: we can declare that a value appearing in some at-tribute or set of attributes must also appear in the corresponding attribute-s of some tupleof the same or another relation. To do so, we use a REFERENCES or FOREIGN KEY declara-tion in the relation schema.

23by definition: left, right, full join are outer join, thus a RIGHT JOIN b is equivalent to a RIGHT OUTER

JOIN24SELECT* FROM a LEFT JOIN b ⇔ SELECT* FROM b RIGHT JOIN a: only the order of the attributes

in the table’s schema varies.25from the Book: ”Database Systems: The Complete Book”, H. Garcia-Molina, J.D. Ullman, J. Widom,

chapter 7: Constraints and Triggers

23

Page 24: A small companion for the class Information Systems for

CREATE TABLE principal(

name CHAR(10),

address principal TEXT REFERENCES sources(address sources)

);

CREATE TABLE principal(

name CHAR(10),

address principal TEXT,

FOREIGN KEY (address principal) REFERENCES sources(address sources)

);

principal:=

name address principal

a

b

c

NULL

sources:=

col 1 col 2 address sources

a

b

c

address principal: is the FOREIGN KEY, and all values appearing in this attribute mustalso be found in address sources, except of the value NULL.

address sources: is the REFERENCES and must be declared UNIQUE or PRIMARY KEY forthe relation sources, ie. cannot have the value NULL.

constraints on attributes and tuples:- NOT NULL constraint,- attribute-based CHECK constraint,- tuple-based CHECK constraint.

attribute-based CHECK constraints: we can place a constraint on the value of an at-tribute by adding CHECK with the condition to the declaration of the relation itself. Thecondition must hold for every value of this attribute, and is checked every time a tuple isinserted into R and every time a tuple is updated.

tuple-based CHECK constraints: can be added or deleted with an ALTER statementfor the appropriate table. The condition is evaluated for the new or updated tuple. If thecondition is false for that tuple, then the constraint is violated and the insertion or updatestatement that caused the violation is rejected).

24

Page 25: A small companion for the class Information Systems for

6 Database design theory

6.1 First normal form

- a “plain” table with flat rows- no nested tables- no nested structures- atomic integrity

6.2 Functional dependency

The functional dependency is a relationship that exists when one attribute uniquely de-termines another attribute.In order to find out which functional dependencies a table fulfils, we can try to check ifan attribute seems to be a function of another attribute.A set of functional dependencies expresses constraints on its table: Given a relation R,given two subset of attributes A,B ⊆ AttributesRB is functionally dependent on A, denoted A →R B when ∀t1, t2 ∈ ExtensionR, t1.A =t2.A⇒ t1.B = t2.B

6.3 Transitivity rule

(A→ B)(B → C)⇒ (A→ C)

6.4 Splitting rule

(A→ B,C)⇒ (A→ B)(A→ C)

6.5 Combining rule

(A→ B)(A→ C)⇒ (A→ B,C)

6.6 Reflexivity rule

(A ⊆ B)⇒ (B → A)

6.7 Augmentation rule

(A→ B)⇒ (A,C → B,C)

25

Page 26: A small companion for the class Information Systems for

6.8 Minimal basis for a set of dependencies

Given a set of functional dependencies S a minimal basis for S is a set of functional de-pendencies such that:1. It is equivalent to S (we say it is a basis for S)2. All right-hand sides are singletons3. It is a minimal basis (any strict subset is not a basis for S any more)4. No attributes can be removed from left-hand-sides (doing so, it would no longer be abasis for S)

How to find a minimal basis?- use the splitting rule- eliminate trivial: delete all relations pointing on itself- simplify left hand side

Application exemple: if we have a table and a set of functional dependencies. Laterwe project the original table into a smaller table. We can compute how the functionaldependencies get projected if we compute the minimal basis of the original given table.

6.9 Super-key

A super-key is a set of attribute(s) within a table whose values can be used to uniquelyidentify a tuple. It is a set of attribute(s) that functionally determine all the table’sattributes.

6.10 Minimal Super-key

A minimal super-key is a set of attribute(s) within a table whose values can be usedto uniquely identify a tuple, with the extra condition that none of its strict subset is asuper-key.

Example: We have a table R(A,B,C,D,E, F ), with the following super-key : {A,C,D}and {A,D}.

The set {A,C,D} is not a minimal super-key on this table, because there exist a strictsubset {A,D} also being a super-key of this table. Thus the minimal super-key is {A,D}.

6.11 Candidate key

A candidate key is a minimal super-key holding for all relations defined on a universe.The distinction between the universe and a specific table is important. Indeed we mightlook at a specific table, ie. a relation and find on this table super-keys that hold on thisspecific table but not on the universe.

Example: a relation that a garage is using to have an overview of the car waiting to be re-paired, with -among others- the following super-keys; {carBrand, carY ear, repairOrder},{carBrand, carY ear}, {carBrand}, ...

So this specific table is populated with different cars, which happens to all be froma different brands, eg. Dodge, Toyota, Mitsubishi. But when we designed the schema ofthis universe, we logically included the highly probable case of two cars of the same brandhaving to be repaired in this garage.

Thus the super-key {carBrand} holds for this specific table, and is here for instancethe minimal super-key. Although it is not the candidate key, because it does not hold onthe universe U, but only on this specific table/relation of the universe. A careful readerwould have also noticed that {carBrand, carY ear} will not hold on the universe U, astwo different cars, from the same brand and the same year can be parked in this garage.

26

Page 27: A small companion for the class Information Systems for

6.12 Primary Key

A primary key is chosen by the database designer among the candidate key(s).

Example: Let’s imagine a travel agency, storing the following informations about theirswiss customers: first name, last name, birthdate, email address, passport identificationnumber, passport expiration date, AHV number, insurance name. We can imagine havingas candidate keys the following: {email Address}, {AHV Number}, {Passport Number}.

We will choose the most meaningful attribute to be our primary key. All of the abovethree attributes uniquely identify a customer, but maybe some are more meaningful forthe considered travel agency.

Indeed they hopefully will not need to use their AHV Number, thus even if we imag-ine for the sake of our example that this information has to be saved, the travel agencywill likely choose the {email Address} as primary key. Their focus is on the customer,and how to reach it, so using the email address seems a good choice. But this is a choicemade by the database designer, and there is not one perfect answer in favour of one orthe other.

6.13 BCNF

Boyce-Codd normal form: Any non-trivial Functional Dependency F →R G with F asuper-key. A table with two columns is always in Boyce-Codd normal form.

How to decompose a table that is not in BCNF into tables that are in BCNF?- find a functional dependency that breaks BCNF- project- iterate until BCNF

To prove that the decomposition of a given table is lossless, we use the Chase algorithm.Remember the impossibility triangle: we can only get 2 out of the following 3 character-istics at the same time: BCNF, Lossless join, Dependencies preserving.

6.14 Some additional definitions 26

goal: designing a relational database schema for an application

functional dependency FD: generalization of the idea of a key for a relation. a FD onR is a statement:”A1, A2, ..., An → B1, B2, ..., Bm”⇔ ”A1, A2, ..., An functionally determine B1, B2, ..., Bm”

A1, A2, ..., An → B1

A1, A2, ..., An → B2

...

A1, A2, ..., An → Bn

If two tuples agrees on all their attributes A1, A2, ..., An then they must also agree onanother list of attributes: B1, B2, ..., Bm. R satisfies the FD if it holds for every instanceof the relation R.

key: a set {A1, A2, ..., An } of on or more attributes is a key for a relation R if:- they functionally determine all attributes (all other attributes than A1, A2, ..., An) of therelation. There is no two distinct tuples that agree on all the A1, A2, ..., An.- no proper subset determine functionally all other attributes → a key must be minimal.If there is more than one key for a relation, we will call one of them as primary key.

superkey: superset of a key: set of attributes that contains a key. Every key is asuperkey. Not every superkey is a minimal key.

26from the Book: ”Database Systems: The Complete Book”, H. Garcia-Molina, J.D. Ullman, J. Widom,chapter 3: Design Theory for Relational Databases

27

Page 28: A small companion for the class Information Systems for

notation may differ:

book lecture

key candidate key (minimal)superkey key

equivalent: two sets of FD’s S and T are equivalent if the set of relation instancessatisfying S is exactly the same as the set of relation instances satisfying T. are equivalentiif S follows from T, and T follows from S.

rules:- transitive rule: A→ B, B → C ⇔ A→ C- splitting rule: A1, A2, ..., An → B1, B2, ..., Bm ⇔ ∀i ∈ {1,m} A1, A2, ..., An → Bi

- combining rule: ∀i ∈ {1,m} A1, A2, ..., An → Bi ⇔ A1, A2, ..., An → B1, B2, ..., Bm

trivial functional dependencies: trivial ≡ it holds for every instance of the relation.A1, A2, ..., An → B1, B2, ..., Bm such that {B1, B2, ..., Bm} ⊆ {A1, A2, ..., An}a trivial FD has a right side that is a subset of its left side.

trivial-dependency rule: when some B are the same as A -but not all of them-A1, A2, ..., An → B1, B2, ..., Bm ⇔ A1, A2, ..., An → C1, C2, ..., Ck

where the C are all the B that are not also A.

closure of {A1, A2, ..., An} : {A1, A2, ..., An}+ under the FD’s in S is the set of attributes Bsuch that every relation that satisfies all the FD’s in set S also satisfies A1, A2, ..., An → B.A1, A2, ..., An → B follows from the FD’s of S.

eg. we want to know if A,B → D and A,B → G follow from the given FD’s.We compute {A,B}+ which is {A,B,C,D,H}. Since D is a member of the closure, thenA,B → D follows from the FD’s, but since G is not a member of the closure then A,B → Gdoes not follow.

closure and keys: {A1, A2, ..., An}+ is the set of all attributes of a relation iif A1, A2, ..., An

is a superkey for the relation, ie. A1, A2, ..., An functionally determine all the other at-tributes.

eg. we want to test if A1, A2, ..., An is a key for a relation: 1. {A1, A2, ..., An}+ is allattributes. 2. for no set X formed by removing one attribute from {A1, A2, ..., An} is X+

the set of all attributes.

basis: for a given set S of FD’s, any set of FD’s equivalent to S is called a basis forS.convention: we only consider basis with singleton right hand side.

minimal basis for a relation is a basis B that satisfies:- all the FD’s in B have singleton right sides.- if any FD is removed from B, the result is no longer a basis.- if for any FD in B we remove one or more attributes from the left side, the result is nolonger a basis.(no trivial FD can be in a minimal basis).

Armstrong’s axioms: reflexivity, augmentation, transitivity.

projection of functional dependencies S: all the functional dependencies that fol-low from S and that only involve attributes in the projection of the corresponding table.1/ list all subsets of the projected schema, ie. compute the power set of the projectedschema2/ compute the closure of all the sets3/ remove from the RHS the attributes that are not in the projected schema4/ compute the minimal basis

design of good relation schemas to avoid anomalies.

28

Page 29: A small companion for the class Information Systems for

anomalies: - redundancy- update anomalies- deletion anomalies

decomposition: of a relation in two, by splitting the attributes to make the schemasof two new relations.

Boyce-Codd Normal Form: decompose the relation by several that does not haveanomalies,ie. relations that fullfill the condition BCNF: a relation R is in BCNF iif whenever thereis a nontrivial FD A1, A2, ..., An → B1, B2, ..., Bm for R, it is the case that {A1, A2, ..., An}is a superky for R.The left side of every nontrivial FD must be a superkey (no need to be minimal), ie. mustcontain a key.

BCNF decomposition algorithm:INPUT: a relation R0 with a set of functional dependencies S0OUTPUT: a decomposition of R0 into a collection of relations, all of which are in BCNF.METHOD: recursively, starting with R = R0 and S = S0- check wether R is BCNF : if yes return R.- for a BCNF violation X → Y compute X+, choose R1 = X+, and R2 = X + (R\X+)- compute S1 := set of FD of R1, ie the projection of the FD on R1

- compute S2 := set of FD of R2, ie the projection of the FD on R2

- recursively decompose R1 and R2

- return the union of the results of these decompositions

29

Page 30: A small companion for the class Information Systems for

7 Databases and host languages

7.1 The three-level architecture

Clients. m

Web servers

Application servers

Database servers

. mDatabase

7.2 ACID

- atomicity: a transaction gets fully executed or not at all.- consistency: database is consistent at all times.- isolation: ”as if” transactions were executed serially.- durability: a change will not be lost (power outrage, ...).

7.3 Some additional definitions 27

three-tier architectures: Large database installations that support large-scale user in-teractions over the Web commonly use three tiers of processes: web servers, applicationservers, and database servers. There can be many processes active at each tier, and theseprocesses can be at one processor or distributed over many processors. The processes mayall run on the same processor in a small system, but it is common to dedicate a large

27from the Book: ”Database Systems: The Complete Book”, H. Garcia-Molina, J.D. Ullman, J. Widom,chapter 9: SQL in a Server Environment

30

Page 31: A small companion for the class Information Systems for

number of processors to each of the tiers.

web servers: These are processes that connect clients to the database system, usu-ally over the Internet or possibly a local connection. The web-server processes managethe interactions with the user. When a user makes contact, perhaps by opening a URL,a web server responds to the request. The user then becomes a client of this web-serverprocess. Typically, the client’s actions are performed by the web-browser, e.g., managingof the filling of forms, which are then posted to the web server.

application servers: The job of the application tier is to turn data, from the database,into a response to the request that it receives from the web-server. Each web-server processcan invoke one or more application-tier processes to handle the request; these processescan be on one machine or many, and they may be on the same or different machines fromthe web-server processes. The actions performed by the application tier are often referredto as the business logic of the organization operating the database. In a simple system,the application tier may issue database queries directly to the database tier, and assemblethe results of those queries, perhaps in an HTML page. In a more complex system, therecan be several subtiers to the application tier, and each may have its own processes.

database servers: These processes run the DBMS and perform queries and modifica-tions at the request of the application servers. Since creating connections to the databasetakes significant time, we normally keep a large number of connections open and allowapplication processes to share these connections.

database environment: An installation using a SQL DBMS creates a SQL environ-ment. Within the environment, database elements such as relations are grouped into(database) schemas, catalogs, and clusters. A catalog is a collection of schemas, and acluster is the largest collection of elements that one user may see.

SQL environment: is the framework under which data may exist and SQL operationson data may be executed. In practice, we should think of a SQL environment as a DBMSrunning at some installation. For example, ABC company buys a license for the Mega-tron 2010 DBMS to run on a collection of ABC’s machines. The system running on thesemachines constitutes a SQL environment.All the database elements we have discussed — tables, views, triggers, and so on — aredefined within a SQL environment. These elements axe organized into a hierarchy ofstructures, each of which plays a distinct role in the organization. The structures definedby the SQL standard are:

impedance mismatch: The data model of SQL is quite different from the data mod-els of conventional host languages. Thus, information passes between SQL and the hostlanguage through shared variables that can represent components of tuples in the SQLportion of the program.

31

Page 32: A small companion for the class Information Systems for

embedded SQL: Instead of using a generic query interface to express SQL queries andmodifications, it is often more effective to write programs that embed SQL queries in aconventional host language. A preprocessor converts the embedded SQL statements intosuitable function calls of the host language.

JDBC: Java Database Connectivity is a facility similar to CLI for allowing Java pro-grams to access SQL databases. It is a collection of Java classes.The first steps we must take to use JDBC are:1. include the line: import java.sql.*; to make the JDBC classes available to yourJava program.2. Load a “driver” for the database system we shall use. The driver we need dependson which DBMS is available to us, but we load the needed driver with the statement:Class.forName(<driver name>);

3. Establish a connection to the database. A variable of class Connection is created if weapply the method getConnection to DriverM anager. The Java statem ent to establisha connection looks like:Connection myCon = DriverM anager.getConnection(<URL>, <username>, <password>);

That is, the method getConnection takes as arguments the URL for the database to whichyou wish to connect, your user name, and your password. It returns an object of classConnection, which we have chosen to call myCon.

There are two methods we can apply to a Connection object in order to create state-ments: 1. createStatement() returns a Statement object. This object has no associatedSQL statement yet. 2. prepareStatement(Q), where Q is a SQL query passed as a stringargument, returns a PreparedStatement object.

There are four different methods that execute SQL statements: executeQuery(), executeQuery(Q),executeUpdate(U), executeUpdate().

PHP: is a scripting language for helping to create HTML Web pages. It provides supportfor database operations through an available library, much as JDBC does. It is anotherpopular system for implementing a call-level interface. This language is found embeddedin HTML pages and enables these pages to interact with a database. All SQL statementsare then referred to as “queries” and are executed by the function query, which takes thestatement as an argument and is applied to the connection variable.28

7.4 Review questions

How is SQL typically interfaced with host languages?→ With a library providing functions that take SQL queries as string parameters.→ Inlining SQL queries into a host programming languages such as C++, with a prepro-cessor translating them to library calls.→With libraries providing iterators with which one can imperatively iterate over the rowsand the columns of a relation output by a SQL query.

28All PHP code is intended to exist inside HTML text. A browser will recognize that text is PHP codeby placing it inside a special tag, which looks like: <?php "PHP code goes here" ?>

32

Page 33: A small companion for the class Information Systems for

8 Transactions

What happens if multiple users are reading and updating a database at the same time?

8.1 Let’s imagine...

Let’s image a simple case with only two users, Alice and Bob, two students of the ISE classthat developed a movies’s database during the last month, as a result of great interest forthe class and social distancing.They still need to recall the concepts of foreign keys, so they implemented only one table,and populated it with some movies of Buster Keaton to begin with.

In the future they want to watch the movies written in the table (Alice or Bob or both)and they want to give it a grade between 1 and 6, thus they could later easily choose theirmovie-evening.

NAME OF THE MOVIE GRADE

The Frozen North NoneSeven Chances NoneOur Hospitality NoneBattling Butler NoneSteamboat Bill Jr. None

8.2 Conflicts / Dependencies

We are Monday 20th April.Alice and Bob watch separately (#socialdistancing) a movie in the weekend. Now theyhave a bit a time between two zoom classes and want to update their table, at the sametime, (not knowing about the other) and give a grade to the movie they watched. Let’ssee five scenarii.

8.2.1 updating-only two different tuples

Alice watched The Frozen North and wants to give it a 4.

Bob watched Seven Chances and wants to give it a 5.

NAME OF THE MOVIE GRADE

The Frozen North 4Seven Chances 5Our Hospitality NoneBattling Butler NoneSteamboat Bill Jr. None

8.2.2 Alice reads before Bob updates

Alice wants to read the grade of Seven Chances .

Bob watched Seven Chances and wants to give it a 5.

READ-WRITE conflict

She will see the following:NAME OF THE MOVIE GRADE

The Frozen North NoneSeven Chances NoneOur Hospitality NoneBattling Butler NoneSteamboat Bill Jr. None

33

Page 34: A small companion for the class Information Systems for

8.2.3 Alice reads after Bob updates

Bob watched Seven Chances and wants to give it a 5.

Alice wants to read the grade of Seven Chances .

WRITE-READ conflict

She will see the followingNAME OF THE MOVIE GRADE

The Frozen North NoneSeven Chances 5Our Hospitality NoneBattling Butler NoneSteamboat Bill Jr. None

8.2.4 Alice updates before Bob updates

Alice watched Our Hospitality and wants to give it a 5.

Bob watched Our Hospitality and wants to give it a 6.

WRITE-WRITE conflict

At the end the table is the following:NAME OF THE MOVIE GRADE

The Frozen North NoneSeven Chances NoneOur Hospitality 6Battling Butler NoneSteamboat Bill Jr. None

8.2.5 Alice updates after Bob updates

Bob watched Our Hospitality and wants to give it a 6.

Alice watched Our Hospitality and wants to give it a 5.

WRITE-WRITE conflict

At the end the table is the following:NAME OF THE MOVIE GRADE

The Frozen North NoneSeven Chances NoneOur Hospitality 5Battling Butler NoneSteamboat Bill Jr. None

We saw READ-WRITE, WRITE-READ, and WRITE-WRITE dependencies. Would itmake sense to have a READ-READ dependency? No, because a READ transaction doesnot modify the table, thus having different users reading from the same table in differentorders always display the same output.

8.3 Timeline/ Schedule / History of Transactions

Now let’s represent the prior cases in a timeline manner. The following illustrates thehistory of transactions. Imagine a time axis, the first action being the first row, and thelast action being the last row. This is also called a schedule.

You will need to understand theses schedules, how to draw a graph with the queries(as shown in the next subsection 8.4), how to draw a graph with the users (as shownin the subsection 8.5), and with the help of those graphs, to conclude if the schedule isserializable or not, and if yes, you might be asked to make it serial.

34

Page 35: A small companion for the class Information Systems for

8.3.1 updating-only two different tuples

Alice Bob

WA(The Frozen North)WB(Seven Chances)

8.3.2 Alice reads before Bob updates

Alice Bob

RA(Seven Chances)WB(Seven Chances)

8.3.3 Alice reads after Bob updates

Alice Bob

WB(Seven Chances)RA(Seven Chances)

8.3.4 Alice updates before Bob updates

Alice Bob

WA(Our Hospitality)WB(Our Hospitality)

8.3.5 Alice updates after Bob updates

Alice Bob

WB(Our Hospitality)WA(Our Hospitality)

8.4 Graph with the queries

We implement the dependencies presented in the section 8.2 in a directed graph manner.Here the nodes are the queries, and the edges are the dependencies between them.

8.4.1 updating-only two different tuples

Alice Bob

WA(The Frozen North)WB(Seven Chances)

8.4.2 Alice reads before Bob updates

Alice Bob

RA(Seven Chances)WB(Seven Chances)

8.4.3 Alice reads after Bob updates

Alice Bob

WB(Seven Chances)RA(Seven Chances)

8.4.4 Alice updates before Bob updates

Alice Bob

WA(Our Hospitality)WB(Our Hospitality)

35

Page 36: A small companion for the class Information Systems for

8.4.5 Alice updates after Bob updates

Alice Bob

WB(Our Hospitality)WA(Our Hospitality)

8.5 Graph with the users

Here the nodes are the users, eg. Alice and Bob, and the edges are the dependenciesbetween them.

8.5.1 updating-only two different tuples

8.5.2 Alice reads before Bob updates

8.5.3 Alice reads after Bob updates

8.5.4 Alice updates before Bob updates

8.5.5 Alice updates after Bob updates

This graph with the users as nodes is helping us to decide wether or not the scheduleis serializable. If we have some cycle, there is no way to detangle the arrows, and theschedule is not serializable. In the case, we have no cycle, than we could detangle thearrows, which means : the schedule is serializable. If a schedule is serializable, we are ableto modify the schedule to obtain a conflict-equivalent serial schedule.

36

Page 37: A small companion for the class Information Systems for

8.6 Detecting cycles

Here we have a graph with two nodes and a cycle.

Here we have a graph with 5 nodes, 4 edges and no cycle.

Here we have a graph with 5 nodes, 5 edges, and a cycle highlighted in orange.

37

Page 38: A small companion for the class Information Systems for

8.7 Detailed example: written support for the video

You can find here the transcript of the example written on the blackboard of the videorecording of last year.

8.7.1 Timeline

Alice Bob Charles

RA(α)

WB(α)

RB(β)

WC(α)

RA(β)

WC(β)

RA(α)

WC(α)

WB(β)

8.7.2 Graph with the queries

Alice Bob Charles

RA(α).WB(α)

RB(β).WC(α)

RA(β).WC(β)

.RA(α)

.WC(α)

.WB(β)

38

Page 39: A small companion for the class Information Systems for

8.7.3 Graph with the users

We obtain a complete graph. There are cycles everywhere, thus the schedule is not seri-alizable.

8.8 Review questions

Serializable schedules:

Bob reads A Bob reads A Bob reads A Bob reads ABob writes A Bob writes A Bob writes A Alice reads BAlice writes A Alice writes A Alice reads A Bob writes ABob reads A Alice reads A Bob reads A Alice writes BAlice reads A Alice writes A Bob reads A

Alice reads B(i.) (ii.) (iii.) (iv.)

(ii.)→ already serial, all actions together of one person and then all actions togetherof the other person.(iv.)→ serializable, there is no interaction between Alice and Bob. We can swap Alice andBob.(iii.)→ serializable, we swap Alice and Bob reading transaction, there is no dependenciesbetween them.(i.)→ not serializable, we have to have Bob-Alice-Bob, we cannot swap Alice and Bob.There is a write-write conflict (Bob writes A. Alice writes A) and a write-read conflict(Alice writes A. Bob reads A).

8.9 Some additional definitions 29

transaction: group of one or more database operations, which is a unit of work thatmust be executed atomically and in apparent isolation from other transactions.

transactions: SQL allows the programmer to group SQL statements into transactions,which may be committed or rolled back (aborted). Transactions may be rolled back bythe application in order to undo changes, or by the system in order to guarantee atomicityand isolation.

consistent database states: Database states that obey whatever implied or declaredconstraints the designers intended are called consistent. It is essential that operations onthe database preserve consistency, that is, they turn one consistent database state intoanother.

consistency of concurrent transactions: It is normal for several transactions to haveaccess to a database at the same time. Transactions, run in isolation, are assumed to pre-serve consistency of the database. It is the job of the scheduler to assure that concurrentlyoperating transactions also preserve the consistency of the database.

29from Database Systems, Garcia-Molina

39

Page 40: A small companion for the class Information Systems for

schedules: Transactions are broken into actions, mainly reading and writing from thedatabase. A sequence of these actions from one or more transactions is called a schedule.

serial schedules: If transactions execute one at a time, the schedule is said to be serial.If its actions consist of all the actions of one transaction, then all the actions of anothertransaction, and so on. No mixing of the actions is allowed.

notations:- An action is an expression of the form ri(X) or Wi(X), meaning that transaction Ti,reads or writes, respectively, the database element X .- A transaction Ti is a sequence of actions with subscript i.- A schedule S of a set of transactions T is a sequence of actions, in which for each trans-action Ti in T, the actions of Ti appear in S in the same order that they appear in thedefinition of Tj itself. We say that S is an interleaving of the actions of the transactionsof which it is composed.

serializable schedules: A schedule that is equivalent in its effect on the database tosome serial schedule is said to be serializable. Interleaving of actions from several trans-actions is possible in a serializable schedule that is not itself serial, but we must be verycareful what sequences of actions we allow, or an interleaving will leave the database inan inconsistent state.

conflict-serializability: stronger than the general notion of serializability. A simple-to-test, sufficient condition for serializability is that the schedule can be made serial by asequence of swaps of adjacent actions without conflicts. Such a schedule is called conflict-serializable. A conflict occurs if we try to swap two actions of the same transaction, or toswap two actions th at access the same database element, at least one of which actions isa write.

We say that two schedules are conflict-equivalent if they can be turned one into the otherby a sequence of nonconflicting swaps of adjacent actions. We shall call a schedule conflict-serializable if it is conflict-equivalent to a serial schedule. Note that conflict-serializabilityis a sufficient condition for serializability; i.e., a conflict-serializable schedule is a serializ-able schedule. Conflict-serializability is not required for a schedule to be serializable, butit is the condition th at the schedulers in commercial systems generally use when theyneed to guarantee serializability.

Any two actions of different transactions may be swapped unless:- They involve the same database element- At least one is a write

locking: The most common approach to assuring serializable schedules is to lock databaseelements before accessing them, and to release the lock after finishing access to the ele-ment. Locks on an element prevent other transactions from accessing the element.

optimistic concurrency control: Instead of locking, a scheduler can assume trans-actions will be serializable, and abort a transaction if some potentially non-serializablebehavior is seen. This approach, called optimistic, is divided into timestamp-based, andvalidation-based scheduling.

validation-based schedulers: These schedulers validate transactions after they haveread everything they need, but before they write. Transactions that have read, or willwrite, an element that some other transaction is in the process of writing, will have anambiguous result, so the transaction is not validated. A transaction that fails to validateis rolled back.

40

Page 41: A small companion for the class Information Systems for

9 Views and indices

9.1 SQL statement to create a view

CREATE VIEW viewname ASSELECT column1, column2, ...FROM tablenameWHERE condition

9.2 Some additional definitions 30

relations in SQL:- tables: stored relations, can be modified and queried. %sql CREATE TABLE

- views: defined by a computation, not stored, but constructed when needed.- temporary tables: constructed by the SQL processor when he executes queries and datamodifications.

updatable views: Some virtual views on a single relation are updatable meaning thatwe can insert into, delete from, and update the view as if it were a stored table. Theseoperations are translated into equivalent modification to the base table over which theview is defined.

indexes: While not of the SQL standard, commercial SQL systems allow the declara-tion of indexes on attributes. The existence of an index on an attribute may speed upgreatly the execution of those queries in which a value, or range of values, is specifiedfor that attribute, and may speed up joins involving that attribute as well. On the otherhand, every index built for one or more attributes of some relation makes insertions, dele-tions, and updates to that relation more complex and time-consuming. Some DBMS’soffer tools that choose indexes for a database automatically. They examine the typicalqueries and modifications performed on the database and evaluate the cost trade-offs fordifferent indexes that might be created.The typical relation is stored over many disk blocks (pages), and the principal cost ofa query or modification is often the number of pages that need to be brought to mainmemory. Thus, indexes that let us find a tuple without examining the entire relation cansave a lot of time. However, the indexes themselves have to be stored, at least partially, ondisk, so accessing and modifying the indexes themselves cost disk accesses. In fact, modi-fication, since it requires one disk access to read a page and another disk access to writethe changed page, is about twice as expensive as accessing the index or the data in a query.

cost with/without indexes:Relation: StarsIn(movieTitle, movieYear, starName)

Query 1: SELECT movieTitle, movieYear

. FROM StarsIn

. WHERE starName = s;

Query 2: SELECT starName

. FROM StarsIn

. WHERE movieTitle = t AND movieYear = y;

Insertion: INSERT INTO StarsIn VALUES (t,y,s);

fraction of time

Query 1 p1Query 2 p2Insertion 1− p1 − p2

30from Database Systems, Garcia-Molina

41

Page 42: A small companion for the class Information Systems for

Let us make the following assumptions about the data:

1. StarsIn occupies 10 pages, so if we need to examine the entire relation the cost is10.

2. On the average, a star has appeared in 3 movies and a movie has 3 stars.

3. Since the tuples for a given star or a given movie are likely to be spread over the10 pages of StarsIn, even if we have an index on starName or on the combination ofmovieTitle and movieYear, it will take 3 disk accesses to find the (average of) 3 tuplesfor a star or movie. If we have no index on the star or movie, respectively, then 10 diskaccesses are required.

4. One disk access is needed to read a page of the index every time we use that in-dex to locate tuples with a given value for the indexed attribute(s). If an index page mustbe modified (in the case of an insertion), then another disk access is needed to write backthe modified page.

5. Likewise, in the case of an insertion, one disk access is needed to read a page onwhich the new tuple will be placed, and another disk access is needed to write back thispage. We assume that, even without an index, we can find some page on which an addi-tional tuple will fit, without scanning the entire relation.

no index star index movie index both indexes

Query 1 1031 432 10 4Query2 10 10 4 4Insertion 233 434 4 635

Average 2 + 8p1 + 8p236 4 + 6p2 4 + 6p1 6− 2p1− 2p2

9.3 Review questions

Are there SQL queries thant can be evaluated without even reading any table data fromthe disk?→ yes, if the indices contains all the necessary information.

9.4 Review questions

Which queries can be evaluation very fast with an index?→ SELECT name FROM persons WHERE name = ’Einstein’→ SELECT* FROM persons→ SELECT AVG(age) FROM persons GROUP BY field HAVING country=’Switzerland’→ SELECT name FROM persons ORDER BY age LIMIT 1 OFFSET 2

9.5 Review questions

Updating views:→ a view is a pre-defined query. Updating a view is possible under few restrictions:- only one table name in the FROM clause- all attributes in the SELECT clause (except those who can be NULL or have defaultvalue)- no recursion in the WHERE clause

31scan the entire relation: 10 disk accesses: cost=1032accessing one index page to find 3 tuples + 3 accesses to find those tuples in the relation: cost=1+333one disk access to find a new page + one disk access to write back this page: cost=234read + write a page for the index and read + write a page for the data: cost=1+1+1+135read+write a page for index on star and read+write a page for index on movie and read+write a page

for data: cost=2+2+23610p 1 + 10p 2 + 2(1− p 1− p 2) cf. fraction time probabilites table

42

Page 43: A small companion for the class Information Systems for

10 Storage and architecture

10.1 The levels at which data is stored (disk, memory, etc.)

10.2 Some additional definitions 37

sequential files: Several simple file organizations begin by sorting the data file accordingto some sort key and placing an index on this file.

B-trees: These structures are essentially multilevel indexes, with graceful growth ca-pabilities. Blocks with n keys and n + 1 pointers are organized in a tree, with the leavespointing to records. All non-root blocks are between half-full and completely full at alltimes.B-trees automatically maintain as many levels of index as is appropriate for the size of thefile being indexed. B-trees manage the space on the blocks they use so that every block isbetween half used and completely full.

A B-tree organizes its blocks into a tree that is balanced, meaning that all paths fromthe root to a leaf have the same length. Typically, there are three layers in a B-tree: theroot, an intermediate layer, and leaves, but any number of layers is possible. There is aparameter n associated with each B-tree index, and this parameter determines the layoutof all blocks of the B-tree. Each block will have space for n search-key values and n + 1pointers. We pick n to be as large as will allow n - 1-1 pointers and n keys to fit in one block.

There are several important rules about what can appear in the blocks of a B-tree:- The keys in leaf nodes are copies of keys from the data file. These keys are distributedamong the leaves in sorted order, from left to right.- At the root, there are at least two used pointers. All pointers point to B-tree blocks atthe level below.- At a leaf, the last pointer points to the next leaf block to the right, i.e., to the blockwith the next higher keys. Among the other n pointers in a leaf block, at least [n + 1)/2]of these pointers are used and point to data records; unused pointers are null and do notpoint anywhere. The ith pointer, if it is used, points to a record with the ith key.

The B-tree is a powerful tool for building indexes. The sequence of pointers at the leavesof a B-tree can play the role of any of the pointer sequences coming out of an index file:1. The search key of the B-tree is the primary key for the data file, and the index is dense.That is, there is one key-pointer pair in a leaf for every record of the data file. The datafile may or may not be sorted by primary key.2. The data file is sorted by its primary key, and the B-tree is a sparse index with onekey-pointer pair at a leaf for each block of the data file.3. The data file is sorted by an attribute that is not a key, and this attribute is the

37from Database Systems, Garcia-Molina, chapter 14: Index Structures

43

Page 44: A small companion for the class Information Systems for

Figure 1: B-Tree

Figure 2: Interior node of a B-Tree

search key for the B-tree. For each key value K that appears in the data file there is onekey-pointer pair at a leaf. T hat pointer goes to the first of the records th at have K astheir sort-key value.

B-trees for range queries: B-trees are useful not only for queries in which a singlevalue of the search key is sought, but for queries in which a range of values are asked for.Typically, range queries have a term in the WHERE clause that compares the search keywith a value or values, using comparison operator(s) <,>=,= ...If we want to find all keys in the range [a, 6] at the leaves of a B-tree, we do a lookup tofind the key a. Whether or not it exists, we are led to a leaf where a could be, and wesearch the leaf for keys that are a or greater. Each such key we find has an associatedpointer to one of the records whose key is in the desired range. As long as we do notfind a key greater than b in the current block, we follow the pointer to the next leaf andrepeat our search for keys in the range [a, 6], The above search algorithm also works if bis infinite; i.e., there is only a lower bound and no upper bound. In that case, we searchall the leaves from the one th at would hold key a to the end of the chain of leaves. Ifa is −∞ (there is an upper bound on the range but no lower bound), then the searchfor “minus infinity” as a search key will always take us to the first leaf. The search thenproceeds as above, stopping only when we pass the key b.

hash tables: We can create hash tables out of blocks in secondary memory, much aswe can create main-memory hash tables. A hash function maps search-key values to buck-ets, effectively partitioning the records of a data file into many small groups (the buckets).Buckets are represented by a block and possible overflow blocks.

linear hashing: This method grows the number of buckets by 1 each time the ratioof records to buckets exceeds a threshold. Since the population of a single bucket cannotcause the table to expand, overflow blocks for buckets are needed in some situations.

10.3 Some additional definitions 38

memory hierarchy: A computer system uses storage components ranging over manyorders of magnitude in speed, capacity, and cost per bit. From the smallest/most expen-sive to largest/cheapest, they are: cache, main memory, secondary memory (disk), and

38from Database Systems, Garcia-Molina, chapter 13: Secondary Storage Management

44

Page 45: A small companion for the class Information Systems for

Figure 3: Leaf of a B-Tree

tertiary memory.

disks/secondary storage: Secondary storage devices are principally magnetic diskswith multigigabyte capacities. Disk units have several circular platters of magnetic mate-rial, with concentric tracks to store bits.

blocks and sectors: Tracks are divided into sectors, which are separated by unmag-netized gaps. Sectors are the unit of reading and writing from the disk. Blocks are logicalunits of storage used by an application such as a DBMS. Blocks typically consist of severalsectors.

disk access time: The latency of a disk is the time between a request to read or write ablock, and the time the access is completed.

45

Page 46: A small companion for the class Information Systems for

11 Data Cubes

11.1 OLTP environment

OLTP: OnLine Transaction Processing:- consistent and reliable record-keeping- transactions and results on small portions of data- lots of transactions on small portions of data- normalized data

11.2 OLAP environment

OLAP: OnLine Analytical Processing:- OLAP is a category of software that allows users to analyse information from multipledatabase systems at the same time. It is a technology that enables analysts to extractand view business data from different points of view.- large portions of the data- possibly many joins- few long heavy queries

11.3 Data cube model

Data is stored in multidimensional hypercubes.Different operations performed on data cubes:- slicing- dicing- crosstabulating

11.4 Roll-up

Roll-up is also known as ”consolidation” or ”aggregation.” The Roll-up operation can beperformed by reducing dimensions.It refers to the process of viewing data with decreasing detail.

11.5 Drill-down

In drill-down, data is fragmented into smaller parts. It is the opposite of the rollup process.It can be done by increasing a dimension.It refers to the process of viewing data at a level of increased detail.

46

Page 47: A small companion for the class Information Systems for

11.6 Let’s imagine

Let’s imagine we are a swiss company selling machines, among which the three following:3D printer, CNC, Laser Cutter.

Figure 4: 3D printer, CNC, Laser Cutter

We have 4 different stores, each one in a different city: Zurich, Geneva, Bern, Lugano.

We want to review our data, and interpret them to take major decision like ”Shouldwe stop selling 3D printers?” or ”Should we close our store in Lugano?”, etc...

We are using the simple following table:

year store product quantitySold revenue2020 Zurich 3D printer 43 25.6002020 Zurich CNC 12 740.000

...2012 Bern Laser Cutter 4 200.0002012 Lugano Laser Cutter 2 100.000

...

This table chSales has three dimensions (in blue), and two value columns (in green).year store product quantitySold revenue

If we want to take a decision regarding the 3D printer:

- First we fix the product = ’3D printer’ in a WHERE-clause

- Then we choose from the value columns (green) the one that is meaningful. Here thequantitySold. For this simplified example, the value column as given is already meaning-ful, but maybe in a more realistic example where every item sold corresponds to a distinctentry in the table, we will need to aggregate the data with some operator SUM(), MiN(),MAX()...

- Finally we choose from the dimensions (blue) the -often two- dimensions we want todisplay. Here we might choose to compare the quantitySold of 3D printer for each yearand store. We say we dice the data on the year and store dimensions. In SQL we will findthose dimensions in the SELECT-clause and the GROUP BY-clause.

Targeting the quantity sold, we dice the data on the year and store dimensions, andslice the product to 3D printer:

47

Page 48: A small companion for the class Information Systems for

SELECT year, store, quantitySoldFROM chSalesWHERE product = ’3D printer’GROUP BY year, store

The cross tabulation report of the above query is:

Slice Valueproduct 3D printer

store Zurich Geneva Bern Luganoyear2020 43 55 20 82019 36 46 4 22018 23 44 7 52017 20 36 4 12016 18 37 9 62015 18 38 3 32014 25 27 7 12013 5 15 1 62012 7 2 0 2

Now imagine that we want to compare the quantity sold of all our products in all ourstores, during the year 2020.Targeting the quantity sold, we dice the data on the product and store dimensions, andslice the year to 2020:

SELECT product, store, quantitySoldFROM chSalesWHERE year = ’2020’GROUP BY product, store

The cross tabulation report of the above query is:

Slice Valueyear 2020

store Zurich Geneva Bern Luganoproduct3D printer 43 55 24 8CNC 12 7 4 2Laser Cutter 5 6 3 1

To remember:the slicing is done in the WHERE-clauseaggregation appears both in the SELECT-clause with an operator SUM(),... and in theGROUP BY

11.7 Some additional definitions 39

SQL recursive queries: In SQL, one can define a relation recursively — th at is, interms of itself. Or, several relations can be defined to be mutually recursive.Monotonicity: Negations and aggregations involved in a SQL recursion must be monotone— inserting tuples in one relation does not cause tuples to be deleted from any relation,including itself. Intuitively, a relation may not be defined, directly or indirectly, in termsof a negation or aggregation of itself.

OLAP: An important application of databases is examination of data for patterns or

39from Database Systems, Garcia-Molina, chapter 10

48

Page 49: A small companion for the class Information Systems for

trends. This activity, called OLAP (standing for On-Line Analytic Processing and pro-nounced “oh-lap”), generally involves highly complex queries that use one or more aggrega-tions. These queries are often termed OLAP queries or decision-support queries.Decision-support queries typically examine very large amounts of data, even if the query resultsare small. On-line analytic processing involves complex queries that touch all or much ofthe data, at the same time. Often, a separate database, called a data warehouse, is con-structed to run such queries while the actual database is used for short-term transactions(OLTP, or on-line transaction processing).

ROLAP and MOLAP: It is frequently useful, for OLAP queries, to think of the dataas residing in a multidimensional space, with dimensions corresponding to independentaspects of the data represented. Systems that support such a view of data take eithera relational point of view (ROLAP, or relational-OLAP systems), or use the specializeddata-cube model (MOLAP, or multidimensional-OLAP systems).

cube operator: A specialized operator called cube pre-aggregates the fact table alongall subsets of dimensions. It may add little to the space needed by the fact table, andgreatly increases the speed with which many OLAP queries can be answered.We can turn the result of a query into a data cube by appending WITH CUBE to a GROUP

BY clause. We can also construct a portion of the cube by using WITH ROLLUP there.

49