80
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Embed Size (px)

Citation preview

Page 1: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1

Chapter 2

Databases & Data Warehouses

Page 2: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 2

Outline

Database Concepts Steps in Database Design Entity-Relationship Model Logical Database Design

Relational Model Object-Relational Model

Physical Database Design Data Warehouse Concepts

Logical Data Warehouse Design Physical Data Warehouse Design Data Warehouse Architecture

Page 3: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 3

Steps in Database Design

Requirements specification: Collects information about users’ needs with respect to the database system

Conceptual design: Builds a user-oriented representation of the database without any implementation considerations

Logical design: Translates the conceptual schema from the previous phase into an implementation model common to several DBMSs, e.g., relational or object-relational

Physical design: Customizes the logical schema from the previous phase to a particular platform, e.g., Oracle or SQL Server

Page 4: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 4

Entity Relationship Model: University Example

Project

Project idProject acronymProject nameDescription (0,1)Start dateEnd dateBudgetFunding agency

(0,n)

Academic staff

Employee no Name First name Last nameAddressEmail HomepageResearch areas (1,n)

Course idCourse nameLevel

(1,n)(0,n)

(partial,exclusive)

(0,n)

Start dateEnd date

Is a prereqHas prereq

Course

Professor

Status Tenure date (0,1) /No courses

Assistant

Thesis titleThesis description

(0,n)(1,1)

Advisor

SemesterYearHomepage (0,1)

Participates

Prerequisite

(1,1)(1,n)

Section

CouSec

(0,n)(1,1)

Teaches

Page 5: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 5

ER Model: Entity and Relationship Types (1)

Entity Relationship Model: Most often used conceptual model for database design

Entity type: Represents a set of real-world objects of interest to an application

Entity: object belonging to an entity type Relationship type: Represents an association

between objects Relationship: association belonging to a relationship type

Role: Participation of an entity type in a relationship type

Cardinalities in a role: Minimum and maximum number of times that the entity may participate in the relationship type

Page 6: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 6

ER Model: Entity and Relationship Types (2)

Types of roles Optional vs mandatory: minimum cardinality 0 vs 1 Monovalued vs multivalued: maximum cardinality 1 vs n

Relationship types Binary vs n-ary: two vs more than two participating entity

types Binary relationship types:

One-to-one, one-to-many, or many-to-many depending on the maximum cardinality of the two roles

Recursive relationship types: the same entity type participates more than once in the relationship type

Role names are then necessary to distinguish different roles

Page 7: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 7

ER Model: Attributes

Attributes: Structural characteristics describing objects and relationships

Attributes have cardinalities, as for roles Depending on cardinalities, attributes may be

Optional vs mandatory Monovalued vs multivalued

Complex attributes: Attributes composed of other attributes

Derived attributes: Their value for each instance may be calculated from other elements of the schema

Page 8: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 8

ER Model: Identifiers and Weak Entity Types

Identifier or key: One or several attributes that uniquely identify a particular object

Weak entity type: Do not have identifier of their own Regular entity types are called strong entity types

A weak entity type is dependent on the existence of another entity type, called the identifying or owner entity type

An identifying relationship type relates a weak entity type to its owner

Normal relationship types are called regular relationship types Partial key: Set of attributes that can uniquely identify

weak entities related to the same owner entity

Page 9: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 9

ER Model: Generalization (1)

Generalization or is-a relationship: Relates perspectives of the same concept at different abstraction levels

Supertype: Entity at higher abstraction level Subtype: Entity at lower abstraction level

Population inclusion: Every instance of the subtype is also an instance of the supertype

Inheritance: All characteristics (e.g., attributes, roles) of the supertype are inherited by the subtype

Substitutability: Each time an instance of a supertype is required, an instance of the subtype can be used instead

Page 10: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 10

ER Model: Generalization (2)

Several generalization relationships accounting for the same criterion can be grouped

Types of generalizations Total vs partial: depending on whether every instance of the

supertype is also an instance of one of the subtypes Disjoint vs overlapping: depending on whether an instance

may belong to one or several subtypes Multiple inheritance: A subtype that has several

supertypes May induce conflicts when a property of the same name is

inherited from two supertypes

Page 11: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 11

ER Model: Summary of Notation

Professor

IdentifierOther attributes

Attributes

Teaches

Partial IdentifierOther Attributes

Section

(0,1)(1,1)(0,n)(1,n)

Simple attributeComplex attribute Attribute 1 Attribute 2Optional att. (0,1)Multivalued attr. (1,n)

(1,1)

role name

Entity Type Relationship Type

Weak Entity Type

Attribute typesRole Cardinalities

Identifying Relationship Type Generalization

CouSec(total,exclusive)(partial,exclusive)(total,overlapping)(partial,overlapping)

Generalization types

Page 12: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 12

Relational Model: University Example

Professor

Employee noStatusTenure date (0,1)No courses

Academic staff

Employee noFirst name Last nameAddressEmailHomepage

Academic staff Research area

Employee noResearch area

Course idCourse nameLevel

Course

Course idHas prereq

Prerequisite

Project

Project idProject acronymProject nameDescription (0,1)Start dateEnd date BudgetFunding agency

Employee noProject id Start dateEnd date

Participates

Assistant

Employee noThesis titleThesis descriptionAdvisor

Course idSemesterYearHomepage (0,1)Employee no

Section

Page 13: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 13

Relational Model (1)

Based on a simple data structure A relation or table, composed of attributes or columns

Each attribute is defined over a domain or data type Typical domains: integer, float, date, string, …

Tuple or row: element of the table Provides several declarative integrity constraints Not null attribute: a value for the attribute must be

provided Key: column(s) that uniquely identify one row of the table

Simple vs composite key: depending on whether key is composed of one or several columns

Primary vs alternate key: one of the keys must be chosen as primary by the designer, the others are alternate keys

Page 14: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 14

Relational Model (2)

Referential integrity Defines a link between two tables (or twice the same one) A set of attributes in one table (foreign key) references the

primary key of the other table Values in the foreign key columns must also exist in the

primary key of the referenced table Check constraint: defines a predicate that must be

valid when inserting or updating a tuple in a relation Predicate can only involve the tuple being inserted/deleted

All other constraints must be expressed by triggers: an event-condition-action rule that is automatically activated when a relation is updated

Page 15: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 15

Translating from ER to Relational Schemas (1)

Rule 1: A strong entity type is associated with a table Table contains the simple monovalued attributes and the

simple component of the monovalued complex attributes of the entity type

Table also defines not null constraints for mandatory attributes

Identifier of the entity type define the primary key of the table

Academic staff

Employee no Name First name Last nameAddressEmail HomepageResearch areas (1,n)

Academic staff

Employee noFirst name Last nameAddressEmailHomepage

Page 16: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 16

Translating from ER to Relational Schemas (2)

Rule 2: A weak entity type is transformed as a strong entity type, but the associated table also contains the identifier of the owner entity

A referential integrity constraint relates this identifier to the table associated with the owner entity type

Primary key of the table: partial identifier of the weak entity type + identifier of the owner entity type

SemesterYear...

Section Course(1,n)(1,1)

Course id...

CouSec

Course idSemesterYear...

Section

Page 17: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 17

Translating from ER to Relational Schemas (3)

Rule 3: A binary one-to-one relationship type is represented by including the identifier of one of the participating entity types in the table associated to the other entity type

A referential integrity constraint relates this identifier to the table associated with the corresponding entity type

Simple monovalued attributes and simple components of monovalued complex attributes are also included in the table

Not null constraint for mandatory attributes are also definedDepartment

...

Professor(0,1)(1,1)

Dean

...

Department

...Dean

Page 18: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 18

Translating from ER to Relational Schemas (4)

Rule 4: A binary one-to-many relationship type is represented by including the identifier of the entity type with cardinality 1 in the table associated to the other entity type

A corresponding referential integrity constraint is added Attributes and not null constraints are also added as in

previous rulesAssistant

...

Professor(0,n)(1,1)

Advisor

...

Assistant

...Advisor

Page 19: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 19

Translating from ER to Relational Schemas (5)

Rule 5: A binary many-to-many relationship type or an n-ary relationship type is associated with a table containing the identifiers of the entity types that participate in all its roles

Corresponding referential integrity constraints are added Attributes and not null constraints are also added as in

previous rules Identifier if any is the key of the relation, but the

combination of all role identifiers can be used as a keyAcademic staff Project

Project id...

Employee no ...

(1,n)(0,n)Start dateEnd date

Participates

Employee noProject id Start dateEnd date

Participates

Page 20: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 20

Translating from ER to Relational Schemas (6)

Rule 6: A multivalued attribute is associated with a table containing the identifier of the entity or relationship type to which it belongs

Corresponding referential integrity constraint is added Primary key of table is composed of all its attributes

Academic staff

...Research areas (1,n)

Academic staff Research area

Employee noResearch area

Page 21: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 21

Translating from ER to Relational Schemas (7)

Rule 7: A generalization can be translated in 3 ways The supertype and the subtype are associated with tables

containing the identifier of the entity type. There is a referential integrity constraint from the table of the subtype to the table of the subtype

Academic staff

Employee noName ...

Professor

Status ...

Assistant

Thesis title...

Academic staff

Employee noName ...

Professor

Employee no Status ...

Assistant

Employee no Thesis title...

Page 22: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 22

Translating from ER to Relational Schemas (8)

Rule 7 (cont.): Only the supertype is associated with a table, which

contains also the attributes of the subtype (these are optional attributes)

Only the subtype is associated with a table containing all attributes of the subtype and the supertype

Academic staff

Employee no Name…Thesis title (0,1)...Status ...

Professor

Employee no Name…Status ...

Assistant

Employee noName …Thesis title...

Page 23: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 23

Defining Relational Schemas in SQL

create table AcademicStaff as (EmployeeNo integer primary key,FirstName character varying (30) not null,LastName character varying (30) not null,Address character varying (50) not null,Email character varying (30) not null,Homepage character varying (64) not null );

create table AcademicStaffResearchArea as (EmployeeNo integer not null,ResearchArea character varying (30) not null,primary key (EmployeeNo,ResearchArea),foreign key EmployeeNo references AcademicStaff(EmployeeNo) );

Page 24: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 24

Redundancies and Update Anomalies

Relation with redundancies may induce anomalies in the presence of insertions, updates, and deletions

In relation Participates the information about a project is repeated for each staff member who works on that project

In relation Affiliation the information about a research area is repeated for all departments to which the staff member is affiliated

Dependencies and normal forms are used to describe such anomalies

Employee noProject id Start dateEnd dateProject acronymProject name

Participates Affiliation

Employee noResearch areaDepartment id

Page 25: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 25

Functional Dependencies

Given a relation R and to sets of attributes X and Y in R, a functional dependency X Y holds if in all tuples of the relation, each value of X is associated with at most one value of Y

In Participates above, Project Id {Project acronym, Project name}

A key is a particular case of a functional dependency A prime attribute is an attribute that is part of a key A full functional dependency is a dependency X Y

in which the removal of an attribute from X invalidates the dependency

A dependency X Z is transitive if there is a set of attributes Y such that X Y and Y Z hold

Page 26: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 26

Multivalued Dependencies

Given a relation R and to sets of attributes X and Y in R, a multivalued dependency X Y holds if the value of X determines a set of values for Y, independently of any other attributes

In Affiliation above, Employee no Research area and Employee no Department Id

A multivalued dependency X Y is trivial if either Y X or X Y = R, otherwise it is nontrivial

Page 27: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 27

Normal Forms (1)

Normal Form: Integrity constraint certifying that a relation schema satisfies particular properties

In the relational model the relations are in first normal form since all attributes are atomic and monovalues

A relation schema is in the second normal form if every nonprime attribute is functionally dependent on every key

Employee noProject id Start dateEnd dateProject acronymProject name

Participates

Employee noProject id Start dateEnd date

Participates

Project id Project acronymProject name

Project

Page 28: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 28

Normal Forms (2)

A relation is in the third normal form if it is in the second normal form and there are no transitive dependencies between a key and a nonprime attribute

A relation is in the Boyce-Codd normal form if for every nontrivial dependency X Y, X is a key or contains a key

Assistant

Employee noThesis titleThesis descriptionAdvisor idAdvisor first name Advisor last name

Assistant

Employee noThesis titleThesis descriptionAdvisor id

Professor

Employee noFirst name Last name

Employee noProject id Start dateEnd dateLocation

Participates

Employee noProject id Start dateEnd date

Participates

LocationProject id

Location

Page 29: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 29

Normal Forms (3)

A relation is in the fourth normal form if for every nontrivial dependency X Y, X is a key or contains a key

Affiliation

Employee noResearch areaDepartment id

Affiliation

Employee noResearch area

Research

Employee noResearch area

Page 30: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 30

Weaknesses of the Relational Model

It provides a very simple data structure, which disallows multivalued and complex attributes

complex objects are split into several tables performance problems

The set of types provided by relational DBMSs is very restrictive (integer, float, string, date, …)

Insufficient for complex application domains There is no integration of operations with data

structures It is not possible to directly reference an object by

use of a surrogate or a pointer links are based on comparison of values performance

problems

Page 31: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 31

Object-Relational Model: Motivations

Preserves the foundations of the relational model Organizes data using an object model

Allow attributes to have complex types groups related facts into a single row

Extensions Complex and/or multivalued attributes User-defined types with associated methods System-generated identifiers Inheritance among types

Defined in the SQL:1999 and SQL:2003 standards Current DBMSs (Oracle, …) introduced object-

relational extensions, but do not necessarily comply with the standard

Page 32: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 32

Object-Relational Model: Definition (1)

Allow composite types, in addition to predefined types

Complex values may be stored in a column of a table Row type: structured values (i.e., composite attributes) Array type: variable-sized vectors of values Multiset type: Unordered collection of values, with duplicates

permitted (bags) Composite types can be nested

This is considered an “advanced feature” in the standard

Page 33: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 33

Object-Relational Model: Definition (2)

Two sorts of user-defined types Distinct types: Represented internally by a predefined

type, but cannot be mixed in expressions Does not make sense to compare a person’s name and a SSN

Structured user-defined types: class declarations in OO languages

May have attributes, which can be of any SQL type May have methods, which can be instance or static (class)

methods Attributes are encapsulated through observer and mutator

functions Both of them can be used as a domain of a column, of an

attribute of another type, or a table Comparison/ordering of values: with user-defined methods

Page 34: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 34

Object-Relational Model: Definition (3)

SQL:2003 supports single inheritance: A subtype inherits attributes and methods from its

supertype A subtype can

Define additional attributes and methods Overload inherited methods: provide another definition with

a different signature Override inherited methods: provide another

implementation with a the same signature Final types: cannot have subtypes Final methods: cannot be redefined Instantiable type: includes constructor methods for

creating instances

Page 35: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 35

Object-Relational Model: Definition (4)

Two types of tables Relational tables: usual ones, where domains for

attributes are all predefined or user-defined types Typed tables: defined on structured user-defined

types Constraints (e.g. keys) defined in the table declaration Have a self-referencing column that contains identifiers

(surrogates) generated by the DBMS Row identifiers can be used for establishing links between

tables Ref type: values are the unique identifiers

Always associated with a specified structured type

Page 36: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 36

Object-Relational Model: University Example

Academic staff Type

Employee noName First name Last nameAddressEmailHomepageResearch areas (1,n)Participates (0,n)

Professor Type

StatusTenure date (0,1)No coursesAdvises (0,n)Teaches (0,n)

Project Type

Project idProject acronymProject nameDescription (0,1)Start dateEnd dateBudgetFunding agencyParticipates (1,n)

SemesterYearHomepage (0,1)ProfessorCourse

Section Type

EmployeeProjectStart dateEnd date

Participates Type

Assistant Type

Thesis titleThesis descriptionAdvisor

Course idCourse nameLevelSections (1,n)Has prereq (0,n)Is a prereq (0,n)

Course Type

Page 37: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 37

Translating from ER to OR Schemas (1)

Rule 1: A strong entity type is associated with a structured type containing all its attributes

Multivalued attributes defined with the array or the multiset type

Composite attributes defined with the row or the structured type

A table of this structured type must be declared Table defines not null constraints for mandatory attributes Identifier of the entity type define the primary key of the

tableAcademic staff

Employee noName First name Last nameAddressEmailHomepageResearch areas (1,n)

Academic staff Type

Employee noName First name Last nameAddressEmailHomepageResearch areas (1,n)

Page 38: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 38

Translating from ER to OR Schemas (2)

Rule 2: A weak entity type is transformed as a strong entity type, but the structured type contains the identifier of its owner

SemesterYearHomepage (0,1)

Section Course(1,n)(1,1)

Course id...

CouSec

SemesterYearHomepage (0,1)Course

Section Type Course Type

Course id...

Page 39: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 39

Translating from ER to OR Schemas (3)

Rule 3: A regular relationship type is associated with structured type containing the identifiers of all participating entity types, and all attributes of the relationship

A table of this structured type must be declared Table defines not null constraints for mandatory attributes Set of identifiers of entity types define the primary key of

the table Inverse references may also be defined

Academic staff Project

Project id...

Employee no...

(1,n)(0,n)Start dateEnd date

Participates

Acad. staffProjectStart dateEnd date

ParticipatesType

Academic staffType

…Participates (0,n)

ProjectType

…Participates (1,n)

Page 40: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 40

Translating from ER to OR Schemas (4)

Rule 4: Only single inheritance is supported by the model multiple inheritance must be removed by transforming some generalization links into binary one-to-one relationships, to which Rule 3 is applied

PhD Student

Thesis title...

Assistant

Assistant Type...

Student

Major...

PhD Student

Thesis title…Student

Assistant

Assistant Type...

Student

Major...PhD Student

Page 41: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 41

Defining Object-Relational Schemas in SQL

create type AcademicStaffType as (EmployeeNo integer, Name row (

FirstName character varying (30), LastName character varying (30) ),

Address character varying (50), Email character varying (30), Homepage character varying (64), ResearchAreas character varying (30) multiset,Participates ref(ParticipatesType) multiset scope Participates

references are checked )ref is system generated;

create table AcademicStaff of AcademicStaffType (constraint AcademicStaffPK primary key (EmployeeNo),Email with options constraint AcademicStaffEmailNN Email not null,ref is oid system generated );

Page 42: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 42

Physical Database Design

Specifies how database records are stored, accessed, and related in order to ensure adequate performance of a database application

Requires to know the specificities of an application, properties of data, and usage patterns

Involves analyzing the transactions/queries that run frequently, that are critical, and the periods of time in which there will be a high demand on the database (peak load)

Page 43: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 43

Performance of Database Applications

Factors for measuring performance of database applications

Transaction throughput: # transaction processed in a time interval

Response time: Elapsed time for the completion of a transaction Disk storage: Space required to store the database file

A compromise has to be made among these factors Space-time trade-off: Reduce time to perform an operation by

using more space, and vice versa Query-update trade-off: Access to data can be more efficient by

imposing some structure upon it However the more elaborate the structure, the more time is

needed to built it and maintain it when content changes After initial physical design is done, it is necessary to

monitor it and tune it

Page 44: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 44

Data Organization

A database is organized in secondary storage into files, each composed of records, at their turn composed of fields

In a disk, data is stored in disk blocks (pages) Set by the operating system during disk formatting

Transfer of data between main memory and disk takes place in units of disk blocks

DBMSs store data on database blocks (pages) Selection of a database block depends on several issues

Locking granularity may be at the block level, not the record level

For disk efficiency, database block size must be equal to or be a multiple of the disk block size

Page 45: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 45

File Organization

Arrangement of data in a file into records and blocks Heap files (unordered files): Records placed in the

file in the order as they are inserted Efficient insertions, slow retrieval

Sequential files (ordered files): Records sorted on the values of ordering fields

Fast retrieval, slow inserting and deleting Hash files: Use a hash function that calculates the

block (bucket) of a record based on one or several fields

Collision: Bucket is filled and a new record must be inserted Fastest retrieval, but collision management degrades

performance

Page 46: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 46

Indexes

Additional access structures that speed up retrieval of records in response to search conditions

Provide alternative ways of accessing the records based on the indexing fields on which the index is constructed

Many different types of indexes Clustering vs nonclustering: Whether records in the file are

physically ordered according to the fields of the index Single column vs multiple column, depending on the number

of indexing files Unique vs nonunique: Whether duplicate values are allowed

Page 47: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 47

Indexes, Clustering

Types of indexes (cont.) Sparse or dense: Whether there are index records for all

search values Single-level vs multilevel: Whether an index is split in

several smaller indexes, with an index to these indexes Multi-level indexes often implemented by using B-trees of B+-

trees

Clustering: Tables physically stored together as they share common columns (cluster key) and are often used together

Improves data retrieval Cluster key stored only once storage efficiency

Page 48: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 48

Outline

Database Concepts Steps in Database Design Entity-Relationship Model Logical Database Design

Relational Model Object-Relational Model

Physical Database Design Data Warehouse Concepts

Logical Data Warehouse Design Physical Data Warehouse Design Data Warehouse Architecture

Page 49: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 49

Data Warehouses: Definition (1)

Operational databases (online transaction processing systems or OLTP), are not suitable for data analysis

Contain detailed data, do not include historical data, perform poorly for complex queries due to normalization

Data warehouses address requirements of decision-making users

A data warehouse is a collection of subject-oriented, integrated, nonvolatile, and time-varying data to support data management decisions

Subject-oriented: focus on particular subjects of analysis (e.g., purchase, inventory, and sales in a retail company)

Operational databases focus on specific functions of an application

Page 50: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 50

Data Warehouses: Definition (2)

Integrated: Join together data from several operational and external systems problems due to differences in data definition and content

In operational DBs these problems are solved in the design phase

Nonvolatile: Data modification and removal is disallowed scope of the data is long period of time

Operational DBs keep data for a short period of time, as required for daily operations

Time-varying: Keep different values for the same informa-tion and the time when changes to these values occurred

Operational DBs typically do not have temporal support Online analytical processing (OLAP): Allows decision-

making users to perform interactive analysis of data

Page 51: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 51

Multidimensional Model

DWs and OLAP use a multidimensional view of data Represented as a data cube or an hypercube

Dimensions: Perspectives for analyzing data Cells (facts): Contain measures, values that are to be

analyzed

21

12

11

35

Q4

ParisNice

Milan

Product (Category)

Tim

e (Q

ua

rter

)

books

47

Q3

Q2

Q1

Rome

CDs

DVDsgames

10 18

30

3226

14

2014 31

12 20 24 33

24 18 28 1433 25 23 25

measure values

dimensions35

27

Page 52: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 52

Hierarchies

Data granularity: Level of detail of measures Data analyzed at different granularities (abstraction

levels) Hierarchies relate low-level (detailed) concepts to

higher-level (general concepts) Example: Store – City – Region/Province – Country

Given two related levels in a hierarchy, lower level is called child, higher level is called parent

Instances of these levels are called members

Page 53: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 53

Kinds of Hierarchies

Balanced: Same number of levels from leaf members to root

Noncovering (ragged): Parent of some members does not belong to the level immediately above

E.g., small countries do not have Region/Province level

Store level

ParisCity level

Region/Province level

Country level

Store 1 Store 2

Ile-de-France

France

Nice

Store 3

Store dimension

Rome

Store 10 Store 11

Milan

Store 12

Lazio

Italy

All

... ...

...

...

LombardyProvence-Alpes-Côte d'Azur

Page 54: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 54

Kinds of Hierarchies

Unbalanced: May not have instances at the lower levels

E.g., some cities do not have stores, sales are through wholesalers

Parent-child (recursive): Involve twice the same level

Jottard

Hallez Berger

Nguyen

Vancamp Rochat

Weiss

Ivanov

Spiegel Praet

Pirlot Quintin

Claus

Lepage

Page 55: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 55

Measure Aggregation and Summarizability

Measures are aggregated when using hierarchies for visualizing data at different abstraction levels

Summarizability conditions ensure correct aggregation

Disjointness of instances: Grouping of instances in a level with respect to the parent in the next level must result in disjoint sets

Completeness: All instances are included in the hierarchy and each instance is related to one parent in the next level

Correct use of aggregation functions: Type of measures determine the kind of aggregation functions that can be applied

Page 56: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 56

Measure Classification: Additivity

Additive measures (flow or rate measures): Can be meaningfully summarized using addition along all dimensions

E.g., sales amount can be summarized when the hierarchies in Store, Time, and Product dimensions are traversed

Semiadditive measures (stock or level measures): Can be meaningfully summarized using addition along some (not all) dimensions

E.g., inventory quantities, can be aggregated in the Store dimension, but cannot be aggregated in the Time dimension

Nonadditive measures (value-per-unit measures): Cannot be meaningfully summarized using addition along any dimension

E.g., item price, cost per unit, exchange rate

Page 57: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 57

Measure Classification: Aggregation Complexity

Distributive measures: Defined by an aggregation function that can be computed in a distributed way

Data is partitioned into n sets, aggregate function applied to each set, aggregated value is computed by applying a function to these n subaggregate values

E.g., count, sum, min, max Algebraic measures: Defined by an aggregation

function that has a bounded number of arguments, each is obtained by applying a distributive function

E.g., average, computed from sum and count, which are distributive

Holistic measures: Cannot be computed from other subaggregate values

E.g., median, mode, rank

Page 58: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 58

OLAP Operations: Roll up

Transforms detailed measures into summarized ones when one moves up in a hierarchy

Roll-up to the Country level21

12

11

35

Q4

ParisNice

Milan

Product (Category)

Tim

e (Q

uar

ter)

books

47

Q3

Q2

Q1

Rome

CDs

DVDsgames

10 18

30

323526

27 14

2014 31

12 20 24 33

24 18 28 1433 25 23 25

33

12

11

68

Q4

FranceItaly

Product (Category)

Tim

e (Q

uar

ter)

books

47

Q3

Q2

Q1

CDs

DVDsgames

30 42

30

323526

27 14

2014 31

57 43 51 39

Page 59: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 59

OLAP Operations: Drill down

Opposite to the roll-up operation, i.e., it moves from a more general level to a detailed level in a hierarchy

Drill-down to the

Month level

21

12

11

35

Q4

ParisNice

Milan

Product (Category)

Tim

e (Q

uar

ter)

books

47

Q3

Q2

Q1

Rome

CDs

DVDsgames

10 18

30

323526

27 14

2014 31

12 20 24 33

24 18 28 1433 25 23 25

7

4

8

13

...

ParisNice

Milan

Product (Category)

Tim

e (Q

ua

rter

)books

...

Mar

Feb

Jan

Rome

CDs

DVDsgames

2 6

12

1046

8 4

...... ...

4 7 8 10

8 6 9 510 8 11 8

1644 7Dec

Page 60: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 60

OLAP Operations: Pivot or Rotate

Rotates the axes of a cube to provide an alternative presentation of the data

21

28

11

14

Q4

Milan

Rome

Paris

Time (Quarter)

books

25

Q3Q2Q1

Nice

CDsDVDs

games

27 26

13

323533

12 14

2324 18

10 14 12 20

35 30 32 3118 11 35 47

Sto

re (

Cit

y)

Pivot

21

12

11

35

Q4

ParisNice

Milan

Product (Category)

Tim

e (Q

uar

ter)

books

47

Q3

Q2

Q1

Rome

CDs

DVDsgames

10 18

30

323526

27 14

2014 31

12 20 24 33

24 18 28 1433 25 23 25

Page 61: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 61

OLAP Operations: Slice

Performs a selection on one dimension of a cube, resulting in a subcube

Slice on Store.City = ‘Paris’

21

12

11

35

Q4

ParisNice

Milan

Product (Category)

Tim

e (Q

uar

ter)

books

47

Q3

Q2

Q1

Rome

CDs

DVDsgames

10 18

30

323526

27 14

2014 31

12 20 24 33

24 18 28 1433 25 23 25 21

12

11

35

Q4

Product (Category)

Tim

e (Q

uar

ter)

books

47

Q3

Q2

Q1

CDs

DVDsgames

10 18

30

323526

27 14

2014 31

Page 62: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 62

OLAP Operations: Dice

Defines a selection on two or more dimensions, thus again defining a subcube

Dice on Store.Country = ‘France’ and Time.Quarter=

‘Q1’ or ‘Q2’

21

12

11

35

Q4

ParisNice

Milan

Product (Category)

Tim

e (Q

uar

ter)

books

47

Q3

Q2

Q1

Rome

CDs

DVDsgames

10 18

30

323526

27 14

2014 31

12 20 24 33

24 18 28 1433 25 23 25

21

11

35

ParisNice

Product (Category)

Tim

e(Q

uar

ter)

books

Q2

Q1

CDs

DVDsgames

10 18

3027 14

12 20 24 33

Page 63: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 63

Other OLAP Operations

Drill-across: Executes queries involving more than one cube, which share at least one dimension

E.g., compare sales data in a cube with purchasing data in another one

Drill-through: Allows one to move from data at the bottom level in a cube, to data in the operational systems from which the cube was derived

E.g., for determining the reason for outlier values in a data cube

Page 64: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 64

Implementations of a Multidimensional Model

Relational OLAP (ROLAP): Store data in relational databases, support extensions to SQL and special access methods to efficiently implement the model and its operations

Multidimensional OLAP (MOLAP): Store data in special data structures (e.g., arrays) and implement OLAP operations in these structures

Better performance than ROLAP for query and aggregation, less storage capacity than ROLAP

Hybrid OLAP (HOLAP): Combine both technologies, E.g., detailed data stored in relational databases,

aggregations kept in a separate MOLAP store

Page 65: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 65

Logical Data Warehouse Design

In ROLAP systems, tables organized in specialized structures

Star schema: One fact table and a set of dimension tables Referential integrity constraints between fact table and

dimension tables Dimension tables may contain redundancy in the presence of

hierarchies Snowflake schema: Avoids redundancy of star schemas

by normalizing dimension tables Normalized tables optimize storage space, but decrease

performance Starflake schema: Combination of the star and snowflake

schemas, some dimensions normalized, other not Constellation schema: Multiple fact tables that share

dimension tables

Page 66: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 66

Logical DW Design: Star Schema

Time

Time keyDateEventWeekday flagWeekend flagSeason...

Product

Product keyProduct number Product nameDescriptionSizeCategory nameCategory descr.Department nameDepartment descr....

Sales

Product fkeyStore fkeyPromotion fkeyTime fkeyAmountQuantity

Store

Store keyStore numberStore nameStore addressManager nameCity nameCity populationCity areaState nameState populationState areaState major activity...Promotion

Promotion keyPromotion descr.Discount pct.TypeStart dateEnd date...

Page 67: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 67

Logical DW Design: Snowflake Schemas

Time

Time keyDateEventWeekday flagWeekend flagSeason...

Category

Category keyCategory nameDescriptionDepartment fkey...

Department

Department keyDepartment nameDescription...

Product keyProduct number Product nameDescriptionSizeCategory fkey...

Sales

Product fkeyStore fkeyPromotion fkeyTime fkeyAmountQuantity

Store

Store keyStore numberStore nameStore addressManager nameCity fkey...

City

City keyCity nameCity populationCity areaState fkey...

State

State keyState nameState populationState areaState major activity...

Promotion

Promotion keyPromotion descr.Discount pct.TypeStart dateEnd date...

Product

Page 68: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 68

Logical DW Design: Constellation Schemas

Purchases

Product fkeySupplier fkeyOrder time fkeyDue time fkeyAmountQuantityFreight cost

Time

Time keyDateEventWeekday flagWeekend flagSeason...

Product

Product keyProduct number Product nameDescriptionSizeCategory nameCategory descr.Department nameDepartment descr....

Sales

Product fkeyStore fkeyPromotion fkeyTime fkeyAmountQuantity

Store

Store keyStore numberStore nameStore addressManager nameCity nameCity populationCity areaState nameState populationState areaState major activity...

Promotion

Promotion keyPromotion descr.Discount pct.TypeStart dateEnd date...

Supplier

Supplier keySupplier nameContact personSupplier addressCity nameState name...

Page 69: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 69

Physical Data Warehouse Design: Views (1)

Data warehouses are several orders of magnitude larger than operational database

Physical design is crucial for ensuring adequate response time for complex queries

View: Table derived from a query Query may involve base tables or other views

Materialized view : View physically stored in a database Enhances query performance by precalculating costly operations This is achieved at the expense of storage space

Updating materialized views: Modifications of underlying base tables must be propagated into the view

Must be performed incrementally to ensure efficiency

Page 70: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 70

Physical Data Warehouse Design: Views (2)

In a DW not all possible aggregations can be materialized

Number of aggregations grows exponentially with the number of dimensions and hierarchies

Selection of materialized views: Identify the queries to be prioritized, which determines the views to be materialized

DBMSs are providing tools that tune the selection of materialized views given previous queries

Queries must be rewritten in order to best exploit existing materialized views to improve response time

Drawback of materialized view approach: Requires one to anticipate queries to be materialized

DW queries are often ad hoc and cannot always be anticipated

Page 71: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 71

Physical Data Warehouse Design: Indexes (1)

Traditional indexes are not appropriate for data warehouses

Designed for OLTP transactions that access a small number of tuples

Bitmap indexes: For columns with a low number of distinct values

Each value is associated with a bit vector whose length is the number of records in the table

ith bit is set if ith row has that value in the column Small in size, logical operators used to manipulate

them

Professor

PId Name Rank

P1

P2

P3

P4

John Smith

Ada Jones

...

...

Assistant

Associate

Full

Associate

Bill Kock

Ed Low

...

...

Index on Rank

Rec# Assistant

1

Associate

2

3

4

1

Full

0 0

0 1 0

0 0 1

0 1 0

...

Page 72: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 72

Physical Data Warehouse Design: Indexes (2)

Join indexes: Materialize relational join by keeping pairs of row identifiers that participate in the join

Index selection: Select indexes to minimize execution cost of a series of queries and updates

DBMSs are providing tools to assist DBAs to this task

P1P1,C1,...

P1,C2,...

P2,C1,...

Index on Client

Index on Product

Sales

P3,C2,...

P2

P3

...

C1

C2

...

...

Page 73: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 73

Fragmentation or Partitioning

Divide the contents of a relation into several files that can be more efficiently processed

Vertical fragmentation: Splits attribute of a table into groups that are stored independently

E.g., most often used attributes in one partition Horizontal fragmentation: Divides records of a table

into groups according to a particular criterion Range partitioning: Partitioned wrt range of values

E.g., partitioning based on time, each partition contains data about a particular time period (year, month)

Hash partitioning: Partitioned wrt hash function Combined partitioning: Combine these two methods

E.g., rows first partitioned according to range values, then subdivided using a hash function

Page 74: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 74

Data Warehouse Architecture (1)

Operationaldatabases

Externalsources

Internalsources

OLAP tools

Reportingtools

Data miningtools

Data marts

Back-endtier

OLAPtier

Front-endtier

Datasources

Data warehousetier

Statisticaltools

Data staging Metadata

ETLprocess

Enterprisedata

warehouse

OLAPserver

Page 75: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 75

Data Warehouse Architecture (2)

Data sources Operational databases Other internal or external sources of information (e.g. files)

Back-end tier Extraction-Transformation-Loading (ETL) tools for

manipulating data from sources Data staging area: Intermediate database where

manipulation is done OLAP tier

OLAP Server: Supports multidimensional data and operations Front-end tier: Deals with data analysis and

visualization Composed of OLAP tools, reporting tools, statistical tools,

data-mining tools, …

Page 76: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 76

Back-End Tier

Extraction: Gathers data from multiple heterogeneous data sources

May be operational databases or files in various formats May be internal or external to the organization Uses APIs such as ODBC, JDBC, … for achieving interoperability

Transformation: Modifies data to conform to the data warehouse format

Cleaning: Removes errors, inconsistencies, format transformation Integration: Reconciles data from different sources Aggregation: Summarizes data according to the granularity (level

of detail) of the DW Loading: Feeds the DW with transformed data

Also includes refreshing the data warehouse at a specified frequency

Page 77: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 77

Data Warehouse Tier

Enterprise data warehouse: Centralized DW that encompasses all areas in an organization

Data mart: Specialized DW targeted to a particular functional area or user group

Their data can be derived from the enterprise DW or collected from data sources

Metadata repository: Describes the content of the DW Business metadata: Meaning (semantics) of data, organization

rules, policies, constraints, … Technical metadata: How data is structured/stored in the

computer Data sources, data warehouse, and data marts: logical and physical

schemas, security information, monitoring information … ETL process: Data lineage (trace to sources), rules, defaults, refresh

and purging rules, algorithms for summarization, …

Page 78: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 78

OLAP Tier

OLAP servers that provides multidimensional view from DWs and data marts

Can be ROLAP, MOLAP, or HOLAP Most database products provide OLAP extensions

and related tools for manipulating cubes However, no standardized language for querying

data cubes Oracle uses Java and query language OLAP DML SQL Server uses .NET and query language MDX

XMLA (XML for Analysis) aims at providing a common language for exchanging multidimensional data

Page 79: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 79

Front-End Tier

OLAP tools: Allow interactive exploration and manipulation of the warehouse data

Facilitate formulation of ad hoc queries (no prior knowledge of them)

Reporting tools: Enable production, delivery and management of reports (paper and web-based)

Use predefined queries Statistical tools: Used to analyze and visualize the

cube data using statistical methods Data-mining tools: Allow users to analyze data to

discover patterns, trends, enable predictions

Page 80: Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 2 Databases & Data Warehouses

Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 80

Variations of the Architecture

Enterprise DW without data marts, or data marts without enterprise DW

Building enterprise DW is costly, building a data mart is easier Integrating independently created data marts is costly

No OLAP server: Client tools access directly DW Extreme situation: No DW either, client tools access

sources Virtual DW: Set of materialized views over data sources Easier to build, but does not provide a real DW solution

Does not contain historical data, no centralized metadata, no possibility to clean and transform the data

No data staging area: If the source systems conform very closely to the data in the DW

Rarely the case in real-world situations