54
0.5 setgray00.5 setgray0.250.5 setgray0.50.5 setgray0.750.5 setgray1 Theoretische Grundlagen der Aggregation in OLAP-Modellen OLTP and OLAP Schemata Towards a Sound Treatment of Aggregation Functions and OLAP Databases Free University Berlin, Dept. of Business Administration and Economics Institute of Production, Information Systems and Operations Research Colloquium May 12, 2005 Bernhard Thalheim (& Hans-Joachim Lenz) Technologie der Informationssysteme Institut f¨ ur Informatik, Christian-Albrechts-Universit¨ at zu Kiel, BRD Kolmogorow-Professor e.h. der Lomonossow-Universit¨ at Moskau

Theoretische Grundlagen der Aggregation in OLAP-Modellen fileISE, CAU, D Colloquium FU Berlin May 12, 2005 B. Thalheim Overview State-of-the-art Aggregation OLAP cube Schemata Methodology

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

0.5 setgray00.5 setgray0.250.5 setgray0.50.5 setgray0.750.5 setgray1

0.5 setgray0 0.5 setgray0.25 0.5 setgray0.5 0.5 setgray0.75 0.5 setgray1

Theoretische Grundlagen der Aggregation in

OLAP-ModellenOLTP and OLAP Schemata

Towards a Sound Treatment of Aggregation Functions and OLAP Databases

Free University Berlin, Dept. of Business Administration and Economics

Institute of Production, Information Systems and Operations Research

Colloquium

May 12, 2005

Bernhard Thalheim (& Hans-Joachim Lenz)Technologie der Informationssysteme

Institut fur Informatik, Christian-Albrechts-Universitat zu Kiel, BRD

Kolmogorow-Professor e.h. der Lomonossow-Universitat Moskau

1

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

DBIS/TIS in DD, Kuwait, HRO, CB, KI

IS Engineering

Interactive media

z

Integration, farms

?

Information services

9

DB modeling

z?9

(Object-)relational theory

z

ER theory

?

OO/XML theory

9

DB theory

9 ? z

Theoria cum praxi

Demanding projects with industrial partners

as impulse for novel theory development

and expertise for teaching and students

2

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Information System Engineering Chair@ CAU Kiel�� ��One goody of the German university system: Research stability

Research @ ISE: or what you cannot expect

Codesign of database applications structuring, functionality, distri-

bution and interactivity

B. Thalheim, Entity-Relationship Modeling. Springer, Berlin, 2000

Database and information systems theory

Website development

K.-D. Schewe, B. Thalheim,

Design and Development of Web Information Systems.

Springer, Berlin 2005?, 2006

Component architectures of information systems

Content management systems on demand - the Kiel approach

3

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Information System Engineering Chair@ CAU Kiel�� ��Another goody of the German university system: Partner selectivity

Projects - for students, knowledge enhancement and leisure

Websites: Information sites, e-learning sites, community sites, e-

government sites

35 successful website development projects

and a good number of “experience-gaining” projects

Media on merge on demand, portfolio, profile and history:

radio/tv + video+ internet + news paper

Database and information systems development,

maintenance, emergency management, tuning

application library with more than 4.500 large applications

Database integration, embedded databases

Partners: Berlin-Brandenburg Graduate school, CoreMedia, VW,

Dataport

4

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Information System Engineering Chair@ CAU Kiel

�� ��Main goody of the German university system: Integrated teaching and research

Internet information services

�j

Intelligent information systems

R

Applications of information systems

?

Database technologyand programming

Distributedinformation systems

Engineering ofinformation systems

Information systemstheory

?

Foundations of database systems

9

R

z

• CS II (Algorithms and data structures), CS III (Software engineering)

• Contemporary courses: Intelligent information systems, Artificial intelligence, None-classical

logics, Problem solving strategies, Web databases, Website engineering, Programming of

information-intensive services, Knowledge bases

• Service courses within other study programs

5

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Overview��

��Nothing is more practical than a good theory (A.N. Kolmogorow)

(1) Quick look onto OLTP and OLAP engines

(2) The pitfalls, misunderstandings, practice, and theory of aggregation

functions

Yet another chapter in the book of stupidity or

Know what you are doing

(3) The OLAP cube

Towards a sound foundation

(4) OLAP-OLTP schemata

Our proposal to overcome the problems

(5) Methodology of OLTP-OLAP application development

Provable properties and quality criteria

There are three kinds of lie: lies, damned lies and statistics.attributed to Benjamin Disraeli (1804-81) by Mark Twain (Autobiography, 1924)

6

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

OLTP and OLAP�� ��Database System Architecture

Appli-cationsystem

(workflow(functio-nality))

Presen-tationsystem(playout(story))

?6

-

DBMS

Data-base

DBS =

DBMS +{ DB }

Storage management

Compiler

Communication subsystem

Operatingsystem

Synchronizationof parallelaccess

Datadictionary

6 ?

Data manager Buffer manager

Transactionmanagement

Scheduler

Recoverymanager

Code generation Distribution management

OptimizerAccess plangeneration

Updateprocessor

Queryprocessor Integrity

control andenforcement

Logbook

6?

-�

Input-outputprocessor

Parser Pre-compiler

Authoritycontrol

Supportingsystems(graphicaletc. )

6? 6?

7

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

OLTP and OLAP�� ��Generalized Data Warehouse Architecture

OLAP/DW SystemLegacyData

ForeignData

OLTPdata

Micro-dataimportexporttools-

-

-

-Content

managementsystem

-Macro-dataextractorsdatabasemining

EIS/DSSuser

-

-

-

Businessunit user

Anonymoususer

Active acquisition

Storage

User profiles

Data suites Access, historymanager

Gate

Purger

Paymentmanager

Workspace

Unit, applet,data provider

Units generator

8

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Micro-data versus Macro-data�� ��The right level of data granularity

Snodgrass Example for Temporality Problems

Lot Pen

HD CNT

penn idlot id nrfrom to

� -Resides

Awful queries, overwhelming complexity

Microdata of the Temporal Database

(0,.) (1,1) (1,.) (0,.)

� -BelongsTo

Lot Cattle

lot id nr gender code penn idfrom to

� -Resides Pen

9

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Micro-data versus Macro-data�� ��The right level of data granularity

Typical Snodgrass query: “Give the History of Lots being co-resident in a Pen”

select L1.Lot Id num, L2.Lot Id Num, L1.Pen Id, L1.From Date, L1.To Datefrom Lot Loc as L1, Lot Loc as L2where L1.Lot Id num< L2.Lot Id num and L1.Fdyd Id = L2.Fdyd Id

and L1.Pen Id= L2.Pen Id and L1.From Date= L2.From Dateand L1.To Date<= L2.To Date

unionselect L1.Lot Id num, L2.Lot Id Num, L1.Pen Id, L1.From Date, L2.To Datefrom Lot Loc as L1, Lot Loc as L2where L1.Lot Id num< L2.Lot Id num and L1.Fdyd Id = L2.Fdyd Id and ...

unionselect L1.Lot Id num, L2.Lot Id Num, L1.Pen Id, L2.From Date, L1.To Datefrom Lot Loc as L1, Lot Loc as L2where L1.Lot Id num< L2.Lot Id num and L1.Fdyd Id = L2.Fdyd Id and ...

unionselect L1.Lot Id num, L2.Lot Id Num, L1.Pen Id, L2.From Date, L2.To Datefrom Lot Loc as L1, Lot Loc as L2where L1.Lot Id num< L2.Lot Id num and L1.Fdyd Id = L2.Fdyd Id

and L1.Pen Id = L2.Pen Id and L1.From Date > L1.From Dateand L2.To Date <= L1.To Date;

select distinct L1.Lot ID, L2.Lot ID, R1.Pen ID, R2.From, min(R1.To, R2.To)from Cattle C1, Cattle C2, Resides R1, Resides R2, Lot L1, Lot L2where L1.Lot ID = C1.BelongsTo and L2.Lot ID = C2.BelongsTo and

R1.Cattle ID = C1.Cattle ID and R2.Cattle ID = C2.Cattle ID andR1.Pen ID = R2.Pen ID and R1.From <= R2.From andR2.From < R1.To and L1.Lot ID <> L2.Lot ID;

10

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Hierarchy Paradox

UniversityLectures

� -HeldOn

Room

#Participants

#Participants

WorkingDay

6

-IsA Day � -OrganizedOn

EveningLectures

TitleRoom

Scheduling Schema on University and Evening Lectures

� -HeldOnLectures Day

#Participants

Cube A: Participants per Lecture and Day

� -UsageRoom Day

#TotalUsage

Cube B: Usage of Rooms per Day

11

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Hierarchies in the Example 2: 1

date ofuniversity lectures

date of

evening lectures

R

morning evening

+

day inside term

s

day

)

date ofevening lectures

j

day outside term

?

12

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Hierarchies in the Example 2: 2

working day

day

non-working day

q

date

?

evening

q

j

date ofevening lecturesat working days

morning

day

inside term

j

date of

university lectures

dayoutside term

13

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Cube B Data in the Example 2Day category Room utilization Daytime # of rooms

working occupied morning 20

working occupied evening 12

working free morning 10

working free evening 4

non-working occupied morning 2

non-working occupied evening 8

non-working free morning 6

non-working free evening 16

Daytime Working day Non-working day All days

Evening 75 % 33 % 50 %

Morning 67 % 25 % 58 %

14

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Cube B Data in the Example 2Day category Room utilization Daytime # of rooms

working occupied morning 20

working occupied evening 12

working free morning 10

working free evening 4

non-working occupied morning 2

non-working occupied evening 8

non-working free morning 6

non-working free evening 16

Daytime Working day Non-working day All days

Evening 75 % 33 % 50 %

Morning 67 % 25 % 58 %

15

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Weighted Cube B Data in the Example 2

Day category Room utilization Daytime # of rooms

working occupied morning 20

working occupied evening 12

working free morning 10

working free evening 4

non-working occupied morning 2

non-working occupied evening 8

non-working free morning 6

non-working free evening 16

Daytime Working day Non-working day All days

Evening (75 % , 1630 ) (33 % , 24

30 ) (50 %, 2430 )

Morning (67 %, 3030 ) (25 %, 8

30 ) (58 %, 3030 )

The hierarchy paradox

16

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Simpson Paradox�� ��Revisited but based on a classical OLAP example

model Chevy Ford ALL

color blue white blue white ALL

year 90 91 90 91 90 91 90 91 ALL

count 187 155 131 108 217 180 151 125 1254

count(*) by model, color, year (data under independence constraint)

Ratios ’Market Share Percentages of Chevy’

p(chevy|blue, 90) =187

187 + 217· 100 = 46%

p(chevy|blue, 91) =155

335· 100 = 46%

p(chevy|white, 90) =131

282· 100 = 46%

p(chevy|white, 90) =108

233· 100 = 46%

Market share of sold units where model = ‘chevy’constant for both colors over years 90-91

17

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Simpson Paradox�� ��Revisited but based on a classical OLAP example

model Chevy Ford ALL

year 90 91 90 91 ALL

count 187 155 217 180 1254

count (*) by model, year

p(chevy, 90) =308

686· 100 ≈ 46%

p(chevy, 91) =263

568· 100 ≈ 46%

coherent with the results before

global independence assumption (integrity constraint) on the data spacespanned by the attributes ‘model’, ‘year’ and ‘color’, cite

18

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Simpson Paradox�� ��Revisited but based on a classical OLAP example

Tm: model Chevy Ford ALL

count 581 673 1254

Ty: year 90 91 ALL

count 739 515 1254

Tc: color blue white ALL

count 687 567 1234

Table 1: Marginal tables Tm, Ty, Tc used to generate ??

insert and delete operations

model Chevy Ford ALL

color blue white blue white ALL

year 90 91 90 91 90 91 90 91 ALL

count 255 156 88 82 174 102 222 175 1254

count (*) by model, color, year

19

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Simpson Paradox�� ��Revisited but based on a classical OLAP example

p(chevy|blue, 90) =255

429· 100 ≈ 59%

p(chevy|blue, 91) =156

258· 100 ≈ 60%

Market share of a blue car of type ‘chevy’ increases slightly over years

Increase of the share is stronger for white cars:

p(chevy|white, 90) =88

310· 100 ≈ 28%

p(chevy|white, 91) =82

257· 100 ≈ 32%

20

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Simpson Paradox�� ��Revisited but based on a classical OLAP example

Slice now over ‘color’, πyear,count

model Chevy Ford ALL

year 90 91 90 91 ALL

count 324 238 396 277 1254

count (*) by model, year

p(chevy, 90) =308

686· 100 ≈ 46%

p(chevy, 91) =263

568· 100 ≈ 46%

The bad guy:

j�

Z (color)

X (year) Y (model)

Dependency graph XYZ with separator (influential) attribute Z

21

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Other Problematic OLAP Applications�� ��Discoveries of H.-J. Lenz

Non-commutative operators: Let O be a given set of numeric opera-

tors. Let o1 ∈ O be a linear operator and o2 ∈ O a non-linear operator.

Then it is generally not true that o1 ◦ o2 = o2 ◦ o1 .

Perfect aggregation: The perfect homomorphism H : y = y exists if

and only if y = SAT ∨ A = S+AT+ where S+(T+) is the Moore-Penrose

inverse of S(T ). x

?

A

T S

A

y

Y

-

-X

6

Summarizability and cover properties

Extensions of These Cases

Non-confluence of aggregations

Identifiability and inclusion/exclusion computation

22

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

n-Pass Aggregation Functions I�� ��Basics

• turn data (micro-data) into information (macro-data)by (one-pass) aggregation

(statistical) summary from that data

• count (tally)

• average (arithmetic mean)not understood for non-linear relative values (pH)

beside other ‘average’ values

• geometric average (√a× b),

quadratic average (√

1n(a21 + ...+ a2n))

• median (just as many cases with a value below it as above it)[measure of central tendency]

‘statistical median’: left or right, dividing the data

into two half

‘financial median’: average of the highest value in the

lower half and the lowest value in the higher half

• sum (total)

• min, max

23

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

n-Pass Aggregation Functions II�� ��First problems due to SQL

• null values, default values

• count({ (Null) }) = 1

• sum({ (Null) }) = sum({ (0) }) = 0

sum(∅) = count(∅) = 0

sum({di|di = NULL, di = ti(A)}) = sum({di|di = ti(A)})

• max({ (Null) }) = min({ (Null) }) = undef or

max({ (Null) }) = maxValue min({ (Null) }) = minValue

SQL assumptions:

max(∅) = max(∅) = NULL

similarly fancy for data types TIME, DATETIME,

STRING, VARSTRING

e.g., MAX(temporal) farthest in the future or most recent

• avg(∅) = NULL

RCno−null[A] = {|{ di|di = NULL, di = ti(A), ti ∈ RC }|}

avg(RCno−null[A]) =

sum(RCno−null[A])

count(RCno−null[A])

avg(RCno−null[A]) = avg(Bag2Set(RC

no−null[A]))

24

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

n-Pass Aggregation Functions�� ��Towards more complex functionality

• Cumulative or running statistics

relation of each data value to the whole data

changes in an aggregate value over time or any dimension set

banded reports, control break reports, OLAP dimensions

“total amount of sales broken down by salesman within territories”

based on two-pass reports

1. properties of a group as a whole2. second pass to produce the results for each row in the group

“What percentage each customer contributes to total sales?”

“Total sales in each territory, ordered from high to low!”

• Key - area computations

• regions: contiguous n values that are the same

• runs: set of consecutive values within a sequence

• sub-sequences: set of consecutive unique data within a sequence

25

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

n-Pass Aggregation Functions�� ��Examples of complex functions

• running totals (tracking changes over time (bank account))

running differences (tracking differences over time since the

last moment)

• cumulative percentage (what percentage of the whole set of

data values the current subset of data values is)

ranking (integers that represent the ordinal values (first,

second, ..) of the elements of a set based on one of the values)

cross tabulation (two-dimensional spreadsheet generation)

[cross join, outer join, subquery]

• mode (most frequently occurring value)

n-modal distribution (n most frequently occurring values)

weak descriptive statistic (minor DB changes with major reflection)

mode with threshold variation

• standard deviation (√∑

k(xk − EX)2 ), variance

average deviation (how much values drift away from the average)

26

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Aggregation Functions: General Definition

Family F = {f0, ...., fk, ..., fω}fk : Bagk → Num (maps a bag on dom(Mj) with k elements

to a numerical range Num)

Minimum preserving: fk(min, ....,min) = minNum min ∈ dom(Mj)

Maximum preserving: fk(max, ...,max) = maxNum max ∈ dom(Mj)

Monotone according to the order of dom(Mj) and T (k ≥ 1)

Idempotent: fk(x, ...., x) = x for all x ∈ dom(Mj)

Continuous: limxi→x f(xi) = f(x) for all sequences xi of size k

Lipschitz property: |fk(x1, ..., xk)− fk(y1, ..., yk)|≤ c

∑ni=1 |xi − yj | for some constant c

Symmetric: fk(x1, ..., xk) = fk(xρ(1), ..., xρ(k)) for any k-permutation ρ

Self-identical: fk(x1, ..., xk) = fk+1(x1, ..., xk, fk(x1, ..., xk))

Shift-invariant: fk(x1 + b, ..., xk + b) = fk(x1, ..., xk) + b

Homogeneous: fk(bx1, ..., bxk) = bfk(x1, ..., xk)

Additive: fk(x1 + y1, ..., xk + yk) = fk(x1, ..., xk) + fk(y1, ..., yk)

Associative: fr(fk1(x1), ..., fkr (xr)) = fk1+...+kr (x1, ..., xr)

27

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Aggregation Functions: Properties�� ��Be careful with the average functions used for OLAP

• max, min are minimum preserving, maximum preserving, idempo-

tent, continuous, symmetric, self-identical, additive, homogeneous,

associative, and obey the Lipschitz property,

• sum is 1-minimum preserving, 1-maximum preserving, 1-monotone,

continuous, symmetric, homogeneous, additive, associative, obeys

the Lipschitz property, and

not idempotent, not self-identical, not shift-invariant,

• avg is minimum preserving, maximum preserving, 1-monotone,

idempotent, continuous, symmetric, shift-invariant, homogeneous,

additive, obeys the Lipschitz property,

is not self-identical, not associative,

• count is continuous, symmetric, associative, obeys the Lipschitz

property,

not idempotent, not self-identical, not shift-invariant, not homoge-

neous, not additive.

28

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Classes of Aggregation Functions:Distributive (Inductive) Functions

preserving partitions of sets

∀f ∃g f(X) = g(f(X1), ..., f(Xn))

X = X1 ∪X2 ∪ ... ∪Xn, Xi ∩Xj = ∅ , i = j

types T , T ′, collection type CT on T , operations ∪CT , ∩CT , ∅CT on CT

h0 ∈ T ′ h1 : T → T ′ h2 : T ′ × T ′ → T ′

srech0,h1,h2(∅CT ) = h0

srech0,h1,h2(|{|s|}|) = h1(s) for |{|s|}|srech0,h1,h2(R

C1 ∪CT RC

2 ) = h2(srech0,h1,h2(RC1 ), srech0,h1,h2(R

C2 ))

iff RC1 ∩CT RC

2 = ∅CT

• sum = srec0,Id,+ for relations without nulls

sumnull0 = srec0,h0Id

,+ h0f (s) =

{0 if s = NULL

f(s) if s = NULL

sumnullundef = srec0,hundefId

,+ hundeff (s) =

{undef if s = NULL

f(s) if s = NULL

• count = srec0,1,+ countnull1 = srec0,h01,+

, countnullundef = srec0,hundef1 ,+

• max = srecNULL,Id,max min = srecNULL,Id,min for sets, bags

max, maxundef are different functions

29

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Classes of Aggregation Functions:Algebraic Functions

Finite algebraic expressions based on distributive functions

• average:(++)

sum

count(??)

sum

countnull1

(??)sum

countnullundef

(SQL!?)sumnull0

count(+!)

sumnull0

countnull1

(??)sumnull0

countnullundef

(+?!)sumnullundef

count(??)

sumnullundef

countnull1

(++)sumnullundef

countnullundef

remember SQL x0= 0

• variance (∑

k(xk − EX)2 · pk EX =∑

k xk · pk )

• ratio values (e.g. consumption)

most of them can be extended to distributive functions

e.g. avgnull0,1 based on ordering on pairs (value, occ#)

30

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Classes of Aggregation Functions: HolisticFunctions

the vast majority of functions

no storage bound for computing sub-aggregates

mostFrequent, n-mostFrequent

statistical and financial median

rank

averageDeviation

geometric key-area computations

computable over temporal views (on temporal views ...)

distinction between raw data and ‘data’

maintenance of identification becomes the main issue

data warehouse data

versions of data, temporality

31

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Invariance of FunctionsGiven: aggregation function ψ

database function f ,

e.g., view function, selection function

Problem: does the application of f change the aggregation ?

Can we delay computation of aggregation to the view?

Is there a need to compute results based on raw data only?

D

?

f

ψ ψ

f(D)-

ψ(D) = ψ(f(D))�

f isψ-invariant

in Dif

ψ(f(D)) = ψ(D)

32

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Invariance Examples

• Bag2Set: (DISTINCT)

• min-, max-invariant

• not sum-invariant

sum-invariant for sets

• not avg-invariant

avg-invariant for sets without NULL’s

• Project: invariant for all distributive functions

• Select, Join: not invariant for most aggregation functions

• Union, Difference, Intersection: mostly not invariant

• Renaming: invariant for all aggregation functions

33

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Repairing Approaches1. Aggregation Transformation

Abstraction functionf

Aggregation functionΨf

Aggregation functionψ

Abstraction functionF

ψ(D)Ψf (f(D))

=F (ψ(D))

D f(D)

-

-

? ?

Problem: find the f -transformation Ψf of ψ

infeasible

34

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Repairing Approaches1. Aggregation Transformation

Abstraction functionf

Aggregation functionΨf

Aggregation functionψ

Abstraction functionF

ψ(D)Ψf (f(D))

=F (ψ(D))

D f(D)

-

-

? ?

Problem: find the f -transformation Ψf of ψ

infeasible

35

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Repairing Approaches2. Preparing Explicit Aggregation

g - ψ-supplement of f

∃ψ∗∀D ψ∗(g(f(D),D)) = ψ(D)

D

?

(f, g)

ψ ψ∗

g(f(D),D)-

ψ(D) = ψ∗(g(f(D),D))+

Supposition: g is based on the mapping of algebraic functions to dis-

tributive functions

36

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Observation Supporting the Supposition

supplement function which generates the occurrence number for each

value

for instance for

f = Bag2Set

• sum∗ = srec0,h,+ for h((v, n)) = n · v

• count∗ = srec0,h′,+for h((v, n)) = n , count = srec0,1,+

• avgnull0,1 =sumnull0

countnull1extended to

avgnull∗0,1 =sumnull∗0

countnull∗1

37

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Behavior of Aggregation Functions

Subset sensitivity: For given D, ψ and a selection condition α:

ψ(D) = ψ(σα(D))

All classical aggregation functions are subset-sensitive.

Critical operations: selection, join, intersection, difference

Object invariance: All objects are residing after application of the

operation.

Object-invariant functions are min- and max-invariant.

Value-set invariance: All values of the domain considered are re-

maining.

Set-value invariant functions are min- and max-invariant.

38

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Behavior within HierarchiesPoly-Hierarchy Lattices

generalizing the characteristic function h to Boolean lattices of classes

C1, ..., Cm:

L(Ci1 , ...Cin) - intersection Ci1 ,..., Cincounting based on the exclusion/inclusion property:

ψ(D) =∑m

i=1 ψ(Ci)−∑m−1

j=1

∑nk=j+1 ψ(L(Cj , Ck)) +−...

+(−1)r−1 ∑1≤j1<...<jr≤m ψ(L(Cj1 , ...Cjr )) + ....

+(−1)m−1ψ(L(C1, ..., Cm))

Distributive aggregation functions can be computed based on the ex-

clusion/inclusion principle.

If an abstraction function f is distributive with the exclusion/inclusion principle then f is ψ-

invariant for distributive aggregation functions.

39

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Algebraic Aggregation FunctionsHandling Through Distributive Functions

D

?

Aggregationfunction

Extendedaggregationfunction

Forgetfulfunctor

ψ(D)-

ψ∗(D)

3

Example: average

h1({s}) = (1, s)

h2((i, k)(j, l)) = (i+ j,i · k + j · l

i+ j)

40

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Supplements for Algebraic FunctionsGiven a database abstraction f and an algebraic aggregation function

ψ. If a ψ-supplement g of f exists then the function ψ∗ is distributive.

Collapseability along MVD Trees

Collapsing along MVD trees is invariant for distributive functions.

Collapsing along MVD trees is invariant for algebraic functions.

41

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The OLAP Cube: DefinitionThe cube is defined by

• a partially ordered set of dimensions D called lattice

({L1, ..., Ln, ALL},≼),

• a family of composite, equivalence-invariant and monotone func-

tions ancLi,Lj such that for each pair Li ≼ Lj in each dimension

the function ancLi,Lj maps each element of the domain dom(Li)

to an element of dom(Lj),

• a family of relationships descLi,Lj inverse to ancLi,Lj ,

• a cube schema (L1, ..., Ln,M1, ...,Mk)

with key {L1, ..., Lm},

• a set of aggregation functions

agg1(M1,1), ..., agg1(M

1,p1), ..., aggk(Mk,pk) and

• a cube algebra consisting of navigation, selection, projection and

split functions.

42

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Query Operations of the OLAP Cube

Basic drill-down functions: decomposing groups of data along a

hierarchy

refine grouping for one dimension Li ≽ L′i

values for fact values on M1, ...,Mk through ancLi,L

′i by decompo-

sition

require additivity property

Basic dice functions: projection

require additive property along Li ≽ L′i.

Basic slice functions: selection

Basic roll-up functions : opposite of the basic drill-down

require disjointness property

Additional functions: join functions, union functions, rotation or pivot-

ing functions, rename functions

43

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Some General ResultsRoll-up functions are neither sum-invariant nor avg-invariant in general.

Roll-up functions are not min- or max-invariant in general.

Rearrangement functions are min-, max-, sum- and avg-invariant.

Summarizability cannot be obtained for relations after application of

any subset-generating function.

Summarizing is correct for union of classes if an exclusion constraint

is valid for classes to which union is applied.

Summarizability cannot be maintained for functions after application

of which identification of objects is lost.

Summarizability cannot be obtained trough collapsing hierarchies.

44

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Some General ExtensionsSummarizability cannot be obtained moving up to generalizations of

classes.

Summarizability is maintained moving downwards through functional

dependencies.

Summarizability is correct on separting collapsing hierarchies if the

portion is used for the supplement.

Summarizability can be maintained on the basis of identification

properties of values.

45

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The OLTP-OLAP Specification FrameF = (A,O,Ψ,M)

Set A of aggregation functions

Set O of OLAP query operations

Set M of modeling assumptions

Set Ψ of properties

Cube C satisfies F

• OLTP-OLAP transformations are restricted to A

• fact domains fulfill Ψ

• modeling assumptions are valid for C

Domain types: precision and accuracy, granularity, and ordering

nominal, absolute, rank, ratio, atomar, complex, interval types

Domains: scales, classifications, default and neutral values

Domain values: extended by measures, e.g. relative, absolute, linear and non-linear

Domain values: transformed by casting functions to values of other domain types

46

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

OLTP-OLAP Modeling Assumptions

• Disjointness: OLTP-OLAP transformations restricted to group-

ings with disjoint groups

• Completeness: Groupings cover the entire set of database ob-

jects.

• P-subset invariance: Fact values are stable if the OLTP

database is restricted to objects based on the policy P.

• P-union invariance: Fact values are stable if the OLTP database

is extended by new objects depending on the policy P.

• Equidistance: Linear transformations

47

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

The Good Message�� ��EER Functions Enable Complete OLAP

• operations on views (selection as dice, projection as slice, general-

ized relational algebra)

• calculation functions, visualization functions

• group-by as roll-up (drill-up), un-nest as drill-down

• schema restructuring operation for unfold (e.g., drill-down), fold

(nest), classification

B. Thalheim, Entity-Relationship Modeling –

Foundations of Database Technology.

Springer 2000, 640pp.

48

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Modeling Assumptions: Drill-DownFunctions

Decomposing groups of data along a hierarchy, refine grouping

Observation 1.Drill-down functions are well defined if the cube construction is

based on disjointness and completeness modeling assumptions.

Observation 2.Drill-down functions are well defined if data granularity is guaranteed

at leaf level L1 and no structural null are used at any level Li (i > 1)

in between.

49

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Modeling Assumptions: Roll-Up Functions

used for combining groups along a hierarchy, i.e., for fusion of groups.

Problematic for collapsing hierarchiesespecially in the case of algebraic and holistic aggregation functions

Observation 3.Roll-up functions are only well-defined if data granularity is guaran-

teed at leaf level L1 and no structural null are used at any level Li(i > 1) in between.

Observation 4.Roll-up functions must be query-invariant, i.e. for the roll-up func-

tion f and the query function ψ:

ψ(x1, ...xn) = ψ(f(x1), ...., f(xn)) .

Observation 5.The Simpson paradox is observed if for groups at roll-up level Lp ≼Lr(∀j(f(Gij) < f(Gkj))) ⇒ f(

∑nj=1Gij) < f(

∑nj=1Gkj)

50

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Modeling Assumptions: Dice Functions

! projection, marginalization (statistics),∑

unions of values

unproblematic for distributive functions combinable with repairing functions

Observation 6.The Simpson paradox is observed if for groups at roll-up level Lp ≼ Lr

(∀j(f(Gij) < f(Gkj))) ⇒ f(∑n

j=1Gij) < f(∑n

j=1Gkj)

Observation 7.Dice functions can only correctly be applied if the cube construction is based

on union invariance, i.e. aggr(⊔∗o∈Gi

value(o)) = ⊔∗o∈Gi

(aggr(value(o))) for

groups Gi along dimensions.

If f is distributive then the ⊔∗ ≡ f .

If f is algebraic then repair functions must be applied.

Observation 8.Dice functions can only be used along dimensions for which constraints among

cube dimensions are not lost, i.e. Σ|π(X) |= π(X)(Σ).

Observation 9.Dice functions must be based on disjointness and completeness modeling as-

sumptions.

51

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Modeling Assumptions: Slice Functions

Requirement for applicability: query-invariance,

for the roll-up function f and the query function ψ:

ψ(x1, ...xn) = ψ(f(x1), ...., f(xn)) .

Observation 10.

Slice functions must be subset invariant.

Constraints invalidated by subset construction are those integrity con-

straints that have to be expressed through ∀∃-constraints, e.g., in-

clusion dependencies, multivalued dependencies, tuple-generating con-

straints.

52

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Modeling Assumptions: OLTP-OLAPTransformations

Modeling pragmatism, e.g.,

• instead of arithmetic average function: mean average, weighted average, root

mean square, average mean, moving average, weighted average

very sensitive to extreme values such as outliers and may be distorted by them

• arithmetic averages may be normalized by skew functions

Averages are measures of central tendency. The median is the middle-most value in

an ordered set.

Observation 11.Application of median instead of mean average functions for aggregation leads

to more stable OLAP query operations.Observation 12.Harmonic mean functions n∑n

i=11xi

are shift-invariant, additive, symmetric, con-

tinuous, and homogeneous.Observation 13.Geometric mean functions n√x1 · x2 · ... · xn provide a better picture for relative

scales among values and are OLAP query invariant.Observation 14.The sampling frequency must be based on the Sampling Theorem and the

generalizations of Nyquist.

53

ISE, CAU, D

Colloquium FU Berlin

May 12, 2005

B. Thalheim

Overview

State-of-the-art

Aggregation

OLAP cube

Schemata

Methodology

Concluding

Concept Topic

Content

Information

Concluding

• Aggregation functions and OLAP computation functions do not fit

to each other

• Aggregation functions are too powerful

• Variety of paradoxes

• Development of an application repository for OLAP applications

• Our solution: OLTP-OLAP spefication frame

• Derivation or proof of modeling assumptions

• Development of methodologies for OLAP applications

• Future research: Constraints and OLAP functions

• Future research: Theory of database aggregations and abstractions

54