30
From Spreadsheets to Relational Databases and Back acome Cunha Jo˜ ao Saraiva Joost Visser Departamento de Inform´ atica SIG & CWI Universidade do Minho The Netherlands Portugal PEPM’09, January, 19-20

From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

From Spreadsheets to Relational

Databases and Back

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

PEPM’09, January, 19-20

Page 2: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

1.1

Motivation

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Property Renting System

From Spreadsheets to Relational Databases and Back

Each row represents a renting transaction

It includes information about owners, clients, properties

It also contains dates and prices (formulas)

Row 3 “says” that john has rented a property owned by tony, with

address 5 Novar Dr., price per day 70, between dates 1/7/00 and

8/31/01 and paid 25550 for it

Page 3: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

1.2

Motivation

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Redundancy

From Spreadsheets to Relational Databases and Back

This unstructured model is valid and serves its purpose

But, it contains data redundancy

For example, the rent per day is repeated several times

Page 4: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

1.3

Motivation

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Updating Problems

From Spreadsheets to Relational Databases and Back

As a result, updates can cause data inconsistence

For example, updating the renting value of property pg4 must be

performed in several places

Page 5: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

1.4

Motivation

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Deleting Problems

From Spreadsheets to Relational Databases and Back

Deleting rows can be problematic, too

For example, deleting the row 5 will remove all the information

about property pg36

Page 6: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

1.5

Motivation

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

A Database Solution

From Spreadsheets to Relational Databases and Back

ClientclientNo VARCHAR(256) PcName VARCHAR(256)

RentingclientNo VARCHAR(256) FPpropertyNo VARCHAR(256) FPrentStart VARCHAR(256) PrentFinish VARCHAR(256) PownerNo VARCHAR(256) FPtotal_rent VARCHAR(256)totalDays VARCHAR(256)

OwnerownerNo VARCHAR(256) PoName VARCHAR(256) U

PropertypropertyNo VARCHAR(256) PpAddress VARCHAR(256)rentPerDay VARCHAR(256)oName VARCHAR(256) F

A normalized (3NF) data representation

No redundancy

No update/delete problems

Page 7: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

1.6

Motivation

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

A Database Solution

From Spreadsheets to Relational Databases and Back

A normalized (3NF) data representation

No redundancy

No update/delete problems: the price per day occurs only once

Tables can be stored in different sheets

Page 8: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

1.6

Motivation

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

A Database Solution

From Spreadsheets to Relational Databases and Back

We can use information computed during normalization to pro-

duce a refactored spreadsheet that helps users to introduce cor-

rect data

Spreadsheet warns the user when errors are introduced

Page 9: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

2.0

Our Approach

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

From Spreadsheets to Relational Databases and Back

s : S Ti

FDs CFDs

fun

to

from choosePKs

synthesize

S spreadsheet schema (header)s spreadsheet data (remaining rows)CFDs compound functional dependenciesT relational database schemai referential integrity constraintsto data conversion from RDB to SSfrom data conversion from SS to RDB

Infer functional dependencies (Fun)

Normalize (3NF) those functional dependencies (synthesize)

Create a Relational Database (RDB) schema (3NF)

Use data refinements to perform the data migration (to and from)

HaExcel combines all these steps creating a complete system

Page 10: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

2.1

Our Approach

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Calculate FDs

From Spreadsheets to Relational Databases and Back

A Functional Dependency (FD) denoted A ⇀ B means that an

element of A is uniquely associated with an element of B

The Fun algorithm (Novelli et. al.) calculates the FDs from data

For our example, the Haskell implementation returns:

ownerNo ⇀ oName

totalDays ⇀ clientNo, cName

propertyNo ⇀ pAddress, rentPerDay , ownerNo, oName

...

Formulas also induce FDs:

C0 = f(X0, . . . , Xn) induces X0, . . . , Xn ⇀ C0

Page 11: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

2.2

Our Approach

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

A Few Concepts

From Spreadsheets to Relational Databases and Back

Each tuple in a DB table is uniquely identified by a set of attri-

butes called Primary Key (PK)

There may be more then one set suitable for becoming PK. They

are designated Candidate Keys (CK)

First Normal Form (1NF) is respected if each element of all tuples

are atomic values

Second Normal Form (2NF) is respected if the 1NF is respected

and its non-key attributes are not functionally dependent on part

of the key

Third Normal Form (3NF) is respected if the 2NF is respected and

if the non-key attributes are only dependent on the key attributes

Page 12: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

2.3

Our Approach

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Normalize FDs

From Spreadsheets to Relational Databases and Back

Synthesize algorithm (Maier) calculates 3NF set of FDs

It returns a set of attributes with candidate keys

It is necessary to choose one CK to be the primary key

To produce a schema with the lossless decomposition property

one must add a FD (Maier):

{all atts } \ {atts rep formulas }⇀ {new att , atts rep formulas }

For our example:

clientNo, propertyNo, cName, pAddress, rentStart , rentFinish,

rentPerDay , ownerNo, oName ⇀ totalDays, total rent , new att

Page 13: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

2.4

Our Approach

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Calculate a RDB Schema

From Spreadsheets to Relational Databases and Back

Fun produces too many, not necessary and redundant FDs

Also, attributes representing columns with formulas can not be

used in a PK

We use Synthesize to produce a 3NF set of attributes/CK

Its input is the result from Fun without the FDs with formulas

in PKs and the FD that guarantees the losslees decomposition

property

We choose the smallest CK to be the PK

Page 14: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

2.5

Our Approach

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

From Spreadsheets to Relational Databases and Back

For our example,

owners

ownerNo ⇀ oName

clients

clientNo ⇀ cName

properties

propertyNo ⇀ pAddress, rentPerDay , oName

renting

clientNo, propertyNo, rentStart , rentFinish, ownerNo ⇀ total rent , totalDays

Page 15: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.1

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Data Refinements

From Spreadsheets to Relational Databases and Back

We need to migrate data from spreadsheets to the relational

model

We use data refinements:

A[rr]−tooo 6 B[ll]−fromppaaaaaaaaaaaaaaaaaaaaaaa

to : A→ B is an injective function;from : B → A a surjective function;from · to = idA (identity function on A);

B is a refinement of A witnessed by the to and from functions

In the case where the refinement works in both directions we have

an isomorphism A ∼= B

Page 16: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.1

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Data Refinements

From Spreadsheets to Relational Databases and Back

Composition

if A[rr]−tooo 6 B[ll]−fromppaaaaaaaaaaaaaaaaaaaaaaa and B[rr]−to′oo 6 C[ll]−from′

ppaaaaaaaaaaaaaaaaaaaaaaaa then A[rr]−from · from′oo 6 C[ll]−to′ · toppaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Propagation

if A[rr]−tooo 6 B[ll]−fromppaaaaaaaaaaaaaaaaaaaaaaa then FA[rr]−Ftooo 6 FB[ll]−Ffrom

ppaaaaaaaaaaaaaaaaaaaaaaaaaa

Page 17: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.2

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Examples: Hierarchical-Relational Data Mapping

From Spreadsheets to Relational Databases and Back

A? 6 N ⇀ A List elimination2A ∼= A ⇀ 1 Set eliminationA? ∼= 1 ⇀ A Optional elimination

A+B 6 A?×B? Sum eliminationA× (B + C) ∼= (A×B) + (A× C) Distribute product over sumA ⇀ (B + C) 6 (A ⇀ B)× (A ⇀ C) Distribute map over sum (range)(B + C) ⇀ A ∼= (B ⇀ A)× (C ⇀ A) Distribute map over sum (domain)

Page 18: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.3

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Data Refinements With Constraints

From Spreadsheets to Relational Databases and Back

To represent database constraints we use type constraints

Associate a constraint φ to a type A: Aφ where φ : A→ Bool

Add a second constraint to a constrained datatype: (Aφ)ψ∼= Aφ∧ψ

Functorial pull: F(Aφ)∼= (FA)(Fφ)

For example, a constraint on the elements of a list can be pulled

up to a constraint on the list: (Aφ)∗ ∼= (A∗)listφ

Page 19: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.3

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Data Refinements With Constraints

From Spreadsheets to Relational Databases and Back

Refining a constrained datatype (source)

if A[rr]−tooo 6 B[ll]−fromppaaaaaaaaaaaaaaaaaaaaaaa then Aφ[rr]

−tooo 6 Bφ·from[ll]−fromppaaaaaaaaaaaaaaaaaaaaaaaaa

A law that introduces a constraint to a constrained datatype

if A[rr]−tooo 6 Bψ[ll]−from

ppaaaaaaaaaaaaaaaaaaaaaaa then Aφ[rr]−tooo 6 (Bψ)φ·from[ll]−from

ppaaaaaaaaaaaaaaaaaaaaaaaaa∼= Bψ∧(φ·from)

Page 20: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.4

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Constraint-Aware Rewriting

From Spreadsheets to Relational Databases and Back

There is an Haskell implementation of data refinement called 2LT

– Two Level Transformation (http://code.google.com/p/2lt)

We need a type-safe representation for types, so we use a GADT

data Type t where

Int :: Type Int

[·] :: Type a → Type [a ]

· × · :: Type a → Type b → Type (a, b)

Func :: Type a → Type b → Type (a → b)

(·)· :: Type a → PF (Pred a) → Type a (Aφ)

...

type Pred a = a → Bool

Page 21: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.5

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Constraint-Aware Rewriting

From Spreadsheets to Relational Databases and Back

For the function representation we also use a GADT

data PF a where

π1 :: PF ((a, b) → a)

δ :: PF ((a ⇀ b) → Set a)

·? :: PF (a → b) → PF ([a ] → [b ])

· × · :: PF (a → b) → PF (c → d) → PF ((a, c) → (b, d))

...

Function representations can be evaluated to the function that is

represented:

eval :: Type (a → b) → PF (a → b) → a → b

Page 22: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.5

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Constraint-Aware Rewriting

From Spreadsheets to Relational Databases and Back

Data refinement rules in Haskell

type Rule = ∀ a . Type a → Maybe (View (Type a))

data View a where

View :: Rep a b → Type b → View (Type a)

data Rep a b = Rep{to = PF (a → b), from = PF (b → a)}

Page 23: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.5

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Constraint-Aware Rewriting

From Spreadsheets to Relational Databases and Back

Rewriting composition operators

nop :: Rule -- identity

. ::Rule → Rule → Rule -- sequential composition

� ::Rule → Rule → Rule -- left-biased choice

many :: Rule → Rule -- repetition

once :: Rule → Rule -- arbitrary depth rule application

Page 24: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.6

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Refining a Relational Table to a Spreadsheet Table

From Spreadsheets to Relational Databases and Back

A ⇀ B

Table2sstable

$$

6 (A×B)?fd

Sstable2table

bb

data PF a where

...

Sstable2table :: PF ([(a, b)] → (a ⇀ b))

Table2sstable :: PF ((a ⇀ b) → [(a, b)])

table2sstable :: Rule

table2sstable (a ⇀ b) = return (View rep [a × b]fd)

where

rep = Rep{to = Table2sstable, from = Sstable2table }

Page 25: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.7

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Refining Relational Tables where a PK Contains a FK

From Spreadsheets to Relational Databases and Back

((A ⇀ B)× (C ⇀ D))πA◦δ◦π1⊆πC◦δ◦π2[dd]−Table2sstable× Table2sstableoo

6

((A×B)?fd × (C ×D)?fd)πA◦list2set◦π?1◦π1⊆πC◦list2set◦π?1◦π2[uu]−Sstable2table× Sstable2table

OO

A Foreign Key (FK) is a set of attributes within one relation that

matches the PK of some relation

Page 26: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.8

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

Refining Tables where Non-Key Attributes Contain FKs

From Spreadsheets to Relational Databases and Back

((A ⇀ B)× (C ⇀ D))πB◦ρ◦π1⊆πC◦δ◦π2[dd]−Table2sstable× Table2sstableoo

6

((A×B)?fd × (C ×D)?fd)πB◦list2set◦π?2◦π1⊆πC◦list2set◦π?1◦π2[uu]−Sstable2table× Sstable2table

OO

Page 27: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

3.9

From a Spreadsheet to a Database

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

The Strategy

From Spreadsheets to Relational Databases and Back

rdb2ss :: Rule

rdb2ss = simplifyInv B

(many (aux tables2table)) B

(many ((aux tables2sstables) B (aux tables2sstables ′))) B

(many (aux table2sstable))

where

aux r = ((once r) B simplifyInv) B ((many (once r)) B simplifyInv)

Page 28: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

4.0

The HaExcel Framework

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

From Spreadsheets to Relational Databases and Back

HaExcel is a framework to manipulate and transform spreadsheets

Haskell library: generic and reusable library to transform spre-

adsheets into RDBs and back

Front-ends: Excel/Gnumeric (XML) formats can be imported and

exported

Tools:

an online tool, available at

http://haskell.di.uminho.pt/jacome/index.html

a batch tool, compilable from sources available at

http://haskell.di.uminho.pt/websvn

Page 29: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

5.0

Conclusions

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

From Spreadsheets to Relational Databases and Back

We extended the 2LT framework with new constraint-aware two-

level data refinements

We shown how these rules can be combined into a strategic re-

write system

We constructed HaExcel that transforms a relational schema into

a tabular schema and back

Our techniques also migrate the formulas between the models

We connected importers and exporters for SQL and spreadsheet

formats

This allows refactoring of spreadsheets to reduce data redun-

dancy and improve error detection and migration of spreadsheet

applications

An Excel plugin that implements HaExcel is under development

Page 30: From Spreadsheets to Relational Databases and Backalfa.di.uminho.pt/~jacome/down/pepm09_talk.pdfCreate a Relational Database (RDB) schema (3NF) Use data refinements to perform the

6.0

Questions?

Jacome Cunha Joao Saraiva Joost Visser

Departamento de Informatica SIG & CWIUniversidade do Minho The Netherlands

Portugal

From Spreadsheets to Relational Databases and Back

Questions?