Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
University of Munster
My Favorite Issues inData Warehouse Modeling
Jens Lechtenborger
University of Munster & ERCIS, Germany
http://dbms.uni-muenster.de
Context
Data Warehouse (DW) modeling
• ETL design
• DW schema design
– Database design– Methodical process in several phases
• Focus here: Conceptual schema design
DOLAP 2005, November 5 Jens Lechtenborger 1
Outline
• Context
• Conceptual Modeling
• Meaning of Features
• Multidimensional Normal Forms
• Schema Versioning
• Conclusions
Conceptual Modeling (1/5)
• Conceptual representation of multidimensional scenario
– System- and implementation-independent
• No standard data model in sight
– Ad hoc– E/R variants– Object-oriented, based upon UML
• Specification of facts’ structure, i.e.,
– Relevant dimensions and their inner structure(→ dimension schema),
– Measures within their multidimensional contexts(→ fact schema)
DOLAP 2005, November 5 Jens Lechtenborger 2
Conceptual Modeling (2/5)Fact Schema
PersonCustType:
CompanyCustType:
Branch
RegionCity
Account AccountID
BranchID
CustID CustType
#Transactions YearDay Month Quarter
TransactionsBranch
Time
Job
DOLAP 2005, November 5 Jens Lechtenborger 3
Conceptual Modeling (3/5)Meaning of Fact Schema
• Universal relation
• Universal relation schema assumption (URSA):Semantics of attribute tied to its name
• Defining dimension levels form key
• Each arc represents functional dependency (FD)
DOLAP 2005, November 5 Jens Lechtenborger 4
Conceptual Modeling (4/5)Some Features
(Incomplete list)
• Standard Features
– Fact schema represents M:N relationship among dimensions– Arc in dimension schema represents M:1 relationship, i.e., FD
• Typical Features (some with challenges for summarizability)
– M:N relationships among dimension levels(non-strict hierarchies)
– Alternative and parallel paths, possibly including joining levels– Optional levels allowing NULL values
(heterogeneous, unbalanced, non-onto hierarchies)
DOLAP 2005, November 5 Jens Lechtenborger 5
Conceptual Modeling (5/5)Guidelines
• A rich set of features is good
• A set of guidelines for their proper use is even better
• Let’s consider above typical features in turn
DOLAP 2005, November 5 Jens Lechtenborger 6
Outline
• Context
• Conceptual Modeling
• Meaning of Features
• Multidimensional Normal Forms
• Schema Versioning
• Conclusions
Meaning of FeaturesM:N relationships (1/4)
• M:N relationships are generally implicitly understood
• Consider levels Day and City
– Many cities exist at a given day– A city exists for many days
• There is no need to model this M:N relationship(if we don’t do history)
DOLAP 2005, November 5 Jens Lechtenborger 7
Meaning of FeaturesM:N relationships (2/4)
Consider geographical levels City, Region, State, Country
• One Region per City, i.e., City→ Region
• M:N between Region and State, i.e., Region←→ State
• One Country per State, i.e., State→ Country
City
Location
All
Country
State
Region
Legal instance City Region State Countryci1 r1 s1 co1
ci1 r1 s2 co2
• City and State are in M:N relationship.
• Probably not intended. Different dimension schema needed.
DOLAP 2005, November 5 Jens Lechtenborger 8
Meaning of FeaturesM:N relationships (3/4)
City
Location
All
Country
Region State
• Implicit M:N relationship
• No problems with summarizability
• Guideline
– Avoid “M:N arcs” within dimensions– Joint work with Bodo Husemann and Gottfried
Vossen, DMDW 2000∗ Synthesize fact schemata∗ Follow FDs to build dimension schemata
– Side remark: Bridge tables of Kimball et al. ariseautomatically as fact schemata
DOLAP 2005, November 5 Jens Lechtenborger 9
Meaning of FeaturesM:N relationships (4/4)
However
• Maybe there was a reason to place State above Region
• Roll-Up like change in granularity
– In general, regions fit into state boundaries– But not always
• Then, add a new type of “M:N navigational arc”
– This is not Roll-Up! City
Location
All
Country
Region State
DOLAP 2005, November 5 Jens Lechtenborger 10
Meaning of FeaturesJoining Levels (1/5)
City
Location
All
Country
Region State
City
Location
All
Region State
SCountryRCountry
1..*1..*
1..* 1..*
11
11..*
11
All
Country
Region State
City
Location
DOLAP 2005, November 5 Jens Lechtenborger 11
Meaning of FeaturesJoining Levels (2/5)
Semantics of schema definable via admissible instances.Consider City c in Region r and State s.
• With universal relations, admissible instances are tables that satisfy FDs
– For left schema, by transitivity of FDs Country of r must be equal toCountry of s
• With objects, associations are implemented via references
– Object c has references to r and s– Objects r and s each have exactly one reference to a country object– That object for r may be distinct from the one of s
• Thus, left schema on previous slide has different meaning than other two,whose meaning is the same
DOLAP 2005, November 5 Jens Lechtenborger 12
Meaning of FeaturesJoining Levels (3/5)
It’s even worse. . .
• Consider a 3NF implementation of left schema
– Tables for City, Region, State, Country
– Table for City has foreign keys to tables for Region, State
– Tables for Region and State each have a foreign key to table for Country
∗ Those foreign keys need not be “in sync”
• Thus, again a city may wind up in two countries
– Star and snowflake schemata have different semantics!
• What does your favorite OLAP tool do?
• Gap in relational theory. Research in progress.
• Guideline: Use handwritten code to maintain consistency. Be careful!
DOLAP 2005, November 5 Jens Lechtenborger 13
Meaning of FeaturesJoining Levels (4/5)
Reuse of levels is different from joining
City
Product
Amount Supplier
CustID
State
Sales
...
...
Customer
SuppID
ProdID
Region
Country
...
...
Here, customer and supplier must be in the same city
DOLAP 2005, November 5 Jens Lechtenborger 14
Meaning of FeaturesJoining Levels (5/5)
Reuse of levels is different from joining
Product
[City]CCity
Amount Supplier
CustIDSales
...
...
Customer
SuppID
ProdID ...
...
[City]SCity
Notice: New notation
DOLAP 2005, November 5 Jens Lechtenborger 15
Meaning of FeaturesParallel vs Alternative Paths (1/5)
Parallel paths allow levels from different paths in single Group-By clause, e.g.:
City
Location
All
Country
Region State
All
Month Week
Year
Quarter
Day
Time
DOLAP 2005, November 5 Jens Lechtenborger 16
Meaning of FeaturesParallel vs Alternative Paths (2/5)
Observations on parallel paths
• Including levels from more than one path increases level of detail
– E.g., grouping by Week and Month is OK
• Guideline: There are less problems than you might have thought
DOLAP 2005, November 5 Jens Lechtenborger 17
Meaning of FeaturesParallel vs Alternative Paths (3/5)
Alternative paths require exclusive choice, e.g.:
Context dependency
CustType:CompanyPerson
CustType:
All
Artist null
P1 P2 P42042
...
...
Airline
all
null
CustType
Customer
Job ... Zoo director
C1...
Branch
CustID
Person Company
Grouping by Job and Branch is inconsistent
DOLAP 2005, November 5 Jens Lechtenborger 18
Meaning of FeaturesParallel vs Alternative Paths (4/5)
Observations on alternative paths
• Alternative paths usually arise from optional levels
• Use context dependencies to explain presence of structural NULLs
• Or more complex dimension constraints
– Hurtado and Mendelzon, PODS 2002
• Guideline: Avoid/explain optional levels.
– Notice: Subclassing in object-oriented models expresses contextdependencies
DOLAP 2005, November 5 Jens Lechtenborger 19
Meaning of FeaturesParallel vs Alternative Paths (5/5)
CustID
CustID
CustID
CustType
CustTypeJob
CustIDBranch
Customer
Company
Person
All
Capital C. Subs. CapitalBusiness P. Legal Form
CustTypeLegal Form
DOLAP 2005, November 5 Jens Lechtenborger 20
Outline
• Context
• Conceptual Modeling
• Meaning of Features
• Multidimensional Normal Forms
• Schema Versioning
• Conclusions
Multidimensional Normal Forms (1/4)
Joint work with Gottfried Vossen: Multidimensional Normal Forms for DataWarehouse Design, Information Systems, 2003
• Three multidimensional normal forms (MNFs)
• 1MNF based on analysis of FDs
• 2MNF requires context dependencies for optional levels
• 3MNF places restrictions upon context dependencies
DOLAP 2005, November 5 Jens Lechtenborger 21
Multidimensional Normal Forms (2/4)
Implications of 1MNF
• Faithful representation of the application domain
• Completeness w.r.t. the application domain
• Avoidance of redundancies
• Avoidance of M:N relationships
DOLAP 2005, November 5 Jens Lechtenborger 22
Multidimensional Normal Forms (3/4)
Implications of 2MNF and 3MNF
• Explanation for structural NULLs allows
– context-sensitive summarizability– avoidance of contradictory queries
• Relational implementation of class hierarchies within dimensions withoutstructural NULLs possible
• Avoidance of alternative paths
DOLAP 2005, November 5 Jens Lechtenborger 23
Multidimensional Normal Forms (4/4)
Final remarks concerning 2MNF and 3MNF
• Both rely on purely relational techniques
• For object-oriented models considerable simplifications possible
– Disallow optional levels– Construction (see paper in Information Systems mentioned above)∗ As long as optional level l exists, introduce further sub-classes∗ One with l, now mandatory∗ The other without l
DOLAP 2005, November 5 Jens Lechtenborger 24
Outline
• Context
• Conceptual Modeling
• Meaning of Features
• Multidimensional Normal Forms
• Schema Versioning
• Conclusions
Schema Versioning (1/14)
Joint work with Matteo Golfarelli, Stefano Rizzi, Gottfried Vossen.Schema Versioning in Data Warehouses: Enabling Cross-Version Querying viaSchema Augmentation. To appear in Data & Knowledge Engineering.
Challenges
• Storage of historical data under changing business requirements
• Non-volatility, in particular consistent re-execution of old queries
Our proposal
• Maintenance of history of schema versions
• Simple graph model representing core of multidimensional models
• Schema augmentation to represent new schema information on old data
• Schema intersection to answer cross-version queries
DOLAP 2005, November 5 Jens Lechtenborger 25
Schema Versioning (2/14)
Part Customer
Size SaleDistrict
Deal
Type City
Nation
Brand
Region
Shipment
Qty Shipped
Category
Type Carrier
ShipMode
Incentive
Allowance
Year
Month
Container
Terms
Shipping CostsDM
Date
DOLAP 2005, November 5 Jens Lechtenborger 26
Schema Versioning (3/14)
At t1 = 1/1/2003, the schema undergoes a major revision.
1. The temporal granularity changes from Date to Month.
2. A classification into Subcategories is added to part hierarchy.
3. A new constraint in customer hierarchy states that SaleDistricts belong toNations.
4. The Incentive is independent of shipment Terms.
At t2 = 1/1/2004, another version is created.
1. New measures ShippingCostsEU and ShippingCostsLIT are added.
2. The ShipMode dimension is deleted.
3. A ShipFrom dimension is added.
4. A descriptive attribute PartDescr is added to Part.
DOLAP 2005, November 5 Jens Lechtenborger 27
Schema Versioning (4/14)
Part Customer
Size SaleDistrict
Deal
Type City
Nation
Brand
Region
Shipment
Qty Shipped
Year
Container
Category
Incentive
Allowance
Shipping CostsEUShipping CostsDM
Month
Subcategory
PartDescr
Terms
ShipFrom
Shipping CostsDM
Shipping CostsLIT
Resulting schema graph
DOLAP 2005, November 5 Jens Lechtenborger 28
Schema Versioning (5/14)
Part Customer
Size SaleDistrict
Deal
Type City
Nation
Brand
Region
Shipment
Qty Shipped
Year
Container
Category
Incentive
Allowance
Shipping CostsEUShipping CostsDM
Month
Subcategory
PartDescr
Terms
ShipFrom
Shipping CostsDM
Shipping CostsLIT
Three sample query challenges:
• Compute the total quantity of each part category Shipped From eachwarehouse to each customer nation since July 2002.
• Drill down from Category to Subcategory
• Drill down from Nation to SaleDistrict
DOLAP 2005, November 5 Jens Lechtenborger 29
Schema Versioning (6/14)Schema Modification (1/4)
Four schema modification operations on schema graph
• AddA() to add a new attribute
• DelA() to delete an existing attribute
• AddF() to add an arc involving existing attribute
• DelF() to remove an existing arc
DOLAP 2005, November 5 Jens Lechtenborger 30
Schema Versioning (7/14)Schema Modification (2/4)
Consider again
Part Customer
Size SaleDistrict
Deal
Type City
Nation
Brand
Region
Shipment
Qty Shipped
Category
Type Carrier
ShipMode
Incentive
Allowance
Year
Month
Container
Terms
Shipping CostsDM
Date
First goal: Delete Date
DOLAP 2005, November 5 Jens Lechtenborger 31
Schema Versioning (8/14)Schema Modification (3/4)
Result of DelA(Date)
Part Customer
Size SaleDistrict
Deal
Type City
Nation
Brand
Region
Shipment
Qty Shipped
Category
Type Carrier
ShipMode
Incentive
Allowance
Container
Year Terms
Shipping CostsDM
Month
Next goal: Insert Subcategory below CategoryDOLAP 2005, November 5 Jens Lechtenborger 32
Schema Versioning (9/14)Schema Modification (4/4)
Result ofAddA(Subcategory)
Part
TypeBrand
Shipment
Container
Size
Category
.........
Subcategory
DOLAP 2005, November 5 Jens Lechtenborger 33
Schema Versioning (9/14)Schema Modification (4/4)
Result ofAddA(Subcategory),AddF(Type→ Subcategory)
Part
TypeBrand
Shipment
Container
Size
Category
.........
Subcategory
DOLAP 2005, November 5 Jens Lechtenborger 33
Schema Versioning (9/14)Schema Modification (4/4)
Result ofAddA(Subcategory),AddF(Type→ Subcategory),AddF(Subcategory→Category)
Part
TypeBrand
Shipment
Container
Size
Subcategory
Category
.........
DOLAP 2005, November 5 Jens Lechtenborger 33
Schema Versioning (10/14)Schema Augmentation (1/2)
Previous schema versions associated with augmented schemata
• Previous schema computable via projection from augmented one
• Designer chooses to add information to augmented schemata based oncurrent schema modification, e.g.,
– old data enriched with new attributes, e.g., Subcategory
– more constraints expressed on old data, e.g., SaleDistrict→ Nation
• Augmented schemata used by querying subsystem
DOLAP 2005, November 5 Jens Lechtenborger 34
Schema Versioning (11/14)Schema Augmentation (2/2)
Element Condition Augm. actionA is measure estimate values for A
(E→ A) ∈ F ′A is dimension disaggregate measure values
A is derived measure compute values for AA ∈ Diff+
A(S,S′)
(E→ A) 6∈ F ′A is property consistently add values for A
f ∈ Diff+F(S,S′) - check if f holds
DOLAP 2005, November 5 Jens Lechtenborger 35
Schema Versioning (12/14)Cross-version Querying (1/3)
General idea: Formulation context for OLAP query is a schema graph
• Intersection of schema versions is the largest schema for uniform querying
• Query can be answered if formulation context is sub-graph of intersection
• More precisely, augmented schemata instead of real versions
DOLAP 2005, November 5 Jens Lechtenborger 36
Schema Versioning (13/14)Cross-version Querying (2/3)
Customer
Size SaleDistrict
Deal
Incentive
Type City AllowanceBrand
Region
Month
Year
Container
Shipment
ShippingCostsDM
ShipFrom
Subcategory
Part
Nation
Terms
QtyShipped
Category
Compute the total quantity of each part category shipped from each warehouseto each customer nation since July 2002.
DOLAP 2005, November 5 Jens Lechtenborger 37
Schema Versioning (14/14)Cross-version Querying (3/3)
Observations
• Query well-formulated only if ShipFrom augmented
• Drilling down from Category to Subcategory only if subcategoriesestablished also for 2002 data
• Drilling down from Nation to SaleDistrict only if FD from sale districts tonations also satisfied before 2003.
DOLAP 2005, November 5 Jens Lechtenborger 38
Outline
• Context
• Conceptual Modeling
• Meaning of Features
• Multidimensional Normal Forms
• Schema Versioning
• Conclusions
Conclusions (1/3)
Summary
• FDs help in data warehouse design
• Meaning and potential of multidimensional features sometimesunderspecified
• Sub-classing helps to structure multidimensional schemata
• Versioning with cross-version querying is feasible
DOLAP 2005, November 5 Jens Lechtenborger 39
Conclusions (2/3)
• Schema versioning offers further potential
– What-if analysis– Horizontal benchmarking
• Open issue: Generalization to hyper-graphs(cross-dimensional attributes, derived measures)
DOLAP 2005, November 5 Jens Lechtenborger 40
Conclusions (3/3)
There’s more. . .
• Taking full advantage of rich models
• Transformations of conceptual to logical models for ETL
– Alkis Simitsis: Mapping Conceptual to Logical Models for ETLProcesses. DOLAP 2005
• More generally, model-driven design
– Jose-Norberto Mazon et al.: Applying MDA to the Development ofData Warehouses. DOLAP 2005
• Where do the requirements come from?
– Paolo Giorgini et al.: Goal-oriented requirement analysis for datawarehouse design. DOLAP 2005
DOLAP 2005, November 5 Jens Lechtenborger 41
http://dbms.uni-muenster.de
Thank you for your attention!
DOLAP 2005, November 5 Jens Lechtenborger 42