Upload
amanjot-singh
View
219
Download
0
Embed Size (px)
Citation preview
Dimensional Modelling
Dimensional modeling has two basic concepts:FactsDimensions
Other ralates conceptsAggregatesMeta-data
Dimensional modeling is a technique for conceptualizing and visualizing data models as a set of measures that are described by common aspects of the business.
Dimensional Modelling
Fact
Definition
• A fact is a collection of related data items, consisting of measures
• A fact is a focus of interest for the decision making process. • Measures are continuously valued attributes that describe facts (Golfarelli et al)
• A fact is a business measure (Kimball and Ross)
What exactly is being analysed? what numbers are being analysed?
Examples of Facts
A university provides education services to its students. What are its facts and measures?Facts
Applications
Enrollment
Student Performance
Student Placement
Student awards
Measures
number, revenue from prospectus sales
number, revenue
grades, marks, %age marks, division
designation, nature of job, salary
Title, amount
Each fact typically represents
•a business item: an order
•a business transaction: order processing
•an event: arrival of an order
that can be used in analyzing the business or business processes.
Fact
Some Aspects of Facts
A fact is continuously valued. It takes a value from a a broad range of values.
The set of integersreal numbers
The most useful facts are numeric and additive: we almost never work with a single fact
Textual facts occur very rarely: free format and unpredictable contents make it impossible to analyse these
recent interest in unstructured DW look at these
Types of Facts• Additive: Additive facts are facts that can be
summed up through all of the dimensions in the fact table. E.g. Sales_Amount along date, product
• Semi-additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. E.g. current_balance along account not along date
• Non-additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact e.g. percentage or profit margin
Dimension
Definition
• The parameter over which we want to perform analysis of facts
sales is a fact; perform analysis over region, product, time
• The parameter that gives meaning to a measurenumber of customers is a fact, perform analysis over time
Discretely valued description that is more or less constant and participates in constraints
Qualifying characteristics that provide additional perspective to a given fact
Examples of Dimension
A university provides education services to its students. What are its facts and dimensions?
Facts
Applications
Enrollment
Performance
Placement
Student awards
Dimension
Age, Region
Region
Year, Discipline, Student
Year, Discipline, Grades
Discipline, Year
Dimensions and their Values
Dimension
Age
Region
Year
Discipline
Grades
Student
Dimension Value
10, 11, 12 …..
North, South
1999, 2000 ….
ECE, CSE, IT,...
A+, A,….
Name of student
Aspects of Dimensions
The values of dimensions do not change with timeslow changing dimensionsrapidly changing dimensions
Need to handle such changes
Dimensions are the primary source of query constraints, report headings, and groupings
Dimension Hierarchies/Categories
Dimensions are composed of smaller units called categories or members
simpler components forming a hierarchycountry, zone, branch, unit
Hierarchies are a basis for drill down and roll-up
special, notable unitsholidays
For special queries: sales performance on holidays
Organising Facts and Dimensions
The model should
provide drill down/roll up along dimension hierarchies
provide good data access
must be query centric
be optimised for queries and analysis
each dimension should be able to interact fully with the fact
The Star Schema
Fact
Dimension
Dimension Dimension Dimension
Dimension
A DW is a collection of star schemata
Example: Facts and Dimensions
Sales
Rupees
Year
Season
Month
RegionCity
Product type
Product name
Computing Fact Sizes
Let there be5000 products60 months50 cities
Number of sales facts = 5000*60*50= 15000000
Sales
Rupees
Year
Season
Month
RegionCity
Product type
Product name
Assume one sale fact per product, per city, per month
Sparse Facts
Not all 5000 products may be sold each month in each city
Assume that 3000 products are sold each month in each city
Number of sales facts = 3000 * 60 * 50 = 9000000
Approximately 60% of the cube is occupied and 40% is empty
Aggregation
We need the total sales for each region, product wise and month-wise
Number of products = 5000Number of regions = 5 Number of months = 60
Total number of facts = 5000*5*60 = 1500000Space-time tradeoff
if the frequency of use is high then pay the storage expenseAggregation guideline
if the number of facts summarised is more than 10, then do aggregation
Aggregation is performed in order to speed up common queries
Aggregates are pre-calculated summaries along dimension hierarchies derived from basic facts.
Aggregation
No aggregation
Year
Season
Month
Region
City
Product type
product name
One-way aggregation
Two-way aggregation
Three-way aggregation
When aggregation is done by rising along n-dimensions then n-way aggregation is said to be performed
Sparsity and Aggregation
As the amount of aggregation increases sparsity decreases
One-way aggregation on regions results in 1.5M facts
The probability of all 5000 products being sold in a month in a region is higher than of all 5000 being sold in a city
Two-way aggregation on regions and season results in 0.5M facts
The probability of all 5000 products being sold in a month in a region is higher than of all 5000 being sold in a region
Aggregation and the Star Schema
Each aggregate is a fact with its own derived dimensions
Derived dimensions may be defined ‘on the fly’
Sales summary by quarter, but quarter was not in the original dimension hierarchy
Each aggregate has its own star schema
Metadata contains the answers to questions about the data in the Data Warehouse
Different definitions :• Data about the data
• Tables of contents for the data
• Catalog for the data
• Data warehouse atlas
• Data warehouse roadmap
Metadata
Central Role of Metadata
Metadata for End Users
Example
Entity Name CustomerAliases Account, ClientDefinition Anyone who purchases hotel roomsSource Systems Reservations, Accounts, HousekeepingCreate Date 1 January 2000Last Update Date 13 September 2003Update Cycle weeklyFull refresh cycle six monthsData Quality Review 15 September 2003Planned Archival Every six months
Metadata for IT Professionals
Metadata Driven Data Warehouse Process
Data Acquisition Metadata Types
Information Delivery
• Functions:– Report generation– Query processing– Complex analysis
• Metadata recorded in the information delivery functional area– relate to predefined queries, predefined reports, and input
parameter definitions for queries and reports– also include information for OLAP.
Information Delivery Metadata Types
Challenges for Metadata Management
• Reconcile the formats of metadata of several tools• No industry-wide accepted standards• Centralized metadata repository : a collection of
fragmented metadata stores• No easy and accepted methods of passing metadata• Preserving version control of metadata• Unifying the metadata relating to the data sources can be
an enormous task
Common Warehouse Model
Foundation Metadata
Business information about model elementsData typesKeys and IndexesExpressionSoftware Deployment: software deployed in DWType Mapping: mapping of data types between different
systems
MetadataCommon Warehouse Model
Metadata for ResourceRelational data sourcesRecord data sourcesmultidimensional resourcesXML data sources
Analysis MetadataData transformation toolsOLAP processing toolsData mining toolsInformation visualisation toolsBusiness taxonomy and glossary
Common Warehouse Model
Management
Warehouse ProcessesResults of Warehouse Operations
The Star Schema Revisited
The Star contains ‘detailed’ facts and dimensions
Aggregates are facts and have their own dimensions
Meta-data support is built around the start schema
Star Schema: Benefits
• Depicts a fuller description of each dimension
• Explicitly shows multiple levels of aggregation on each dimension
• Depicts multiple facts at the intersection of all dimensions
• Directly implementable in a Relational DBMS
• Can utilize new, accelerated approaches to indexing, STARindex and joining, STARjoin
Dimensional Modelling vs. Spread Sheet
Annual product sales by region ($,000)======================================================================= REGION:PRODUCT: SOUTHERN WESTERN NORTHERN EASTERN TOTAL----------------------------------------------------------------------------------------------------------------------------Stibes $7,140 $14,790 $13,260 $15,810 $51,000Farkles 5,460 11,310 10,140 12,090 39,000Teglers 3,150 6,525 5,850 6,975 22,500Qwerts 5,250 11,875 10,750 12,625 40,500---------------------------------------------------------------------------------------------------------------------------- TOTALS: $21,000 $44,500 $40,000 $47,500 $153,000=======================================================================
Is this a Relational Table?What is the Entity?What is the Identifier?What are the Attributes?
How to make it a Relational Table?
How many Fact types?How many Dimensions?
Dimensional Modelling vs.
Relations
REGION PRODUCT SALESSouthern Stibes $7,140Southern Farkles 5,460Southern Teglers 3,150Southern Qwerts 5,250Western Stibes 14,790Western Farkles 11,310Western Teglers 6,525Western Qwerts 11,875Northern Stibes 13,260Northern Farkles 10,140Northern Teglers 5,850Northern Qwerts 10,750Eastern Stibes 15,810Eastern Farkles 12,090Eastern Teglers 6,975Eastern Qwerts 12,625
(all) Stibes 51,000(all) Qwerts 40,500Southern (all) 21,000(all) (all) 153,000
How many Facts?
How many Dimensions?
What type of Table?
What is the Identifier?
Where are the Dimension Tables?
REGION:NAME LEVEL---------------------(all) 1Southern 2Western 2Northern 2Eastern 2
ER Diagram
REGION Region ID Region Name
DEPARTMENT Department ID Department Name
PRODUCT GROUP Product Group ID Product Group Desc. Department ID (fk)
PRODUCT Product ID Product Desc. Product Group ID (fk)
STORE Store ID Store Name Address City State ZipCode Region ID (fk)
SALES Sales Date Store ID (fk) Product ID (fk) Sale Amount Sale Units
INVENTORY Week Store ID (fk) Product ID (fk) Quantity
ER Diagram
Good for OLTP
Update in exactly one place
No redundancy
Oriented towards insertion, deletion. Modification of data
weak entities/relationships create normalised structures
What are the facts and dimensions?
Transformation of ER to Star
DEPARTMENT
PRODUCT GROUP
PRODUCT ITEM
REGION
STORE
YEAR
MONTH
WEEK
DATE
ProductDimension
LocationDimension
TimeDimension
SALES FACTS