Upload
edward-reed
View
217
Download
2
Embed Size (px)
Citation preview
Database Schema
Akanksha (HT00-6255Y)
E. V. Anantha Lakshmi (HD99-9054A)
Sze Eng Koon (HD99-2800H)
Material Available at http://www.comp.nus.edu.sg/~szeek/cs6203
Agenda
Data warehouse Identifying facts and dimensions Designing fact tables Designing dimension tables Designing starflake schemas Multidimensional schema
Data Warehouse
Relational database designed for query and analysis
Contains historical data derived from transaction data
Enables organization to consolidate data from several sources
Schema
Consists of DB objects such as tables, views, indexes
There are many types of schema, e.g., star, snowflake
Fact
In decision support DW, a number of queries ask about an essential fact E.g., top 20% spending customers
Fact: Loyalty card transactions
Characteristics Transactions that occurred in past and won’t
change Analyzed in many ways by cross-referencing
with attributes
Fact Table
A table that contains facts Contains numeric, additive fields
(measurements of the business) It has two types of columns:
containing facts foreign keys to dimension tables
Dimension
A structure that categorizes data to enable end users to answer business questions
e.g., Time, Location, Customer
Example
Video chain store builds DW to analyze sales of its products across all stores over time Effect of promoting one product on sale of related
product that is not promoted? Product sales before and after the promotion?
Facts sales (units sold or rented) and profits
Dimensions products, locations (stores), promotions, time
Major Considerations
Data warehouse supports business process, not specific queries
Must understand what information will be used Challenge:
Solution should be effective for a reasonable time period (3-5 years)
Design should allow unknown queries to be able to perform
Identifying Facts & Dimensions
Given large entity model1. Look for elemental transactions in business
process2. Determine key dimensions that apply to facts3. Check that a candidate fact is not a dimension
table4. Check that a candidate dimension is not a fact
table If 3 or 4 changes assignment go to step 2
1) Look for Elemental Transactions in Business
Examine business and identify transactions of interest
Transactions that describe fundamental events e.g., record of account transactions made by a
banking customer Is information operated on by business? Don’t assume reported transactions are facts Design to store most detailed transaction to
prevent restructuring in future
Example
Consider sales information in retail shop Stored record
Store_id, Product_id, Date, Sale Business operates on basket,
i.e.,individual sales transaction and not aggregate sales Fact should be basket transaction
2) Determine Key Dimensions for Facts
Find out entities that are associated with the facts
Need to focus on key dimensions
Account_id
Owner_id
Account_type
Balance
account
Owner_id
Name
Address
Contact_no
customer
Account transaction
analyze how customer uses services
analyze account transaction by account
Account_id
Owner_id
Account_type
Balance
Owner_id
Name
Address
Contact_no
customer
3) Check If Fact Is Dimension
Entities may appear to be fact tables but may be combination of both facts and dimensions
Consider whether attributes were designed around data entry requirements
Data in fact table rows should not vary over time
Example
Address Zip Date cables laid
Connect
Date
… Date sub
added
Address
Mistaken for fact table because users will query on information about address
Example (Cont.)
Code Address Zip Code Date Code Date
Address Date Cable laid
Date sub added
•Will affect the DB size
•Query of operational events much easier
4) Check If Dimension Is Fact
An entity could be either dimensions or facts
Consider focal point of analysis of the business
Good check: by how many dimensions entity can be viewed, >3 probably fact
Account_id
Owner_id
Account_type
Balance
account
Owner_id
Name
Address
Contact_no
customer
Account transaction
analyze how customer uses services
Owner_id
Name
Address
Contact_no
customer
Customer_id
Name
Age
Location_id
Occupation_id
customer
Occupation
Location wise % patronage
Location
Designing Fact Table
No practical limit to size of fact table Balance between size of fact table and
value of data
Techniques to Reduce Size of Fact Tables
Identification of significant historical period for each supported function
Determination of samples to satisfy requirements of detailed data
Selection of relevant / appropriate columns
Techniques (Cont.)
Minimization of column size Determination of intelligent or non-
intelligent keys Determination of format of storing time Partition the fact table (Next group)
1.Identification Of Significant Historical Period
Draw retention period graph showing period and detail necessary for each business function
e.g., comparison of sales of seasonal products such as winter clothing, text books etc
Sales Analysis
Jan ‘98 July ‘98 Jan ‘99
2. Determine samples to satisfy requirement of detailed data
Appropriate in situations where analysis requirement is to analyze trends
The business requirement does not require all the detailed data
Store samples and aggregate the rest
3.Selection of Relevant / appropriate columns
Decide whether optional data needs to be stored in the fact table
e.g., for a table related to bank accounts The columns typically required are:
Acc#, IC#, Amt#, Bank # or Branch# Optional columns related to Teller Name and
Time of entry of records to be decided based on queries’ requirements
4. Minimization of Column Size
Saving per row can have significant effect on total table size
e.g., a typical transaction table contains 4 billion rows (2 million customers, 3 transactions per day per customer for 2 or 3 yr period) saving of 20 bytes per row will save 20 x 4 =80 GB
5. Intelligent /Non-intelligent keys
A fact table can be structured in two ways using Intelligent keys:
Each key represents unique identifier for the item in the real world
Non-Intelligent: Each unique key is generated automatically and
refers to unique identifier of the item in the real world
Salesman_nameCustomerVehicle_desc DateSales_price
Sales
Intelligent Keys
Time idDate
Sales idSalesman Name
Sales idCust idVehicle idTime idSales_price
Vehicle idVehicle Description
Sales AnalysisSalesman
Sales Vehicle
Time
Non – Intelligent Keys
Comparison
Advantages Intelligent keys improve query performance
by avoiding joins Disadvantages
For intelligent keys, if any of the unique identifier changes, a lot of rows in fact table have to be updated to reflect new identifiers
6. Determination of format of storing time
Time information can be stored in many ways within fact table, e.g., Use of foreign key into time dimension table Actual physical dates are stored within
dimension table
Determination of format of storing time (Cont.)
Possible techniques to store dates are Storing the physical date Storing an offset from the inherent start of the
table Storing a date range
Use physical dates rather than foreign keys as the cost of storing is minimum
Determination of format of storing time (Cont.)
A fact table is usually partitioned in a unit of time (e.g. week, month,quarter etc)
Dates can be referred as offsets from the inherent start
Storage costs are low. Two bytes enough to store up to 31 numbers
Determination of format of storing time (Cont.)
Disadvantages: Queries have to be constructed to convert physical
dates into date offsets. E.g.: To return sum of transactions on ‘9-Jan-00’
Select count(*) from customer
where date=‘9-Jan-00’ – ‘1-Jan-00’; Solution:
If access tools can’t interpret offset, create view that looks like logical table by adding offset to start date
Date Range
Significant saving in disk capacity and query performance
Reduces table size Has significance when changes occur
substantially over time
Date Range (Cont.)
Select count (*) from customer where
date between (st_date,end_date); A number of access tools may not be able
to process fact tables in this view Sometimes a Cartesian product is formed
by joining time dimension table against date range in the fact table
Date Range (Cont.)
Cost of Cartesian product is high, it requires Significant processing power Temporary space
Consider the use of date range if access tools can cope directly with structure and do not allow data expansion
Designing Dimension Table
Done after fact tables are designed dimension tables support querying on fact
tables Wrong design is not a major disaster
volumes relatively small restructuring costs relatively small
Considerations
Star dimension Hierarchies and networks Dimensions that vary over time Managing large dimension tables
Star Dimension Denormalized tables
put reference information into a single table speed up query performance
Rely on perceived use of information by typical queries
Not appropriate when additional data is not accessed often overhead of scanning expanded table
Star Dimension (cont.)
View enrollment size by department, faculty
acad_year, mod#, enrollment
mod#, title, dept#
dept#, department, fac#
fac#, faculty
1
n
1n
mod#, title, dept#, department, faculty#, faculty
Put under same table
Star Dimension (cont.)
Shape of a star
fact
dimension
dimension
dimension
dimension
dimension
Hierarchies and Network
Not possible to denormalize all entities into star dimensions
Many-to-many relationships should not be denormalized into star dimension not efficient
Multiple hierarchies are used to represent different views denormalized the hierarchy likely to be
used by most queries
Hierarchies and Network (cont.)
Example
Fact
Dimension
Dimension
Dimension
Dimension
Dimension Dimension DimensionDimension Dimension Dimension
Hierarchies and Network (cont.)
Useful if query profiles do not change to a point where other hierarchy becomes more popular existing star dimension table is useless
For minor changes add columns note impact on canned queries if columns are
dropped or modified
Dimensions That Vary Over Time
Change in business t-shirts move from menswear to unisex baked beans move from canned foods to canned
vegetables May need to support queries that compare
facts based on present and past groupings for example, to compare sales by departments
between past and present, need to use both old and new classification
instead of updating record in dimension table, add new record and validity dates
Dimensions That Vary Over Time (cont.)
Year 1996select sum(s.revenue_achieved)from sales_year_to_date s, product_dimension pdwhere s.product_id = pd.idand pd.department = ‘Menswear’and pd.end_date > 1-Jan-96and pd.start_date < 31-Dec-96;
Year 1997select sum(s.revenue_achieved)from sales_year_to_date s, product_dimension pdwhere s.product_id = pd.idand pd.department = ‘Menswear’and pd.end_date > 1-Jan-97and pd.start_date < 31-Dec-97;
Dimensions That Vary Over Time (cont.)
If query needs to compare a significant event in corporate calendar, year on year, keep in time dimension Great Singapore Sales in 1998 is from 15 Jul
to 15 Aug Great Singapore Sales in 1999 is from 10 Jun
to 10 Jul Instead of just joining sales_year_to_date
and product_dimension tables, now we also join with time_dimension table.
Managing Large Dimension Tables
Need to watch out for dimension tables that grow too large over time especially for those dimension tables that store
new values instead of updating existing tuples Indicated by:
size similar to that of fact table full table scan takes too long
Horizontal partition the large dimension tables product_dimension_1990s product_dimension_2000s
Designing Starflake Schema Starflake schema
combination of denormalized star and normalized snowflake schemas, plus some additional ideas
Difficult to restructure all entities into a set of distinct dimensions common for entities to span one or more
dimensions for example, common for stores to apply local
pricing for products, thus need to join product and store tables instead of getting the price from product table
allow a degree of crossover between dimensions
Salestransactions
Locations
Time
Products
Department
Products
Businessunit
Style
Color
Size
Locations
Region
Week
Time
Summersales
Eastersales
Month
Facts
Star dimensions
Snowflake dimensions
Designing Starflake Schema (cont.) Crossovers of starflake schema will reduce over time
once system is operational requirements become better understood expect greatest amount of change for outlying entities
Degree of change should decrease as we move from outlying entities towards fact tables need not spend much time on requirements that use outlying
entities Need to direct query to most appropriate source
starflake schema not supported by access tools create views and synonyms on fact tables to be used by
access tools
Multidimensional Schemas Class of decision support queries that
analyze data by representing facts and dimensions within a multidimensional cube
Each dimension occupy an axis, and values within the cube correspond to factual transactions Trend analysis
Good for viewing statistical operations/aggregations apply functions against planes of cube cube can be pivoted, sliced, diced
ProductLocation
Time
Products
Location
ProductLocationTime
Multidimensional Schemas (cont.) Cube extracts data from data warehouse
created as physical stores in many access tools populate on demand or in advance
If data warehouse is constantly being accessed, performance will be poor most appropriate solution is when cube satisfies
majority of queries, but require substantial design effort
Summary
Data warehouse Identifying facts and dimensions Designing fact tables Designing dimension tables Designing starflake schemas Multidimensional schema