Database Schema Akanksha (HT00-6255Y) E. V. Anantha Lakshmi (HD99-9054A) Sze Eng Koon (HD99-2800H) Material Available at szeek/cs6203

Database Schema

Akanksha (HT00-6255Y)

E. V. Anantha Lakshmi (HD99-9054A)

Sze Eng Koon (HD99-2800H)

Material Available at http://www.comp.nus.edu.sg/~szeek/cs6203

Agenda

Data warehouse Identifying facts and dimensions Designing fact tables Designing dimension tables Designing starflake schemas Multidimensional schema

Data Warehouse

Relational database designed for query and analysis

Contains historical data derived from transaction data

Enables organization to consolidate data from several sources

Schema

Consists of DB objects such as tables, views, indexes

There are many types of schema, e.g., star, snowflake

Fact

In decision support DW, a number of queries ask about an essential fact E.g., top 20% spending customers

Fact: Loyalty card transactions

Characteristics Transactions that occurred in past and won’t

change Analyzed in many ways by cross-referencing

with attributes

Fact Table

A table that contains facts Contains numeric, additive fields

(measurements of the business) It has two types of columns:

containing facts foreign keys to dimension tables

Dimension

A structure that categorizes data to enable end users to answer business questions

e.g., Time, Location, Customer

Example

Video chain store builds DW to analyze sales of its products across all stores over time Effect of promoting one product on sale of related

product that is not promoted? Product sales before and after the promotion?

Facts sales (units sold or rented) and profits

Dimensions products, locations (stores), promotions, time

Major Considerations

Data warehouse supports business process, not specific queries

Must understand what information will be used Challenge:

Solution should be effective for a reasonable time period (3-5 years)

Design should allow unknown queries to be able to perform

Identifying Facts & Dimensions

Given large entity model1. Look for elemental transactions in business

process2. Determine key dimensions that apply to facts3. Check that a candidate fact is not a dimension

table4. Check that a candidate dimension is not a fact

table If 3 or 4 changes assignment go to step 2

1) Look for Elemental Transactions in Business

Examine business and identify transactions of interest

Transactions that describe fundamental events e.g., record of account transactions made by a

banking customer Is information operated on by business? Don’t assume reported transactions are facts Design to store most detailed transaction to

prevent restructuring in future

Example

Consider sales information in retail shop Stored record

Store_id, Product_id, Date, Sale Business operates on basket,

i.e.,individual sales transaction and not aggregate sales Fact should be basket transaction

2) Determine Key Dimensions for Facts

Find out entities that are associated with the facts

Need to focus on key dimensions

Account_id

Owner_id

Account_type

Balance

account

Owner_id

Name

Address

Contact_no

customer

Account transaction

analyze how customer uses services

analyze account transaction by account

Account_id

Owner_id

Account_type

Balance

Owner_id

Name

Address

Contact_no

customer

3) Check If Fact Is Dimension

Entities may appear to be fact tables but may be combination of both facts and dimensions

Consider whether attributes were designed around data entry requirements

Data in fact table rows should not vary over time

Example

Address Zip Date cables laid

Connect

Date

… Date sub

added

Address

Mistaken for fact table because users will query on information about address

Example (Cont.)

Code Address Zip Code Date Code Date

Address Date Cable laid

Date sub added

•Will affect the DB size

•Query of operational events much easier

4) Check If Dimension Is Fact

An entity could be either dimensions or facts

Consider focal point of analysis of the business

Good check: by how many dimensions entity can be viewed, >3 probably fact

Account_id

Owner_id

Account_type

Balance

account

Owner_id

Name

Address

Contact_no

customer

Account transaction

analyze how customer uses services

Owner_id

Name

Address

Contact_no

customer

Customer_id

Name

Age

Location_id

Occupation_id

customer

Occupation

Location wise % patronage

Location

Designing Fact Table

No practical limit to size of fact table Balance between size of fact table and

value of data

Techniques to Reduce Size of Fact Tables

Identification of significant historical period for each supported function

Determination of samples to satisfy requirements of detailed data

Selection of relevant / appropriate columns

Techniques (Cont.)

Minimization of column size Determination of intelligent or non-

intelligent keys Determination of format of storing time Partition the fact table (Next group)

1.Identification Of Significant Historical Period

Draw retention period graph showing period and detail necessary for each business function

e.g., comparison of sales of seasonal products such as winter clothing, text books etc

Sales Analysis

Jan ‘98 July ‘98 Jan ‘99

2. Determine samples to satisfy requirement of detailed data

Appropriate in situations where analysis requirement is to analyze trends

The business requirement does not require all the detailed data

Store samples and aggregate the rest

3.Selection of Relevant / appropriate columns

Decide whether optional data needs to be stored in the fact table

e.g., for a table related to bank accounts The columns typically required are:

Acc#, IC#, Amt#, Bank # or Branch# Optional columns related to Teller Name and

Time of entry of records to be decided based on queries’ requirements

4. Minimization of Column Size

Saving per row can have significant effect on total table size

e.g., a typical transaction table contains 4 billion rows (2 million customers, 3 transactions per day per customer for 2 or 3 yr period) saving of 20 bytes per row will save 20 x 4 =80 GB

5. Intelligent /Non-intelligent keys

A fact table can be structured in two ways using Intelligent keys:

Each key represents unique identifier for the item in the real world

Non-Intelligent: Each unique key is generated automatically and

refers to unique identifier of the item in the real world

Salesman_nameCustomerVehicle_desc DateSales_price

Sales

Intelligent Keys

Time idDate

Sales idSalesman Name

Sales idCust idVehicle idTime idSales_price

Vehicle idVehicle Description

Sales AnalysisSalesman

Sales Vehicle

Time

Non – Intelligent Keys

Comparison

Advantages Intelligent keys improve query performance

by avoiding joins Disadvantages

For intelligent keys, if any of the unique identifier changes, a lot of rows in fact table have to be updated to reflect new identifiers

6. Determination of format of storing time

Time information can be stored in many ways within fact table, e.g., Use of foreign key into time dimension table Actual physical dates are stored within

dimension table

Determination of format of storing time (Cont.)

Possible techniques to store dates are Storing the physical date Storing an offset from the inherent start of the

table Storing a date range

Use physical dates rather than foreign keys as the cost of storing is minimum


A fact table is usually partitioned in a unit of time (e.g. week, month,quarter etc)

Dates can be referred as offsets from the inherent start

Storage costs are low. Two bytes enough to store up to 31 numbers


Disadvantages: Queries have to be constructed to convert physical

dates into date offsets. E.g.: To return sum of transactions on ‘9-Jan-00’

Select count(*) from customer

where date=‘9-Jan-00’ – ‘1-Jan-00’; Solution:

If access tools can’t interpret offset, create view that looks like logical table by adding offset to start date

Date Range

Significant saving in disk capacity and query performance

Reduces table size Has significance when changes occur

substantially over time

Date Range (Cont.)

Select count (*) from customer where

date between (st_date,end_date); A number of access tools may not be able

to process fact tables in this view Sometimes a Cartesian product is formed

by joining time dimension table against date range in the fact table

Date Range (Cont.)

Cost of Cartesian product is high, it requires Significant processing power Temporary space

Consider the use of date range if access tools can cope directly with structure and do not allow data expansion

Designing Dimension Table

Done after fact tables are designed dimension tables support querying on fact

tables Wrong design is not a major disaster

volumes relatively small restructuring costs relatively small

Considerations

Star dimension Hierarchies and networks Dimensions that vary over time Managing large dimension tables

Star Dimension Denormalized tables

put reference information into a single table speed up query performance

Rely on perceived use of information by typical queries

Not appropriate when additional data is not accessed often overhead of scanning expanded table

Star Dimension (cont.)

View enrollment size by department, faculty

acad_year, mod#, enrollment

mod#, title, dept#

dept#, department, fac#

fac#, faculty

1

n

1n

mod#, title, dept#, department, faculty#, faculty

Put under same table

Star Dimension (cont.)

Shape of a star

fact

dimension

dimension

dimension

dimension

dimension

Hierarchies and Network

Not possible to denormalize all entities into star dimensions

Many-to-many relationships should not be denormalized into star dimension not efficient

Multiple hierarchies are used to represent different views denormalized the hierarchy likely to be

used by most queries

Hierarchies and Network (cont.)

Example

Fact

Dimension

Dimension

Dimension

Dimension

Dimension Dimension DimensionDimension Dimension Dimension

Hierarchies and Network (cont.)

Useful if query profiles do not change to a point where other hierarchy becomes more popular existing star dimension table is useless

For minor changes add columns note impact on canned queries if columns are

dropped or modified

Dimensions That Vary Over Time

Change in business t-shirts move from menswear to unisex baked beans move from canned foods to canned

vegetables May need to support queries that compare

facts based on present and past groupings for example, to compare sales by departments

between past and present, need to use both old and new classification

instead of updating record in dimension table, add new record and validity dates

Dimensions That Vary Over Time (cont.)

Year 1996select sum(s.revenue_achieved)from sales_year_to_date s, product_dimension pdwhere s.product_id = pd.idand pd.department = ‘Menswear’and pd.end_date > 1-Jan-96and pd.start_date < 31-Dec-96;

Year 1997select sum(s.revenue_achieved)from sales_year_to_date s, product_dimension pdwhere s.product_id = pd.idand pd.department = ‘Menswear’and pd.end_date > 1-Jan-97and pd.start_date < 31-Dec-97;

Dimensions That Vary Over Time (cont.)

If query needs to compare a significant event in corporate calendar, year on year, keep in time dimension Great Singapore Sales in 1998 is from 15 Jul

to 15 Aug Great Singapore Sales in 1999 is from 10 Jun

to 10 Jul Instead of just joining sales_year_to_date

and product_dimension tables, now we also join with time_dimension table.

Managing Large Dimension Tables

Need to watch out for dimension tables that grow too large over time especially for those dimension tables that store

new values instead of updating existing tuples Indicated by:

size similar to that of fact table full table scan takes too long

Horizontal partition the large dimension tables product_dimension_1990s product_dimension_2000s

Designing Starflake Schema Starflake schema

combination of denormalized star and normalized snowflake schemas, plus some additional ideas

Difficult to restructure all entities into a set of distinct dimensions common for entities to span one or more

dimensions for example, common for stores to apply local

pricing for products, thus need to join product and store tables instead of getting the price from product table

allow a degree of crossover between dimensions

Salestransactions

Locations

Time

Products

Department

Products

Businessunit

Style

Color

Size

Locations

Region

Week

Time

Summersales

Eastersales

Month

Facts

Star dimensions

Snowflake dimensions

Designing Starflake Schema (cont.) Crossovers of starflake schema will reduce over time

once system is operational requirements become better understood expect greatest amount of change for outlying entities

Degree of change should decrease as we move from outlying entities towards fact tables need not spend much time on requirements that use outlying

entities Need to direct query to most appropriate source

starflake schema not supported by access tools create views and synonyms on fact tables to be used by

access tools

Multidimensional Schemas Class of decision support queries that

analyze data by representing facts and dimensions within a multidimensional cube

Each dimension occupy an axis, and values within the cube correspond to factual transactions Trend analysis

Good for viewing statistical operations/aggregations apply functions against planes of cube cube can be pivoted, sliced, diced

ProductLocation

Time

Products

Location

ProductLocationTime

Multidimensional Schemas (cont.) Cube extracts data from data warehouse

created as physical stores in many access tools populate on demand or in advance

If data warehouse is constantly being accessed, performance will be poor most appropriate solution is when cube satisfies

majority of queries, but require substantial design effort

Summary

Data warehouse Identifying facts and dimensions Designing fact tables Designing dimension tables Designing starflake schemas Multidimensional schema

Documents

Database Schema Akanksha (HT00-6255Y) E. V. Anantha Lakshmi (HD99-9054A) Sze Eng Koon (HD99-2800H) Material Available at szeek/cs6203