48
The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843 voice 443-253-6054 mobile 410-764-2445 fax [email protected]

The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Embed Size (px)

Citation preview

Page 1: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

The Rules of Time:Data Quality Issues for

Time Varying Databases

DAMA National Capital Region – Mar 2002

Dr. Jerry RosenbaumConcentrX, LLC410-764-1843 voice443-253-6054 mobile410-764-2445 [email protected]

Page 2: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 2

Outline

• Perspective• Example• Aspects of Time• Example (with LDM)• Queries• Design Guidelines

Page 3: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 3

Perspective

• Designing building and using a time dependent database looks simple– Just add in dates or date ranges to some tables– Rows are only logically deleted (to maintain

history)– Make sure the SQL includes date logic

• BUT . . .

Page 4: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 4

Perspective continued

• There are often many issues and a lot of complexity lurking under the covers– You must understand the requirements– You must understand the uses of the data– You must be prepared help the “ad hoc”

customer obtain valid results• Ferret out what the customer forgot to tell you• Understand what they are really saying

• Transaction Path Analysis is very useful for physical design

Page 5: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 5

Perspective continued• Primary Keys often have a time factor• Queries must take into account the (multiple)

times and / or time ranges• Relationships between entities tend to

become more complex• The notion of referential integrity may need

to change • Training customers is difficult• Training developers is no easier

Page 6: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 6

Simple Questions

• How will we represent date– ymd, mdy, dmy, yd, day count since a start date

• Which Calendar– Julian, Gregorian, Hebrew, Chinese, Muslim, Hindu

• When does a day begin– Just after midnight (local time)– At sunset (local time)

• How about am/pm vs. 24 hour clock• How does daylight savings time fit in• What are the transformation rules between them

Page 7: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 7

Example - Persons Residence

Track every residence a person has lived in and when they resided at each place

• Basic table design includes – Name– Address– Start date– End date

Page 8: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 8

Some Issues

• However, we are not yet done– We must understand the business purpose for

tracking the data– We must understand how the data may be used– We must uncover and handle possible “quirks”

in the data– Are other attributes needed– How should we handle the primary key

Page 9: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 9

Example of Business Issue

• How do we plan to use the address– General mailings– Bills– Time sensitive material (e.g. auction catalog)– Visit the person– Call the person– Aggregate reporting– Etc, Etc, Etc

Page 10: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 10

Questions

• Is day sufficiently granular

• What if the person lives in Bombay, India and the user lives in NYC – What do we do about the 12 hour time zone

difference, especially if it bridges days.– For this type of application we can probably

ignore the time zone (unless we wish to call the person)

Page 11: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 11

More Questions

• Can there be a time with no residence• Can a person have more than one residence at one

time– Is one residence primary and other secondary

– Can we have a temporary overlap of times as the person moves residences

– How about winter and summer residences with each primary in its season

– Should a temporary residence be included

– Can one buy two residences on one day

Page 12: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 12

Primary Key Questions

• Does it make sense to use – Name + date + address sequence number– Name + address sequence number– Surrogate key

• If a surrogate key is used, what is the underlying business key

• What affect does this have on foreign keys

Page 13: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 13

Possible Design

• So far we are led to the below possible design– Surrogate Key– Name– Address– Address Type– Start Date– End Date

Page 14: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 14

Yet More Questions

• Do we have to track– When we knew about a new address– When we knew that an address is to end

• Note that these two dates can be– Before the person moved to an address– During the time a person is at an address– After a person leaves an address

• This data would add two more dates to the table design

Page 15: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 15

One Last Thought

• Alternative physical design could be 2 tables

• Table 1– Person Id– Name

• Table 2– Person Id– Address Seq Number– Rest of the attributes

Page 16: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 16

Key Points

• The basics of tracking time varying data appear easy

• The details cannot be ignored because they will cause changes in both the design and use of the database

• One must understand the business

• One must understand the customers

• Rules are subject to change

Page 17: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 17

Aspects of Time

Degree of time dependency and vary from table to table and attribute to attribute

• Some data has no time dependency (or we don’t care about the time dependency)

• Some data is time annotated

• Other data is valid only for a specified time or time period (I.e. time period dependent)

Page 18: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 18

Time Data Types

• Time Points

• Time Periods

• Time Period Categories

• Time Period Categories

• Bounded Time Periods

Page 19: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 19

Events and Time

Time by itself is rarely of interest

• Events and Things are important and we may need to track time in relation to them

• An event or thing may have one or more time factors associated with it that are relevant to the business

• Time factors may be interdependent

Page 20: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 20

Time Points

• Refers to a single “moment” in time

• Examples– The time that an event happened– The time we found out that the event happened– The time the data about the event was entered

into the system

• Any single event may have multiple point in time dimensions

Page 21: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 21

Picking a Point in Time

• Suppose a widget is imported by Ship• What is the import date

– Date widget is loaded onto the ship– Date ship arrives in U.S. port– Date container is taken off ship– Date customs inspector gets manifest– Date custome inspector verifies manifest– Etc. Etc, Etc

• If widgets are subject to a quota this is very important

Page 22: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 22

Time Periods

• Has a duration - beginning time point and end time point

• Examples– U.S. government fiscal year 1999 (Oct 1, 1998

to Sept 30, 1999)– Effective and Expiration dates of an insurance

policy

• An event may have multiple time periods associated with it

Page 23: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 23

Time Point Categories

• Generalization of a Time Point• Examples

– Last day of Month (Jan 31, Feb 28, etc)– New Moon – Mondays

• Categories must be well defined and data may be entered or calculated for each entry

• Example of use - service customer first Monday of each month

Page 24: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 24

Time Period Categories

• Generalization of Time Periods

• Examples– Fiscal Year– Accounting Months– Sales weeks for a retailer (often is Mon - Sun

and numbered sequentially from first full week in January)

• Example of use - comparing retail sales from last year and this year

Page 25: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 25

Bounded Time Periods

• Similar to time periods, but the span of time is not predefined

• Examples– The period when a person works for a company

(or department)– Car ownership - day you acquire a car until the

day you dispose of it

Page 26: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 26

Tense

• Time factors can be– Past– Present– Future

• There are often business rules about recording past and future information as well as rules for changing that data

Page 27: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 27

The Global Aspect

Many companies operate in multiple time zones (including global operations)

• To correlate time factors between different time zones generally sets up– Reference time zone – Rules for recording local time zones (or

location)

Page 28: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 28

Example – Tracking Employees

• We need track some HR data and maintain history– Employees

• Hours worked each day

• Salary

• Paychecks

– Departments• Departmental Manager

• Employees working in the department

Page 29: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 29

Business Question

• Determine the number of hours a person worked during the week of January 1, 1998 (Thursday)

• If a work day includes midnight, we attribute all hours to the day in which the work period began– Note: Midnight is the beginning of the next day

Page 30: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 30

Additional Questions

• When does a work week start: Friday, Saturday, Sunday or Monday

• The week of Jan 1 goes across a calendar year boundary, do we split the week into two

• Are there two types of weeks: tax weeks and work weeks. We use tax week for the IRS and work week to calculate payroll

• Payroll withholding rules change every year

Page 31: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 31

Logical Design

Building the logical data model

• Include time independent and time annotated items

• Temporarily ignore time dependencies and treat the model as if you were looking at the business at a specific point in time.

• Add time dependencies as a second step

Page 32: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 32

LDM Without time dependencies

Page 33: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 33

Add Some Time Dependencies

• Employees– Have hire and termination dates– Change salary – Change departments

• Departments – Are created and eliminated– Have changes in management

Page 34: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 34

LDM With Time Dependencies

Page 35: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 35

Notes

• Primary keys for tables (except PayCheck) include a start time

• All time periods include both a start time and an end time

• If we do not know the end time, should we– Use a standard default value (preferred)– Use a null

• This is the normalized logical model

Page 36: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 36

More Notes

• The LDM maintains RI– Physical model will generally not have RI

• The business rules for integrity of the data (similar to “RI”) are critical – The “basic business key” portions must match– The time period of the “referenced table” must

include the time period of the “referencing table”

Page 37: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 37

Still More notes

• PayCheck is still the same except there is an important business rule

• The attribute salary has become a separate table with a 1:M relationship

• The 1:1 manages a dept relationship became a M:M relationship

• The 1:M member of a dept relationship became a M:M relationship

Page 38: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 38

Looking At Queries• Consider the following tablesEmployeeEmp Id Name Start Dt End Dt

001 Smith 1995-01-01 9999-12-31002 Jones 1996-04-01 9999-12-31

Member Of Emp Id Dept From Dt To Dt001 Acct 1995-01-01 1997-02-01001 Finance 1997-02-01 9999-12-31002 Finance 1996-04-01 9999-12-31

Salary Emp Id Salary From Dt To Dt001 40000 1995-01-01 1997-01-31001 50000 1997-02-01 9999-12-31002 90000 1996-04-01 1997-03-14002 100000 1997-03-15 9999-12-31

Page 39: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 39

Average at a Point in Time• Average salary for finance at the end of 1999 Select Average (T3 Salary) From Member Of T2 Salary T3 Where T2.Dept = Fin And T2.EmpId = T3.EmpId And 1999-12-31 Between T2.Dt From and T2.Date To And 1999-12-31 Between

T3.Date From and T3.Date To

• We have a similar query for Average 1998 salary

Page 40: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 40

Are We Comparing the Right Averages

• People have changed departments• Average salary at the end of 1998 and

1999reflects those people who just happened to be in Finance at those points in time

• The average salary in Finance dropped because we transferred in a low salary employee

• We must create views that take into account the organizational changes

Page 41: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 41

Yesterday’s Salary with Today’s Glasses

• What is the average salary for Finance at the end of 1998 based on those in finance at the end of 1999

Select Avg (T2.Salary)

From Member Of T2

Salary T3

Where T2.Dept = Fin

And T2.EmpId = T3.EmpId

And 1999-12-31 Between T2.Dt From And T2.Dt To

And 1998-12-31 Between T3.Dt From And T3.Dt To

Page 42: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 42

Query Notes

• Queries that involve one time point (or period) are usually straight forward

• Queries involving 2 time point (or period) can cause significant confusion

• Watch out for a set of 2 (or more) queries which involve more than one time point (or period). They look deceptively simple

Page 43: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 43

Design Guidelines

• First build the logical data model – Include time independent and time annotated

items (e.g. date of birth)– Temporarily ignore time dependencies and treat

the model as if you were looking at the business at a specific point in time.

– Make notes about all time dependent attributes and entities and relationships

• Gather potential queries, query sets and reports

Page 44: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 44

Design - 2• The LDM should be in Third Normal Form• The primary keys in this model will be the basic

business keys for future integrity rules• Do not combine entities in 1:1 relationships

unless they truly represent the same thing or concept

• In general column vectors are preferred to row vectors

• Delay design changes for physical considerations until later

Page 45: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 45

Adding in the Time Factors

• Individual Attributes• Groups of Attributes• 1:1 Relationships• 1:M Relationships• M:M Relationships• N-ary relationships• Integrity Rules• Multiple Time Factor Case

Page 46: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 46

Integrity Rules

• Referential Integrity often does not hold in the physical database design– There is no exact matching of primary and

foreign keys

• Business integrity rule usually replaces RI– An exact match of the business key (like RI)– Rule for how the time factors must relate to each

other

Page 47: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 47

Hard Problem

• The design of the database is the easy problem

• Training customers to properly understand and use a time varying database is hard and you should not underestimate the task.

Page 48: The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002 Dr. Jerry Rosenbaum ConcentrX, LLC 410-764-1843

Sept 2001 Concentrx, LLC 48

Thank you for your patience

Questions

Dr. Jerry Rosenbaum

ConcentrX, LLC

410-764-1843 voice

443-253-6054 mobile

410-764-2445 fax

[email protected]