67
Relational Database Relational Database Design Design Bill Woolfolk Public Health Sciences University of Virginia woolfolk@virginia .edu

Relational Database Design Bill Woolfolk Public Health Sciences University of Virginia [email protected]

Embed Size (px)

Citation preview

Relational Database DesignRelational Database Design

Bill WoolfolkPublic Health SciencesUniversity of [email protected]

ObjectivesObjectives

Understand definition of modern relational database

Understand and be able to apply a practical method for designing databases

Recognize and avoid common pitfalls of database design

What’s a database?What’s a database?A collection of logically-related

information stored in a consistent fashion◦ Phone book◦ Bank records (checking statements, etc)◦ Library card catalog◦ Soccer team roster

The storage format typically appears to users as some kind of tabular list (table, spreadsheet)

What Does a Database Do?What Does a Database Do?Stores information in a highly

organized mannerManipulates information in various

ways, some of which are not available in other applications or are easier to accomplish with a database

Models some real world process or activity through electronic means◦ Often called modeling a business process◦ Often replicates the process only in

appearance or end result

Databases and the Systems Databases and the Systems which manage themwhich manage themModern electronic databases are

created and managed through means of RDBMS: Relational DataBase Management Systems

An individual data storage structure created with an RDBMS is typically called a “database”

A database and its attendant views, reports, and procedures is called an “application”

Database ApplicationsDatabase ApplicationsDatabase (the actual DB with its

attendant storage structure)SQL Engine - interprets between

the database and the interface/application

Interface or application – the part the user gets to see and use

Relational DatabaseRelational DatabaseManagement SystemsManagement SystemsLow-end, proprietary, specific purpose

◦ Email: Outlook, Eudora, Mulberry◦ Bibliographic: Ref. Mgr., EndNote, ProCite

Mid-level◦ Microsoft Access, Lotus Approach, Borland’s

Paradox◦ More or less total control of design allows

custom buildsHigh-end

◦ Oracle, Microsoft SQL Server, Sybase, IBM DB2◦ Professional level DBs: Banks, e-commerce,

secure◦ Amazon.com, Ebay.com, Yahoo.com

Problems with Bad DesignProblems with Bad DesignEarly computers were slow and

had limited storage capacityRedundant or repeating data

slowed operations and took up too much precious storage space

Poor design increased chance of data errors, lost or orphaned information

Benefits of Good DesignBenefits of Good DesignComputers today are faster and

possess much larger storage devicesRigid structure of modern relational

databases helped codify problems and solutions

Design problems are still possible, because the DBMS software won’t protect you from poor practices

Good design still increases efficiency of data processes, reduces waste of storage, and helps eliminate data entry errors

Codd’s RulesCodd’s RulesEdgar F. Codd

◦ Mathematician and Researcher at IBM◦ Devised the relational data model in 1970◦ Published 12 rules in 1985 defining ideal

relational database, added 6 more in 1990

E. F. Codd: A Relational Model of Data for Large Shared Data Banks. CACM 13(6): 377-387 (1970)(http://www.acm.org/classics/nov95/toc.html)

Codd, E. (1985). "Is Your DBMS Really Relational?" and "Does Your DBMS Run By the Rules?" ComputerWorld, October 14 and October 21.

Modification AnomaliesModification Anomalies

A search for “General Tool Co.” would miss “General Tool” and “General Toll”. A case-sensitive search for

“Totally Toys” would miss “TOTALLY TOYS”

Customer OrderNum ItemNum Item

General Tool 07456 2246 Pentium Computer

General Toll 08622 3145 HP Printer

General Tool Co. 08622 3967 17” monitor

Totally Toys 06755 2246 Pentium computer

TOTALLY TOYS 08134 3145 Hewlett-Packard Printer

XYZ Inc. 09010 0446 Dot Matrix Printer

Customers_Orders_Inventory

Insertion AnomaliesInsertion Anomalies

How would you enter a new item into your inventory if no one had ordered it

yet?

Customer OrderNum ItemNum Item

General Tool 07456 2246 Pentium Computer

General Toll 08622 3145 HP Printer

General Tool Co. 08622 3967 17” monitor

Totally Toys 06755 2246 Pentium computer

TOTALLY TOYS 08134 3145 Hewlett-Packard Printer

XYZ Inc. 09010 0446 Dot Matrix Printer

Customers_Orders_Inventory

Deletion AnomaliesDeletion Anomalies

If you wanted to stop selling “dot matrix printer” and remove it from your inventory, you would have to delete the order and customer info for

“XYZ Inc.”

Customer OrderNum ItemNum Item

General Tool 07456 2246 Pentium Computer

General Toll 08622 3145 HP Printer

General Tool Co. 08622 3967 17” monitor

Totally Toys 06755 2246 Pentium computer

TOTALLY TOYS 08134 3145 Hewlett-Packard Printer

XYZ Inc. 09010 0446 Dot Matrix Printer

Customers_Orders_Inventory

The FixThe FixOrderNum ItemNum

06755 2246

07456 2246

08134 3145

08622 3145

08622 3967

09010 0446

CustomerNum

OrderNum

7822 09010

8755 06755

8755 08134

9123 07456

9123 08622

CustomerNum

Customer

7822 XYZ Inc.

8755 Totally Toys

9123 General Tool Co.

ItemNum Item

0446 Dot Matrix Printer

2246 Pentium Computer

3145 Hewlett-Packard printer

3967 17” monitor

Order_Items Orders

Customers

Products

The Design ProcessThe Design Process1) Identify the purpose of the database2) Review existing data3) Make a preliminary list of fields4) Make a preliminary list of tables and

enter fields5) Identify the key fields6) Draft the table relationships7) Enter sample data and normalize the

data/tables8) Review and finalize the design

Database ModelingDatabase ModelingRefers to various, more-or-less

formal methods for designing a database

Some provide precision steps and tools◦Ex.: Entity-Relationship (E-R) Modeling

Widely used, especially by high-end database designers who can’t afford to miss things

Fairly complex process Extremely precise

1. Identify purpose of the 1. Identify purpose of the DBDBClients can tell you what information

they want but have no idea what data they need.

“We need to keep track of inventory”“We need an order entry system”“I need monthly sales reports”“We need to provide our product catalog

on the Web”

Be sure to Limit the Scope of the database.

2. Review Existing Data2. Review Existing DataElectronic

◦Legacy database(s)◦Spreadsheets◦Web forms

Manual◦Paper forms◦Receipts and other printed output

3. Make Preliminary Field 3. Make Preliminary Field ListListMake sure fields exist to support

needs◦ Ex. if client wants monthly sales reports,

you need a date field for orders.◦ Ex. To group employees by division, you

need a division identifierMake sure values are atomic

◦ Ex. First and Last names stored separately◦ Ex. Addresses broken down to Street, City,

State, etc.Do not store values that can be

calculated from other values◦ Ex. “Age” can be calculated from “Date of

Birth”

4. Make Preliminary Tables4. Make Preliminary Tables(and insert the fields into them)(and insert the fields into them)

Each table holds info about one subjectDon’t worry about the quantity of tablesLook for logical groupings of informationUse a consistent naming convention

Naming ConventionsNaming ConventionsRules of thumb

◦ Table names must be unique in DB; should be plural

◦ Field names must be unique in the table(s)◦ Clearly identify table subject or field data◦ Be as brief as possible◦ Avoid abbreviations and acronyms◦ Use less than 30 characters, ◦ Use letters, numbers, underscores (_)◦ Do not use spaces or other special

characters

Naming Conventions Naming Conventions (cont’d)(cont’d)Leszynski Naming Convention

(LNC)◦Example: tblEmployees, qryPartNum◦tbl, qry = tag◦Employees, PartNum = basename

LNC at Microsoft Developers Network

5. Identify the Key Fields5. Identify the Key FieldsPrimary Key(s)

◦ Can never be Null; must hold unique values◦ Automatically indexed in most RDBMSs◦ Values rarely (if ever) change◦ Try to include as few fields as possible

Multi-field Primary Key◦ Combination of two or more fields that

uniquely identify an individual recordCandidate Key

◦ Field or fields that qualify as a primary key◦ Important in Third and Boyce-Codd Normal

Forms

6. Identify Table 6. Identify Table RelationshipsRelationshipsBased on business rules being modeled

Examples:◦“each customer can place many orders”

◦“all employees belong to a department”

◦“each TA is assigned to one course”

Relationship TerminologyRelationship TerminologyRelationship Type

◦One-to-one: expressed as 1:1◦One-to-Many: expressed as 1:N or 1:M or

1:∞◦Many-to-Many: expressed as N:N or M:M

Primary or Parent Table◦Table on the left side of 1:N relationship

Related or Child Table◦Table on the right side of 1:N relationship

Relational Schema◦Diagram of table relationships in

database

Relationship Terminology Relationship Terminology (cont’d)(cont’d)Join

◦ Definition of how related records are returned

Join Line◦ Visual relationship indicators in schema

Key fields◦ Primary Key: the linking field on the one

side of a 1:N relationship◦ Foreign Key: the primary key from one

table that is added to another table so the records can be related

◦ Non-Key Fields: any field that is not part of a primary key, multi-field primary key, or foreign key

One-to-One (1:1)One-to-One (1:1)Each record in Table A relates to

one, and only one, record in Table B, and vice versa.

Either table can be considered the Primary, or Parent Table

Can usually be combined into one table, although may not be most efficient design

One-to-Many (1:N)One-to-Many (1:N)Each record in Table A may relate to

zero, one or many records in Table B, but each record in Table B relates to only one record in Table A.

The potential relationship is what’s important: there might be no related records, or only one, but there could be many.

The table on the One (or left) side of a 1:N relationship is considered the Primary Table.

Many-to-Many (N:N)Many-to-Many (N:N)A record in Table A can relate to many

records in Table B, and a record in Table B can relate to many records in Table A.

Most RDBMSs do not support N:N relationships, requiring the use of a linking (or intersection or bridge) table that breaks the N:N relationship down into two 1:N relationships with the linking table being on the Many side of both new relationships.

Relational SchemaRelational Schema

Table 1

Field1_1

Field1_2

Field1_3

Field1_4

Table 2

Field2_1

Field1_1

Field2_2

Field2_3

1N

7. Normalization7. NormalizationNormal Forms (NF): design

standards based on database design theory

Normalization is the process of applying the NFs to table design to eliminate redundancy and create a more efficient organization of DB storage.

Each successive NF applies an increasingly stringent set of rules

First Normal Form (1NF)First Normal Form (1NF)A table is in first normal form if

there are no repeating groups.Repeating Groups : a set of

logically related fields or values that occur multiple times in one record◦1: non-atomic value, or multiple

values, stored in a field◦2: multiple fields in the same table

that hold logically similar values

Sample 1NF Violation - 1Sample 1NF Violation - 1

EmployeeID Name Project Time

EN1-26 Sean O’Brien 30-452-T3, 30-457-T3, 32-244-T3

0.25, 0.40, 0.30

EN1-33 Amy Guya 30-452-T3, 30-382-TC, 32-244-T3

0.05, 0.35, 0.60

EN1-35 Steven Baranco 30-452-T3, 31-238-TC

0.15, 0.80

Employee_Projects_Time

Sample 1NF Violation - 2Sample 1NF Violation - 2

EmpID

Last

Name

First

Name Proj1 Time1 Proj2 Time2

EN1-26 O’Brien Sean 30-452-T3

0.25 30-457-T3

0.40

EN1-33 Guya Amy 30-452-T3

0.05 30-328-TC

0.35

Employee_Projects_Time

Tables in 1NFTables in 1NF

*EmployeeID LastName FirstName

EN1-26 O’Brien Sean

EN1-33 Guya Amy

EN1-35 Baranco Steven

*ProjNum EmployeeID Time

30-328-TC EN1-33 0.35

30-452-T3 EN1-26 0.25

30-452-T3 EN1-33 0.05

Employees

Employees_Projects

Second Normal Form Second Normal Form (2NF)(2NF)A table is in 2NF if it is in 1NF and each non-

key field is functionally dependent on the entire primary key.

Functional dependency: a relationship between fields such that the value in one field determines the one value that can be contained in the other field.

Determinant: a field in which the value determines the value in another field.

ExampleAirport – City

Dulles – Washington, DC

Sample 2NF ViolationSample 2NF Violation

*EmpID Lname Fname *ProjNum ProjTitle

EN1-25 O’Brien Sean 30-452-T3 STAR Manual

EN1-25 O’Brien Sean 30-457-T3 ISO Procedures

EN1-25 O’Brien Sean 31-124-T3 Employee Handbook

EN1-33 Guya Amy 30-452-T3 STAR Manual

EN1-33 Guya Amy 30-482-TC Web site

Employees_Projects

Tables in 2NFTables in 2NF

*EmployeeID LastName FirstName

EN1-26 O’Brien Sean

EN1-33 Guya Amy

Employees

*EmployeeID *ProjNum

EN1-26 30-452-T3

EN1-33 30-457-T3

Employees_Projects

*ProjNum Title

30-452-T3 STAR manual

30-457-T3 ISO procedure

Projects

Third Normal Form (3NF)Third Normal Form (3NF)A table is in 3NF when it is in 2NF

and there are no transitive dependencies.

Transitive Dependency: a type of functional dependency in which the value of a non-key field is determined by the value in another non-key field and that field is not a candidate key.

Sample 3NF ViolationSample 3NF Violation

*ProjNum ProjTitle ProjMgr Phone

30-452-T3 STAR Manual Garrison 2756

30-457-T3 ISO Procedures Jacanda 2954

30-482-TC Web Site Friedman 2846

31-124-T3 STAR prototype Garrison 2756

35-272-TC Order System Jacanda 2954

Projects_Managers

Tables in 3NFTables in 3NF

*ProjNum ProjTitle Manager

30-452-T3 STAR manual Garrison

30-457-T3 ISO procedures Jacanda

Projects

*Manager Phone

Garrison 2846

Jacanda 2756

Project Managers

Boyce-Codd Normal Form Boyce-Codd Normal Form (BCNF)(BCNF)A table is in BCNF when it is in

3NF and all determinants are candidate keys.

Developed to cover situations that 3NF did not address.

Applies to situations where you have overlapping candidate keys.

Sample Business RulesSample Business RulesBusiness Rules:

◦Each course can have many students◦Each student can take many courses◦Each course can have multiple

teaching assistants (TAs)◦Each TA is associated with only one

course◦For each course, each student has

one TA

Sample BCNF ViolationSample BCNF Violation

CourseNum Student TA

ENG101 Jones Clark

ENG101 Grayson Chen

ENG101 Samara Chen

MAT350 Grayson Powers

MAT350 Jones O’Shea

MAT350 Berg Powers

Course_Students_TAs

Tables in BCNFTables in BCNF

*Student *TA

Jones Clark

Grayson Chen

Students

*CourseNum *TA

ENG101 Clark

MAT350 Chen

TAs

*CourseNum *Student

ENG101 Jones

MAT350 Grayson

Courses

Fourth Normal Form (4NF)Fourth Normal Form (4NF)A table is in 4NF when it is in BCNF

and there are no multi-valued dependencies.

Multi-valued Dependency: occurs when, for each value in field A, there is a set of values for field B and a set of values for field C, but B and C are not related.

Occurs when the table contains fields that are not logically related.

Sample 4NF Violation - 1Sample 4NF Violation - 1

*Movie *Star *Producer

Once Upon a Time Judy Garland Alfred Brown

Once Upon a Time Mickey Rooney Alfred Brown

Once Upon a Time Judy Garland Muriel Hemingway

Once Upon a Time Mickey Rooney Muriel Hemingway

Moonlight Humphrey Bogart Alfred Brown

Moonlight Judy Garland Alfred Brown

Movies

Tables in 4NF - 1Tables in 4NF - 1

*Movie *Star

Once Upon a Time Judy Garland

Once Upon a Time Mickey Rooney

Moonlight Humphrey Bogart

Moonlight Judy Garland

Stars

*Movie *Producer

Once Upon a Time Alfred Brown

Once Upon a Time Muriel Hemingway

Moonlight Alfred Brown

Producers

Sample 4NF Violation - 2Sample 4NF Violation - 2Projects_Equipment

Dept

Code ProjNum ProjMgrID Equip PropID

IS 36-272-TC EN1-15 CD-ROM 657

IS VGA monitor 305

AC 36-152-TC EN1-15

AC Dot matrix printer 358

AC Calculator w/tape 239

TW 30-452-T3 EN1-10 486 PC 275

TW 30-457-T3 EN1-15

TW 31-124-T3 EN1-15 Laser Printer 109

Tables in 4NF - 2Tables in 4NF - 2

*PropID Equip DeptCode

657 CD-ROM IS

305 VGA monitor IS

358 Dot matrix printer AC

Equipment

*ProjNum ProjMgrID DeptCode

30-452-T3 EN1-15 IS

30-457-T3 EN1-15 AC

35-152-TC EN1-10 TW

Projects

Fifth Normal Form (5NF)Fifth Normal Form (5NF)A table is in 5NF when it is in 4NF

and there are no cyclic dependencies.

Cyclic Dependency: occurs when there is a multi-field primary key with three or more fields (ex. A, B, C) and those fields are related in pairs AB, BC and AC.

Can occur only with a multi-field primary key of three or more fields

Sample 5NF ViolationSample 5NF Violation

*Buyer *Product *Company

Chris Jeans Levi

Chris Jeans Wrangler

Chris Shirts Levi

Lori Jeans Levi

BUYING

Do the mathDo the math

Our sample is two buyers, two products and two companies, so…

2 x 2 x 2 = 8 total records

But, what if our store has 20 buyers, 50 products and 100 companies?

20 x 50 x 100 = 100,000 total records

A Tempting SolutionA Tempting Solution

*Buyer *Product

Chris Jeans

Chris Shirts

Lori Jeans

Buyers

*Product *Company

Jeans Wrangler

Jeans Levi

Shirts Levi

Products

The Correct SolutionThe Correct Solution

*Buyer *Product

Chris Jeans

Chris Shirts

Lori Jeans

Buyers

*Product *Company

Jeans Wrangler

Jeans Levi

Shirts Levi

Products*Buyer *Compan

y

Chris Wrangler

Chris Levi

Lori Levi

Companies

Check the Math, AgainCheck the Math, AgainIf our company has 20 buyers, 50 products and 100 companies?

Buyers = 20 x 50 = 1000Products = 50 x 100 = 5000

Companies = 20 x 100 = 2000

8,000 total records instead of 100,000!

8. Finalizing the Design8. Finalizing the DesignDouble-check to ensure good,

principle-based designEvaluate design in light of

business model and determine desired deviations from design principles◦Process efficiency◦Security concerns

That’s it for Table DesignThat’s it for Table DesignWatch for repeating values and

fieldsCheck against the Normal FormsMake new tables when necessaryRe-check all tables against the

NFsRemember the business rulesUse common sense, but check

anyway!

Ensuring Data IntegrityEnsuring Data IntegrityPlacing constraints on how and

when and where data can be entered

Done after or along with table design

Part of design process because many constraints are established at the database and table levels

Referential IntegrityReferential Integrity

True relational databases support Referential Integrity: every non-null foreign key value must match an existing primary key value.

In other words, every record in a related table must have a matching record in the primary table.

Preserves the validity of foreign key values.

Enforced at database level.

Cascading UpdatesCascading UpdatesWhen a primary key value

changes, Cascade Update changes the corresponding values in the related records, so no records get orphaned.

Usually only one level deep◦Foreign key is not usually primary

key of related table (except in 1:1 relationships) hence no other tables are usually related to it

Cascade DeletesCascade DeletesWhen a primary table record is

deleted, all matching records in any related table are also deleted

Can propagate through multiple tables if Cascade Delete is turned on in all relationships between those tables

Another protection against orphan records, only this time by eradicating them instead!

Levels of EnforcementLevels of EnforcementReferential Integrity enforced at

database level because it affects relationship between two tables.

Many other business rules enforced at field and table level to ensure data integrity.

Business rule implementation should be documented: how and where it is enforced in the design.

Some rules can’t be enforced at table or field level; must be enforced in the application level.

Testing of Business RulesTesting of Business RulesAlways test business rule

implementation◦What happens when rule is met?◦What happens when rule is violated?

Not much good as a data entry constraint if it doesn’t constrain properly

Good application or interface design will provide feedback when user violates a constraint or rule

Field Level IntegrityField Level IntegrityConstraining by use of field properties

◦Data type: text, number, Yes/No, Date/Time

◦Field size◦Formats

Entry and editing constraints◦Required◦Indexed, with or without duplicates◦Input masks◦Default value◦Validation Rule

Table Level IntegrityTable Level Integrity

Field Comparisons◦Compare value in one field to value in

another◦Comparison performed before record is

saved◦Violations could display an error message

or force constraint of available valuesValidation or Lookup Tables

◦Store generally static set of values◦Stored values used to populate new

records to ensure accuracy of data entry

DocumentationDocumentation

A good design deserves good documentation

Data Dictionary for database/table design◦Table and field names◦Table and field properties◦Relationships, including primary and foreign

keys◦Indexes

Provide reasons for design features, especially if they intentionally violate normal design principles