23
Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d2 1

Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Embed Size (px)

Citation preview

Page 1: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Data Modelling and Cleaning

CMPT 455/826 - Week 8, Day 2

Sept-Dec 2009 – w8d2 1

Page 2: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Conceptual Modelling Solutions for the Data

Warehouse

Stefano Rizzi

Sept-Dec 2009 – w8d2 2

Page 3: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 1: Facts

• A fact is a focus of interest – for the decision-making process;

• Typically, it models a set of events – occurring in the enterprise world.

• A fact is graphically represented – by a box with two sections,

• one for the fact name and • one for the measures.

Sept-Dec 2009 – w8d2 3

Page 4: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Guideline 1: Facts

• Concepts represented in the data source – by frequently-updated archives

• are good candidates for facts

• Concepts represented – by almost-static archives

• are not good candidates for facts

Sept-Dec 2009 – w8d2 4

Page 5: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 2: Measure

• A measure is a numerical property of a fact, – and describes one of its quantitative aspects of interests for

analysis.

• Measures are included in the bottom section of the fact.

Sept-Dec 2009 – w8d2 5

Page 6: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 3: Dimension

• A dimension is a fact property with a finite domain and – describes one of its analysis coordinates.

• The set of dimensions of a fact – determines its finest representation granularity???.

• Graphically, dimensions are represented – as circles attached to the fact by straight lines.

Sept-Dec 2009 – w8d2 6

Page 7: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Guideline 2: Dimensions

• At least one of the dimensions of the fact – should represent time, at any granularity.

Sept-Dec 2009 – w8d2 7

Page 8: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 4: Primary Event

• A primary event is an occurrence of a fact, and – is identified by a tuple of values, – one value for each dimension.

• Each primary event is described – by one value for each measure.

Sept-Dec 2009 – w8d2 8

Page 9: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 5: Dimension Attributes

• A dimension attribute is a property, – with a finite domain, – of a dimension.

• Like dimensions, – it is represented by a circle.

Sept-Dec 2009 – w8d2 9

Page 10: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 6: hierarchy

• A hierarchy is a directed tree, – rooted in a dimension, – whose nodes are all the dimension attributes

• that describe that dimension,

– and whose arcs model many-to-one associations• between pairs of dimension attributes.

• Arcs are graphically represented by straight lines.

Sept-Dec 2009 – w8d2 10

Page 11: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 8: Descriptive attribute

• A descriptive attribute specifies a property of a dimension attribute, – to which is related by an x-to-one association.

• Descriptive attributes are not used for aggregation; – they are always leaves of their hierarchy – and are graphically represented by horizontal lines.

Sept-Dec 2009 – w8d2 11

Page 12: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 9: Cross-dimension attributes

• A cross-dimension attribute– is a (either dimension or descriptive) attribute – whose value is determined – by the combination of two or more dimension attributes, – possibly belonging to different hierarchies.

• It is denoted by connecting through a curve line – the arcs that determine it.

Sept-Dec 2009 – w8d2 12

Page 13: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 10: Convergence

• A convergence takes place – when two dimension attributes within a hierarchy – are connected by two or more alternative paths – of many-to-one associations.

• Convergences are represented – by letting two or more arcs converge – on the same dimension attribute.

Sept-Dec 2009 – w8d2 13

Page 14: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 13: Ragged Hierarchy

• A ragged (or incomplete) hierarchy is a hierarchy, – where, for some instances, – the values of one or more attributes are missing – (since undefined or unknown).

• A ragged hierarchy is graphically denoted – by marking with a dash the attributes – whose values may be missing.

Sept-Dec 2009 – w8d2 14

Page 15: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 14: Unbalanced Hierarchy

• An unbalanced (or recursive) hierarchy is a hierarchy – where, though inter-attribute relationships are consistent, – the instances may have different lengths.

• Graphically, it is represented – by introducing a cycle within the hierarchy.

Sept-Dec 2009 – w8d2 15

Page 16: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Definition 15: Additive

• A measure is said to be additive along a dimension – if its values can be aggregated – along the corresponding hierarchy by the sum operator, – otherwise it is called nonadditive.

• A nonadditive measure is nonaggregable – if no other aggregation operator can be used on it.

Sept-Dec 2009 – w8d2 16

Page 17: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Open Issues

• Lack of a standard for conceptual models

• Need for design patterns to support modelling

• Need for a method to model security issues

Sept-Dec 2009 – w8d2 17

Page 18: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Data Cleaning

(Based on Rahm)

Sept-Dec 2009 – w8d2 18

Page 19: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Single source problems

• Lack of appropriate model-specific integrity constraints

– Attribute: illegal values

– Record: uniqueness violation

– Relationship: referential integrity not validated

Sept-Dec 2009 – w8d2 19

Page 20: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Single source problems

• Lack of appropriate application-specific integrity constraints can lead to:

– Attribute problems: • missing values, misspellings, cryptic abbreviations, embedded values,

misfiled values

– Record problems: • violated attribute dependencies, word transpositions, duplicated records,

contradicted records

– Relationship problems: • wrong references

Sept-Dec 2009 – w8d2 20

Page 21: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Multi-source Problems

• In addition to single source problems, there can be:

– overlapping or contradicting data

– schema naming and structural conflicts

– different data types / granularities / interpretations / points in time

Sept-Dec 2009 – w8d2 21

Page 22: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Data Analysis for cleaning

• Using metadata for data profiling – focuses on the instance analysis of individual attributes– derives information

• such as the data type, length, value range, discrete values and their frequency, variance, uniqueness, occurrence of null values, typical string pattern (e.g., for phone numbers)

– providing an exact view of various quality aspects of the attribute

• Data mining – helps discover specific data patterns in large data sets,

• e.g., relationships holding between several attributes

– focuses on so-called descriptive data mining models • including clustering, summarization, association discovery and sequence

Sept-Dec 2009 – w8d2 22

Page 23: Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21

Data transformations

• Can be done via SQL operations– which allows tracking of all transformations– can include

• Extracting values from free-form attributes (attribute split):• Validation and correction:• Standardization• Duplicate elimination

• May require considerable human involvement– some transformations will be more complex than others– some transformations will apply to more or less data

Sept-Dec 2009 – w8d2 23