Upload
kathiravelu-pradeeban
View
307
Download
0
Tags:
Embed Size (px)
Citation preview
Introduction to Data Quality
Pradeeban Kathiravelu
INESC-ID LisboaInstituto Superior Tecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 1.March 19, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 1 / 21
Introduction
Introduction
Data is an important asset for the organizations.Data warehouses and exploration tools depend on data quality.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 2 / 21
Introduction
1. DQ Problems within a Single Data Source
1.1. DQ Problems within a Single Relation
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 4 / 21
Introduction
1.1.1. An Attribute Value of a Single Tuple
Missing value.
Syntax violation.
Outdated value.
Interval violation.
Set violation.
Misspelled error.
Inadequate value to theattribute context.
Value items beyond theattribute context.
Meaningless value.
Value with imprecise or doubtfulmeaning.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 5 / 21
Introduction
1.1.2. The Values of a Single Attribute
Uniqueness value violation.
Synonyms existence.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 7 / 21
Introduction
1.1.3. The Attribute Values of a Single Tuple
Semi-empty tuple.
Inconsistency among attribute values.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 9 / 21
Introduction
1.1.4. The Attribute Values of Several Tuples
Redundancy about an entity.
Inconsistency about an entity.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 11 / 21
Introduction
1.2. Relationships among Multiple Relations
Referential integrity violation.
Outdated reference.
Syntax inconsistency
Inconsistency among related attribute values.
Circularity among tuples in a self-relationship.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 13 / 21
Introduction
2. Multiple Data Sources
Syntax inconsistency.
Different measure units.
Representation inconsistency.
Different aggregation levels.
Synonyms existence.
Homonyms existence.
Redundancy about an entity.
Inconsistency about an entity.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 15 / 21
Introduction
Data Cleaning Problems (Rahm and Do)
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 17 / 21
Introduction
Phases of Data Cleaning
Data analysis.
Data profiling.Data mining.
Descriptive data mining models.Clustering, summarization, association discovery and sequencediscovery.
Definition of transformation workflow and mapping rules.
Verification.
Transformation.
Backflow of cleaned data.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 18 / 21
Introduction
Tool Support
Data analysis and reengineering tools.
Data profiling - MigrationArchitect.Data mining - WizRule and DataMiningSuite.Data reengineering - Integrity.
Specialized cleaning tools
Special domain cleaning - idCentric, PureIntegrate, QuickAddress,Reunion, and Trillium.Duplicate elimination - DataCleanser, Merge/PurgeLibrary, matchIT,and MasterMerge .
ETL (Extraction, Transformation, Loading) Tools
CopyManager, DataStage, Extract, PowerMart, DecisionBase,DataTransformationService, MetaSuite, SagentSolution, andWarehouseAdministrator.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 19 / 21
Introduction
Conclusions
Identification, classification and systematization of DQ problems.
Taxonomy using a bottom-up approach.
Definition of methods to detect DQ problems
represented as binary classification trees.
Thank you!Questions?
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 20 / 21
Introduction
References
Oliveira, P., Rodrigues, F., Henriques, P., & Galhardas, H. (2005,June). A taxonomy of data quality problems. In Proc. 2nd Int.Workshop on Data and Information Quality (in conjunction withCAiSE 2005), Porto, Portugal.
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and currentapproaches. IEEE Data Eng. Bull., 23(4), 3-13.
Barateiro, J., & Galhardas, H. (2005). A Survey of Data QualityTools. Datenbank-Spektrum, 14(15-21), 48.
Kim, W.; Choi, B.-J.; Hong, E.-K.; Kim, S.-K. and Lee, D. – ATaxonomy of Dirty Data. Data Mining and Knowledge Discovery, 7,2003. pp. 81-99.
Muller, H. and Freytag, J.-C. – Problems, Methods, and Challenges inComprehensive Data Cleansing. Technical Report HUB-IB-164,Humboldt University, Berlin, 2003.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 21 / 21