View
23
Download
1
Category
Preview:
DESCRIPTION
AJAX: Model, Declarative Language, and Algorithms. Helena Galhardas. Plan. Context Problem statement Contributions Our data cleaning solution Validation Related solutions Conclusions. Application context. Eliminate errors and duplicates within a single source - PowerPoint PPT Presentation
Citation preview
1SAD Tagus
AJAX:Model, Declarative Language,
and Algorithms
Helena Galhardas
2SAD Tagus
Plan
Context
• Problem statement
• Contributions
• Our data cleaning solution
• Validation
• Related solutions
• Conclusions
3SAD Tagus
Application context
– Eliminate errors and duplicates within a single
source
– Integrate data from different sources
– Migrate poorly structured data into structured
data
4SAD Tagus
Typical architecture HumanKnowledge
HumanKnowledge
DataExtraction
DataLoading
DataTransformation
Metadata Dictionaries DataAnalysis
SchemaIntegration
... ...
SOURCE DATA TARGET DATA
DataTransformation
5SAD Tagus
Data cleaning
Activity of transforming source data into target data without errors, duplicates, and inconsistencies
6SAD Tagus
Motivating example (1)
DirtyData(paper:String)
Data Cleaning
Events(eventKey, name)
Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)
Authors(authorKey, name)
PubsAuthors(pubKey, authorKey)
7SAD Tagus
Motivating example (2)
[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95
DirtyData
Data Cleaning
PDIS | Conference on Parallel and Distributed Information Systems
Events
QGMW96| Making Views Self-Maintainablefor Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996
PublicationsAuthors
DQua | Dallan Quass
AGup | Ashish Gupta
JWid | Jennifer Widom…..
QGMW96 | DQua
QGMW96 | AGup….
PubsAuthors
8SAD Tagus
Plan
• Context Problem statement
• Contributions
• Our data cleaning solution
• Validation
• Related solutions
• Conclusions
9SAD Tagus
Modeling a data cleaning process
A data cleaning process is modeled by a directed acyclic graph of data transformations
DirtyData
DirtyAuthors
Authors
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles... DirtyEvents
CitiesTags
10SAD Tagus
Existing technology
• Ad-hoc code– difficult to maintain
• Extraction Transformation Loading (ETI, Informatica, Sagent)
– limited cleaning functionality
• Data Reengineering (Integrity) – fixed implementation for certain operators
• Specific-domain cleaning (idCentric, PureIntegrate)
– names and addresses
• Duplicate elimination (DataCleanser, matchIt)
– finds/eliminates duplicates
11SAD Tagus
Problems of existing solutions (1)
The semantics of some data transformations is defined in terms of their implementation algorithms
App. Domain 1
App. Domain 2
App. Domain 3
Data cleaning transformations
...
12SAD Tagus
There is a lack of interactive facilities to tune a data cleaning application program
Problems of existing solutions (2)
Dirty Data
Cleaning process
Clean data Rejected data
13SAD Tagus
AJAX
• An extensible data cleaning framework
• A declarative language for logical operators
• Efficient implementation of the match operator
• A debugger facility for tuning a data cleaning program application
14SAD Tagus
Data cleaning framework
• Logical level: set of logical operators to express cleaning criteria enclosed in each data transformation
• Physical level: set of algorithms that implement the logical operations
15SAD Tagus
Logical level: parametric operators
• View: arbitrary SQL query• Map: iterator-based one-to-many mapping with
arbitrary user-defined functions• Match: iterator-based approximate join • Cluster: uses an arbitrary clustering function• Merge: extends SQL group-by with user-defined
aggregate functions• Apply: executes an arbitrary user-defined
algorithm
Map Match
Merge
ClusterView
Apply
16SAD Tagus
Logical level
DirtyData
DirtyAuthors
Authors
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles...
CitiesTags
17SAD Tagus
Logical level
DirtyData
DirtyAuthors
Map
Cluster
Match
Merge
Authors
Map
Map
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles...
CitiesTags
DirtyData
DirtyAuthors
TC
NL
Authors
SQL Scan
Java Scan
Physical level
DirtyTitles...
Java Scan
Java Scan
CitiesTags
18SAD Tagus
Contributions
• An extensible data cleaning framework
A declarative language for logical operators
• Efficient implementation of the match operator
• A debugger facility for tuning a data cleaning program application
19SAD Tagus
Match• Input: 2 relations• Finds data records that correspond to the same
real object• Calls distance functions for comparing field values
and computing the distance between input tuples• Output: 1 relation containing matching tuples and
possibly 1 or 2 relations containing non-matching tuples
20SAD Tagus
Example
Cluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
21SAD Tagus
ExampleCREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
INTO MatchAuthorsCluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
22SAD Tagus
ExampleCREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
INTO MatchAuthors
Input:
DirtyAuthors(authorKey, name)861|johann christoph freytag
822|jc freytag
819|j freytag
814|j-c freytag
Output:
MatchAuthors(authorKey1, authorKey2, name1, name2)861|822|johann christoph freytag| jc freytag
822|814|jc freytag|j-c freytag ...
Cluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
23SAD Tagus
Implementation of the match operator
s1 S1, s2 S2
(s1, s2) is a match if
editDistance (s1, s2) < maxDist
24SAD Tagus
Nested loopS1 S2
...
• Very expensive evaluation when handling large amounts of data
Need alternative execution algorithms for the same logical specification
editDistance
25SAD Tagus
A database solution
CREATE TABLE MatchAuthors ASSELECT authorKey1, authorKey2, distance
FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2,
editDistance (a1.name, a2.name) distance
FROM DirtyAuthors a1, DirtyAuthors a2)
WHERE distance < maxDist;
No optimization supported for a Cartesian product with external function calls
26SAD Tagus
Window scanning
S
n
27SAD Tagus
Window scanning
S
n
28SAD Tagus
Window scanning
S
n
May loose some matches
29SAD Tagus
String distance filtering
S1 S2
maxDist = 1
John Smith
John Smit
Jogn Smith
John Smithe
length
length- 1
length
length + 1
editDistance
30SAD Tagus
Annotation-based optimization
• The user specifies types of optimization • The system suggests which algorithm to
use
Ex:
CREATE MATCHING MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET dist = editDistance(da1.name, da2.name)
WHERE dist < maxDist
% distance-filtering: map= length; dist = abs %
INTO MatchAuthors
31SAD Tagus
Contributions
• An extensible data cleaning framework
• A declarative language for logical operators
• Efficient implementation of the match operator
A debugger facility for tuning a data cleaning program application
32SAD Tagus
Management of exceptions
• Problem: to mark tuples not handled by the cleaning criteria of an operator
• Solution: to specify the generation of exceptional tuples within a logical operator– exceptions are thrown by external functions– output constraints are violated
33SAD Tagus
Example (1)
CREATE MAP ExtractionCities
FROM StandardizedDirtyData dd
LET city = extractCities(dd.paper, Cities),
{ SELECT dd.paperKey AS pubKey, city AS city
INTO ExtractedCities
CONSTRAINT NOT NULL city } Map
ExtractedCities(pubKey, city)
Extraction
CitiesStandardizedDirtyData (pubKey, paper)
34SAD Tagus
Example(2)
ExtractionCities
Cities
ExtractedCitiesStandardizedDirtyDataexc
4| ManyDifferentCities
StandardizedDirtyData
4|y ioannidis r ng k shim and t sellis parametric query optimization technical report univ of wisconsin madison and univ of maryland college park
35SAD Tagus
Debugger facility
• Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions
• Supports the interactive data modification and the incremental execution of some logical operators
36SAD Tagus
4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison and Univ. Of Maryland, College Park, 1992
4| ManyDifferentCities
4|Technical Report, Univ. Of Wisconsin, and Univ. Of Maryland
StandardizedDirtyDataForExtraction
StandardizeDataForExtraction
ExtractionAuthorsTitleEvent
DirtyEvents
KeyDirtyData
StandardizeData
StandardizedDirtyData
ExtractionCities
ExtractedCitiesStandardizedDirtyDataexc
BackwardDerivationForwardDerivation
Backward/forward data derivation
Cities
37SAD Tagus
4| ManyDifferentCities
4|Technical Report, Univ. Of Wisconsinand Univ. Of Maryland
StandardizedDirtyDataForExtraction
StandardizeDataForExtraction
ExtractionAuthorsTitleEvent
DirtyEvents
KeyDirtyData
StandardizeData
StandardizedDirtyData
ExtractionCities
ExtractedCitiesStandardizedDirtyDataexc
4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992
Interactive data correction (1)
Cities
38SAD Tagus
4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992
KeyDirtyData
Interactive data correction(2) 4| Technical Report, Univ. Of Wisconsin101| Technical Report, Univ. Of Maryland 4| Madison
101| College Park
StandardizedDirtyDataForExtraction
StandardizeDataForExtraction
ExtractionAuthorsTitleEvent
DirtyEvents
StandardizeData
StandardizedDirtyData
ExtractionCities
ExtractedCities
incrementalincremental
incrementalincremental
Cities
39SAD Tagus
AJAX Architecture
40SAD Tagus
AJAX Demo
Recommended