H.GalhardasSAD 2007/08
Extract, Transform, Load (ETL)
H.GalhardasSAD 2007/08
Overview• General ETL issues:
– Data staging area (DSA)– Building dimensions– Building fact tables– Extract– Load– Transformation/cleaning
• Commercial (and open-source) tools• The AJAX data cleaning and transformation
framework
H.GalhardasSAD 2007/08
DW PhasesDesign phase
– Modeling, DB design, source selection,…Loading phase
– First load/population of the DW– Based on all data in sources
Refreshment phase– Keep the DW up-to-date wrt. source data
changes
H.GalhardasSAD 2007/08
The ETL Process• The most underestimated process in DW development
• The most time-consuming process in DW development– Often, 80% of development time is spent on ETL
Extract– Extract relevant data
Transform– Transform data to DW format– Build keys, etc.– Cleansing of data
Load– Load data into DW– Build aggregates, etc.
H.GalhardasSAD 2007/08
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
othersources
Data Storage
OLAP Server
Data Staging
DW ArchitectureDW Architecture
H.GalhardasSAD 2007/08
Operational DBs Data
Warehouse
othersources
ExtractTransformLoadRefresh
Data Staging
ETL processETL process
Implemented using anETL tool!
Ex: SQLServer 2005Integration Services
H.GalhardasSAD 2007/08
Data Staging Area
• Transit storage for data underway in the ETLprocess– Transformations/cleansing done here
• No user queries (some do it)• Sequential operations (few) on large data volumes
– Performed by central ETL logic– Easily restarted– No need for locking, logging, etc.– RDBMS or flat files? (DBMS have become better at this)
• Finished dimensions copied from DSA to relevantmarts
H.GalhardasSAD 2007/08
ETL construction processPlan
1)Make high-level diagram of source-destination flow2)Test, choose and implement ETL tool3)Outline complex transformations, key generation and job
sequence for every destination tableConstruction of dimensions
4)Construct and test building static dimension5)Construct and test change mechanisms for one dimension6)Construct and test remaining dimension builds
Construction of fact tables and automation7)Construct and test initial fact table build8)Construct and test incremental update9)Construct and test aggregate build10)Design, construct, and test ETL automation
H.GalhardasSAD 2007/08
Building Dimensions
Static dimension table– Assignment of keys: production keys to DW using
table– Combination of data sources: find common key?
Handling dimension changes– Slowly changing dimensions– Find newest DW key for a given production key– Table for mapping production keys to DW keys
must be updatedLoad of dimensions
– Small dimensions: replace– Large dimensions: load only changes
H.GalhardasSAD 2007/08
Building fact tables
Two types of load:• Initial load
– ETL for all data up till now– Done when DW is started the first time– Often problematic to get correct historical data– Very heavy -large data volumes
• Incremental update– Move only changes since last load– Done periodically (../month/week/day/hour/...) after DW start– Less heavy -smaller data volumes
• Dimensions must be updated before facts– The relevant dimension rows for new facts must be in place– Special key considerations if initial load must be performed
again
H.GalhardasSAD 2007/08
Types of data sourcesNon-cooperative sources�Snapshot sources –provides only full copy of source�Specific sources –each is different, e.g., legacy systems�Logged sources –writes change log (DB log)�Queryable sources –provides query interface, e.g., SQLCooperative sources�Replicated sources –publish/subscribe mechanism�Call back sources –calls external code (ETL) when changes
occur�Internal action sources –only internal actions when changes
occur (DB triggers is an example)
Extract strategy is very dependent on the source types
H.GalhardasSAD 2007/08
Extract phaseGoal: fast extract of relevant data�Extract from source systems can take a long time
Types of extracts:�Extract applications (SQL): co-existence with other
applications�DB unload tools: must faster than SQL-based extracts�Extract applications sometimes the only solution
Often too time consuming to ETL– Extracts can take days/weeks– Drain on the operational systems– Drain on DW systems
=> Extract/ETL only changes since last load (delta)
H.GalhardasSAD 2007/08
Computing deltas• Much faster to only ”ETL” changes since last load
A number of methods can be used• Store sorted total extracts in DSA
– Delta can easily be computed from current+last extract+ Always possible+ Handles deletions- Does not reduce extract time
• Put update timestamp on all rows– Updated by DB trigger– Extract only where ”timestamp > time for last extract”+ Reduces extract time+/- Less operational overhead- Cannot (alone) handle deletions- Source system must be changed
H.GalhardasSAD 2007/08
Load (1)Goal: fast loading into DW
– Loading deltas is much faster than total load
• SQL-based update is slow– Large overhead (optimization,locking,etc.) for every SQL call– DB load tools are much faster– Some load tools can also perform UPDATEs
• Index on tables slows load a lot– Drop index and rebuild after load– Can be done per partition
• Parallellization– Dimensions can be loaded concurrently– Fact tables can be loaded concurrently– Partitions can be loaded concurrently
H.GalhardasSAD 2007/08
Load (2)• Relationships in the data
– Referential integrity must be ensured– Can be done by loader
• Aggregates– Must be built and loaded at the same time as the detail data– Today, RDBMSes can often do this
• Load tuning– Load without log– Sort load file first– Make only simple transformations in loader– Use loader facilities for building aggregates– Use loader within the same database
• Should DW be on-line 24*7?– Use partitions or several sets of tables
H.GalhardasSAD 2007/08
Overview• General ETL issues:
– Data staging area (DSA)– Building dimensions– Building fact tables– Extract– Load– Transformation/cleaning
• Commercial (and open-source) tools• The AJAX data cleaning and transformation
framework
H.GalhardasSAD 2007/08
Data Cleaning
Activity of converting source data into targetdata without errors, duplicates, andinconsistencies, i.e.,
Cleaning and Transforming to get…
High-quality data!
H.GalhardasSAD 2007/08
Why Data Cleaning andTransformation?
Data in the real world is dirtyincomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data• e.g., occupation=“”
noisy: containing errors or outliers (spelling, phoneticand typing errors, word transpositions, multiple valuesin a single free-form field)
• e.g., Salary=“-10”inconsistent: containing discrepancies in codes or
names (synonyms and nicknames, prefix and suffixvariations, abbreviations, truncation and initials)
• e.g., Age=“42” Birthday=“03/07/1997”• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records
H.GalhardasSAD 2007/08
Why Is Data Dirty?
• Incomplete data comes from:– non available data value when collected– different criteria between the time when the data was collected
and when it is analyzed.– human/hardware/software problems
• Noisy data comes from:– data collection: faulty instruments– data entry: human or computer errors– data transmission
• Inconsistent (and redundant) data comes from:– Different data sources, so non uniform naming conventions/data
codes– Functional dependency and/or referential integrity violation
H.GalhardasSAD 2007/08
Why Is Data CleaningImportant?
• Data warehouse needs consistentintegration of quality data– Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
• No quality data, no quality decisions!– Quality decisions must be based on quality data (e.g.,
duplicate or missing data may cause incorrect or evenmisleading statistics)
H.GalhardasSAD 2007/08
Types of data cleaning
• Conversion, parsing and normalization– Text coding, date formats, etc.– Most common type of cleansing
• Special-purpose cleansing– Normalize spellings of names, addresses, etc.– Remove duplicates, e.g., duplicate customers– Domain-independent cleansing
• Approximate, ”fuzzy” joins on not-quite-matchingkeys
H.GalhardasSAD 2007/08
Data quality vs cleaning• Data quality = Data cleaning +
– Data enrichment• enhancing the value of internally held data by appending related
attributes from external sources (for example, consumerdemographic attributes or geographic descriptors).
– Data profiling• analysis of data to capture statistics (metadata) that provide
insight into the quality of the data and aid in the identification ofdata quality issues.
– Data monitoring• deployment of controls to ensure ongoing conformance of data to
business rules that define data quality for the organization.– Data stewards responsible for data quality– DW-controlled improvement– Source-controlled improvement– Construct programs to check data quality
H.GalhardasSAD 2007/08
Overview• General ETL issues:
– Data staging area (DSA)– Building dimensions– Building fact tables– Extract– Load– Transformation/cleaning
• Commercial (and open-source) tools• The AJAX data cleaning and transformation
framework
H.GalhardasSAD 2007/08
ETL toolsETL tools from the big vendors, e.g.,
– �Oracle Warehouse Builder– �IBM DB2 Warehouse Manager– �Microsoft Integration Services
Offer much functionality at a reasonable price (included…)– �Data modeling– �ETL code generation– �Scheduling DW jobs– �…
Many others– �Hundreds of tools– �Often specialized tools for certain jobs (insurance cleansing,…)
The ”best” tool does not exist– �Choose based on your own needs– �Check first if the ”standard tools”from the big vendors are ok
H.GalhardasSAD 2007/08
ETL and data quality tools• http://www.etltool.com/• Magic Quadrant for Data Quality Tools, 2007• Some open source ETL tools:
– Talend– Enhydra Octopus– Clover.ETL
• Not so many open source quality/cleaningtools
H.GalhardasSAD 2007/08
Application context
Integrate data from different sources• E.g.,populating a DW from different operational data stores
• Eliminate errors and duplicates within a single source• E.g., duplicates in a file of customers
• Migrate data from a source schema into a differentfixed target schema
• E.g., discontinued application packages
• Convert poorly structured data into structured data• E.g., processing data collected from the Web
H.GalhardasSAD 2007/08
The AJAX data transformationand cleaning framework
H.GalhardasSAD 2007/08
Motivating example (1)
DirtyData(paper:String)
Data Cleaning & Transformation
Events(eventKey, name)
Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)Authors(authorKey, name)
PubsAuthors(pubKey, authorKey)
H.GalhardasSAD 2007/08
Motivating example (2)
[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and JenniferWidom. Making Views Self-Maintainable for Data Warehousing. InProceedings of the Conference on Parallel and DistributedInformation Systems. Miami Beach, Florida, USA, 1996[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95
DirtyData
Data Cleaning & Transformation
PDIS | Conference onParallel andDistributedInformation Systems
Events
QGMW96| Making Views Self-Maintainablefor Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996
PublicationsAuthors
DQua | Dallan Quass
AGup | Ashish Gupta
JWid | Jennifer Widom…..
QGMW96 | DQua
QGMW96 | AGup….
PubsAuthors
H.GalhardasSAD 2007/08
Modeling a data cleaning process
A data cleaning process is modeled by a directed acyclic graph of data transformations
DirtyData
DirtyAuthors
Authors
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles... DirtyEvents
CitiesTags
H.GalhardasSAD 2007/08
Existing technology• Ad-hoc programs written in a programming language
like C or Java or using an RDBMS proprietarylanguage– Programs difficult to optimize and maintain
• Data transformation scripts using an ETL(Extraction-Transformation-Loading) or a data quality tool
H.GalhardasSAD 2007/08
Problems of ETL and data qualitysolutions (1)
The semantics of some data transformationsis defined in terms of their implementationalgorithms
App. Domain 1
App. Domain 2
App. Domain 3
Data cleaning transformations
...
H.GalhardasSAD 2007/08
There is a lack of interactive facilities to tunea data cleaning application program
Problems of ETL and data qualitysolutions (2)
Dirty Data
Cleaning process
Clean data Rejected data
H.GalhardasSAD 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra– Physical execution algorithms
• A declarative language for logical operators– SQL extension
• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions
H.GalhardasSAD 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra– Physical execution algorithms
• A declarative language for logical operators– SQL extension
• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions
H.GalhardasSAD 2007/08
Logical level: parametricoperators
• View: arbitrary SQL query• Map: iterator-based one-to-many mapping with
arbitrary user-defined functions• Match: iterator-based approximate join• Cluster: uses an arbitrary clustering function• Merge: extends SQL group-by with user-defined
aggregate functions• Apply: executes an arbitrary user-defined
algorithm
Map Match
Merge
ClusterView
Apply
H.GalhardasSAD 2007/08
Logical level
DirtyData
DirtyAuthors
Authors
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles...
CitiesTags
H.GalhardasSAD 2007/08
Logical level
DirtyData
DirtyAuthors
Map
Cluster
Match
Merge
Authors
Map
Map
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles...
CitiesTags
DirtyData
DirtyAuthors
TC
NL
Authors
SQL Scan
Java Scan
Physical level
DirtyTitles...
Java Scan
Java Scan
CitiesTags
H.GalhardasSAD 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra– Physical execution algorithms
• A declarative language for logical operators– SQL extension
• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions
H.GalhardasSAD 2007/08
Match• Input: 2 relations• Finds data records that correspond to the
same real object• Calls distance functions for comparing field
values and computing the distancebetween input tuples
• Output: 1 relation containing matchingtuples and possibly 1 or 2 relationscontaining non-matching tuples
H.GalhardasSAD 2007/08
Example
Cluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
H.GalhardasSAD 2007/08
ExampleCREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
INTO MatchAuthorsCluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
H.GalhardasSAD 2007/08
ExampleCREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
INTO MatchAuthors
Input:DirtyAuthors(authorKey, name)861|johann christoph freytag822|jc freytag819|j freytag814|j-c freytag
Output:MatchAuthors(authorKey1, authorKey2, name1, name2)861|822|johann christoph freytag| jc freytag822|814|jc freytag|j-c freytag ...
Cluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
H.GalhardasSAD 2007/08
Implementation of the matchoperator
∀ s1∈ S1, s2 ∈ S2
(s1, s2) is a match ifeditDistance (s1, s2) < maxDist
H.GalhardasSAD 2007/08
Nested loopS1 S2
...
• Very expensive evaluation when handlinglarge amounts of data
Þ Need alternative execution algorithms forthe same logical specification
editDistance
H.GalhardasSAD 2007/08
A database solution
CREATE TABLE MatchAuthors AS
SELECT authorKey1, authorKey2, distance
FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2,editDistance (a1.name, a2.name) distanceFROM DirtyAuthors a1, DirtyAuthors a2)
WHERE distance < maxDist;
· No optimization supported for aCartesian product with external functioncalls
H.GalhardasSAD 2007/08
Window scanningS
n
H.GalhardasSAD 2007/08
Window scanningS
n
H.GalhardasSAD 2007/08
Window scanningS
n
· May loose some matches
H.GalhardasSAD 2007/08
String distance filtering
S1 S2
maxDist = 1
John Smith
John Smit
Jogn Smith
John Smithe
length
length- 1
length
length + 1editDistance
H.GalhardasSAD 2007/08
Annotation-based optimization
• The user specifies types of optimization• The system suggests which algorithm
to use
Ex:CREATE MATCHING MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET dist = editDistance(da1.name, da2.name)
WHERE dist < maxDist% distance-filtering: map= length; dist = abs %INTO MatchAuthors
H.GalhardasSAD 2007/08
DEFINE FUNCTIONS ASChoose.uniqueString(OBJECT[]) RETURN STRINGTHROWS CiteSeerExceptionGenerate.generateId(INTEGER) RETURN STRINGNormal.removeCitationTags(STRING) RETURN STRING(600)
DEFINE ALGORITHMS ASTransitiveClosureSourceClustering(STRING)
DEFINE INPUT DATA FLOWS ASTABLE DirtyData(paper STRING (400));TABLE City(city STRING (80),citysyn STRING (80))KEY city,citysyn;
DEFINE TRANSFORMATIONS AS
CREATE MAPPING mapKeDiDaFROM DirtyData DdLET keyKdd = generateId(1){SELECT keyKdd AS paperKey, Dd.paper AS paperKEY paperKeyCONSTRAINT NOT NULL mapKeDiDa.paper}
Declarative specification
H.GalhardasSAD 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra– Physical execution algorithms
• A declarative language for logical operators– SQL extension
• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions
H.GalhardasSAD 2007/08
Management of exceptions
• Problem: to mark tuples not handled by thecleaning criteria of an operator
• Solution: to specify the generation ofexception tuples within a logical operator– exceptions are thrown by external functions– output constraints are violated
H.GalhardasSAD 2007/08
Debugger facility
• Supports the (backward and forward) dataderivation of tuples wrt an operator todebug exceptions
• Supports the interactive data modificationand, in the future, the incrementalexecution of logical operators
H.GalhardasSAD 2007/08
Architecture
H.GalhardasSAD 2007/08
To see it working....
Workshop-UQ?Tomorrow at Tagus Park,
10H-16H