Download pdf - Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Extract, Transform, Load (ETL)


Overview• General ETL issues:

– Data staging area (DSA)– Building dimensions– Building fact tables– Extract– Load– Transformation/cleaning

• Commercial (and open-source) tools• The AJAX data cleaning and transformation

framework


DW PhasesDesign phase

– Modeling, DB design, source selection,…Loading phase

– First load/population of the DW– Based on all data in sources

Refreshment phase– Keep the DW up-to-date wrt. source data

changes


The ETL Process• The most underestimated process in DW development

• The most time-consuming process in DW development– Often, 80% of development time is spent on ETL

Extract– Extract relevant data

Transform– Transform data to DW format– Build keys, etc.– Cleansing of data

Load– Load data into DW– Build aggregates, etc.


DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

othersources

Data Storage

OLAP Server

Data Staging

DW ArchitectureDW Architecture


Operational DBs Data

Warehouse

othersources

ExtractTransformLoadRefresh

Data Staging

ETL processETL process

Implemented using anETL tool!

Ex: SQLServer 2005Integration Services


Data Staging Area

• Transit storage for data underway in the ETLprocess– Transformations/cleansing done here

• No user queries (some do it)• Sequential operations (few) on large data volumes

– Performed by central ETL logic– Easily restarted– No need for locking, logging, etc.– RDBMS or flat files? (DBMS have become better at this)

• Finished dimensions copied from DSA to relevantmarts


ETL construction processPlan

1)Make high-level diagram of source-destination flow2)Test, choose and implement ETL tool3)Outline complex transformations, key generation and job

sequence for every destination tableConstruction of dimensions

4)Construct and test building static dimension5)Construct and test change mechanisms for one dimension6)Construct and test remaining dimension builds

Construction of fact tables and automation7)Construct and test initial fact table build8)Construct and test incremental update9)Construct and test aggregate build10)Design, construct, and test ETL automation


Building Dimensions

Static dimension table– Assignment of keys: production keys to DW using

table– Combination of data sources: find common key?

Handling dimension changes– Slowly changing dimensions– Find newest DW key for a given production key– Table for mapping production keys to DW keys

must be updatedLoad of dimensions

– Small dimensions: replace– Large dimensions: load only changes


Building fact tables

Two types of load:• Initial load

– ETL for all data up till now– Done when DW is started the first time– Often problematic to get correct historical data– Very heavy -large data volumes

• Incremental update– Move only changes since last load– Done periodically (../month/week/day/hour/...) after DW start– Less heavy -smaller data volumes

• Dimensions must be updated before facts– The relevant dimension rows for new facts must be in place– Special key considerations if initial load must be performed

again


Types of data sourcesNon-cooperative sources�Snapshot sources –provides only full copy of source�Specific sources –each is different, e.g., legacy systems�Logged sources –writes change log (DB log)�Queryable sources –provides query interface, e.g., SQLCooperative sources�Replicated sources –publish/subscribe mechanism�Call back sources –calls external code (ETL) when changes

occur�Internal action sources –only internal actions when changes

occur (DB triggers is an example)

Extract strategy is very dependent on the source types


Extract phaseGoal: fast extract of relevant data�Extract from source systems can take a long time

Types of extracts:�Extract applications (SQL): co-existence with other

applications�DB unload tools: must faster than SQL-based extracts�Extract applications sometimes the only solution

Often too time consuming to ETL– Extracts can take days/weeks– Drain on the operational systems– Drain on DW systems

=> Extract/ETL only changes since last load (delta)


Computing deltas• Much faster to only ”ETL” changes since last load

A number of methods can be used• Store sorted total extracts in DSA

– Delta can easily be computed from current+last extract+ Always possible+ Handles deletions- Does not reduce extract time

• Put update timestamp on all rows– Updated by DB trigger– Extract only where ”timestamp > time for last extract”+ Reduces extract time+/- Less operational overhead- Cannot (alone) handle deletions- Source system must be changed


Load (1)Goal: fast loading into DW

– Loading deltas is much faster than total load

• SQL-based update is slow– Large overhead (optimization,locking,etc.) for every SQL call– DB load tools are much faster– Some load tools can also perform UPDATEs

• Index on tables slows load a lot– Drop index and rebuild after load– Can be done per partition

• Parallellization– Dimensions can be loaded concurrently– Fact tables can be loaded concurrently– Partitions can be loaded concurrently


Load (2)• Relationships in the data

– Referential integrity must be ensured– Can be done by loader

• Aggregates– Must be built and loaded at the same time as the detail data– Today, RDBMSes can often do this

• Load tuning– Load without log– Sort load file first– Make only simple transformations in loader– Use loader facilities for building aggregates– Use loader within the same database

• Should DW be on-line 24*7?– Use partitions or several sets of tables





framework


Data Cleaning

Activity of converting source data into targetdata without errors, duplicates, andinconsistencies, i.e.,

Cleaning and Transforming to get…

High-quality data!


Why Data Cleaning andTransformation?

Data in the real world is dirtyincomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data• e.g., occupation=“”

noisy: containing errors or outliers (spelling, phoneticand typing errors, word transpositions, multiple valuesin a single free-form field)

• e.g., Salary=“-10”inconsistent: containing discrepancies in codes or

names (synonyms and nicknames, prefix and suffixvariations, abbreviations, truncation and initials)

• e.g., Age=“42” Birthday=“03/07/1997”• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records


Why Is Data Dirty?

• Incomplete data comes from:– non available data value when collected– different criteria between the time when the data was collected

and when it is analyzed.– human/hardware/software problems

• Noisy data comes from:– data collection: faulty instruments– data entry: human or computer errors– data transmission

• Inconsistent (and redundant) data comes from:– Different data sources, so non uniform naming conventions/data

codes– Functional dependency and/or referential integrity violation


Why Is Data CleaningImportant?

• Data warehouse needs consistentintegration of quality data– Data extraction, cleaning, and transformation comprises

the majority of the work of building a data warehouse

• No quality data, no quality decisions!– Quality decisions must be based on quality data (e.g.,

duplicate or missing data may cause incorrect or evenmisleading statistics)


Types of data cleaning

• Conversion, parsing and normalization– Text coding, date formats, etc.– Most common type of cleansing

• Special-purpose cleansing– Normalize spellings of names, addresses, etc.– Remove duplicates, e.g., duplicate customers– Domain-independent cleansing

• Approximate, ”fuzzy” joins on not-quite-matchingkeys


Data quality vs cleaning• Data quality = Data cleaning +

– Data enrichment• enhancing the value of internally held data by appending related

attributes from external sources (for example, consumerdemographic attributes or geographic descriptors).

– Data profiling• analysis of data to capture statistics (metadata) that provide

insight into the quality of the data and aid in the identification ofdata quality issues.

– Data monitoring• deployment of controls to ensure ongoing conformance of data to

business rules that define data quality for the organization.– Data stewards responsible for data quality– DW-controlled improvement– Source-controlled improvement– Construct programs to check data quality





framework


ETL toolsETL tools from the big vendors, e.g.,

– �Oracle Warehouse Builder– �IBM DB2 Warehouse Manager– �Microsoft Integration Services

Offer much functionality at a reasonable price (included…)– �Data modeling– �ETL code generation– �Scheduling DW jobs– �…

Many others– �Hundreds of tools– �Often specialized tools for certain jobs (insurance cleansing,…)

The ”best” tool does not exist– �Choose based on your own needs– �Check first if the ”standard tools”from the big vendors are ok


ETL and data quality tools• http://www.etltool.com/• Magic Quadrant for Data Quality Tools, 2007• Some open source ETL tools:

– Talend– Enhydra Octopus– Clover.ETL

• Not so many open source quality/cleaningtools


Application context

Integrate data from different sources• E.g.,populating a DW from different operational data stores

• Eliminate errors and duplicates within a single source• E.g., duplicates in a file of customers

• Migrate data from a source schema into a differentfixed target schema

• E.g., discontinued application packages

• Convert poorly structured data into structured data• E.g., processing data collected from the Web


The AJAX data transformationand cleaning framework


Motivating example (1)

DirtyData(paper:String)

Data Cleaning & Transformation

Events(eventKey, name)

Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)Authors(authorKey, name)

PubsAuthors(pubKey, authorKey)


Motivating example (2)

[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and JenniferWidom. Making Views Self-Maintainable for Data Warehousing. InProceedings of the Conference on Parallel and DistributedInformation Systems. Miami Beach, Florida, USA, 1996[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95

DirtyData

Data Cleaning & Transformation

PDIS | Conference onParallel andDistributedInformation Systems

Events

QGMW96| Making Views Self-Maintainablefor Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996

PublicationsAuthors

DQua | Dallan Quass

AGup | Ashish Gupta

JWid | Jennifer Widom…..

QGMW96 | DQua

QGMW96 | AGup….

PubsAuthors


Modeling a data cleaning process

A data cleaning process is modeled by a directed acyclic graph of data transformations

DirtyData

DirtyAuthors

Authors

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles... DirtyEvents

CitiesTags


Existing technology• Ad-hoc programs written in a programming language

like C or Java or using an RDBMS proprietarylanguage– Programs difficult to optimize and maintain

• Data transformation scripts using an ETL(Extraction-Transformation-Loading) or a data quality tool


Problems of ETL and data qualitysolutions (1)

The semantics of some data transformationsis defined in terms of their implementationalgorithms

App. Domain 1

App. Domain 2

App. Domain 3

Data cleaning transformations

...


There is a lack of interactive facilities to tunea data cleaning application program

Problems of ETL and data qualitysolutions (2)

Dirty Data

Cleaning process

Clean data Rejected data


AJAX features• An extensible data quality framework

– Logical operators as extensions of relational algebra– Physical execution algorithms

• A declarative language for logical operators– SQL extension

• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions







Logical level: parametricoperators

• View: arbitrary SQL query• Map: iterator-based one-to-many mapping with

arbitrary user-defined functions• Match: iterator-based approximate join• Cluster: uses an arbitrary clustering function• Merge: extends SQL group-by with user-defined

aggregate functions• Apply: executes an arbitrary user-defined

algorithm

Map Match

Merge

ClusterView

Apply


Logical level

DirtyData

DirtyAuthors

Authors


Extraction

Standardization

Formatting

DirtyTitles...

CitiesTags


Logical level

DirtyData

DirtyAuthors

Map

Cluster

Match

Merge

Authors

Map

Map


Extraction

Standardization

Formatting

DirtyTitles...

CitiesTags

DirtyData

DirtyAuthors

TC

NL

Authors

SQL Scan

Java Scan

Physical level

DirtyTitles...

Java Scan

Java Scan

CitiesTags







Match• Input: 2 relations• Finds data records that correspond to the

same real object• Calls distance functions for comparing field

values and computing the distancebetween input tuples

• Output: 1 relation containing matchingtuples and possibly 1 or 2 relationscontaining non-matching tuples


Example

Cluster

Match

Merge


Authors

DirtyAuthors

MatchAuthors


ExampleCREATE MATCH MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name)

WHERE distance < maxDist

INTO MatchAuthorsCluster

Match

Merge


Authors

DirtyAuthors

MatchAuthors


ExampleCREATE MATCH MatchDirtyAuthors


LET distance = editDistance(da1.name, da2.name)

WHERE distance < maxDist

INTO MatchAuthors

Input:DirtyAuthors(authorKey, name)861|johann christoph freytag822|jc freytag819|j freytag814|j-c freytag

Output:MatchAuthors(authorKey1, authorKey2, name1, name2)861|822|johann christoph freytag| jc freytag822|814|jc freytag|j-c freytag ...

Cluster

Match

Merge


Authors

DirtyAuthors

MatchAuthors


Implementation of the matchoperator

∀ s1∈ S1, s2 ∈ S2

(s1, s2) is a match ifeditDistance (s1, s2) < maxDist


Nested loopS1 S2

...

• Very expensive evaluation when handlinglarge amounts of data

Þ Need alternative execution algorithms forthe same logical specification

editDistance


A database solution

CREATE TABLE MatchAuthors AS

SELECT authorKey1, authorKey2, distance

FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2,editDistance (a1.name, a2.name) distanceFROM DirtyAuthors a1, DirtyAuthors a2)

WHERE distance < maxDist;

· No optimization supported for aCartesian product with external functioncalls


Window scanningS

n


Window scanningS

n


Window scanningS

n

· May loose some matches


String distance filtering

S1 S2

maxDist = 1

John Smith

John Smit

Jogn Smith

John Smithe

length

length- 1

length

length + 1editDistance


Annotation-based optimization

• The user specifies types of optimization• The system suggests which algorithm

to use

Ex:CREATE MATCHING MatchDirtyAuthors


LET dist = editDistance(da1.name, da2.name)

WHERE dist < maxDist% distance-filtering: map= length; dist = abs %INTO MatchAuthors


DEFINE FUNCTIONS ASChoose.uniqueString(OBJECT[]) RETURN STRINGTHROWS CiteSeerExceptionGenerate.generateId(INTEGER) RETURN STRINGNormal.removeCitationTags(STRING) RETURN STRING(600)

DEFINE ALGORITHMS ASTransitiveClosureSourceClustering(STRING)

DEFINE INPUT DATA FLOWS ASTABLE DirtyData(paper STRING (400));TABLE City(city STRING (80),citysyn STRING (80))KEY city,citysyn;

DEFINE TRANSFORMATIONS AS

CREATE MAPPING mapKeDiDaFROM DirtyData DdLET keyKdd = generateId(1){SELECT keyKdd AS paperKey, Dd.paper AS paperKEY paperKeyCONSTRAINT NOT NULL mapKeDiDa.paper}

Declarative specification







Management of exceptions

• Problem: to mark tuples not handled by thecleaning criteria of an operator

• Solution: to specify the generation ofexception tuples within a logical operator– exceptions are thrown by external functions– output constraints are violated


Debugger facility

• Supports the (backward and forward) dataderivation of tuples wrt an operator todebug exceptions

• Supports the interactive data modificationand, in the future, the incrementalexecution of logical operators


Architecture


To see it working....

Workshop-UQ?Tomorrow at Tagus Park,

10H-16H