1
Wesley Deneke
A Domain Specific Model for Generating ETL
Workflows from Business Intents
Thesis directors
Wing-Ning Li
Craig Thompson
Thesis committee
Gordon Beavers
Rick Couvillion
2
Outline
Problem Context
Thesis and Objectives
Approach
Prototype
Results
Contributions and Future Work
5
How to leverage this data?
Data Sources
Diverse
Large
Changing
Processing Tasks
Objective specific
Manipulate data
8
Workflow Representation
Directed Acyclic Graph
G is a pair (V, E), where:
V is a set of vertices
E is a set of ordered pairs of vertices that denote
directed edges connecting vertex pairs.
18
Problem
Analysis
Design
Construction
Verification
Maintenance
ETL Workflow Specification
Time consuming
Required expertise
Error prone
19
Problem
Analysis
Design
Construction
Verification
Maintenance
ETL Workflow Specification
Time consuming
Required expertise
Error prone
Operator logic
Workflow standards
Business rules
20
Problem
Analysis
Design
Construction
Verification
Maintenance
ETL Workflow Specification
Time consuming
Required expertise
Error prone
22
Thesis Statement
ETL workflow specification can be automated in an
extensible manner by translating high-level statements
of intent into a set of ETL workflow requirements and
generating ETL workflow solutions that accomplish these
specifications.
23
Objectives
Better solution accuracy
Lower required level of expertise
Faster turn around
Fewer errors
24
Approach
Out of Scope:
Field characterization
Unknown data resources
Unknown operators
Assumptions:
Sources and sinks given
Single data source
Properly characterized
Flat files of data records
Homogenous data
Known state
Operators given
Well-defined
25
Approach
Extensible framework for the creation of domain-specific
modeling languages that enable users to express the
intent of a desired ETL solution at a high-level of
abstraction and automatically generate workflows
satisfying such specifications.
26
ETL Domain Knowledge
Not uniform across domains
Not guaranteed to remain static
Capture ETL domain knowledge in a
formal representation.
Domain-
specific
27
Domain
US
Canad
aRetail
Financia
lClient
Client
A means of constraining
the set of considerations.
28
Workflow ∪ State
S = (Q, ∑, ∂, q0, F):
Q - set of states
∑ - set of valid input symbols
∂ - set of state transitions
q0 - start state
F - set of accepting states
29
Attributed Field
The data each field contains can be
described at a higher level of
abstraction.
Name Data Type
varchar
intfloat
timestamp
30
Attributed Field
Content Type
High-level concepts used to categorize data.
Semantic relationships
34
Preconditions
Assertions that must be true
prior to execution to guarantee
the result produced.
Predicate expressions representing valid input
35
Preconditions
Required:
(Input1 && Input2) ||
(Input1 && Input3) ||
(Input3 && Input4 && Input6) ||
(Input3 && Input4 && Input5 && Input6) Unparsed
Name
Unparsed
Address
Parsed Address
First Name
Middle Name
Last Name
Y:
Input1::Name.Full
||Input3::Address.Primary
+Corrected
Input1::Name.Full &&
Input2::Address.PrimaryStandardized
Input1::Name.Full &&
Input2::Address.Primary
+Standardized
36
Postconditions
Assertions that will be true
after execution, provided that
the preconditions are satisfied.
37
Postconditions
Input1::Address.Primary
+Standardized
Input1::Address.Primary
+Standardized
+Verified
Input1::Name.Full
+CorrectedInput1::Name.Full
+Filtered
38
Workflow Engine
AI Planner
Initial State ➔ Source Data
Goal State ➔ Target State of Data
State Transitions ➔ Available Operators
Planning strategies:
Depth First
Breadth First
A*
39
Intent Language
Need an intuitive goal specification
Express in terms of the given domain
Familiar terminology
Understandable to users
Mapping:
High-level ➔ Low-level
Intent Goal State
46
Operators
AddressEditCheck
AddressEnhance
AddressSelect
ContactLink
IndustryCode
NameEditCheck
Parser
PremiumAddress
47
Intents
Address hygiene
Premium address hygiene
Change of address
Premium change of address
Filter profanity
Validate names
Determine industry demographic
Validate addresses
Delivery sequencing
Geocode addresses
Link contacts
48
Test Scenario 1
ParserIndustry
Code
Addres
sEnhan
ce
Addres
s
Select
Name
Edit
Check
Intents:Determine industry demographic
Address hygiene
Validate names
Initial state:Full name
Unparsed primary address
Unparsed city/state/zip
49
Test Scenario 2
Intents:Determine industry demographic
Address hygiene
Link contacts
Validate addresses
Validate names
Initial state:Full name
Unparsed primary address
Unparsed city/state/zip
ParserIndustry
Code
Addres
sEnhan
ce
Addres
s
Select
Contact
Link
Address
Edit
Check
Name
Edit
Check
50
Analysis
AccuracyScenario 1: 384 workflows
Scenario 2: 7680 workflows
Consolidated: 16 workflows
Ease of useScenario 1: 3 intents
Scenario 2: 5 intents
Reduced timeBoth < 1 minute
Error freeAssuming proper modeling
51
Contributions
Represent and enforce ETL domain knowledge.
Automatic generation of ETL workflow solutions using
AI planning.
Mapping between workflow requirements and higher-
level abstractions called “intents”.
52
Future Work
Operator verification
Correctness
Equivalence
Optimization
Data heritage
Generic set operators
Intermediate goals
Inputs mappings
Goal indexing
Caching
Nested intent
statements
Intent relationships
Result filtering