Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Algorithms and Architecture for Managing Evolving
ETL Workflows
Judith AwitiUniversité Libre de Bruxelles, Belgium
Esteban Zimányi (Home Supervisor) : Université Libre de BruxellesRobert Wrembel (Host Supervisor) : Poznań University of Technology
1
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
2
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
3
Introduction
DW
…
CRM
ERP
Flat File
ETL
• Data Extraction (into a data staging area)
• Data Transformation (Aggregation, Cleaning, removing duplicates, data type change, correct null values,… )
• Data Loading
Source Systems Data Warehouse
Extract-Transform-Load (ETL)
4
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
5
Problem Statement
?
?DWΔDS schema
…
ETL
DS1
DS2
DS1
DS2
6
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
7
Objectives
1. To propose a methodology for designing ETL processes that will facilitate asmooth transition from gathering user requirements to the actualimplementation. This methodology will include all aspects of ETL design,from conceptual modelling to physical implementation
2. To develop a framework to (semi-) automatically repair ETL workflowsupon data source changes
Currently focusing on relational data
8
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
9
ETL Modelling (Our approach)
BPMN4ETL [1,2]
Extended RA [3]
SQL
BEXF (XML Interchange Format )
Models of ETL Tools (SSIS, Pentaho PDI, Talend, etc)
Conceptual Model
Logical Model
Physical Implementation
10
ETL ModellingScenarioThe historical ETL load of DimBroker Dimension of TPC-DI [6] Dataset
ATTRIBUTES
HR.csv EmployeeID, ManagerID, EmployeeJobCode, EmployeeFirstName… , EmployeePhone
DimDate SK_DateID, DateValue, CalendarYearDesc,…, HolidayFlag
DimBroker Sk_BrokerID, BrokerID, ManagerID, FirstName,…,IsCurrent, BatchID, EffectiveDate, EndDate
Transformations• Records where EmployeeJobCode is not 314 are not broker records, and are ignored (Filter)• SK_BrokerID is set appropriately for new records (Surrogate key assignment)• IsCurrent is set to true • EffectiveDate is set to the earliest date in the DimDate table and EndDate is set to 9999-12-31 (Aggregate)• BatchID is set as described in TPC-DI specification document
11
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
12
BPMN4ETL: Conceptual modelling [1,2]
• BPMN is a standard modelling language and can be used for documentation• Models both control and data flow• ETL activities (e.g., aggregations, conversions, etc) can be plugged in easily• Less complex because user is not overwhelmed with inter-attribute mappings• Easy communication and validation between an Operational Database Designer, an ETL Designer and a BI analyst• Exposes the manipulation of data and their order from one ETL task to the other• Can be translated directly to relational algebra, SQL, or an XML interchange format
Load DimBroker+
Load DimBroker
13
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
14
Logical Modelling : Extended Relational Algebra [3]
15
Logical Modelling : Extended Relational Algebra
• RA provides a set of operators that manipulates relations to ensure that there is no ambiguity• Can also be directly translated into SQL to be executed in any Relational Database Management System
(RDBMS). We avoid dealing with the peculiarities of a particular programming language• When extended with update operations, they can provide a logical model of different ETL scenarios. E.g.
Slowly changing dimension with dependencies found in the TPC-DI Benchmark
LimitationDifficult to model certain complex tasks in relational algebra even though they can be done directly with SQLs. (E.g. window functions and loops)
16
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
17
ETL Modelling (Experiments)Experimental EvaluationExperiments implemented in two ways:1. Using Pentaho PDI , translating the BPMN4ETL directly into Pentaho PDI2. Using RA , translating BPMN4ETL into extended RA, and then implementing the RA operations
using Postgres PLSQL.
TPC-DI Benchmark• Data sources are of different formats (xml, csv, txt, and so on)• Source data model: Based on a fictitious retail brokerage firm and external sources• Target data model: Has a snowstorm schema• One historical load and two identical incremental loads• Scale factor (number of records) - 3 (4.5 million), 5 (7.8 million), 10 (16.1 million)
PlatformIntel i7 computer, with a RAM of 16 GB, running the Windows 10 Enterprise operating system, using the Postgres SQL database as the DW storage
18
ETL Modelling (Experiments)PerformanceExecution times to complete TPC-DI benchmark LoadTime = hours:minutes:seconds
Pentaho PDI Optimization• PDI memory limit was increased from 2G to 4G• PDI performance tuning tips were applied- https://help.pentaho.com/Documentation/7.1/0P0/100/040/010
19
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
20
ETL Modelling (XML Interchange format)
BPMN4ETL eXchange format (BEXF)
<ETLProcess id="_idProcess" name="Load of DimBroker"><StartEvent id="_idStartEvent" name="Start Event">
<outRefId>_idS1</outRefId><outRefId>_idS6</outRefId>
</StartEvent><ETLTask id="_idInputData" name="Input Data" type="Input Data">
<File name=“HR.csv“ Type=“csv”/><inputs>
<inputColumn name="EmployeeID"/><inputColumn name=" ManagerID"/><inputColumn name=" EmployeeJobCode "/>
…</inputs><inRefId>_idS1</inRefId><outRefId>_idS2</outRefId>
</ETLTask>...</ETLProcess>
21
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
22
ETL Evolution (Current Approaches)
HECATAEUS Framework – based on rules/policies [4]
• Abstract ETL activities as queries and sequence of views
• Transforms SQL queries to graph
• User annotate graph with rules/policies (Propagate, Block, Prompt)
• System detects parts of the graph affected by a change in data source and highlights the way they respond to it
23
ETL Evolution (Current Approaches)
HECATAEUS
DS change = Add Phone to EMP
Policy = Propagate
Q: SELECT EMP.Emp# as Emp#, Sum(WORKS.Hours) as T_HoursFROM EMP, WORKS WHERE EMP.Emp# = WORKS.Emp# AND EMP.STD_SAL >5000 GROUP BY EMP.Emp#
Phone
Phone
Detailed graph representation of ETL1_ACT9
24
ETL Evolution (Current Approaches)
Concerns with Hecataeus
• Near manual – policies must be explicitly stated for each node
• User must determine policy in advance before evolution event occurs
25
ETL Evolution (Current Approaches)
E-ETL (Evolving ETL) Framework – based on case-based reasoning [5]
• Applies case-based reasoning
• Keeps library of repair cases (LRC) as knowledge base
Concerns with E-ETL
• Developers cannot guarantee correctness
• It needs a case base in advance to work
Library of repair cases
DSC_A
DSC_B
Mod_A
Mod_B
ETL processData Sources
Case_A
Case_B
26
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
27
ETL Evolution (Our approach)Subgoals:
• Develop algorithms for (semi-)automatic repair of ETL workflows upon DS changes
– Rules may be inferred from cases
– Cases may be built from applying rules
– Rule based + Case based (a quality measure for RB and CB)
• Develop an architecture for handling ETL evolution
• Implement a prototype
• Verify the applicability of the proposed solution with the TPC-DI benchmark [6]
28
ETL Evolution (Our approach)Extended Evolving ETL (E3TL) framework
RBR + CBR• ETL workflows are rewritten by applying rules• Rules are inferred from cases (By applying algorithms)• Cases are built from user input
29
ETL Evolution (Our approach)Extended Evolving ETL (E3TL) framework – Learns rules from user input
Components:ETL Parser: The ETL parser takes an entire ETL workflow in the form of RA or SQLs and parses the parts of each command of the workflowETL Manager: The ETL manager assesses the impact of the data source change on each command of the ETL workflow and takes these decisions by applying rules stored in a the rule baseETL Rewriter: This component of the framework rewrites the commands in the ETL workflow by applying recommendations from the ETL managerRule Base: This contains distinct rules based on conditionsUser Input: This part of the framework request the user's input if any of the following conditions is true:• no rule is available in the rule base to deal with the problem• several solutions are applicable to solve the problemCase Base: This is a repository to store casesTranslator: This component applies algorithms to develop distinct rules from cases
30
Outline
➢ Introduction➢Problem Statement➢Project Objectives➢ETL Modelling
• BPMN4ETL• Extended Relational Algebra• Experiments• BEXF (XML Interchange format)
➢ETL Evolution• Current Approaches• Our Approach
➢Conclusion
31
ConclusionThis project provides a means of managing ETL processes in two ways. First, their modelling and second, their reparation upon DS schema changes.
Currently, we have provided a modelling strategy of ETL processes with BPMN4ETL, an extended BPMN model for ETL at the conceptual level and with extended relational algebra (RA) extended with update operations at a logical level.
We propose the E3TL framework in which we will develop algorithms for (semi-) automatic repair of ETL workflows upon DS schema changes.
32
References• [1] Akkaoui, Z.E., Zimányi, E.: Defining ETL workflows using BPMN and BPEL. In: Proc. of the 12th ACM International
Workshop on Data Warehousing and OLAP. pp. 41–48. ACM, Hong Kong, China (2009)
• [2] El Akkaoui, Z., Zimányi, E., Mazón, J.N., Trujillo, J.: A BPMN-Based design and maintenance framework for ETL processes. International Journal of Data Ware housing and Mining 9(3), 46–72 (2013)
• [3] Awiti, J., Vaisman, A., Zimányi, E.: From conceptual to logical ETL design using BPMN and relational algebra. In: Proc. of the 21st ACM International Conference on Big Data Analytics and Knowledge Discovery, (DAWAK), Springer, Linz, Austria (2019).
• [4] Papastefanatos, G., Vassiliadis, P., Simitsis, A., & Vassiliou, Y. (2009). Policy-regulated management of ETL evolution. In Journal on Data Semantics XIII (pp. 147-177). Springer, Berlin, Heidelberg.
• [5] Wojciechowski, A.: E-ETL: Framework for managing evolving ETL processes. In: Proc. of the 4th Workshop for Ph.D. students in information & knowledge management, ACM, Glasgow, Scotland, UK (2011).
• [6] Poess, M., Rabl, T., Jacobsen, H. A., & Caufield, B. (2014). TPC-DI: the first industry benchmark for data integration. Proceedings of the VLDB Endowment, 7(13), 1367-1378.
33
Thank You
34