35
Project Topic XML Data warehousing Group 5 Guided By: Keyur Patel (1100101106013) Dr Kalpdrum Passi Shubham Shah (1100101106037) Manav Sharma ( 110010107056 ) Ruturaj Raval (110090107036) 1

Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Embed Size (px)

DESCRIPTION

Small presentation by group of 4 - Keyur Patel, Manav Sharma, Shubham Shah and Ruturaj Rawal at Laurentian University, Greater Sudbury, ON, Canada on the research topic of XML data warehousing(with ETL). Thanks to many of the Research papers as well as Slideshare for reference. Special thanks to Manav Sharma for "Our approach" section

Citation preview

Page 1: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Project Topic XML Data warehousing

 

Group 5                                                                    Guided By:Keyur Patel          (1100101106013)                     Dr Kalpdrum PassiShubham Shah   (1100101106037)                    Manav Sharma   ( 110010107056 )Ruturaj Raval       (110090107036)

1

Page 2: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Introduction

• XML can be considered as a particular standard syntax for the exchange of semi-structured data

• Feature of semi-structured data model•  Lack of schema, so that data is self-describing

• However, XML can be associated and validated using DTD and XML schema

• Difficult to store and retrieve data in warehouse in semi-structured form than in structured form

2

Page 3: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

XML - The Key to the Next Generation Data Warehouse

• A large amount of data needed in decision-making processes is stored in the XML data format, which is widely used for E-commerce and Internet-based information exchange

• Importance of integrating XML data in data warehousing environments is becoming increasingly high

                       http://bias.csr.unibo.it/golfarelli/Papers/softcom01.pdf

3

Page 4: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

XML in Data warehouse

• XML documents are usually generated as part of a transaction, complex business process, or communication exchanged between partnering businesses

• Applying the information to Data Warehousing(DW)/Business Intelligence(BI) is challenging, because of the hierarchies and complex data structures that are typical of XML schema

4

Page 5: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Integration of XML documents within a data warehouse.

5http://www-db.deis.unibo.it/~srizzi/PDF/dolap01.pdf

A data mart is the access layer of the data warehouse environment that is used to get data out to the users

XML documents are integrated and stored into data warehouses

Page 6: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Data Warehouse Architecture

Data Warehouse

ETL ETL ETL ETL

RDBMS 1 RDBMS 2

HTML 1 XML 1

ETL pipeline

outputs

ETL

http://research.cs.wisc.edu/dibook/slides/Chapter_10.ppt6

Page 7: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Data Warehouse Architecture

• At the top – a centralized database• Generally configured for queries and appends• Many indices, materialized views, etc.

• Data is loaded and periodically updated via Extract/Transform/Load (ETL) tools

7

Page 8: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Data Warehouse Architecture

• A traditional data warehouse architecture consists of four layers: 1. The data sources

• e.g. legacy systems, flat files or files under any format

2. The back-end• ETL(Extracting transforming and Loading) takes place here

3. The global data warehouse• Keeps record of data that result from the transformation, integration and aggregation

4.  The front-end• Consists of applications and techniques that business users use to interact with data stored in the data warehouse

8

Page 9: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

ETL

9

Page 10: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

ETL

• ETL means  Extract, transform, and load• Extracts data from outside sources• Transforms it to fit operational needs, which can include quality levels• Loads it into the end target (database, more specifically, operational data store, data mart, or data warehouse)

10

Page 11: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

ETL

• The set of operations taking place in the back stage of data warehouse architecture is generally known as the Extraction, Transformation, and Loading (ETL) processes

• ETL processes are responsible for the extraction of data from different, distributed, and often, heterogeneous data sources, their cleansing and customization in order to fit business needs and rules, their transformation in order to fit the data warehouse schema, and finally, their loading into a data warehouse

11

Page 12: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Extraction

• The extraction conceptually is the simplest step, aiming at the identification of the subset of source data that should be submitted to the ETL workflow for further processing. 

• In practice, this task is not easy, basically, due to the fact that there must be minimum interference with the software configuration at the source side. 

• This requirement is imposed by two factors: • (a) the source must suffer minimum overhead during the extraction, since other administrative activities also take place during that period, and, 

• (b) both for technical and political reasons, administrators are quite reluctant to accept major interventions to their system’s configuration.

12

Page 13: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Transformation & Cleaning

• After their extraction from the sources, the data are transported into an intermediate storage area, where they are transformed and cleansed. 

• That area is frequently called Data Staging Area (DSA), and physically, it can be either in a separate machine or the one used for the data warehouse.

• The transformation and cleaning tasks constitute the core functionality of an ETL process. • Depending on the application, different problems may exist and different kinds of transformations may be needed.

•  The problems can be categorized as follows: • (a) schema-level problems: naming and structural conflicts, including granularity differences, • (b) record-level problems: duplicated or contradicting records, and consistency problems, • and (c) value-level problems: several low-level technical problems such as different value representations or different interpretation of the values.

13

Page 14: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Loading

• The appropriate transformations and cleaning operations, the data are loaded to the respective fact or dimension table of the data warehouse.

•  There are two broad categories of solutions for the loading of data:• Bulk loading through a DBMS-specific utility or • Inserting data as a sequence of rows.

14

Page 15: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

ETL Tools

• ETL tools are the equivalent of Schema mappings in virtual integration, but are more powerful

• Arbitrary pieces of code to take data from a source, convert it into data for the warehouse:

• import filters – read and convert from data sources• data transformations – join, aggregate, filter, convert data• de-duplication – finds multiple records referring to the same entity, merges them

• profiling – builds tables, histograms, etc. to summarize data• quality management – test against master values, known business rules, constraints, etc.

15

Page 16: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Example ETL Tool Chain

• This is an example for e-commerce loading• Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load.

Invoice line items

SplitDate-time

Filterinvalid

JoinFilterinvalid

Invalid dates/ times

Invaliditems

Itemrecords

Filternon -

match

Invalidcustomers

Group by customer

Customerbalance

Customerrecords

16

Page 17: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Implementation of the ETL system

ETL

Source A

Source B

17

Page 18: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

• The process flow occurs as, Source A  ETL  Source B  ETL  Source A

• Source A• Files• Database• Message queues• Web services

• ETL• Read• Apply logic• Write

• Source B• Files• Database• Message queues• Web services

18

Page 19: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

ETL is a bridge for  bi-directional flow• True data integration is agnostic of source or target application

Four staging steps for data warehouse to be implemented

Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004 

19

Page 20: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

• How the system is included in implementation of ETL?• Scripting (shell, perl, python) • PL/SQL, sqlldr • Transformation hardcoded in Java, C# • Develop (universal) ETL tool in-house • Using off-the-shelf ETL tool 

20

Page 21: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

ETL tools for implementation over data warehouse

• Commercial • Ab Initio • IBM DataStage • Informatica PowerCenter • Microsoft Data Integration Services • Oracle Data Integrator • SAP Business Objects – Data Integrator • SAS Data Integration Studio 

• Open-source based • Adeptia Integration Suite • Apatar • CloverETL • Pentaho Data Integration (Kettle) • Talend Open Studio/Integration Suite  

21

Page 22: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Data migration scheme over implementation

22

Page 23: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

ETL solutions for integration in data warehouse (6 parts)

• Data migration• Process of transferring data between storage types or formats• An automated migration frees up human resources from tedious tasks• Design, extraction, cleansing, load and verification are done for moderate to high complexity jobs

• Data consolidation• Usually associated with moving data from remote locations to a central location • combining data due to an acquisition or merger

• Data integration• Process of combining data residing at different sources and providing a unified view• Emerges in both commercial and scientific fields and is focus of extensive theoretical work • Referred to as Enterprise Information Integration

23

Page 24: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

• Master data management• Processes and tools to define and manage non-transactional data

• Provides for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing data

• Ensures consistency and control

• Data warehouse• Repository of electronically stored data• facilitates populating, reporting and analysis• metadata retrieval can be done

• Data synchronization• Process of making sure two or more locations contain the same up-to-date files

• Add, change, or delete a file from one location, synchronization will mirror the action 

24

Page 25: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Data Migration

Data consolidation

Data integration

Master data management

Data warehouse

Data synchronization

25

Page 26: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Contextual implementation

• Documentation• Data sources/target/transformations• Data lineage

• Important to know and publish• Frequency of ETL processes runs• Error handling• Support- monitoring checklist

• RTP

26

Page 27: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

ELT-Extracting Loading Transforming

The newer trend suggests the use of ETLT systems. ETLT represents an intermediate solution between ETL and ELT, allowing the designer to use the best solution for the current need. 

27

Page 28: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Problem Statement

One of the problem of building any data warehouse is the process of extracting, transforming, cleansing, and loading the data from the source system

Almost all ETL tools and systems, whether based on off-the-shelf products or custom-coded, operate in a batch mode 

They assume that the data becomes available as some sort of extract file on a certain schedule, usually nightly, weekly, or monthly

Then the system transforms and cleanses the data and loads it into the data warehouse

This process typically involves downtime of the data warehouse, so no users are able to access it while the load takes place. Since these loads are usually performed late at night, this scheduled downtime typically does not inconvenience many users

When loading data continuously in real-time, there can't be any system downtime.

28

Page 29: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Existing Approach(From literature)

Instead of loading the data in real-time into the actual warehouse tables, the data can be continuously fed into staging tables that are in the exact same format as the target tables. 

Depending on the data modeling approach being used, the staging tables will either contain copy of just the data for the current day, or for smaller fact tables can contain a complete copy of all the historical data.

Then on a periodically the staging table is duplicated and the copy is swapped with the fact table, bring the data warehouse instantly up-to-date. 

Depending upon the characteristics of how the swap is handled by the particular database, it might be advisable to temporally pause the Online Analytical Processing(OLAP) server while this flip takes place, no new queries are initiated while the swap occurs.

29

Page 30: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Our Approach

 We can introduce an intermediate database server that is dedicated to have a real time cache  mechanism.

First the data will be extracted from the data store and the half of the transformation will be done at the Data Staging Area(DSA)

Then the data will be saved in real time cache database.

So now the people can directly execute the queries on both the data warehouse and the database server and the remaining transformation will be performed depending upon the queries

30

Page 31: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Advantages 

The processing work will be distributed as transformation is divided into 2 parts. The real time data in the cache and the history data in the warehouse can be queried simultaneously without frequent downtime

Disadvantage

It would require extra hardware for the establishment of the intermediate cache database server 

31

Page 32: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Conclusion

• Talking about the actual author’s approach, and our approach towards the database system when the concerns is abut the warehousing will differ being a real time data extraction in data staging approach and the ours one will targeted towards the ETLT approach which will mainly focusing upon the data integrity about real time cache database.

• Characteristics of how the swap is handled by the particular database, it might be advisable to temporally pause the Online Analytical Processing(OLAP) server while this flip takes place in existing approach where on other side our approach focuses on people can directly execute the queries on both the data warehouse and the database server acting as cache and the remaining transformation will be performed depending upon the queries and ETLT system.

32

Page 33: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

References

• http://www-db.deis.unibo.it/~srizzi/PDF/dolap01.pdf• http://www.w3.org/XML/• http://www.slideshare.net/cloveretl/introduction-to-etl-and-data-integration

• http://www.ksi.mff.cuni.cz/~pokorny/papers/BALTIC02.pdf• http://www.xml.com/• http://arxiv.org/abs/1308.6683• http://www.sciencedirect.com/science/article/pii/S0306437909001203

33

Page 34: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

References

• http://airccj.org/CSCP/vol2/csit2133.pdf• http://bias.csr.unibo.it/golfarelli/Papers/softcom01.pdf• http://tdwi.org/webcasts/2010/06/xml-the-key-to-the-next-generation-data-warehouse.aspx

• ftp://ftp.irit.fr/IRIT/SIG/[Tournier-10]%20IS%20Survey%20XML%20DW%20OLAP.pdf

• http://dssresources.com/papers/features/langseth/langseth02082004.html

34

Page 35: Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Thank you!

35

Any Questions?