Xml datawarehousing with ETL(Extracting, Transforming and Loading)

Project Topic XML Data warehousing

Group 5 Guided By:Keyur Patel (1100101106013) Dr Kalpdrum PassiShubham Shah (1100101106037) Manav Sharma ( 110010107056 )Ruturaj Raval (110090107036)

1

Introduction

• XML can be considered as a particular standard syntax for the exchange of semi-structured data

• Feature of semi-structured data model• Lack of schema, so that data is self-describing

• However, XML can be associated and validated using DTD and XML schema

• Difficult to store and retrieve data in warehouse in semi-structured form than in structured form

2

XML - The Key to the Next Generation Data Warehouse

• A large amount of data needed in decision-making processes is stored in the XML data format, which is widely used for E-commerce and Internet-based information exchange

• Importance of integrating XML data in data warehousing environments is becoming increasingly high

http://bias.csr.unibo.it/golfarelli/Papers/softcom01.pdf

3



XML in Data warehouse

• XML documents are usually generated as part of a transaction, complex business process, or communication exchanged between partnering businesses

• Applying the information to Data Warehousing(DW)/Business Intelligence(BI) is challenging, because of the hierarchies and complex data structures that are typical of XML schema

4

Integration of XML documents within a data warehouse.

5http://www-db.deis.unibo.it/~srizzi/PDF/dolap01.pdf

A data mart is the access layer of the data warehouse environment that is used to get data out to the users

XML documents are integrated and stored into data warehouses

http://www-db.deis.unibo.it/~srizzi/PDF/dolap01.pdf


Data Warehouse Architecture

Data Warehouse

ETL ETL ETL ETL

RDBMS 1 RDBMS 2

HTML 1 XML 1

ETL pipeline

outputs

ETL

http://research.cs.wisc.edu/dibook/slides/Chapter_10.ppt6

http://research.cs.wisc.edu/dibook/slides/Chapter_10.ppt


• At the top – a centralized database• Generally configured for queries and appends• Many indices, materialized views, etc.

• Data is loaded and periodically updated via Extract/Transform/Load (ETL) tools

7


• A traditional data warehouse architecture consists of four layers: 1. The data sources

• e.g. legacy systems, flat files or files under any format

2. The back-end• ETL(Extracting transforming and Loading) takes place here

3. The global data warehouse• Keeps record of data that result from the transformation, integration and aggregation

4. The front-end• Consists of applications and techniques that business users use to interact with data stored in the data warehouse

8

ETL

9

ETL

• ETL means Extract, transform, and load• Extracts data from outside sources• Transforms it to fit operational needs, which can include quality levels• Loads it into the end target (database, more specifically, operational data store, data mart, or data warehouse)

10

ETL

• The set of operations taking place in the back stage of data warehouse architecture is generally known as the Extraction, Transformation, and Loading (ETL) processes

• ETL processes are responsible for the extraction of data from different, distributed, and often, heterogeneous data sources, their cleansing and customization in order to fit business needs and rules, their transformation in order to fit the data warehouse schema, and finally, their loading into a data warehouse

11

Extraction

• The extraction conceptually is the simplest step, aiming at the identification of the subset of source data that should be submitted to the ETL workflow for further processing.

• In practice, this task is not easy, basically, due to the fact that there must be minimum interference with the software configuration at the source side.

• This requirement is imposed by two factors: • (a) the source must suffer minimum overhead during the extraction, since other administrative activities also take place during that period, and,

• (b) both for technical and political reasons, administrators are quite reluctant to accept major interventions to their system’s configuration.

12

Transformation & Cleaning

• After their extraction from the sources, the data are transported into an intermediate storage area, where they are transformed and cleansed.

• That area is frequently called Data Staging Area (DSA), and physically, it can be either in a separate machine or the one used for the data warehouse.

• The transformation and cleaning tasks constitute the core functionality of an ETL process. • Depending on the application, different problems may exist and different kinds of transformations may be needed.

• The problems can be categorized as follows: • (a) schema-level problems: naming and structural conflicts, including granularity differences, • (b) record-level problems: duplicated or contradicting records, and consistency problems, • and (c) value-level problems: several low-level technical problems such as different value representations or different interpretation of the values.

13

Loading

• The appropriate transformations and cleaning operations, the data are loaded to the respective fact or dimension table of the data warehouse.

• There are two broad categories of solutions for the loading of data:• Bulk loading through a DBMS-specific utility or • Inserting data as a sequence of rows.

14

ETL Tools

• ETL tools are the equivalent of Schema mappings in virtual integration, but are more powerful

• Arbitrary pieces of code to take data from a source, convert it into data for the warehouse:

• import filters – read and convert from data sources• data transformations – join, aggregate, filter, convert data• de-duplication – finds multiple records referring to the same entity, merges them

• profiling – builds tables, histograms, etc. to summarize data• quality management – test against master values, known business rules, constraints, etc.

15

Example ETL Tool Chain

• This is an example for e-commerce loading• Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load.

Invoice line items

SplitDate-time

Filterinvalid

JoinFilterinvalid

Invalid dates/ times

Invaliditems

Itemrecords

Filternon -

match

Invalidcustomers

Group by customer

Customerbalance

Customerrecords

16

Implementation of the ETL system

ETL

Source A

Source B

17

• The process flow occurs as, Source A ETL Source B ETL Source A

• Source A• Files• Database• Message queues• Web services

• ETL• Read• Apply logic• Write

• Source B• Files• Database• Message queues• Web services

18

ETL is a bridge for bi-directional flow• True data integration is agnostic of source or target application

Four staging steps for data warehouse to be implemented

Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004

19

• How the system is included in implementation of ETL?• Scripting (shell, perl, python) • PL/SQL, sqlldr • Transformation hardcoded in Java, C# • Develop (universal) ETL tool in-house • Using off-the-shelf ETL tool

20

ETL tools for implementation over data warehouse

• Commercial • Ab Initio • IBM DataStage • Informatica PowerCenter • Microsoft Data Integration Services • Oracle Data Integrator • SAP Business Objects – Data Integrator • SAS Data Integration Studio

• Open-source based • Adeptia Integration Suite • Apatar • CloverETL • Pentaho Data Integration (Kettle) • Talend Open Studio/Integration Suite

21

Data migration scheme over implementation

22

ETL solutions for integration in data warehouse (6 parts)

• Data migration• Process of transferring data between storage types or formats• An automated migration frees up human resources from tedious tasks• Design, extraction, cleansing, load and verification are done for moderate to high complexity jobs

• Data consolidation• Usually associated with moving data from remote locations to a central location • combining data due to an acquisition or merger

• Data integration• Process of combining data residing at different sources and providing a unified view• Emerges in both commercial and scientific fields and is focus of extensive theoretical work • Referred to as Enterprise Information Integration

23

• Master data management• Processes and tools to define and manage non-transactional data

• Provides for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing data

• Ensures consistency and control

• Data warehouse• Repository of electronically stored data• facilitates populating, reporting and analysis• metadata retrieval can be done

• Data synchronization• Process of making sure two or more locations contain the same up-to-date files

• Add, change, or delete a file from one location, synchronization will mirror the action

24

Data Migration

Data consolidation

Data integration

Master data management

Data warehouse

Data synchronization

25

Contextual implementation

• Documentation• Data sources/target/transformations• Data lineage

• Important to know and publish• Frequency of ETL processes runs• Error handling• Support- monitoring checklist

• RTP

26

ELT-Extracting Loading Transforming

The newer trend suggests the use of ETLT systems. ETLT represents an intermediate solution between ETL and ELT, allowing the designer to use the best solution for the current need.

27

Problem Statement

One of the problem of building any data warehouse is the process of extracting, transforming, cleansing, and loading the data from the source system

Almost all ETL tools and systems, whether based on off-the-shelf products or custom-coded, operate in a batch mode

They assume that the data becomes available as some sort of extract file on a certain schedule, usually nightly, weekly, or monthly

Then the system transforms and cleanses the data and loads it into the data warehouse

This process typically involves downtime of the data warehouse, so no users are able to access it while the load takes place. Since these loads are usually performed late at night, this scheduled downtime typically does not inconvenience many users

When loading data continuously in real-time, there can't be any system downtime.

28

Existing Approach(From literature)

Instead of loading the data in real-time into the actual warehouse tables, the data can be continuously fed into staging tables that are in the exact same format as the target tables.

Depending on the data modeling approach being used, the staging tables will either contain copy of just the data for the current day, or for smaller fact tables can contain a complete copy of all the historical data.

Then on a periodically the staging table is duplicated and the copy is swapped with the fact table, bring the data warehouse instantly up-to-date.

Depending upon the characteristics of how the swap is handled by the particular database, it might be advisable to temporally pause the Online Analytical Processing(OLAP) server while this flip takes place, no new queries are initiated while the swap occurs.

29

Our Approach

We can introduce an intermediate database server that is dedicated to have a real time cache mechanism.

First the data will be extracted from the data store and the half of the transformation will be done at the Data Staging Area(DSA)

Then the data will be saved in real time cache database.

So now the people can directly execute the queries on both the data warehouse and the database server and the remaining transformation will be performed depending upon the queries

30

Advantages

The processing work will be distributed as transformation is divided into 2 parts. The real time data in the cache and the history data in the warehouse can be queried simultaneously without frequent downtime

Disadvantage

It would require extra hardware for the establishment of the intermediate cache database server

31

Conclusion

• Talking about the actual author’s approach, and our approach towards the database system when the concerns is abut the warehousing will differ being a real time data extraction in data staging approach and the ours one will targeted towards the ETLT approach which will mainly focusing upon the data integrity about real time cache database.

• Characteristics of how the swap is handled by the particular database, it might be advisable to temporally pause the Online Analytical Processing(OLAP) server while this flip takes place in existing approach where on other side our approach focuses on people can directly execute the queries on both the data warehouse and the database server acting as cache and the remaining transformation will be performed depending upon the queries and ETLT system.

32

References

• http://www-db.deis.unibo.it/~srizzi/PDF/dolap01.pdf• http://www.w3.org/XML/• http://www.slideshare.net/cloveretl/introduction-to-etl-and-data-integration

• http://www.ksi.mff.cuni.cz/~pokorny/papers/BALTIC02.pdf• http://www.xml.com/• http://arxiv.org/abs/1308.6683• http://www.sciencedirect.com/science/article/pii/S0306437909001203

33


http://www.w3.org/XML/

http://www.w3.org/XML/

http://www.slideshare.net/cloveretl/introduction-to-etl-and-data-integration

http://www.slideshare.net/cloveretl/introduction-to-etl-and-data-integration

http://www.ksi.mff.cuni.cz/~pokorny/papers/BALTIC02.pdf

http://www.xml.com/

http://arxiv.org/abs/1308.6683

http://www.sciencedirect.com/science/article/pii/S0306437909001203

http://www.sciencedirect.com/science/article/pii/S0306437909001203

References

• http://airccj.org/CSCP/vol2/csit2133.pdf• http://bias.csr.unibo.it/golfarelli/Papers/softcom01.pdf• http://tdwi.org/webcasts/2010/06/xml-the-key-to-the-next-generation-data-warehouse.aspx

• ftp://ftp.irit.fr/IRIT/SIG/[Tournier-10]%20IS%20Survey%20XML%20DW%20OLAP.pdf

• http://dssresources.com/papers/features/langseth/langseth02082004.html

34

http://airccj.org/CSCP/vol2/csit2133.pdf


http://tdwi.org/webcasts/2010/06/xml-the-key-to-the-next-generation-data-warehouse.aspx

http://tdwi.org/webcasts/2010/06/xml-the-key-to-the-next-generation-data-warehouse.aspx

http://dssresources.com/papers/features/langseth/langseth02082004.html

http://dssresources.com/papers/features/langseth/langseth02082004.html

Thank you!

35

Any Questions?

Engineering

Xml datawarehousing with ETL(Extracting, Transforming and Loading)