Upload
keyur-patel
View
212
Download
3
Embed Size (px)
DESCRIPTION
Small presentation by group of 4 - Keyur Patel, Manav Sharma, Shubham Shah and Ruturaj Rawal at Laurentian University, Greater Sudbury, ON, Canada on the research topic of XML data warehousing(with ETL). Thanks to many of the Research papers as well as Slideshare for reference. Special thanks to Manav Sharma for "Our approach" section
Citation preview
Project Topic XML Data warehousing
Group 5 Guided By:Keyur Patel (1100101106013) Dr Kalpdrum PassiShubham Shah (1100101106037) Manav Sharma ( 110010107056 )Ruturaj Raval (110090107036)
1
Introduction
• XML can be considered as a particular standard syntax for the exchange of semi-structured data
• Feature of semi-structured data model• Lack of schema, so that data is self-describing
• However, XML can be associated and validated using DTD and XML schema
• Difficult to store and retrieve data in warehouse in semi-structured form than in structured form
2
XML - The Key to the Next Generation Data Warehouse
• A large amount of data needed in decision-making processes is stored in the XML data format, which is widely used for E-commerce and Internet-based information exchange
• Importance of integrating XML data in data warehousing environments is becoming increasingly high
http://bias.csr.unibo.it/golfarelli/Papers/softcom01.pdf
3
XML in Data warehouse
• XML documents are usually generated as part of a transaction, complex business process, or communication exchanged between partnering businesses
• Applying the information to Data Warehousing(DW)/Business Intelligence(BI) is challenging, because of the hierarchies and complex data structures that are typical of XML schema
4
Integration of XML documents within a data warehouse.
5http://www-db.deis.unibo.it/~srizzi/PDF/dolap01.pdf
A data mart is the access layer of the data warehouse environment that is used to get data out to the users
XML documents are integrated and stored into data warehouses
Data Warehouse Architecture
Data Warehouse
ETL ETL ETL ETL
RDBMS 1 RDBMS 2
HTML 1 XML 1
ETL pipeline
outputs
ETL
http://research.cs.wisc.edu/dibook/slides/Chapter_10.ppt6
Data Warehouse Architecture
• At the top – a centralized database• Generally configured for queries and appends• Many indices, materialized views, etc.
• Data is loaded and periodically updated via Extract/Transform/Load (ETL) tools
7
Data Warehouse Architecture
• A traditional data warehouse architecture consists of four layers: 1. The data sources
• e.g. legacy systems, flat files or files under any format
2. The back-end• ETL(Extracting transforming and Loading) takes place here
3. The global data warehouse• Keeps record of data that result from the transformation, integration and aggregation
4. The front-end• Consists of applications and techniques that business users use to interact with data stored in the data warehouse
8
ETL
9
ETL
• ETL means Extract, transform, and load• Extracts data from outside sources• Transforms it to fit operational needs, which can include quality levels• Loads it into the end target (database, more specifically, operational data store, data mart, or data warehouse)
10
ETL
• The set of operations taking place in the back stage of data warehouse architecture is generally known as the Extraction, Transformation, and Loading (ETL) processes
• ETL processes are responsible for the extraction of data from different, distributed, and often, heterogeneous data sources, their cleansing and customization in order to fit business needs and rules, their transformation in order to fit the data warehouse schema, and finally, their loading into a data warehouse
11
Extraction
• The extraction conceptually is the simplest step, aiming at the identification of the subset of source data that should be submitted to the ETL workflow for further processing.
• In practice, this task is not easy, basically, due to the fact that there must be minimum interference with the software configuration at the source side.
• This requirement is imposed by two factors: • (a) the source must suffer minimum overhead during the extraction, since other administrative activities also take place during that period, and,
• (b) both for technical and political reasons, administrators are quite reluctant to accept major interventions to their system’s configuration.
12
Transformation & Cleaning
• After their extraction from the sources, the data are transported into an intermediate storage area, where they are transformed and cleansed.
• That area is frequently called Data Staging Area (DSA), and physically, it can be either in a separate machine or the one used for the data warehouse.
• The transformation and cleaning tasks constitute the core functionality of an ETL process. • Depending on the application, different problems may exist and different kinds of transformations may be needed.
• The problems can be categorized as follows: • (a) schema-level problems: naming and structural conflicts, including granularity differences, • (b) record-level problems: duplicated or contradicting records, and consistency problems, • and (c) value-level problems: several low-level technical problems such as different value representations or different interpretation of the values.
13
Loading
• The appropriate transformations and cleaning operations, the data are loaded to the respective fact or dimension table of the data warehouse.
• There are two broad categories of solutions for the loading of data:• Bulk loading through a DBMS-specific utility or • Inserting data as a sequence of rows.
14
ETL Tools
• ETL tools are the equivalent of Schema mappings in virtual integration, but are more powerful
• Arbitrary pieces of code to take data from a source, convert it into data for the warehouse:
• import filters – read and convert from data sources• data transformations – join, aggregate, filter, convert data• de-duplication – finds multiple records referring to the same entity, merges them
• profiling – builds tables, histograms, etc. to summarize data• quality management – test against master values, known business rules, constraints, etc.
15
Example ETL Tool Chain
• This is an example for e-commerce loading• Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load.
Invoice line items
SplitDate-time
Filterinvalid
JoinFilterinvalid
Invalid dates/ times
Invaliditems
Itemrecords
Filternon -
match
Invalidcustomers
Group by customer
Customerbalance
Customerrecords
16
Implementation of the ETL system
ETL
Source A
Source B
17
• The process flow occurs as, Source A ETL Source B ETL Source A
• Source A• Files• Database• Message queues• Web services
• ETL• Read• Apply logic• Write
• Source B• Files• Database• Message queues• Web services
18
ETL is a bridge for bi-directional flow• True data integration is agnostic of source or target application
Four staging steps for data warehouse to be implemented
Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004
19
• How the system is included in implementation of ETL?• Scripting (shell, perl, python) • PL/SQL, sqlldr • Transformation hardcoded in Java, C# • Develop (universal) ETL tool in-house • Using off-the-shelf ETL tool
20
ETL tools for implementation over data warehouse
• Commercial • Ab Initio • IBM DataStage • Informatica PowerCenter • Microsoft Data Integration Services • Oracle Data Integrator • SAP Business Objects – Data Integrator • SAS Data Integration Studio
• Open-source based • Adeptia Integration Suite • Apatar • CloverETL • Pentaho Data Integration (Kettle) • Talend Open Studio/Integration Suite
21
Data migration scheme over implementation
22
ETL solutions for integration in data warehouse (6 parts)
• Data migration• Process of transferring data between storage types or formats• An automated migration frees up human resources from tedious tasks• Design, extraction, cleansing, load and verification are done for moderate to high complexity jobs
• Data consolidation• Usually associated with moving data from remote locations to a central location • combining data due to an acquisition or merger
• Data integration• Process of combining data residing at different sources and providing a unified view• Emerges in both commercial and scientific fields and is focus of extensive theoretical work • Referred to as Enterprise Information Integration
23
• Master data management• Processes and tools to define and manage non-transactional data
• Provides for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing data
• Ensures consistency and control
• Data warehouse• Repository of electronically stored data• facilitates populating, reporting and analysis• metadata retrieval can be done
• Data synchronization• Process of making sure two or more locations contain the same up-to-date files
• Add, change, or delete a file from one location, synchronization will mirror the action
24
Data Migration
Data consolidation
Data integration
Master data management
Data warehouse
Data synchronization
25
Contextual implementation
• Documentation• Data sources/target/transformations• Data lineage
• Important to know and publish• Frequency of ETL processes runs• Error handling• Support- monitoring checklist
• RTP
26
ELT-Extracting Loading Transforming
The newer trend suggests the use of ETLT systems. ETLT represents an intermediate solution between ETL and ELT, allowing the designer to use the best solution for the current need.
27
Problem Statement
One of the problem of building any data warehouse is the process of extracting, transforming, cleansing, and loading the data from the source system
Almost all ETL tools and systems, whether based on off-the-shelf products or custom-coded, operate in a batch mode
They assume that the data becomes available as some sort of extract file on a certain schedule, usually nightly, weekly, or monthly
Then the system transforms and cleanses the data and loads it into the data warehouse
This process typically involves downtime of the data warehouse, so no users are able to access it while the load takes place. Since these loads are usually performed late at night, this scheduled downtime typically does not inconvenience many users
When loading data continuously in real-time, there can't be any system downtime.
28
Existing Approach(From literature)
Instead of loading the data in real-time into the actual warehouse tables, the data can be continuously fed into staging tables that are in the exact same format as the target tables.
Depending on the data modeling approach being used, the staging tables will either contain copy of just the data for the current day, or for smaller fact tables can contain a complete copy of all the historical data.
Then on a periodically the staging table is duplicated and the copy is swapped with the fact table, bring the data warehouse instantly up-to-date.
Depending upon the characteristics of how the swap is handled by the particular database, it might be advisable to temporally pause the Online Analytical Processing(OLAP) server while this flip takes place, no new queries are initiated while the swap occurs.
29
Our Approach
We can introduce an intermediate database server that is dedicated to have a real time cache mechanism.
First the data will be extracted from the data store and the half of the transformation will be done at the Data Staging Area(DSA)
Then the data will be saved in real time cache database.
So now the people can directly execute the queries on both the data warehouse and the database server and the remaining transformation will be performed depending upon the queries
30
Advantages
The processing work will be distributed as transformation is divided into 2 parts. The real time data in the cache and the history data in the warehouse can be queried simultaneously without frequent downtime
Disadvantage
It would require extra hardware for the establishment of the intermediate cache database server
31
Conclusion
• Talking about the actual author’s approach, and our approach towards the database system when the concerns is abut the warehousing will differ being a real time data extraction in data staging approach and the ours one will targeted towards the ETLT approach which will mainly focusing upon the data integrity about real time cache database.
• Characteristics of how the swap is handled by the particular database, it might be advisable to temporally pause the Online Analytical Processing(OLAP) server while this flip takes place in existing approach where on other side our approach focuses on people can directly execute the queries on both the data warehouse and the database server acting as cache and the remaining transformation will be performed depending upon the queries and ETLT system.
32
References
• http://www-db.deis.unibo.it/~srizzi/PDF/dolap01.pdf• http://www.w3.org/XML/• http://www.slideshare.net/cloveretl/introduction-to-etl-and-data-integration
• http://www.ksi.mff.cuni.cz/~pokorny/papers/BALTIC02.pdf• http://www.xml.com/• http://arxiv.org/abs/1308.6683• http://www.sciencedirect.com/science/article/pii/S0306437909001203
33
References
• http://airccj.org/CSCP/vol2/csit2133.pdf• http://bias.csr.unibo.it/golfarelli/Papers/softcom01.pdf• http://tdwi.org/webcasts/2010/06/xml-the-key-to-the-next-generation-data-warehouse.aspx
• ftp://ftp.irit.fr/IRIT/SIG/[Tournier-10]%20IS%20Survey%20XML%20DW%20OLAP.pdf
• http://dssresources.com/papers/features/langseth/langseth02082004.html
34
Thank you!
35
Any Questions?