14
1 Data Warehouse Design Architectures Amirkabir University Morteza Zaker Supervisor : Prof . Abbdolahzadeh

1 Data Warehouse Design Architectures Amirkabir University Morteza Zaker Supervisor : Prof. Abbdolahzadeh

Embed Size (px)

Citation preview

1

Data Warehouse Design Architectures

Amirkabir University

Morteza Zaker

Supervisor : Prof . Abbdolahzadeh

2

Presentation plan

Introduction Data Warehouse Architecture Concepts of dimensional model History of Data Warehouse

Modeling issues Conclusions

3

DW and OLAP – general concepts

Data Warehouses – contain historical data for supporting decision-making process

On-Line Analytical Processing systems - facilitate manipulation of DW data

DW and OLAP require clear definition of facts, dimensions, and hierarchies

DW logical level design based on star/snowflake schema

Data Warehouse Architecture Data Flow Architectures

Single DDS NDS+DDS ODS+DDS

System Architecture Federated Architectures ETL Architectures

ETLDDS

Data Profiler

Source System stage

DQETL

MDB

Other BI Application

Data Mining

analytic

Ad Hoc query

Pivot tables

Spreadsheet

Control system+Audit

DQcorrectionreports

Metadata

reports

reports

Source system are OLTP system that contain data which is loaded to DW .

OLTP : Capture and store Business Transaction online.

The Data is examined to realize characteristic of data by data Profiler.

Data profiler analyze data to find out that for example how many row does Table has? And which one is Null and so on.

Extract and transform and Load (ETL) then bring data from various source system into a stage area.

ETL Integrate and transform stage’s data then load it to dimensional data store (DDS).when loading data into DDS ,(DQETL) do various rules to check data then bad data push into DQ data base for reporting and correcting .Bad data Automatically be corrected or tolerated if it can be needed.

ETL system is managed by Control system based on the rules in the metadata.Metadata is a database that contain information about the data structure data usage-quality rules and other information about data.Audit system is used for understanding what happen during ETL process and then logs system oprenation into Metadata database.

User Can get data via several front-end tools and applications.

Some Application operate on Multidimensional format. So data in the DDS is Loaded to Multidimensional database( MDB ).

Multidimensional is a form of database that data is stored like a cub. Cells of cube represent number of variable which is called Dimensions. Value of dimension show when and where business event happened.

6

ETL+DQ

DDS

s1

Source System

MDB

Application

Application

Control system+Audit

Stage

Metadata

s2

Advantage of Single DDS is simple to design , because the data from the stage is loaded straight into the

dimensional data store, without going to any kind of normalized store. It is good for system which has

just one source or just has one dimension.

The main ِDisadvantage is that it is more difficult, in this architecture, to create a second DDS. The DDS in

the single DDS architecture is the master data store.

1. Extract data from several source system

2. Push it in stage area. Stage area could be a database or files system.

3. Stage is necessary because of lacking memory space and so on .

1. Second ETL package pick up data from Stage and Integrates them.

2. Apply some Data Quality rules 3. Puts consolidated data into a DDS

1. Control system + Audit manage ETL system concurrently .

2. Log ETL process to Metadata file or database

3. Metadata contain Data Structure and data processing within data warehouse

Single DDS

DDS-ETL

DDSs1

s2

MDB

Application

ApplicationControl

system+Audit

Stage

Metadata

s2

1. Data storage = Stage, NDS & DDS 2. Core DW Store = Normalized & Dimensional

Format 3. Data Marts = 1 to N Data Marts in each DDS 4. ETL Engine = 4 ETL Package 5. NDS Contain Master table and transaction

Table 6. Master Table Dimensions in DDS 7. Transaction table Facts in DDS

1. NDS is the in front of DDS and NDS is our master data .Master data contain all historical nad structral data .

2. DDS is our Transactional data and just could contain Single years of data .

NDS-ETL+DQ NDS

DDS

Application

NDS + DDS

We have got 1. Data storage = Stage, ODS & DDS 2. Core DW Store = Normalized &

Dimensional Format 3. Data Marts = 1 to N Data Marts in each

DDS 4. ETL Engine = 4 ETL Package 5. ODS Contain Master table and transaction

Table but it is not Master data store6. Master Table Dimensions in DDS 7. Transaction table Facts in DDS

The advantage of this architecture is that The third normal form is slimmer than the NDS because it contains only current values.

In this architecture we have a central place to integrate, maintain, and publish master data.

The normalized relational store is updatable by the user application.

The main ِDisadvantage is that it is more difficult, in this architecture, to create a second DDS. The DDS in the single DDS architecture is the master data store.

DDS-ETL

DDSs1

s2

MDB

Application

ApplicationControl

system+Audit

Stage

Metadata

s2

ODS-ETL+DQ ODS

DDS

Application

ODS + DDS

ODS is hybrid data store so User can access data from ODS

DW1DW1DW2DW2

DW3DW3

ETL

FDW Application

DW2DW2

EII

FDW Application

DM1DM1DM2DM2

DM3DM3

ETL

FDW Application

1. The FDW ETL needs to match the Updating time frequency of the source DWs.

2. The FDW ETL needs to integrate the data from source DWs based on business rules.

3. Duplicate records need to be merged.4. Subject area in here is very narrow that the

source DWs.

1. The FDW ETL needs to match the Updating time frequency of the source DWs.

2. The FDW ETL needs to integrate the data from source DWs based on business rules.

3. Duplicate records need to be merged.4. Subject area in here is very narrow that the

source DWs.

DW1DW1 DW3DW3

EII(Extract Information Integration)

1. is a method to integrate data by accessing different source systems online and aggregating the outputs on the fly before bringing the end result to the user.

2. All 3 DWs must be standardized as the same structure.

EII(Extract Information Integration)

1. is a method to integrate data by accessing different source systems online and aggregating the outputs on the fly before bringing the end result to the user.

2. All 3 DWs must be standardized as the same structure.

1. Data marts in the same Data warehouse is nonintegrated data marts.

2. They can be dimensional, normalized, or neither

1. Data marts in the same Data warehouse is nonintegrated data marts.

2. They can be dimensional, normalized, or neither

Federated DW

W e b Se r v e rW in 2 0 0 0SQ L 2 0 0 0

I B MD B 2

I n f o r m ix

E TL Se r v e rO r a c le 1 1 G4 P r o c e sso r1 6 G R A M

N D S + D D S

it s 2 se tf o r F a ilo v e ra n d C lust e r in g

C lie n t s

O L A P Se r v e rSQ L Se r v e r 2 0 0 8SSA S

R e p o r t Se r v e rW e b F a r m

St o r a ge A r e a N e t wo r k ( SA N )2 0 T r e a B y t e

H P

O L E D B

O D B C

O L E D B

F ibe r N e t wo r k

Giga N t e wo r k

System Architecture

So ur c e Sy st e m

E x t r a c t

E T L Se r v e r

L o a d

D W D a t a ba se se r v e r

T r a n sf o r m

So ur c e Sy st e m

E x t r a c t

E T L Se r v e r

L o a d

D W D a t a ba se se r v e r

T r a n sf o r m

So ur c e Sy st e m

E x t r a c t

E T L Se r v e r

L o a d

D W D a t a ba se se r v e rT r a n sf o r m

St a ge o n D isk

So ur c e Sy st e m

E x t r a c t

E T L Se r v e r

L o a d

D W D a t a ba se se r v e rT r a n sf o r m

N o St a ge a n dit s in M e m o r y

ETL Architectures

Main Issues that must be considered

There are two different types of database software

1. Symmetric multi processing(SMP) It is a databas system that runs on one or more machines with several

identical processors sharing the same disk storage. The database is physically located in a single disk storage system. Examples of SMP database systems are SQL Server, Oracle, DB/2, Informix, and Sybase

2. Massively parallel processing (MPP) It is a database system that 20uns on more than one machine where each

machine has its own disk storage. The database is physically located in several disk storage systems that are interconnected to each other. An Examples of MPP database systems are Teradata, Neoview, Netezza.

MPP database system is faster and more scalable than an SMP database system. In an MPP database system, a table is physically located in several nodes, each with its own storage.

14

Research challenges (1)

Spatial measure aggregations considering Their types

Distributive – reuse of aggregates, e.g., spatial union Algebraic – additional treatments for reusing

aggregates, e.g., center of n points Holistic - new calculation with a row data, e.g., equi-

partition Topological relationships between hierarchy levels Types of hierarchies