Upload
datawarehouse-trainings
View
424
Download
2
Embed Size (px)
DESCRIPTION
IF ANY ONE NEED DATASTAGE ONLINE TRAINING (OR) JOB SUPPORT (OR) CLASS VIDEO'S PLEASE CONTACT ---> [email protected] or +91-9000380723
Citation preview
®
IBM Software Group
© 2007 IBM Corporation
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
3
04/08/23 TCS Confidential 3
IBM Software Group | WebSphere software
4
Course Roadmap
• Why we use Data warehousing
• Difference between Operational System and Data Warehouse
• Introduction to Data warehousing
• Data Warehousing Approaches
• Data Warehouse Technical Architecture
• Data Modelling concepts
• Operational Data Store
• Schema Design of Data warehouse
• Data Acquisation
• ETL Products
• Project Life Cycle
IBM Software Group | WebSphere software
5
Why We Need Data Warehousing ?
Better business intelligence for end-users
Reduction in time to locate, access, and analyze information
Consolidation of disparate information sources
To Store Large Volumes of Historical Detail Data from Mission Critical Applications
Strategic advantage over competitors
Faster time-to-market for products and services
Replacement of older, less-responsive decision support systems
Reduction in demand on IS to generate reports
IBM Software Group | WebSphere software
6
What is an Operational System?
Operational systems are just what their name implies; they are the systems that
help us run the day-to-day enterprise operations.
These are the backbone systems of any enterprise, such as order entry inventory
etc.
The classic examples are airline reservations, credit-card authorizations, and ATM
withdrawals etc.,
IBM Software Group | WebSphere software
7
Characteristics of Operational Systems
• Continuous availability
• Predefined access paths
• Transaction integrity
• Volume of transaction - High
• Data volume per query - Low
• Used by operational staff
• Supports day to day control operations
• Large number of users
IBM Software Group | WebSphere software
8
OLTP Vs Data Warehouse
Operational System Data Warehouse
Transaction Processing Query Processing
Predictable CPU Usage Random CPU Usage
Time Sensitive History Oriented
Operator View Managerial View
Normalized Efficient
Design for TP
Denormalized Design for
Query Processing
Operational System Data Warehouse
Transaction Processing Query Processing
Predictable CPU Usage Random CPU Usage
Time Sensitive History Oriented
Operator View Managerial View
Normalized Efficient
Design for TP
Denormalized Design for
Query Processing
IBM Software Group | WebSphere software
9
OLTP Vs WarehouseOperational System Data Warehouse
Designed for Atmocity,Consistency, Isolation andDurability
Designed for quite or staticdatabase
Organized by transactions(Order, Input, Inventory)
Organized by subject(Customer, Product)
Relatively smaller database Large database size
Many concurrent users Relatively few concurrentusers
Volatile Data Non Volatile Data
Operational System Data Warehouse
Designed for Atmocity,Consistency, Isolation andDurability
Designed for quite or staticdatabase
Organized by transactions(Order, Input, Inventory)
Organized by subject(Customer, Product)
Relatively smaller database Large database size
Many concurrent users Relatively few concurrentusers
Volatile Data Non Volatile Data
IBM Software Group | WebSphere software
10
Operational System Data Warehouse
Stores all data Stores relevant data
Performance Sensitive Less Sensitive to performance
Not Flexible Flexible
Efficiency Effectiveness
Operational System Data Warehouse
Stores all data Stores relevant data
Performance Sensitive Less Sensitive to performance
Not Flexible Flexible
Efficiency Effectiveness
IBM Software Group | WebSphere software
11
What is a Data Warehouse ?
Data Warehouse Data Warehouse is a
Subject-Oriented
Integrated
Time-Variant
Non-volatile
WH Inmon - Regarded As Father Of Data WarehousingWH Inmon - Regarded As Father Of Data Warehousing
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
13
13
Subject Oriented Analysis
Data Warehouse StorageTransactional Storage
SalesSales
CustomersCustomers
ProductsProducts
EntrySales RepQuantity SoldPart NumberDate Customer NameProduct DescriptionUnit PriceMail Address
Process Oriented Subject Oriented
IBM Software Group | WebSphere software
14
14
Integration of Data
Data Warehouse StorageTransactional Storage
Appl. A - M, FAppl. B - 1, 0Appl. C - X, Y
Appl. A - pipeline cm.Appl. B - pipeline inchesAppl. C - pipeline mcf
Appl. A - balance dec(13,2) Appl. B - balance PIC 9(9)V99Appl. C - balance float
Appl. A - bal-on-handAppl. B - current_balanceAppl. C - balance
Appl. A - date (Julian)Appl. B - date (yymmdd)Appl. C - date (absolute)
M, F
pipeline cm
balance dec(13, 2)
balance
date (Julian)In
tegr
atio
n
Encoding
Unit of Attributes
Physical Attributes
Naming Conventions
Data Consistency
IBM Software Group | WebSphere software
15
15
Load
Access
Mass Load / Access of DataRecord-by-Record Data Manipulation
Insert
Access
Insert
Change
Delete
Change
Volatile Non-Volatile
Volatility of Data
Data Warehouse StorageTransactional Storage
IBM Software Group | WebSphere software
16
16
Time Variant Data Analysis
Data Warehouse StorageTransactional Storage
Current Data Historical Data
0
5
10
15
20
Sales ( in lakhs )
January February March
Year97
Sales ( Region , Year - Year 97 - 1st Qtr)
EastWestNorth
IBM Software Group | WebSphere software
Load/ Update
Consistent Points in Time
Updated constantly
Data changes according to
need, not a fixed schedule
Added to regularly, but loaded data
is rarely directly changed
Does NOT mean the Data
warehouse is never updated or
never changes!!
Constant Change
Operational systems Database
Data warehouse
Datawarehouse- Differences from Operational Systems
Insert
Insert
Update
Initial Load
Incremental Load
Incremental Load
Update
Delete
IBM Software Group | WebSphere software
18
Difference B/W OLTP AND OLAP
IBM Software Group | WebSphere software
19
DW Implementation Approaches
Top Down
Bottom-up
Combination of both
Choices depend on: current infrastructure resources architecture ROI Implementation speed
IBM Software Group | WebSphere software
20
Heterogeneous Source Systems
Staging
Common Staging interface Layer
EDW- “Top Down”Approach
Data mart bus architecture Layer
Enterprise Datawarehouse
Source1
Source2
Source3
Incremental Architected data marts
DM 1 DM 3DM 2
IBM Software Group | WebSphere software
21
Heterogeneous Source Systems
Staging
Common Staging interface Layer
EDW- “Bottom up”Approach
Data mart bus architecture Layer
Source1
Source2
Source3
Incremental Architected data marts
DM 1 DM 3DM 2
Enterprise Datawarehouse
IBM Software Group | WebSphere software
22
Source System Data Staging Area Presentation Area
Services:
Transform from
source-to-Target
Maintain Conform
Dimensions
No user query support
Data Store:
Flat files or relational tables
Design Goals:
Staging Throughput integrity/ consistency
Load
Access
Ad Hoc Query Tools
Report Writers
Analytic Applications
Modeling:
Forecasting Scoring Data Mining
Data Mart #1
Dimensional Atomic AND summery data Business Process Centric
Design Goals:
Easy-of -use Query Performance
Data Mart #2
Data Mart #.....
Data Mart Bus: Conformed facts and dimsExtract
Extract
Extract
Data Access Tools
Independent Data Marts: Ralph Kimball’s Ideology
Ralph Kimball’ Approach
IBM Software Group | WebSphere software
23
•E/R Design or Flat File
•Retain History Needed for
regular processing
•No end user access
• Dimensional
•Transaction & Summary data
•Data Mart Single subject area
(i.e. Fact table)
•Multiple Marts May exist in a
Single Database Instance
Bottom Up Approach
Staging Data Store
Data Warehouse
Data Mart Data Mart Data Mart
Data Mart Data MartData Mart
•Integrated Data•Timely User Access•Conformed Dimensions•Single Process to Build Dimension
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
25
Bill Inmon’ Approach
Source System
Data Staging Area
Presentation Area
“Enterprise Data Warehouse”
Normalized tables
Atomic Data
User query support to atomic data
Extract
Extract
Extract
Load
Data Mart #1 Dimensional summery data Departmental Centric
Access
Access
Data Access Tools
Data Mart #2
Data Mart #...
ETL
Dependent Data Marts: Bill Inmon’s Ideology
DWH
IBM Software Group | WebSphere software
26
Top Down Approach
• Raw Input Data
• E/R Model• Subject Areas• Transaction Level Detail• Historical Persistency As justified- Archive
for Retrieval if Needed
• Most are dimensional• Data Mart Design by Business
Function• Summary Level Data
•
Data Mart Data Mart
Staging Data Store
Data Warehouse
Data Mart
Data Mart
Flat File
•Integrated Data•Timely user Access•Single Process to build dimension
IBM Software Group | WebSphere software
27
DW Implementation Approaches
Top Down More planning and design initially
Involve people from different work-groups, departments
Data marts may be built later from Global DW
Overall data model to be decided up-front
Bottom Up Can plan initially without waiting for
global infrastructure
built incrementally
can be built before or in parallel with Global DW
Less complexity in design
IBM Software Group | WebSphere software
28
DW Implementation Approaches
Top Down Consistent data definition and
enforcement of business rules across enterprise
High cost, lengthy process, time consuming
Works well when there is centralized IS department responsible for all H/W and resources
Bottom Up Data redundancy and
inconsistency between data marts may occur
Integration requires great planning
Less cost of H/W and other resources
Faster pay-back
IBM Software Group | WebSphere software
29
29
DW Architectures
IBM Software Group | WebSphere software
30
Prod
Mkt
HR
Fin
Acctg
Data Sources
Transaction Data
IBM
IMS
VSAM
Oracle
Sybase
ETL Software Data Stores Data AnalysisTools and Applications
Users
Other Internal Data
ERP SAP
Clickstream Informix
Web Data
External Data
Demographic Harte-Hanks
STAGING
AREA
OPERATIONAL
DATA
STORE
Ascential
Extract
Sagent
SAS
Clean/ScrubTransformFirstlogic
Load
DATASTAGE
Data MartsTeradataIBM
Data Warehouse
Meta Data
Finance
Marketing
Sales
Essbase
Microsoft
ANALYSTS
MANAGERS
EXECUTIVES
OPERATIONAL PERSONNEL
CUSTOMERS/SUPPLIERS
SQL
Cognos
SAS
Queries,Reporting,DSS/EIS,Data Mining
Micro Strategy
Siebel
BusinessObjects
Web Browser
IBM Software Group | WebSphere software
31
Benefits of DWH
To formulate effective business, marketing
and sales strategies.
To precisely target promotional activity.
To discover and penetrate new markets.
To successfully compete in the marketplace
from a position of informed strength.
To build predictive rather than retrospective models.
IBM Software Group | WebSphere software
32
Data Modeling
IBM Software Group | WebSphere software
33
Data Modeling
WHAT IS A DATA MODEL?
A data model is an abstraction of some aspect of the real world (system).
WHY A DATA MODEL?
• Helps to visualize the business • A model is a means of communication.• Models help elicit and document requirements. • Models reduce the cost of change. • Model is the essence of DW architecture based on which
DW will be implemented
IBM Software Group | WebSphere software
34
STEPS in DATA MODELING
Problem & scope definition
Requirement Gathering
Analysis
Logical Database Design
Deciding Database
Physical Database design
Schema Generation
IBM Software Group | WebSphere software
35
Levels of modeling Conceptual modeling
Describe data requirements from a business point of view without technical details
Logical modelingRefine conceptual modelsData structure oriented, platform
independent Physical modeling
Detailed specification of what is physically implemented using specific technology
IBM Software Group | WebSphere software
36
Modeling Techniques
Entity-Relationship Modeling
Traditional modeling technique
Technique of choice for OLTP
Suited for corporate data warehouse
Dimensional Modeling
Analyzing business measures in the specific business context
Helps visualize very abstract business questions
End users can easily understand and navigate the data
structure
IBM Software Group | WebSphere software
37
Relationship
Relationship between entities - structural interaction
and association
described by a verb
Cardinality
1-1
1-M
M-M
Example : Books belong to Printed Media
Entity-Relationship Modeling - Basic Concepts
IBM Software Group | WebSphere software
38
Entity-Relationship Modeling - Basic Concepts
AttributesCharacteristics and properties of entitiesExample :
Book Id, Description, book category are attributes of entity “Book”
Attribute name should be unique and self-explanatory
Primary Key, Foreign Key, Constraints are defined on Attributes
IBM Software Group | WebSphere software
Review of Logical Modeling Terms & Symbols
Entities define specific groups of information
Sales Organization
Sales Org IDDistribution Channel
Entity
IBM Software Group | WebSphere software
Review of Logical Modeling Terms & Symbols
One or more attribute uniquely identifies an instance of an entity
Sales Organization
Sales Org IDDistribution Channel
Identifier
IBM Software Group | WebSphere softwareReview of Logical Modeling Terms & Symbols
The logical model identifies relationships between entities
Sales Detail
Sales Record ID
Sales Rep
Sales Rep ID
Relationship{
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
Logical Data Model
Sales Detail
Sales Record ID
Customer
Customer ID
Product
Product SKU
Suppliers
Supplier ID
Manufacturing Group
Manufacturing Org ID
Factory
Factory ID
Sales Organization
Sales Org IDDistribution Channel
Sales Rep
Sales Rep ID
Retail
Market
Product Sales Plan
Plan ID
Wholesale
Industry
IBM Software Group | WebSphere software
44
44
Examples: ER Model
IBM Software Group | WebSphere software
45
Limitations of E-R Modeling
Poor Performance
Tend to be very complex and difficult to navigate.
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
47
47
Dimensional Modeling
IBM Software Group | WebSphere software
48
Dimensional Modeling
Dimensional modeling uses three basic concepts : measures, facts, dimensions.
Is powerful in representing the requirements of the business user in the context of database tables.
Focuses on numeric data, such as values counts, weights, balances and occurences.
IBM Software Group | WebSphere software
49
What is a Facts
A fact is a collection of related data items, consisting of measures and context data.
Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process.
Facts are measured, “continuously valued”, rapidly changing information. Can be calculated and/or derived.
Granularity
The level of detail of data contained in the data warehouse
e.g. Daily item totals by product, by store
IBM Software Group | WebSphere software
50
Types of Facts Additive
Able to add the facts along all the dimensionsDiscrete numerical measures eg. Retail sales in $
Semi AdditiveSnapshot, taken at a point in timeMeasures of IntensityNot additive along time dimension eg. Account balance, Inventory
balanceAdded and divided by number of time period to get a time-average
Non AdditiveNumeric measures that cannot be added across any dimensions Intensity measure averaged across all dimensions eg. Room
temperatureTextual facts - AVOID THEM
IBM Software Group | WebSphere software
51
Dimensions A dimension is a collection of members or units of the same type
of views.
Dimensions determine the contextual background for the facts.
Dimensions represent the way business people talk about the data resulting from a business process, e.g., who, what, when, where, why, how
IBM Software Group | WebSphere software
52
52
Dimensional Hierarchy
World
America AsiaEurope
USA
FL
Canada Argentina
GA VA CA WA
TampaMiami Orlando Naples
Continent Level
State Level
City Level
World Level
Country Level
Pare
nt R
elat
ion
Dimension Member / Business Entity
Geography Dimension
Attributes: Population, Tourist’s Place
IBM Software Group | WebSphere software
53
Dimensions Types
Conformed Dimension
Junk Dimension
Fast Changing Dimension
Role Playing Dimension
‘Garbage’ Dimension
Slowly Changing Dimension
Degenerated Dimension
53
IBM Software Group | WebSphere software
54
What is a Slowly Changing Dimension?
Although dimension tables are typically static lists, most dimension tables do change over
time.
Since these changes are smaller in magnitude compared to changes in fact tables, these
dimensions are known as slowly growing or slowly changing dimensions.
IBM Software Group | WebSphere software
55
Slowly Changing Dimension -Classification
Slowly changing dimensions are classified into three different
types
TYPE I
TYPE II
TYPE III
IBM Software Group | WebSphere software
56
Slowly Changing Dimensions Type I
Shane
Name
EmailEmp id
Shane
Name
EmailEmp id
Shane
Name
1001
EmailEmp id
Shane
Name
1001
EmailEmp id
Source
Source Target
Target
IBM Software Group | WebSphere software
57
Slowly Changing Dimensions Type II
Shane
Name
EmailEmp id
Shane
Name
10
Emp id
1000
PM_PRIMARYKEY
0
PM_VERSION_NUMBER
Source Target
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
59
Slowly Changing Dimensions -Versioning
Shane
Name
10
EmailEmp id
Source
Target
Shane101000
Shane101001
EmailNameEmp idPM_PRIMARYKEY
PM_VERSION_NUMBER
IBM Software Group | WebSphere software
60
Slowly Changing Dimensions -Versioning
Shane
Name
10
EmailEmp id
Source
Target
Shane101001
Shane101003
Shane101000
EmailNameEmp idPM_PRIMARYKEY
PM_VERSION_NUMBER
IBM Software Group | WebSphere software
61
Slowly Changing Dimensions Type II - Flag
Shane
Name
10
EmailEmp id
Shane
Name
10
Emp id
1000
PM_PRIMARYKEY
Y
PM_CURRENT_FLAG
Source
Target
IBM Software Group | WebSphere software
62
Slowly Changing Dimensions - Flag Current
Shane
Name
10
EmailEmp id
Source
Target
Shane101000
Shane101001
EmailNameEmp idPM_PRIMARYKEY
PM_CURRENT_FLAG
IBM Software Group | WebSphere software
63
Slowly Changing Dimensions - Flag Current
Shane
Name
10
EmailEmp id
Source
Target
Shane101001
Shane101003
Shane101000
EmailNameEmp idPM_PRIMARYKEY
PM_CURRENT_FLAG
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
65
Slowly Changing Dimensions Type II
Shane
Name
10
EmailEmp id
01/01/00
PM_BEGIN_DATE
Shane
Name
10
Emp id
1000
PM_PRIMARYKEY
PM_END_DATE
Source
Target
IBM Software Group | WebSphere software
66
Slowly Changing Dimensions -Effective Date
Shane
Name
EmailEmp id
Source
Target
03/01/00
01/01/00
PM_BEGIN_DATE
03/01/[email protected]
Shane101000
Shane101001
EmailNameEmp idPM_PRIMARYKEY
PM_END_DATE
IBM Software Group | WebSphere software
67
Slowly Changing Dimensions - Effective Date
Shane
Name
EmailEmp id
Source
Target
05/02/00
03/01/00
01/01/00
PM_BEGIN_DATE
05/02/[email protected]
Shane101001
Shane101003
03/01/[email protected]
Shane101000
EmailNameEmp idPM_PRIMARYKEY
PM_END_DATE
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
69
Slowly Changing Dimensions Type III
Shane
Name
10
EmailEmp id
PM_Prev_Column Name
Shane
Name
10
Emp id
1
PM_PRIMARYKEY
01/01/00
PM_EFFECT_DATE
SourceTarget
IBM Software Group | WebSphere software
70
Slowly Changing Dimensions Type III
Shane
Name
EmailEmp id
Source
Target
PM_Prev_ColumnName
01/02/[email protected]
Shane101
EmailNameEmp idPM_PRIMARYKEY
PM_EFFECT_DATE
IBM Software Group | WebSphere software
71
Slowly Changing Dimensions Type III
Shane
Name
EmailEmp id
Source
Target
PM_Prev_ColumnName
01/03/[email protected]
Shane101
EmailNameEmp idPM_PRIMARYKEY
PM_EFFECT_DATE
IBM Software Group | WebSphere software
72
Degenerate Dimension
Dimension keys in fact table without corresponding dimension tables are called Degenerate Dimensions
Purpose of Degenerate Dimensions
1. Generally used when each record in fact represents transaction line item
2. Useful for grouping transaction line items belonging to a single transaction
IBM Software Group | WebSphere software
73
Fast Changing DimensionA fast changing dimension is a dimension whose attribute or
attributes for a record (row) change rapidly over time.1. Example: Age of associates, Income, Daily balance etc.2. Technique to handle fast changing dimension: Create band
tables
IBM Software Group | WebSphere software
74
Role Playing Dimension
A single dimension which is expressed differently in a fact table using views is called a role-playing dimension. This can be achieved by creating views on dimension table.
IBM Software Group | WebSphere software
75
Conformed Dimension
A conformed dimension means the same thing to each fact table to which it can be joined.
Typically, dimension tables that are referenced or are likely to be referenced by multiple fact tables (multiple dimensional models) are called conformed dimensions
.
IBM Software Group | WebSphere software
76
Conformed Dimension Option #1
Identical dimensions with same keys, labels, definitions and Values
Sales Schema
Inventory Schema
SALES Facts
DATE KEY PRODUCT KEY STORE KEY PROMO KEY
Product Desc Brand Desc Category Desc
PRODUCT KEY
INVENTORY Facts
DATE KEY PRODUCT KEY STORE KEYProduct Desc
Brand Desc Category Desc
PRODUCT KEY
IBM Software Group | WebSphere software
77
Conformed Dimension Option #2
Subset of base dimension with common labels, definitions and values
Sales Schema
Forecast Schema
SALES $
DATE KEY PRODUCT KEY STORE KEY PROMO KEY
Product Desc Brand Desc Category Desc
PRODUCT KEY DATE KEY
Day-of-week Week Desc Month Desc
SALES $
MONTH KEY BRAND KEYBrand Desc
Category Desc
BRAND KEY MONTH KEY
Month Desc
BRAND KEY Brand Desc Category Desc
12345 Cherriors Cereal
PROD KEY Prod Desc Brand Desc Category Desc
12345 Cherriors 10 Cherriors Cereal
IBM Software Group | WebSphere software
78
‘Garbage’ DimensionA garbage dimension is a dimension that consists of low-cardinality columnssuch as codes, indicators, and status flags.
Approach to handle Garbage dimension:• Put the new attributes into existing dimension tables.• Put the new attributes into the fact table.• Create new separate dimension tables garbage dimension• Create a separate ‘Garbage Dimension’ table
IBM Software Group | WebSphere software
79
Junk Dimensions
Whether to use junk dimension5 indicators, each has 3 values -> 243 (35) rows5 indicators, each has 100 values -> 100 million (1005) rows
When to insert rows in the dimension
IBM Software Group | WebSphere software
80
Factless Fact Tables
The two types of factless fact tables are:
Coverage tables
Event tracking tables
IBM Software Group | WebSphere software
81
Factless Fact Tables - Coverage Tables
Coverage tables are required when a primary fact table is sparse
Example: Tracking products in a store that did not sell
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
83
Factless Fact Tables - Event Tracking
These tables are used for tracking a event:
Example: Tracking student attendance
IBM Software Group | WebSphere software
84
Fact Constellation Fact constellations: Multiple fact tables share dimension tables,viewed as
a collection of stars, therefore called galaxy schema or fact constellation
IBM Software Group | WebSphere software
85
What is a Data mart?
Data mart is a decentralized subset of data found either in a data warehouse or as a standalone subset designed to support the unique business unit requirements of a specific decision-support system.
Data marts have specific business-related purposes such as measuring the impact of marketing promotions, or measuring and forecasting sales performance etc,.
Data Mart
Data Mart
EnterpriseData Warehouse
IBM Software Group | WebSphere software
86
Data marts - Main Features
Main Features:
Low cost
Controlled locally rather than centrally, conferring power on the user group.
Contain less information than the warehouse
Rapid response
Easily understood and navigated than an enterprise data warehouse.
Within the range of divisional or departmental budgets
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
88
Datamart Advantages :
Typically single subject area and fewer dimensions
Limited feeds
Very quick time to market (30-120 days to pilot)
Quick impact on bottom line problems
Focused user needs
Limited scope
Optimum model for DW construction
Demonstrates ROI
Allows prototyping
Advantages of Datamart over Datawarehouse
IBM Software Group | WebSphere software
89
Data Mart disadvantages :
Does not provide integrated view of business information.
Uncontrolled proliferation of data marts results in redundancy
More number of data marts complex to maintain
Scalability issues for large number of users and increased data volume
Disadvantages of Data Mart
IBM Software Group | WebSphere software
90
90
Data marts
• Embedded data marts are marts that are stored within
the central DW. They can be stored relationally as files or
cubes.
• Dependent data marts are marts that are fed directly by
the DW, sometimes supplemented with other feeds, such as
external data.
• Independent data marts are marts that are fed directly
by external sources and do not use the DW.
DM - Types
®
IBM Software Group
© 2007 IBM Corporation
The Operational Data StoreThe Operational Data Store
IBM Software Group | WebSphere software
92
IBM Software Group | WebSphere software
93
Why We Need Operational Data Store?
Need
To obtain a “system of record” that contains the best data that
exists in a legacy environment as a source of information
Best here implies data to be
Complete
Up to date
Accurate
In conformance with the organization’s information model
IBM Software Group | WebSphere software
ODS data resolves data integration issues
Data physically separated from production environment to insulate it from the processing demands of reporting and analysis
Access to current data facilitated.
Operational Data Store - Insulated from OLTP
Tactical Analysis
OLTP Server
ODS
IBM Software Group | WebSphere software
95
Detailed data
Records of Business Events
(e.g. Orders capture)
Data from heterogeneous sources
Does not store summary data
Contains current data
Operational Data Store - Data
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
97
ODS- Benefits Integrates the data
Synchronizes the structural differences in data
High transaction performance
Serves the operational and DSS environment
Transaction level reporting on current data
Flat files
RelationalDatabase
Operational Data Store
60,5.2,”JOHN” 72,6.2,”DAVID”
Excel files
IBM Software Group | WebSphere software
Update schedule - Daily or less
time frequency
Detail of Data is mostly between
30 and 90 days
Addresses operational needs
Weekly or greater time frequency
Potentially infinite history
Address strategic needs
Operational Data Store- Update schedule
ODSData
Data warehouse Data
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
100
OLTP Vs ODS Vs DWH
Characteristic OLTP ODS Data Warehouse
Data redundancy Non-redundantwithin system;Unmanagedredundancy amongsystems
Somewhatredundant withoperationaldatabases
Managedredundancy
Data stability Dynamic Somewhat dynamic Static
Data update Field by field Field by field Controlled batch
Data usage Highly structured,repetitive
Somewhatstructured, someanalytical
Highlyunstructured,heuristic oranalytical
Database size Moderate Moderate Large to very large
Databasestructure stability
Stable Somewhat stable Dynamic
IBM Software Group | WebSphere software
101
Star Schema Design
Single fact table surrounded by denormalized dimension tables
The fact table primary key is the composite of the foreign keys (primary keys of dimension tables)
Fact table contains transaction type information.
Many star schemas in a data mart
Easily understood by end users, more disk storage required
IBM Software Group | WebSphere software
102
EXAMPLE OF STAR SCHEMA
IBM Software Group | WebSphere software
103
Snowflake Schema Single fact table surrounded by normalized dimension
tables
Normalizes dimension table to save data storage space.
When dimensions become very very large
Less intuitive, slower performance due to joins
May want to use both approaches, especially if supporting multiple end-user tools.
IBM Software Group | WebSphere software
104
Example of Snow flake schema
IBM Software Group | WebSphere software
105
Snowflake - Disadvantages
Normalization of dimension makes it difficult for user to understand
Decreases the query performance because it involves more joins
Dimension tables are normally smaller than fact tables - space may not be a major issue to warrant snowflaking
IBM Software Group | WebSphere software
106
Data Acquisation
Data Extraction
Data Transformation
Data Loading
106
IBM Software Group | WebSphere software
107
Tool Category Products ETL Tools ETI Extract, Informatica, IBM Visual Warehouse
Oracle Warehouse Builder
OLAP Server Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB
OLAP Tools Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube
Data Warehouse Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks
Data Mining & Analysis
SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine, TCS Tools
Representative DW Tools
IBM Software Group | WebSphere software
108
ETL PRODUCTS
CODE BASED ETL TOOLS
GUI BASED ETL TOOLS
108
IBM Software Group | WebSphere software
109
CODE BASED ETL TOOLS
SAS ACCESS
SAS BASE
TERADATA ETL TOOLS
1. BTEQ
2. TPUMP
3. FAST LOAD
4. MULTI LOAD
IBM Software Group | WebSphere software
110
GUI BASED ETL TOOLS
Informatica
DT/Studio
Data Stage
Business Objects Data Integrator (BODI)
AbInitio
Data Junction
Oracle Warehouse Builder
Microsoft SQL Server Integration Services
IBM DB2 Ware house Center
®
IBM Software Group
© 2007 IBM Corporation
Extraction Types Extraction Types
IBM Software Group | WebSphere software
112
Extraction Types
Extraction
Full ExtractPeriodic/
IncrementalExtract
IBM Software Group | WebSphere software
113
Full Extract
Source System
Full Extract
Data Mart
New data
IBM Software Group | WebSphere software
115
Incremental Extract
Data Mart
Source SystemIncremental Extract
Existing data
IncrementalData
IBM Software Group | WebSphere software
116
Incremental Extract
Data Mart
Source SystemIncremental Extract
New data
Changed data
Existing data
IncrementalData
IBM Software Group | WebSphere software
117
Incremental Extract
Data Mart
Source SystemIncremental Extract
New data
Changed data Existing data updated using changed data
IncrementalData
Incremental addition to data mart
IBM Software Group | WebSphere software
118
DATAWARE LOADING
®
IBM Software Group
© 2007 IBM Corporation
IBM Software Group | WebSphere software
120
Types of Data warehouse Loading
Target update types
Insert
Update
IBM Software Group | WebSphere software
Types of Data Warehouse Updates
Insert
Full Replace
Selective Replace
Update plus Retain History
Update
Point in Time Snapshots
New Data Changed Data
Data Warehouse
Source data Data Staging
IBM Software Group | WebSphere software
New Data and Point-In-Time Data Insert
Source data
New data
OR
Point-in-Time Snapshot(e.g.. Monthly)
New Data Added to Existing Data
IBM Software Group | WebSphere software
Changed Data Insert
Source data Changed Data Added to Existing Data
Changed data
IBM Software Group | WebSphere software
124
Data Data WareWarehousehouse
Data Data WareWarehousehouse
EnterpriseData
Warehouse
EnterpriseData
Warehouse
Info Info AccessAccess
Info Info AccessAccess
Reporting tools
Web Browsers
OLAP
Mining
ETLETLETLETL
External Data External Data StorageStorage
BusinessBusinessRequirementRequirement
Map DataMap Datasourcessources
ReverseReverseEngg.Engg.
Map Map Req. to Req. to OLTPOLTP
OLTP OLTP SystemSystem
LogicalLogicalModelingModeling
RefineRefineModelModel
Data Warehouse Life cycle
IBM Software Group | WebSphere software
125
Project Life Cycle
Software Requirement Specification
High level Design(HLD)
Low level Design(LLD)
Development
Unit Testing
System Integration Testing
Peer Review
User Acceptance Testing
Production
Maintenance
125
®
IBM Software Group
© 2007 IBM Corporation
Meta Data in a Data WarehouseMeta Data in a Data Warehouse
IBM Software Group | WebSphere software
127
• Data about data and the processes
• Metadata is stored in a data dictionary and repository.
• Insulates the data warehouse from changes in the schema of
operational systems.
• It serves to identify the contents and location of data in the
data warehouse
What is Metadata?
IBM Software Group | WebSphere software
128
Share resources
Users
Tools
Document system
Without meta data
Not Sustainable
Not able to fully utilize resource
Why Do You Need Meta Data?
IBM Software Group | WebSphere softwareThe Role of Meta Data in the Data Warehouse
Know what data you have and
You can trust it!
Meta Data enables data to become information, because with it you
IBM Software Group | WebSphere software
Meta Data Answers….
How have business definitions and terms changed over time?
How do product lines vary across organizations?
What business assumptions have been made?
How do I find the data I need?
What is the original source of the data?
How was this summarization created?
What queries are available to access the data
IBM Software Group | WebSphere software
131
Meta Data Process
Integrated with entire process and data flow
Populated from beginning to end
Begin population at design phase of project
Dedicated resources throughout
Build
Maintain
•Design•Mapping
•Design•Mapping
•Extract•Scrub•Transform
•Extract•Scrub•Transform
•Load•Index•Aggregation
•Load•Index•Aggregation
•Replication•Data Set Distribution
•Replication•Data Set Distribution
•Access & Analysis•Resource Scheduling & Distribution
•Access & Analysis•Resource Scheduling & Distribution
Meta DataMeta Data
System MonitoringSystem Monitoring
IBM Software Group | WebSphere software
132
Types of ETL Meta Data
.
ETL Meta data
Technical Meta data
Operational Meta data
IBM Software Group | WebSphere software
Data Warehouse Meta data
This Meta data stores descriptive information about the physical
implementation details of data warehouse.
Source Meta data
This Meta data stores information about the source data and the mapping of source
data to data warehouse data
Classification of ETL Meta Data
IBM Software Group | WebSphere software
Transformations & Integrations.
This Meta data describes comprehensive information about the Transformation and
loading.
Processing Information
This Meta data stores information about the activities involved in the processing of data
such as scheduling and archives etc
End User Information
This Meta data records information about the user profile and security.
ETL Meta Data
IBM Software Group | WebSphere software
135
ETL -Planning for the Movement
The following may be helpful for planning the movement
Develop a ETL plan
Specifications
Implementation
®
IBM Software Group
© 2007 IBM Corporation