Upload
ulysses-xiong
View
17
Download
2
Tags:
Embed Size (px)
DESCRIPTION
An Introduction to Data Warehousing. 1. Business Intelligence. - PowerPoint PPT Presentation
Citation preview
An Introduction to
Data Warehousing
1
Business Intelligence Now,if the Estimates made before a Battle indicate Victory,it is because Now,if the Estimates made before a Battle indicate Victory,it is because
careful calculations show that your conditions are more favorable than careful calculations show that your conditions are more favorable than
those of your enemy;if they indicate defeat ,it is because careful those of your enemy;if they indicate defeat ,it is because careful
calculations show that the favorable conditions for a Battle are calculations show that the favorable conditions for a Battle are
fewer.With more careful calculations one can win ; with less one fewer.With more careful calculations one can win ; with less one
cannot. How much chance of Victory has one who makes no cannot. How much chance of Victory has one who makes no
calculations at all !!calculations at all !!
--- Sun Tzu , The Art of War--- Sun Tzu , The Art of War
Business these days are ,war minus shooting. Business these days are ,war minus shooting.
-Anonymous-Anonymous
Course Roadmap• Introduction to Datawarehousing• Difference between Operational System and DataWarehouse• Emergence of Decision Support Systems• DataWarehouse Theoretical Architecture• DataWarehouse Technical Architecture• DataWarehouse Bus Architecture• Data Modelling concepts• E-R Modelling for OLTP System• Dimensional Modelling for a Datawarehouse• Scheme generation for Datawarehouse• Star Scheme Design• Snowflake Scheme Design• Key aspects in designing the Dimensional Model• Granularity with respect to the Fact Table in the Schemas• Conformed Facts,Dimensions
Course Roadmap
• Fact less Fact Tables,Aggregate Fact Tables• Out Trigger Entities in the Schemas• Types of Relationships to be maintained between Facts and Dimensions• Dependencies while generating Physical Scheme for a DataWarehouse• Case Study of design of DataWarehouse for an existing ERmodel
Objectives
At the end of this session, you will know :At the end of this session, you will know :
– What is Data Warehousing What is Data Warehousing
– The evolution of Data WarehousingThe evolution of Data Warehousing
– Need for Data WarehousingNeed for Data Warehousing
– OLTP Vs Warehouse ApplicationsOLTP Vs Warehouse Applications
– Data marts Vs Data WarehousesData marts Vs Data Warehouses
– Operational Data StoresOperational Data Stores
– Overview of Warehouse ArchitectureOverview of Warehouse Architecture
Objectives
At the end of this lesson, you will know :At the end of this lesson, you will know :
– Data Warehouse ArchitecturesData Warehouse Architectures
– Components of Data Warehousing ArchitectureComponents of Data Warehousing Architecture
– An overview of each of the componentsAn overview of each of the components
– Considerations for Data Warehouse DesignConsiderations for Data Warehouse Design
– Common mistakes in Warehouse designsCommon mistakes in Warehouse designs
– An overview of Warehouse on the webAn overview of Warehouse on the web
What is a DataWarehouse ?
What is a Data Warehouse ?
A data warehouse is a A data warehouse is a subject-oriented,subject-oriented,
integrated,integrated, nonvolatile,nonvolatile, time-varianttime-variant collection collection
of data in support of management's decisions. of data in support of management's decisions.
- WH Inmon- WH Inmon
WH Inmon - Regarded As Father Of Data WarehousingWH Inmon - Regarded As Father Of Data Warehousing
Subject-Oriented- Characteristics of a Data Warehouse
Quotes Orders
ProspectsLeads
Operational Data Warehouse
Customers Products
Regions Time
Focus is on Subject Areas rather than ApplicationsFocus is on Subject Areas rather than Applications
Integrated - Characteristics of a Data Warehouse
Appl A - m,fAppl B - 1,0Appl C - male,female
Appl A - balance dec fixed (13,2)Appl B - balance pic 9(9)V99Appl C - balance pic S9(7)V99 comp-3
Appl A - bal-on-handAppl B - current-balanceAppl C - cash-on-hand
Appl A - date (julian)Appl B - date (yymmdd)Appl C - date (absolute)
m,f
balance dec fixed (13,2)
date (julian)
Current balance
Integrated View Is The Essence Of A Data WarehouseIntegrated View Is The Essence Of A Data Warehouse
Non-volatile - Characteristics of a Data Warehouse
Operational Data Warehouse
replacechange
insert
changeinsert
delete load
read only access
Data Warehouse Is Relatively Static In NatureData Warehouse Is Relatively Static In Nature
Time Variant - Characteristics of a Data Warehouse
Operational Data Warehouse
Current Value data• time horizon : 60-90 days
Snapshot data• time horizon : 5-10 years•data warehouse stores historical data
Data Warehouse Typically Spans Across TimeData Warehouse Typically Spans Across Time
Alternate Definitions
A collection of integrated, subject oriented databases A collection of integrated, subject oriented databases
designed to support the DSS function, where each designed to support the DSS function, where each
unit of data is relevant to some moment of timeunit of data is relevant to some moment of time
- - Imhoff Imhoff
Alternate Definitions
Data Warehouse is a repository of data summarized Data Warehouse is a repository of data summarized
or aggregated in simplified form from operational or aggregated in simplified form from operational
systems. End user orientated data access and systems. End user orientated data access and
reporting tools let user get at the data for decision reporting tools let user get at the data for decision
support - Babcocksupport - Babcock
Evolution of Data Warehousing
1960 - 1985 : MIS Era
Focus on ReportingFocus on Reporting
• Unfriendly
• Slow
• Dependent on IS programmers
• Inflexible
• Analysis limited to defined reports
Evolution of Data Warehousing
1985 - 1990 : Querying Era
Focus on Online QueryingFocus on Online Querying
• Adhoc, unstructured access to corporate data
• SQL as interface not scalable
• Cannot handle complex analysis
Queries that are formulated by the user
on the spur of the moment
Evolution of Data Warehousing
1990 - 20xx : Analysis Era
Focus on Online AnalysisFocus on Online Analysis
• Trend Analysis
• What If ?
• Cross Dimensional Comparisons
• Statistical profiles
• Automated pattern and rule discovery
Need for Data Warehousing
Better business intelligence for end-usersBetter business intelligence for end-users
Reduction in time to locate, access, and analyze informationReduction in time to locate, access, and analyze information
Consolidation of disparate information sourcesConsolidation of disparate information sources
Strategic advantage over competitorsStrategic advantage over competitors
Faster time-to-market for products and servicesFaster time-to-market for products and services
Replacement of older, less-responsive decision support Replacement of older, less-responsive decision support
systemssystems
Reduction in demand on IS to generate reportsReduction in demand on IS to generate reports
Typical Business Queries
Which product generated maximum revenue over last two Which product generated maximum revenue over last two
quarters in a chosen geographical region, city wise, relative to quarters in a chosen geographical region, city wise, relative to
the previous version of product, compared with the planthe previous version of product, compared with the plan
What percent of customer procures product A with B in a What percent of customer procures product A with B in a
chosen region, brokenchosen region, broken down by city, season, and income group down by city, season, and income group
Business Queries
OLTP Systems Vs Data Warehouse
Remember
Between OLTP and Data Warehouse systems
users are different
data content is different,
data structures are different
hardware is different
Understanding The Differences Is The KeyUnderstanding The Differences Is The Key
OLTP Vs Warehouse
Operational System Data Warehouse
Transaction Processing Query Processing
Predictable CPU Usage Random CPU Usage
Time Sensitive History Oriented
Operator View Managerial View
Normalized Efficient
Design for TP
Denormalized Design for
Query Processing
Operational System Data Warehouse
Transaction Processing Query Processing
Predictable CPU Usage Random CPU Usage
Time Sensitive History Oriented
Operator View Managerial View
Normalized Efficient
Design for TP
Denormalized Design for
Query Processing
OLTP Vs Warehouse
Operational System Data Warehouse
Designed for Atmocity,Consistency, Isolation andDurability
Designed for quite or staticdatabase
Organized by transactions(Order, Input, Inventory)
Organized by subject(Customer, Product)
Relatively smaller database Large database size
Many concurrent users Relatively few concurrentusers
Volatile Data Non Volatile Data
Operational System Data Warehouse
Designed for Atmocity,Consistency, Isolation andDurability
Designed for quite or staticdatabase
Organized by transactions(Order, Input, Inventory)
Organized by subject(Customer, Product)
Relatively smaller database Large database size
Many concurrent users Relatively few concurrentusers
Volatile Data Non Volatile Data
OLTP Vs Warehouse
Operational System Data Warehouse
Stores all data Stores relevant data
Performance Sensitive Less Sensitive to performance
Not Flexible Flexible
Efficiency Effectiveness
Operational System Data Warehouse
Stores all data Stores relevant data
Performance Sensitive Less Sensitive to performance
Not Flexible Flexible
Efficiency Effectiveness
Capacity Planning
Pro
cessin
g P
ow
er
Time of day
Processing Load Peaks During the Beginning and End of DayProcessing Load Peaks During the Beginning and End of Day
Examples Of Some Applications
Target Marketing Target Marketing
Market SegmentationMarket Segmentation
BudgetingBudgeting
Credit Rating AgenciesCredit Rating Agencies
Financial Reporting and ConsolidationFinancial Reporting and Consolidation
Market Basket Analysis - Market Basket Analysis - POS Analysis
Fraud ManagementFraud Management
Profitability ManagementProfitability Management
Event trackingEvent tracking
ManufacturersManufacturersManufacturersManufacturers
CustomersCustomersCustomersCustomers
RetailersRetailersRetailersRetailers
Do we need a separate database ?
OLTP and data warehousing require two very OLTP and data warehousing require two very
differently configured systemsdifferently configured systems
Isolation of Production System from Business Isolation of Production System from Business
Intelligence SystemIntelligence System
Significant and highly variable resource demands of Significant and highly variable resource demands of
the data warehousethe data warehouse
Cost of disk space no longer a concernCost of disk space no longer a concern
Production systems not designed for query Production systems not designed for query
processingprocessing
Data Marts
Enterprise wide data warehousing projects have a Enterprise wide data warehousing projects have a
very large cycle timevery large cycle time
Getting consensus between multiple parties may Getting consensus between multiple parties may
also be difficultalso be difficult
Departments may not be satisfied with priority Departments may not be satisfied with priority
accorded to themaccorded to them
Sometimes individual departmental needs may be Sometimes individual departmental needs may be
strong enough to warrant a local implementationstrong enough to warrant a local implementation
Application/database distribution is also an Application/database distribution is also an
important factorimportant factor
Data Marts
Subject or Application Oriented Business View of Subject or Application Oriented Business View of
WarehouseWarehouse
» Finance, Manufacturing, Sales etc.Finance, Manufacturing, Sales etc.
» Smaller amount of data used for Analytic ProcessingSmaller amount of data used for Analytic Processing
» Address a single business processAddress a single business process
A Logical Subset of The Complete Data WarehouseA Logical Subset of The Complete Data Warehouse
Data Warehouse and Data Mart
Data Warehouse Data Marts
Scope Application Neutral Centralized, Shared Cross LOB/enterprise
Specific ApplicationRequirement
LOB, department Business Process
Oriented
DataPerspective
Historical Detailed data Some summary
Detailed (some history) Summarized
Subjects Multiple subject areas Single Partial subject Multiple partial subjects OLTP snapshots
Data Warehouse Data Marts
Scope Application Neutral Centralized, Shared Cross LOB/enterprise
Specific ApplicationRequirement
LOB, department Business Process
Oriented
DataPerspective
Historical Detailed data Some summary
Detailed (some history) Summarized
Subjects Multiple subject areas Single Partial subject Multiple partial subjects OLTP snapshots
Data Warehouse and Data Mart
Data Warehouse Data Marts
Data Sources Many Operational/ External
Data
Few Operational, external
data OLTP snapshots
ImplementTime Frame
9-18 months for firststage
Multiple stageimplementation
4-12 months
Characteristics Flexible, extensible Durable/Strategic Data orientation
Restrictive, nonextensible
Short life/tactical Project Orientation
Data Warehouse Data Marts
Data Sources Many Operational/ External
Data
Few Operational, external
data OLTP snapshots
ImplementTime Frame
9-18 months for firststage
Multiple stageimplementation
4-12 months
Characteristics Flexible, extensible Durable/Strategic Data orientation
Restrictive, nonextensible
Short life/tactical Project Orientation
Warehouse or Mart First ?
Data Warehouse First Data Mart first
Expensive Relatively cheap
Large development cycle Delivered in < 6 months
Change management isdifficult
Easy to manage change
Difficult to obtain continuouscorporate support
Can lead to independent andincompatible marts
Technical challenges inbuilding large databases
Cleansing, transformation,modeling techniques may beincompatible
Data Warehouse First Data Mart first
Expensive Relatively cheap
Large development cycle Delivered in < 6 months
Change management isdifficult
Easy to manage change
Difficult to obtain continuouscorporate support
Can lead to independent andincompatible marts
Technical challenges inbuilding large databases
Cleansing, transformation,modeling techniques may beincompatible
Different kinds of Information Needs
CurrentCurrent
RecentRecent
HistoricalHistorical
CurrentCurrent
RecentRecent
HistoricalHistorical
Is this medicine available in stock
What are the tests this patient has completed so far
Has the incidence of Tuberculosis increased in last 5 years in Southern region
Operational Data Store - Definition
A A subject orientedsubject oriented, , integratedintegrated, ,
volatilevolatile, , current valuedcurrent valued data store data store
containing only corporate containing only corporate
detailed datadetailed dataData stored only for current period. Old
Data is either archived or moved to
Data Warehouse
Can I see credit report from
Accounts, Sales from
marketing and open order report from
order entry for this customer
Identical queries may give different results
at different times. Supports analysis requiring current
data
Data from multiple sources is integrated
for a subject
Operational Data Store
Increasingly becoming integrated with the data Increasingly becoming integrated with the data
warehousewarehouse
Are nothing but more responsive real time data Are nothing but more responsive real time data
warehouseswarehouses
Data Mining has anyway forced Data Warehouses Data Mining has anyway forced Data Warehouses
to store transactional level datato store transactional level data
OLTP Vs ODS Vs DWH
Characteristic OLTP ODS Data Warehouse
Audience OperatingPersonnel
Analysts Managers andanalysts
Data access Individual records,transaction driven
Individual records,transaction oranalysis driven
Set of records,analysis driven
Data content Current, real-time Current and near-current
Historical
Data granularity Detailed Detailed and lightlysummarized
Summarized andderived
Data organization Functional Subject-oriented Subject-oriented
Data quality All applicationspecific detaileddata needed tosupport a businessactivity
All integrated dataneeded to support abusiness activity
Data relevant tomanagementinformation needs
OLTP Vs ODS Vs DWH
Characteristic OLTP ODS Data Warehouse
Data redundancy Non-redundantwithin system;Unmanagedredundancy amongsystems
Somewhatredundant withoperationaldatabases
Managedredundancy
Data stability Dynamic Somewhat dynamic Static
Data update Field by field Field by field Controlled batch
Data usage Highly structured,repetitive
Somewhatstructured, someanalytical
Highlyunstructured,heuristic oranalytical
Database size Moderate Moderate Large to very large
Databasestructure stability
Stable Somewhat stable Dynamic
OLTP Vs ODS Vs DWH
Characteristic OLTP ODS Data Warehouse
Developmentmethodology
Requirementsdriven, structured
Data driven,somewhatevolutionary
Data driven,evolutionary
Operationalpriorities
Performance andavailability
Availability Access flexibilityand end userautonomy
Philosophy Support day-to-day operation
Support day-to-daydecisions &operationalactivities
Support managingthe enterprise
Predictability Stable Mostly stable, someunpredictability
Unpredictable
Response time Sub-second Seconds to minutes Seconds to minutes
Return set Small amount ofdata
Small to mediumamount of data
Small to largeamount of data
Typical Data Warehouse Architecture
OperationalSystems/Data
Select
Extract
Transform
Integrate
Maintain
Data Preparation
Middleware/API
Data Warehouse
Metadata
EIS /DSS
Query Tools
OLAP/ROLAP
Web Browsers
Data Mining
DataMarts
Multi-tiered Data Warehouse without ODSMulti-tiered Data Warehouse without ODS
Typical Data Warehouse Architecture
Multi-tiered Data Warehouse with ODSMulti-tiered Data Warehouse with ODS
OperationalSystems/Data
Select
Extract
Transform
Integrate
Maintain
Data Preparation
DataMarts
Data Warehouse
Metadata
ODS
Metadata
Select
Extract
Transform
Load
Data Preparation
Benefits of DWH
To formulate effective business, marketing
and sales strategies.
To precisely target promotional activity.
To discover and penetrate new markets.
To successfully compete in the marketplace
from a position of informed strength.
To build predictive rather than retrospective models.
These capabilities empower the corporate...
Warehouse Architecture - 1
OperationalSystems/Data
Select
Extract
Transform
Integrate
Maintain
Data Preparation
Middleware/API
Data Warehouse
Metadata
EIS /DSS
Query Tools
OLAP/ROLAP
Web Browsers
Data Mining
Enterprise Data WarehouseEnterprise Data Warehouse
Warehouse Architecture - 2
OperationalSystems/Data
Select
Extract
Transform
Integrate
Maintain
Data Preparation
EIS /DSS
Query Tools
Middleware/API
OLAP/ROLAP
Web Browsers
Data Mining
Data Mart
Metadata
Data Mart
Metadata
Data Mart
Metadata
Single Department Data MartSingle Department Data Mart
Warehouse Architecture - 3
OperationalSystems/Data
Select
Extract
Transform
Integrate
Maintain
Data Preparation
Middleware/API
Data Warehouse
Metadata
EIS /DSS
Query Tools
OLAP/ROLAP
Web Browsers
Data Mining
DataMarts
Operational Data Store
Multi-tiered Data WarehouseMulti-tiered Data Warehouse
Data Warehouse Architectures
There are three schools of thought about DW There are three schools of thought about DW
architecturesarchitectures
– One supports Dimensional Modeling all through One supports Dimensional Modeling all through
(Ralph Kimball)(Ralph Kimball)
– Second supports ER for Data Warehouse and Star Second supports ER for Data Warehouse and Star
Schemas for Data MartsSchemas for Data Marts
– Third supports ER model for DW (NCR)Third supports ER model for DW (NCR)
Kimball’s View
Data Warehouse Server
Processes
•Extract •Scrubbing•Transformation•Load Jobs•Aggregation Jobs•Replication•Monitoring•Management•Meta Data Repository•Meta Data Population•Meta Data Maintenance
Operational Systems
Staging Area
DW is sum total of all Data Marts
LAN
Presentation Server
DW Bus usingConformed Dimensions
Each Star is a Data Mart and has both summary and
detail data
Multiple Data Marts With Conformed DimensionsMultiple Data Marts With Conformed Dimensions
Inmon’s View
Data Warehouse Server Processes
•Extract •Scrubbing•Transformation•Load Jobs•Aggregation Jobs•Replication•Monitoring•Management•Meta Data Repository•Meta Data Population•Meta Data Maintenance
Operational Systems
Staging Area
LAN
Data Marts
Summarized Data in Star formats
Data Warehouse
Detail Data in ER format
Data Warehouse (ER) Feeding Multiple Data Marts (Star Schema)Data Warehouse (ER) Feeding Multiple Data Marts (Star Schema)
Components of a Data Warehouse Architecture
Source DatabasesSource Databases
Data extraction/transformation/load (ETL) toolData extraction/transformation/load (ETL) tool
Data warehouse maintenance and administration Data warehouse maintenance and administration
toolstools
Data modeling tool or interface to external data Data modeling tool or interface to external data
modelsmodels
Warehouse databasesWarehouse databases
End-user data access and analysis toolsEnd-user data access and analysis tools
Data Cleansing
Tools
Source Databases
Central Metadata
ETL Tool
Data Modeling
ToolData Access and Analysis Tools
-Managed Query
-Desktop OLAP
-ROLAP
-MOLAP
- Data Mining
Central Warehouse(RDBMS)
Warehouse Admin Tool
Local meta data
RDBMS
ROLAP Engine
Architected Datamarts
Warehouse Databases
MDDB
Components of a Data Warehouse Architecture
Data Warehouse Is Not Just About Data... But Tools TooData Warehouse Is Not Just About Data... But Tools Too
Source Databases - Characteristics
Legacy, relational, text or external sourcesLegacy, relational, text or external sources
Designed for high-speed transaction processingDesigned for high-speed transaction processing
Real-time, current, volatile dataReal-time, current, volatile data
Fast response for larger numbers of concurrent usersFast response for larger numbers of concurrent users
Many short transactionsMany short transactions
Update-intensive; modifications by rowUpdate-intensive; modifications by row
Inquiry-oriented; access by keysInquiry-oriented; access by keys
High integrity, security, recoverabilityHigh integrity, security, recoverability
Source data is often inconsistent and poorly modeledSource data is often inconsistent and poorly modeled
Data Cleaning Tools
To clean data at the source To clean data at the source
Clean up source data in-place on the hostClean up source data in-place on the host
Business rule discovery tools which analyse the Business rule discovery tools which analyse the
source data and write cleaning rules based on source data and write cleaning rules based on
lexical analysis and AI techniqueslexical analysis and AI techniques
Poorly integrated with data warehousing toolsPoorly integrated with data warehousing tools
ETL tools have limited yet adequate data cleansing ETL tools have limited yet adequate data cleansing
functionalityfunctionality