Upload
chetan-gadodia
View
33
Download
0
Tags:
Embed Size (px)
Citation preview
What’s Warehousing?
• Large volume of data (Gb, Tb)• Non-volatile• Historical• Time attributes are important• Updates infrequent• May be append-only
1
What’s Data Warehousing?
• Process of extracting.• Integrating.• Filtering.• Standardizing.• Transforming.• Cleaning & quality checking.• Storing it in a consolidated database.
2
Need
• Huge Amount of Operational Data• Knowledge worker wants to turn this data into useful
information.• Support strategic decision making .• From business perspective– Marketing weapon– Valuable tool in today’s world.– Learning more about Customer needs
3
Benefits
• The potential benefits of data warehousing are high returns on investment.
• Substantial competitive advantage.
• Increased productivity of corporate decision-makers.
4
Volatile•Same data for different period
Definition
Subject Oriented
•Finance•Marketing•Inventory
Integrated •SAP•Weblog•Legacy
Time Variant•Daily•Monthly•Quarterly
5
Operational Database Data Warehouse
OLTPOLAP
Vs
Perform on-line transaction & query processing.
Day-to-Day operations of an organization
Data analysis & Decision making.
Systems can organize & present data in various formats
8
Data Marts: Overview
• Data Mart is a decentralized subset of data
• Data Marts have specific business-related purposes
9
Data Marts: Needs
• Much better performance querying from a data mart than from a data warehouse
• Much easier time navigating through data marts
10
Data Marts: Features
• Low cost • Controlled locally rather than
centrally, conferring power on the user group
• Contain less information• Rapid response• Easily understood and navigated
than an enterprise Data Warehouse
• Within the range of divisional or departmental budgets
11
Dimensional Data Modeling
E-R model• Symmetric• Divides data into many entities• Describes entities and relationships• Seeks to eliminate data redundancy• Good for high transaction performance
Dimensional model• Asymmetric• Divides data into dimensions and facts• Describes dimensions and measures• Encourages data redundancy• Good for high query performance
12
What is Dimension?
• Single join to the fact table (single primary key)
• Stores business attributes
• Attributes are textual in nature
• Organized into hierarchies
• More or less constant data
• E.g. Time, Product, Customer, Store, etc.
13
What is Fact?
• Central, dominant table
• Multi-part primary key
• Links directly to dimensions
• Stores business measures
• Constantly varying data
14
Star Schema
• A single, large and central fact table and one table for each dimension.
• For example A Fact surrounded by 4-15 dimensions
• Dimensions are de-normalized
15
Star Schema Example…
Store KeyProduct Key
Period Key
Units
Price
Store Dimension Time DimensionFact Table
Store Key
Store Name
City
State
Region
Period Key
Year
Quarter
Month
Product Key
Product Desc
16
Snowflake Schema
• Variant of star schema model.
• A single, large and central fact table and one or more tables for each dimension.
• Dimension tables are normalized i.e. split dimension table data into additional tables
17
Eg: Snowflake schema
Store KeyProduct Key
Period Key
Units
Price
Time DimensionFact Table
Store Key
Store Name
City Key
Period Key
Year
Quarter
Month
Product Key
Product Desc
City Key
City
State
Region
Store Dimension
18
Avoid Snowflakes• Avoid natural desire to normalize model:• Complicates end-user query
construction• Adds additional level of “JOIN”
complexity• Database optimizers do not handle very
well• Saves some space at the cost of longer
queries
So,• Don’t snowflake for saving space• Snowflake if secondary dimensions have
many attributes
19
Widely used ETL Tools
• IBM Information Server (Datastage) • PowerCenter –Informatica• Abinitio• SAS Data Integration Studio • Oracle Warehouse Builder (OWB)• SQL Server Integration Services(SSIS)
21