Upload
subiec79
View
218
Download
1
Embed Size (px)
Citation preview
Data Warehousing
Vikas SinghComputer Science Department
BITS, Pilani
Background• 1980’s to early 1990’s– Focus on computerizing business processes– To gain competitive advantage
• By early 1990’s– All companies had operational systems– It no longer offered any advantage
• How to get competitive advantage??
04/28/2023 2SS ZG515, Data Warehousing
OLTP Systems: Primary Purpose
Run the operations of the business• For example: Banks, Railway reservation etc.• Based on ER Data Modeling• Transaction based system• Data is always current valued• Little history is available• Data is highly volatile• Has “Intelligent keys”
04/28/2023 3SS ZG515, Data Warehousing
OLTP Systems
• Has relational normalized design• Redundant data is undesirable• Consists of many tables• High volume retrieval is inefficient• Optimized for repetitive “narrow” queries• Common data in many applications
04/28/2023 4SS ZG515, Data Warehousing
Need for Data Warehousing
• Companies, over the years, gathered huge volumes of data
• “Hidden Treasure”• Can this data be used in any way?• Can we analyze this data to get any
competitive advantage?• If yes, what kind of advantage?
04/28/2023 5SS ZG515, Data Warehousing
Benefits of Data Warehousing
• Allows “efficient” analysis of data• Competitive Advantage• Analysis aids strategic decision making• Increased productivity of decision makers• Potential high ROI• Classic example: Diaper and Beer
04/28/2023 6SS ZG515, Data Warehousing
Decision Support Systems, DW, & OLAP
• Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions.
• Data Warehouse is a DSS• A data warewhouse is an architectural construct of an
information system that provides users with current and historical decision support information that is hard to access or present in traditional operational systems.
• Data Warehouse is not an Intelligent system• On-Line Analytical Processing (OLAP) is an element of DSS
04/28/2023 7SS ZG515, Data Warehousing
DW: Interesting Statistics
37.4
626393
148.5
2178
1100
0
500
1000
1500
2000
2500
2000 2003
Investment($ BN)Users
Size(GB)
04/28/2023 8SS ZG515, Data Warehousing
Data Warehouse: Characteristics
• Analysis driven• Ad-hoc queries• Complex queries• Used by top managers• Based on Dimensional Modeling• Denormalized structures
04/28/2023 9SS ZG515, Data Warehousing
Data Warehouse:Major Players
• SAS institute• IBM• Oracle• Sybase• Microsoft• HP• Cognos• Business Objects
04/28/2023 10SS ZG515, Data Warehousing
Data Warehouse• A decision support database that is maintained
separately from the organization’s operational databases.
• A data warehouse is a – subject-oriented,– integrated,– time-varying,– non-volatile
collection of data that is used primarily in organizational decision making
04/28/2023 11SS ZG515, Data Warehousing
Subject Oriented• Data Warehouse is designed around
“subjects” rather than processes• A company may have – Retail Sales System– Outlet Sales System– Catalog Sales System
• Problems Galore!!!• DW will have a Sales Subject Area
04/28/2023 12SS ZG515, Data Warehousing
Subject Oriented
Retail Sales System Outlet Sales
System Catalog Sales
System
Sales Subject Area
04/28/2023 13SS ZG515, Data Warehousing
Integrated
• Heterogeneous Source Systems• Little or no control• Need to Integrate source data• For Example: Product codes could be
different in different systems• Arrive at common code in DW
04/28/2023 14SS ZG515, Data Warehousing
Non-Volatile(Read-Mostly)
WriteUSER Read
ReadUSER
OLTP
DW
04/28/2023 15SS ZG515, Data Warehousing
Time Variant• Most business analysis
has a time component
• Trend Analysis (historical data is required)
2001 2002 2003 2004
Sales
04/28/2023 16SS ZG515, Data Warehousing
Data Warehousing Architecture
Monitoring & Administration
Metadata Repository
ExtractTransformLoadRefresh
Data Marts
External Sources
Operational dbs
Serve
OLAP servers
AnalysisQuery/ ReportingData Mining
04/28/2023 17SS ZG515, Data Warehousing
Populating & Refreshing the Warehouse
Data Extraction Data Cleaning Data Transformation
Convert from legacy/host format to warehouse format
Load Sort, summarize, consolidate, compute
views, check integrity, build indexes, partition
Refresh Bring new data from source systems04/28/2023 18SS ZG515, Data Warehousing
ETL ProcessIssues & Challenges
Consumes 70-80% of project time Heterogeneous Source Systems Little or no control over source systems Source systems scattered Source systems operating in different
time zones Different currencies Different measurement units Data not captured by OLTP systems Ensuring data quality
04/28/2023 19SS ZG515, Data Warehousing
Data Staging Area A storage area where extracted data is
CleanedTransformedDeduplicated
Initial storage for data Need not be based on Relational model Spread over a number of machines Mainly sorting and Sequential processing COBOL or C code running against flat files Does not provide data access to users Analogy – kitchen of a restaurant 04/28/2023 20SS ZG515, Data Warehousing
Presentation Servers A target physical machine on which DW
data is organized for Direct querying by end users using
OLAPReport writersData Visualization toolsData mining tools
Data stored in Dimensional framework Analogy – Sitting area of a restaurant
04/28/2023 21SS ZG515, Data Warehousing
Data Cleaning Why?
Data warehouse contains data that is analyzed for business decisions
More data and multiple sources could mean more errors in the data and harder to trace such errors
Results in incorrect analysis Detecting data anomalies and
rectifying them early has huge payoffs
Long Term Solution Change business practices and data
entry tools Repository for meta-data
04/28/2023 22SS ZG515, Data Warehousing
Soundex Algorithms
Misspelled terms For example NAMES Phonetic algorithms – can find
similar sounding names Based on the six phonetic
classifications of human speech sounds
04/28/2023 23SS ZG515, Data Warehousing
Data Warehouse Design• OLTP Systems are Data Capture Systems• “DATA IN” systems• DW are “DATA OUT” systems
OLTP DW
04/28/2023 24SS ZG515, Data Warehousing
Analyzing the DATA • Active Analysis – User Queries– User-guided data analysis– Show me how X varies with Y– OLAP
• Automated Analysis – Data Mining– What’s in there?– Set the computer FREE on your data– Supervised Learning (classification)– Unsupervised Learning (clustering)
04/28/2023 25SS ZG515, Data Warehousing
OLAP Queries • How much of product P1 was sold in 1999
state wise?• Top 5 selling products in 2002• Total Sales in Q1 of FY 2002-03?• Color wise sales figure of cars from 2000 to
2003• Model wise sales of cars for the month of Jan
from 2000 to 2003
04/28/2023 26SS ZG515, Data Warehousing
Data Mining Investigations • Which type of customers are more likely to spend
most with us in the coming year?• What additional products are most likely to be sold
to customers who buy sportswear?• In which area should we open a new store in the
next year?• What are the characteristics of customers most
likely to default on their loans before the year is out?
04/28/2023 27SS ZG515, Data Warehousing
Continuum of Analysis
OLTP OLAP Data Mining
Primitive & Canned Analysis
Complex Ad-hoc Analysis
Automated Analysis
SQLSpecialized Algorithms
04/28/2023 28SS ZG515, Data Warehousing
Design Requirements
• Design of the DW must directly reflect the way the managers look at the business
Should capture the measurements of importance along with parameters by which these parameters are viewed It must facilitate data analysis, i.e., answering business questions04/28/2023 29SS ZG515, Data Warehousing
ER Modeling
• A logical design technique that seeks to eliminate data redundancy
• Illuminates the microscopic relationships among data elements
• Perfect for OLTP systems• Responsible for success of transaction
processing in Relational Databases
04/28/2023 30SS ZG515, Data Warehousing
Problems with ER Model
ER models are NOT suitable for DW?• End user cannot understand or remember an ER
Model• Many DWs have failed because of overly complex ER
designs• Not optimized for complex, ad-hoc queries • Data retrieval becomes difficult due to normalization• Browsing becomes difficult
04/28/2023 31SS ZG515, Data Warehousing
ER vs Dimensional Modeling• ER models are constituted to– Remove redundant data (normalization)– Facilitate retrieval of individual records having certain
critical identifiers– Thereby optimizing OLTP performance
• Dimensional model supports the reporting and analytical needs of a data warehouse system.
04/28/2023 32SS ZG515, Data Warehousing
Dimensional Modeling:Salient Features
• Represents data in a standard framework• Framework is easily understandable by end users• Contains same information as ER model• Packages data in symmetric format• Resilient to change• Facilitates data retrieval/analysis
04/28/2023 33SS ZG515, Data Warehousing
Dimensional Modeling:Vocabulary
• Measures or facts• Facts are “numeric” & “additive”• For example; Sale Amount, Sale Units • Factors or dimensions• Star Schemas• Snowflake & Starflake Schemas
Sales Amt = f (Product, Location, Time)Fact Dimensions
04/28/2023 34SS ZG515, Data Warehousing
Star Schema FK FK
FK FK
Sales FactTable
Location Dimension
Promotion Dimension
ProductDimension
TimeDimension
04/28/2023 35SS ZG515, Data Warehousing
Dimensional Modeling
• Facts are stored in FACT Tables• Dimensions are stored in DIMENSION tables• Dimension tables contains textual descriptors of
business• Fact and dimension tables form a Star Schema• “BIG” fact table in center surrounded by
“SMALL” dimension tables
04/28/2023 36SS ZG515, Data Warehousing
The “Classic” Star Schema
PERIOD KEY
Store Dimension
Time Dimension
Product Dimension
STORE KEYPRODUCT KEYPERIOD KEYDollars_soldUnitsDollars_cost
Period DescYearQuarterMonthDay
Fact Table
PRODUCT KEY
Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.
Product Desc.BrandColorSizeManufacturer
STORE KEY
04/28/2023 37SS ZG515, Data Warehousing
Fact Tables• Contains numerical measurements of the business• Each measurement is taken at the intersection of all
dimensions• Intersection is the composite key • Represents Many-to-many relationships between
dimensions• Examples of facts
Sale_amt, Units_sold, Cost, Customer_count
04/28/2023 38SS ZG515, Data Warehousing
Dimension Tables
• Contains attributes for dimensions• 50 to 100 attributes common• Best attributes are textual and descriptive• DW is only as good as the dimension attributes• Contains hierarchal information albeit redundantly • Entry points into the fact table
04/28/2023 39SS ZG515, Data Warehousing
Types of Facts• Fully-additive-all dimensions
– Units_sold, Sales_amt• Semi-additive-some dimensions
– Account_balance, Customer_count28/3,tissue paper,store1, 25, 250,2028/3,paper towel,store1, 35, 350,30Is no. of customers who bought either tissue paper or paper towel is 50? NO.
• Non-additive-none– Gross margin=Gross profit/amount– Note that GP and Amount are fully additive– Ratio of the sums and not sum of the ratios
04/28/2023 40SS ZG515, Data Warehousing
Data Warehouse:Design Steps
Step 1: Identify the Business Process
Step 2: Declare the Grain
Step 3: Identify the Dimensions
Step 4: Identify the Facts
04/28/2023 41SS ZG515, Data Warehousing
Grocery Store:The Universal Example
The Scenario: Chain of 100 Grocery Stores
10000 of these products sold on any given day(average)
60000 individual products in each store
3 year data
04/28/2023 42SS ZG515, Data Warehousing
Some Terms
• SKU (Stock Keeping Units)• UPC (Universal Product Codes)• EPOS ( Electronic Point of Sales)
04/28/2023 43SS ZG515, Data Warehousing
What Management is Interested In?
• Ordering logistics• Stocking shelves• Selling products• Maximize profits
04/28/2023 44SS ZG515, Data Warehousing
Grocery Store DW Step 1: Sales Business Process Step 2: Daily Grain A word about GRANULARITY
Temp sensor data: per ms, sec, min, hr?
Size of the DW is governed by granularity
Daily grain (club products sold on a day for each store) Aggregated data
Receipt line Grain (each line in the receipt is recorded – finest grain data)04/28/2023 45SS ZG515, Data Warehousing
Grocery Store:DW Size Estimate
• Daily Grain• Size of Fact Table = 100*10000*3*365 = 1095 million records• 3 facts & 4 dimensions (49 bytes)• 1095 m * 49 bytes = 53655 m bytes• i.e. ~ 50 GB
04/28/2023 46SS ZG515, Data Warehousing
Facts for Grocery Store
1.Quantity sold (additive)2.Dollar revenue (additive)3.Dollar cost (additive)4.Customer count (semi-
additive, not additive along the product dimension)
04/28/2023 47SS ZG515, Data Warehousing
Fact Table for Grocery Store
Field name Example Values
Description/Remarks
Date key (FK) 1 Surrogate key
Product key (FK) 1 Surrogate key
Store key (FK) 1 Surrogate key
EPOS transaction no.
100 Trancsaction number generated by the Operational system to record sales
Sales Quantity 2 No. of units bought by a customer
Sales amount 72 Amount received by selling 2 units
Cost amount 65 Cost price of 2 units
04/28/2023 48SS ZG515, Data Warehousing
Market-Basket Analysis
• What products customers are buying together?• Beer & Diapers• Polo Shirts & Barbie Dolls• How do we find this out?• Market-Basket Analysis!• Transaction No. • Receipt Line Grain• Degenerate Dimension
04/28/2023 49SS ZG515, Data Warehousing
Promotion Dimension
• Causal Dimension• Which causes or being the cause• Promotion conditions include– TPRs– End-aisle displays– Newspapers ads– Coupons– Combinations are common
04/28/2023 50SS ZG515, Data Warehousing
Q & A
Thank You