52
Data Warehousing Vikas Singh Computer Science Department BITS, Pilani

L1.pptx

Embed Size (px)

Citation preview

Page 1: L1.pptx

Data Warehousing

Vikas SinghComputer Science Department

BITS, Pilani

Page 2: L1.pptx

Background• 1980’s to early 1990’s– Focus on computerizing business processes– To gain competitive advantage

• By early 1990’s– All companies had operational systems– It no longer offered any advantage

• How to get competitive advantage??

04/28/2023 2SS ZG515, Data Warehousing

Page 3: L1.pptx

OLTP Systems: Primary Purpose

Run the operations of the business• For example: Banks, Railway reservation etc.• Based on ER Data Modeling• Transaction based system• Data is always current valued• Little history is available• Data is highly volatile• Has “Intelligent keys”

04/28/2023 3SS ZG515, Data Warehousing

Page 4: L1.pptx

OLTP Systems

• Has relational normalized design• Redundant data is undesirable• Consists of many tables• High volume retrieval is inefficient• Optimized for repetitive “narrow” queries• Common data in many applications

04/28/2023 4SS ZG515, Data Warehousing

Page 5: L1.pptx

Need for Data Warehousing

• Companies, over the years, gathered huge volumes of data

• “Hidden Treasure”• Can this data be used in any way?• Can we analyze this data to get any

competitive advantage?• If yes, what kind of advantage?

04/28/2023 5SS ZG515, Data Warehousing

Page 6: L1.pptx

Benefits of Data Warehousing

• Allows “efficient” analysis of data• Competitive Advantage• Analysis aids strategic decision making• Increased productivity of decision makers• Potential high ROI• Classic example: Diaper and Beer

04/28/2023 6SS ZG515, Data Warehousing

Page 7: L1.pptx

Decision Support Systems, DW, & OLAP

• Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions.

• Data Warehouse is a DSS• A data warewhouse is an architectural construct of an

information system that provides users with current and historical decision support information that is hard to access or present in traditional operational systems.

• Data Warehouse is not an Intelligent system• On-Line Analytical Processing (OLAP) is an element of DSS

04/28/2023 7SS ZG515, Data Warehousing

Page 8: L1.pptx

DW: Interesting Statistics

37.4

626393

148.5

2178

1100

0

500

1000

1500

2000

2500

2000 2003

Investment($ BN)Users

Size(GB)

04/28/2023 8SS ZG515, Data Warehousing

Page 9: L1.pptx

Data Warehouse: Characteristics

• Analysis driven• Ad-hoc queries• Complex queries• Used by top managers• Based on Dimensional Modeling• Denormalized structures

04/28/2023 9SS ZG515, Data Warehousing

Page 10: L1.pptx

Data Warehouse:Major Players

• SAS institute• IBM• Oracle• Sybase• Microsoft• HP• Cognos• Business Objects

04/28/2023 10SS ZG515, Data Warehousing

Page 11: L1.pptx

Data Warehouse• A decision support database that is maintained

separately from the organization’s operational databases.

• A data warehouse is a – subject-oriented,– integrated,– time-varying,– non-volatile

collection of data that is used primarily in organizational decision making

04/28/2023 11SS ZG515, Data Warehousing

Page 12: L1.pptx

Subject Oriented• Data Warehouse is designed around

“subjects” rather than processes• A company may have – Retail Sales System– Outlet Sales System– Catalog Sales System

• Problems Galore!!!• DW will have a Sales Subject Area

04/28/2023 12SS ZG515, Data Warehousing

Page 13: L1.pptx

Subject Oriented

Retail Sales System Outlet Sales

System Catalog Sales

System

Sales Subject Area

04/28/2023 13SS ZG515, Data Warehousing

Page 14: L1.pptx

Integrated

• Heterogeneous Source Systems• Little or no control• Need to Integrate source data• For Example: Product codes could be

different in different systems• Arrive at common code in DW

04/28/2023 14SS ZG515, Data Warehousing

Page 15: L1.pptx

Non-Volatile(Read-Mostly)

WriteUSER Read

ReadUSER

OLTP

DW

04/28/2023 15SS ZG515, Data Warehousing

Page 16: L1.pptx

Time Variant• Most business analysis

has a time component

• Trend Analysis (historical data is required)

2001 2002 2003 2004

Sales

04/28/2023 16SS ZG515, Data Warehousing

Page 17: L1.pptx

Data Warehousing Architecture

Monitoring & Administration

Metadata Repository

ExtractTransformLoadRefresh

Data Marts

External Sources

Operational dbs

Serve

OLAP servers

AnalysisQuery/ ReportingData Mining

04/28/2023 17SS ZG515, Data Warehousing

Page 18: L1.pptx

Populating & Refreshing the Warehouse

Data Extraction Data Cleaning Data Transformation

Convert from legacy/host format to warehouse format

Load Sort, summarize, consolidate, compute

views, check integrity, build indexes, partition

Refresh Bring new data from source systems04/28/2023 18SS ZG515, Data Warehousing

Page 19: L1.pptx

ETL ProcessIssues & Challenges

Consumes 70-80% of project time Heterogeneous Source Systems Little or no control over source systems Source systems scattered Source systems operating in different

time zones Different currencies Different measurement units Data not captured by OLTP systems Ensuring data quality

04/28/2023 19SS ZG515, Data Warehousing

Page 20: L1.pptx

Data Staging Area A storage area where extracted data is

CleanedTransformedDeduplicated

Initial storage for data Need not be based on Relational model Spread over a number of machines Mainly sorting and Sequential processing COBOL or C code running against flat files Does not provide data access to users Analogy – kitchen of a restaurant 04/28/2023 20SS ZG515, Data Warehousing

Page 21: L1.pptx

Presentation Servers A target physical machine on which DW

data is organized for Direct querying by end users using

OLAPReport writersData Visualization toolsData mining tools

Data stored in Dimensional framework Analogy – Sitting area of a restaurant

04/28/2023 21SS ZG515, Data Warehousing

Page 22: L1.pptx

Data Cleaning Why?

Data warehouse contains data that is analyzed for business decisions

More data and multiple sources could mean more errors in the data and harder to trace such errors

Results in incorrect analysis Detecting data anomalies and

rectifying them early has huge payoffs

Long Term Solution Change business practices and data

entry tools Repository for meta-data

04/28/2023 22SS ZG515, Data Warehousing

Page 23: L1.pptx

Soundex Algorithms

Misspelled terms For example NAMES Phonetic algorithms – can find

similar sounding names Based on the six phonetic

classifications of human speech sounds

04/28/2023 23SS ZG515, Data Warehousing

Page 24: L1.pptx

Data Warehouse Design• OLTP Systems are Data Capture Systems• “DATA IN” systems• DW are “DATA OUT” systems

OLTP DW

04/28/2023 24SS ZG515, Data Warehousing

Page 25: L1.pptx

Analyzing the DATA • Active Analysis – User Queries– User-guided data analysis– Show me how X varies with Y– OLAP

• Automated Analysis – Data Mining– What’s in there?– Set the computer FREE on your data– Supervised Learning (classification)– Unsupervised Learning (clustering)

04/28/2023 25SS ZG515, Data Warehousing

Page 26: L1.pptx

OLAP Queries • How much of product P1 was sold in 1999

state wise?• Top 5 selling products in 2002• Total Sales in Q1 of FY 2002-03?• Color wise sales figure of cars from 2000 to

2003• Model wise sales of cars for the month of Jan

from 2000 to 2003

04/28/2023 26SS ZG515, Data Warehousing

Page 27: L1.pptx

Data Mining Investigations • Which type of customers are more likely to spend

most with us in the coming year?• What additional products are most likely to be sold

to customers who buy sportswear?• In which area should we open a new store in the

next year?• What are the characteristics of customers most

likely to default on their loans before the year is out?

04/28/2023 27SS ZG515, Data Warehousing

Page 28: L1.pptx

Continuum of Analysis

OLTP OLAP Data Mining

Primitive & Canned Analysis

Complex Ad-hoc Analysis

Automated Analysis

SQLSpecialized Algorithms

04/28/2023 28SS ZG515, Data Warehousing

Page 29: L1.pptx

Design Requirements

• Design of the DW must directly reflect the way the managers look at the business

Should capture the measurements of importance along with parameters by which these parameters are viewed It must facilitate data analysis, i.e., answering business questions04/28/2023 29SS ZG515, Data Warehousing

Page 30: L1.pptx

ER Modeling

• A logical design technique that seeks to eliminate data redundancy

• Illuminates the microscopic relationships among data elements

• Perfect for OLTP systems• Responsible for success of transaction

processing in Relational Databases

04/28/2023 30SS ZG515, Data Warehousing

Page 31: L1.pptx

Problems with ER Model

ER models are NOT suitable for DW?• End user cannot understand or remember an ER

Model• Many DWs have failed because of overly complex ER

designs• Not optimized for complex, ad-hoc queries • Data retrieval becomes difficult due to normalization• Browsing becomes difficult

04/28/2023 31SS ZG515, Data Warehousing

Page 32: L1.pptx

ER vs Dimensional Modeling• ER models are constituted to– Remove redundant data (normalization)– Facilitate retrieval of individual records having certain

critical identifiers– Thereby optimizing OLTP performance

• Dimensional model supports the reporting and analytical needs of a data warehouse system.

04/28/2023 32SS ZG515, Data Warehousing

Page 33: L1.pptx

Dimensional Modeling:Salient Features

• Represents data in a standard framework• Framework is easily understandable by end users• Contains same information as ER model• Packages data in symmetric format• Resilient to change• Facilitates data retrieval/analysis

04/28/2023 33SS ZG515, Data Warehousing

Page 34: L1.pptx

Dimensional Modeling:Vocabulary

• Measures or facts• Facts are “numeric” & “additive”• For example; Sale Amount, Sale Units • Factors or dimensions• Star Schemas• Snowflake & Starflake Schemas

Sales Amt = f (Product, Location, Time)Fact Dimensions

04/28/2023 34SS ZG515, Data Warehousing

Page 35: L1.pptx

Star Schema FK FK

FK FK

Sales FactTable

Location Dimension

Promotion Dimension

ProductDimension

TimeDimension

04/28/2023 35SS ZG515, Data Warehousing

Page 36: L1.pptx

Dimensional Modeling

• Facts are stored in FACT Tables• Dimensions are stored in DIMENSION tables• Dimension tables contains textual descriptors of

business• Fact and dimension tables form a Star Schema• “BIG” fact table in center surrounded by

“SMALL” dimension tables

04/28/2023 36SS ZG515, Data Warehousing

Page 37: L1.pptx

The “Classic” Star Schema

PERIOD KEY

Store Dimension

Time Dimension

Product Dimension

STORE KEYPRODUCT KEYPERIOD KEYDollars_soldUnitsDollars_cost

Period DescYearQuarterMonthDay

Fact Table

PRODUCT KEY

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.

Product Desc.BrandColorSizeManufacturer

STORE KEY

04/28/2023 37SS ZG515, Data Warehousing

Page 38: L1.pptx

Fact Tables• Contains numerical measurements of the business• Each measurement is taken at the intersection of all

dimensions• Intersection is the composite key • Represents Many-to-many relationships between

dimensions• Examples of facts

Sale_amt, Units_sold, Cost, Customer_count

04/28/2023 38SS ZG515, Data Warehousing

Page 39: L1.pptx

Dimension Tables

• Contains attributes for dimensions• 50 to 100 attributes common• Best attributes are textual and descriptive• DW is only as good as the dimension attributes• Contains hierarchal information albeit redundantly • Entry points into the fact table

04/28/2023 39SS ZG515, Data Warehousing

Page 40: L1.pptx

Types of Facts• Fully-additive-all dimensions

– Units_sold, Sales_amt• Semi-additive-some dimensions

– Account_balance, Customer_count28/3,tissue paper,store1, 25, 250,2028/3,paper towel,store1, 35, 350,30Is no. of customers who bought either tissue paper or paper towel is 50? NO.

• Non-additive-none– Gross margin=Gross profit/amount– Note that GP and Amount are fully additive– Ratio of the sums and not sum of the ratios

04/28/2023 40SS ZG515, Data Warehousing

Page 41: L1.pptx

Data Warehouse:Design Steps

Step 1: Identify the Business Process

Step 2: Declare the Grain

Step 3: Identify the Dimensions

Step 4: Identify the Facts

04/28/2023 41SS ZG515, Data Warehousing

Page 42: L1.pptx

Grocery Store:The Universal Example

The Scenario: Chain of 100 Grocery Stores

10000 of these products sold on any given day(average)

60000 individual products in each store

3 year data

04/28/2023 42SS ZG515, Data Warehousing

Page 43: L1.pptx

Some Terms

• SKU (Stock Keeping Units)• UPC (Universal Product Codes)• EPOS ( Electronic Point of Sales)

04/28/2023 43SS ZG515, Data Warehousing

Page 44: L1.pptx

What Management is Interested In?

• Ordering logistics• Stocking shelves• Selling products• Maximize profits

04/28/2023 44SS ZG515, Data Warehousing

Page 45: L1.pptx

Grocery Store DW Step 1: Sales Business Process Step 2: Daily Grain A word about GRANULARITY

Temp sensor data: per ms, sec, min, hr?

Size of the DW is governed by granularity

Daily grain (club products sold on a day for each store) Aggregated data

Receipt line Grain (each line in the receipt is recorded – finest grain data)04/28/2023 45SS ZG515, Data Warehousing

Page 46: L1.pptx

Grocery Store:DW Size Estimate

• Daily Grain• Size of Fact Table = 100*10000*3*365 = 1095 million records• 3 facts & 4 dimensions (49 bytes)• 1095 m * 49 bytes = 53655 m bytes• i.e. ~ 50 GB

04/28/2023 46SS ZG515, Data Warehousing

Page 47: L1.pptx

Facts for Grocery Store

1.Quantity sold (additive)2.Dollar revenue (additive)3.Dollar cost (additive)4.Customer count (semi-

additive, not additive along the product dimension)

04/28/2023 47SS ZG515, Data Warehousing

Page 48: L1.pptx

Fact Table for Grocery Store

Field name Example Values

Description/Remarks

Date key (FK) 1 Surrogate key

Product key (FK) 1 Surrogate key

Store key (FK) 1 Surrogate key

EPOS transaction no.

100 Trancsaction number generated by the Operational system to record sales

Sales Quantity 2 No. of units bought by a customer

Sales amount 72 Amount received by selling 2 units

Cost amount 65 Cost price of 2 units

04/28/2023 48SS ZG515, Data Warehousing

Page 49: L1.pptx

Market-Basket Analysis

• What products customers are buying together?• Beer & Diapers• Polo Shirts & Barbie Dolls• How do we find this out?• Market-Basket Analysis!• Transaction No. • Receipt Line Grain• Degenerate Dimension

04/28/2023 49SS ZG515, Data Warehousing

Page 50: L1.pptx

Promotion Dimension

• Causal Dimension• Which causes or being the cause• Promotion conditions include– TPRs– End-aisle displays– Newspapers ads– Coupons– Combinations are common

04/28/2023 50SS ZG515, Data Warehousing

Page 51: L1.pptx

Q & A

Page 52: L1.pptx

Thank You