31
1 IST 210 Data Warehousing

Data Warehousing

Embed Size (px)

DESCRIPTION

Data Warehousing. Data Rich, but Information Poor. Data is s tored , not explored : by its volume and complexity it represents a burden, not a support Data overload results in uninformed decisions, contradictory information, higher overhead, wrong decisions, increased costs - PowerPoint PPT Presentation

Citation preview

Page 1: Data Warehousing

1

IST 210 Data Warehousing

Page 2: Data Warehousing

2

IST 210 Data Rich, but Information Poor

Data is stored, not explored : by its volume and complexity it represents a burden, not a support

Data overload results in uninformed decisions, contradictory information, higher overhead, wrong decisions, increased costs

Data is not designed and is not structured for successful management decision making

Page 3: Data Warehousing

3

IST 210 Improving Decision Making

Data

Information

Decisions

Data Warehouse

Page 4: Data Warehousing

4

IST 210 Data Warehouse Concepts

Page 5: Data Warehousing

5

IST 210 What’s a Data Warehouse?

A data warehouse is a single, integrated source of decision support information formed by collecting data from multiple sources, internal to the organization as well as external, and transforming and summarising this information to enable improved decision making.

A data warehouse is designed for easy access by users to large amounts of information, and data access is typically supported by specialized analytical tools and applications.

Page 6: Data Warehousing

6

IST 210

Data Warehouse Characteristics

Key Characteristics of a Data Warehouse

Subject-oriented Integrated Time-variant Non-volatile

Page 7: Data Warehousing

7

IST 210 Subject Oriented• Example for an insurance company :

PolicyPolicyCustomerCustomer

Data

LossesLosses PremiumPremium

Commercial and Life

Insurance Systems

Commercial and Life

Insurance Systems

Auto and Fire Policy

Processing Systems

Auto and Fire Policy

Processing Systems

Data

Accounting System

Accounting System

Claims Processing

System

Claims Processing

SystemBilling System

Billing System

Applications Area Data Warehouse

Page 8: Data Warehousing

8

IST 210 Integrated

• Data is stored once in a single integrated location(e.g. insurance company)

Data WarehouseDatabase

Subject = Customer

Auto PolicyProcessing

System

Auto PolicyProcessing

System

Customer data stored in severaldatabases

Fire PolicyProcessing

System

Fire PolicyProcessing

System

FACTS, LIFECommercial, Accounting

Applications

FACTS, LIFECommercial, Accounting

Applications

Page 9: Data Warehousing

9

IST 210 Time - Variant

Data is tagged with some element of time - creation date, as of date, etc.

Data is available on-line for long periods of time for trend analysis and forecasting. For example, five or more years

Data Warehouse Data

Time Data

{Key

• Data is stored as a series of snapshots or views which record how it is collected across time.

Page 10: Data Warehousing

10

IST 210 Non-Volatile

• Existing data in the warehouse is not overwritten or

updated. External Sources

• Read-Only

DataWarehouseDatabaseData

WarehouseEnvironment

Data Warehouse

Environment

ProductionDatabases

ProductionApplications

ProductionApplications

• Update• Insert• Delete

• Load

Page 11: Data Warehousing

11

IST 210 Transaction System vs. Data Warehouse

Page 12: Data Warehousing

12

IST 210

On-line, real time update into disparate systems

Day-to-day operations System Experts

UsersData Manipulation

Unix

VMS

MVS

Other

Transaction-Based Reporting System

Page 13: Data Warehousing

13

IST 210

BENEFIT: Reduce data processing

costs

BENEFIT: Integrated, consistent data

available for analysis

BENEFIT: Improve Network Reporting processes and

analytical capabilities

Data Staging, Transformation and Cleansing

Data Staging, Transformation and Cleansing

Interfaces

Executive Reporting and On-Line Analysis

EnvironmentOther

VMS

MVS

Unix

Su

mm

arization

OLAP

DataWarehouse

Warehouse-Based Reporting System

Page 14: Data Warehousing

14

IST 210

Transaction - Warehouse Process

TransformSummarize &

Refine

On-line, real time update.

“Transaction Based Process”

Day-to-day operations

Detailed Information to operational systems.

“Warehouse Based Process”

Decision support for management use.

Batch Load

Page 15: Data Warehousing

15

IST 210

Supports management analysis and decision-

making processes Contains summarized, refined, and cleansed

information Non-volatile -- provides a data “snapshot”;

adjustments are not permitted, or are limited Business analysis requirements drive the data

structure and system design Integrated, consistent information on a single

technology platform Users have direct, fast access via On-line

Analytical Processing tools Minimal impact on operational processes

Data Warehouse

Supports day-to-day operational processes Contains raw, detailed data that has not been

refined or cleansed Volatile -- data changes from day-to-day, with

frequent updates Technical issues drive the data structure and

system design Disparate data structures, physical locations,

query types, etc. Users rely on technical analysts for reporting

needs Operational processes impacted by queries

run off of system

Transaction System

Transaction System vs. Data Warehouse

Page 16: Data Warehousing

16

IST 210 Data Warehouse Architecture

Page 17: Data Warehousing

17

IST 210 Data Warehouse Architecture

Conversion & Interface

OLAPCubes

Ad-hocReporting

CannedReports

Data MartsStaging AreaODS

Operational System Data Warehouse

Page 18: Data Warehousing

18

IST 210

Systems are widely dispersed Systems are organized for on-line

transaction processing (OLTP) Functionality and data definitions

are typically duplicated across many systems

Data Warehouse ArchitectureOperational System Characteristics

Page 19: Data Warehousing

19

IST 210

Map source data to target Data scrubbing Derive new data Data Extraction Transform / convert data Create / modify metadata

Conversion& Cleansing

Data Warehouse ArchitectureConversion and Cleansing Activities

Page 20: Data Warehousing

20

IST 210

Contains Current and near current data

Contains almost all detail data Data is updated frequently Used to report a status

continuously and ask specific questions – not flexible

Contains historical data Contains summarized and

detailed data Data is non-volatile Used to populate the DWH,

which makes OLAP possible - flexible

ODS

Data Warehouse ArchitectureODS vs. DWH Staging Area

DWHStaging

Area

Page 21: Data Warehousing

21

IST 210

Back room Sequential Processing:

clean, combine, sort, archive, remove duplicates, add keys

Off limits to the end users

Front room ROLAP – OLAP:

subject oriented, locally implemented, user group driven

Available for end user inquiry

DWHStaging

Area

Data Warehouse ArchitectureData Staging Area vs. Presentation Area

DWHDetailed

Data

Page 22: Data Warehousing

22

IST 210

DetailedData

Metadata

Ranges from detailed to summarized data

Contains metadata Many views of the data Subject-Oriented Time-variant

SummaryData

Data Warehouse ArchitectureData Warehouse Components

Page 23: Data Warehousing

23

IST 210 Data Warehouse Model

Page 24: Data Warehousing

24

IST 210

Requirements Gathering Process Business Measure Definition

Standard definition and related business rules and formulas

Source data element(s), including quality constraints

Data granularity levels (e.g., county detail for state)

Data retention (e.g., one month, one quarter, one year, multiple years)

Priority of the information (For example, is the information necessary to derive other business measures?)

Data load frequency (e.g., monthly, quarterly, etc.)

Page 25: Data Warehousing

25

IST 210

Data Modeling Process Fact Table and Dimension

Each subject area (e.g. Business Unit) has its own Fact Table. Fact tables relate or link dimensions. Each attribute of the fact table is a measure or foreign key. “The best facts are numeric, additive and continuously

valued.” Fact tables never contain direct links to other fact tables.

Fact Table

Dimension

Best defined in focus sessions and interviews Business Unit specific and overall perspectives Typically, hierarchical in nature

Page 26: Data Warehousing

26

IST 210 Star Join Schema

Region_Dimension_Tableregion _id

NENWSESW

region _doc

NortheastNorthwestSoutheastSouthwest

account _id

100000110000120000130000140000

account _doc

ABC ElectronicsMidway ElectricVictor ComponentsWashburn, Inc.Zerox

Account_Dimension_Table

Product_Dimension_Tableprod_grp_id

102030

prod_id

100140220

prod_grp_desc

Fewer devicesCircuit boardsComponents

prod_desc

Power supplyMotherboardCo-processor

month

01-199602-199603-1996

mo_in_fiscal_yr

456

month_name

JanuaryFebruaryMarch

prod_id

100140220

region_id

SWNESW

account_id

100000110000100000

vend_id

100200300

net-sales

30,00023,00032,000

gross_sales

50,00042,00049,000

Monthly_Sales_Summary_Table

Time_Dimension_Table

Fact Table

Dimension Tables

Vendor_Dimension_Tablevend_id

100200300

vendor_desc

PowerAge, Inc.Advanced Micro DevicesFarad Incorporated

month

01-199602-199603-1996

Page 27: Data Warehousing

27

IST 210 Storage of cubes Rolap (Relational On-line Analytical Processing)

Fact table is stored in relational data bases using keys Building is faster, consulting is slower Less storage space Used for very large amounts of data

Molap (Multi-dimensional On-line Analytical Processing) Cube is stored using multidimensional tables Data is stored in the cube, using sparsity Longer building time, faster consulting time of the cube More storage space needed Used for smaller amounts of data

Holap (Hybrid On-line Analytical Processing) Cube is stored using a combination of both techniques Detailed data is stored using ROLAP Summarized data is stored using MOLAP Synergy between storage and efficiency

Page 28: Data Warehousing

28

IST 210 Multi-Dimensional Analysis

Zip Code

County

Region

State

ProductFamily

Client Type

Account

Store

ProductLine Brand

Category

GroupItem

Class of Trade

Net Sales by Brand byRegion by Client Type

Geography Dimension

Customer Dimension

Product DimensionProduct Dimension

Business Measure:Net Sales

DW0117

Page 29: Data Warehousing

29

IST 210

Application of a Data Warehouse

Page 30: Data Warehousing

30

IST 210 Application Solution Classes

Executive information system (EIS) : Present information at the highest level of summarization using

corporate business measures. They are designed for extreme ease-of-use and, in many cases, only a mouse is required. Graphics are usually generously incorporated to provide at-a-glance indications of performance

Decision Support Systems (DSS) : They ideally present information in graphical and tabular

form, providing the user with the ability to drill down on selected information. Note the increased detail and data manipulation options presented

Page 31: Data Warehousing

31

IST 210 Data Mining Data Mining provides techniques to :

Detect trends or patterns, find correlations Exploratory data analysis Forecasting and business modeling

Intelligent agents are coupled to the data warehouse using different techniques:

Neural networks Expert systems Advanced statistics

The volume and complexity of information may not become a barrier

Applications : Early warning systems, Fraud detection, market research, direct mail.