CSE5230 - Data Mining, 2001Lecture 10.1 Data Mining - CSE5230 The Data Warehouse (DW) and Business Intelligence (BI) CSE5230/DMS/2001/10

CSE5230 - Data Mining, 2001 Lecture 10.1

Data Mining - CSE5230

The Data Warehouse (DW) andBusiness Intelligence (BI)

CSE5230/DMS/2001/10


Lecture Outline

Overview of Data Warehousing Data Warehouse Architecture Overview of Business Intelligence (BI) OLAP


What is a DW?

A data store to support data analysis or decision support Decision support:

» a methodology to extract information from data Decision support system:

» an arrangement of computerized tools to assist in managerial decision making

Answers questions by combining historical operational data with a business data model that reflects business activity

Data may come from both operational and external sources external data - e.g. industry average salaries


Data Warehouse Definitions - 1

The information in a DW is subject-oriented, non-volatile, and of an historic nature, and so DWs tend to contain extremely large datasets

The purpose of the DW is to provide the tools and facilities to manage and deliver complete, timely, accurate, and understandable business information to authorized individuals for effective business decision making

DW implementation needs a company-wide effort that requires user involvement and commitment at all levels

A successful DW implementation tracks return on investment


Data Warehouse Definitions - 2

A DW is a concept not a product It is the compiling, assembling, and consolidating of

application data common to user communities at a single logical point

Typical use includes ad hoc queries, “what if”, data matching, trend analysis and other sophisticated information functions

Warehouse data is typically extracted from OLTP systems

A DW can be described as a read-only database that provides users with access to consolidated, historic, or static data extracted from operational databases, usually augmented with external data


Operational Data vs. the DW - 1

Integration Data found within the DW is ALWAYS integrated, e.g. encoding, measurements of attributes, etc. are

standardized

Normalized vs. denormalized Operational data is normalized

Timespan Operational data is current DW data is historical

Granularity Operational data is at transaction level DW data is at an aggregation level


Operational Data vs. the DW - 2

Dimensionality data is clustered according to functional

requirements i.e. all orders to be delivered to a particular suburb

data analyst requires access to all dimensions

Use DW is read only


MIS, or Before the DW

MIS: Management Information System required detailed knowledge of the operational

systems no Business Information Directory data quality is ad hoc limited data integration from source systems integration and querying performed by MIS

specialists using 3+GL tools such as SAS or at best performing queries using SQL against

images of unintegrated operational databases


Inmon’s 12 Rules - 1

DW and operational environments are separated Integrated DW data DW contains historical data DW is snapshot data captured at particular point

in time DW data is subject-oriented


Inmon’s 12 Rules - 2

No online update DW SDLC is data-driven DW contains several levels of data - raw to

summarized Data sources are traced Meta-data is a critical component DW contains a charge back mechanism


DW Architecture

Authoritative Source

Source SystemsExternal systems

Extract / Enhance /Transform Layer

Copy mgtExtractTransform

Process onceBusiness rules

Consistency& controls

Value add

Enterprisesingle imagedata view

Separates data fromapplication

Fully modelled& documented

Data Warehouse

Build datafor appropriatedatamart

Parallelprocess

Denormalizefor specificuse

Customise

Meets specificOLAPrequirements

DataMarts

Delivery touser

Industrystandardtools

Tailored applicationswhereappropriate

Load

Business Information Directory


Source Systems/Authoritative Source

must first identify authoritative source data Authoritative Source

atomic data from the creating/owning source system

data propagation must be subject to a delivery contract

data propagation is asynchronous no reverse propagation no periodic synchronization

delivery must have minimal impact on operational systems


Extract/Enhance/Transform Layer

must create integrated and standardized data deduping process happens here denormalize into a format for direct loading into

the DW cleanse

must remove semantic and syntactic inconsistencies return invalid data to the source system for repair

requires a data quality process simple business transformations addition of surrogate keys and time variance


Handling Inserts/Deltas - 1

Scenarios additions to a (1) New or (2) Existing partition partitions are (1) Atomic or (2) Aggregates

New partition - atomic or aggregate work off-line do summation outside of database and use efficient

tools i.e.. Syncsort or C then SQL*LOADER


Handling Inserts/Deltas - 2

Updates to an existing partition Atomic Partition

» Unload, Sort, Reload or

» Insert directly into DB - concurrency issues Aggregate Partition

R1 X 1R2 X 2

X 3 - stored in databaseR3 X 1

Update directly to DW Unload and update out of the database Keep source data and re-sort sum


The Data Warehouse

contains atomic data Star Schema structure

contains

» Facts

» Dimensions

» Attributes - Surrogate keys

» Attribute Hierarchies

Key Issues size data retention period - YTD backup and recovery security


Star Schemas

a data modeling technique used to map decision support data into a relational database

this structure is based on the premise that a highly normalized data structure do not serve advanced data analysis requirements well

DimACustomer

Fact TableSALES

DimBProduct

DimCSalesrep

DimDLocation

Cust#

SalesrepID

Loc# Prod#


Snowflake Schemas

DimACustomer

Fact TableSALES

DimBProduct

DimCSalesrep

DimDLocation

SalesrepID

Prod#

CustomerCategory

Customer Address

Customer State


Fact Tables

Facts measure something of interest to an enterprise

atomic level or transactional datasummarization will reduce volume but may lose information

CUST# PROD# TOTALC100 P100 $1000C100 P200 $2000

CUST# PROD# REP DATE COSTC100 P100 S1 1/12 $510C100 P100 S2 2/12 $490


Dimensions

drill down to atomic data from dimensions or reference tables

A Query List sales of Product P100 for each State for each

Month of 1999?

Product Location TimeP#=P100 State=Each Year=1999PName Nuts Region Month=EachPCat


Attributes & Attribute Hierarchies

each dimension table contains attributes surrogate keys are commonly added to improve

performance of joins between Fact tables and their associated Dimensions

attributes are used to search, filter of classify facts

Attribute Hierarchies: classification attributes, e.g.

SALES_REGIONVIC, TAS


Datamarts/Customization/Cubes

customization - select only the attributes and rows of interest for export to a datamart or data cube

apply coding techniques to the attributes of interest suitable for search algorithm to be used

each cell of a cube is a view consisting of an aggregation of interest e.g. TOTAL_SALES

used as a performance improving technique to pre-aggregate groupby cells remove data not required for the problem at hand from

the search algorithm


Business Intelligence & The DW

most enterprises have a data repository to allow data analysis to occur

databases provide enabling techniques efficient data storage and access query optimization

80% of knowledge discovery in databases (KDD) is the preparation of the data - this is the data warehouse

the evolution of the desktop, database, networks and AI/search has made it possible to perform KDD in commercial databases


The BI Process - 1

Understand and define the process Perform data collection and extraction Perform Data Cleaning and Exploration Data Engineering

select attributes of interest select records of interest map attributes to suit DM algorithms


The BI Process - 2

Algorithm Engineering which algorithm to use ability to deal with

» quality of input

» quality of output

» performance

Run the data mining algorithm Preliminary evaluation of the results Refine the data and the problem Use the results to implement a business strategy


A BI Model

AnalysisDiscovery

Pattern Recognition

Prediction/Verification

Model

AnswerVariables

Learning

Adaptive Modelling

Profit from targeted customers buying Product X/Cost of Producing the Model and Predicting the Answer= Return on Investment


DM Techniques - a BI view

Verification Driven Data Mining Techniques Naive evaluation - exhaustive search Random walk ad hoc query OLAP Hypothesis testing - statistics

Discovery Driven Data Mining Techniques Statistical Modeling (e.g. linear regression) Visualization Rule-based and inductive learning Neural networks Genetic algorithms (an optimization technique)


OLAP: On-Line Analytical Processing

an environment for the analysis ofmulti-dimensional data dice rotate drill-down rollup

OLAP provides advanced database support involving attribute selection, attribute encoding, row sampling, data cleansing and allows the use of multiple different search engines easy to use user-interface open system architecture using local processing power


References Rob, P. & Coronel, C. Database Systems: Design, Implementation, and

Management, 3rd Ed., Nelson 1997 Inmon W. H. - numerous. See http://www.cait.wustl.edu/cait

/papers/prism/vol1_no1/ for example Kimball, R - numerous. See http://www.rkimball.com/ Golfarelli, M., Maio, D., and Rizzi, S. Conceptual Design of Data

Warehouses from E/R Schemes, in Proceedings of the 31st Hawaii International Conference on System Sciences,1998

Lee A.J. and Rundensteiner, E. A Data Warehouse Evolution: Consistent Metadata Management.

Gray, J. et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals, Data Mining and Knowledge Discovery 1, pp. 29-53, 1997

Maier, D. et al. Selected Research Issues in Decision Support Databases Journal of Intelligent Information Systems, 11 (2), pp. 169-191 1998

Documents

CSE5230 - Data Mining, 2001Lecture 10.1 Data Mining - CSE5230 The Data Warehouse (DW) and Business Intelligence (BI) CSE5230/DMS/2001/10