View
217
Download
4
Tags:
Embed Size (px)
Citation preview
CSE5230 - Data Mining, 2001 Lecture 10.1
Data Mining - CSE5230
The Data Warehouse (DW) andBusiness Intelligence (BI)
CSE5230/DMS/2001/10
CSE5230 - Data Mining, 2001 Lecture 10.2
Lecture Outline
Overview of Data Warehousing Data Warehouse Architecture Overview of Business Intelligence (BI) OLAP
CSE5230 - Data Mining, 2001 Lecture 10.3
What is a DW?
A data store to support data analysis or decision support Decision support:
» a methodology to extract information from data Decision support system:
» an arrangement of computerized tools to assist in managerial decision making
Answers questions by combining historical operational data with a business data model that reflects business activity
Data may come from both operational and external sources external data - e.g. industry average salaries
CSE5230 - Data Mining, 2001 Lecture 10.4
Data Warehouse Definitions - 1
The information in a DW is subject-oriented, non-volatile, and of an historic nature, and so DWs tend to contain extremely large datasets
The purpose of the DW is to provide the tools and facilities to manage and deliver complete, timely, accurate, and understandable business information to authorized individuals for effective business decision making
DW implementation needs a company-wide effort that requires user involvement and commitment at all levels
A successful DW implementation tracks return on investment
CSE5230 - Data Mining, 2001 Lecture 10.5
Data Warehouse Definitions - 2
A DW is a concept not a product It is the compiling, assembling, and consolidating of
application data common to user communities at a single logical point
Typical use includes ad hoc queries, “what if”, data matching, trend analysis and other sophisticated information functions
Warehouse data is typically extracted from OLTP systems
A DW can be described as a read-only database that provides users with access to consolidated, historic, or static data extracted from operational databases, usually augmented with external data
CSE5230 - Data Mining, 2001 Lecture 10.6
Operational Data vs. the DW - 1
Integration Data found within the DW is ALWAYS integrated, e.g. encoding, measurements of attributes, etc. are
standardized
Normalized vs. denormalized Operational data is normalized
Timespan Operational data is current DW data is historical
Granularity Operational data is at transaction level DW data is at an aggregation level
CSE5230 - Data Mining, 2001 Lecture 10.7
Operational Data vs. the DW - 2
Dimensionality data is clustered according to functional
requirements i.e. all orders to be delivered to a particular suburb
data analyst requires access to all dimensions
Use DW is read only
CSE5230 - Data Mining, 2001 Lecture 10.8
MIS, or Before the DW
MIS: Management Information System required detailed knowledge of the operational
systems no Business Information Directory data quality is ad hoc limited data integration from source systems integration and querying performed by MIS
specialists using 3+GL tools such as SAS or at best performing queries using SQL against
images of unintegrated operational databases
CSE5230 - Data Mining, 2001 Lecture 10.9
Inmon’s 12 Rules - 1
DW and operational environments are separated Integrated DW data DW contains historical data DW is snapshot data captured at particular point
in time DW data is subject-oriented
CSE5230 - Data Mining, 2001 Lecture 10.10
Inmon’s 12 Rules - 2
No online update DW SDLC is data-driven DW contains several levels of data - raw to
summarized Data sources are traced Meta-data is a critical component DW contains a charge back mechanism
CSE5230 - Data Mining, 2001 Lecture 10.11
DW Architecture
Authoritative Source
Source SystemsExternal systems
Extract / Enhance /Transform Layer
Copy mgtExtractTransform
Process onceBusiness rules
Consistency& controls
Value add
Enterprisesingle imagedata view
Separates data fromapplication
Fully modelled& documented
Data Warehouse
Build datafor appropriatedatamart
Parallelprocess
Denormalizefor specificuse
Customise
Meets specificOLAPrequirements
DataMarts
Delivery touser
Industrystandardtools
Tailored applicationswhereappropriate
Load
Business Information Directory
CSE5230 - Data Mining, 2001 Lecture 10.12
Source Systems/Authoritative Source
must first identify authoritative source data Authoritative Source
atomic data from the creating/owning source system
data propagation must be subject to a delivery contract
data propagation is asynchronous no reverse propagation no periodic synchronization
delivery must have minimal impact on operational systems
CSE5230 - Data Mining, 2001 Lecture 10.13
Extract/Enhance/Transform Layer
must create integrated and standardized data deduping process happens here denormalize into a format for direct loading into
the DW cleanse
must remove semantic and syntactic inconsistencies return invalid data to the source system for repair
requires a data quality process simple business transformations addition of surrogate keys and time variance
CSE5230 - Data Mining, 2001 Lecture 10.14
Handling Inserts/Deltas - 1
Scenarios additions to a (1) New or (2) Existing partition partitions are (1) Atomic or (2) Aggregates
New partition - atomic or aggregate work off-line do summation outside of database and use efficient
tools i.e.. Syncsort or C then SQL*LOADER
CSE5230 - Data Mining, 2001 Lecture 10.15
Handling Inserts/Deltas - 2
Updates to an existing partition Atomic Partition
» Unload, Sort, Reload or
» Insert directly into DB - concurrency issues Aggregate Partition
R1 X 1R2 X 2
X 3 - stored in databaseR3 X 1
Update directly to DW Unload and update out of the database Keep source data and re-sort sum
CSE5230 - Data Mining, 2001 Lecture 10.16
The Data Warehouse
contains atomic data Star Schema structure
contains
» Facts
» Dimensions
» Attributes - Surrogate keys
» Attribute Hierarchies
Key Issues size data retention period - YTD backup and recovery security
CSE5230 - Data Mining, 2001 Lecture 10.17
Star Schemas
a data modeling technique used to map decision support data into a relational database
this structure is based on the premise that a highly normalized data structure do not serve advanced data analysis requirements well
DimACustomer
Fact TableSALES
DimBProduct
DimCSalesrep
DimDLocation
Cust#
SalesrepID
Loc# Prod#
CSE5230 - Data Mining, 2001 Lecture 10.18
Snowflake Schemas
DimACustomer
Fact TableSALES
DimBProduct
DimCSalesrep
DimDLocation
SalesrepID
Prod#
CustomerCategory
Customer Address
Customer State
CSE5230 - Data Mining, 2001 Lecture 10.19
Fact Tables
Facts measure something of interest to an enterprise
atomic level or transactional datasummarization will reduce volume but may lose information
CUST# PROD# TOTALC100 P100 $1000C100 P200 $2000
CUST# PROD# REP DATE COSTC100 P100 S1 1/12 $510C100 P100 S2 2/12 $490
CSE5230 - Data Mining, 2001 Lecture 10.20
Dimensions
drill down to atomic data from dimensions or reference tables
A Query List sales of Product P100 for each State for each
Month of 1999?
Product Location TimeP#=P100 State=Each Year=1999PName Nuts Region Month=EachPCat
CSE5230 - Data Mining, 2001 Lecture 10.21
Attributes & Attribute Hierarchies
each dimension table contains attributes surrogate keys are commonly added to improve
performance of joins between Fact tables and their associated Dimensions
attributes are used to search, filter of classify facts
Attribute Hierarchies: classification attributes, e.g.
SALES_REGIONVIC, TAS
CSE5230 - Data Mining, 2001 Lecture 10.22
Datamarts/Customization/Cubes
customization - select only the attributes and rows of interest for export to a datamart or data cube
apply coding techniques to the attributes of interest suitable for search algorithm to be used
each cell of a cube is a view consisting of an aggregation of interest e.g. TOTAL_SALES
used as a performance improving technique to pre-aggregate groupby cells remove data not required for the problem at hand from
the search algorithm
CSE5230 - Data Mining, 2001 Lecture 10.23
Business Intelligence & The DW
most enterprises have a data repository to allow data analysis to occur
databases provide enabling techniques efficient data storage and access query optimization
80% of knowledge discovery in databases (KDD) is the preparation of the data - this is the data warehouse
the evolution of the desktop, database, networks and AI/search has made it possible to perform KDD in commercial databases
CSE5230 - Data Mining, 2001 Lecture 10.24
The BI Process - 1
Understand and define the process Perform data collection and extraction Perform Data Cleaning and Exploration Data Engineering
select attributes of interest select records of interest map attributes to suit DM algorithms
CSE5230 - Data Mining, 2001 Lecture 10.25
The BI Process - 2
Algorithm Engineering which algorithm to use ability to deal with
» quality of input
» quality of output
» performance
Run the data mining algorithm Preliminary evaluation of the results Refine the data and the problem Use the results to implement a business strategy
CSE5230 - Data Mining, 2001 Lecture 10.26
A BI Model
AnalysisDiscovery
Pattern Recognition
Prediction/Verification
Model
AnswerVariables
Learning
Adaptive Modelling
Profit from targeted customers buying Product X/Cost of Producing the Model and Predicting the Answer= Return on Investment
CSE5230 - Data Mining, 2001 Lecture 10.27
DM Techniques - a BI view
Verification Driven Data Mining Techniques Naive evaluation - exhaustive search Random walk ad hoc query OLAP Hypothesis testing - statistics
Discovery Driven Data Mining Techniques Statistical Modeling (e.g. linear regression) Visualization Rule-based and inductive learning Neural networks Genetic algorithms (an optimization technique)
CSE5230 - Data Mining, 2001 Lecture 10.28
OLAP: On-Line Analytical Processing
an environment for the analysis ofmulti-dimensional data dice rotate drill-down rollup
OLAP provides advanced database support involving attribute selection, attribute encoding, row sampling, data cleansing and allows the use of multiple different search engines easy to use user-interface open system architecture using local processing power
CSE5230 - Data Mining, 2001 Lecture 10.29
References Rob, P. & Coronel, C. Database Systems: Design, Implementation, and
Management, 3rd Ed., Nelson 1997 Inmon W. H. - numerous. See http://www.cait.wustl.edu/cait
/papers/prism/vol1_no1/ for example Kimball, R - numerous. See http://www.rkimball.com/ Golfarelli, M., Maio, D., and Rizzi, S. Conceptual Design of Data
Warehouses from E/R Schemes, in Proceedings of the 31st Hawaii International Conference on System Sciences,1998
Lee A.J. and Rundensteiner, E. A Data Warehouse Evolution: Consistent Metadata Management.
Gray, J. et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals, Data Mining and Knowledge Discovery 1, pp. 29-53, 1997
Maier, D. et al. Selected Research Issues in Decision Support Databases Journal of Intelligent Information Systems, 11 (2), pp. 169-191 1998