45
ITEC 423 Data Warehousing and ITEC 423 Data Warehousing and Data Mining Data Mining Lecture 2 Lecture 2

ITEC 423 Data Warehousing and Data Mining Lecture 2

Embed Size (px)

Citation preview

Page 1: ITEC 423 Data Warehousing and Data Mining Lecture 2

ITEC 423 Data Warehousing and ITEC 423 Data Warehousing and Data MiningData MiningLecture 2Lecture 2

Page 2: ITEC 423 Data Warehousing and Data Mining Lecture 2
Page 3: ITEC 423 Data Warehousing and Data Mining Lecture 2

What is a data warehouse?What is a data warehouse?

“A data warehouse is a subject-oriented, Integrated (consolidated) time-variant, and nonvolatile collection of data in support of

management’s decision-making process.”W. H. Inmon

Page 4: ITEC 423 Data Warehousing and Data Mining Lecture 2

Subject-Oriented DataSubject-Oriented Data

Warehouse is organized around major subjects of the enterprise rather than major application areas

Contains decision-support data rather than application-oriented data.

The focus of the design is:providing users easy access to data so that current and future questions can be answered

CustomersProductssales

Customer invoicingStock control

Page 5: ITEC 423 Data Warehousing and Data Mining Lecture 2

Application-Oriented vs Application-Oriented vs Subject-OrientedSubject-Oriented

Page 6: ITEC 423 Data Warehousing and Data Mining Lecture 2

Integrated or Consolidated Integrated or Consolidated DataData

Integrates corporate level application-oriented data from different source systems data is often inconsistent or missing

Integrated data source must be made consistent to present a unified view of the data to the users.

Page 7: ITEC 423 Data Warehousing and Data Mining Lecture 2

Integrated DataIntegrated Data

Page 8: ITEC 423 Data Warehousing and Data Mining Lecture 2

Time-Variant DataTime-Variant Data

Data in the warehouse is only accurate and valid at some point in time or over some time interval.

contains slices of data across different periods of time. With these data slices, the user can view

current and past reports. data represents a series of snapshots.

Time-variance is also shown in the extended time that data is stored contains several years’ worth of data implicit or explicit association of time with

all data

This is necessary to support trending,

forecasting, and time-based

performance reporting, such as current year versus previous

year.

Page 9: ITEC 423 Data Warehousing and Data Mining Lecture 2

Non-Volatile DataNon-Volatile Data

Data in the warehouse is not updated in real-time but is refreshed from operational systems on a regular basis

New data is always added as a supplement to the database, rather than a replacement.

Page 10: ITEC 423 Data Warehousing and Data Mining Lecture 2

Data GranularityData Granularityoperational database data is usually kept at the lowest level of detail

the units of sale are captured at the level of units of a product per transaction at the check -out counter.

the quantity ordered is captured and stored at the level of units of a product per order received from the customer.

If summary data is needed, the individual transactions are grouped.

data warehouse Initial requests are for summary data to use

for analysis. total sale units of a product in an entire region.

Progessively more details may be required breakdown by states in the region. Examine sale units at individual stores.

Page 11: ITEC 423 Data Warehousing and Data Mining Lecture 2

Data GranularityData Granularity Is the level of detail keep data summarized at different levels of detail

Page 12: ITEC 423 Data Warehousing and Data Mining Lecture 2

Typical Properties of a data Typical Properties of a data warehousewarehouseA data warehouse is housed on an enterprise

mainframe server.

Data from various online transaction processing (OLTP) applications and other sources is selectively extracted and organized Read-only Copy re-structured

Data warehouse database is used for processing analytical applications and user queries.

Page 13: ITEC 423 Data Warehousing and Data Mining Lecture 2

OLTP vs. WarehousingOLTP vs. Warehousing

Organized by transactions vs. Organized by particular subject

More number of users vs. less Accesses few records vs. entire table Smaller database vs. Large database Normalized data structure vs.

Unnormalized Continuous update vs. periodic update

(load)

Page 14: ITEC 423 Data Warehousing and Data Mining Lecture 2

Data Warehouse Compared to Data Warehouse Compared to OLTPOLTP

PROPERTY OLTP DATA WAREHOUSE

ACTIVITIES Processes Analysis

RESPONSE TIME Subsecondsto seconds

Seconds to hours

OPERATIONS DML Primarily read-only

NATURE OFDATA

30-60 days Snapshots over time

DATA ORGANIZE. By application By subject,time

SIZE Small to large Large to very large

DATA SOURCES Operational Internal

Operational,Internal, External

USAGE CURVEE Predictable Unpredictable

Page 15: ITEC 423 Data Warehousing and Data Mining Lecture 2

Data Warehouse or Data Data Warehouse or Data MartMart

Page 16: ITEC 423 Data Warehousing and Data Mining Lecture 2

Data Warehouse Compared to Data Warehouse Compared to Data MartData Mart

Data Warehouse

Data Mart

Property Data WarehouseData Mart

Scope Enterprise Department

Subjcts Multiple Single-subject, line of business (LOB)

Data Source Many Few

Size (typical) 100 GB to > 1 TB < 100 GB

Implementationtime

Months to years Months

Page 17: ITEC 423 Data Warehousing and Data Mining Lecture 2

Which one to build first?Which one to build first?Data warehouse or Data Data warehouse or Data

Mart?Mart?

Page 18: ITEC 423 Data Warehousing and Data Mining Lecture 2

Top Down ApproachTop Down Approach

Page 19: ITEC 423 Data Warehousing and Data Mining Lecture 2

Top Down ApproachTop Down Approach

Build the overall, big, enterprise- wide data warehouse. Instead of collection of fragmented islands of

information. Data warehouse is large and integrated. would take longer to build and has a high risk of

failure. If you do not have experienced professionals on your

team, this approach could be hazardous. Difficult to sell this approach to senior

management and sponsors They are not likely to see results soon enough.

Page 20: ITEC 423 Data Warehousing and Data Mining Lecture 2

Pros and Cons of Top Pros and Cons of Top Down ApproachDown ApproachAdvantages Disadvantages

A truly corporate effort, an enterprise view of data

Inherently architected, not a union of disparate data marts

Single, central storage of data about the content

Centralized rules and control

May see quick results if implemented with iterations

Takes longer to build even with an iterative method

High exposure to risk of failure

Needs high level of cross-functional skills

High outlay without proof of concept

Page 21: ITEC 423 Data Warehousing and Data Mining Lecture 2

Bottom Up ApproachBottom Up Approach

Page 22: ITEC 423 Data Warehousing and Data Mining Lecture 2

Bottom Up ApproachBottom Up Approach Build departmental data marts one by one

based on priority Collection of data marts make up the data

warehouse Beware of data fragmentation Independent data marts are blind to the overall

requirements of the entire organization. Data marts contain

data at the lowest level of granularity summaries depending on the needs for analysis

Data marts are joined or “unioned” together by conforming the dimensions

Page 23: ITEC 423 Data Warehousing and Data Mining Lecture 2

Pros and Cons of Bottom Pros and Cons of Bottom Up ApproachUp ApproachAdvantages Disadvantages

Faster and easier implementation of manageable pieces

Favorable return on investment and proof of concept

Less risk of failure Inherently incremental

can schedule important data marts first

Allows project team to learn and grow

Each data mart has its own narrow view of data

Permeates redundant data in every data mart

Perpetuates inconsistent and irreconcilable data

Proliferates unmanageable interfaces

Page 24: ITEC 423 Data Warehousing and Data Mining Lecture 2

Architectural TypesArchitectural Types

Page 25: ITEC 423 Data Warehousing and Data Mining Lecture 2

Architectural TypesArchitectural Types

Centralized Data Warehouse: Takes into account the enterprise-level information requirements Atomic level data at the lowest level of granularity is stored Some summarized data may be included Queries and applications access the central data warehouse. No separate data martsIndependent Data Marts Evolves in companies where the organizational units develop

their own data marts for their own specific purposes Each data mart serves a particular organizational unit More than one version of the truth may be found Data marts are independent of one another Different data marts may have inconsistent data definitions and

standards Such variances hinder analysis of data across data marts.

Page 26: ITEC 423 Data Warehousing and Data Mining Lecture 2

Architectural TypesArchitectural TypesFederated An existing legacy of an assortment of DSS in the form of operational

systems, extracted datasets, primitive data marts, … May not be possible to discard investment and start from scratch Practical solution is a federated architectural type data may be physically or logically integrated through shared key

fields, overall global metadata , distributed queries, and such other methods

No one overall data warehouseData-Mart Bus Conformed supermarts approach Analyzing requirements for a specific business subject such as orders,

shipments, billings, insurance claims, car rentals, and ... Build the first data mart (supermart) using business dimensions and

metrics These business dimensions will be shared in the future data marts. Conform dimensions among the various data marts Result would be logically integrated supermarts that will provide an

enterprise view of the data Data marts contain atomic data organized as a dimensional data model Results from adopting an enhanced bottom-up approach to data

warehouse development

Page 27: ITEC 423 Data Warehousing and Data Mining Lecture 2

Architectural TypesArchitectural TypesHub-and-S poke Similar to the centralized data warehouse architecture Overall enterprise-wide data warehouse Atomic data is stored in the centralized data warehouse Major and useful difference is the presence of dependent data

marts in this architectural type Dependent data marts obtain data from the centralized data

warehouse The centralized data warehouse forms the hub to feed data to the

data marts on the spokes Dependent data marts may be developed for a variety of

purposes: departmental analytical needs, specialized queries, data mining,

and ... Dependent data mart may have normalized, denormalized,

summarized, or dimensional data structures based on individual requirements

Most queries are directed to the dependent data marts although the centralized data warehouse may itself be used for querying

Result s from adopting a top-down approach to data warehouse development.

Page 28: ITEC 423 Data Warehousing and Data Mining Lecture 2

Building Blocks of Data Building Blocks of Data WarehousesWarehouses

Page 29: ITEC 423 Data Warehousing and Data Mining Lecture 2

What is OLAP?What is OLAP? online analytical processing Approach to answer multi-dimensional

analytical queries. part of the broader category of

business intelligence: reporting, data mining.

Applications include: business reporting for sales, management reporting, budgeting forecasting

Page 30: ITEC 423 Data Warehousing and Data Mining Lecture 2

Typical Data Warehousing ProcessTypical Data Warehousing Process

Phase I - STRATEGYIdentify business requirementsDefine objectives & purpose of DW Phase II - DEFINITION

Project scoping and planning: Using building block approach

Phase III - ANALYSISInformation requirements are definedPhase IV - DESIGN

Database structures to hold basedata and summaries are created;Translation mechanisms are designed Phase V - BUILD & DOCUMENT

The warehouse is built and documentation is developed

Phase VI - POPULATE, TEST & TRAINThe warehouse is populated andtested the users are trained on system and tools

Phase VII - DISCOVERY & EVOLUTIONThe warehouse is monitored andadjustments are applied, or future extensions are planned

Iterative

Page 31: ITEC 423 Data Warehousing and Data Mining Lecture 2

What Does All This Mean?What Does All This Mean?

On a daily basis, organizations turn to their data warehouses to answer a limitless variety of questions.

Nothing is free these benefits do come with a cost.

The value of a data warehouse is a result of the new and changed business processes it enables.

There are limitations A DW cannot correct problems with the data,

although it may help to clearly identify them.

Page 32: ITEC 423 Data Warehousing and Data Mining Lecture 2

Comparison of Typical DW Costs and Comparison of Typical DW Costs and BenefitsBenefitsCosts Hardware, software, development personnel and consultant

costs. Operational costs like ongoing systems maintenance. Benefits Added Revenue Will the new (business objective) process generate new

customers (what is the estimated value?) Will the new (business objective) process increase the buying

propensity of existing customers (by how much?) Is the new process necessary to ensure that the competition

doesn't offer a demanded service that you can't match? Reduced costs What costs of current systems will be eliminated? Is the new process intended to make some operation more

efficient? If so, how and what is the dollar value?

Page 33: ITEC 423 Data Warehousing and Data Mining Lecture 2

The Cost of Warehousing DataThe Cost of Warehousing Data

Expenditures can be categorized as one-time initial costs or as recurring, ongoing costs.

The initial costs can further be identified as for hardware or software.

Expenditures can also be categorized as capital costs (associated with acquisition of the warehouse) or as operational costs (associated with running and maintaining the warehouse)

Page 34: ITEC 423 Data Warehousing and Data Mining Lecture 2

Recurring Costs One-Time Costs

Capital Hardware maintenance Software maintenance Terminal analysis Middleware

Hardware Software Disk DBMS CPU Terminal analysis Network Middleware Terminal analysis Network Log utility Processing Metadata Infrastructure

Operational Ongoing refreshment Integration transformation Data model maintenance Record identification maintenance Metadata infrastructure maintenance Archival of data Data aging within the DW

Integration/transformation processing specification

Metadata infrastructure population System of record definition Data dictionary language definition Network transfer definition CASE/Repository interface Initial data warehouse population Data model definition Database design definition

Expenditures Associated with Building a Expenditures Associated with Building a DWDW

Page 35: ITEC 423 Data Warehousing and Data Mining Lecture 2

Cost is Highly VariableCost is Highly Variable

A company that spends less money for their data warehouse is often happier with it.

The main justification for the development expense is that a DW reduces the cost of accessing the information owned by the organization.

Since information has to be retrieved just once (when it is placed in the warehouse), DW users see a lower cost on each report generated.

Page 36: ITEC 423 Data Warehousing and Data Mining Lecture 2

Typical Multidatabase Report and Screen Typical Multidatabase Report and Screen GenerationGeneration

SourceSystem

A

SourceSystem

B

SourceSystem

C

SourceSystem

D

Data download and transformation contribute to retrieval costs for every report or screen generated

Page 37: ITEC 423 Data Warehousing and Data Mining Lecture 2

Typical DW Report and Screen Typical DW Report and Screen GenerationGeneration

SourceSyste

mA

SourceSyste

mB

SourceSyste

mC

SourceSyste

mD

Organizational

DataWarehouse

Data upload and

transformation costs occur just once.

Retrieval costs are lower.

Page 38: ITEC 423 Data Warehousing and Data Mining Lecture 2

Farmers and ExplorersFarmers and Explorers

Every corporation has two types of DW users. Farmers know what they want before they

set out to find it. They submit small queries and retrieve small nuggets of information.

Explorers are quite unpredictable. They often submit large queries. Sometimes they find nothing, sometimes they find priceless nuggets.

Cost justification for the DW is usually done on the basis of the results obtained by farmers since explorers are unpredictable.

Page 39: ITEC 423 Data Warehousing and Data Mining Lecture 2

Data Marts and the Data Data Marts and the Data WarehouseWarehouse

Organizational

DataWarehouse

FinanceData Mart

Accounting

Data Mart

Marketing

Data Mart

SalesData MartOperation

al Data Store

Operational Data Store

Operational Data Store

Operational Data Store

Legacy Systems

Legacy systems feed data to the warehouse.

The warehouse feeds specialized information to departments.

Page 40: ITEC 423 Data Warehousing and Data Mining Lecture 2

The Data Mart is More The Data Mart is More SpecializedSpecialized

Organizational

DataWarehouse

FinanceData Mart

AcctingData Mart

Marketing

Data Mart

SalesData Mart

Data Marts

DepartmentalizedSummarized, aggregated dataStar join designLimited historical dataLimited data volumeRequirements driven dataFocused on departmental needsMulti-dimensional DBMS technologies

Organizational Data Warehouse

CorporateHighly granular dataNormalized designRobust historical dataLarge data volumeData Model driven dataVersatileGeneral purpose DBMS technologies

The data mart

serves the needs

of one business unit, not

the organizati

on.

Page 41: ITEC 423 Data Warehousing and Data Mining Lecture 2

Foundations of Data MiningFoundations of Data Mining

Data mining is the process of using raw data to infer important business relationships.

Despite a consensus on the value of data mining, a great deal of confusion exists about what it is.

It is a collection of powerful techniques intended for analyzing large datasets.

There is no single data mining approach, but rather a set of techniques that can be used in combination with each other.

Page 42: ITEC 423 Data Warehousing and Data Mining Lecture 2

The Roots of Data MiningThe Roots of Data Mining

The approach has roots in practice dating back over 30 years.

In the early 1960s, data mining was called statistical analysis, and the pioneers were statistical software companies such as SAS and SPSS.

By the 1980s, the traditional techniques had been augmented by new methods such as fuzzy logic, heuristics and neural networks.

Page 43: ITEC 423 Data Warehousing and Data Mining Lecture 2

A General ApproachA General Approach

Although all data mining endeavors are unique, they possess a common set of process steps:

1. Infrastructure preparation – choice of hardware platform, the database system and one or more mining tools

2. Exploration – looking at summary data, sampling and applying intuition

3. Analysis – each discovered pattern is analyzed for significance and trends

Page 44: ITEC 423 Data Warehousing and Data Mining Lecture 2

A General Approach A General Approach (continued)(continued)4. Interpretation – Once patterns have been

discovered and analyzed, the next step is to interpret them. Considerations include business cycles, seasonality and the population the pattern applies to.

5. Exploitation – this is both a business and a technical activity. One way to exploit a pattern is to use it for prediction. Others are to package, price or advertise the product in a different way.

Page 45: ITEC 423 Data Warehousing and Data Mining Lecture 2

Review VocabularyReview Vocabulary

Data warehouseData martOLTPOLAPDimensional ModelSubject OrientedTime variantNon volatile Integrated/consolidate