45
Databases & Data Mining Types of database systems How are they related to data mining

Databases & Data Mining Types of database systems How are they related to data mining

Embed Size (px)

Citation preview

Databases& Data Mining

Types of database systems

How are they related to data mining

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-2

Contemporary Database

• Gain competitive advantage – customer information systems

• data mining

• Develop and market new products• micromarketing

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-3

Systems• Database

– Personal, small business level

• On-Line Analytic Processing (OLAP)– Ability to use many dimensions, reports & graphics

• Data Mart– Usually temporary analysis

• Data Warehouse– Usually permanent repository

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-4

Data WarehousingPrice Waterhouse definition:A data warehouse is an orderly and accessible

repository of known facts and related data that is used as a basis for making better management decisions. The data warehouse provides a unified repository of consistent data for decision making that is subject oriented, integrated, time variant, and nonvolatile.

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-5

Data Warehousing

• Provide business users views of data appropriate to mission

• Consolidate & reconcile data

• Give macro views of critical aspects

• Timely & detailed access to information

• Provide specific information to groups

• Ability to identify trends

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-6

Data Warehousing

Price Waterhouse:

Not just a technology;

an architecture and process designed to support decision making

special-purpose database systems to improve query performance significantly

index, partition, pre-aggregate data

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-7

Data Warehousing

Beyond OLAP: Data warehouseOLAP On-Line Transactional Processing

summary data detailed operational data

few users many concurrent users

data driven transaction driven

effectiveness efficiency

use EIS, spreadsheets to access

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-8

Data Marts

• Intermediate-level database system

• Often used as temporary storage– Gather data for study from data

warehouse, other sources (including external)

– Clean & transform for data mining

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-9

OLAP• Multidimensional spreadsheet• Hypercube – term to reflect ability to sort on

many dimensions• Many forms

– MOLAP – multidimensional– ROLAP – relational (uses SQL)– DOLAP – desktop– WOLAP – web enabled– HOLAP - hybrid

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-10

Key Concepts• Scalability

– Ability to accurately cope with changing conditions (especially magnitude of computing)

• Granularity– Level of detail

• Data warehouse – tends to be fine granularity• OLAP – tends to aggregate to coarse granularity

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-11

Data Warehouse Implementation

• Reliable, comprehensive source of clean data– Accurate, complete, in correct format

• Processes– System development– Data acquisition– Data extraction for use

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-12

Data Warehouse Generation

• Extract data from sources

• Transform

• Clean

• Load into data warehouse– 60-80% of effort in operating data

warehouse

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-13

Data Extraction Routines

• Interpret data formats

• Identify changed records

• Copy information to intermediate file

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-14

Data Transformation• Consolidate data from multiple sources

• Filter to eliminate unnecessary details

• Clean data– eliminate incorrect entries– eliminate duplications

• Convert & translate data into proper format

• Aggregate data as designed

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-15

Data Management

• Retrieve information• Extraction programs• Problems:

– Required data not available– Initial data warehouse scope too broad– Not enough time to do prototyping, or

needs analysis– Insufficient senior direction

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-16

Meta Data

• Data to keep track of data

• Life cycle:– Manage meta data– Design data warehouse– Ensure data quality– Manage system during operations

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-17

Business Meta Data

• What data are available

• Source of each data element

• Frequency of data updates

• Location of specific data

• Predefined reports & queries

• Methods of data access

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-18

Technical Meta Data• Data source

– (internal or external)• Data preparation features

– (transformation & aggregation rules)• Logical structure of data• Physical structure & content• Data ownership• Security aspects

– (access rights, restrictions)• System information

– (date of last update, retention policy, data usage)

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-19

Wal-Mart’s Data Warehouse• Heavy user of IT• Core competency – supply chain distribution

– 2900 outlets– Data warehouse of 101 terabytes ($4 billion)– 65 million transactions per week– Subject-oriented, integrated, time-variant,

nonvolatile data– 65 weeks of data by item, store, day

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-20

Wal-Mart

• Use data warehouse to:– Support decision making– Buyers, merchandisers, logistics,

forecasters– 3,500 vendor partners can query– Can handle 35 thousand queries per week

• Benefit $12,000 per query• Some users about 1 thousand queries per day

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-21

Summers Rubber Company

• Distribution firm– 7 operating locations– 10,000 items– 3,000 customers

• Old system:– OLAP– Databases transactional & summarized,

distributed

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-22

Summers Data Storage System

• Built in-house, PCs, Access database• Visual Basic & Excel• Distributed system

– Data warehouse server controlled queries, managed resources

• Security– Passwords gave some protection– To protect from leaving employees, used data

marts with small versions of central database

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-23

Summers

• Move from transactional databases to new system

• Small prototype, iterative feedback from users

• Data came from many sources• Scrubbing data

– Reformatting (time units, scales, currency measures, etc.)

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-24

Summers – Negative features

• Too much disk space on user local drives

• Often difficult to understand & use

• Updating multiple data sites slow, limited access

• Summary data often wrong

• Couldn’t use data mining tools– Problem was aggregated data stored

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-25

Comparison

Product Use Duration Granularity

Warehouse Repository Permanent Finest

Mart Specific study

Temporary Aggregate

OLAP Report & analysis

Repetitive Summary

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-26

Examples of Data Uses

• Customer information systems

• Fingerhut

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-27

Customer Information Systems

• Massive databases

• Detailed information about individuals and households

• Use automated analysis– identify focused market target

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-28

Micromarketing• Target small groups of highly responsive

customers

• Own niches like smaller competitors

• EXAMPLES:– Great Atlantic & Pacific Tea Company (A&P)

• target customers, centralize buying

– Fingerhut• sell on credit to households <$25,000 income

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-29

Media Companies• R. R. Donnelley & Sons

– world’s largest printer– provide consumer & life-style data– customized individual publications

• Mass marketing has become less effective• Profit in developing niche-oriented strategy• Need marketing information technology

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-30

Information Overload• Retail food (groceries)

– average store - 20,000 items• larger stores 40,000 to 60,000;• with weights, flavors, etc., hundreds of thousands

– every year 10,000 new items– 550 corporate and regional buying offices– 100,000 salespeople– several hundred thousand price changes/year

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-31

Information Overload

• Grocery data collection– point-of-sale scanning– used to allocate shelf space– used to optimize product mix– control inventories– avoid shortages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-32

Customer Information Systems

• tens of thousands of characters of information

• tens of millions of customers

• enormous data storage– hundreds of gigabytes

• parallel computing

• YOU HAVE TO BE BIG TO AFFORD

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-33

Customer Information Systems

• USES– adjust prices– see new product possibilities– develop promotions– personalized advertising

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-34

Customer Information Systems

• OPERATION– artificial intelligence

• neural networks to wade through data• identify shopping trends• segment groups of customers

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-35

Customer Information Systems• AIRLINE INDUSTRY

– 1980s - deregulation– number of possible fares & rates skyrocketed– SABRE - 45 million fares,

40 million changes/month– industry now dominated by

American (SABRE) & United (Apollo)– cost - hundreds of millions of dollars

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-36

Own the Customer• A&P

– point-of-sale scanning– frequent shopper programs

• used to build customer database• sign up, get free bonus saver cards, check cashing,

hundreds of special discounts• A&P gathers list of purchases, feeds database

– centralized buying, better inventory, advertising

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-37

Versioning• Assemble hundreds of versions of the same ad• Switch & reassemble products & prices• Cigarette makers

– some of most advanced database marketing– direct mail, discount coupons, freebies– have built databases on smoker

demographics– anticipate market changes, target promotions

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-38

Versioning• FINGERHUT

– 150 catalog mailings in 1992– based on statistically predicted consumer

response– 13 million customers, 14% annual growth– database captures 1400 pieces of

information about a household• demographics, purchasing histories

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-39

FINGERHUT

• identify your kid’s birthdays, send ideas– FRONT-END programs

• get new customers (purchased from others)

– TRANSITION programs• evaluate new purchasers, keep best

– BACK-END programs• maximize profit

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-40

FINGERHUT• FRONT-END

– newspaper, magazines, TV, postcards, catalogs

– predictive models – lists from other companies– if you respond

• TRANSITION– sort out good credit risks, good purchasers

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-41

FINGERHUT

• BACK-END– 80% of revenue from repeat customers– customers segmented

• 75 specialty catalogs• personalized messages

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-42

Marketing Budgets

• Saturated advertising channels– expenditures more than doubled in 1980s– too much advertising, too little relevant

• Shift to– promotional discounts– slotting - buy shelf space– undermines brand loyalty

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-43

Narrowcasting

• Cable TV

• In-store coupons

• Special monitors– doctors’ offices, airport lounges

• Interactive kiosks

• Interactive home TV shopping

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-44

R.R. Donnelley & Sons• Will manage customer’s database

• Supply consumer data

• Identify market segments

• Printing– Farm Journal - 8000 different

editions/month– tailored editorial & advertising content

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

3-45

Customer Information Systems

• Barriers to competition

• Cost up to $100 million to develop

• Years to gather data and build

• Basic shift in source of competitive advantage