32
Data Mining Kathy S Schwaig

Data Mining Kathy S Schwaig. Outline Motivation Definitions Techniques Applications Portions of this presentation are adapted from J. Han Simon

Embed Size (px)

Citation preview

Page 1: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Data Mining

Kathy S Schwaig

Page 2: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Outline

MotivationDefinitionsTechniquesApplications

Portions of this presentation are adapted from J. Han Simon Fraser University, Canada

Page 3: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Motivation

Data found in data warehouses is not, by itself, of great intrinsic value.

Value comes from the knowledge that can be discovered from data.

What do you do with it?

Page 4: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

• Magnitude of data due to machine-readable text disseminated across networks.

• Difficult to distill information for analysis.

• Tools needed to 'mine' information to bring out key, relevant facts.

•Users need to rapidly filter and assimilate useful information from a variety of data sources.

Data Volume Problems

Page 5: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Data Mining

The process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

Extraction of hidden, predictive information from large databases.

Provide answers to questions a decision maker had previously not thought to ask

Page 6: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Data Mining

Search for relationships, patterns, and trends which, prior to the search were not known to exist or were not visible.

E.g. “Find related buying patterns.”“Find related buying patterns.”

““There is a pattern that occurs X% of the time There is a pattern that occurs X% of the time that when someone buys window coverings (not that when someone buys window coverings (not shades, blinds, or other specifics), and within 1 shades, blinds, or other specifics), and within 1 to 3 months buys linens, within the next 4 to 3 months buys linens, within the next 4 months buys furniture.”months buys furniture.”

Page 7: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Data Mining

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 8: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions

End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Page 9: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Data Mining Analysis Techniques Examples

Characterization Association Classification Prediction Clustering (Data Segmentation)

Page 10: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Characterization

Demographics: address, income, recreational equipment ownership, etc.

Psychographics: lifestyle/personality characteristics like “highly protective of children; impulsive shopper

Technographic(web based): attributes of your computer system; browser, operating system, modem speed, etc.

Page 11: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Association

Occurrences linked to a single event; Identify items that are likely to be purchased or viewed at the same session (web)

Example: Amazon.com…..Customers that bought Grapes of Wrath also bought Great Gatsby

Page 12: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Classification

Recognize patterns that describe a group to which an item belongs by examing existing items that have been classified and by inferring a set of rules

Example: Credit Card companies have discovered the characteristics of customers likely to leave and have provided a model to help predict who will leave in the future.

Page 13: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Prediction

Guesses an unknown value such as income when you know other things about a person.

Example: lifetime monetary value, Often used in demographic data to fill in blank information. For example, we know someone’s address, car preference and job title but not their income. We can look at others with similar characteristics and from their data infer the missing income figure.

Page 14: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Clustering

Identify people who share common characteristics. A way of identifying differing groups within the data

Page 15: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Patterns

Scuba gear and Australian vacations

Skim milk and whole wheat bread

AT&T’s stock rises at least 2% after every 3-day slump in DOW

Page 16: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

• Discovered what appeared to be a curious purchasing trend.

• Music retailer’s 493 stores were selling a lot of rap and alternative CDs to people older than 65.

Camelot Music Inc.

Page 17: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Are All the “Discovered” Patterns Interesting?

A data mining query may generate thousands of patterns.

Are they interesting? Why or why not?

Interesting if: easily understood by humans valid on new or test data with some degree of certainty potentially useful novel validates some hypothesis that a user seeks to confirm

Page 18: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Applications: MCIApplications: MCI

How to find the customers you want to keep How to find the customers you want to keep from among the from among the millionsmillions? ?

Comb marketing data on 140 million Comb marketing data on 140 million households, each evaluated on as many as households, each evaluated on as many as 10,000 attributes— e.g. income, lifestyle, 10,000 attributes— e.g. income, lifestyle,

and details about past calling habits. and details about past calling habits.

But which set of those attributes is the most But which set of those attributes is the most important to monitor, and within what important to monitor, and within what

range of values?range of values?

Page 19: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

•IBM SP/2 super computer, its data warehouse, has identified variables it finds most telling about it’s customers, and from that, compiled a set of 22 very detailed and highly confidential statistical customer profiles– none of which could have been developed without data mining programs

MCI

Page 20: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Wal-Mart

Point of sale transaction data is captured at each retail store and transmitted to Wal-Mart’s Arkansas data

warehouse.

Over 3,500 independent suppliers have online access to information about their respective products in that data

warehouse. They may query that data to analyze trends by item and store, using that information to find the products

that need replenishment,

and thus allow them to get the right products to each store on time

Page 21: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Data Mining Should Not be Used Blindly!

Data mining find regularities from history, but history is not the same as the future.

Association does not dictate trend nor causality!? Drink diet drinks lead to obesity! David Heckerman’s counter-example (1997)

Barbecue source, hot dogs and hamburgers.

Page 22: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Web Mining: Lots To Be Done!

Types of Web mining Web usage mining: which page or graphic was

served(URL) linked to date, time, browser information Web content mining: how are visitors responding to your

content (which links they select, where they spend time, which search terms they use, where they browse)

Other than managers, who could REALLY use this information?

Page 23: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Challenges to Web Mining

Web: A huge, widely-distributed, highly heterogeneous, semi-

structured, interconnected, evolving, hypertext/hypermedia

information repository.

Problems:

the “abundance” problem

limited coverage of the Web (hidden Web sources)

limited query interface: keyword-oriented search

limited customisation to individual users

DBMS, and data miners will play an increasingly important role in

the new generation of Internet

Page 24: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Summary

•Need for data mining

• Approaches

• Problems

• Applications

• Web data mining

Page 25: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Appendix: Market Analysis and Management

Data sources Credit card transactions, loyalty cards, discount coupons,

customer complaint calls, studies. Target marketing

Clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

Customer purchasing patterns Conversion of single to a joint bank account: marriage, etc.

Cross-market analysis Associations/co-relations between product sales Prediction based on the association information.

Page 26: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Appendix: Market Analysis and Management (Con’t)

Customer profiling data mining can tell you what types of customers buy

what products (clustering or classification).

Customer requirements identify best products for different customers prediction to find what factors will attract new customers

Summary information multi-dimensional summary reports; statistical summary information

Page 27: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Appendix: Corporate Analysis and Risk Management

Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)

Resource planning summarize and compare resources and spending

Competition Monitor competitors and market directions. Segment customers into classes with class-based pricing

procedure. Set pricing strategy in a highly competitive market.

Page 28: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Appendix: Fraud Detection and Management

Applications Widely used in health care, retail, credit card services, telecommunications

(phone card fraud). Approach

use historical data to build models of fraudulent behavior and use data mining to help identify similar instances.

Examples Auto Insurance: detect a group of people who stage accidents to collect

insurance Money Laundering: detect suspicious money transactions (US Treasury's

Financial Crimes Enforcement Network) Medical Insurance: detect professional patients and ring of doctors and

ring of references

Page 29: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Appendix: Fraud Detection and Management (Con’t)

Telephone fraud: Telephone call model: destination of call, duration,

time of day or week. Analyze patterns that deviate from expected norm. British Telecom identified discrete groups of callers

with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.

Page 30: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Appendix: Other Application

Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to

Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Page 31: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

Appendix: Decision Support and OLAP

DSS: Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions what were the sales volumes by region and product

category for the last year? How did the share price of computer manufacturers

correlate with quarterly profits over the past 10 years? Will a 10% discount increase sales volume sufficiently?

Page 32: Data Mining Kathy S Schwaig. Outline  Motivation  Definitions  Techniques  Applications Portions of this presentation are adapted from J. Han Simon

•OLAP- On-line analytical processing. Refers to array-oriented database applications that allow users to view, navigate through, manipulate, and analyze multi-dimensional databases. An element of a decision support system.

•Data mining is a powerful, high-performance data analysis tool for decision support.