65
1 Introduction to Data Mining Credit: Natasha Balac, Ph.D.

DM Lecture 2

  • Upload
    asad199

  • View
    74

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DM Lecture 2

1

Introduction to Data Mining

Credit: Natasha Balac, Ph.D.

Page 2: DM Lecture 2

© Copyright 2006, Natasha Balac 2

Outline

Motivation: Why Data Mining?

What is Data Mining?

History of Data Mining

Data Mining Functionality and Terminology

Data Mining Applications

Are all the Patterns Interesting?

Issues in Data Mining

Page 3: DM Lecture 2

Evolution of Database Technology

1960s:

Data collection, database creation, IMS and network DBMS

1970s:

Relational data model, relational DBMS implementation

1980s:

RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial,

scientific, engineering, etc.)

1990s—2000s:

Data mining and data warehousing, multimedia databases, and

Web databases

© Copyright 2006, Natasha Balac3

Page 4: DM Lecture 2

© Copyright 2006, Natasha Balac 4

Necessity is the Mother of Invention

Data explosion

Automated data collection tools and mature database

technology lead to tremendous amounts of data stored in

databases, data warehouses and other information

repositories

We are drowning in data, but starving for

knowledge!

Page 5: DM Lecture 2

© Copyright 2006, Natasha Balac 5

Necessity is the Mother of Invention

We are drowning in data, but starving for

knowledge!

Solution - Data Mining

Data Warehousing and online analytical processing

Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases

Page 6: DM Lecture 2

Data Mining Main Objectives

Identification of data as a source of useful

information

Use of discovered information for

competitive advantages when working in

business environment

© Copyright 2006, Natasha Balac 6

Page 7: DM Lecture 2

© Copyright 2006, Natasha Balac 7

Why DATA MINING?

Huge amounts of data

Electronic records of our decisions Choices in the supermarket

Financial records

Our comings and goings

We swipe our way through the world – every swipe is a record in a database

Data rich – but information poor

Lying hidden in all this data is information!

Page 8: DM Lecture 2

© Copyright 2006, Natasha Balac 8

Data vs. Information

Society produces massive amounts of data business, science, medicine, economics, sports, …

Potentially valuable resource

Raw data is useless need techniques to automatically extract information

Data: recorded facts (as in databases)

Information: patterns underlying the data

Patterns must be discovered automatically

Data – Information - Knowledge

Page 9: DM Lecture 2

© Copyright 2006, Natasha Balac 9

What is DATA MINING?

Extracting or “mining” knowledge from large amounts of data

Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data

Process for Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data

Page 10: DM Lecture 2

© Copyright 2006, Natasha Balac 10

Data mining:

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)information or patterns from data in large databases

Alternative names: Data mining: a misnomer?Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

What Is Data Mining?

Page 11: DM Lecture 2

© Copyright 2006, Natasha Balac 11

Data Mining is NOT

Data Warehousing

(Deductive) query processing

SQL/ Reporting

Software Agents

Expert Systems

Online Analytical Processing (OLAP)

Statistical Analysis Tool

Data visualization

Page 12: DM Lecture 2

© Copyright 2006, Natasha Balac 12

Data Mining

Programs that detect patterns and rules in

the data

Strong patterns can be used to make non-

trivial predictions on new data

Page 13: DM Lecture 2

© Copyright 2006, Natasha Balac 13

Data Mining Challenges

Problem 1: most patterns are not

interesting

Problem 2: patterns may be inexact or

completely spurious when noisy data

present

Page 14: DM Lecture 2

© Copyright 2006, Natasha Balac 14

Machine Learning Techniques

Technical basis for data mining: algorithms for

acquiring structural descriptions from examples

Methods originate from artificial intelligence,

statistics, and research on databases

Page 15: DM Lecture 2

© Copyright 2006, Natasha Balac 15

Machine Learning Techniques

Structural descriptions represent patterns

explicitly can be used to

predict outcome in new situation

understand and explain how prediction is

derived (maybe even more important)

Page 16: DM Lecture 2

© Copyright 2006, Natasha Balac 16

Multidisciplinary Field

Data Mining

Database Technology

Statistics

OtherDisciplines

Artificial Intelligence

MachineLearning

Visualization

Page 17: DM Lecture 2

© Copyright 2006, Natasha Balac 17

Multidisciplinary Field

Database technology

Artificial Intelligence

Machine Learning including Neural Networks

Statistics

Pattern recognition

Knowledge-based systems/acquisition

High-performance computing

Data visualization

Page 18: DM Lecture 2

© Copyright 2006, Natasha Balac 18

History of Data Mining

Page 19: DM Lecture 2

© Copyright 2006, Natasha Balac 19

History

Emerged late 1980s

Flourished –1990s

Roots traced back along three family lines

Classical Statistics

Artificial Intelligence

Machine Learning

Page 20: DM Lecture 2

© Copyright 2006, Natasha Balac 20

Statistics

Foundation of most DM technologies

Regression analysis, standard

distribution/deviation/variance, cluster

analysis, confidence intervals

Building blocks

Significant role in today’s data mining –

but alone is not powerful enough

Page 21: DM Lecture 2

© Copyright 2006, Natasha Balac 21

Artificial Intelligence

Heuristics vs. Statistics

Human-thought-like processing

Requires vast computer processing power

Supercomputers

Page 22: DM Lecture 2

© Copyright 2006, Natasha Balac 22

Machine Learning

Union of statistics and AI

Blends AI heuristics with advanced statistical analysis

Machine Learning – let computer programs

learn about data they study - make different decisions based on the quality of studied data

using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms

Page 23: DM Lecture 2

© Copyright 2006, Natasha Balac 23

Data Mining

Adoption of the Machine learning techniques to the real world problems

Union: Statistics, AI, Machine learning

Used to find previously hidden trends or patterns

Finding increasing acceptance in science and business areas which need to analyze large amount of data to discover trends which could not be found otherwise

Page 24: DM Lecture 2

© Copyright 2006, Natasha Balac 24

Terminology

Gold Mining (turn data tombs into golden nuggets of knowledge.

Finding small sets of precious nuggets from a great deal of raw material)

Knowledge mining from databases

Knowledge extraction

Data/pattern analysis

Knowledge Discovery Databases or KDD

Information harvesting

Business intelligence

Page 25: DM Lecture 2

Data Mining: A KDD Process

© Copyright 2006, Natasha Balac 25September 6, 2014Data Mining: Concepts and

Techniques 25

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection and Transformation

Data Mining

Pattern Evaluation

Page 26: DM Lecture 2

Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (remove noise, may take 60% of effort!)

Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining

summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

© Copyright 2006, Natasha Balac 26

Page 27: DM Lecture 2

Data Mining and Business Intelligence

© Copyright 2006, Natasha Balac 27September 6, 2014

Data Mining: Concepts and Techniques 27

Increasing potential

to support

business decisions End User

Business

Analyst

Data

Analyst

DBA

Making

Decisions

Data Presentation

Visualization Techniques

Data Mining

Information Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Page 28: DM Lecture 2

Architecture of a Typical Data Mining System

© Copyright 2006, Natasha Balac 28September 6, 2014Data Mining: Concepts and

Techniques 28

Data

Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-baseGuide search , evaluate interestingness in patterns

Interestingness measures to focus search to interesting patterns

Page 29: DM Lecture 2

Data Mining: On What Kind of Data?Should be applicable to any kind of data repository, and transient data (data streams)

Types of databases and information repositories on which data mining can be

performed

Relational databases

Data warehouses

Transactional databases

Advanced DB and information repositories

Object-oriented and object-relational databases

Spatial databases

Time-series data and temporal data

Text databases and multimedia databases

Heterogeneous and legacy databases

WWW

© Copyright 2006, Natasha Balac 29

Page 30: DM Lecture 2

© Copyright 2006, Natasha Balac 30

LEARNING ALGORITHMS

Fundamental idea:

learn rules/patterns/relationships

automatically from the data

Page 31: DM Lecture 2

© Copyright 2006, Natasha Balac 31

Data Mining Tasks

Exploratory Data Analysis

Predictive Modeling inference on current data to make predictions

Classification and Regression

Descriptive Modeling characterise general properties of the data

Cluster analysis/segmentation

Discovering Patterns and Rules Association/Dependency rules

Sequential patterns

Temporal sequences

Deviation detection

Page 32: DM Lecture 2

© Copyright 2006, Natasha Balac 32

Data Mining TasksData is associated with classes (eg computers, printers) or concepts (eg customer types)

Concept/Class description: Characterization and discrimination

Generalize, summarize (target class), and contrast data

characteristics, e.g., dry vs. wet regions (contrasting class)

Frequent patterns: patterns occurring frequently in the data

Frequent itemsets: set of items frequently appearing together in transactional data

Association (correlation and causality)

Multi-dimensional or single-dimensional association

age(X, “20-29”) ^ income(X, “60-90K”) buys(X, “TV”)

age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]

buys(T, “computer”) buys(T, “software”) [1%, 50%]

Assoc. rule discarded as uninteresting if not satisfying minimum support threshold and minimum confidence

threshold

Page 33: DM Lecture 2

© Copyright 2006, Natasha Balac 33

Data Mining Tasks Classification

Finding models (functions) that describe and distinguish classes or

concepts for future prediction

Example: classify countries based on climate, or classify cars based on

gas mileage

Derived model based on analysis of a set of training data (objects of known

class label)

Model may be represented in various forms:

If-THEN rules, decision-tree, classification rule, neural network

Prediction

Models continuous valued function - Predict some unknown or missing

numerical values ( eg Regression analysis)

Relevance analysis – identify attributes not contributing to classification/prediction,

hence can be excluded

Page 34: DM Lecture 2

© Copyright 2006, Natasha Balac 34

Cluster analysis

Class label is unknown: Group data to form new

classes,

Example: cluster houses to find distribution

patterns

Clustering based on the principle: maximizing the

intra-class similarity and minimizing the interclass

similarity

Each cluster formed may be viewed as a class of objects

Data Mining Tasks

Page 35: DM Lecture 2

© Copyright 2006, Natasha Balac 35

Data Mining Tasks Outlier analysis

Outlier: a data object that does not comply with the general

behavior of the data

Mostly considered as noise or exception, but is quite useful

in fraud detection, rare events analysis

Trend and evolution analysis model regularities or

trends for objects whose behaviour changes with time

Trend and deviation: regression analysis

Sequential pattern mining, periodicity analysis

Page 36: DM Lecture 2

© Copyright 2006, Natasha Balac 36

Data Mining: Classification Schemes

General functionality

Descriptive data mining Vs. Predictive data mining (DDM – characterise general properties of data. PDM – perform inference on

data in order to make predictions)

Different views - different classifications

Kinds of databases to be mined

Kinds of knowledge to be discovered

Kinds of techniques employed

Kinds of applications

Page 37: DM Lecture 2

© Copyright 2006, Natasha Balac 37

A Multi-Dimensional View of Data

Mining Classification

Databases to be mined

Relational, transactional, object-oriented, object-

relational, active, spatial, time-series, text, multi-

media,WWW, etc.

Knowledge to be mined

Characterization, discrimination, association,

classification, clustering, trend, deviation and outlier

analysis, etc.

Multiple/integrated functions

Mining at multiple levels of abstractions

Page 38: DM Lecture 2

© Copyright 2006, Natasha Balac 38

A Multi-Dimensional View of Data

Mining Classification

Techniques utilized

Decision/Regression trees, clustering, neural

networks, etc.

Applications adapted

Retail, telecom, banking, DNA mining, stock

market analysis, Web mining

Page 39: DM Lecture 2

© Copyright 2006, Natasha Balac 39

Data Mining Applications

Science: Chemistry, Physics, Medicine

Biochemical analysis

Remote sensors on a satellite

Telescopes – star galaxy classification

Medical Image analysis

Page 40: DM Lecture 2

© Copyright 2006, Natasha Balac 40

Data Mining Applications

Bioscience

Sequence-based analysis

Protein structure and function prediction

Protein family classification

Microarray gene expression

Page 41: DM Lecture 2

© Copyright 2006, Natasha Balac 41

Pharmaceutical companies, Insurance

and Health care, Medicine

Drug development

Identify successful medical therapies

Claims analysis, fraudulent behavior

Medical diagnostic tools

Predict office visits

Data Mining Applications

Page 42: DM Lecture 2

© Copyright 2006, Natasha Balac 42

Financial Industry, Banks, Businesses, E-

commerce

Stock and investment analysis

Identify loyal customers vs. risky customer

Predict customer spending

Risk management

Sales forecasting

Data Mining Applications

Page 43: DM Lecture 2

© Copyright 2006, Natasha Balac 43

Retail and Marketing

Customer buying patterns/demographic

characteristics

Mailing campaigns

Market basket analysis

Trend analysis

Data Mining Applications

Page 44: DM Lecture 2

© Copyright 2006, Natasha Balac 44

Database analysis and decision support

Market analysis and management

target marketing, customer relation management, market

basket analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting,

quality control, competitive analysis

Fraud detection and management

Data Mining Applications

Page 45: DM Lecture 2

© Copyright 2006, Natasha Balac 45

Sports and Entertainment

IBM Advanced Scout analyzed NBA game

statistics (shots blocked, assists, and fouls) to

gain competitive advantage for New York

Knicks and Miami Heat

Astronomy

JPL and the Palomar Observatory discovered

22 quasars with the help of data mining

Data Mining Applications

Page 46: DM Lecture 2

© Copyright 2006, Natasha Balac 46

DATA MINING EXAMPLES

Grocery store

NBA

Banking and Credit Card scoring

Fraud detection

Personalization & Customer Profiling

Campaign Management and Database

Marketing

Page 47: DM Lecture 2

© Copyright 2006, Natasha Balac 47

Data mining at work:

Case study 1

Page 48: DM Lecture 2

© Copyright 2006, Natasha Balac 48

Processing Loan Applications

Given: questionnaire with financial and personal information

Problem: should money be lent?

Borderline cases referred to loan officers

But: 50% of accepted borderline cases defaulted!

Solution: reject all borderline cases?

Borderline cases are most active customers!

Page 49: DM Lecture 2

© Copyright 2006, Natasha Balac 49

Enter Machine Learning

Given: 1000 training examples of borderline cases

20 attributes:

age, years with current employer,years at current address, years with the bank, years at current job, other credit cards

Learned rules predicted 2/3 of borderline cases correctly!

Rules could be used to explain decisions to customers

Page 50: DM Lecture 2

© Copyright 2006, Natasha Balac 50

Case study 2:

Screening images

Given:

radar satellite images of coastal waters

Problem: detecting oil slicks in those images

Oil slicks = dark regions with changing size and shape

Look-alike dark regions can be caused by weather conditions (e.g. high wind)

Expensive process requiring highly trained personnel

Page 51: DM Lecture 2

© Copyright 2006, Natasha Balac 51

Dark regions extracted from normalized image

Attributes:

size of region, shape, area, intensity, sharpness and jaggedness of boundaries, proximity of other regions, info about background

Constraints: Scarcity of training examples (oil slicks are rare!)

Unbalanced data: most dark regions aren’t oil slicks

Regions from same image form a batch

Requirement is adjustable false-alarm rate

Enter Machine Learning

Page 52: DM Lecture 2

© Copyright 2006, Natasha Balac 52

Data Mining Challenges

Computationally expensive to investigate all possibilities

Dealing with noise/missing information and errors in data

Choosing appropriate attributes/input representation

Finding the minimal attribute space

Finding adequate evaluation function(s)

Extracting meaningful information

Not overfitting

Page 53: DM Lecture 2

© Copyright 2006, Natasha Balac 53

Are All the “Discovered” Patterns

Interesting?

DM system may generate thousands/millions of patterns/rules

Interestingness measures: A pattern is interesting if it is easily understood

by humans, valid on new or test data with some degree of certainty,

potentially useful, novel, or validates some hypothesis that a user seeks to

confirm

support: %age of transactions the given rule satisfies

support( X => Y ) = P ( X U Y )

confidence: degree of certainty of detected association

confidence(X => Y ) = P ( Y | X )

Page 54: DM Lecture 2

© Copyright 2006, Natasha Balac 54

Are All the “Discovered” Patterns

Interesting?

Objective vs. subjective measures:

Objective: based on statistics and structures of

patterns

support and confidence

Subjective: based on user’s belief in the data

unexpectedness, novelty, actionability, confirm

hypothesis user wishes to validate or resemble hunch

Page 55: DM Lecture 2

© Copyright 2006, Natasha Balac 55

Can We Find All and Only Interesting

Patterns?

An Optimisation problem – completeness of DM algo

Completeness - Find all the interesting

patterns

Can a data mining system find all the interesting

patterns?

Association vs. classification vs. clustering

Page 56: DM Lecture 2

© Copyright 2006, Natasha Balac 56

Can We Find All and Only Interesting

Patterns?

Optimization - Search for only interesting patterns

Can a data mining system find only the interesting

patterns?

Approaches

First generate all the patterns and then filter out the

uninteresting ones

Mining query optimization (improve search efficiency by pruning

away subsets of the pattern space not satisfying interestingness constraints)

Page 57: DM Lecture 2

© Copyright 2006, Natasha Balac 57

Major Issues in Data Mining

Mining methodology and user interaction

Mining different kinds of knowledge in databases

Incorporation of background knowledge allow discovered

patterns to be expressed in concise terms at different levels of abstraction

Handling noise and incomplete data may confuse the process

Pattern evaluation: the interestingness problem

Expression and visualization of data mining results in expressive forms so that knowledge is easily understood, usable by humans

Page 58: DM Lecture 2

© Copyright 2006, Natasha Balac 58

Performance and scalability huge data volumes

Efficiency of data mining algorithms

Parallel, distributed and incremental mining

methods

Issues relating to the diversity of data types

Handling relational and complex types of data

Mining information from diverse databasesUnrealistic to expect one system to mine all kinds of data

Major Issues in Data Mining

Page 59: DM Lecture 2

© Copyright 2006, Natasha Balac 59

Issues related to applications and social

impacts

Application of discovered knowledge

Domain-specific data mining tools

Intelligent query answering

Expert systems

Process control and decision making

A knowledge fusion problem

Protection of data security, integrity, and privacy

Major Issues in Data Mining

Page 60: DM Lecture 2

© Copyright 2006, Natasha Balac 60

Summary

Data mining: discovering interesting patterns

from large amounts of data

A KDD process includes data cleaning, data

integration, data selection, transformation, data

mining, pattern evaluation, and knowledge

presentation

Page 61: DM Lecture 2

© Copyright 2006, Natasha Balac 61

Summary

Mining can be performed in a variety of

information repositories

Data mining functionalities: characterization,

association, classification, clustering, outlier

and trend analysis, etc.

Classification of data mining systems

Major issues in data mining

Page 62: DM Lecture 2

© Copyright 2006, Natasha Balac 62

Kinds of Data Mining

Decision Tree Learning

Clustering

Neural Networks

Association Rules

Support Vector Machines

Genetic Algorithms

Nearest Neighbor Method

Page 63: DM Lecture 2

© Copyright 2006, Natasha Balac 63

Decision Tree ExampleGrandparents

A lotA little

Page 64: DM Lecture 2

© Copyright 2006, Natasha Balac 64

DECISION TREE FOR THE CONCEPT

“Play Tennis”

Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong NoMitchell, 1997

Page 65: DM Lecture 2

© Copyright 2006, Natasha Balac 65

DECISION TREE FOR THE CONCEPT

“Play Tennis”

[Mitchell,1997]