DM Lecture 2

1

Introduction to Data Mining

Credit: Natasha Balac, Ph.D.

© Copyright 2006, Natasha Balac 2

Outline

Motivation: Why Data Mining?

What is Data Mining?

History of Data Mining

Data Mining Functionality and Terminology

Data Mining Applications

Are all the Patterns Interesting?

Issues in Data Mining

Evolution of Database Technology

1960s:

Data collection, database creation, IMS and network DBMS

1970s:

Relational data model, relational DBMS implementation

1980s:

RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial,

scientific, engineering, etc.)

1990s—2000s:

Data mining and data warehousing, multimedia databases, and

Web databases

© Copyright 2006, Natasha Balac3


Necessity is the Mother of Invention

Data explosion

Automated data collection tools and mature database

technology lead to tremendous amounts of data stored in

databases, data warehouses and other information

repositories

We are drowning in data, but starving for

knowledge!


Necessity is the Mother of Invention

We are drowning in data, but starving for

knowledge!

Solution - Data Mining

Data Warehousing and online analytical processing

Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases

Data Mining Main Objectives

Identification of data as a source of useful

information

Use of discovered information for

competitive advantages when working in

business environment



Why DATA MINING?

Huge amounts of data

Electronic records of our decisions Choices in the supermarket

Financial records

Our comings and goings

We swipe our way through the world – every swipe is a record in a database

Data rich – but information poor

Lying hidden in all this data is information!


Data vs. Information

Society produces massive amounts of data business, science, medicine, economics, sports, …

Potentially valuable resource

Raw data is useless need techniques to automatically extract information

Data: recorded facts (as in databases)

Information: patterns underlying the data

Patterns must be discovered automatically

Data – Information - Knowledge


What is DATA MINING?

Extracting or “mining” knowledge from large amounts of data

Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data

Process for Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data


Data mining:

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)information or patterns from data in large databases

Alternative names: Data mining: a misnomer?Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

What Is Data Mining?


Data Mining is NOT

Data Warehousing

(Deductive) query processing

SQL/ Reporting

Software Agents

Expert Systems

Online Analytical Processing (OLAP)

Statistical Analysis Tool

Data visualization


Data Mining

Programs that detect patterns and rules in

the data

Strong patterns can be used to make non-

trivial predictions on new data


Data Mining Challenges

Problem 1: most patterns are not

interesting

Problem 2: patterns may be inexact or

completely spurious when noisy data

present


Machine Learning Techniques

Technical basis for data mining: algorithms for

acquiring structural descriptions from examples

Methods originate from artificial intelligence,

statistics, and research on databases


Machine Learning Techniques

Structural descriptions represent patterns

explicitly can be used to

predict outcome in new situation

understand and explain how prediction is

derived (maybe even more important)


Multidisciplinary Field

Data Mining

Database Technology

Statistics

OtherDisciplines

Artificial Intelligence

MachineLearning

Visualization


Multidisciplinary Field

Database technology


Machine Learning including Neural Networks

Statistics

Pattern recognition

Knowledge-based systems/acquisition

High-performance computing

Data visualization


History of Data Mining


History

Emerged late 1980s

Flourished –1990s

Roots traced back along three family lines

Classical Statistics


Machine Learning


Statistics

Foundation of most DM technologies

Regression analysis, standard

distribution/deviation/variance, cluster

analysis, confidence intervals

Building blocks

Significant role in today’s data mining –

but alone is not powerful enough



Heuristics vs. Statistics

Human-thought-like processing

Requires vast computer processing power

Supercomputers


Machine Learning

Union of statistics and AI

Blends AI heuristics with advanced statistical analysis

Machine Learning – let computer programs

learn about data they study - make different decisions based on the quality of studied data

using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms


Data Mining

Adoption of the Machine learning techniques to the real world problems

Union: Statistics, AI, Machine learning

Used to find previously hidden trends or patterns

Finding increasing acceptance in science and business areas which need to analyze large amount of data to discover trends which could not be found otherwise


Terminology

Gold Mining (turn data tombs into golden nuggets of knowledge.

Finding small sets of precious nuggets from a great deal of raw material)

Knowledge mining from databases

Knowledge extraction

Data/pattern analysis

Knowledge Discovery Databases or KDD

Information harvesting

Business intelligence

Data Mining: A KDD Process

© Copyright 2006, Natasha Balac 25September 6, 2014Data Mining: Concepts and

Techniques 25

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection and Transformation

Data Mining

Pattern Evaluation

Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (remove noise, may take 60% of effort!)

Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining

summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge


Data Mining and Business Intelligence

© Copyright 2006, Natasha Balac 27September 6, 2014

Data Mining: Concepts and Techniques 27

Increasing potential

to support

business decisions End User

Business

Analyst

Data

Analyst

DBA

Making

Decisions

Data Presentation

Visualization Techniques

Data Mining

Information Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Architecture of a Typical Data Mining System

© Copyright 2006, Natasha Balac 28September 6, 2014Data Mining: Concepts and

Techniques 28

Data

Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-baseGuide search , evaluate interestingness in patterns

Interestingness measures to focus search to interesting patterns

Data Mining: On What Kind of Data?Should be applicable to any kind of data repository, and transient data (data streams)

Types of databases and information repositories on which data mining can be

performed

Relational databases

Data warehouses

Transactional databases

Advanced DB and information repositories

Object-oriented and object-relational databases

Spatial databases

Time-series data and temporal data

Text databases and multimedia databases

Heterogeneous and legacy databases

WWW



LEARNING ALGORITHMS

Fundamental idea:

learn rules/patterns/relationships

automatically from the data


Data Mining Tasks

Exploratory Data Analysis

Predictive Modeling inference on current data to make predictions

Classification and Regression

Descriptive Modeling characterise general properties of the data

Cluster analysis/segmentation

Discovering Patterns and Rules Association/Dependency rules

Sequential patterns

Temporal sequences

Deviation detection


Data Mining TasksData is associated with classes (eg computers, printers) or concepts (eg customer types)

Concept/Class description: Characterization and discrimination

Generalize, summarize (target class), and contrast data

characteristics, e.g., dry vs. wet regions (contrasting class)

Frequent patterns: patterns occurring frequently in the data

Frequent itemsets: set of items frequently appearing together in transactional data

Association (correlation and causality)

Multi-dimensional or single-dimensional association

age(X, “20-29”) ^ income(X, “60-90K”) buys(X, “TV”)

age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]

buys(T, “computer”) buys(T, “software”) [1%, 50%]

Assoc. rule discarded as uninteresting if not satisfying minimum support threshold and minimum confidence

threshold


Data Mining Tasks Classification

Finding models (functions) that describe and distinguish classes or

concepts for future prediction

Example: classify countries based on climate, or classify cars based on

gas mileage

Derived model based on analysis of a set of training data (objects of known

class label)

Model may be represented in various forms:

If-THEN rules, decision-tree, classification rule, neural network

Prediction

Models continuous valued function - Predict some unknown or missing

numerical values ( eg Regression analysis)

Relevance analysis – identify attributes not contributing to classification/prediction,

hence can be excluded


Cluster analysis

Class label is unknown: Group data to form new

classes,

Example: cluster houses to find distribution

patterns

Clustering based on the principle: maximizing the

intra-class similarity and minimizing the interclass

similarity

Each cluster formed may be viewed as a class of objects

Data Mining Tasks


Data Mining Tasks Outlier analysis

Outlier: a data object that does not comply with the general

behavior of the data

Mostly considered as noise or exception, but is quite useful

in fraud detection, rare events analysis

Trend and evolution analysis model regularities or

trends for objects whose behaviour changes with time

Trend and deviation: regression analysis

Sequential pattern mining, periodicity analysis


Data Mining: Classification Schemes

General functionality

Descriptive data mining Vs. Predictive data mining (DDM – characterise general properties of data. PDM – perform inference on

data in order to make predictions)

Different views - different classifications

Kinds of databases to be mined

Kinds of knowledge to be discovered

Kinds of techniques employed

Kinds of applications


A Multi-Dimensional View of Data

Mining Classification

Databases to be mined

Relational, transactional, object-oriented, object-

relational, active, spatial, time-series, text, multi-

media,WWW, etc.

Knowledge to be mined

Characterization, discrimination, association,

classification, clustering, trend, deviation and outlier

analysis, etc.

Multiple/integrated functions

Mining at multiple levels of abstractions


A Multi-Dimensional View of Data

Mining Classification

Techniques utilized

Decision/Regression trees, clustering, neural

networks, etc.

Applications adapted

Retail, telecom, banking, DNA mining, stock

market analysis, Web mining



Science: Chemistry, Physics, Medicine

Biochemical analysis

Remote sensors on a satellite

Telescopes – star galaxy classification

Medical Image analysis



Bioscience

Sequence-based analysis

Protein structure and function prediction

Protein family classification

Microarray gene expression


Pharmaceutical companies, Insurance

and Health care, Medicine

Drug development

Identify successful medical therapies

Claims analysis, fraudulent behavior

Medical diagnostic tools

Predict office visits



Financial Industry, Banks, Businesses, E-

commerce

Stock and investment analysis

Identify loyal customers vs. risky customer

Predict customer spending

Risk management

Sales forecasting



Retail and Marketing

Customer buying patterns/demographic

characteristics

Mailing campaigns

Market basket analysis

Trend analysis



Database analysis and decision support

Market analysis and management

target marketing, customer relation management, market

basket analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting,

quality control, competitive analysis

Fraud detection and management



Sports and Entertainment

IBM Advanced Scout analyzed NBA game

statistics (shots blocked, assists, and fouls) to

gain competitive advantage for New York

Knicks and Miami Heat

Astronomy

JPL and the Palomar Observatory discovered

22 quasars with the help of data mining



DATA MINING EXAMPLES

Grocery store

NBA

Banking and Credit Card scoring

Fraud detection

Personalization & Customer Profiling

Campaign Management and Database

Marketing


Data mining at work:

Case study 1


Processing Loan Applications

Given: questionnaire with financial and personal information

Problem: should money be lent?

Borderline cases referred to loan officers

But: 50% of accepted borderline cases defaulted!

Solution: reject all borderline cases?

Borderline cases are most active customers!


Enter Machine Learning

Given: 1000 training examples of borderline cases

20 attributes:

age, years with current employer,years at current address, years with the bank, years at current job, other credit cards

Learned rules predicted 2/3 of borderline cases correctly!

Rules could be used to explain decisions to customers


Case study 2:

Screening images

Given:

radar satellite images of coastal waters

Problem: detecting oil slicks in those images

Oil slicks = dark regions with changing size and shape

Look-alike dark regions can be caused by weather conditions (e.g. high wind)

Expensive process requiring highly trained personnel


Dark regions extracted from normalized image

Attributes:

size of region, shape, area, intensity, sharpness and jaggedness of boundaries, proximity of other regions, info about background

Constraints: Scarcity of training examples (oil slicks are rare!)

Unbalanced data: most dark regions aren’t oil slicks

Regions from same image form a batch

Requirement is adjustable false-alarm rate

Enter Machine Learning


Data Mining Challenges

Computationally expensive to investigate all possibilities

Dealing with noise/missing information and errors in data

Choosing appropriate attributes/input representation

Finding the minimal attribute space

Finding adequate evaluation function(s)

Extracting meaningful information

Not overfitting


Are All the “Discovered” Patterns

Interesting?

DM system may generate thousands/millions of patterns/rules

Interestingness measures: A pattern is interesting if it is easily understood

by humans, valid on new or test data with some degree of certainty,

potentially useful, novel, or validates some hypothesis that a user seeks to

confirm

support: %age of transactions the given rule satisfies

support( X => Y ) = P ( X U Y )

confidence: degree of certainty of detected association

confidence(X => Y ) = P ( Y | X )


Are All the “Discovered” Patterns

Interesting?

Objective vs. subjective measures:

Objective: based on statistics and structures of

patterns

support and confidence

Subjective: based on user’s belief in the data

unexpectedness, novelty, actionability, confirm

hypothesis user wishes to validate or resemble hunch


Can We Find All and Only Interesting

Patterns?

An Optimisation problem – completeness of DM algo

Completeness - Find all the interesting

patterns

Can a data mining system find all the interesting

patterns?

Association vs. classification vs. clustering


Can We Find All and Only Interesting

Patterns?

Optimization - Search for only interesting patterns

Can a data mining system find only the interesting

patterns?

Approaches

First generate all the patterns and then filter out the

uninteresting ones

Mining query optimization (improve search efficiency by pruning

away subsets of the pattern space not satisfying interestingness constraints)


Major Issues in Data Mining

Mining methodology and user interaction

Mining different kinds of knowledge in databases

Incorporation of background knowledge allow discovered

patterns to be expressed in concise terms at different levels of abstraction

Handling noise and incomplete data may confuse the process

Pattern evaluation: the interestingness problem

Expression and visualization of data mining results in expressive forms so that knowledge is easily understood, usable by humans


Performance and scalability huge data volumes

Efficiency of data mining algorithms

Parallel, distributed and incremental mining

methods

Issues relating to the diversity of data types

Handling relational and complex types of data

Mining information from diverse databasesUnrealistic to expect one system to mine all kinds of data



Issues related to applications and social

impacts

Application of discovered knowledge

Domain-specific data mining tools

Intelligent query answering

Expert systems

Process control and decision making

A knowledge fusion problem

Protection of data security, integrity, and privacy



Summary

Data mining: discovering interesting patterns

from large amounts of data

A KDD process includes data cleaning, data

integration, data selection, transformation, data

mining, pattern evaluation, and knowledge

presentation


Summary

Mining can be performed in a variety of

information repositories

Data mining functionalities: characterization,

association, classification, clustering, outlier

and trend analysis, etc.

Classification of data mining systems

Major issues in data mining


Kinds of Data Mining

Decision Tree Learning

Clustering

Neural Networks

Association Rules

Support Vector Machines

Genetic Algorithms

Nearest Neighbor Method


Decision Tree ExampleGrandparents

A lotA little


DECISION TREE FOR THE CONCEPT

“Play Tennis”

Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong NoMitchell, 1997


DECISION TREE FOR THE CONCEPT

“Play Tennis”

[Mitchell,1997]

Data & Analytics

DM Lecture 2