Database and Knowledge-Base Systems: Data Mining Martin …Applications • Geo-marketing Purchasing patterns for particular geographical areas (e.g., for choice of store location)

CMPT 843, SFU, Martin Ester, 1-06 1

Database and Knowledge-Base Systems:Data Mining

Martin Ester

Simon Fraser University

School of Computing Science

Graduate Course

Spring 2006


Introduction

[Fayyad, Piatetsky-Shapiro & Smyth 96]

Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is

• valid• previously unknown• and potentially useful.

Remarks• (semi)-automatic: distinction from manual analysis / OLAP.

Typically, some user interaction necessary. • valid: in the statistical sense.• previously unknown: not explicit, no „common sense knowledge“.• potentially useful: for some given application.


Introduction

Statistics [Hand, Mannila & Smyth 2001]• representation of uncertainty • model-based inferences• focus on numeric data

Machine Learning [Mitchell 1997]• knowledge representation• search strategies• focus on symbolic data

Database Systems [Han & Kamber 2000]• data management• integration of data mining with DBS• scalability for large databases


Introduction

Pre-processing

Trans-formation

Database

Focussing DataMining

Evaluation

Pattern Knowledge

KDD Process [Han & Kamber 2000]

KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996]

DatabasesData Cleaning

Data Integration

Selection

Data Mining

Data Warehouse

Task-relevant Data

Pattern Evaluation

Knowledge


Data Mining

Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996]

• Data Mining is the application of efficient algorithms to determine the

patterns contained in some database.

Data-Mining Tasks

• ••

••

••

••••

•

•

a aa

bb

bb

baa

b

a

• ••

••

••

•••

•

•A and B � C

clustering classification

association rules generalisation

other tasks: regression, outlier detection . . .


Trends in KDD Research

KDD 2000 Conference

• New Data Mining Algorithms

• Efficiency and Scalability of Data Mining Algorithms

• Interactive Data Exploration

• Visualization

• Constraints and Evaluation in the KDD Process



KDD 2002 Conference

• Statistical Methods

• Frequent Patterns

• Streams and Time Series

• Visualization

• Web Search and Navigation

• Text and Web Page Classification

• Intrusion and Privacy

• Applications



KDD 2004 Conference

• Frequent Patterns / Association Rules

• Clustering

• Mining Spatio-Temporal Data

• Mining Data Streams

• Dimensionality Reduction

• Privacy-Preserving Data Mining

• Mining Biological Data

• Applications (Web, biological data, security, . . .)



KDD 2005 Conference

• Clustering

• Privacy

• Mining Spatio-Temporal Data

• Mining Data Streams

• SVMs

• Text and Web Mining

• Mining (Social) Networks

• Graph Mining (best paper on graphs over time)



Increasing Importance

• Mining data streams

• Clustering high-dimensional data

• Mining spatio-temporal data

• Privacy-preserving data mining

• Network analysis

• Graph mining

• Multi-relational data mining


Overview of this Course

Prerequisites

Basics in database systems and statistics

Introductory graduate data mining course

Objectives

• Introduction into some hot topics of data mining research

• Introduction into some ongoing research projects of our DDM Lab

• General research methodology

• Presentation skills

start thesis work after this class!



Topics

• Clustering high-dimensional data

• Mining data streams

• Spatio-temporal data mining

• Multi-relational data mining

• Graph mining



Format

• Tutorial surveys

• Research paper presentations (and discussions)

• Small research projects

Grading

• Paper presentation

• Project presentation

• Project report

� originality, technical quality, presentation quality


Clustering High-Dimensional Data

Applications

Biological Data• Micro-Array Data: rows = genes, columns = conditions / experiments,

value measures the expression level of gene under given condition

• Often: thousands of columns

• Co-regulated genes: similar expression levels in a subset of all conditions

Text / Web Data• Text / web document: attributes = term frequencies

• Typically, >> 1000 relevant terms

• Document clusters: document sets that share some important terms



Curse of Dimensionality

• The more dimensions, the larger the (average) pairwise distances

• Clusters only in lower-dimensional subspaces

clusters only in

1-dimensional subspace

„salary“



Approaches

• In approach1, cluster: dense connected region in data space

• Find interesting subspaces, then clusters within these subspaces

�density threshold hard to determine (should be different)

�clusters highly overlapping

• In approach 2, start with full-dimensional clustering and

iteratively refine the clusters and relevant cluster dimensions

� result ill-defined

� number of clusters / cluster dimensions hard to determine


Mining Data Streams

Applications

• Telecommunications

o Telecommunications providers collect call records (from, to, when, how

long, . . .)

o Want to use the data not only for billing, but also for analysis (monitor

trends in usage, customer segmentation, campaign design, . . .)

• Sensor networks

o Network of distributed sensors measuring several parameters such as

precipitation, temperature, amount of traffic, blood pressure, . . .

o Data need to be monitored and analyzed on-line (immediate response)


Mining Data Streams

Challenges

• Characteristics of data streams

o Massive volumes of data

o Records arrive at a rapid rate

• Requirements

o Main memory to small to store all records

o Each record is examined at most once

o Real time response, i.e. very efficient processing


Mining Data Streams

Approach

• Summarize using samples, histograms or novel methods such as CF-trees

� How to maximize the approximation accuracy?

� How to exploit the temporal dimension (aging of data)?

Data Stream 1

Data Stream m

. . .

Main Memory Synopsis

Stream Processing

Engine

(Approximate)Answer


Spatio-Temporal Data Mining

Applications

• Geo-marketing

Purchasing patterns for

particular geographical areas

(e.g., for choice of store location)

• Health care data analysis

Analysis of the spread of diseases

Interventions by Public Health

Authorities

� Data referencing the earth surface (spatial) and the time (temporal)



Challenges

• Independence assumption no longer valid Attribute values of neighboring objects are typically correlated

• Operations on spatial data are very expensive

Spatial objects are complex (lines, polygons, 3D surfaces, . . .)

which makes the corresponding operations very expensive

• Temporal dimension

Blows up the pattern search space

�What patterns do we really want to find in spatio-temporal DB?



Approaches

• Consider spatial auto-correlation

Find only patterns that deviate from what is expected according to spatial auto-correlation

• Efficient support by the DBMS

Indexes, basic operations, . . .

• Models for spatio-temporal data mining

Definition of new pattern types such as spatio-temporal trends


Multi-Relational Data Mining

Applications

• Mining biological data

o Molecular biologists collect data on genes, proteins, gene expression, metabolic pathways, . . .

o Want to learn, e.g., about the process of gene regulation

• Text mining

o Using information extraction methods, entities (companies, persons, genes, . . .)

and their relationships (directs, married, regulates, . . .) can be extracted from a

text document

o Can be used as input for true text mining: finding knowledge rather than

documents



Limitations of Existing Methods

• Emerging applications are inherently multi-relational

o Input: multiple tables (entity sets) and their relationships

o Record characteristics: own attributes, related records from other tables and the attributes of these related records

• Existing data mining methods are single-relational

o Input: a single table (relation), Output: refers to attributes of a single table

o Data representation as a universal relation (single table) is possible, but may loose a lot of information

� propositional logic



Approaches

• Inductive Logic Programming

o Logic program: facts (records) and deduction rules (background knowledge)

o Task: find (first order) logic rules with some target predicate in the conclusion

o Restrict search space by user-specified (syntactic) constraints

� huge search space

� syntactic constraints are hard to define

� only for classification tasks



Approaches

• First-order versions of standard data mining algorithms

o Multi-relational decision trees

o Multi-relational association rules

�What rule format / semantics (in particular, aggregation operations)?

• Multi-relational distances

o Family of distance functions with different depths, taking into accountattributes of related records up to the given depth

o Standard methods can be applied, e.g. k-means or k-NN classification

� (global) distance function looses a lot of information


Graph Mining

Applications

• Analysis of the internet

o What are the most important web pages?

o How will the internet / web look like next year?

• Social network analysis

o What customers should be targeted to maximize the profit of a marketing campaign?

o Whom to immunize in order to stop spread of some virus?

o Find abnormal subgraphs (e.g., criminal rings).


Graph Mining

Challenges

• Definition of new types of patterns

o Certain subgraphs . . .

o Which ones are interesting in a given application?

• Complexity

o Many graph algorithms are NP-complete.

o Real graphs tend to be extremely large.

� Need efficient algorithms

• Dynamics

o Many networks evolve rapidly.


References

Text Books

• Han J., Kamber M., „Data Mining: Concepts and Techniques“, Morgan Kaufmann Publishers, 2000.

• Hand D., Mannila H., Smyth P. „Principles of Data Mining“, MIT Press, 2001.

• Mitchell T. M., „Machine Learning“, McGraw-Hill, 1997.

Documents

Database and Knowledge-Base Systems: Data Mining Martin …Applications • Geo-marketing Purchasing patterns for particular geographical areas (e.g., for choice of store location)