Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
CMPT 843, SFU, Martin Ester, 1-06 1
Database and Knowledge-Base Systems:Data Mining
Martin Ester
Simon Fraser University
School of Computing Science
Graduate Course
Spring 2006
CMPT 843, SFU, Martin Ester, 1-06 2
Introduction
[Fayyad, Piatetsky-Shapiro & Smyth 96]
Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is
• valid• previously unknown• and potentially useful.
Remarks• (semi)-automatic: distinction from manual analysis / OLAP.
Typically, some user interaction necessary. • valid: in the statistical sense.• previously unknown: not explicit, no „common sense knowledge“.• potentially useful: for some given application.
CMPT 843, SFU, Martin Ester, 1-06 3
Introduction
Statistics [Hand, Mannila & Smyth 2001]• representation of uncertainty • model-based inferences• focus on numeric data
Machine Learning [Mitchell 1997]• knowledge representation• search strategies• focus on symbolic data
Database Systems [Han & Kamber 2000]• data management• integration of data mining with DBS• scalability for large databases
CMPT 843, SFU, Martin Ester, 1-06 4
Introduction
Pre-processing
Trans-formation
Database
Focussing DataMining
Evaluation
Pattern Knowledge
KDD Process [Han & Kamber 2000]
KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996]
DatabasesData Cleaning
Data Integration
Selection
Data Mining
Data Warehouse
Task-relevant Data
Pattern Evaluation
Knowledge
CMPT 843, SFU, Martin Ester, 1-06 5
Data Mining
Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996]
• Data Mining is the application of efficient algorithms to determine the
patterns contained in some database.
Data-Mining Tasks
• ••
••
••
••••
•
•
a aa
bb
bb
baa
b
a
• ••
••
••
•••
•
•A and B � C
clustering classification
association rules generalisation
other tasks: regression, outlier detection . . .
CMPT 843, SFU, Martin Ester, 1-06 6
Trends in KDD Research
KDD 2000 Conference
• New Data Mining Algorithms
• Efficiency and Scalability of Data Mining Algorithms
• Interactive Data Exploration
• Visualization
• Constraints and Evaluation in the KDD Process
CMPT 843, SFU, Martin Ester, 1-06 7
Trends in KDD Research
KDD 2002 Conference
• Statistical Methods
• Frequent Patterns
• Streams and Time Series
• Visualization
• Web Search and Navigation
• Text and Web Page Classification
• Intrusion and Privacy
• Applications
CMPT 843, SFU, Martin Ester, 1-06 8
Trends in KDD Research
KDD 2004 Conference
• Frequent Patterns / Association Rules
• Clustering
• Mining Spatio-Temporal Data
• Mining Data Streams
• Dimensionality Reduction
• Privacy-Preserving Data Mining
• Mining Biological Data
• Applications (Web, biological data, security, . . .)
CMPT 843, SFU, Martin Ester, 1-06 9
Trends in KDD Research
KDD 2005 Conference
• Clustering
• Privacy
• Mining Spatio-Temporal Data
• Mining Data Streams
• SVMs
• Text and Web Mining
• Mining (Social) Networks
• Graph Mining (best paper on graphs over time)
CMPT 843, SFU, Martin Ester, 1-06 10
Trends in KDD Research
Increasing Importance
• Mining data streams
• Clustering high-dimensional data
• Mining spatio-temporal data
• Privacy-preserving data mining
• Network analysis
• Graph mining
• Multi-relational data mining
CMPT 843, SFU, Martin Ester, 1-06 11
Overview of this Course
Prerequisites
Basics in database systems and statistics
Introductory graduate data mining course
Objectives
• Introduction into some hot topics of data mining research
• Introduction into some ongoing research projects of our DDM Lab
• General research methodology
• Presentation skills
start thesis work after this class!
CMPT 843, SFU, Martin Ester, 1-06 12
Overview of this Course
Topics
• Clustering high-dimensional data
• Mining data streams
• Spatio-temporal data mining
• Multi-relational data mining
• Graph mining
CMPT 843, SFU, Martin Ester, 1-06 13
Overview of this Course
Format
• Tutorial surveys
• Research paper presentations (and discussions)
• Small research projects
Grading
• Paper presentation
• Project presentation
• Project report
� originality, technical quality, presentation quality
CMPT 843, SFU, Martin Ester, 1-06 14
Clustering High-Dimensional Data
Applications
Biological Data• Micro-Array Data: rows = genes, columns = conditions / experiments,
value measures the expression level of gene under given condition
• Often: thousands of columns
• Co-regulated genes: similar expression levels in a subset of all conditions
Text / Web Data• Text / web document: attributes = term frequencies
• Typically, >> 1000 relevant terms
• Document clusters: document sets that share some important terms
CMPT 843, SFU, Martin Ester, 1-06 15
Clustering High-Dimensional Data
Curse of Dimensionality
• The more dimensions, the larger the (average) pairwise distances
• Clusters only in lower-dimensional subspaces
clusters only in
1-dimensional subspace
„salary“
CMPT 843, SFU, Martin Ester, 1-06 16
Clustering High-Dimensional Data
Approaches
• In approach1, cluster: dense connected region in data space
• Find interesting subspaces, then clusters within these subspaces
�density threshold hard to determine (should be different)
�clusters highly overlapping
• In approach 2, start with full-dimensional clustering and
iteratively refine the clusters and relevant cluster dimensions
� result ill-defined
� number of clusters / cluster dimensions hard to determine
CMPT 843, SFU, Martin Ester, 1-06 17
Mining Data Streams
Applications
• Telecommunications
o Telecommunications providers collect call records (from, to, when, how
long, . . .)
o Want to use the data not only for billing, but also for analysis (monitor
trends in usage, customer segmentation, campaign design, . . .)
• Sensor networks
o Network of distributed sensors measuring several parameters such as
precipitation, temperature, amount of traffic, blood pressure, . . .
o Data need to be monitored and analyzed on-line (immediate response)
CMPT 843, SFU, Martin Ester, 1-06 18
Mining Data Streams
Challenges
• Characteristics of data streams
o Massive volumes of data
o Records arrive at a rapid rate
• Requirements
o Main memory to small to store all records
o Each record is examined at most once
o Real time response, i.e. very efficient processing
CMPT 843, SFU, Martin Ester, 1-06 19
Mining Data Streams
Approach
• Summarize using samples, histograms or novel methods such as CF-trees
� How to maximize the approximation accuracy?
� How to exploit the temporal dimension (aging of data)?
Data Stream 1
Data Stream m
. . .
Main Memory Synopsis
Stream Processing
Engine
(Approximate)Answer
CMPT 843, SFU, Martin Ester, 1-06 20
Spatio-Temporal Data Mining
Applications
• Geo-marketing
Purchasing patterns for
particular geographical areas
(e.g., for choice of store location)
• Health care data analysis
Analysis of the spread of diseases
Interventions by Public Health
Authorities
� Data referencing the earth surface (spatial) and the time (temporal)
CMPT 843, SFU, Martin Ester, 1-06 21
Spatio-Temporal Data Mining
Challenges
• Independence assumption no longer valid Attribute values of neighboring objects are typically correlated
• Operations on spatial data are very expensive
Spatial objects are complex (lines, polygons, 3D surfaces, . . .)
which makes the corresponding operations very expensive
• Temporal dimension
Blows up the pattern search space
�What patterns do we really want to find in spatio-temporal DB?
CMPT 843, SFU, Martin Ester, 1-06 22
Spatio-Temporal Data Mining
Approaches
• Consider spatial auto-correlation
Find only patterns that deviate from what is expected according to spatial auto-correlation
• Efficient support by the DBMS
Indexes, basic operations, . . .
• Models for spatio-temporal data mining
Definition of new pattern types such as spatio-temporal trends
CMPT 843, SFU, Martin Ester, 1-06 23
Multi-Relational Data Mining
Applications
• Mining biological data
o Molecular biologists collect data on genes, proteins, gene expression, metabolic pathways, . . .
o Want to learn, e.g., about the process of gene regulation
• Text mining
o Using information extraction methods, entities (companies, persons, genes, . . .)
and their relationships (directs, married, regulates, . . .) can be extracted from a
text document
o Can be used as input for true text mining: finding knowledge rather than
documents
CMPT 843, SFU, Martin Ester, 1-06 24
Multi-Relational Data Mining
Limitations of Existing Methods
• Emerging applications are inherently multi-relational
o Input: multiple tables (entity sets) and their relationships
o Record characteristics: own attributes, related records from other tables and the attributes of these related records
• Existing data mining methods are single-relational
o Input: a single table (relation), Output: refers to attributes of a single table
o Data representation as a universal relation (single table) is possible, but may loose a lot of information
� propositional logic
CMPT 843, SFU, Martin Ester, 1-06 25
Multi-Relational Data Mining
Approaches
• Inductive Logic Programming
o Logic program: facts (records) and deduction rules (background knowledge)
o Task: find (first order) logic rules with some target predicate in the conclusion
o Restrict search space by user-specified (syntactic) constraints
� huge search space
� syntactic constraints are hard to define
� only for classification tasks
CMPT 843, SFU, Martin Ester, 1-06 26
Multi-Relational Data Mining
Approaches
• First-order versions of standard data mining algorithms
o Multi-relational decision trees
o Multi-relational association rules
�What rule format / semantics (in particular, aggregation operations)?
• Multi-relational distances
o Family of distance functions with different depths, taking into accountattributes of related records up to the given depth
o Standard methods can be applied, e.g. k-means or k-NN classification
� (global) distance function looses a lot of information
CMPT 843, SFU, Martin Ester, 1-06 27
Graph Mining
Applications
• Analysis of the internet
o What are the most important web pages?
o How will the internet / web look like next year?
• Social network analysis
o What customers should be targeted to maximize the profit of a marketing campaign?
o Whom to immunize in order to stop spread of some virus?
o Find abnormal subgraphs (e.g., criminal rings).
CMPT 843, SFU, Martin Ester, 1-06 28
Graph Mining
Challenges
• Definition of new types of patterns
o Certain subgraphs . . .
o Which ones are interesting in a given application?
• Complexity
o Many graph algorithms are NP-complete.
o Real graphs tend to be extremely large.
� Need efficient algorithms
• Dynamics
o Many networks evolve rapidly.
CMPT 843, SFU, Martin Ester, 1-06 29
References
Text Books
• Han J., Kamber M., „Data Mining: Concepts and Techniques“, Morgan Kaufmann Publishers, 2000.
• Hand D., Mannila H., Smyth P. „Principles of Data Mining“, MIT Press, 2001.
• Mitchell T. M., „Machine Learning“, McGraw-Hill, 1997.