CMPT 884, SFU, Martin Ester, 1-09 1 Special Topics in Database Systems Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring

CMPT 884, SFU, Martin Ester, 1-09 1

Special Topics in Database Systems

Martin Ester

Simon Fraser University

School of Computing Science

CMPT 884

Spring 2009


Introduction

[Fayyad, Piatetsky-Shapiro & Smyth 96]

Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is

• valid• previously unknown• and potentially useful.

Remarks• (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. • valid: in the statistical sense.• previously unknown: not explicit, no „common sense knowledge“.• potentially useful: for some given application.


Introduction

Statistics [Hand, Mannila & Smyth 2001]• representation of uncertainty • model-based inferences• focus on numeric data

Machine Learning [Mitchell 1997]• knowledge representation• search strategies• focus on symbolic data

Database Systems [Han & Kamber 2000]• data management• integration of data mining with DBS• scalability for large databases


Introduction

Pre-processing

Trans-formation

Database

Focussing DataMining

Evaluation

Pattern Knowledge

KDD Process [Han & Kamber 2000]

KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996]

DatabasesData Cleaning

Data Integration

Selection

Data Mining

Data Warehouse

Task-relevant Data

Pattern Evaluation

Knowledge


Data Mining

Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996]

• Data Mining is the application of efficient algorithms to determine the

patterns contained in some database.

Data-Mining Tasks

• ••

••

••

••••

•

•

a aa

bb

bb

baa

b

a

• ••

••

••

•••

•

•A and B C

clustering classification

association rules generalisation

other tasks: regression, outlier detection . . .


Trends in KDD Research

KDD 2000 Conference

• New Data Mining Algorithms

• Efficiency and Scalability of Data Mining Algorithms

• Interactive Data Exploration

• Visualization

• Constraints and Evaluation in the KDD Process



KDD 2002 Conference

• Statistical Methods

• Frequent Patterns

• Streams and Time Series

• Visualization

• Web Search and Navigation

• Text and Web Page Classification

• Intrusion and Privacy

• Applications



KDD 2004 Conference

• Frequent Patterns / Association Rules

• Clustering

• Mining Spatio-Temporal Data

• Mining Data Streams

• Dimensionality Reduction

• Privacy-Preserving Data Mining

• Mining Biological Data

• Applications (Web, biological data, security, . . .)



KDD 2006 Conference

• Clustering

• Classification / supervised ML

• Privacy

• Web / Graph Mining

• Web / Text Mining

• Frequent Pattern Mining

• Structured Data



KDD 2008 Conference

• Text Mining

• Data Integration

• Social Networks

• Graph Mining

• Distance Functions and Metric Learning

• Active and Semi-supervised Learning

• Pattern Mining

• Collaborative Filtering



Some Hot Topics

• Social Networks

THE hot topic of KDD 08

topic of the only panel

• Graph mining

• Text mining

and information extraction / integration

• Collaborative Filtering

more general, recommender systems

$1M NetFlix prize


Overview of this Course

Prerequisites

Foundations of database systems and statistics

Introductory graduate data mining course or equivalent

Objectives

• Introduction into some hot topics of data mining research

• Training in research methodology

• Presentation skills

start thesis work after this class!



Topics

• Graph mining

social network analysis and analysis of biological networks

as driving applications

• Recommender systems

in particular trust-based recommendation

• Information extraction and integration

integration with existing databases



Format• Tutorial surveys

by instructor

• Written research paper reviews

by students

• Research paper presentations

by students

discussions in class

• Course research projects

by students

on a topic of their choice



Tentative Grading Scheme

• Paper review (20 %)

• Paper presentation (20 %)

• Course project report (40%)

two steps:

project proposal, final project report

• Course project presentation (20 %)

marking criteria:

originality, technical quality, presentation



Types of Course Projects

• Literature survey

summarize the state-of-the-art and identify open research problems

• New problem

introduce and analyze a new problem

• New algorithm for known problem

implement and evaluate algorithm

• Improvement of existing algorithm

implement and compare algorithm

• Comparison of existing algorithms on a new, interesting dataset

identify criteria for choice of algorithms / open research problems


Graph Mining

Motivating Applications

• Social network analysiso What communities exist?

o How does information about a new product spread?

o What customers should be targeted to maximize the profit of a marketing campaign?

• Analysis of biological networks

o What are the functional modules of an organism?

o How do biological networks evolve in the course of time?

o What protein should be targeted to inhibit some virulent bacteria?


Graph Mining

Methods

• Frequent subgraph mining

frequent pattern mining approach

• Graph clustering e.g., normalized cut, i.e. Minimize number of edges between graph components / clusters

• Graph generative models

probabilistic models that generate graphs similar to

real graphs / networks


Graph Mining

Challenges

• Complexity of graph algorithms

o Many graph mining problems are NP-hard.

o Real graphs tend to be extremely large.

need efficient algorithms

• Attribute data

o Many graphs have attributes associated with the nodes.

o Transformation into weighted graph looses a lot of information.

need new models / algorithms considering relationship and attribute data


Recommender Systems


• Motivation

o The internet provides a flood of information on all kinds of items.

o There is a great need for personalized recommendations.

o The internet also provides a wealth of item ratings / reviews.

• Typical applications

o Movie recommendation

o Product recommendation

o Keyword recommendation


Recommender SystemsMethods

• Collaborative filteringo Uses only a database of user – item ratings.

o Recommendation based on ratings by users with similar rating patterns.

• Content-based recommender systems

o Uses information about the content of items and / or the properties of users.

o Recommends items that have content similar to items liked by user.

• Trust-based recommender systems

o Assume a social network / trust network. Trust can be defined explicitly or implicitly.

o Recommendation based on ratings by trusted neighbors.


Recommender SystemsChallenges

• High dimensionality and sparsity of datao The overwhelming majority (> 99%) of user item ratings is unknown.

o Recommendation especially hard for cold start users and controversial items.

dimensionality reduction, model based methods, trust-based approach

• Fraud

o Memory-based collaborative filtering can be easily manipulated by adding fraudulent ratings.

trust-based approach more robust to fraud

• Privacy issues with trust network data

o only very few trust networks are public domain


Information Extraction and Integration


• Importance of unstructured text data o The overwhelming majority (>= 80%) of human generated information

is not in structured form, but in unstructured text.

• Biomedical literature o Contains a wealth of valuable information that cannot be processed / searched

automatically.

o Extraction of entities and relationships such as proteins and their localizations.

• Online product reviews

o A lot of product „reviews“ available online in community databases or blogs.

o Companies want to know what customers think of their products.


Information Extraction and Integration

Methods

• Basic NLP methods o Part-of-speech tagging

o Lexica, ontologies, . . .

• Machine learning methods o Typically, supervised classification.

o CRFs and similar methods are state-of-the-art.

• Bootstrapping approach

o Using a small labeled training dataset, find textual extraction patterns.

o Using these patterns, extract further entities / relationships and continue.


Information Extraction and IntegrationChallenges

• Text data is hard to understand o Many of the NLP problems are still essentially unsolved. relatively simple NLP methods often sufficient for information extraction

• Portability across domains o Extraction methods need to be portable from one domain to another.

o Knowledge engineering approach (domain expert defines rules) is labor-intensive and expensive.

machine learning methods

• Entity mentions need to be resolved

o Information extraction produces strings referencing an entity of a given type.

o Without mapping to known real world entities, extracted information is oflimited usefulness.

need to integrate extracted information with existing databases


References

Graph mining- X Yan & Karsten Borgwardt, "Graph Mining and Graph Kernels", Tutorial

KDD 08

- Jure Leskovec and Christos Faloutsos, “Mining Large Graphs: Models, Diffusion and Case Studies”, Tutorial ECML/PKDD 2007

Recommender systems- Joseph Konstan, “Introduction to Recommender Systems”, Tutorial

SIGMOD 2008

Information extraction and integration- Eugene Agichtein & Sunita Sarawagi, “Scalable Information Extraction and

Integration”, Tutorial KDD 06

- AnHai Doan & Raghu Ramakrishnan & Shiv Vaithyanathan, “Managing Information Extraction”, Tutorial SIGMOD 2006

Documents

CMPT 884, SFU, Martin Ester, 1-09 1 Special Topics in Database Systems Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring