View
228
Download
2
Tags:
Embed Size (px)
Citation preview
CMPT 884, SFU, Martin Ester, 1-09 1
Special Topics in Database Systems
Martin Ester
Simon Fraser University
School of Computing Science
CMPT 884
Spring 2009
CMPT 884, SFU, Martin Ester, 1-09 2
Introduction
[Fayyad, Piatetsky-Shapiro & Smyth 96]
Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is
• valid• previously unknown• and potentially useful.
Remarks• (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. • valid: in the statistical sense.• previously unknown: not explicit, no „common sense knowledge“.• potentially useful: for some given application.
CMPT 884, SFU, Martin Ester, 1-09 3
Introduction
Statistics [Hand, Mannila & Smyth 2001]• representation of uncertainty • model-based inferences• focus on numeric data
Machine Learning [Mitchell 1997]• knowledge representation• search strategies• focus on symbolic data
Database Systems [Han & Kamber 2000]• data management• integration of data mining with DBS• scalability for large databases
CMPT 884, SFU, Martin Ester, 1-09 4
Introduction
Pre-processing
Trans-formation
Database
Focussing DataMining
Evaluation
Pattern Knowledge
KDD Process [Han & Kamber 2000]
KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996]
DatabasesData Cleaning
Data Integration
Selection
Data Mining
Data Warehouse
Task-relevant Data
Pattern Evaluation
Knowledge
CMPT 884, SFU, Martin Ester, 1-09 5
Data Mining
Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996]
• Data Mining is the application of efficient algorithms to determine the
patterns contained in some database.
Data-Mining Tasks
• ••
••
••
••••
•
•
a aa
bb
bb
baa
b
a
• ••
••
••
•••
•
•A and B C
clustering classification
association rules generalisation
other tasks: regression, outlier detection . . .
CMPT 884, SFU, Martin Ester, 1-09 6
Trends in KDD Research
KDD 2000 Conference
• New Data Mining Algorithms
• Efficiency and Scalability of Data Mining Algorithms
• Interactive Data Exploration
• Visualization
• Constraints and Evaluation in the KDD Process
CMPT 884, SFU, Martin Ester, 1-09 7
Trends in KDD Research
KDD 2002 Conference
• Statistical Methods
• Frequent Patterns
• Streams and Time Series
• Visualization
• Web Search and Navigation
• Text and Web Page Classification
• Intrusion and Privacy
• Applications
CMPT 884, SFU, Martin Ester, 1-09 8
Trends in KDD Research
KDD 2004 Conference
• Frequent Patterns / Association Rules
• Clustering
• Mining Spatio-Temporal Data
• Mining Data Streams
• Dimensionality Reduction
• Privacy-Preserving Data Mining
• Mining Biological Data
• Applications (Web, biological data, security, . . .)
CMPT 884, SFU, Martin Ester, 1-09 9
Trends in KDD Research
KDD 2006 Conference
• Clustering
• Classification / supervised ML
• Privacy
• Web / Graph Mining
• Web / Text Mining
• Frequent Pattern Mining
• Structured Data
CMPT 884, SFU, Martin Ester, 1-09 10
Trends in KDD Research
KDD 2008 Conference
• Text Mining
• Data Integration
• Social Networks
• Graph Mining
• Distance Functions and Metric Learning
• Active and Semi-supervised Learning
• Pattern Mining
• Collaborative Filtering
CMPT 884, SFU, Martin Ester, 1-09 11
Trends in KDD Research
Some Hot Topics
• Social Networks
THE hot topic of KDD 08
topic of the only panel
• Graph mining
• Text mining
and information extraction / integration
• Collaborative Filtering
more general, recommender systems
$1M NetFlix prize
CMPT 884, SFU, Martin Ester, 1-09 12
Overview of this Course
Prerequisites
Foundations of database systems and statistics
Introductory graduate data mining course or equivalent
Objectives
• Introduction into some hot topics of data mining research
• Training in research methodology
• Presentation skills
start thesis work after this class!
CMPT 884, SFU, Martin Ester, 1-09 13
Overview of this Course
Topics
• Graph mining
social network analysis and analysis of biological networks
as driving applications
• Recommender systems
in particular trust-based recommendation
• Information extraction and integration
integration with existing databases
CMPT 884, SFU, Martin Ester, 1-09 14
Overview of this Course
Format• Tutorial surveys
by instructor
• Written research paper reviews
by students
• Research paper presentations
by students
discussions in class
• Course research projects
by students
on a topic of their choice
CMPT 884, SFU, Martin Ester, 1-09 15
Overview of this Course
Tentative Grading Scheme
• Paper review (20 %)
• Paper presentation (20 %)
• Course project report (40%)
two steps:
project proposal, final project report
• Course project presentation (20 %)
marking criteria:
originality, technical quality, presentation
CMPT 884, SFU, Martin Ester, 1-09 16
Overview of this Course
Types of Course Projects
• Literature survey
summarize the state-of-the-art and identify open research problems
• New problem
introduce and analyze a new problem
• New algorithm for known problem
implement and evaluate algorithm
• Improvement of existing algorithm
implement and compare algorithm
• Comparison of existing algorithms on a new, interesting dataset
identify criteria for choice of algorithms / open research problems
CMPT 884, SFU, Martin Ester, 1-09 17
Graph Mining
Motivating Applications
• Social network analysiso What communities exist?
o How does information about a new product spread?
o What customers should be targeted to maximize the profit of a marketing campaign?
• Analysis of biological networks
o What are the functional modules of an organism?
o How do biological networks evolve in the course of time?
o What protein should be targeted to inhibit some virulent bacteria?
CMPT 884, SFU, Martin Ester, 1-09 18
Graph Mining
Methods
• Frequent subgraph mining
frequent pattern mining approach
• Graph clustering e.g., normalized cut, i.e. Minimize number of edges between graph components / clusters
• Graph generative models
probabilistic models that generate graphs similar to
real graphs / networks
CMPT 884, SFU, Martin Ester, 1-09 19
Graph Mining
Challenges
• Complexity of graph algorithms
o Many graph mining problems are NP-hard.
o Real graphs tend to be extremely large.
need efficient algorithms
• Attribute data
o Many graphs have attributes associated with the nodes.
o Transformation into weighted graph looses a lot of information.
need new models / algorithms considering relationship and attribute data
CMPT 884, SFU, Martin Ester, 1-09 20
Recommender Systems
Motivating Applications
• Motivation
o The internet provides a flood of information on all kinds of items.
o There is a great need for personalized recommendations.
o The internet also provides a wealth of item ratings / reviews.
• Typical applications
o Movie recommendation
o Product recommendation
o Keyword recommendation
CMPT 884, SFU, Martin Ester, 1-09 21
Recommender SystemsMethods
• Collaborative filteringo Uses only a database of user – item ratings.
o Recommendation based on ratings by users with similar rating patterns.
• Content-based recommender systems
o Uses information about the content of items and / or the properties of users.
o Recommends items that have content similar to items liked by user.
• Trust-based recommender systems
o Assume a social network / trust network. Trust can be defined explicitly or implicitly.
o Recommendation based on ratings by trusted neighbors.
CMPT 884, SFU, Martin Ester, 1-09 22
Recommender SystemsChallenges
• High dimensionality and sparsity of datao The overwhelming majority (> 99%) of user item ratings is unknown.
o Recommendation especially hard for cold start users and controversial items.
dimensionality reduction, model based methods, trust-based approach
• Fraud
o Memory-based collaborative filtering can be easily manipulated by adding fraudulent ratings.
trust-based approach more robust to fraud
• Privacy issues with trust network data
o only very few trust networks are public domain
CMPT 884, SFU, Martin Ester, 1-09 23
Information Extraction and Integration
Motivating Applications
• Importance of unstructured text data o The overwhelming majority (>= 80%) of human generated information
is not in structured form, but in unstructured text.
• Biomedical literature o Contains a wealth of valuable information that cannot be processed / searched
automatically.
o Extraction of entities and relationships such as proteins and their localizations.
• Online product reviews
o A lot of product „reviews“ available online in community databases or blogs.
o Companies want to know what customers think of their products.
CMPT 884, SFU, Martin Ester, 1-09 24
Information Extraction and Integration
Methods
• Basic NLP methods o Part-of-speech tagging
o Lexica, ontologies, . . .
• Machine learning methods o Typically, supervised classification.
o CRFs and similar methods are state-of-the-art.
• Bootstrapping approach
o Using a small labeled training dataset, find textual extraction patterns.
o Using these patterns, extract further entities / relationships and continue.
CMPT 884, SFU, Martin Ester, 1-09 25
Information Extraction and IntegrationChallenges
• Text data is hard to understand o Many of the NLP problems are still essentially unsolved. relatively simple NLP methods often sufficient for information extraction
• Portability across domains o Extraction methods need to be portable from one domain to another.
o Knowledge engineering approach (domain expert defines rules) is labor-intensive and expensive.
machine learning methods
• Entity mentions need to be resolved
o Information extraction produces strings referencing an entity of a given type.
o Without mapping to known real world entities, extracted information is oflimited usefulness.
need to integrate extracted information with existing databases
CMPT 884, SFU, Martin Ester, 1-09 26
References
Graph mining- X Yan & Karsten Borgwardt, "Graph Mining and Graph Kernels", Tutorial
KDD 08
- Jure Leskovec and Christos Faloutsos, “Mining Large Graphs: Models, Diffusion and Case Studies”, Tutorial ECML/PKDD 2007
Recommender systems- Joseph Konstan, “Introduction to Recommender Systems”, Tutorial
SIGMOD 2008
Information extraction and integration- Eugene Agichtein & Sunita Sarawagi, “Scalable Information Extraction and
Integration”, Tutorial KDD 06
- AnHai Doan & Raghu Ramakrishnan & Shiv Vaithyanathan, “Managing Information Extraction”, Tutorial SIGMOD 2006