91
© Padhraic Smyth, Dec 2000: 1 What’s New in Data Mining? Padhraic Smyth Information and Computer Science University of California, Irvine © December 2000 Invited Talk at NonParametrics/Data Mining Workshop, SMU, Dallas

What's New in Data Mining?

  • Upload
    tommy96

  • View
    853

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 1

What’s New in Data Mining?

Padhraic SmythInformation and Computer Science

University of California, Irvine

© December 2000

Invited Talk at NonParametrics/Data Mining Workshop,SMU, Dallas

Page 2: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 2

Outline of Talk

• What is Data Mining?

• Computer Science and Statistics: the Interface

• Hot Topics in Data Mining

• Conclusions

Page 3: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 3

Technological Driving Factors

• Larger, cheaper memory– Moore’s law for magnetic disk density

“capacity doubles every 18 months” (Jim Gray, Microsoft)– storage cost per byte falling rapidly

• Faster, cheaper processors– the CRAY of 10 years ago is now on your desk

• Success of Relational Database Technology– everybody is a “data owner”

• Flexible modeling paradigms– GLMs, trees, etc– computationally-intensive modeling, massive search

Page 4: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 4

The Emergence of Data Mining

• Distinct threads of evolution– AI/machine learning

• 1989 KDD workshop -> ACM SIGKDD 2000• focus on “automated discovery, novelty”

– Database Research• focus on massive data sets (since 1995)• e.g., ACM SIGMOD -> association rules, scalable

algorithms– “Data Owners”

• what can we do with all this data in commercialdatabases?

• primarily customer-oriented transaction data• industry dominated, applications-oriented

Page 5: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 5

The Emergence of Data Mining

• The “Mother in Law phenomenon”• even your mother-in-law has heard about data mining• people are hoping they can do data analysis without the

“nuisance factor” of statistics

• Beware of the hype!– remember expert systems, neural nets, etc– basically sound ideas that were oversold creating a

backlash

Page 6: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 6

What is data mining?

Page 7: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 7

What is data mining?

“the art of fishing over alternative models ….”

M. C. Lovell, The Review of Economics and StatisticsFebruary 1983

Page 8: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 8

What is data mining?

“Data-driven discovery of models and patterns from massive observational data sets”

Page 9: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 9

What is data mining?

“The magic phrase to put in every funding proposalyou write to NSF, DARPA, NASA, etc”

Page 10: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 10

What is data mining?

“The magic phrase you use to sell your….. - database software - statistical analysis software - parallel computing hardware - consulting services”

Page 11: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 11

What is data mining?

“Data-driven discovery of models and patterns from massive observational data sets”

Statistics,Inference

Page 12: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 12

What is data mining?

“Data-driven discovery of models and patterns from massive observational data sets”

Statistics,Inference

LanguagesandRepresentations

Page 13: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 13

What is data mining?

“Data-driven discovery of models and patterns from massive observational data sets”

Statistics,Inference

LanguagesandRepresentations

Engineering,Data Management

Page 14: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 14

What is data mining?

“Data-driven discovery of models and patterns from massive observational data sets”

Statistics,Inference

LanguagesandRepresentations

Engineering,Data Management

RetrospectiveAnalysis

Page 15: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 15

Who is involved in Data Mining?

• Business Applications– customer-oriented, transaction-oriented applications– very specific applications in fraud, ecommerce, credit-scoring

• in-house applications (e.g., AT&T, Microsoft, Amazon, etc)• consulting firms: considerable hype factor!

– largely involve the application of existing statistical ideas,scaled up to massive data sets (“engineering”)

• Academic Researchers– mainly in computer science– extensions of existing ideas, significant “bandwagon effect”– database-oriented: “what can we compute quickly?”

• Bottom Line:– primarily computer scientists, often with little knowledge of

statistics, main focus is on algorithms

Page 16: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 16

Current Data Mining Software ToolKits

1. General purpose tools

– software systems for data mining (IBM, SGI, etc)• just simple statistical algorithms with SQL?• limited support for

– statistical inference, temporal, spatial data• also: “born-again” statistical software packages

– some successes (difficult to validate)• banking, marketing, retail• mainly useful for large-scale EDA?

– “mining the miners” (Jerry Friedman):• similar to expert systems/neural networks hype in 80’s?

Page 17: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 17

Transaction Data and Association Rules

• Supermarket example: (Srikant and Agrawal, 1997) – #items = 500,000, #transactions = 1.5 million

ItemsT

rans

actio

ns

x xx

xx x x

xx x x

xx xx

x

x

xx

x

Page 18: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 18

Transaction Data and Association Rules

• Example of an Association Rule If a customer buys beer they will also buy chips– p(chips|beer) = “confidence”– p(beer) = “support”

• Algorithm: basically a fast way to compute correlations

ItemsT

rans

actio

ns

x xx

xx x x

xx x x

xx xx

x

x

xx

x

Page 19: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 19

Current Data Mining Software

2. Special purpose (“niche”) applications

- fraud detection, ecommerce profiling, credit-scoring,etc.

- often solve high-dimensional classification/regressionproblems

- Fraud detection- telecom (AT&T), credit-cards (HNC)

- Profiling -> Advertising- profile: “histogram” of products/terms- Engage: database of 70 million internet user profiles

- common theme: “track the customer!”

- difficult to validate claims of success (few publications)

Page 20: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 20

General Characteristics of Data Mining Applications

• Emphasis on Predictive Modeling– scoring, classification, detection

• Massive Data Sets– significant “data engineering” component– variable selection, “feature definition”– offline: computational issues in model fitting– online: real-time response (e.g., e-commerce)

• “Scaling up” traditional ideas– e.g., wide use of CART (decision trees)– often modified to handle large-scale issues

Page 21: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 21

Myths and Legends in Data Mining

• “Data analysis can be fully automated”– human judgement is critical in almost all applications– “semi-automation” is however very useful

Page 22: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 22

Myths and Legends in Data Mining

• “Data analysis can be fully automated”– human judgement is critical in almost all applications– “semi-automation” is however very useful

• “Association rules are useful”– association rules are essentially lists of correlations– no documented successful application– compare with decision trees (numerous applications)

Page 23: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 23

Myths and Legends in Data Mining

• “Data analysis can be fully automated”– human judgement is critical in almost all applications– “semi-automation” is however very useful

• “Association rules are useful”– association rules are essentially lists of correlations– none or few documented successful applications– compare with decision trees (numerous applications)

• “With massive data sets you don’t need statistics”– massiveness can bring more heterogeneity and noise

• even more statistics!

Page 24: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 24

Outline

• What is Data Mining?

• Computer Science and Statistics: the Interface

Page 25: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 25

Historical Perspective

Statistics Computer Science/Engineering

1950

1960

1970

1980

1990

2000

Page 26: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 26

Historical Perspective

Statistics Computer Science/Engineering

AIStatisticalPattern Recognition

1950

1960

1970

1980

1990

2000

Page 27: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 27

Historical Perspective

Statistics Computer Science/Engineering

AIStatisticalPattern Recognition

1950

1960

1970

1980

1990

2000

EDA

Page 28: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 28

Historical Perspective

Statistics Computer Science/Engineering

AIStatisticalPattern Recognition

1950

1960

1970

1980

1990

2000

EDA

Trees ML:Trees/Rules

Page 29: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 29

Historical Perspective

Statistics Computer Science/Engineering

AIStatisticalPattern Recognition

1950

1960

1970

1980

1990

2000

EDA

Trees ML:Trees/RulesNeural

Networks

MARS

Page 30: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 30

Historical Perspective

Statistics Computer Science/Engineering

AIStatisticalPattern Recognition

1950

1960

1970

1980

1990

2000

EDA

Trees ML:Trees/RulesNeural

Networks

MARS

FlexiblePredictors

Page 31: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 31

Historical Perspective

Statistics Computer Science/Engineering

AIStatisticalPattern Recognition

1950

1960

1970

1980

1990

2000

EDA

Trees ML:Trees/RulesNeural

Networks

MARS

FlexiblePredictors

KDD

Page 32: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 32

Historical Perspective

Statistics Computer Science/Engineering

AIStatisticalPattern Recognition

1950

1960

1970

1980

1990

2000

EDA

Trees ML:Trees/RulesNeural

Networks

MARS

FlexiblePredictors

DB

OLAPKDD

Page 33: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 33

Historical Perspective

Statistics Computer Science/Engineering

AIStatisticalPattern Recognition

1950

1960

1970

1980

1990

2000

EDA

Trees ML:Trees/RulesNeural

Networks

MARS

FlexiblePredictors

DB

OLAPKDD

DataMining

Page 34: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 34

Observations

• Significant synergy/convergence of CS and Statistics emergedfrom neural networks– flexible prediction models = “super offspring”– role of NIPS, Snowbird meetings, etc

• Data Mining/KDD is still back where Neural Nets was 10 years ago– DM: “our stuff is cool and we don’t really need statistics - do

we ?”– Statistics: “what are these guys talking about and why don’t

they know some basic statistics?”

– Nonetheless…. The DM folks have some very interestingapplications and some interesting approaches

Page 35: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 35

Statistics Computer Science

StatisticalPatternRecognition

NeuralNetworks

MachineLearning

DataMining

DatabasesStatisticalInference

Where Work is Published

JASA,JRSS

IEEE PAMIICPRICCV

NIPSNeural Comp.

ICMLCOLTML JournalUAIwww.jmlr.org

KDDIJDMKD

SIGMODVLDB

Page 36: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 36

Modeling,

Inference

Computation,

Algorithms

Evaluation,

Interpretation

The Predictive Modeling Cycle

Page 37: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 37

Modeling,

Inference

Computation,Algorithms

Evaluation,

Interpretation

The Computer Scientist’s View

Page 38: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 38

Modeling,Inference

Computation,

Algorithms

Evaluation,

Interpretation

A Statistician’s View

Page 39: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 39

Modeling,

InferenceComputation,

Algorithms

Evaluation,Interpretation

The Customer’s View

Page 40: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 40

Educational Differences

• Computer Scientists:– undergraduate exposure in statistics

• cookbook hypothesis tests– little or no exposure with mathematical modeling– good at algorithms/data structures

Page 41: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 41

Educational Differences

• Computer Scientists:– undergraduate exposure in statistics

• cookbook hypothesis tests– little or no exposure with mathematical modeling– good at algorithms, data structures

• Statisticians:– undergraduate exposure to CS

• how to write Fortran code– little or no exposure to data structures/algorithms– not everyone learns the “art” of data analysis?

Page 42: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 42

Educational Differences

• Computer Scientists:– undergraduate exposure in statistics

• cookbook hypothesis tests– little or no exposure with mathematical modeling– good at algorithms, data structures

• Statisticians:– undergraduate exposure to CS

• how to write Fortran code– little or no exposure to data structures/algorithms– how to learn the “art” of data analysis?

• Bottom line– need a new breed of “data engineers”– note: easier to go from statistics to CS, than vice-versa

Page 43: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 43

Cultural Differences

• Computer Scientists:– little exposure to the “modeling art” of data analysis– stick to a small set of well-understood models and problems– “close to the data”: they often have ready access to data– business-oriented culture

Page 44: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 44

Cultural Differences

• Computer Scientists:– little exposure to the “modeling art” of data analysis– stick to a small set of well-understood models and problems– “close to the data”: they often have ready access to data– business-oriented culture

• Statisticians:– applied statisticians often very good at the “art” component– little experience with the data management/engineering part– papers focus on inference/models, not algorithms– science-oriented culture

Page 45: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 45

Cultural Differences

• Computer Scientists:– little exposure to the “modeling art” of data analysis– stick to a small set of well-understood models and problems– “close to the data”: they often have ready access to data– business-oriented culture

• Statisticians:– applied statisticians often very good at the “art” component– little experience with the data management/engineering part– papers focus on inference/models, not algorithms– science-oriented culture

• Bottom line– computer scientists get more attention since they are much

more marketing-savvy (less worried about objectivity) thanstatisticians

Page 46: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 46

Modeling Computation

Evaluation

Page 47: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 47

TaskData Set

Representation

Objective Function

Optimization

Data Access

Evaluation and Deployment

Modeling

Algorithm

Page 48: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 48

PredictionMultivariate

Hierarchical representation of

piecewise constant mapping

Cross-Validation

Greedy Search

Flat File

Accuracy and Interpretability

CART

Emphasis on

predictive power

and flexibility

of modelModeling

Algorithm

Page 49: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 49

ExploratoryTransaction Data

Sets of local rules/

conditional probabilities

Thresholds on p

Systematic Search

Linear Data Scans

????

Association Rules

Emphasis on

computational

efficiency and

data access

Modeling

Algorithm

Page 50: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 50

The Reductionist Viewpoint

• General Framework for Modeling– reduce problems to fundamental components– think in terms of

• application first• modeling second• algorithm third

– ultimately the application should “drive” the algorithm

– allows systematic comparison and synthesis• for work on synthesis, see Buntine et al, KDD 99

– clarifies relative role of statistics, databases, search, etc

– see Hand, Mannila, and Smyth, MIT Press, May(?) 2001

Page 51: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 51

Implications

• The “renaissance data miner” is skilled in:– statistics: theories and principles of inference– modeling: languages and representations for data– optimization and search– algorithm design and data management

• The educational problem– is it necessary to know all these areas in depth?– Is it possible?– Do we need a new breed of professionals?

• The applications viewpoint:– How does a scientist or business person keep up with all

these developments?– How can they choose the best approach for their problem

Page 52: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 52

Outline

• What is Data Mining?

• Computer Science and Statistics: the Interface

• Hot Topics in Data Mining

Page 53: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 53

Subspecies of Data Miners

• SIGMOD/VLDB Conferences– Database issues: querying, efficiency: no modeling– fast querying/association rule algorithms

• SIGKDD Conferences– Algorithm focus: scaling machine learning/stats methods– rule finding algorithms

• Machine Learning Conference– Algorithmic focus– decision trees, reinforcement learning

• NIPS– originally neural networks, but now mathematical/probabilistic

learning: heavy statistical influence– SVMs, boosting, Gaussian processes, latent variable models

• ICPR (Pattern Recognition), SIGIR, etc– speech, images, classifiers, etc: engineering applications

Page 54: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 54

Hot Topics, New Directions from Computer Science

• Flexible predictive modeling– neural networks, boosting, SVMs

• Engineering of scale– scaling up statistical methods to new large-scale applications

• Hidden/latent variable models– wide scale application of EM, e.g., HMMs for speech

• Pattern finding– associations, rules, bumps: “non-global” patterns

• Heterogeneous Data– modeling structured data, e.g, Web, multimedia (video/audio)

Page 55: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 55

Flexible Predictive Modeling

• Model Combining:– Stacking

• linear combinations of models with X-validated weights– Bagging

• equally weighted models from bootstrap samples– Boosting

• iterative re-training on data points in error

• Flexible Model Forms– Decision trees, Neural networks, Support vector machines

• Common theme:– many of these ideas were popularized in computer science– later “legitimized” by statisticians (e.g., by Breiman, Friedman)

Page 56: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 56

20 40 60 80 100 120 140 160 180 200

50

100

150

200

250

300

350

400

450

500

Example of a Document-Term Matrix

Page 57: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 57

Application: Flexible Classification Models for Text

• The Web represents a huge data set of text documents– Problem: classification of Web pages into “topic categories”– e.g., automated creation of topic hierarchies for Yahoo– automated crawlers for information gathering

• Technical challenges– standard representation of a Web page?

• Typically use “list of term vectors”• very high-dimensional information

– other information: images, page structure, etc

• Current Activity– much research in data mining in document classification:

• Web page -> high-d term vector -> flexible classifier– Commercial companies: Whizbang, Autonomy, IBM, etc

Page 58: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 58

2. Scale: How far away are the data?

CPU RAM Disk

Page 59: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 59

2. Scale: How far away are the data?

CPU RAM Disk

10-8 seconds 10-3 seconds

Random Access Times

Page 60: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 60

2. Scale: How far away are the data?

CPU RAM Disk

1 meter 100 km

Effective Distances

Page 61: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 61

Basic Idea

MassiveDatabase

ApproximateModel of Data

Humanor

Algorithm

“Slow”Memory

“Fast”/CachedMemory

Page 62: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 62

Basic Idea

MassiveDatabase

ApproximateModel of Data

Humanor

Algorithm

“Slow”Memory

“Fast”/CachedMemory

Comments: 1. even if data fits in main memory there are many advantagesto clever data structures (e.g., see Andrew Moore’s talk)2. Particularly relevant for massive streams of transaction data, e.g.telephone data (see Diane Lambert’s talk).

Page 63: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 63

2. Scalable Algorithms

• “Scaling down the data” or “data approximation”– work from clever data summarizations (e.g., sufficient

statistics)– e.g., “Data squashing” (DuMouchel et al, AT&T, KDD ‘99)– create a small “pseudo data set”– similar statistical properties to the original (massive) data set– now run your standard algorithm on the pseudo-data– interesting theoretical (statistical) basis

• “Scaling up the algorithm”– data structures/caching strategies to speed up known

algorithms• ADTrees, etc., from Andrew Moore (CMU)• scalable decision trees (Johannes Gehrke, Cornell)

– can get orders of magnitude speed improvements

Page 64: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 64

3. Pattern Finding

• Patterns = unusual hard-to-find local “pockets” of data– finding patterns is not the same as global model fitting– the simplest example of patterns are association rules– much other work on rule-finding in data mining/AI– other applications:

• motif-finding in protein sequences• unusual objects in sky-survey data

• “Bump-hunting”– PRIM algorithm of Friedman and Fisher (1999)– finds multivariate “boxes” in high-dimensional spaces where

mean of target variable is higher– trades off “support” with “mean height”– effective and flexible

• e.g., finding small highly profitable groups of customers

Page 65: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 65

“Bump-Hunting”

Page 66: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 66

“Bump-Hunting”

Page 67: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 67

“Bump-Hunting”

Page 68: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 68

“Bump-Hunting”

Page 69: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 69

“Bump-Hunting”

Page 70: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 70

“Bump-Hunting”

Page 71: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 71

Pattern Finding (ctd.)

• Contrast Sets (Bay and Pazzani, KDD99)– individuals or objects categorized into 2 groups

• e.g., students enrolled in CS and in Engineering– high-dimensional multivariate measurements on each– automatically produces a summary of significant differences

between groups (Bay and Pazzani, KDD ‘99)– combines massive search with statistical estimation

• Time-Series Pattern Spotting– “find me a shape that looks like this”– semi-Markov deformable templates (Ge and Smyth, KDD 2000)– significantly outperforms template matching and DTW– Bayesian approach integrates prior knowledge with data

Page 72: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 72

Example: Deformable Templates

• Segmental hidden semi-Markov model• Ge and Smyth, KDD 2000

• Each waveform segment corresponds to a state in the model

S1 S2ST

- - - - - - - -

Segments

States

Page 73: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 73

Pattern-Based End-Point Detection

0 50 100 150 200 250 300 350 400200

300

400

500

0 50 100 150 200 250 300 350 400200

300

400

500

TIME (SECONDS)

Original Pattern

Detected Pattern

End-Point Detection in Semiconductor Manufacturing

Page 74: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 74

Heterogeneous Data Modeling

• Clustering Objects (sequences, curves, etc)– probabilistic approach: define a mixture of models (Cadez,

Gaffney, and Smyth, KDD 2000)– unified framework for clustering objects of different

dimensions– applications:

• curve-clustering:– e.g., mixture of regression models (Gaffney and Smyth

(KDD ‘99)– video movement, gene expression data, storm

trajectories• sequence clustering

– e.g., mixtures of Markov models– clustering of MSNBC Web data (Cadez et al, KDD ‘00)

Page 75: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 75

0 5 1 0 1 5 2 0 2 5 3 04 0

6 0

8 0

1 0 0

1 2 0

1 4 0

1 6 0

T I M E

X-P

OS

ITIO

N

T R A J E C T O R I E S O F C E N T R O I D S O F M O V I N G H A N D I N V I D E O S T R E A M S

0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

T I M E

Y-P

OS

ITIO

N

E S T I M A T E D C L U S T E R T R A J E C T O R Y

0 5 10 15 20 25 3085

90

95

100

105

110

115

120

125

TIME

X-P

OS

ITIO

N

EST IMATED CLUSTER TRAJECTORY

Page 76: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 76

4. (Un) Structured Data (e.g., Text, Web)• Applications

– classification of text documents• automatic classification of emails as junk/non-junk• automatic creation of taxonomies for Web page portals

such as Yahoo– discovery of authoritative documents

• search engines (Yahoo)• citation rankings

• Techniques– “vector-space” model– adaptations of simple classification and clustering algorithms– graph-based techniques

• Challenges for Statistics– scale of problem: huge documents, huge Web– structure, semantics of documents, Web

Page 77: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 77

Document Clustering

• Techniques– model-based mixture clustering– mixtures of multinomials– mixtures of conditional-independence models

• Connections to Statistics– probabilistic models are well-known (latent class models)– EM algorithm for training

• Differences from Statistics:– scale and nature of applications– e.g., Hofmann (Brown, U), “probabilistic PCA”– e.g., Lafferty (CMU), maxent models for text prediction

Page 78: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 78

20 40 60 80 100 120 140 160 180 200

50

100

150

200

250

300

350

400

450

500

Example of a Document-Term Matrix

Page 79: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 79

Example of Document

Cluster

Most Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write 0.571 drive 0.465 problem 0.369 mail 0.364 articl 0.332 hard 0.323 work 0.319 system 0.303 good 0.296 time 0.273

Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi 7.7 0.13 0.02 drive 5.7 0.47 0.08 hard 4.9 0.32 0.07 card 4.2 0.23 0.06 format 4.0 0.12 0.03 softwar 3.8 0.21 0.05 memori 3.6 0.14 0.04 install 3.6 0.14 0.04 disk 3.5 0.12 0.03 engin 3.3 0.21 0.06

Page 80: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 80

Example of Document

Cluster

Most Likely Terms in Component 1weight = 0.11 : TERM p(t|k) articl 0.684 good 0.368 dai 0.363 fact 0.322 god 0.320 claim 0.294 apr 0.279 fbi 0.256 christian 0.256 group 0.239

Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi 8.3 0.26 0.03 jesu 5.5 0.16 0.03 fire 5.2 0.20 0.04 christian 4.9 0.26 0.05 evid 4.8 0.24 0.05 god 4.6 0.32 0.07 gun 4.2 0.17 0.04 faith 4.2 0.12 0.03 kill 3.8 0.22 0.06 bibl 3.7 0.11 0.03

Page 81: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 81

Pixel Representation of Mixture Components

TERMS

CO

MP

ON

EN

T M

OD

ELS

10 20 30 40 50 60 70 80 90 100

1

2

3

4

5

6

7

8

9

10

Page 82: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 82

128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

5115

11111151511151

77777777

111333

3333131113332232

User 5

User 4

User 3User 2

User 1

Example: Web Log Mining

Page 83: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 83

Clusters of Dynamic Behavior

B

C

D

A

B

C

D

A

B

C

D

A

Cluster 1 Cluster 2

Cluster 3

Page 84: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 84

Page 85: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 85WebCanvas: Cadez, Heckerman, et al, KDD 2000

Page 86: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 86

Application: Product Purchasing Data

PRODUCT CATEGORIES

TR

AN

SA

CT

ION

S

5 10 15 20 25 30 35 40 45 50

50

100

150

200

250

300

350

Page 87: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 87

Application: Recommender Systems

ProductsT

rans

actio

ns

x xx

xx x x

xx x x

xx xx

x

x

xx

x

x x ? ? ? ? ? ? ? ? ???New

Customer

Page 88: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 88

Application: Recommender Systems

ProductsT

rans

actio

ns

x xx

xx x x

xx x x

xx xx

x

x

xx

x

x x ? ? ? ? ? ? ? ? ???New

Customer

– high-dimensional inference/prediction problem– sparse data– recommendations must be in real-time!

Page 89: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 89

Approaches to Recommender Systems

• Collaborative Filtering– “infer your interests from people with similar behavior”– essentially a nearest-neighbor algorithm– considerable commercial interest (e.g., NetPerceptions,

Firefly)– scalability problems

• Model-based Recommenders– model the joint distribution of the products explicitly– Example

• dependency networks from Microsoft Research• decision-tree models/MRFs• extremely fast, shipping in Microsoft products• Heckerman et al. (2000), Journal of Machine Learning

Research, www.jmlr.org

Page 90: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 90

Final Comments

• Successful data mining requires integration/understanding of– statistics– computer science– the application discipline

• Current practice of data mining– computer scientists focused on business applications– relatively little statistical sophistication: but some new ideas– considerable “hype” factor

• Opportunities for Statisticians– new problems: e.g., statistical scalability– new applications: e.g., inference from Web and text data– ready audience for statistical techniques

• need better marketing!

Page 91: What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 91

Pointers

• Papers:– www.ics.uci.edu/~datalab– e.g., “Data mining: data analysis on a grand scale?”, P. Smyth,

(2000), Statistical Methods in Medical Research.

• Web Resources:– www.kdnuggets.com

• Interface ‘01– data mining and bioinformatics themes– June 13-16th, 2001 Costa Mesa, CA

• Text (forthcoming)– Principles of Data Mining

• D. J Hand, H. Mannila, P. Smyth• MIT Press, May 2001?