What's New in Data Mining?

© Padhraic Smyth, Dec 2000: 1

What’s New in Data Mining?

Padhraic SmythInformation and Computer Science

University of California, Irvine

© December 2000

Invited Talk at NonParametrics/Data Mining Workshop,SMU, Dallas


Outline of Talk

• What is Data Mining?

• Computer Science and Statistics: the Interface

• Hot Topics in Data Mining

• Conclusions


Technological Driving Factors

• Larger, cheaper memory– Moore’s law for magnetic disk density

“capacity doubles every 18 months” (Jim Gray, Microsoft)– storage cost per byte falling rapidly

• Faster, cheaper processors– the CRAY of 10 years ago is now on your desk

• Success of Relational Database Technology– everybody is a “data owner”

• Flexible modeling paradigms– GLMs, trees, etc– computationally-intensive modeling, massive search


The Emergence of Data Mining

• Distinct threads of evolution– AI/machine learning

• 1989 KDD workshop -> ACM SIGKDD 2000• focus on “automated discovery, novelty”

– Database Research• focus on massive data sets (since 1995)• e.g., ACM SIGMOD -> association rules, scalable

algorithms– “Data Owners”

• what can we do with all this data in commercialdatabases?

• primarily customer-oriented transaction data• industry dominated, applications-oriented


The Emergence of Data Mining

• The “Mother in Law phenomenon”• even your mother-in-law has heard about data mining• people are hoping they can do data analysis without the

“nuisance factor” of statistics

• Beware of the hype!– remember expert systems, neural nets, etc– basically sound ideas that were oversold creating a

backlash


What is data mining?



“the art of fishing over alternative models ….”

M. C. Lovell, The Review of Economics and StatisticsFebruary 1983



“Data-driven discovery of models and patterns from massive observational data sets”



“The magic phrase to put in every funding proposalyou write to NSF, DARPA, NASA, etc”



“The magic phrase you use to sell your….. - database software - statistical analysis software - parallel computing hardware - consulting services”




Statistics,Inference





LanguagesandRepresentations






Engineering,Data Management






Engineering,Data Management

RetrospectiveAnalysis


Who is involved in Data Mining?

• Business Applications– customer-oriented, transaction-oriented applications– very specific applications in fraud, ecommerce, credit-scoring

• in-house applications (e.g., AT&T, Microsoft, Amazon, etc)• consulting firms: considerable hype factor!

– largely involve the application of existing statistical ideas,scaled up to massive data sets (“engineering”)

• Academic Researchers– mainly in computer science– extensions of existing ideas, significant “bandwagon effect”– database-oriented: “what can we compute quickly?”

• Bottom Line:– primarily computer scientists, often with little knowledge of

statistics, main focus is on algorithms


Current Data Mining Software ToolKits

1. General purpose tools

– software systems for data mining (IBM, SGI, etc)• just simple statistical algorithms with SQL?• limited support for

– statistical inference, temporal, spatial data• also: “born-again” statistical software packages

– some successes (difficult to validate)• banking, marketing, retail• mainly useful for large-scale EDA?

– “mining the miners” (Jerry Friedman):• similar to expert systems/neural networks hype in 80’s?


Transaction Data and Association Rules

• Supermarket example: (Srikant and Agrawal, 1997) – #items = 500,000, #transactions = 1.5 million

ItemsT

rans

actio

ns

x xx

xx x x

xx x x

xx xx

x

x

xx

x


Transaction Data and Association Rules

• Example of an Association Rule If a customer buys beer they will also buy chips– p(chips|beer) = “confidence”– p(beer) = “support”

• Algorithm: basically a fast way to compute correlations

ItemsT

rans

actio

ns

x xx

xx x x

xx x x

xx xx

x

x

xx

x


Current Data Mining Software

2. Special purpose (“niche”) applications

- fraud detection, ecommerce profiling, credit-scoring,etc.

- often solve high-dimensional classification/regressionproblems

- Fraud detection- telecom (AT&T), credit-cards (HNC)

- Profiling -> Advertising- profile: “histogram” of products/terms- Engage: database of 70 million internet user profiles

- common theme: “track the customer!”

- difficult to validate claims of success (few publications)


General Characteristics of Data Mining Applications

• Emphasis on Predictive Modeling– scoring, classification, detection

• Massive Data Sets– significant “data engineering” component– variable selection, “feature definition”– offline: computational issues in model fitting– online: real-time response (e.g., e-commerce)

• “Scaling up” traditional ideas– e.g., wide use of CART (decision trees)– often modified to handle large-scale issues


Myths and Legends in Data Mining

• “Data analysis can be fully automated”– human judgement is critical in almost all applications– “semi-automation” is however very useful




• “Association rules are useful”– association rules are essentially lists of correlations– no documented successful application– compare with decision trees (numerous applications)




• “Association rules are useful”– association rules are essentially lists of correlations– none or few documented successful applications– compare with decision trees (numerous applications)

• “With massive data sets you don’t need statistics”– massiveness can bring more heterogeneity and noise

• even more statistics!


Outline




Historical Perspective

Statistics Computer Science/Engineering

1950

1960

1970

1980

1990

2000




AIStatisticalPattern Recognition

1950

1960

1970

1980

1990

2000





1950

1960

1970

1980

1990

2000

EDA





1950

1960

1970

1980

1990

2000

EDA

Trees ML:Trees/Rules





1950

1960

1970

1980

1990

2000

EDA

Trees ML:Trees/RulesNeural

Networks

MARS





1950

1960

1970

1980

1990

2000

EDA


Networks

MARS

FlexiblePredictors





1950

1960

1970

1980

1990

2000

EDA


Networks

MARS

FlexiblePredictors

KDD





1950

1960

1970

1980

1990

2000

EDA


Networks

MARS

FlexiblePredictors

DB

OLAPKDD





1950

1960

1970

1980

1990

2000

EDA


Networks

MARS

FlexiblePredictors

DB

OLAPKDD

DataMining


Observations

• Significant synergy/convergence of CS and Statistics emergedfrom neural networks– flexible prediction models = “super offspring”– role of NIPS, Snowbird meetings, etc

• Data Mining/KDD is still back where Neural Nets was 10 years ago– DM: “our stuff is cool and we don’t really need statistics - do

we ?”– Statistics: “what are these guys talking about and why don’t

they know some basic statistics?”

– Nonetheless…. The DM folks have some very interestingapplications and some interesting approaches


Statistics Computer Science

StatisticalPatternRecognition

NeuralNetworks

MachineLearning

DataMining

DatabasesStatisticalInference

Where Work is Published

JASA,JRSS

IEEE PAMIICPRICCV

NIPSNeural Comp.

ICMLCOLTML JournalUAIwww.jmlr.org

KDDIJDMKD

SIGMODVLDB


Modeling,

Inference

Computation,

Algorithms

Evaluation,

Interpretation

The Predictive Modeling Cycle


Modeling,

Inference

Computation,Algorithms

Evaluation,

Interpretation

The Computer Scientist’s View


Modeling,Inference

Computation,

Algorithms

Evaluation,

Interpretation

A Statistician’s View


Modeling,

InferenceComputation,

Algorithms

Evaluation,Interpretation

The Customer’s View


Educational Differences

• Computer Scientists:– undergraduate exposure in statistics

• cookbook hypothesis tests– little or no exposure with mathematical modeling– good at algorithms/data structures




• cookbook hypothesis tests– little or no exposure with mathematical modeling– good at algorithms, data structures

• Statisticians:– undergraduate exposure to CS

• how to write Fortran code– little or no exposure to data structures/algorithms– not everyone learns the “art” of data analysis?




• cookbook hypothesis tests– little or no exposure with mathematical modeling– good at algorithms, data structures

• Statisticians:– undergraduate exposure to CS

• how to write Fortran code– little or no exposure to data structures/algorithms– how to learn the “art” of data analysis?

• Bottom line– need a new breed of “data engineers”– note: easier to go from statistics to CS, than vice-versa


Cultural Differences

• Computer Scientists:– little exposure to the “modeling art” of data analysis– stick to a small set of well-understood models and problems– “close to the data”: they often have ready access to data– business-oriented culture




• Statisticians:– applied statisticians often very good at the “art” component– little experience with the data management/engineering part– papers focus on inference/models, not algorithms– science-oriented culture




• Statisticians:– applied statisticians often very good at the “art” component– little experience with the data management/engineering part– papers focus on inference/models, not algorithms– science-oriented culture

• Bottom line– computer scientists get more attention since they are much

more marketing-savvy (less worried about objectivity) thanstatisticians


Modeling Computation

Evaluation


TaskData Set

Representation

Objective Function

Optimization

Data Access

Evaluation and Deployment

Modeling

Algorithm


PredictionMultivariate

Hierarchical representation of

piecewise constant mapping

Cross-Validation

Greedy Search

Flat File

Accuracy and Interpretability

CART

Emphasis on

predictive power

and flexibility

of modelModeling

Algorithm


ExploratoryTransaction Data

Sets of local rules/

conditional probabilities

Thresholds on p

Systematic Search

Linear Data Scans

????

Association Rules

Emphasis on

computational

efficiency and

data access

Modeling

Algorithm


The Reductionist Viewpoint

• General Framework for Modeling– reduce problems to fundamental components– think in terms of

• application first• modeling second• algorithm third

– ultimately the application should “drive” the algorithm

– allows systematic comparison and synthesis• for work on synthesis, see Buntine et al, KDD 99

– clarifies relative role of statistics, databases, search, etc

– see Hand, Mannila, and Smyth, MIT Press, May(?) 2001


Implications

• The “renaissance data miner” is skilled in:– statistics: theories and principles of inference– modeling: languages and representations for data– optimization and search– algorithm design and data management

• The educational problem– is it necessary to know all these areas in depth?– Is it possible?– Do we need a new breed of professionals?

• The applications viewpoint:– How does a scientist or business person keep up with all

these developments?– How can they choose the best approach for their problem


Outline



• Hot Topics in Data Mining


Subspecies of Data Miners

• SIGMOD/VLDB Conferences– Database issues: querying, efficiency: no modeling– fast querying/association rule algorithms

• SIGKDD Conferences– Algorithm focus: scaling machine learning/stats methods– rule finding algorithms

• Machine Learning Conference– Algorithmic focus– decision trees, reinforcement learning

• NIPS– originally neural networks, but now mathematical/probabilistic

learning: heavy statistical influence– SVMs, boosting, Gaussian processes, latent variable models

• ICPR (Pattern Recognition), SIGIR, etc– speech, images, classifiers, etc: engineering applications


Hot Topics, New Directions from Computer Science

• Flexible predictive modeling– neural networks, boosting, SVMs

• Engineering of scale– scaling up statistical methods to new large-scale applications

• Hidden/latent variable models– wide scale application of EM, e.g., HMMs for speech

• Pattern finding– associations, rules, bumps: “non-global” patterns

• Heterogeneous Data– modeling structured data, e.g, Web, multimedia (video/audio)


Flexible Predictive Modeling

• Model Combining:– Stacking

• linear combinations of models with X-validated weights– Bagging

• equally weighted models from bootstrap samples– Boosting

• iterative re-training on data points in error

• Flexible Model Forms– Decision trees, Neural networks, Support vector machines

• Common theme:– many of these ideas were popularized in computer science– later “legitimized” by statisticians (e.g., by Breiman, Friedman)


20 40 60 80 100 120 140 160 180 200

50

100

150

200

250

300

350

400

450

500

Example of a Document-Term Matrix


Application: Flexible Classification Models for Text

• The Web represents a huge data set of text documents– Problem: classification of Web pages into “topic categories”– e.g., automated creation of topic hierarchies for Yahoo– automated crawlers for information gathering

• Technical challenges– standard representation of a Web page?

• Typically use “list of term vectors”• very high-dimensional information

– other information: images, page structure, etc

• Current Activity– much research in data mining in document classification:

• Web page -> high-d term vector -> flexible classifier– Commercial companies: Whizbang, Autonomy, IBM, etc


2. Scale: How far away are the data?

CPU RAM Disk



CPU RAM Disk

10-8 seconds 10-3 seconds

Random Access Times



CPU RAM Disk

1 meter 100 km

Effective Distances


Basic Idea

MassiveDatabase

ApproximateModel of Data

Humanor

Algorithm

“Slow”Memory

“Fast”/CachedMemory


Basic Idea

MassiveDatabase

ApproximateModel of Data

Humanor

Algorithm

“Slow”Memory

“Fast”/CachedMemory

Comments: 1. even if data fits in main memory there are many advantagesto clever data structures (e.g., see Andrew Moore’s talk)2. Particularly relevant for massive streams of transaction data, e.g.telephone data (see Diane Lambert’s talk).


2. Scalable Algorithms

• “Scaling down the data” or “data approximation”– work from clever data summarizations (e.g., sufficient

statistics)– e.g., “Data squashing” (DuMouchel et al, AT&T, KDD ‘99)– create a small “pseudo data set”– similar statistical properties to the original (massive) data set– now run your standard algorithm on the pseudo-data– interesting theoretical (statistical) basis

• “Scaling up the algorithm”– data structures/caching strategies to speed up known

algorithms• ADTrees, etc., from Andrew Moore (CMU)• scalable decision trees (Johannes Gehrke, Cornell)

– can get orders of magnitude speed improvements


3. Pattern Finding

• Patterns = unusual hard-to-find local “pockets” of data– finding patterns is not the same as global model fitting– the simplest example of patterns are association rules– much other work on rule-finding in data mining/AI– other applications:

• motif-finding in protein sequences• unusual objects in sky-survey data

• “Bump-hunting”– PRIM algorithm of Friedman and Fisher (1999)– finds multivariate “boxes” in high-dimensional spaces where

mean of target variable is higher– trades off “support” with “mean height”– effective and flexible

• e.g., finding small highly profitable groups of customers


“Bump-Hunting”


“Bump-Hunting”


“Bump-Hunting”


“Bump-Hunting”


“Bump-Hunting”


“Bump-Hunting”


Pattern Finding (ctd.)

• Contrast Sets (Bay and Pazzani, KDD99)– individuals or objects categorized into 2 groups

• e.g., students enrolled in CS and in Engineering– high-dimensional multivariate measurements on each– automatically produces a summary of significant differences

between groups (Bay and Pazzani, KDD ‘99)– combines massive search with statistical estimation

• Time-Series Pattern Spotting– “find me a shape that looks like this”– semi-Markov deformable templates (Ge and Smyth, KDD 2000)– significantly outperforms template matching and DTW– Bayesian approach integrates prior knowledge with data


Example: Deformable Templates

• Segmental hidden semi-Markov model• Ge and Smyth, KDD 2000

• Each waveform segment corresponds to a state in the model

S1 S2ST

- - - - - - - -

Segments

States


Pattern-Based End-Point Detection

0 50 100 150 200 250 300 350 400200

300

400

500

0 50 100 150 200 250 300 350 400200

300

400

500

TIME (SECONDS)

Original Pattern

Detected Pattern

End-Point Detection in Semiconductor Manufacturing


Heterogeneous Data Modeling

• Clustering Objects (sequences, curves, etc)– probabilistic approach: define a mixture of models (Cadez,

Gaffney, and Smyth, KDD 2000)– unified framework for clustering objects of different

dimensions– applications:

• curve-clustering:– e.g., mixture of regression models (Gaffney and Smyth

(KDD ‘99)– video movement, gene expression data, storm

trajectories• sequence clustering

– e.g., mixtures of Markov models– clustering of MSNBC Web data (Cadez et al, KDD ‘00)


0 5 1 0 1 5 2 0 2 5 3 04 0

6 0

8 0

1 0 0

1 2 0

1 4 0

1 6 0

T I M E

X-P

OS

ITIO

N

T R A J E C T O R I E S O F C E N T R O I D S O F M O V I N G H A N D I N V I D E O S T R E A M S

0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

T I M E

Y-P

OS

ITIO

N

E S T I M A T E D C L U S T E R T R A J E C T O R Y

0 5 10 15 20 25 3085

90

95

100

105

110

115

120

125

TIME

X-P

OS

ITIO

N

EST IMATED CLUSTER TRAJECTORY


4. (Un) Structured Data (e.g., Text, Web)• Applications

– classification of text documents• automatic classification of emails as junk/non-junk• automatic creation of taxonomies for Web page portals

such as Yahoo– discovery of authoritative documents

• search engines (Yahoo)• citation rankings

• Techniques– “vector-space” model– adaptations of simple classification and clustering algorithms– graph-based techniques

• Challenges for Statistics– scale of problem: huge documents, huge Web– structure, semantics of documents, Web


Document Clustering

• Techniques– model-based mixture clustering– mixtures of multinomials– mixtures of conditional-independence models

• Connections to Statistics– probabilistic models are well-known (latent class models)– EM algorithm for training

• Differences from Statistics:– scale and nature of applications– e.g., Hofmann (Brown, U), “probabilistic PCA”– e.g., Lafferty (CMU), maxent models for text prediction


20 40 60 80 100 120 140 160 180 200

50

100

150

200

250

300

350

400

450

500

Example of a Document-Term Matrix


Example of Document

Cluster

Most Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write 0.571 drive 0.465 problem 0.369 mail 0.364 articl 0.332 hard 0.323 work 0.319 system 0.303 good 0.296 time 0.273

Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi 7.7 0.13 0.02 drive 5.7 0.47 0.08 hard 4.9 0.32 0.07 card 4.2 0.23 0.06 format 4.0 0.12 0.03 softwar 3.8 0.21 0.05 memori 3.6 0.14 0.04 install 3.6 0.14 0.04 disk 3.5 0.12 0.03 engin 3.3 0.21 0.06


Example of Document

Cluster

Most Likely Terms in Component 1weight = 0.11 : TERM p(t|k) articl 0.684 good 0.368 dai 0.363 fact 0.322 god 0.320 claim 0.294 apr 0.279 fbi 0.256 christian 0.256 group 0.239

Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi 8.3 0.26 0.03 jesu 5.5 0.16 0.03 fire 5.2 0.20 0.04 christian 4.9 0.26 0.05 evid 4.8 0.24 0.05 god 4.6 0.32 0.07 gun 4.2 0.17 0.04 faith 4.2 0.12 0.03 kill 3.8 0.22 0.06 bibl 3.7 0.11 0.03


Pixel Representation of Mixture Components

TERMS

CO

MP

ON

EN

T M

OD

ELS

10 20 30 40 50 60 70 80 90 100

1

2

3

4

5

6

7

8

9

10


128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

…

5115

11111151511151

77777777

111333

3333131113332232

…

User 5

User 4

User 3User 2

User 1

Example: Web Log Mining


Clusters of Dynamic Behavior

B

C

D

A

B

C

D

A

B

C

D

A

Cluster 1 Cluster 2

Cluster 3


© Padhraic Smyth, Dec 2000: 85WebCanvas: Cadez, Heckerman, et al, KDD 2000


Application: Product Purchasing Data

PRODUCT CATEGORIES

TR

AN

SA

CT

ION

S

5 10 15 20 25 30 35 40 45 50

50

100

150

200

250

300

350


Application: Recommender Systems

ProductsT

rans

actio

ns

x xx

xx x x

xx x x

xx xx

x

x

xx

x

x x ? ? ? ? ? ? ? ? ???New

Customer


Application: Recommender Systems

ProductsT

rans

actio

ns

x xx

xx x x

xx x x

xx xx

x

x

xx

x

x x ? ? ? ? ? ? ? ? ???New

Customer

– high-dimensional inference/prediction problem– sparse data– recommendations must be in real-time!


Approaches to Recommender Systems

• Collaborative Filtering– “infer your interests from people with similar behavior”– essentially a nearest-neighbor algorithm– considerable commercial interest (e.g., NetPerceptions,

Firefly)– scalability problems

• Model-based Recommenders– model the joint distribution of the products explicitly– Example

• dependency networks from Microsoft Research• decision-tree models/MRFs• extremely fast, shipping in Microsoft products• Heckerman et al. (2000), Journal of Machine Learning

Research, www.jmlr.org


Final Comments

• Successful data mining requires integration/understanding of– statistics– computer science– the application discipline

• Current practice of data mining– computer scientists focused on business applications– relatively little statistical sophistication: but some new ideas– considerable “hype” factor

• Opportunities for Statisticians– new problems: e.g., statistical scalability– new applications: e.g., inference from Web and text data– ready audience for statistical techniques

• need better marketing!


Pointers

• Papers:– www.ics.uci.edu/~datalab– e.g., “Data mining: data analysis on a grand scale?”, P. Smyth,

(2000), Statistical Methods in Medical Research.

• Web Resources:– www.kdnuggets.com

• Interface ‘01– data mining and bioinformatics themes– June 13-16th, 2001 Costa Mesa, CA

• Text (forthcoming)– Principles of Data Mining

• D. J Hand, H. Mannila, P. Smyth• MIT Press, May 2001?

Documents

What's New in Data Mining?