Upload
tommy96
View
853
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
© Padhraic Smyth, Dec 2000: 1
What’s New in Data Mining?
Padhraic SmythInformation and Computer Science
University of California, Irvine
© December 2000
Invited Talk at NonParametrics/Data Mining Workshop,SMU, Dallas
© Padhraic Smyth, Dec 2000: 2
Outline of Talk
• What is Data Mining?
• Computer Science and Statistics: the Interface
• Hot Topics in Data Mining
• Conclusions
© Padhraic Smyth, Dec 2000: 3
Technological Driving Factors
• Larger, cheaper memory– Moore’s law for magnetic disk density
“capacity doubles every 18 months” (Jim Gray, Microsoft)– storage cost per byte falling rapidly
• Faster, cheaper processors– the CRAY of 10 years ago is now on your desk
• Success of Relational Database Technology– everybody is a “data owner”
• Flexible modeling paradigms– GLMs, trees, etc– computationally-intensive modeling, massive search
© Padhraic Smyth, Dec 2000: 4
The Emergence of Data Mining
• Distinct threads of evolution– AI/machine learning
• 1989 KDD workshop -> ACM SIGKDD 2000• focus on “automated discovery, novelty”
– Database Research• focus on massive data sets (since 1995)• e.g., ACM SIGMOD -> association rules, scalable
algorithms– “Data Owners”
• what can we do with all this data in commercialdatabases?
• primarily customer-oriented transaction data• industry dominated, applications-oriented
© Padhraic Smyth, Dec 2000: 5
The Emergence of Data Mining
• The “Mother in Law phenomenon”• even your mother-in-law has heard about data mining• people are hoping they can do data analysis without the
“nuisance factor” of statistics
• Beware of the hype!– remember expert systems, neural nets, etc– basically sound ideas that were oversold creating a
backlash
© Padhraic Smyth, Dec 2000: 6
What is data mining?
© Padhraic Smyth, Dec 2000: 7
What is data mining?
“the art of fishing over alternative models ….”
M. C. Lovell, The Review of Economics and StatisticsFebruary 1983
© Padhraic Smyth, Dec 2000: 8
What is data mining?
“Data-driven discovery of models and patterns from massive observational data sets”
© Padhraic Smyth, Dec 2000: 9
What is data mining?
“The magic phrase to put in every funding proposalyou write to NSF, DARPA, NASA, etc”
© Padhraic Smyth, Dec 2000: 10
What is data mining?
“The magic phrase you use to sell your….. - database software - statistical analysis software - parallel computing hardware - consulting services”
© Padhraic Smyth, Dec 2000: 11
What is data mining?
“Data-driven discovery of models and patterns from massive observational data sets”
Statistics,Inference
© Padhraic Smyth, Dec 2000: 12
What is data mining?
“Data-driven discovery of models and patterns from massive observational data sets”
Statistics,Inference
LanguagesandRepresentations
© Padhraic Smyth, Dec 2000: 13
What is data mining?
“Data-driven discovery of models and patterns from massive observational data sets”
Statistics,Inference
LanguagesandRepresentations
Engineering,Data Management
© Padhraic Smyth, Dec 2000: 14
What is data mining?
“Data-driven discovery of models and patterns from massive observational data sets”
Statistics,Inference
LanguagesandRepresentations
Engineering,Data Management
RetrospectiveAnalysis
© Padhraic Smyth, Dec 2000: 15
Who is involved in Data Mining?
• Business Applications– customer-oriented, transaction-oriented applications– very specific applications in fraud, ecommerce, credit-scoring
• in-house applications (e.g., AT&T, Microsoft, Amazon, etc)• consulting firms: considerable hype factor!
– largely involve the application of existing statistical ideas,scaled up to massive data sets (“engineering”)
• Academic Researchers– mainly in computer science– extensions of existing ideas, significant “bandwagon effect”– database-oriented: “what can we compute quickly?”
• Bottom Line:– primarily computer scientists, often with little knowledge of
statistics, main focus is on algorithms
© Padhraic Smyth, Dec 2000: 16
Current Data Mining Software ToolKits
1. General purpose tools
– software systems for data mining (IBM, SGI, etc)• just simple statistical algorithms with SQL?• limited support for
– statistical inference, temporal, spatial data• also: “born-again” statistical software packages
– some successes (difficult to validate)• banking, marketing, retail• mainly useful for large-scale EDA?
– “mining the miners” (Jerry Friedman):• similar to expert systems/neural networks hype in 80’s?
© Padhraic Smyth, Dec 2000: 17
Transaction Data and Association Rules
• Supermarket example: (Srikant and Agrawal, 1997) – #items = 500,000, #transactions = 1.5 million
ItemsT
rans
actio
ns
x xx
xx x x
xx x x
xx xx
x
x
xx
x
© Padhraic Smyth, Dec 2000: 18
Transaction Data and Association Rules
• Example of an Association Rule If a customer buys beer they will also buy chips– p(chips|beer) = “confidence”– p(beer) = “support”
• Algorithm: basically a fast way to compute correlations
ItemsT
rans
actio
ns
x xx
xx x x
xx x x
xx xx
x
x
xx
x
© Padhraic Smyth, Dec 2000: 19
Current Data Mining Software
2. Special purpose (“niche”) applications
- fraud detection, ecommerce profiling, credit-scoring,etc.
- often solve high-dimensional classification/regressionproblems
- Fraud detection- telecom (AT&T), credit-cards (HNC)
- Profiling -> Advertising- profile: “histogram” of products/terms- Engage: database of 70 million internet user profiles
- common theme: “track the customer!”
- difficult to validate claims of success (few publications)
© Padhraic Smyth, Dec 2000: 20
General Characteristics of Data Mining Applications
• Emphasis on Predictive Modeling– scoring, classification, detection
• Massive Data Sets– significant “data engineering” component– variable selection, “feature definition”– offline: computational issues in model fitting– online: real-time response (e.g., e-commerce)
• “Scaling up” traditional ideas– e.g., wide use of CART (decision trees)– often modified to handle large-scale issues
© Padhraic Smyth, Dec 2000: 21
Myths and Legends in Data Mining
• “Data analysis can be fully automated”– human judgement is critical in almost all applications– “semi-automation” is however very useful
© Padhraic Smyth, Dec 2000: 22
Myths and Legends in Data Mining
• “Data analysis can be fully automated”– human judgement is critical in almost all applications– “semi-automation” is however very useful
• “Association rules are useful”– association rules are essentially lists of correlations– no documented successful application– compare with decision trees (numerous applications)
© Padhraic Smyth, Dec 2000: 23
Myths and Legends in Data Mining
• “Data analysis can be fully automated”– human judgement is critical in almost all applications– “semi-automation” is however very useful
• “Association rules are useful”– association rules are essentially lists of correlations– none or few documented successful applications– compare with decision trees (numerous applications)
• “With massive data sets you don’t need statistics”– massiveness can bring more heterogeneity and noise
• even more statistics!
© Padhraic Smyth, Dec 2000: 24
Outline
• What is Data Mining?
• Computer Science and Statistics: the Interface
© Padhraic Smyth, Dec 2000: 25
Historical Perspective
Statistics Computer Science/Engineering
1950
1960
1970
1980
1990
2000
© Padhraic Smyth, Dec 2000: 26
Historical Perspective
Statistics Computer Science/Engineering
AIStatisticalPattern Recognition
1950
1960
1970
1980
1990
2000
© Padhraic Smyth, Dec 2000: 27
Historical Perspective
Statistics Computer Science/Engineering
AIStatisticalPattern Recognition
1950
1960
1970
1980
1990
2000
EDA
© Padhraic Smyth, Dec 2000: 28
Historical Perspective
Statistics Computer Science/Engineering
AIStatisticalPattern Recognition
1950
1960
1970
1980
1990
2000
EDA
Trees ML:Trees/Rules
© Padhraic Smyth, Dec 2000: 29
Historical Perspective
Statistics Computer Science/Engineering
AIStatisticalPattern Recognition
1950
1960
1970
1980
1990
2000
EDA
Trees ML:Trees/RulesNeural
Networks
MARS
© Padhraic Smyth, Dec 2000: 30
Historical Perspective
Statistics Computer Science/Engineering
AIStatisticalPattern Recognition
1950
1960
1970
1980
1990
2000
EDA
Trees ML:Trees/RulesNeural
Networks
MARS
FlexiblePredictors
© Padhraic Smyth, Dec 2000: 31
Historical Perspective
Statistics Computer Science/Engineering
AIStatisticalPattern Recognition
1950
1960
1970
1980
1990
2000
EDA
Trees ML:Trees/RulesNeural
Networks
MARS
FlexiblePredictors
KDD
© Padhraic Smyth, Dec 2000: 32
Historical Perspective
Statistics Computer Science/Engineering
AIStatisticalPattern Recognition
1950
1960
1970
1980
1990
2000
EDA
Trees ML:Trees/RulesNeural
Networks
MARS
FlexiblePredictors
DB
OLAPKDD
© Padhraic Smyth, Dec 2000: 33
Historical Perspective
Statistics Computer Science/Engineering
AIStatisticalPattern Recognition
1950
1960
1970
1980
1990
2000
EDA
Trees ML:Trees/RulesNeural
Networks
MARS
FlexiblePredictors
DB
OLAPKDD
DataMining
© Padhraic Smyth, Dec 2000: 34
Observations
• Significant synergy/convergence of CS and Statistics emergedfrom neural networks– flexible prediction models = “super offspring”– role of NIPS, Snowbird meetings, etc
• Data Mining/KDD is still back where Neural Nets was 10 years ago– DM: “our stuff is cool and we don’t really need statistics - do
we ?”– Statistics: “what are these guys talking about and why don’t
they know some basic statistics?”
– Nonetheless…. The DM folks have some very interestingapplications and some interesting approaches
© Padhraic Smyth, Dec 2000: 35
Statistics Computer Science
StatisticalPatternRecognition
NeuralNetworks
MachineLearning
DataMining
DatabasesStatisticalInference
Where Work is Published
JASA,JRSS
IEEE PAMIICPRICCV
NIPSNeural Comp.
ICMLCOLTML JournalUAIwww.jmlr.org
KDDIJDMKD
SIGMODVLDB
© Padhraic Smyth, Dec 2000: 36
Modeling,
Inference
Computation,
Algorithms
Evaluation,
Interpretation
The Predictive Modeling Cycle
© Padhraic Smyth, Dec 2000: 37
Modeling,
Inference
Computation,Algorithms
Evaluation,
Interpretation
The Computer Scientist’s View
© Padhraic Smyth, Dec 2000: 38
Modeling,Inference
Computation,
Algorithms
Evaluation,
Interpretation
A Statistician’s View
© Padhraic Smyth, Dec 2000: 39
Modeling,
InferenceComputation,
Algorithms
Evaluation,Interpretation
The Customer’s View
© Padhraic Smyth, Dec 2000: 40
Educational Differences
• Computer Scientists:– undergraduate exposure in statistics
• cookbook hypothesis tests– little or no exposure with mathematical modeling– good at algorithms/data structures
© Padhraic Smyth, Dec 2000: 41
Educational Differences
• Computer Scientists:– undergraduate exposure in statistics
• cookbook hypothesis tests– little or no exposure with mathematical modeling– good at algorithms, data structures
• Statisticians:– undergraduate exposure to CS
• how to write Fortran code– little or no exposure to data structures/algorithms– not everyone learns the “art” of data analysis?
© Padhraic Smyth, Dec 2000: 42
Educational Differences
• Computer Scientists:– undergraduate exposure in statistics
• cookbook hypothesis tests– little or no exposure with mathematical modeling– good at algorithms, data structures
• Statisticians:– undergraduate exposure to CS
• how to write Fortran code– little or no exposure to data structures/algorithms– how to learn the “art” of data analysis?
• Bottom line– need a new breed of “data engineers”– note: easier to go from statistics to CS, than vice-versa
© Padhraic Smyth, Dec 2000: 43
Cultural Differences
• Computer Scientists:– little exposure to the “modeling art” of data analysis– stick to a small set of well-understood models and problems– “close to the data”: they often have ready access to data– business-oriented culture
© Padhraic Smyth, Dec 2000: 44
Cultural Differences
• Computer Scientists:– little exposure to the “modeling art” of data analysis– stick to a small set of well-understood models and problems– “close to the data”: they often have ready access to data– business-oriented culture
• Statisticians:– applied statisticians often very good at the “art” component– little experience with the data management/engineering part– papers focus on inference/models, not algorithms– science-oriented culture
© Padhraic Smyth, Dec 2000: 45
Cultural Differences
• Computer Scientists:– little exposure to the “modeling art” of data analysis– stick to a small set of well-understood models and problems– “close to the data”: they often have ready access to data– business-oriented culture
• Statisticians:– applied statisticians often very good at the “art” component– little experience with the data management/engineering part– papers focus on inference/models, not algorithms– science-oriented culture
• Bottom line– computer scientists get more attention since they are much
more marketing-savvy (less worried about objectivity) thanstatisticians
© Padhraic Smyth, Dec 2000: 46
Modeling Computation
Evaluation
© Padhraic Smyth, Dec 2000: 47
TaskData Set
Representation
Objective Function
Optimization
Data Access
Evaluation and Deployment
Modeling
Algorithm
© Padhraic Smyth, Dec 2000: 48
PredictionMultivariate
Hierarchical representation of
piecewise constant mapping
Cross-Validation
Greedy Search
Flat File
Accuracy and Interpretability
CART
Emphasis on
predictive power
and flexibility
of modelModeling
Algorithm
© Padhraic Smyth, Dec 2000: 49
ExploratoryTransaction Data
Sets of local rules/
conditional probabilities
Thresholds on p
Systematic Search
Linear Data Scans
????
Association Rules
Emphasis on
computational
efficiency and
data access
Modeling
Algorithm
© Padhraic Smyth, Dec 2000: 50
The Reductionist Viewpoint
• General Framework for Modeling– reduce problems to fundamental components– think in terms of
• application first• modeling second• algorithm third
– ultimately the application should “drive” the algorithm
– allows systematic comparison and synthesis• for work on synthesis, see Buntine et al, KDD 99
– clarifies relative role of statistics, databases, search, etc
– see Hand, Mannila, and Smyth, MIT Press, May(?) 2001
© Padhraic Smyth, Dec 2000: 51
Implications
• The “renaissance data miner” is skilled in:– statistics: theories and principles of inference– modeling: languages and representations for data– optimization and search– algorithm design and data management
• The educational problem– is it necessary to know all these areas in depth?– Is it possible?– Do we need a new breed of professionals?
• The applications viewpoint:– How does a scientist or business person keep up with all
these developments?– How can they choose the best approach for their problem
© Padhraic Smyth, Dec 2000: 52
Outline
• What is Data Mining?
• Computer Science and Statistics: the Interface
• Hot Topics in Data Mining
© Padhraic Smyth, Dec 2000: 53
Subspecies of Data Miners
• SIGMOD/VLDB Conferences– Database issues: querying, efficiency: no modeling– fast querying/association rule algorithms
• SIGKDD Conferences– Algorithm focus: scaling machine learning/stats methods– rule finding algorithms
• Machine Learning Conference– Algorithmic focus– decision trees, reinforcement learning
• NIPS– originally neural networks, but now mathematical/probabilistic
learning: heavy statistical influence– SVMs, boosting, Gaussian processes, latent variable models
• ICPR (Pattern Recognition), SIGIR, etc– speech, images, classifiers, etc: engineering applications
© Padhraic Smyth, Dec 2000: 54
Hot Topics, New Directions from Computer Science
• Flexible predictive modeling– neural networks, boosting, SVMs
• Engineering of scale– scaling up statistical methods to new large-scale applications
• Hidden/latent variable models– wide scale application of EM, e.g., HMMs for speech
• Pattern finding– associations, rules, bumps: “non-global” patterns
• Heterogeneous Data– modeling structured data, e.g, Web, multimedia (video/audio)
© Padhraic Smyth, Dec 2000: 55
Flexible Predictive Modeling
• Model Combining:– Stacking
• linear combinations of models with X-validated weights– Bagging
• equally weighted models from bootstrap samples– Boosting
• iterative re-training on data points in error
• Flexible Model Forms– Decision trees, Neural networks, Support vector machines
• Common theme:– many of these ideas were popularized in computer science– later “legitimized” by statisticians (e.g., by Breiman, Friedman)
© Padhraic Smyth, Dec 2000: 56
20 40 60 80 100 120 140 160 180 200
50
100
150
200
250
300
350
400
450
500
Example of a Document-Term Matrix
© Padhraic Smyth, Dec 2000: 57
Application: Flexible Classification Models for Text
• The Web represents a huge data set of text documents– Problem: classification of Web pages into “topic categories”– e.g., automated creation of topic hierarchies for Yahoo– automated crawlers for information gathering
• Technical challenges– standard representation of a Web page?
• Typically use “list of term vectors”• very high-dimensional information
– other information: images, page structure, etc
• Current Activity– much research in data mining in document classification:
• Web page -> high-d term vector -> flexible classifier– Commercial companies: Whizbang, Autonomy, IBM, etc
© Padhraic Smyth, Dec 2000: 58
2. Scale: How far away are the data?
CPU RAM Disk
© Padhraic Smyth, Dec 2000: 59
2. Scale: How far away are the data?
CPU RAM Disk
10-8 seconds 10-3 seconds
Random Access Times
© Padhraic Smyth, Dec 2000: 60
2. Scale: How far away are the data?
CPU RAM Disk
1 meter 100 km
Effective Distances
© Padhraic Smyth, Dec 2000: 61
Basic Idea
MassiveDatabase
ApproximateModel of Data
Humanor
Algorithm
“Slow”Memory
“Fast”/CachedMemory
© Padhraic Smyth, Dec 2000: 62
Basic Idea
MassiveDatabase
ApproximateModel of Data
Humanor
Algorithm
“Slow”Memory
“Fast”/CachedMemory
Comments: 1. even if data fits in main memory there are many advantagesto clever data structures (e.g., see Andrew Moore’s talk)2. Particularly relevant for massive streams of transaction data, e.g.telephone data (see Diane Lambert’s talk).
© Padhraic Smyth, Dec 2000: 63
2. Scalable Algorithms
• “Scaling down the data” or “data approximation”– work from clever data summarizations (e.g., sufficient
statistics)– e.g., “Data squashing” (DuMouchel et al, AT&T, KDD ‘99)– create a small “pseudo data set”– similar statistical properties to the original (massive) data set– now run your standard algorithm on the pseudo-data– interesting theoretical (statistical) basis
• “Scaling up the algorithm”– data structures/caching strategies to speed up known
algorithms• ADTrees, etc., from Andrew Moore (CMU)• scalable decision trees (Johannes Gehrke, Cornell)
– can get orders of magnitude speed improvements
© Padhraic Smyth, Dec 2000: 64
3. Pattern Finding
• Patterns = unusual hard-to-find local “pockets” of data– finding patterns is not the same as global model fitting– the simplest example of patterns are association rules– much other work on rule-finding in data mining/AI– other applications:
• motif-finding in protein sequences• unusual objects in sky-survey data
• “Bump-hunting”– PRIM algorithm of Friedman and Fisher (1999)– finds multivariate “boxes” in high-dimensional spaces where
mean of target variable is higher– trades off “support” with “mean height”– effective and flexible
• e.g., finding small highly profitable groups of customers
© Padhraic Smyth, Dec 2000: 65
“Bump-Hunting”
© Padhraic Smyth, Dec 2000: 66
“Bump-Hunting”
© Padhraic Smyth, Dec 2000: 67
“Bump-Hunting”
© Padhraic Smyth, Dec 2000: 68
“Bump-Hunting”
© Padhraic Smyth, Dec 2000: 69
“Bump-Hunting”
© Padhraic Smyth, Dec 2000: 70
“Bump-Hunting”
© Padhraic Smyth, Dec 2000: 71
Pattern Finding (ctd.)
• Contrast Sets (Bay and Pazzani, KDD99)– individuals or objects categorized into 2 groups
• e.g., students enrolled in CS and in Engineering– high-dimensional multivariate measurements on each– automatically produces a summary of significant differences
between groups (Bay and Pazzani, KDD ‘99)– combines massive search with statistical estimation
• Time-Series Pattern Spotting– “find me a shape that looks like this”– semi-Markov deformable templates (Ge and Smyth, KDD 2000)– significantly outperforms template matching and DTW– Bayesian approach integrates prior knowledge with data
© Padhraic Smyth, Dec 2000: 72
Example: Deformable Templates
• Segmental hidden semi-Markov model• Ge and Smyth, KDD 2000
• Each waveform segment corresponds to a state in the model
S1 S2ST
- - - - - - - -
Segments
States
© Padhraic Smyth, Dec 2000: 73
Pattern-Based End-Point Detection
0 50 100 150 200 250 300 350 400200
300
400
500
0 50 100 150 200 250 300 350 400200
300
400
500
TIME (SECONDS)
Original Pattern
Detected Pattern
End-Point Detection in Semiconductor Manufacturing
© Padhraic Smyth, Dec 2000: 74
Heterogeneous Data Modeling
• Clustering Objects (sequences, curves, etc)– probabilistic approach: define a mixture of models (Cadez,
Gaffney, and Smyth, KDD 2000)– unified framework for clustering objects of different
dimensions– applications:
• curve-clustering:– e.g., mixture of regression models (Gaffney and Smyth
(KDD ‘99)– video movement, gene expression data, storm
trajectories• sequence clustering
– e.g., mixtures of Markov models– clustering of MSNBC Web data (Cadez et al, KDD ‘00)
© Padhraic Smyth, Dec 2000: 75
0 5 1 0 1 5 2 0 2 5 3 04 0
6 0
8 0
1 0 0
1 2 0
1 4 0
1 6 0
T I M E
X-P
OS
ITIO
N
T R A J E C T O R I E S O F C E N T R O I D S O F M O V I N G H A N D I N V I D E O S T R E A M S
0 5 10 15 20 25 300
10
20
30
40
50
60
70
80
T I M E
Y-P
OS
ITIO
N
E S T I M A T E D C L U S T E R T R A J E C T O R Y
0 5 10 15 20 25 3085
90
95
100
105
110
115
120
125
TIME
X-P
OS
ITIO
N
EST IMATED CLUSTER TRAJECTORY
© Padhraic Smyth, Dec 2000: 76
4. (Un) Structured Data (e.g., Text, Web)• Applications
– classification of text documents• automatic classification of emails as junk/non-junk• automatic creation of taxonomies for Web page portals
such as Yahoo– discovery of authoritative documents
• search engines (Yahoo)• citation rankings
• Techniques– “vector-space” model– adaptations of simple classification and clustering algorithms– graph-based techniques
• Challenges for Statistics– scale of problem: huge documents, huge Web– structure, semantics of documents, Web
© Padhraic Smyth, Dec 2000: 77
Document Clustering
• Techniques– model-based mixture clustering– mixtures of multinomials– mixtures of conditional-independence models
• Connections to Statistics– probabilistic models are well-known (latent class models)– EM algorithm for training
• Differences from Statistics:– scale and nature of applications– e.g., Hofmann (Brown, U), “probabilistic PCA”– e.g., Lafferty (CMU), maxent models for text prediction
© Padhraic Smyth, Dec 2000: 78
20 40 60 80 100 120 140 160 180 200
50
100
150
200
250
300
350
400
450
500
Example of a Document-Term Matrix
© Padhraic Smyth, Dec 2000: 79
Example of Document
Cluster
Most Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write 0.571 drive 0.465 problem 0.369 mail 0.364 articl 0.332 hard 0.323 work 0.319 system 0.303 good 0.296 time 0.273
Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi 7.7 0.13 0.02 drive 5.7 0.47 0.08 hard 4.9 0.32 0.07 card 4.2 0.23 0.06 format 4.0 0.12 0.03 softwar 3.8 0.21 0.05 memori 3.6 0.14 0.04 install 3.6 0.14 0.04 disk 3.5 0.12 0.03 engin 3.3 0.21 0.06
© Padhraic Smyth, Dec 2000: 80
Example of Document
Cluster
Most Likely Terms in Component 1weight = 0.11 : TERM p(t|k) articl 0.684 good 0.368 dai 0.363 fact 0.322 god 0.320 claim 0.294 apr 0.279 fbi 0.256 christian 0.256 group 0.239
Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi 8.3 0.26 0.03 jesu 5.5 0.16 0.03 fire 5.2 0.20 0.04 christian 4.9 0.26 0.05 evid 4.8 0.24 0.05 god 4.6 0.32 0.07 gun 4.2 0.17 0.04 faith 4.2 0.12 0.03 kill 3.8 0.22 0.06 bibl 3.7 0.11 0.03
© Padhraic Smyth, Dec 2000: 81
Pixel Representation of Mixture Components
TERMS
CO
MP
ON
EN
T M
OD
ELS
10 20 30 40 50 60 70 80 90 100
1
2
3
4
5
6
7
8
9
10
© Padhraic Smyth, Dec 2000: 82
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
…
5115
11111151511151
77777777
111333
3333131113332232
…
User 5
User 4
User 3User 2
User 1
Example: Web Log Mining
© Padhraic Smyth, Dec 2000: 83
Clusters of Dynamic Behavior
B
C
D
A
B
C
D
A
B
C
D
A
Cluster 1 Cluster 2
Cluster 3
© Padhraic Smyth, Dec 2000: 84
© Padhraic Smyth, Dec 2000: 85WebCanvas: Cadez, Heckerman, et al, KDD 2000
© Padhraic Smyth, Dec 2000: 86
Application: Product Purchasing Data
PRODUCT CATEGORIES
TR
AN
SA
CT
ION
S
5 10 15 20 25 30 35 40 45 50
50
100
150
200
250
300
350
© Padhraic Smyth, Dec 2000: 87
Application: Recommender Systems
ProductsT
rans
actio
ns
x xx
xx x x
xx x x
xx xx
x
x
xx
x
x x ? ? ? ? ? ? ? ? ???New
Customer
© Padhraic Smyth, Dec 2000: 88
Application: Recommender Systems
ProductsT
rans
actio
ns
x xx
xx x x
xx x x
xx xx
x
x
xx
x
x x ? ? ? ? ? ? ? ? ???New
Customer
– high-dimensional inference/prediction problem– sparse data– recommendations must be in real-time!
© Padhraic Smyth, Dec 2000: 89
Approaches to Recommender Systems
• Collaborative Filtering– “infer your interests from people with similar behavior”– essentially a nearest-neighbor algorithm– considerable commercial interest (e.g., NetPerceptions,
Firefly)– scalability problems
• Model-based Recommenders– model the joint distribution of the products explicitly– Example
• dependency networks from Microsoft Research• decision-tree models/MRFs• extremely fast, shipping in Microsoft products• Heckerman et al. (2000), Journal of Machine Learning
Research, www.jmlr.org
© Padhraic Smyth, Dec 2000: 90
Final Comments
• Successful data mining requires integration/understanding of– statistics– computer science– the application discipline
• Current practice of data mining– computer scientists focused on business applications– relatively little statistical sophistication: but some new ideas– considerable “hype” factor
• Opportunities for Statisticians– new problems: e.g., statistical scalability– new applications: e.g., inference from Web and text data– ready audience for statistical techniques
• need better marketing!
© Padhraic Smyth, Dec 2000: 91
Pointers
• Papers:– www.ics.uci.edu/~datalab– e.g., “Data mining: data analysis on a grand scale?”, P. Smyth,
(2000), Statistical Methods in Medical Research.
• Web Resources:– www.kdnuggets.com
• Interface ‘01– data mining and bioinformatics themes– June 13-16th, 2001 Costa Mesa, CA
• Text (forthcoming)– Principles of Data Mining
• D. J Hand, H. Mannila, P. Smyth• MIT Press, May 2001?