Upload
jasmin-cannon
View
212
Download
0
Embed Size (px)
Citation preview
© ELCA - 01 - 2004 CGCVT 1.0
There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch…
Provo, UT, 15 January 2004
CS Graduate Colloquium Series
2 © ELCA - 01 - 2004 CGC
A Frog’s Leaps
1989
19942001
3 © ELCA - 01 - 2004 CGC
Inductive Machine Learning
Design and implement algorithms that exhibit inductive capabilitiesInduction “involves intellectual leaps from the particular to the general”From observations:
Construct a model that explains the (labelled) observations and generalises to new, previously unobserved situations, or
Extract rules/critical features that characterize the observations
4 © ELCA - 01 - 2004 CGC
Learning Tasks
PredictionLearn a function that associates a data item with the value of a prediction/response variable credit worthiness
Clustering / SegmentationIdentify a set of (meaningful) categories or clusters to describe the data customer DB
Dependency ModelingFind a model that describes significant dependencies, associations or affinity among variables market baskets
Change / Anomaly DetectionDiscover the most significant changes in the data from previously measured or normative values fraud
5 © ELCA - 01 - 2004 CGC
Classification Learning
Prediction where the response variable is discrete
Wide range of applications Most used DM technique (?)
Learning algorithms Decision trees
ID3, C5.0, OC1, etc. Neural networks
FF/BP, RBF, ASOCS, etc. Rule induction
CN2, ILP, etc. Instance-based learning
kNN, NGE, etc. Etc.
6 © ELCA - 01 - 2004 CGC
The Challenge
I am faced with a specific classification task (e.g., am I looking at a rock or at a mine?)I want a model with high accuracy
Which classification algorithm should I use?Any of them?All of them?One in particular?
7 © ELCA - 01 - 2004 CGC
Assumptions
Attribute-value Language Objects are represented by vectors of A-V pairs, where each
attribute takes on only a finite number of values There is a finite set of m possible attribute vectors
Binary Classification into {0,1} The relationship between attribute vectors and classes may be
specified by an m-component class probability vector C, where Ci is the probability that an object with attribute vector Ai is of class 1
Data Generation Attribute vectors are sampled with replacement according to an
arbitrary distribution D and a class is assigned to each object using C (i.e., class 1 with probability Ci and class 0 with probability 1- Ci for an object with attribute vector Ai)
Data is generated the same way for both training and testing
8 © ELCA - 01 - 2004 CGC
Definitions
A learning situation S is a triple (D,C,n) where D and C specify how data will be generated and n is the size of the training sampleGeneralization accuracy is the expected prediction performance on objects with attribute vectors not presented in the training sampleThe generalization accuracy of a random guesser is 0.5 for every D and CGeneralization performance is equal to generalization accuracy - 0.5
GPL(S) denotes the generalization performance of a learner L in learning situation S
9 © ELCA - 01 - 2004 CGC
Schaffer’s Law of Conservation
For any learner L
For every D and n
0)( SGPS
L
10 © ELCA - 01 - 2004 CGC
Verification of the Law (I)
The theorem holds for any arbitrary choice of D and n, but these are fixed when the sum is computed. Hence, summing over S is equivalent to summing over C
If we exclude the possibility of noise, the components of C are drawn from {0,1}
There are 2m class probability vectors and the theorem involves a sum, as written, of the 2m corresponding terms
11 © ELCA - 01 - 2004 CGC
Verification of the Law (II)
Restricting to binary attributes, we have m=2a (where a is the number of attributes)Generate all 2m=22a
noise-free binary classification tasks
Compute each task's generalization performance using a leave-one-out procedure
Sum the generalization performances
Using ID3 as the learner, we performed the experiment for both a=3 and a=4
The results were … as expected!
12 © ELCA - 01 - 2004 CGC
There is No Free Lunch
Some learners are impossible. In particular, there is no universal learnerEvery demonstration that the generalization performance of an algorithm is better than chance on a set of learning tasks is an implied demonstration that it is worse than chance on an alternative set of tasksEvery demonstration that the generalization performance of an algorithm is better than that of another on a set of learning tasks is an implied demonstration that it is worse on an alternative set of tasksAveraged over all learning tasks, the generalization performance of any two algorithms is exactly identical
13 © ELCA - 01 - 2004 CGC
Practical Implication
Either:One shows that all « real-world » learning tasks come from some subset of the universe on which some (or all) learning algorithm(s) perform wellOROne must have some way of determining which algorithm(s) will perform well on one’s learning task
Since it is difficult to know a priori all « real-world » learning tasks, we focus on the second alternative
Note: « overfitting » UCI assumes the first option
14 © ELCA - 01 - 2004 CGC
Going back to the Challenge
Which learner should I use?
Any of them? Too hazardous – you may select the wrong one
All of them? Too onerous – it will take too long
One in particular? Yes – but, how do I know which one is best?
Note: ML/KDD practitioners often narrow in on a subsetof algorithms, based on experience
15 © ELCA - 01 - 2004 CGC
Finding a Mapping…
Classification tasks Classification algorithms
?
16 © ELCA - 01 - 2004 CGC
…through Meta-learning
Basic idea: learn a selection or ranking function for learning tasksPrerequisite: a description language for tasks
Several approaches have been proposed The most popular one relies on extracting statistical and
information-theoretic measures from a data set We have developed an alternative approach to task
description, called landmarking
17 © ELCA - 01 - 2004 CGC
Conjecture
THE PERFORMANCE OF A LEARNER ON A TASK UNCOVERS
INFORMATION ABOUT THE NATURE OF THAT TASK
18 © ELCA - 01 - 2004 CGC
Landmarking the Expertise Space
Each learner has an area of expertise, i.e., the class of tasks on which it performs particularly well, under some reasonable measure of performanceA task can be described by the collection of areas of expertise to which it belongsA landmarker is a learning mechanism whose performance is used to describe tasksLandmarking is the use of landmarkers to locate tasks in the expertise space, i.e., the space of all areas of expertise
19 © ELCA - 01 - 2004 CGC
Illustration
Labelled areas are areas of expertise of learnersAssume the landmarkers are i1, i2 and i3Possible inference: Problems on which both i1 and i3 perform well, but on
which i2 performs poorly, are likely to be in i4's area of expertise
20 © ELCA - 01 - 2004 CGC
Meta-learning with Landmarking
Landmarking concentrates on « cartographic » considerations: Learners are used to signpost learnersIn principle, every learner's performance can signpost the location of a problem with respect to other learners' expertiseThe landmarkers' performance values are used as task descriptors or meta-attributes for meta-learningExploring the meta-learning potential of landmarking amounts to investigating how well landmarkers' performances hint at the location of learning tasks in the expertise space
21 © ELCA - 01 - 2004 CGC
Selecting Landmarkers
Two main considerationsComputational complexity
Statistical tests are expensive (up to O(n3) for some)
»Poor scalability»CPU time could have been alloted to a sophisticated learner of equal or better complexity
Limit ourselves to O(nlogn) and in any case do not exceed the time needed to run the target learners
Bias To adequately chart the learning space, we need landmarkers to
measure different properties, at least implicitly (the set of target learners may guide the choice of bias)
22 © ELCA - 01 - 2004 CGC
Target Learners
The set of target learners considered consists of the following set of 10 popular learners:
C5.0 Decision tree Decision rules Decision tree with boosting
Naive Bayes (MLC++ implementation) Instance-Based Learning (MLC++ implementation) Clementine's Multi-Layer Perceptron Clementine's Radial Basis Function Network RIPPER Linear Discriminant LTREE
23 © ELCA - 01 - 2004 CGC
Landmarkers
Typical landmarkers include: Decision node
A single, most-informative decision node (based on information gain ratio) Aims to establish closeness to linear separability
Randomly chosen node A single decision node chosen at random Informs, together with next one, about irrelevant attributes
Worst node A single, least informative decision node Further informs on linear separability (if neither the best nor the worst
attribute produce a single well performing separation, it is likely that linear separation is not an adequate learning strategy
Elite 1-Nearest Neighbor Standard 1-NN with nearest neighbor computed based on an elite subset
of attributes (based on information gain ratio) Attempts to establish whether the task is relational, that is, if it involves
parity-like relationships between the attributes. In relational tasks, no single attribute is considerably more informative than all others.
24 © ELCA - 01 - 2004 CGC
Statistical Meta-attributes
Typical (simple) statistical meta-attributes include:
Class entropy Average entropy of the attributes Mutual information Joint entropy Equivalent number of attributes Signal-to-noise ratio
25 © ELCA - 01 - 2004 CGC
Training Set I
Artificial Datasets 320 randomly generated Boolean datasets with between 5
and 12 attributes Generalization performance, GPLi, of each of the target
learners is computed using 10-fold stratified cross-validation
Each dataset is labeled as: Learner Lk GPLk = max {GPLi} Tie max {GPLi} – min {GPLi} < 0.1
26 © ELCA - 01 - 2004 CGC
Landmarking vs Standard DC
Meta-learner Landmarking Standard DC Combined
Majority 0.460 0.460 0.460
C5.0Tree 0.242 0.342 0.314
C5.0Rules 0.239 0.333 0.301
MLP 0.301 0.317 0.320
RBFN 0.289 0.323 0.304
LinDiscr 0.335 0.311 0.301
LTree 0.270 0.317 0.286
IB1 0.329 0.366 0.342
NB 0.429 0.407 0.363
Ripper 0.292 0.314 0.295
Average 0.303 0.337 0.314
27 © ELCA - 01 - 2004 CGC
Training Set II
Artificial Datasets 222 randomly generated Boolean datasets with 20
attributes Generalization performance, GPLi, of each of the target
learners is computed using 10-fold stratified cross-validation
Three model classes: NN={MLP, RBFN}, R={RIPPER, C5.0 Rules}, DT={C5.0, C5.0 Boosting, LTREE}
Each dataset is labeled as: Model class Mk LiMk GPLi > (1.1).avg {GPLj} NoDiff, otherwise
18 UCI Datasets
28 © ELCA - 01 - 2004 CGC
Predicting Model Classes
Meta-learner NN R DT
Majority 0.440 0.370 0.470
C5.0Tree 0.358 0.233 0.371
C5.0Rules 0.367 0.229 0.371
MLP 0.413 0.392 0.454
RBFN 0.333 0.225 0.375
LinDiscr 0.371 0.379 0.467
LTree 0.396 0.221 0.346
IB1 0.388 0.258 0.354
NB 0.433 0.421 0.421
Ripper 0.363 0.221 0.363
29 © ELCA - 01 - 2004 CGC
Measuring Performance Level
Experiments show that landmarking meta-learnsThey do not, however, reflect the overall performance of a system whose end result is the accuracy of the selected learning modelEstimate it as follows:
Train on artificial datasets Test on UCI datasets Report average error difference between actual best
choice and meta-learner selected choice
30 © ELCA - 01 - 2004 CGC
Performance Level
Model Class Loss Max Loss
NN 0.031 0.081
R 0.036 0.088
DT 0.021 0.096
31 © ELCA - 01 - 2004 CGC
METAL Home Page
32 © ELCA - 01 - 2004 CGC
Uploading Data
33 © ELCA - 01 - 2004 CGC
Characterising Data
34 © ELCA - 01 - 2004 CGC
Ranking
35 © ELCA - 01 - 2004 CGC
Parameter Setting
36 © ELCA - 01 - 2004 CGC
Results
37 © ELCA - 01 - 2004 CGC
Running Algorithms
38 © ELCA - 01 - 2004 CGC
Site Statistics
0
20
40
60
80
100
120
140
160
180
200
May
-02
Jul-0
2
Sep-0
2
Nov-0
2
Jan-
03
Mar
-03
May
-03
Jul-0
3
Sep-0
3
Nov-0
3
Registered users
Industry
Academia
0
5
10
15
20
25
30
35
40
45
May-02 J un-02 J ul-02 Aug-02 Sep-02 Oct-02 Nov-02 Dec-02 J an-03 Feb-03 Mar-03 Apr-03 May-03 J un-03 J ul-03 Aug-03 Sep-03 Oct-03 Nov-03 Dec-03
Num countries
Num datasets (excl. METAL)
39 © ELCA - 01 - 2004 CGC
Conclusions
There is no universal learnerWork on meta-learning holds promise
Landmarking shows learner preferences may be learned
Open questionsTraining data generation (incremental…)Choice of meta-features (landmarking, structure, …)Choice of meta-learner (higher-order, …)
MLJ Special Issue on Meta-learning (2004), to appear
© ELCA - 01 - 2004 CGCVT 1.0
If There is Time Left…
41 © ELCA - 01 - 2004 CGC
Process View
RawData
Domain & DataUnderstanding
SelectedData
Pre-processedData
ModelBuilding
PatternsModels
Interpretation&
Evaluation
BusinessProblem
Formulation
Dissemination&
Deployment
Determine credit worthiness
Aggregate individual incomes into household income
Learn about loans, repayments, etc.;Collect data about past performance
Build a decision tree
Check against hold-out set
DataPre-processing
42 © ELCA - 01 - 2004 CGC
DM Tools
Three types:
Research tools Generally open source, no GUI, expert-driven
Dedicated tools Commercial-strength, restricted to one type of tasks/algorithms,
limited process support
Packaged tools Commercial-strength, rich GUI, support for different types of tasks,
rich collection of algorithms, DM process support
A plethora of DM tools has emerged…
43 © ELCA - 01 - 2004 CGC
Tool Selection
The situation: Increasing interest Many tools with same basis but different content Research talk vs business talk
Assist business users in selecting a DM package based on high-level business criteria
Journal of Intelligent Data Analysis, Vol. 7, No. 3 (2003)
44 © ELCA - 01 - 2004 CGC
Schema Definition
C u sto m e r A c q u i s i ti on
C r o ss-/ U p -se l l i ng
P r o d u c t D e v e l o p m e nt
C h u r n P r e d i c ti on
F r a u d D e te c ti o n
M a r k e t-b a sk e t A n a l y s i s
R i sk A sse ssm e nt
P r e d i c ti o n / F o r e c a sti ng
O u tc o m e s M e a su r e m e n t
C o n d i ti o n M o n i to r i ng
H e a l th M o n i to r i ng
D i sc o v e r y
B u si n e ss G o al
C l a ss i fi c a ti on
R e g r e ss i o n
P r e d i c ti on
D e sc r i p ti onC l u ste r i n g / S e g m e n ta ti on
D e p e n d e n c y / A sso c i a ti on
A n o m a l y / C h a n g e D e te c ti on
T i m e -se r i es A n a l y s is
M o d e l T y p e s
F l a t F i le O D B C / JD B C X M L O th e r (e .g ., S A S )
D B M S E m b e d d ed
In p u t F o r m a t s
D a ta V i su a l i sa ti on D a ta C l e a n i ng R e c o r d S e l e c ti on A ttr i b u te S e l e c ti on
D a ta C h a r a c te r i sa ti on D a ta T r a n sf o r m a ti on
P r e -p r o c e ss i n g
D e c i s i o n T r e es R u l e L e a r n i ng N e u r a l N e tw o r ks L i n e a r / L o g i sti c R e g r e ss i on
A sso c i a ti o n L e a r n i ng T e x t M i n i ng In sta n c e -b a se d L e a r n i n g U n su p e r v i se d L e a r n i ng
P r o b a b i l i s ti c L e a r n i ng
M o d e l l in g A l go r i th m s
C r o ss-v a l i d a ti on L i f t / G a in C h a r ts R O C A n a l y s is In d e p e n d e n t T e st S e t
S u m m a r y R e p o r ts M o d e l V isu a l i sa ti on
In te r p r e ta ti o n / E v a l u a ti on
S a v e / R e l o a d M o d e ls C o m m en t F i e l ds P r o d u c e E x e c u ta b le P M M L / X M L E x p o r t
D i sse m i na ti o n / D ep l o y m e nt
D a ta se t S i z e L i m i ts S u p p o r t f o r P a r a l l e l i sa ti on In c r e m e n ta l i ty
S c a l a b i l i ty
E x p e r t O p ti o ns B a tc h P r o c e ss i n g
M i sc e l l a n e o us
P r o c e ss -d e p e n d e n tF e a tu r e s
L a y o u t(G U I v s C o m m a n d L i n e)
V i su a l P r o g r am m i ng(D r a g & D r o p )
O n -l i ne H e lp
U se r In te r f a c eF e a tu r e s
P C W i n d o w s
U n i x / S ol a r is
H a r d w a re P l a tf o rm
D B 2
S A S B a se
O r a cl e D B
J a v a / J R E
S o f tw a r e R e q u i r e m e n ts
S ta n d a l o n e
C l i e n t-se r v er
T h i n C l i e nt
S o f tw a r e A r c h i te c tu re
S y ste m R e q u i r e m e n ts
C o n ta c t d e ta i l sA d d r e ss , U R L , E m a i l , e tc
T e c h n i c a l S u p p o rtL e v el
F r e e (O p e n S o u r c e )
L o w C o st
M e d i u m C o st
E x p e n si ve
P r i c e R a n ge
F r e e D e m o / E v a l u a ti o n V e r s i on
M a r k e t P e n e tr a ti o n
L o n g e v i ty
In te g r a ti o n / V a l i d a ti on
V e n d o r In f o r m a ti on
D a ta M i ni n g T o olD e sc r i p ti on
45 © ELCA - 01 - 2004 CGC
Comprehensive Survey
59 of the most popular toolsCommercial: AnswerTree, CART / MARS, Clementine, Enterprise Miner, GainSmarts, GhostMiner, Insightful Miner, intelligent Miner, KnoweldgeSTUDIO, KXEN, MATLAB Neural Network Toolbox, NeuralWorks Predict, NeuroShell, Oracle Data Mining Suite, PolyAnalyst, See5 / Cubist / Magnum Opus, SPAD, SQL Server 2000, STATISTICA Data Miner, Teradata Warehouse Data Mining, etc.Freeware: WEKA, Orange, YALE, SwissAnalyst
Dynamic DB: updated regularly
46 © ELCA - 01 - 2004 CGC
Recipients
Number of Recipients
0
20
40
60
80
100
120
140
160
180
200
26/0
2/20
02
26/0
3/20
02
26/0
4/20
02
26/0
5/20
02
26/0
6/20
02
26/0
7/20
02
26/0
8/20
02
26/0
9/20
02
26/1
0/20
02
26/1
1/20
02
26/1
2/20
02
26/0
1/20
03
26/0
2/20
03
26/0
3/20
03
26/0
4/20
03
26/0
5/20
03
26/0
6/20
03
26/0
7/20
03
26/0
8/20
03
26/0
9/20
03
26/1
0/20
03
26/1
1/20
03
26/1
2/20
03
47 © ELCA - 01 - 2004 CGC
Comments
The above dimensions characterize Data Mining tools, and NOT Data Mining algorithms
With a standard schema and corresponding database, users are able to select a DM software package with respect to its ability to meet high-level business objectives
Automatic advice strategies such as METAL's Data Mining Advisor (see http://www.metal-kdd.org) or IDEA (see http://
www.ifi.unizh.ch/ddis/Research/idea.htm) can then be used to assist users further in the selection of the most appropriate algorithms / models for their specific tasks.
48 © ELCA - 01 - 2004 CGC
How About Combining Learners?
An obvious solution: Combine learners, as in boosting, bagging, stacked generalisation, etc.
However, no matter how elaborate the method, any algorithm that implements a fixed mapping from training sets to prediction models is subject to the limitations imposed by the law of conservation