© ELCA - 01 - 2004 CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004

© ELCA - 01 - 2004 CGCVT 1.0

There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch…

Provo, UT, 15 January 2004

CS Graduate Colloquium Series

2 © ELCA - 01 - 2004 CGC

A Frog’s Leaps

1989

19942001

3 © ELCA - 01 - 2004 CGC

Inductive Machine Learning

Design and implement algorithms that exhibit inductive capabilitiesInduction “involves intellectual leaps from the particular to the general”From observations:

Construct a model that explains the (labelled) observations and generalises to new, previously unobserved situations, or

Extract rules/critical features that characterize the observations

4 © ELCA - 01 - 2004 CGC

Learning Tasks

PredictionLearn a function that associates a data item with the value of a prediction/response variable credit worthiness

Clustering / SegmentationIdentify a set of (meaningful) categories or clusters to describe the data customer DB

Dependency ModelingFind a model that describes significant dependencies, associations or affinity among variables market baskets

Change / Anomaly DetectionDiscover the most significant changes in the data from previously measured or normative values fraud

5 © ELCA - 01 - 2004 CGC

Classification Learning

Prediction where the response variable is discrete

Wide range of applications Most used DM technique (?)

Learning algorithms Decision trees

ID3, C5.0, OC1, etc. Neural networks

FF/BP, RBF, ASOCS, etc. Rule induction

CN2, ILP, etc. Instance-based learning

kNN, NGE, etc. Etc.

6 © ELCA - 01 - 2004 CGC

The Challenge

I am faced with a specific classification task (e.g., am I looking at a rock or at a mine?)I want a model with high accuracy

Which classification algorithm should I use?Any of them?All of them?One in particular?

7 © ELCA - 01 - 2004 CGC

Assumptions

Attribute-value Language Objects are represented by vectors of A-V pairs, where each

attribute takes on only a finite number of values There is a finite set of m possible attribute vectors

Binary Classification into {0,1} The relationship between attribute vectors and classes may be

specified by an m-component class probability vector C, where Ci is the probability that an object with attribute vector Ai is of class 1

Data Generation Attribute vectors are sampled with replacement according to an

arbitrary distribution D and a class is assigned to each object using C (i.e., class 1 with probability Ci and class 0 with probability 1- Ci for an object with attribute vector Ai)

Data is generated the same way for both training and testing

8 © ELCA - 01 - 2004 CGC

Definitions

A learning situation S is a triple (D,C,n) where D and C specify how data will be generated and n is the size of the training sampleGeneralization accuracy is the expected prediction performance on objects with attribute vectors not presented in the training sampleThe generalization accuracy of a random guesser is 0.5 for every D and CGeneralization performance is equal to generalization accuracy - 0.5

GPL(S) denotes the generalization performance of a learner L in learning situation S

9 © ELCA - 01 - 2004 CGC

Schaffer’s Law of Conservation

For any learner L

For every D and n

0)( SGPS

L

10 © ELCA - 01 - 2004 CGC

Verification of the Law (I)

The theorem holds for any arbitrary choice of D and n, but these are fixed when the sum is computed. Hence, summing over S is equivalent to summing over C

If we exclude the possibility of noise, the components of C are drawn from {0,1}

There are 2m class probability vectors and the theorem involves a sum, as written, of the 2m corresponding terms

11 © ELCA - 01 - 2004 CGC

Verification of the Law (II)

Restricting to binary attributes, we have m=2a (where a is the number of attributes)Generate all 2m=22a

noise-free binary classification tasks

Compute each task's generalization performance using a leave-one-out procedure

Sum the generalization performances

Using ID3 as the learner, we performed the experiment for both a=3 and a=4

The results were … as expected!

12 © ELCA - 01 - 2004 CGC

There is No Free Lunch

Some learners are impossible. In particular, there is no universal learnerEvery demonstration that the generalization performance of an algorithm is better than chance on a set of learning tasks is an implied demonstration that it is worse than chance on an alternative set of tasksEvery demonstration that the generalization performance of an algorithm is better than that of another on a set of learning tasks is an implied demonstration that it is worse on an alternative set of tasksAveraged over all learning tasks, the generalization performance of any two algorithms is exactly identical

13 © ELCA - 01 - 2004 CGC

Practical Implication

Either:One shows that all « real-world » learning tasks come from some subset of the universe on which some (or all) learning algorithm(s) perform wellOROne must have some way of determining which algorithm(s) will perform well on one’s learning task

Since it is difficult to know a priori all « real-world » learning tasks, we focus on the second alternative

Note: « overfitting » UCI assumes the first option

14 © ELCA - 01 - 2004 CGC

Going back to the Challenge

Which learner should I use?

Any of them? Too hazardous – you may select the wrong one

All of them? Too onerous – it will take too long

One in particular? Yes – but, how do I know which one is best?

Note: ML/KDD practitioners often narrow in on a subsetof algorithms, based on experience

15 © ELCA - 01 - 2004 CGC

Finding a Mapping…

Classification tasks Classification algorithms

?

16 © ELCA - 01 - 2004 CGC

…through Meta-learning

Basic idea: learn a selection or ranking function for learning tasksPrerequisite: a description language for tasks

Several approaches have been proposed The most popular one relies on extracting statistical and

information-theoretic measures from a data set We have developed an alternative approach to task

description, called landmarking

17 © ELCA - 01 - 2004 CGC

Conjecture

THE PERFORMANCE OF A LEARNER ON A TASK UNCOVERS

INFORMATION ABOUT THE NATURE OF THAT TASK

18 © ELCA - 01 - 2004 CGC

Landmarking the Expertise Space

Each learner has an area of expertise, i.e., the class of tasks on which it performs particularly well, under some reasonable measure of performanceA task can be described by the collection of areas of expertise to which it belongsA landmarker is a learning mechanism whose performance is used to describe tasksLandmarking is the use of landmarkers to locate tasks in the expertise space, i.e., the space of all areas of expertise

19 © ELCA - 01 - 2004 CGC

Illustration

Labelled areas are areas of expertise of learnersAssume the landmarkers are i1, i2 and i3Possible inference: Problems on which both i1 and i3 perform well, but on

which i2 performs poorly, are likely to be in i4's area of expertise

20 © ELCA - 01 - 2004 CGC

Meta-learning with Landmarking

Landmarking concentrates on « cartographic » considerations: Learners are used to signpost learnersIn principle, every learner's performance can signpost the location of a problem with respect to other learners' expertiseThe landmarkers' performance values are used as task descriptors or meta-attributes for meta-learningExploring the meta-learning potential of landmarking amounts to investigating how well landmarkers' performances hint at the location of learning tasks in the expertise space

21 © ELCA - 01 - 2004 CGC

Selecting Landmarkers

Two main considerationsComputational complexity

Statistical tests are expensive (up to O(n3) for some)

»Poor scalability»CPU time could have been alloted to a sophisticated learner of equal or better complexity

Limit ourselves to O(nlogn) and in any case do not exceed the time needed to run the target learners

Bias To adequately chart the learning space, we need landmarkers to

measure different properties, at least implicitly (the set of target learners may guide the choice of bias)

22 © ELCA - 01 - 2004 CGC

Target Learners

The set of target learners considered consists of the following set of 10 popular learners:

C5.0 Decision tree Decision rules Decision tree with boosting

Naive Bayes (MLC++ implementation) Instance-Based Learning (MLC++ implementation) Clementine's Multi-Layer Perceptron Clementine's Radial Basis Function Network RIPPER Linear Discriminant LTREE

23 © ELCA - 01 - 2004 CGC

Landmarkers

Typical landmarkers include: Decision node

A single, most-informative decision node (based on information gain ratio) Aims to establish closeness to linear separability

Randomly chosen node A single decision node chosen at random Informs, together with next one, about irrelevant attributes

Worst node A single, least informative decision node Further informs on linear separability (if neither the best nor the worst

attribute produce a single well performing separation, it is likely that linear separation is not an adequate learning strategy

Elite 1-Nearest Neighbor Standard 1-NN with nearest neighbor computed based on an elite subset

of attributes (based on information gain ratio) Attempts to establish whether the task is relational, that is, if it involves

parity-like relationships between the attributes. In relational tasks, no single attribute is considerably more informative than all others.

24 © ELCA - 01 - 2004 CGC

Statistical Meta-attributes

Typical (simple) statistical meta-attributes include:

Class entropy Average entropy of the attributes Mutual information Joint entropy Equivalent number of attributes Signal-to-noise ratio

25 © ELCA - 01 - 2004 CGC

Training Set I

Artificial Datasets 320 randomly generated Boolean datasets with between 5

and 12 attributes Generalization performance, GPLi, of each of the target

learners is computed using 10-fold stratified cross-validation

Each dataset is labeled as: Learner Lk GPLk = max {GPLi} Tie max {GPLi} – min {GPLi} < 0.1

26 © ELCA - 01 - 2004 CGC

Landmarking vs Standard DC

Meta-learner Landmarking Standard DC Combined

Majority 0.460 0.460 0.460

C5.0Tree 0.242 0.342 0.314

C5.0Rules 0.239 0.333 0.301

MLP 0.301 0.317 0.320

RBFN 0.289 0.323 0.304

LinDiscr 0.335 0.311 0.301

LTree 0.270 0.317 0.286

IB1 0.329 0.366 0.342

NB 0.429 0.407 0.363

Ripper 0.292 0.314 0.295

Average 0.303 0.337 0.314

27 © ELCA - 01 - 2004 CGC

Training Set II

Artificial Datasets 222 randomly generated Boolean datasets with 20

attributes Generalization performance, GPLi, of each of the target

learners is computed using 10-fold stratified cross-validation

Three model classes: NN={MLP, RBFN}, R={RIPPER, C5.0 Rules}, DT={C5.0, C5.0 Boosting, LTREE}

Each dataset is labeled as: Model class Mk LiMk GPLi > (1.1).avg {GPLj} NoDiff, otherwise

18 UCI Datasets

28 © ELCA - 01 - 2004 CGC

Predicting Model Classes

Meta-learner NN R DT

Majority 0.440 0.370 0.470

C5.0Tree 0.358 0.233 0.371

C5.0Rules 0.367 0.229 0.371

MLP 0.413 0.392 0.454

RBFN 0.333 0.225 0.375

LinDiscr 0.371 0.379 0.467

LTree 0.396 0.221 0.346

IB1 0.388 0.258 0.354

NB 0.433 0.421 0.421

Ripper 0.363 0.221 0.363

29 © ELCA - 01 - 2004 CGC

Measuring Performance Level

Experiments show that landmarking meta-learnsThey do not, however, reflect the overall performance of a system whose end result is the accuracy of the selected learning modelEstimate it as follows:

Train on artificial datasets Test on UCI datasets Report average error difference between actual best

choice and meta-learner selected choice

30 © ELCA - 01 - 2004 CGC

Performance Level

Model Class Loss Max Loss

NN 0.031 0.081

R 0.036 0.088

DT 0.021 0.096

31 © ELCA - 01 - 2004 CGC

METAL Home Page

32 © ELCA - 01 - 2004 CGC

Uploading Data

33 © ELCA - 01 - 2004 CGC

Characterising Data

34 © ELCA - 01 - 2004 CGC

Ranking

35 © ELCA - 01 - 2004 CGC

Parameter Setting

36 © ELCA - 01 - 2004 CGC

Results

37 © ELCA - 01 - 2004 CGC

Running Algorithms

38 © ELCA - 01 - 2004 CGC

Site Statistics

0

20

40

60

80

100

120

140

160

180

200

May

-02

Jul-0

2

Sep-0

2

Nov-0

2

Jan-

03

Mar

-03

May

-03

Jul-0

3

Sep-0

3

Nov-0

3

Registered users

Industry

Academia

0

5

10

15

20

25

30

35

40

45

May-02 J un-02 J ul-02 Aug-02 Sep-02 Oct-02 Nov-02 Dec-02 J an-03 Feb-03 Mar-03 Apr-03 May-03 J un-03 J ul-03 Aug-03 Sep-03 Oct-03 Nov-03 Dec-03

Num countries

Num datasets (excl. METAL)

39 © ELCA - 01 - 2004 CGC

Conclusions

There is no universal learnerWork on meta-learning holds promise

Landmarking shows learner preferences may be learned

Open questionsTraining data generation (incremental…)Choice of meta-features (landmarking, structure, …)Choice of meta-learner (higher-order, …)

MLJ Special Issue on Meta-learning (2004), to appear

© ELCA - 01 - 2004 CGCVT 1.0

If There is Time Left…

41 © ELCA - 01 - 2004 CGC

Process View

RawData

Domain & DataUnderstanding

SelectedData

Pre-processedData

ModelBuilding

PatternsModels

Interpretation&

Evaluation

BusinessProblem

Formulation

Dissemination&

Deployment

Determine credit worthiness

Aggregate individual incomes into household income

Learn about loans, repayments, etc.;Collect data about past performance

Build a decision tree

Check against hold-out set

DataPre-processing

42 © ELCA - 01 - 2004 CGC

DM Tools

Three types:

Research tools Generally open source, no GUI, expert-driven

Dedicated tools Commercial-strength, restricted to one type of tasks/algorithms,

limited process support

Packaged tools Commercial-strength, rich GUI, support for different types of tasks,

rich collection of algorithms, DM process support

A plethora of DM tools has emerged…

43 © ELCA - 01 - 2004 CGC

Tool Selection

The situation: Increasing interest Many tools with same basis but different content Research talk vs business talk

Assist business users in selecting a DM package based on high-level business criteria

Journal of Intelligent Data Analysis, Vol. 7, No. 3 (2003)

44 © ELCA - 01 - 2004 CGC

Schema Definition

C u sto m e r A c q u i s i ti on

C r o ss-/ U p -se l l i ng

P r o d u c t D e v e l o p m e nt

C h u r n P r e d i c ti on

F r a u d D e te c ti o n

M a r k e t-b a sk e t A n a l y s i s

R i sk A sse ssm e nt

P r e d i c ti o n / F o r e c a sti ng

O u tc o m e s M e a su r e m e n t

C o n d i ti o n M o n i to r i ng

H e a l th M o n i to r i ng

D i sc o v e r y

B u si n e ss G o al

C l a ss i fi c a ti on

R e g r e ss i o n

P r e d i c ti on

D e sc r i p ti onC l u ste r i n g / S e g m e n ta ti on

D e p e n d e n c y / A sso c i a ti on

A n o m a l y / C h a n g e D e te c ti on

T i m e -se r i es A n a l y s is

M o d e l T y p e s

F l a t F i le O D B C / JD B C X M L O th e r (e .g ., S A S )

D B M S E m b e d d ed

In p u t F o r m a t s

D a ta V i su a l i sa ti on D a ta C l e a n i ng R e c o r d S e l e c ti on A ttr i b u te S e l e c ti on

D a ta C h a r a c te r i sa ti on D a ta T r a n sf o r m a ti on

P r e -p r o c e ss i n g

D e c i s i o n T r e es R u l e L e a r n i ng N e u r a l N e tw o r ks L i n e a r / L o g i sti c R e g r e ss i on

A sso c i a ti o n L e a r n i ng T e x t M i n i ng In sta n c e -b a se d L e a r n i n g U n su p e r v i se d L e a r n i ng

P r o b a b i l i s ti c L e a r n i ng

M o d e l l in g A l go r i th m s

C r o ss-v a l i d a ti on L i f t / G a in C h a r ts R O C A n a l y s is In d e p e n d e n t T e st S e t

S u m m a r y R e p o r ts M o d e l V isu a l i sa ti on

In te r p r e ta ti o n / E v a l u a ti on

S a v e / R e l o a d M o d e ls C o m m en t F i e l ds P r o d u c e E x e c u ta b le P M M L / X M L E x p o r t

D i sse m i na ti o n / D ep l o y m e nt

D a ta se t S i z e L i m i ts S u p p o r t f o r P a r a l l e l i sa ti on In c r e m e n ta l i ty

S c a l a b i l i ty

E x p e r t O p ti o ns B a tc h P r o c e ss i n g

M i sc e l l a n e o us

P r o c e ss -d e p e n d e n tF e a tu r e s

L a y o u t(G U I v s C o m m a n d L i n e)

V i su a l P r o g r am m i ng(D r a g & D r o p )

O n -l i ne H e lp

U se r In te r f a c eF e a tu r e s

P C W i n d o w s

U n i x / S ol a r is

H a r d w a re P l a tf o rm

D B 2

S A S B a se

O r a cl e D B

J a v a / J R E

S o f tw a r e R e q u i r e m e n ts

S ta n d a l o n e

C l i e n t-se r v er

T h i n C l i e nt

S o f tw a r e A r c h i te c tu re

S y ste m R e q u i r e m e n ts

C o n ta c t d e ta i l sA d d r e ss , U R L , E m a i l , e tc

T e c h n i c a l S u p p o rtL e v el

F r e e (O p e n S o u r c e )

L o w C o st

M e d i u m C o st

E x p e n si ve

P r i c e R a n ge

F r e e D e m o / E v a l u a ti o n V e r s i on

M a r k e t P e n e tr a ti o n

L o n g e v i ty

In te g r a ti o n / V a l i d a ti on

V e n d o r In f o r m a ti on

D a ta M i ni n g T o olD e sc r i p ti on

45 © ELCA - 01 - 2004 CGC

Comprehensive Survey

59 of the most popular toolsCommercial: AnswerTree, CART / MARS, Clementine, Enterprise Miner, GainSmarts, GhostMiner, Insightful Miner, intelligent Miner, KnoweldgeSTUDIO, KXEN, MATLAB Neural Network Toolbox, NeuralWorks Predict, NeuroShell, Oracle Data Mining Suite, PolyAnalyst, See5 / Cubist / Magnum Opus, SPAD, SQL Server 2000, STATISTICA Data Miner, Teradata Warehouse Data Mining, etc.Freeware: WEKA, Orange, YALE, SwissAnalyst

Dynamic DB: updated regularly

46 © ELCA - 01 - 2004 CGC

Recipients

Number of Recipients

0

20

40

60

80

100

120

140

160

180

200

26/0

2/20

02

26/0

3/20

02

26/0

4/20

02

26/0

5/20

02

26/0

6/20

02

26/0

7/20

02

26/0

8/20

02

26/0

9/20

02

26/1

0/20

02

26/1

1/20

02

26/1

2/20

02

26/0

1/20

03

26/0

2/20

03

26/0

3/20

03

26/0

4/20

03

26/0

5/20

03

26/0

6/20

03

26/0

7/20

03

26/0

8/20

03

26/0

9/20

03

26/1

0/20

03

26/1

1/20

03

26/1

2/20

03

47 © ELCA - 01 - 2004 CGC

Comments

The above dimensions characterize Data Mining tools, and NOT Data Mining algorithms

With a standard schema and corresponding database, users are able to select a DM software package with respect to its ability to meet high-level business objectives

Automatic advice strategies such as METAL's Data Mining Advisor (see http://www.metal-kdd.org) or IDEA (see http://

www.ifi.unizh.ch/ddis/Research/idea.htm) can then be used to assist users further in the selection of the most appropriate algorithms / models for their specific tasks.

48 © ELCA - 01 - 2004 CGC

How About Combining Learners?

An obvious solution: Combine learners, as in boosting, bagging, stacked generalisation, etc.

However, no matter how elaborate the method, any algorithm that implements a fixed mapping from training sets to prediction models is subject to the limitations imposed by the law of conservation

Documents

© ELCA - 01 - 2004 CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004