22
Th Cl I b l P bl i The Class Imbalance Problem in Learning Classifier Systems: Learning Classifier Systems: A Preliminary Study Albert Orriols Puig Ester Bernadó Mansilla Enginyeria i Arquitectura La Salle Ramon Llull University Page 1 IWLCS Enginyeria i Arquitectura La Salle June 25th, 2005

IWLCS'05: The Class Imbalance Problem in Learning Classifier Systems: A Preliminary Study

Embed Size (px)

Citation preview

Th Cl I b l P bl i The Class Imbalance Problem in Learning Classifier Systems: Learning Classifier Systems:

A Preliminary Studyy y

Albert Orriols PuigEster Bernadó Mansilla

Enginyeria i Arquitectura La SalleRamon Llull University

Page 1IWLCS Enginyeria i Arquitectura La Salle

June 25th, 2005

OUTLINE

1. Introduction 1. Introduction

2. Description of UCS3 D t t d i

2. UCS Description

3. Dataset Design3. Dataset design4. UCS on Unbalanced Datasets

3. Dataset Design

4. UCS on unbalanced d.

4. UCS on Unbalanced Datasets5. Dealing with imbalances

5. Dealing imbalances

6. Chk Problem

6. UCS in the Chk Problem7 Contrasting results with Pos

7. Contrasting res.

7. Contrasting results with Pos Problem

8. Conclusions

8. Conclusions

Page 2IWLCS Enginyeria i Arquitectura La Salle

INTRODUCTION

1. Introduction

2. UCS Description

3. Dataset Design

Real world

Class imbalances inthe samples taken 3. Dataset Design

4. UCS on unbalanced d.

domainst e sa p es ta e

5. Dealing imbalances

6. Chk ProblemDoes it affects the learning performance of some well

7. Contrasting res.

Does it affects the learning performance of some well-known systems?

8. Conclusions

If it is, how we can deal with imbalances

Does class imbalances affect the performance of UCS

Page 3IWLCS Enginyeria i Arquitectura La Salle

Supervised Learning Scheme1. Introduction

I t E l l

Supervised Learning Scheme

2. UCS Description

3. Dataset Design

Input Example class

3. Dataset Design

4. UCS on unbalanced d.Population

5. Dealing imbalances

6. Chk Problem

Classifiers that predictthe correct action.

7. Contrasting res.

matchset

correctset

GeneticAlgorithmDiscovery

component

8. Conclusions

Classifier’s Acc = #Correct / experienceParameters

Update Fitness = accv

Page 4IWLCS Enginyeria i Arquitectura La Salle

Chk Problem

- Two real attributes x,y E [0,1]Two classes

1. Introduction

- Two classes- Permits varying complexity along:

C t C l it ( )

2. UCS Description

3. Dataset Designa. Concept Complexity (c)b. Dataset size (s)

I b l l l (i)

3. Dataset Design

4. UCS on unbalanced d.

c. Imbalance level (i)5. Dealing imbalances

6. Chk Problem

7. Contrasting res.

8. Conclusions

s=4096, c=4, i=2

#inst. maj. class = s/c2 = 4096/16 = 256#inst. min. class = s/c2*2i = 4096/(16*4) = 64

Page 5IWLCS Enginyeria i Arquitectura La Salle

We ran UCS in chk with s=4096 c=4 and i=[0 7]

1. Introduction

We ran UCS in chk with s=4096, c=4 and i=[0..7]

2. UCS Description

3. Dataset Design3. Dataset Design

4. UCS on unbalanced d.

5. Dealing imbalances

6. Chk Problem

7. Contrasting res.

8. Conclusions

Training datasets for chk problem

Page 6IWLCS Enginyeria i Arquitectura La Salle

Obtaining the following results

1. Introduction

Obtaining the following results

2. UCS Description

3. Dataset Design3. Dataset Design

4. UCS on unbalanced d.

5. Dealing imbalances

6. Chk Problem

7. Contrasting res.

8. Conclusions

Boundaries evolved by UCS in the chk problem with imbalance levels from 0 to 7

Page 7IWLCS Enginyeria i Arquitectura La Salle

Analyzing the population evolved in higher

1. Introduction

y g p p gimbalance levels

Id diti Cl A F N2. UCS Description

3. Dataset Design

Id condition Class Acc F Num

1 [0.509, 0.750] [0.259, 0.492] 1 1.00 1.00 39

2 [0.000, 0.231] [0.252, 0.492] 1 1.00 1.00 383. Dataset Design

4. UCS on unbalanced d.

3 [0.000, 0,248] [0.755, 1.000] 1 1.00 1.00 35

4 [0.761, 1.000] [0.000, 0.249] 1 1.00 1.00 34

5 [0.255, 0.498] [0.520, 0.730] 1 1.00 1.00 3318 rules5. Dealing imbalances

6. Chk Problem

6 [0.751, 1.000] [0.514, 0.737] 1 1.00 1.00 31

7 [0.259, 0.498] [0.000, 0.244] 1 1.00 1.00 27

8 [0.501, 0.743] [0.751, 1.000] 1 1.00 1.00 18

18 rules predicting the under-sized

class As imbalance level increases, the

7. Contrasting res.

[ , ] [ , ]

9 [0.500, 0.743] [0.751, 1.000] 1 1.00 1.00 9

10 [0.751, 1.000] [0.531, 0.737] 1 1.00 1.00 8

accuracy of the over-general classifiers increases too. Then, they become stronger in the population.

8. Conclusions…

18 [0.509, 0.750] [0.246, 0.492] 1 0.64 0.01 1

19 [0.000, 1.000] [0.000, 1.000] 0 0.94 0.54 2047 rules

g p p

20 [0.000, 1.000] [0.000, 0.990] 0 0.94 0.54 13

21 [0.012, 1.000] [0.000, 0.990] 0 0.94 0.54 10

47 rules predicting the

over-sized class

Page 8IWLCS Enginyeria i Arquitectura La Salle

64 [0.012, 1.000] [0.038, 0.973] 0 0.94 0.54 1Rules for imbalance level i=4

Methods to deal with imbalances1. Introduction

Methods to deal with imbalances

• In literature there are several methods to 2. UCS Description

3. Dataset Design

deal with imbalances

• We have considered 3 methods:

3. Dataset Design

4. UCS on unbalanced d.

• We have considered 3 methods:– Random over-sampling [Jap02]

5. Dealing imbalances

6. Chk Problem

– Adaptive sampling

– Class-sensitive accuracy7. Contrasting res.

y8. Conclusions

Page 9IWLCS Enginyeria i Arquitectura La Salle

Adaptive Sampling

I i d i li d b ti1. Introduction

Adaptive Sampling

• Inspired in over-sampling and boosting

• It maintains a weight for each training instance. Th i ht i th b bilit f li thi

2. UCS Description

3. Dataset DesignThe weight is the probability of sampling this instance

E h ti i t i l t d f l it it

3. Dataset Design

4. UCS on unbalanced d.

• Each time an instance is selected for exploit, its weight is updated in the following way:

5. Dealing imbalances

6. Chk Problem

7. Contrasting res.wi (1 - α) if correct

8. Conclusionswi

wi (1 + α) otherwise

Page 10IWLCS Enginyeria i Arquitectura La Salle

Class sensitive accuracy

W t f h l1. Introduction

Class-sensitive accuracy

• We compute accuracy for each class2. UCS Description

3. Dataset Designi

icacc = Ci = number of examples of class i correctly classified

b f l f l i d b th l

• The compound accuracy

3. Dataset Design

4. UCS on unbalanced d.i

iaccexp expi = number of examples of class i covered by the rule

The compound accuracy5. Dealing imbalances

6. Chk Problem⎪⎨⎧ ∑

>=

C

iii

eacc

Cacc 0exp|1

1accii θ≥∀ exp: Ce = Number of different

classes that a rule

7. Contrasting res.⎪⎩⎨

∑=

>=

iiC

iiii

ewacc

C

acc 0e p|

0exp|1

1otherwise

classes that a rulecovers.

• Where 8. Conclusions

⎪⎧ exp f θ0 Cee = Number of

⎪⎩

⎪⎨

∑=

<<=−

acc

i

C

accii iacceCiw θ

θθ

exp

exp·exp0|1

acciif θ<< exp0..

acciif θ≥exp..

Cee = Number of experienced classes

Θacc = threshold below hi h l i

Page 11IWLCS Enginyeria i Arquitectura La Salle

⎪⎩ accee

acciC θ·

accif p which a class is inexperienced

1. Introduction

2. UCS Descript.

3. Dataset Design

4. UCS on unbal.

5. Dealing imb.

6. Chk Problem

7. Contrasting res.

8. Conclusions

Page 12IWLCS Enginyeria i Arquitectura La SalleOversampling

1. Introduction

2. UCS Descript.

3. Dataset Design

4. UCS on unbal.

5. Dealing imb.

6. Chk Problem

7. Contrasting res.

8. Conclusions

Page 13IWLCS Enginyeria i Arquitectura La SalleAdaptive sampling

1. Introduction

2. UCS Descript.

3. Dataset Design

4. UCS on unbal.

5. Dealing imb.

6. Chk Problem

7. Contrasting res.

8. Conclusions

Page 14IWLCS Enginyeria i Arquitectura La SalleClass-sensitive accuracy

Pos Problem

- Multiple classes and different imbalance levelsCondition binary string of length L

1. Introduction

- Condition = binary string of length L- Class = Position of the leftmost one-valued bit

2. UCS Description

3. Dataset Design3. Dataset Design

4. UCS on unbalanced d.

C diti A ti5. Dealing imbalances

6. Chk Problem

Condition Action00000 0

00001 1

7. Contrasting res.

0001# 2

001## 3

01### 48. Conclusions

01### 4

1#### 5

Optimal ruleset for the pos5 problemOptimal ruleset for the pos5 problem

Page 15IWLCS Enginyeria i Arquitectura La Salle

Running pos8 pos15 with raw UCS

1. Introduction

Running pos8 – pos15 with raw UCS

2. UCS Description

3. Dataset Design3. Dataset Design

4. UCS on unbalanced d.

5. Dealing imbalances

6. Chk Problem

7. Contrasting res.

8. ConclusionsPercentage of optimal population achieved

As the imbalance level increases, the system presents difficulties in discovering the most specific rules

Page 16IWLCS Enginyeria i Arquitectura La Salle

difficulties in discovering the most specific rules

Contrasting results with Pos problem

1. Introduction

2. UCS Description

3. Dataset Design

Condition Class00000000 0 3. Dataset Design

4. UCS on unbalanced d.

00000000 0

00000001 1

0000001# 2Wrong rule: #000000:05. Dealing imbalances

6. Chk Problem

000001## 3

00001### 4

0001#### 5

Wrong rule: #000000:0example = 00000000:0 (128)

Counter example = 10000000:8 (1)

7. Contrasting res.

0001#### 5

001##### 6

01###### 7

Counter-example = 10000000:8 (1)

We are sampling in a very low rate the counter-examples for the rules

8. Conclusions1####### 8the counter examples for the rules

over-generalized with the most specific optimal rules.

Page 17IWLCS Enginyeria i Arquitectura La Salle

Oversampling

Contrasting results with Pos problem

1. Introduction

2. UCS Description

3. Dataset Design3. Dataset Design

4. UCS on unbalanced d.

5. Dealing imbalances

6. Chk Problem

7. Contrasting res.

8. Conclusions

Adaptive Sampling

Page 18IWLCS Enginyeria i Arquitectura La Salle

Contrasting results with Pos problem

1. Introduction

2. UCS Description

3. Dataset Design3. Dataset Design

4. UCS on unbalanced d.

5. Dealing imbalances

6. Chk Problem

7. Contrasting res.

8. Conclusions

Class-sensitive accuracy

Page 19IWLCS Enginyeria i Arquitectura La Salle

Conclusions1. Introduction

Conclusions

• The class imbalance problem has appeared to be a real problem on UCS

2. UCS Description

3. Dataset Designbe a real problem on UCS.

• All tested strategies to deal with class i b l i th lt f UCS i

3. Dataset Design

4. UCS on unbalanced d.

imbalances improves the results of raw UCS in Chk problem

5. Dealing imbalances

6. Chk Problem

• The analysis in Pos revealed many inconveniences in oversampling method. This

7. Contrasting res.

p glead us to discard this method for real-world problem

8. Conclusions

Page 20IWLCS Enginyeria i Arquitectura La Salle

Further Work1. Introduction

Further Work

• Enhance the study with other LCS (preliminary experiments made with GAssist [Bac04] and

2. UCS Description

3. Dataset Designexperiments made with GAssist [Bac04] and Hider [Agu04])

E t d thi l i t th l ifi

3. Dataset Design

4. UCS on unbalanced d.

• Extend this analysis to other classifier schemes: C4.5 and SVM

5. Dealing imbalances

6. Chk Problem

• Extend the analysis with other artificial and real problems

7. Contrasting res.

p8. Conclusions

Page 21IWLCS Enginyeria i Arquitectura La Salle

Th k f tt tiThanks for you attention

Page 22IWLCS Enginyeria i Arquitectura La Salle