69
Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Evolutionary Computation

Genetic Algorithms

Genetic Programming

Learning Classifier Systems

Page 2: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Genetic Algorithms• Population-based technique for discovery

of knowledge structures

• Based on idea that evolution represents search for optimum solution set

• Massively parallel

Page 3: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

The Vocabulary of GAs• Population

– Set of individuals, each represented by one or more strings of characters

• Chromosome– The string representing an individual

Page 4: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

The vocabulary of GAs, contd.• Gene

– The basic informational unit on a chromosome

• Allele– The value of a specific gene

• Locus– The ordinal place on a chromosome where a

specific gene is found

Page 5: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Thus...

011010

Chromosome

Gene(Allele="0")

Locus=5

Page 6: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Genetic operators

• Reproduction– Increase representations of strong individuals

• Crossover– Explore the search space

• Mutation– Recapture “lost” genes due to crossover

Page 7: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Genetic operators illustrated...

011010Parent 1:

Parent 2:

Offspring 1:

Offspring 2:000110011010000110

Simple reproduction

011010 Offspring 1:Offspring 2:

000110011110000010

Reproduction with crossover at locus 3

011010 Offspring 1:

Offspring 2:000110010010000110

Simple reproduction with mutation at locus 3 for offspring 1

Parent 1:

Parent 2:

Parent 1:

Parent 2:

Page 8: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

GAs rely on the concept of “fitness”

• Ability of an individual to survive into the next generation

• “Survival of the fittest”

• Usually calculated in terms of an objective fitness function– Maximization– Minimization– Other functions

Page 9: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Genetic Programming

• Based on adaptation and evolution

• Structures undergoing adaptation are computer programs of varying size and shape

• Computer programs are genetically “bred” over time

Page 10: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

The Learning Classifier System

• Rule-based knowledge discovery and concept learning tool

• Operates by means of evaluation, credit assignment, and discovery applied to a population of “chromosomes” (rules) each with a corresponding “phenotype” (outcome)

Page 11: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Components of a Learning Classifier System

• Performance– Provides interaction between environment and rule base

– Performs matching function

• Reinforcement– Rewards accurate classifiers

– Punishes inaccurate classifiers

• Discovery– Uses the genetic algorithm to search for plausible rules

Page 12: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

The Learning Classifier System

• Rule-based knowledge discovery and concept learning tool

• EpiCS– First Learning Classifier System designed

for use in epidemiologic surveillance– Supervised learning environment

Page 13: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Knowledge Representation

• Classifiers– IF-THEN rules

• Condition=“genotype”• Action=“phenotype”

– Strength metric– Encoded as bit strings or numerics

• Population– Fixed size collection of classifiers

Page 14: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Low-level knowledge representation:The Classifier

0111*00011*111:0 34.9

Action BitTaxon

Strength

• Taxon is analogous to a condition (LHS) of an IF-THEN rule

• Action bit is analogous to an action (RHS) of an IF-THEN rule

• Strength is an internal fitness function

Page 15: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

High-level knowledge representation:Macrostate Population

Page 16: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Components of a learning classifier system

• Performance– Provides interaction between environment and classifier

population– Performs matching function

• Reinforcement– Rewards accurate classifiers– Punishes inaccurate classifiers

• Discovery– Uses the genetic algorithm to search for plausible knowledge

structures

Page 17: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Generic Machine Learning Model

Page 18: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

A Generic Learning Classifier System

Performance component

Classifier population

Reinforcement component

Discovery component

Input Output

Page 19: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

EpiCS: A Learning Classifier System

Environment Detectors

Population

[P]01001:110010:01*010:1 ---**110:11*001:0

10010

Match Set

[M]

10**0:11**1*:11001*:1***10:0100*0:0

Effector

Performance Component

Correct Set

[C]

10**0:11**1*:11001*:1

Reinforcement/Penalty

Regime

Reinforcement Component

CoveringGenetic

Algorithm

Discovery Component

Decision (=1)1001*:1

0.5600.3340.871

Not-correct Set

Not[C]

***10:0100*0:0

0.5600.334

Page 20: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

EpiCS: Performance Component

Environment Detectors

Population

[P]01001:110010:01*010:1 ---**110:11*001:0

10010

Match Set

[M]

10**0:11**1*:11001*:1***10:0100*0:0

Effector

Performance Component

Correct Set

[C]

10**0:11**1*:11001*:1

Reinforcement/Penalty

Regime

Reinforcement Component

CoveringGenetic

Algorithm

Discovery Component

Decision (=1)1001*:1

0.5600.3340.871

Not-correct Set

Not[C]

***10:0100*0:0

0.5600.334

Page 21: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Performance component

• Creates a subset (the matchset, [M]) of all classifiers in population [P] whose conditions match a string received from the environment

• From [M], a single classifier is selected, based on its strength as a proportion of the sum of all strengths in [M]

• The action of this classifier is then used as the output of the system

Page 22: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

EpiCS: Reinforcement Component

Environment Detectors

Population

[P]01001:110010:01*010:1 ---**110:11*001:0

10010

Match Set

[M]

10**0:11**1*:11001*:1***10:0100*0:0

Effector

Performance Component

Correct Set

[C]

10**0:11**1*:11001*:1

Reinforcement/Penalty

Regime

Reinforcement Component

CoveringGenetic

Algorithm

Discovery Component

Decision (=1)1001*:1

0.5600.3340.871

Not-correct Set

Not[C]

***10:0100*0:0

0.5600.334

Page 23: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Reinforcement component• Correct set [C] is created from classifiers in [M]

advocating correct decisions• Remaining classifiers in [M] form Not[C]• Tax is deducted from the strengths of all classifiers

in [C]• Reward is added to the strengths of all classifiers in

[C], biased for generality• Penalty is deducted from the strengths of all

classifiers in Not[C]

Page 24: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

EpiCS: Discovery Component

Environment Detectors

Population

[P]01001:110010:01*010:1 ---**110:11*001:0

10010

Match Set

[M]

10**0:11**1*:11001*:1***10:0100*0:0

Effector

Performance Component

Correct Set

[C]

10**0:11**1*:11001*:1

Reinforcement/Penalty

Regime

Reinforcement Component

CoveringGenetic

Algorithm

Discovery Component

Decision (=1)1001*:1

0.5600.3340.871

Not-correct Set

Not[C]

***10:0100*0:0

0.5600.334

Page 25: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Discovery component

• Genetic algorithm invoked once per iteration

• One new offspring is created, from parents deterministically selected based on strength

• The single offspring replaces weakest classifier in the population

Page 26: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Features of EpiCS• Object-oriented implementation

• Stimulus-response architecture

• Payoff/Penalty reinforcement regime

• Syntactic control of overgeneralization

• Differential penalty control of undergeneralization

• Ability to compute risk of outcome

Page 27: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Discovering risk with EpiCS

• Output decision of the learning classifier system is probability of disease (CSPD), rather than dichotomous decision

• CSPD determined from proportion of classifiers matching a given input case’s taxon

Page 28: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Discovering risk with EpiCS: The specifics

CSPD (probabilities of classifiers associated with disease)

(probabilities of all classifiers with matching taxa)

0 95

1 00 95

.

..

Page 29: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Discovery of Predictive Models in an Injury Surveillance Database:

An Application of Data Mining in Clinical Research

Page 30: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Partners for Child Passenger SafetyInformation Infrastructure

State Farm Insurance Companies

CHOPUniversity of PA

Dynamic Science, Inc.

Response Analysis

Corporation

Page 31: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Why data mining is needed for PCPS

• Large number of raw and derived variables renders traditional “manual” methods for discovering patters in data unwieldy

• Hypothesis-driven (biased) analyses may lead to missed associations

• Constantly changing patterns in prospective data require constantly changing analytic approaches that can be informed by data mining

Page 32: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Candidate Predictors

• Demographics

• Kinematics

• Characteristics of crash

• Restraint use

Page 33: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Outcome: Head Injury

• Major burns involving the head

• Skull fracture

• Evidence of brain injury reported by respondent– Excessive sleepiness– Difficulty in arousing– Unresponsiveness– Amnesia after accident

Page 34: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Data Preparation

• Pool of 8,334 records

• 20 separate datasets created– All cases of head injury included (N=415)– Equal number of non-head injury cases

randomly drawn from pool

• Each dataset randomly sampled to create mutually exclusive training and testing sets of equal size

Page 35: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Comparison methods:Logistic Regression

• Variables from training sets stepped into model to determine significant terms

• Significant terms used to create new risk model:

)...( 111

1ˆnnxxe

yP

• Risk model applied to cases in testing set• Risk estimates categorized by deciles and used construct ROC

curves

Page 36: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Comparison Methods:Decision Tree Induction

• C4.5 used to create decision trees from training sets

• 10-fold cross-validation used to optimize trees

• Optimized trees used by C4.5RULES to classify cases in testing set

Page 37: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Experimental Procedure

for x=1 to number of testing cases

evaluate testing case x

Genetic algorithm inactive

Training

Phase

Interim

Evaluation

Phase

Training

Epoch

Testing

Epoch

Trial

for a=1 to maximum number of training epochsfor x=1 to 100

present randomly selected training

case

for x=1 to number of training

cases

evaluate training case x

Page 38: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Results: Training

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Iterations

AUC

Indeterminant Rate

Page 39: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Results: Training

• EpiCS– 5,000 unique classifiers reduced to 2,314

by the end of training

• Logistic regression– Single model with eight significant terms,

no significant interactions

• C4.5– 11 rules created for each training set, most

with single conjuncts

Page 40: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Results: Prediction

Area under the ROC curve obtained on testing, averaged over the 20 separate studies

Page 41: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

And now for something a little different

The XCS model

Page 42: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

XCS: A little history

• Wilson, SW: Evolutionary Computation, 2(1), 1-18 (1994)– ZCS

• Wilson, SW: Evolutionary Computation, 3(2), 149-175 (1995)– The seminal work on XCS

• Many papers by Lanzi, Barry, Butz, and others• Butz, M and Wilson, SW: Advances in Learning

Classifier Systems. Third International Workshop (IWLCS-2000), Lecture Notes in Artificial Intelligence (LNAI-1996). Berlin: Springer-Verlag (2001)– The algorithm paper

Page 43: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

What is XCS?

• An LCS that differs from traditional Holland model– Classifier fitness is based on the accuracy of

the classifiers payoff prediction, rather than the prediction itself

– The genetic algorithm is restricted to niches in the action set, rather than applied to the classifier population as a whole

• The major feature is graceful, accurate generalization

Page 44: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

XCS in a nutshell

Source: Wilson, XCS tutorial

((43*99)+(27*3))/102

Action: 00

Action: 01

Page 45: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

EpiXCS: An XCS-Based Learning

Classifier System for Epidemiologic Research

Page 46: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Outline

• What is it?• EpiXCS architecture

– Data encoding– Evaluation metrics– Reinforcement– Missing values handling– Classifier ranking– Risk assessment

• Test case: Pima Indians Diabetes Data

Page 47: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

What is EpiXCS?

• Learning classifier system based on the XCS paradigm – Uses the Lanzi C++ kernel

• Designed for use in epidemiologic research, specifically mining disease surveillance databases in supervised learning environments– Visualization by non-LCS users– Sensitive to demands of clinical data

Page 48: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Data Encoding in EpiXCS

• All numeric data formats permissible– Binary– Categorical– Ordinal– Real

• Non-binary data represented using “center-spread” approach– Two genes per feature

• Actions are limited to binary (for now)

Page 49: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Sample input data format(Pima Indians Diabetes Database)

ATTRIBUTE 0 <WILD "99"><REAL><STRING "Clump Thickness">ATTRIBUTE 1 <WILD "99"><REAL><STRING "Uniformity of Cell Size">ATTRIBUTE 2 <WILD "99"><REAL><STRING "Uniformity of Cell Shape">ATTRIBUTE 3 <WILD "99"><REAL><STRING "Marginal Adhesion">ATTRIBUTE 4 <WILD "99"><REAL><STRING "Single Epithelial Cell Size">ATTRIBUTE 5 <WILD "99"><REAL><STRING "Bare Nuclei">ATTRIBUTE 6 <WILD "99"><REAL><STRING "Bland Chromatin">ATTRIBUTE 7 <WILD "99"><REAL><STRING "Normal Nucleoli">ATTRIBUTE 8 <WILD "99"><REAL><STRING "Mitoses">ACTION 9 <STRING "Malignant">5 4 4 5 7 10 3 2 1 03 1 1 1 2 2 3 1 1 08 10 10 8 7 10 9 7 1 1…

Page 50: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Classifier Population Initialization

• Minima and maxima for each attribute determined automatically at start of run

• Center values can be initialized by user – Mean– Median– Random value between spread

• Spread values can be initialized by user– Standard deviation– Quantile

Page 51: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Sample Macroclassifiers

/5.5,5.5/107.5,51.5/64.0,21.0/#/316.0,160.0/16.55,16.55/#/#/:1/3.0,3.0/#/91.5,30.5/#/#/#/#/26.0,5.0/:1/2.5,2.5/119.5,63.5/#/#/#/#/#/#/:1/2.5,2.5/107.5,51.5/64.0,21.0/#/#/#/#/49.0,28.0/:1 /2.5,2.5/#/66.0,22.0/#/317.5,301.5/#/1.0040,0.9260/49.0,28.0/:1/2.5,2.5/107.5,51.5/64.0,21.0/#/#/#/1.0735,0.9955/#/:1

Page 52: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Evaluation Metrics

• Sensitivity

• Specificity

• Area under the ROC curve

• Predictive values

• Accuracy

• Learning rate

Page 53: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

A Fast Primer on Test Evaluation

Page 54: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Sensitivity

• Prior probability of a test-positive

• If it’s high, then one would want to use the test to diagnose (classify positive)

• If a classifier’s Se is high, then that classifier should be more likely to be used in defining an Correct Set when a training case is known positive

Page 55: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Specificity

• Prior probability of a test-negative

• If it’s high, then one would want to use the test to rule out (classify negative)

• If a classifier’s Sp is high, then that classifier should be more likely to be used in defining an Correct Set when a training case is known negative

Page 56: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

The Predictive Values

• Posterior probability of a test-positive or negative• If a PPV is high, then once one has the test

result in hand, and it predicts positive, it would be considered to be accurate

• If a NPV is high, then once one has the test result in hand, and it predicts positive, it would be considered to be accurate

Page 57: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

How these metrics are used in EpiXCS

• To evaluate classification performance– Training

• Se, Sp, AUC, Accuracy, and Indeterminate Rate are plotted every 100th iteration

– Testing• Se, Sp, AUC, Accuracy, and Indeterminate Rate

are obtained for the testing set

Page 58: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

How these metrics are used in EpiXCS

• To evaluate learning

– Shoulder is the iteration at which 95% of the maximum AUC obtained during training is first attained, and AUCShoulder is the AUC obtained at the shoulder and classification performance

1000Shoulder

AUC

Shoulder 1000Shoulder

AUC

Shoulder 1000Shoulder

AUC

Shoulder

1000Shoulder

AUC

Shoulder

1000Shoulder

AUC

Shoulder

Page 59: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Reinforcement in EpiXCS

• Done the usual way, but…• User can bias the reward depending on

the class distribution– Give disproportionately less “negative” reward

to False Negative classifiers in data with <50% positives (where the Se is low)

– Give disproportionately less “negative” reward to False Positive classifiers in data with <50% negatives (where the Sp is low)

Page 60: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Missing Values Handling during Covering

• Four possible ways to cover missing data in a non-matching input σ that needs to be covered– Wild-to-wild:

• Missing attributes covered as #s

– Random within range• Random value within the range for the attribute

– Population average• Population average for the attribute

– Population standard deviation• Random value within the standard deviation for the attribute

Page 61: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Classifier Ranking

• After training, classifiers ranked according to their predictive values– Classifiers predicting positive ranked by PPV– Classifiers predicting negative ranked by NPV

• Classifier ranking used for rule visualization

Page 62: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Risk Assessment

• Based on risk assessment module used in EpiCS

• Risk estimates determined on testing based on proportional prevalence in match sets for each testing case

• Provides risk assessment analogous that obtained by logistic regression

Page 63: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Test case: Pima Indians Diabetes Data

• 768 cases– 268 positive, 500 negative

• 8 attributes– Gravidity– Plasma glucose– Diastolic blood pressure– Skin-fold thickness– Serum insulin– Body mass index– Pedigree function– Class: Diabetes Yes/No

Page 64: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Experimental procedure

• Training and testing sets created– 134 positives/250 negatives in each

• EpiXCS– Results averaged over 20 runs, 50,000 iterations

each • See5

– Boosting at 10 trials – 10-fold crossvalidation

• Logistic regression– Relaxed stepwise model built on training set and

evaluated on testing set

Page 65: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Rules: EpiXCSIf Number of times pregnant is 7.0 ± 7.0 and plasma glucose concentration after 2 hours is 67.5 ± 11.5 and triceps skinfold thickness is 33.0 ± 26.0 and 2-hour serum insulin is 326.5 ± 66.5 and age is 48.0 ± 27.0Then not diabetes

If Number of times pregnant is 7.5 ± 7.5 and triceps skinfold thickness is 35.5 ± 24.5 and 2-hour serum insulin is 811.5 ± 34.5 and body mass index is 48.9 ± 6.1 and pedigree function is 0.97 ± 0.89Then diabetes

Page 66: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Rules: See5Rule 9/1: (20.5, lift 1.8) pedigree <= 0.179 age <= 34 -> class 0 [0.955]

Rule 9/2: (62.1/2.6, lift 1.8) plasmaglu <= 103 pedigree <= 0.787 -> class 0 [0.944]

Rule 9/3: (9.3, lift 1.7) serumins <= 156 bmi <= 35.3 age > 34 age <= 37 -> class 0 [0.912]

Rule 9/9: (12, lift 2.0) plasmaglu > 135 serumins <= 185 bmi > 33.7 pedigree <= 1.096 age > 37 -> class 1 [0.928]

Rule 9/10: (37.1/2.5, lift 2.0) plasmaglu > 103 bmi > 35.3 pedigree <= 1.096 age > 34 -> class 1 [0.909]

Page 67: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

The logistic model

Risk of diabetes= 1.34+ 0.19*Gravidity+ 0.04*Post-prandial glucose+ -0.01*Diastolic blood pressure+ 0.01*Skinfold thickness+ -0.01*Serum insulin+ 0.05*Body mass index+ 0.72*Pedigree function

Page 68: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Classification accuracy on testing

Page 69: Evolutionary Computation Genetic Algorithms Genetic Programming Learning Classifier Systems

Conclusions

• EpiXCS incorporates features of EpiCS into the XCS paradigm

• Facilitates analysis of epidemiologic data• Uses metrics understood by clinical

researchers• Discovers knowledge comparably to

See5 and logistic regression