93
Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets Hisao Ishibuchi Osaka Prefecture University, Japan Workshop on Grand Challenges of Computational Intelligence (Cyprus, September 14, 2012)

Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Embed Size (px)

Citation preview

Page 1: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Scalability Improvement of Genetics-Based Machine Learning

to Large Data Sets

Hisao IshibuchiOsaka Prefecture University, Japan

Workshop on Grand Challenges of Computational Intelligence (Cyprus, September 14, 2012)

Page 2: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Contents of This Presentation

Introduction1. Basic Idea of Evolutionary Computation2. Genetics-Based Machine Learning3. Parallel Distributed Implementation4. Computation Experiments5. Conclusion

Page 3: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Contents of This Presentation

Introduction1. Basic Idea of Evolutionary Computation2. Genetics-Based Machine Learning3. Parallel Distributed Implementation4. Computation Experiments5. Conclusion

Page 4: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Basic Idea of Evolutionary Computation

Environment

Population

Individual

Page 5: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Basic Idea of Evolutionary Computation

Environment

Population

Individual

(1)

(1) Natural selection in a tough environment.

Environment

Population

Good Individual

Page 6: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Basic Idea of Evolutionary Computation

Environment

Population

Individual

(1) (2)

Environment

Population

Environment

Population

New Individual

(1) Natural selection in a tough environment.(2) Reproduction of new individuals by crossover and mutation.

Good Individual

Page 7: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Basic Idea of Evolutionary Computation

Environment

Population

Environment

Population

Iteration of the generation update many times(1) Natural selection in a tough environment.(2) Reproduction of new individuals by crossover and mutation.

Page 8: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Applications of Evolutionary ComputationDesign of High Speed TrainsEnvironment

Population

Individual = Design ( )

Page 9: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Applications of Evolutionary ComputationDesign of Stock Trading AlgorithmsEnvironment

Population

Individual = Trading Algorithm ( )

Page 10: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Contents of This Presentation

Introduction1. Basic Idea of Evolutionary Computation2. Genetics-Based Machine Learning3. Parallel Distributed Implementation4. Computation Experiments5. Conclusion

Page 11: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Genetics-Based Machine LearningKnowledge Extraction from Numerical Data

If (condition) then (result)If (condition) then (result)If (condition) then (result)If (condition) then (result)If (condition) then (result)

Training dataEvolutionary Computation

Design of Rule-Based Systems

Page 12: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Environment

Population

Design of Rule-Based Systems

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Individual = Rule-Based System ( )If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Page 13: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Environment

Population

Design of Decision Trees

Individual = Decision Tree ( )

Page 14: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Environment

Population

Design of Neural Networks

Individual = Neural Network ( )

Page 15: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Multi-Objective EvolutionMinimization of Errors and Complexity

Erro

rSm

all

Larg

e

ComplexitySimple Complicated

Initial Population

Page 16: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Multi-Objective EvolutionMinimization of Errors and Complexity

Erro

rSm

all

Larg

e

ComplexitySimple Complicated

Page 17: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Multi-Objective EvolutionA number of different neural networks

Erro

rSm

all

Larg

e

ComplexitySimple Complicated

Page 18: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Multi-Objective EvolutionA number of different decision trees

Erro

rSm

all

Larg

e

ComplexitySimple Complicated

Page 19: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Multi-Objective EvolutionA number of fuzzy rule-based systems

Erro

rSm

all

Larg

e

ComplexitySimple Complicated

Page 20: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Contents of This Presentation

Introduction1. Basic Idea of Evolutionary Computation2. Genetics-Based Machine Learning3. Parallel Distributed Implementation4. Computation Experiments5. Conclusion

Page 21: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Environment

Population

Difficulty in Applications to Large DataComputation Load for Fitness Evaluation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training data

Fitness Evaluation:Evaluation of each individual using the given training data

Individual = Rule-Based System ( )If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Page 22: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Environment

Population

Difficulty in Applications to Large DataComputation Load for Fitness Evaluation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training data

Fitness Evaluation:Evaluation of each individual using the given training data

Individual = Rule-Based System ( )If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Page 23: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

EnvironmentPopulation

Difficulty in Applications to Large DataComputation Load for Fitness Evaluation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training dataFitness Evaluation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Individual = Rule-Based System ( )If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Page 24: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

EnvironmentPopulation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training dataFitness Evaluation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

The Main Issue in This Presentation How to Decrease the Computation Load

Individual = Rule-Based System ( )If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Page 25: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

EnvironmentPopulation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training dataFitness Evaluation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Fitness CalculationStandard Non-Parallel Model

Individual = Rule-Based System ( )If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Page 26: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

EnvironmentPopulation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training dataFitness Evaluation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Fitness CalculationStandard Non-Parallel Model

Individual = Rule-Based System ( )If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Page 27: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

EnvironmentPopulation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training data

Fitness Evaluation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Fitness CalculationStandard Non-Parallel Model

Individual = Rule-Based System ( )If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Page 28: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

EnvironmentPopulation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training data

Fitness EvaluationIf ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Fitness CalculationStandard Non-Parallel Model

Individual = Rule-Based System ( )If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Page 29: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

A Popular Approach for Speed-UpParallel Computation of Fitness Evaluation

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training data

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If we use n CPUs, the computation load for each CPU can be 1/n in comparison with the case of a single CPU (e.g., 25% by four CPUs)

CPU 1

CPU 2

CPU 3

CPU 4

Fitness Evaluation

Page 30: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Another Approach for Speed-UpTraining Data Reduction

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training data

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If we use x% of the training data, the computation load can be reduced to x% in comparison with the use of all the training data.

Training data Subset

Page 31: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Another Approach for Speed-UpTraining Data Reduction

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training data

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If we use x% of the training data, the computation load can be reduced to x% in comparison with the use of all the training data.

Training data Subset

Page 32: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Another Approach for Speed-UpTraining Data Reduction

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training data

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If we use x% of the training data, the computation load can be reduced to x% in comparison with the use of all the training data.

Training data Subset

Page 33: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Another Approach for Speed-UpTraining Data Reduction

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training data

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If we use x% of the training data, the computation load can be reduced to x% in comparison with the use of all the training data.

Training data Subset

Page 34: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Another Approach for Speed-UpTraining Data Reduction

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Training data

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Difficulty: How to choose a training data subsetThe population will overfit to the selected training data subset.

Training data Subset

Page 35: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Idea of Windowing in J. Bacardit et al.: Speeding-up Pittsburgh learning classifier systems: Modeling time and accuracy. PPSN 2004.

Training data

Training data are divided into data subsets.

Page 36: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Idea of Windowing in J. Bacardit et al.: Speeding-up Pittsburgh learning classifier systems: Modeling time and accuracy. PPSN 2004.

1st Generation

At each generation, a different data subset (i.e., different window) is used.

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Fitness Evaluation

Page 37: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Idea of Windowing in J. Bacardit et al.: Speeding-up Pittsburgh learning classifier systems: Modeling time and accuracy. PPSN 2004.

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

2nd Generation

At each generation, a different data subset (i.e., different window) is used.

Fitness Evaluation

Page 38: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Idea of Windowing in J. Bacardit et al.: Speeding-up Pittsburgh learning classifier systems: Modeling time and accuracy. PPSN 2004.

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

3rd Generation

At each generation, a different data subset (i.e., different window) is used.

Fitness Evaluation

Page 39: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Visual Image of WindowingA population is moving around in the training data.

Training Data = Environment

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Population

Page 40: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Visual Image of WindowingA population is moving around in the training data.

Training Data = Environment

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Population

Page 41: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Visual Image of WindowingA population is moving around in the training data.

Training Data = Environment

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Population

Page 42: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Visual Image of WindowingA population is moving around in the training data.

Training Data = Environment

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Population

Page 43: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Visual Image of WindowingA population is moving around in the training data.

Training Data = Environment

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Population

Page 44: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Visual Image of WindowingA population is moving around in the training data.

Training Data = Environment

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

If ... Then ...If ... Then ...If ... Then ...If ... Then ...

Population

After enough evolution with a moving windowThe population does not overfit to any particular training data subset.The population may have high generalization ability.

Page 45: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Training data

CPU

Popu

latio

n

Non-parallel Non-distributed

Our Idea: Parallel Distributed ImplementationH. Ishibuchi et al.: Parallel Distributed Hybrid Fuzzy GBML Models with Rule Set Migration and Training Data Rotation. TFS (in Press)

Page 46: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

CPU CPU CPU CPU CPU

Training data

CPU

Popu

latio

n

Non-parallel Non-distributed

Our Idea: Parallel Distributed ImplementationH. Ishibuchi et al.: Parallel Distributed Hybrid Fuzzy GBML Models with Rule Set Migration and Training Data Rotation. TFS (in Press)

(1) A population is divided into multiple subpopulations.(as in an island model)

Our Parallel Distributed Model

Page 47: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

CPU CPU CPU CPU CPU

Training data

CPU

Popu

latio

n

Non-parallel Non-distributed

Our Idea: Parallel Distributed ImplementationH. Ishibuchi et al.: Parallel Distributed Hybrid Fuzzy GBML Models with Rule Set Migration and Training Data Rotation. TFS (in Press)

Training data

(1) A population is divided into multiple subpopulations.(2) Training data are also divided into multiple subsets.

(as in the windowing method)

Our Parallel Distributed Model

Page 48: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

CPU CPU CPU CPU CPU

Training data

CPU

Popu

latio

n

Non-parallel Non-distributed

Our Idea: Parallel Distributed ImplementationH. Ishibuchi et al.: Parallel Distributed Hybrid Fuzzy GBML Models with Rule Set Migration and Training Data Rotation. TFS (in Press)

Training data

(1) A population is divided into multiple subpopulations.(2) Training data are also divided into multiple subsets.(3) An evolutionary algorithm is locally performed at each CPU.

(as in an island model)

Our Parallel Distributed Model

Page 49: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

CPU CPU CPU CPU CPU

Training data

CPU

Popu

latio

n

Our Parallel Distributed ModelNon-parallel Non-distributed

Our Idea: Parallel Distributed ImplementationH. Ishibuchi et al.: Parallel Distributed Hybrid Fuzzy GBML Models with Rule Set Migration and Training Data Rotation. TFS (in Press)

Training data

(1) A population is divided into multiple subpopulations.(2) Training data are also divided into multiple subsets.(3) An evolutionary algorithm is locally performed at each CPU.(4) Training data subsets are periodically rotated.

(e.g., every 100 generations)

Page 50: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

CPU CPU CPU CPU CPU

Training data

CPU

Popu

latio

n

Our Parallel Distributed ModelNon-parallel Non-distributed

Our Idea: Parallel Distributed ImplementationH. Ishibuchi et al.: Parallel Distributed Hybrid Fuzzy GBML Models with Rule Set Migration and Training Data Rotation. TFS (in Press)

Training data

(1) A population is divided into multiple subpopulations.(2) Training data are also divided into multiple subsets.(3) An evolutionary algorithm is locally performed at each CPU.(4) Training data subsets are periodically rotated.(5) Migration is also periodically performed.

Page 51: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Computation Load of Four Models

EC

Fitness Evaluation

Standard Non-Parallel Model

CPU

Computation Load = EC Part + Fitness Evaluation Part

Page 52: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Computation Load of Four Models

EC

Fitness Evaluation

Standard Non-Parallel Model

CPU

Computation Load = EC Part + Fitness Evaluation Part

EC

EC = Evolutionary Computation

= { Selection, Crossover, Mutation, Generation Update }

Page 53: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Computation Load of Four Models

EC

Fitness Evaluation

Standard Non-Parallel Model

EC

Standard Parallel Model(Parallel Fitness Evaluation)

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU Computation Load = EC Part

+ Fitness Evaluation Part (1/7)

Page 54: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Computation Load of Four Models

EC

Fitness Evaluation

Standard Non-Parallel Model

EC

Standard Parallel Model(Parallel Fitness Evaluation)

EC

Windowing Model(Reduced Training Data Set)

CPU CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Computation Load = EC Part

+ Fitness Evaluation Part (1/7)

Page 55: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Computation Load of Four Models

EC

Fitness Evaluation

Standard Non-Parallel Model

EC

Standard Parallel Model(Parallel Fitness Evaluation)

EC

Windowing Model(Reduced Training Data Set)

Parallel Distributed Model(Divided Population & Data Set)

CPU CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU CPU

CPU

CPU

CPU

CPU

CPU

CPU

Page 56: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Computation Load of Four Models

EC

Fitness Evaluation

Standard Non-Parallel Model

EC

Standard Parallel Model(Parallel Fitness Evaluation)

EC

Windowing Model(Reduced Training Data Set)

Parallel Distributed Model(Divided Population & Data Set)

CPU CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU CPU

CPU

CPU

CPU

CPU

CPU

CPU

EC Part (1/7)

Fitness Evaluation Part (1/7)

Page 57: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Contents of This Presentation

Introduction1. Basic Idea of Evolutionary Computation2. Genetics-Based Machine Learning3. Parallel Distributed Implementation4. Computation Experiments5. Conclusion

Page 58: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Our Model in Computational Experimentswith Seven Subpopulations and Seven Data Subsets

Page 59: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Our Model in Computational Experimentswith Seven Subpopulations and Seven Data Subsets

Seven CPUs

Seven Subpopulations of Size 30Population Size = 30 x 7 = 210

Seven Data Subsets

Page 60: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Standard Non-Parallel Non-Distributed Modelwith a Single Population and a Single Data Set

Popu

latio

n

Page 61: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Standard Non-Parallel Non-Distributed Modelwith a Single Population and a Single Data Set

Single CPU

Single Population of Size 210

Whole Data Set

Page 62: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Standard Non-Parallel Non-Distributed Modelwith a Single Population and a Single Data Set

Single CPU

Single Population of Size 210

Whole Data Set

Termination Conditions: 50,000 GenerationsComputation Load: 210 x 50,000 = 10,500,000 Evaluations

(more than ten million evaluations)

Page 63: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Comparison of Computation Load

Standard Model:Evaluation of 210 rule sets using all the training data

Parallel Distributed Model:Evaluation of 30 rule sets using one of the seven data subsets.

Computation Load on a Single CPU per Generation

Page 64: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Comparison of Computation Load

Standard Model:Evaluation of 210 rule sets using all the training data

Parallel Distributed Model:Evaluation of 30 rule sets using one of the seven data subsets.

Computation Load on a Single CPU per Generation

1/7

1/7

Page 65: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Comparison of Computation Load

Standard Model:Evaluation of 210 rule sets using all the training data

Parallel Distributed Model:Evaluation of 30 rule sets using one of the seven data subsets.

Computation Load on a Single CPU per Generation

1/7

1/7

Computation Load ==> 1/7 x 1/7 = 1/49 (about 2%)

Page 66: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Data Sets in Computational ExperimentsNine Pattern Classification Problems

Name of Data Set

Number of Patterns

Number of Attributes

Number of Classes

Segment 2,310 19 7Phoneme 5,404 5 2

Page-blocks 5,472 10 5Texture 5,500 40 11

Satimage 6,435 36 6Twonorm 7,400 20 2

Ring 7,400 20 2PenBased 10,992 16 10

Magic 19,020 10 2

Page 67: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Computation Time for 50,000 GenerationsComputation time was decreased to about 2%

Name of Data Set

StandardA minutes

Our Model B minutes

Percentage of B B/A (%)

Segment 203.66 4.69 2.30%Phoneme 439.18 13.19 3.00%

Page-blocks 204.63 4.74 2.32%Texture 766.61 15.72 2.05%

Satimage 658.89 15.38 2.33%Twonorm 856.58 7.84 0.92%

Ring 1015.04 22.52 2.22%PenBased 1520.54 35.56 2.34%

Magic 771.05 22.58 2.93%

Page 68: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Computation Time for 50,000 GenerationsComputation time was decreased to about 2%

Name of Data Set

StandardA minutes

Our Model B minutes

Percentage of B B/A (%)

Segment 203.66 4.69 2.30%Phoneme 439.18 13.19 3.00%

Page-blocks 204.63 4.74 2.32%Texture 766.61 15.72 2.05%

Satimage 658.89 15.38 2.33%Twonorm 856.58 7.84 0.92%

Ring 1015.04 22.52 2.22%PenBased 1520.54 35.56 2.34%

Magic 771.05 22.58 2.93%

Why ?Because the population and the trainingdata were divided into seven subsets.

1/7 x 1/7 = 1/49 (about 2%)

Page 69: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Computation Time for 50,000 GenerationsComputation time was decreased to about 2%

Name of Data Set

StandardA minutes

Our Model B minutes

Percentage of B B/A (%)

Segment 203.66 4.69 2.30%Phoneme 439.18 13.19 3.00%

Page-blocks 204.63 4.74 2.32%Texture 766.61 15.72 2.05%

Satimage 658.89 15.38 2.33%Twonorm 856.58 7.84 0.92%

Ring 1015.04 22.52 2.22%PenBased 1520.54 35.56 2.34%

Magic 771.05 22.58 2.93%

Why ?Because the population and the trainingdata were divided into seven subsets.

1/7 x 1/7 = 1/49 (about 2%)

Page 70: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Test Data Error Rates (Results of 3x10CV)Test data accuracy was improved for six data sets

Name of Data Set

Standard(A %)

Our Model (B %)

Improvement from A: (A - B)%

Segment 5.99 5.90 0.09Phoneme 15.43 15.96 - 0.53

Page-blocks 3.81 3.62 0.19Texture 4.64 4.77 - 0.13

Satimage 15.54 12.96 2.58 Twonorm 7.36 3.39 3.97

Ring 6.73 5.25 1.48 PenBased 3.07 3.30 - 0.23

Magic 15.42 14.89 0.53

Page 71: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Test Data Error Rates (Results of 3x10CV)Test data accuracy was improved for six data sets

Name of Data Set

Standard(A %)

Our Model (B %)

Improvement from A: (A - B)%

Segment 5.99 5.90 0.09Phoneme 15.43 15.96 - 0.53

Page-blocks 3.81 3.62 0.19Texture 4.64 4.77 - 0.13

Satimage 15.54 12.96 2.58 Twonorm 7.36 3.39 3.97

Ring 6.73 5.25 1.48 PenBased 3.07 3.30 - 0.23

Magic 15.42 14.89 0.53

Page 72: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Q. Why did our model improve the test data accuracy ?A. Because our model improved the search ability.

Data Set Standard Our Model ImprovementSatimage 15.54% 12.96% 2.58% Tr

aini

ng D

ata

Erro

r Rat

e (%

)

Number of Generations

Our Parallel Distributed Model

Non-Parallel Non-Distributed

Page 73: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Q. Why did our model improve the search ability ?A. Because our model maintained the diversity.

Data Set Standard Our Model ImprovementSatimage 15.54% 12.96% 2.58%

Training Data Rotation:Every 100 Generations

Rule Set Migration:Every 100 Generations

Trai

ning

Dat

a Er

ror R

ate

(%)

Number of Generations

Parallel Distributed Model

Non-Parallel Non-Distributed Model(Best = Worst: No Diversity)

The best and worst error ratesin a particular subpopulation ateach generation in a single run.

Worst

Best

Page 74: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Effects of Rotation and Migration Intervals

Training Data Rotation: Every 100 GenerationsRule Set Migration: Every 100 Generations

Page 75: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Effects of Rotation and Migration Intervals(Rotations in the opposite directions)

CPU

Training data rotation

Training data

Popu

latio

n

CPU CPU CPU CPU CPU CPU

Rule set migration

NoNo

Page 76: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Effects of Rotation and Migration Intervals(Rotations in the opposite directions)

500∞

200100

50500

200100

50

18

16

14

12

1020

20

CPU

Training data rotation

Training data

Popu

latio

n

CPU CPU CPU CPU CPU CPU

Rule set migration

NoNo

Good Results

Page 77: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Effects of Rotation and Migration Intervals(Rotations in the same direction)

CPU

Training data rotation

Training data

Popu

latio

n

CPU CPU CPU CPU CPU CPU

Rule set migration

NoNo

Page 78: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

500∞

200100

50500

200100

50

18

16

14

12

1020

20

Effects of Rotation and Migration Intervals(Rotations in the same direction)

CPU

Training data rotation

Training data

Popu

latio

n

CPU CPU CPU CPU CPU CPU

Rule set migration

NoNo

Good Results

Page 79: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Effects of Rotation and Migration Intervals

500∞

200100

50500

200100

50

18

16

14

12

1020

20

500∞

200100

50500

200100

50

18

16

14

12

1020

20

Opposite Directions Same Direction

NoNo NoNo

Page 80: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Effects of Rotation and Migration Intervals

500∞

200100

50500

200100

50

18

16

14

12

1020

20

500∞

200100

50500

200100

50

18

16

14

12

1020

20

Opposite Directions Same Direction

NoNo NoNo

Interaction between rotation and migration

Page 81: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Effects of Rotation and Migration Intervals

500∞

200100

50500

200100

50

18

16

14

12

1020

20

500∞

200100

50500

200100

50

18

16

14

12

1020

20

Opposite Directions Same Direction

Interaction between rotation and migration

NoNo NoNo

Negative Effects

Page 82: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Contents of This Presentation

Introduction1. Basic Idea of Evolutionary Computation2. Genetics-Based Machine Learning3. Parallel Distributed Implementation4. Computation Experiments5. Conclusion

Page 83: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Introduction1.We explained our parallel distributed model.2. It was shown that the computation time was decreased to 2%.3. It was shown that the test data accuracy was improved.4.We explained negative effects of the interaction between the

training data rotation and the rule set migration.

Page 84: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Introduction1.We explained our parallel distributed model.2. It was shown that the computation time was decreased to 2%.3. It was shown that the test data accuracy was improved.4.We explained negative effects of the interaction between the

training data rotation and the rule set migration.

Page 85: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Introduction1.We explained our parallel distributed model.2. It was shown that the computation time was decreased to 2%.3. It was shown that the test data accuracy was improved.4.We explained negative effects of the interaction between the

training data rotation and the rule set migration.

Page 86: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Introduction1.We explained our parallel distributed model.2. It was shown that the computation time was decreased to 2%.3. It was shown that the test data accuracy was improved.4.We explained negative effects of the interaction between the

training data rotation and the rule set migration.

Page 87: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Introduction5. A little bit different model may be also possible for learning

from locally located data bases.

Subpopulation Rotation

CPU

University A

CPU

University C

CPU

University B

CPU

University D

Page 88: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Introduction6. This model can be also used for learning from changing

data bases.

Subpopulation Rotation

CPU

University A

CPU

University C

CPU

University B

CPU

University D

Page 89: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Introduction

Subpopulation Rotation

CPU

University A

CPU

University C

CPU

University B

CPU

University D

6. This model can be also used for learning from changingdata bases.

Page 90: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Introduction

Subpopulation Rotation

CPU

University A

CPU

University C

CPU

University B

CPU

University D

6. This model can be also used for learning from changingdata bases.

Page 91: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Introduction

Subpopulation Rotation

CPU

University A

CPU

University C

CPU

University B

CPU

University D

6. This model can be also used for learning from changingdata bases.

Page 92: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Introduction

Subpopulation Rotation

CPU

University A

CPU

University C

CPU

University B

CPU

University D

6. This model can be also used for learning from changingdata bases.

Page 93: Hisao Ishibuchi: "Scalability Improvement of Genetics-Based Machine Learning to Large Data Sets"

Conclusion

Thank you very much!