22
Project 2: Project 2: Classification Using Genetic Classification Using Genetic Programming Programming 2008. 10. 27 Kim, MinHyeok [email protected] Biointelligence laboratory Artificial Intelligence

Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok [email protected] [email protected] Biointelligence laboratory Artificial

Embed Size (px)

Citation preview

Page 1: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Project 2: Project 2: Classification Using Genetic Programming Classification Using Genetic Programming

2008. 10. 27

Kim, MinHyeok

[email protected] Biointelligence laboratory

Artificial Intelligence

Page 2: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

ContentsContents

Project outline Description on the data set Genetic Programming

Brief overview Fitness function & Selection methods Classification with GP (in this project)

Guide to writing reports Style & contents

Submission guide / Marking scheme

2(C) 2008, SNU Biointelligence Laboratory

Page 3: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

3(C) 2008, SNU Biointelligence Laboratory

OutlineOutline

Goal Understand the Genetic Programming (GP) deeper Practice researching and writing a paper

Forest Fires problem (classification) To predict whether a fire occurs or not Using Genetic Programming Estimating several statistics on the dataset

Data set Variation of the ‘Forest Fires data set’

http://archive.ics.uci.edu/ml/datasets/Forest+Fires

Page 4: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Forest Fires Forest Fires Data SetData Set

Description Database of 517 samples

You can use at most 500 samples for training 17 samples for prediction

12 attributes X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,label Integer or real value

Label (Class) Two classes

– 0 : a fire does not occur

– 1 : a fire occurs

4(C) 2008, SNU Biointelligence Laboratory

Page 5: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Brief Summary of GPBrief Summary of GP

A kind of evolutionary algorithms It is represented with a tree structure You need to set up following elements for GP run

The set of terminals (input attributes, the class variable, constants)

The set of functions (numerical / condition operators)

The fitness measure The algorithm parameters

population size, maximum number of generations crossover rate and mutation rate maximum depth of GP trees etc.

The method for designating a result and the criterion for terminating a run.

5(C) 2008, SNU Biointelligence Laboratory

Page 6: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

6

GP FlowchartGP Flowchart

GA loop GP loop

Page 7: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

InitializationInitialization

Maximum initial depth of trees Dmax is set.

Full method (each branch has depth = Dmax): nodes at depth d < Dmax randomly chosen from function set F

nodes at depth d = Dmax randomly chosen from terminal set T

Grow method (each branch has depth Dmax): nodes at depth d < Dmax randomly chosen from F T

nodes at depth d = Dmax randomly chosen from T

Common GP initialisation: ramped half-and-half, where grow and full method each deliver half of initial population

7(C) 2008, SNU Biointelligence Laboratory

Page 8: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Fitness FunctionsFitness Functions

Relative squared error

The number of outputs that are within % of the correct value

And you can try other fitness functions which are well-defined to solve problems

n

i i

ii

y

xfyFitness

1

2

)(ˆ

Page 9: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Selection methods (1/2)Selection methods (1/2)

Fitness proportional (roulette wheel) selection The roulette wheel can be constructed as follows.

Calculate the total fitness for the population.

Calculate selection probability pk for each chromosome vk.

Calculate cumulative probability qk for each chromosome vk.

SIZEPOP

kkifF

_

1

)(

SIZEPOPkF

ifp kk _,...,2,1 ,

)(

SIZEPOPkpqk

jjk _,...,2,1 ,

1

Page 10: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Procedure: Proportional_Selection Generate a random number r from the range [0,1]. If r q1, then select the first chromosome v1; else, select the kth

chromosome vk (2 k pop_size) such that qk-1 < r qk.

pk qk

1 0.082407 0.082407

2 0.110652 0.193059

3 0.131931 0.324989

4 0.121423 0.446412

5 0.072597 0.519009

6 0.128834 0.647843

7 0.077959 0.725802

8 0.102013 0.827802

9 0.083663 0.911479

10 0.088521 1.000000

0.036441)(_

1

sizepop

kkifF

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1

Page 11: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Selection methods (2/2)Selection methods (2/2)

Tournament selection Tournament size q

Ranking-based selection

2 POP_SIZE 1 + 2 and - = 2 - +

Elitism To preserve n good solutions until the next generation

1

1)(

1

i

pi

Page 12: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Classification with GP (in this project)Classification with GP (in this project)

Function Regression Search a function f(x) s.t.

f(x) ≥ threshold t when y=1 f(x) < threshold t when y=0

Converting to Boolean value∧

¬ ∨

= > <

0rain RH 50 wind +

FFMC ISI

IF

> 1 0

f(x) t

Page 13: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

What to do for the experiment?What to do for the experiment?

Select a library that implements GP You can find various libraries written in C++/Java/Matlab See the list of recommended libraries on the next page

Build up your own code for the experiment Check sample codes and tutorials of libraries for quick start Add comments to explain the flow of your program

Caution Running GP may take much time

13(C) 2008, SNU Biointelligence Laboratory

Page 14: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Recommended Libraries for GPRecommended Libraries for GP

C++ GPLib: http://www.cs.bham.ac.uk/~cmf/GPLib/index.html

Java JGAP: http://jgap.sourceforge.net/ ECJ: http://cs.gmu.edu/~eclab/projects/ecj/

Matlab toolbox GPLAB: http://gplab.sourceforge.net/

More References Implementations section in Wiki – Genetic Programming:

http://en.wikipedia.org/wiki/Genetic_programming

14(C) 2008, SNU Biointelligence Laboratory

Page 15: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Reports StyleReports Style

English only!! Scientific journal-style

How to Write A Paper in Scientific Journal Style and Format http://abacus.bates.edu/~ganderso/biology/resources/writing/HTWsections.html

15(C) 2008, SNU Biointelligence Laboratory

 Experimental process  Section of Paper

What did I do in a nutshell?  Abstract

 What is the problem? Introduction

 How did I solve the problem?  Materials and Methods

 What did I find out?  Results

 What does it mean?  Discussion

 Who helped me out?  Acknowledgments (optional)

 Whose work did I refer to?  Literature Cited

 Extra Information Appendices (optional)

Page 16: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Report Contents (1/3)Report Contents (1/3)

System description Used programming language and running environments

Result tables

Analysis & discussion (Very Important!!)

16(C) 2008, SNU Biointelligence Laboratory

Training

Average SD

Best Worst

Setting 1

% % % %

Setting 2

% % % %

Setting 3

% % % %

Your prediction

1 2 … 16 17 Equation

Page 17: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Report Contents (2/3)Report Contents (2/3)

Graph Avg., Max. Fitness versus Generation

Tree size versus Generation

17(C) 2008, SNU Biointelligence Laboratory

Page 18: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Report Contents (3/3)Report Contents (3/3)

Basic experiments Changing parameters for the crossover and mutation Various function sets: arithmetic, numerical

Optional experiments Various selection methods Depth limitation Population size, generation numbers Comparison to Neural Network …

References

18(C) 2008, SNU Biointelligence Laboratory

Page 19: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

19(C) 2008, SNU Biointelligence Laboratory

Submission GuideSubmission Guide

Due date: Nov. 19 (Wed) 18:00 Submit both ‘hardcopy’ and ‘email’

Hardcopy submission to the office (301-417 ) E-mail submission to [email protected]

Subject : [AI Project1 Report] Student number, Name Report + your source code with comments + executable file(s)

Length: report should be summarized within 12 pages. We are NOT interested in the accuracy and your programming

skill, but your creativity and research ability.

If your major is not a C.S, team project with a C.S major student is possible (Use the class board to find your partner and notice the information of your team to TA ([email protected]) by Nov. 5)

Page 20: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Marking SchemeMarking Scheme

5 points for programming 5 points for result prediction 30 points for experiment & analysis

15 pts for experiments, 15pts for analysis

10 points for report Late work

- 10% per one day Maximum 7 days

20(C) 2008, SNU Biointelligence Laboratory

Page 21: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

QnAQnA

21(C) 2008, SNU Biointelligence Laboratory

Page 22: Project 2: Classification Using Genetic Programming 2008. 10. 27 Kim, MinHyeok mhkim@bi.snu.ac.kr mhkim@bi.snu.ac.kr Biointelligence laboratory Artificial

Test DataTest Data

  X Y monthda

yFFMC DMC DC ISI temp RH wind rain

Data01 6 5 9 3 92.9 133.3 699.6 9.2 26.4 21 4.5 0

Data02 6 3 11 2 79.5 3 106.7 1.1 11.8 31 4.5 0

Data03 4 3 7 4 93.2 114.4 560 9.5 30.2 22 4.9 0

Data04 6 5 6 1 90.4 93.3 298.1 7.5 19.1 39 5.4 0

Data05 6 3 4 7 91 14.6 25.6 12.3 13.7 33 9.4 0

Data06 5 4 4 7 91 14.6 25.6 12.3 17.6 27 5.8 0

Data07 4 3 5 5 89.6 25.4 73.7 5.7 18 40 4 0

Data08 7 5 10 1 91.7 48.5 696.1 11.1 16.1 44 4 0

Data09 8 6 3 5 91.7 33.3 77.5 9 8.3 97 4 0.2

Data10 7 5 8 2 96.1 181.1 671.2 14.3 27.3 63 4.9 6.4

Data11 6 5 9 6 91.2 94.3 744.4 8.4 15.4 57 4.9 0

Data12 8 6 8 1 92.1 207 672.6 8.2 21.1 54 2.2 0

Data13 7 4 9 5 88.2 55.2 732.3 11.6 15.2 64 3.1 0

Data14 4 3 9 2 91.9 111.7 770.3 6.5 15.9 53 2.2 0

Data15 3 6 9 7 92.4 124.1 680.7 8.5 17.2 58 1.3 0

Data16 3 6 9 1 90.9 126.5 686.5 7 15.6 66 3.1 0

Data17 9 9 7 2 85.8 48.3 313.4 3.9 18 42 2.7 0