final report.doc

APPLICATION OF GENETIC PROGRAMMING TOWARDS WORD ALIGNERS

BENJAMIN HEILERS

Department of Electrical Engineering and Computer ScienceUniversity of California, Berkeley

December 2004

CS 294-5

keywords: Genetic Programming, Word Aligners, Machine Translation, Machine Learning, Genetic Algorithms, Natural Language Processing, Artificial Intelligence

1. Introduction

This paper details the (as-of-yet-unfruitful-and-thus-determinedly-ongoing)

research into the application of genetic programming towards optimizing word aligners.

Popular belief holds the use of genetic programming in machine translation to be

infeasible. Regardless, it is the goal of the author, admittedly due to infatuation with

machine learning in general, to convince himself personally that such a wide-spread

sentiment is either well-chosen or wholly amiss.

A word aligner is a program coupling words in sentence pairs, in effect

constructing a bilingual dictionary [1:484, 2]. Genetic Programming, a specific branch of

Genetic Algorithms, is a term used to deal with search across a function space, where the

natural reproductive methods – selection, mutation, crossover – are mimicked in the

hopes that the great successes of evolution on living creatures may be repeated on

programs [3:47-56, 4]. Genetic Algorithms deal more broadly with evolving all types of

functions. Genetic Programming takes a set of programs and filters out those most

resembling word aligners, to subject them to alterations in the hopes of finding yet better

candidates. The process by which programs are selected tests each program on a subset

of the sentence-pair corpus, thus qualifying this approach as supervised learning.

Another concept appearing in this paper and warranting a definition is that of the Abstract

Syntax Tree (AST), a representation for computer code which renders the code in a

particularly useful format for genetically reproductive processes [5:9]. Abstract Syntax

Trees are preferable due to:

it is many times simpler conceptually to apply crossover and mutation to a tree

representation, than to a program code in string form.

2

a design pattern, the Visitor Pattern, suggests an easily implementable approach

to traversing this representation for program code [9, 10]

Note in figure 1 that the Eclipse AST, the package used in this research, maintains some

information within nodes (such as the operator in infix expressions), whereas some AST

representations place formulate this information a child node.

2. Literature Review

3

Figure 1: An Eclipse Abstract Syntax Tree in Graphical and Textual Forms.

There is little literature on the application of genetic algorithms to word aligners.

Instead, we turn to the literature on genetic programming, where suggestions to

counteract various results-limiting phenomena abound.

There are a myriad of decisions to make in implementing a genetic algorithm.

Fortunately, literature provides enough detailed discussion to allow for preparations

against most common problems with genetic programming. Franz Rothlauf is the first to

write a book on the pros and cons of various representations in genetic algorithms [7].

Like many of his colleagues, he highly suggests tree representations for genetic

programming. This representation eases the implementation of mutation and crossover

tremendously, compared with the traditional representation as bit strings, whereby the

chances that a mutated string still resembles working code are less than slim.

4

The phenomena of bloat is widely mentioned, whereby each successive

generation displays a much larger file size than the previous, yet most of the added code

contribute little to no added functionality. With high rates of mutation, I have seen 450

lines (nine pages) of code introduced to an initially twenty-line file, after less than ten

generations. There are several mechanisms in place to cope with bloat, as discussed later.

Another commonly observed fact to cope with is over-fitting. This occurs when

the genetic process is allowed to run for too long. For example, the corpus used in this

research consists of 447 sentence pairs, pairing English and French sentences. If we are

public Alignment alignSentencePair(SentencePair sentencePair){MISSING=2092010418 <= -1198423683;alignment=new Alignment();I4=addAlignment(alignment,I4,I4,B3);B4=false;I2=numEnglishWordsInSentence(sentencePair);if (I2 < -594586326){D2=getDouble(L3,I1);} else {addInt(L2,I1,I3);getInt(L5,I2);addBoolean(L3,I1,B2);while (I2 < 1564864814){

addDouble(L5,I5,D1);MISSING=664939021;alignment=getString(L1,I2);MISSING=311599999 * 1197784289;D1=-287916828;

}}I2=numFrenchWordsInSentence(sentencePair);for (I3=0;I3 < I1;I3++){I4=-1;D1=0;for (I5=0;I5 < I2;I5++){

D2=50 / (1 + abs(I3 - I5));if (D2 >= D1){

D1=D2;I4=I5;

}}addAlignment(alignment,I4,I3,true);}return alignment;

}

Figure 2: Example of Bloat. Lines resembling original file are in bold. Lines colored red are added as result of bloat phenomenon.

5

to choose the first ten and evolve randomly generated programs to return alignments of

these, then at some point we may theoretically find a reasonable solution which not only

achieves superb results on the ten training sentence pairs, but on the 447 total sentence

pairs as well. However, if we continue to evolve past this point, chances are that our

population will become over fit for these ten sentences. This is similar, for example, to

hoping to find the equation y = x2, but instead achieving y = 1, with training data of only

(-1, 1) and (1, 1).

6

Figure 3: After fitness is reached, over-fitting to the training data may occur.

Figure 4: Example of Over-Fitting. The solid black line is y = x2, the red dotted line is y = |x|, and the blue dashed line is y = 1. The training data is { (1, 1), (-1, 1) }, but the desired function is { (x, y) : y = x2 }

The literature is also helpful in suggesting approximate values for the frequencies

at which to apply mutation and crossover to members of the population, though the

perfect values are apparently learned only by trial and error.

3. General Overview of Algorithm

In general, instead of searching across the solution space, we utilize GP to aid in

search across the function space. As the function space is of immense proportions, we

randomly sample the function space, and then search through not only these functions but

others similar to them. In the graph above, we may have a function y = x + 5. This

would lead us to searching similar functions such as y = x + 6, y = 3x + 5, y = x 2 + 5, etc.

Since it is infeasible to evaluate every possible function with similar form to y = x + 5,

we must again find a method with which to decide which functions to search. This is the

basic concept of genetic programming, where a desired set of (input, output) pairs is

known, but we search for the function (or possibly one of many functions) which causes

this return. Doing this search across program code is many times more complex than

across math equations.

7

The

flow chart

shown here is

exactly the

order in

which

genetic

programming

is

implemented

in this

research. An

initial population is created, by taking files such as random.java in the Appendix and

sending them through several generations of high mutation. Since the current version of

this GP process is still prone to producing erroneous code, many more programs are

generated than asked for. Each is then evaluated according to the fitness function, and

those which have compile and runtime errors, (at this point mostly due to invalid

arguments, incorrect casting, and undeclared variable names – see Results), is filtered out

and thrown away. Thus the GP process begins with only valid programs in its initial

population. From here, the population undergoes a number of iterations wherein each

member is evaluated, then the next generation is selected, then crossover and mutation is

allowed to occur. The rates for these are currently at 80% chance for 3-point crossover to

occur (see Section 4), and 0.05% for mutation, as suggested by most literature.

8

Figure 5: Flow Chart of GP

4. Details of Implementation Decisions

Since each design decision is independent of each other, and many need to be presented

simultaneously, I have chosen to format this section by discussing each one on its own, as

orderly as possible.

AST representation: The members of the population could be represented in a

myriad of ways. Many people implementing genetic programming choose to create a an

original representation of their function. This unfortunately, is due to the newness of the

field, and hinders the progress of future work by allowing for the researchers to get

bogged down in minor details which could potentially be settled already. I admire Franz

Rothlauf’s efforts to correct this problem, and agree with him that the best representation

for my purposes is to use an AST. This allows for an easy crossover implementation, and

only necessitates moderate work to implement mutation.

Generational model: The generational model of a population allows for the

lifespan of each population member to be only a single generation, as opposed to the

Steady-State model, which not only selects members for reproduction but also selects

which member they will be replacing as well [8:134]. Both models use a constant-sized

population. I have chosen to go with the generational model here because of simplicity in

design.

Initial Population: The usual approach is to start with a completely random set of

programs. This seems unreasonable. Why create programs which construct strings and

9

draw websites when we are looking for a program to add two numbers together? I have

taken an approach for which I could not find any literature, to start with a population of

programs similar to that under random.java in the Appendix. In some cases, I have even

placed some initial code within the for-loop, to make alignments based on superficial

traits. I do not rule out the possibility that this second strategy in effect may steer me into

the wrong direction, by intending to run the program both with random.java and with the

other versions (entitled superficial.java). It is my hope that by providing some base code,

the completely random results will be avoided and thus a better chance of finding an

optimal solution is possible.

Fitness function: The fitness function, the measure by which we decide which

members yield the most desirable results, and thus have the most potential for being

prototypes of our desired word aligner, seems obvious. The goal is to maximize the

precision and recall while minimizing the AER, as defined in [11:1]. Thus the fitness

function calls alignSentencePair of each population member on a small subset of the full

corpus, and returns the weighted sum of these numbers (where w1, w2, w3 are the

weights):

10 * [ w1 * P + w2 * R + w3 * (1 – AER) ]

Countering Over-Fitting: An easy fix to countering over-fitting to the training

data used in the fitness function is to keep the training set dynamic. I have implemented

this by choosing a random set each time. Thus there is no worry of over-fitting to the

specific set of sentence pairs being learned on, since there is no specific set of sentence

pairs.

10

Figure 6: The Proportionate Fitness Selection Process is Akin to a Game of Darts.

Fitness-Proportionate Selection: There are two methods of selection in

widespread use: tournament and proportionate fitness selection [3:37]. In tournament

selection several tournaments are held in which the fitness is calculated and the winner of

the tournament is selected for reproduction. As the fitness function here requires no

small amount of time, for matters of efficiency I chose the less computationally

expensive selection process, fitness-

proportionate selection. Here each

member of the population is evaluated

once, and then the new generation is

randomly selected, with probability

proportional to the fitness of the

member [see figure]. The risk here

generally is that with a wide variety of

fitness values, those with the lower

fitness values will be excluded from selection, the diversity of the population will

disappear prematurely, leading to premature convergence. It is my hope that with a non-

random initial population, the disparity in the fitness values will not be as dangerous as if

the populations had been truly initialized randomly.

Elitist Strategy: This is the decision to leave the best fit member of the population

in the next generation, unmodified, although this does not rule out the possibility of

including genetic reproductions of this member as well.

Copies to file best of each generation: Looking back to the chart in Section 2,

Literature Review, we acknowledge that there exists a reduction in quality when one

11

allows the GP process to run too many generations. To alleviate this, a copy of the

member with the highest fitness value of each generation is written to file in a separate

folder. The end process, where we test the evolved word aligners, tests the best fit

member of each generation, not solely that of the final generation.

N-point crossover: The two common methods of crossover are uniform and n-

point. In uniform crossover, a single point in the genome is chosen and two members

have their code swapped at this point. N-point crossover allows for this to happen at

multiple points, and is much more suitable to crossover on trees, where we are not

dealing with the traditional fixed-length representation as in bit strings.

Protected functions and variable types: To alleviate casting problems and protect

against null pointer exceptions, it is easier to Refer to WordAligner class in Appendix,

which is the super class of all other word aligner classes.

Confine alterations: To minimize the number of off-track members of the

population, alterations to the code are kept in the area where they matter the most. The

basic information needed by all word aligners is as is shown in the appendix for

random.java. Every word aligner should align each French word to some word in

English. Thus, only the body of the for loop is altered.

Halting problem: It may occur through mutation or crossover that infinite loops

are created [8:293-294]. To counter this, the fitness function makes use of threads and

halts after a reasonable amount of time has elapsed. This also serves as an additional

measure against excessive bloat.

5. Results

12

As can be seen by the example code, mutation never worked as desired. Nearly

every mutation results in erroneous code. It seems the large majority are casting issues

(in the figure here, D3 is a double and S1 is a string, while H5 is a HashMap). Those

mutated programs which do compile include useless code, such as incrementing variables

used nowhere else in the program.

13

Without

mutation, the

rest of the

genetic

programming

process is

little more than searching over the different orderings of the statements within the for

loop, which holds no possible word aligners not already visible at a first glance.

14

Figure 2: Some Mutation Errors Still Mishandled.

Figure 5: Initial Code and Code with "Better" Result.

Multiple runs each proved futile. As can be seen in the figure above, the best

word aligner returned is only slightly improved in terms of performance. Looking at the

code it appears to be a fluke, due to the off-chance that using the number of English

words in place of the number of French words tends to give better recall results (due to

the smaller length of English sentences in general relative to their French versions, less

proposed alignments allows for less erroneous guesses).

Indeed, generations proved of no use. In effect, the code as is merely stares at

effectively the same program each generation, since crossover alone is not enough to

introduce variety, and the initial population is not truly random (not with the mutation

process in its current status). In the table below are shown the evaluation results of

members of each generation with the highest fitness.

Figure 9: Table of Fitness Function Results on a GP run

6. Conclusion

Gen. Precision Recall AER Fitness0 0.3658 0.2258 0.6864 9.05201 0.3658 0.2258 0.6864 9.05202 0.3535 0.2909 0.6678 9.76603 0.3658 0.2258 0.6864 9.05204 0.3658 0.2258 0.6864 9.05205 0.3658 0.2258 0.6864 9.05206 0.3935 0.1889 0.6966 8.85807 0.3658 0.2258 0.6864 9.05208 0.3658 0.2258 0.6864 9.05209 0.3658 0.2258 0.6864 9.052010 0.3658 0.2250 0.6864 9.052011 0.3658 0.2250 0.6864 9.052012 0.3658 0.2250 0.6864 9.052013 0.3658 0.2250 0.6864 9.0520

15

In addition to deciphering the process of mutating code in a meaningful way,

there are a few other tricks which I did not have time to experiment with, but think may

prove useful.

With regards to premature convergence, it would be interesting to add a feature

whereby the mutation rate is raised greatly for a generation to promote an increase in

diversity, triggered by a low standard deviation in the fitnesses.

Since I was not able to get mutation working, I never actually started off with

random.java. I instead started off with the likeness of the file seen in Figure 8. From

what I have seen, it seems like possibly initializing from this file yields a population

lacking in diversity, even with higher mutation rates in the initial generations. It would

be interesting to run a comparison between initializing from this file and from

random.java.

It is acknowledged widely that the process of finding correct values for mutation

and crossover rates, for the number of generations, for the population size, and choosing

a heuristic function are all decisions which are still made by trial and error. Studying the

exact effects of raising and lowering each of these values will consume quite an amount

of time, but is vital before much more work can be done in the area of genetic

programming in general.

16

Bibliography

1. Manning and Schütze. Foundations of Statistical Natural Language Processing,

pg. 484

2. Automatic Construction of a Bilingual Lexicon:

wwwhome.cs.utwente.nl/~irgroup/align/

3. Ghanea-Hercock. Applied Evolutionary Algorithms in Java

4. Genetic-Programming.Org: www.genetic-programming.org/

5. Grune, Bal, Jacobs, and Langendoen. Modern Compiler Design, pg. 9, 22, 52-55

6. Langdon and Poli. Foundations of Genetic Programming

7. Rothlauf. Representations for Genetic and Evolutionary Algorithms

8. Banzhaf, Nordin, Keller, and Francone. Genetic Programming: An Introduction

9. Gamma, Helm, Johnson, and Vlissides. Design Patterns

10. Visitor Pattern: http://en.wikipedia.org/wiki/Visitor_pattern

11. Assignment 4: Word Alignment Models: www.cs.berkeley.edu/~klein/cs294-5/cs294-

5%20assignment%204.pdf

Figures

(original unless otherwise noted)

1. Abstract Syntax Tree in graphical and textual forms.

2. Example of Bloat.

3. After fitness is reached, overfitting to the training data may occur. [source:

Schmiedle F, Drechsler N, Grosse D, Drechsler R. “Heuristic learning based on

genetic programming.” Genetic Programming & Evolvable Machines, Vol. 3,

Dec. 2002, pg 376]

4. Example of Over-Fitting

5. Flow Chart of GP Approach. [chart source: Sette S, Boullart L. “Genetic

programming: principles and applications.” Engineering Applications of

Artificial Intelligence, Vol. 14, Dec. 2001, pg 728]

17

http://www.cs.berkeley.edu/~klein/cs294-5/cs294-5%20assignment%204.pdf

http://www.cs.berkeley.edu/~klein/cs294-5/cs294-5%20assignment%204.pdf

http://en.wikipedia.org/wiki/Visitor_pattern

http://www.genetic-programming.org/

http://wwwhome.cs.utwente.nl/~irgroup/align/

6. Proportionate Fitness Selection

7. Some Mutation Errors Still Mishandled

8. Initial Code and Code with "Better" Result.

9. Table of Fitness Function Results

Appended Code

Crossover: Takes two statements with parents of same type (for/for, while/while, etc.)

The eclipse AST toolkit requires that each node belong to a certain tree, and thus simply

switching trees is not possible, instead we clone the subtrees under their new owner with

the static copySubtree(targetAST, sourceNode) method.

private void crossover (int index1, Statement switch1, int index2, Statement switch2) {

CompilationUnit cu1 = newPop[index1];CompilationUnit cu2 = newPop[index2];

AST ast1 = cu1.getAST();AST ast2 = cu2.getAST();

ASTNode p1 = switch1.getParent();ASTNode p2 = switch2.getParent();

Statement switch1_under_ast2 = (Statement) ASTNode.copySubtree(ast2,switch1);

Statement switch2_under_ast1 = (Statement) ASTNode.copySubtree(ast1,switch2);

switch (p1.getNodeType()) {case ASTNode.BLOCK:

List m1 = ((Block) p1).statements();List m2 = ((Block) p2).statements();

m1.set(m1.indexOf(switch1), switch2_under_ast1);m2.set(m2.indexOf(switch2), switch1_under_ast2);break;

case ASTNode.IF_STATEMENT:if (switch1.getLocationInParent().getId()

.equals("elseStatement")) {((IfStatement) p2).setElseStatement(switch1_under_ast2);((IfStatement) p1).setElseStatement(switch2_under_ast1);

} else {((IfStatement) p2).setThenStatement(switch1_under_ast2);((IfStatement) p1).setThenStatement(switch2_under_ast1);

}break;

case ASTNode.WHILE_STATEMENT:((WhileStatement) p2).setBody(switch1_under_ast2);((WhileStatement) p1).setBody(switch2_under_ast1);break;

18

case ASTNode.FOR_STATEMENT:((ForStatement) p2).setBody(switch1_under_ast2);((ForStatement) p1).setBody(switch2_under_ast1);break;

default:throw new RuntimeException("unhandled crossover for nodeType: "

+ p1.getNodeType());

}

}

Mutation: Uses the Visitor Pattern and extends org.eclipse.jdt.internal.corext

.dom.GenericVisitor with Mutator to implement mutation. Mutator is a file much too

long to display here. The essentials are that it randomly changes register names and

values in the code, as well occasionally inserting newly generated lines of code and

making calls to safely defined methods (that is to say, a divide that checks for division by

zero, etc.).

public void mutate(int index) {random.nextFloat()*interchangeableTable.size();Mutator mutator = new Mutator(seed, numRegisters);CompilationUnit cu = newPop[index];AST ast = cu.getAST();cu.accept(mutator);

}

WordAligner parent class: The following is edited due to length; redundant and

obvious methods have been abbreviated. The Statistics object contains data from an

initial pass over the corpus before hand, gathering data such as is used in unsupervised

learning: Pr(f), Pr(e), Pr(f, e).

public class WordAligner {protected WordAligner (Statistics s) {

statistics = s;}

public Alignment alignSentencePair(SentencePair s) {return null;

}

public float prob_f(String f) {return (float) statistics.prob_f(f);

}

public float prob_e(String e) {return (float) statistics.prob_e(e);

}

public float prob_e_and_f(SentencePair s, String f, String e) {

19

return (float) statistics.prob_f_and_e(s,f,e);}

public List getFrenchWords (SentencePair s) {return s.getFrenchWords();

}

public List getEnglishWords (SentencePair s) {return s.getEnglishWords();

}

public float abs (float i) {return Math.abs(i);

}

public float numFrenchWordsInSentence (SentencePair s) {return s.getFrenchWords().size();

}

public float numEnglishWordsInSentence (SentencePair s) {return s.getEnglishWords().size();

}

public float getSentenceID (SentencePair s) {return s.getSentenceID();

}

public boolean addAlignment(float englishPosition, float frenchPosition, boolean sure) {

int e = Math.round(englishPosition);int f = Math.round(frenchPosition);alignment.addAlignment(e, f, sure);return true;

}

/** GET methods **/public String getString(List L, float i) {

if (L== null || L.size() == 0)return "";

if (i >= L.size())i = L.size()-1;

if (i < 0)i = 0;

return (String) L.get(Math.round(i));}

Also: getBoolean, getNumber

/** ADD methods **/public boolean addString (List L, float i, String o) {

if (L == null)L = new ArrayList();

if (i >= L.size())i = L.size()-1;

if (i < 0)i = 0;

L.add(Math.round(i), o);return true;

}

Also: addBoolean, addNumber

20

/** FIELDS **/public LinkedList L1 = new LinkedList();public LinkedList L2 = new LinkedList();public LinkedList L3 = new LinkedList();public LinkedList L4 = new LinkedList();public LinkedList L5 = new LinkedList();

public float N1 = 0; public float N2 = 0;public float N3 = 0; public float N4 = 0;public float N5 = 0; public float N6 = 0;public float N7 = 0; public float N8 = 0;public float N9 = 0; public float N0 = 0;

public boolean B1 = true; public boolean B2 = true;public boolean B3 = true; public boolean B4 = true;public boolean B5 = true;

public String S1 = ""; public String S2 = "";public String S3 = ""; public String S4 = "";public String S5 = "";

public Alignment alignment = new Alignment();public static Statistics statistics;

}

An extension of WordAligner class: This is the base class for the random initialization.

Several instances of this class are made, then subjected to many generations at a higher

than normal mutation rate. Mutation occurs within the for-loop.

public class random extends WordAligner { public Alignment alignSentencePair(SentencePair sentencePair) { alignment = new Alignment(); N1 = numEnglishWordsInSentence(sentencePair); N2 = numFrenchWordsInSentence(sentencePair); for (N3 = 0; N3 < N2; N3++) { B5 = addAlignment(N4, N3, true); } return alignment; }

public random(Statistics s) { super(s); }

}

21

Documents

final report.doc