15
Memetic Comp. (2013) 5:19–33 DOI 10.1007/s12293-012-0092-0 REGULAR RESEARCH PAPER A genotype-phenotype-fitness assessment protocol for evolutionary self-assembly Wang tiles design Germán Terrazas · Natalio Krasnogor Received: 5 March 2012 / Accepted: 9 September 2012 / Published online: 5 October 2012 © Springer-Verlag 2012 Abstract In a previous work we have reported on the evolutionary design optimisation of self-assembling Wang tiles capable of arranging themselves together into a target structure. Apart from the significant findings on how self- assembly is achieved, nothing has been yet said about the efficiency by which individuals were evolved. Specially in light that the mapping from genotype to phenotype and from this to fitness is clearly a complex, stochastic and non-linear relationship. One of the most common procedures would suggest running many experiments for different configura- tions followed by a fitness comparison, which is not only time-consuming but also inaccurate for such intricate map- pings. In this paper we aim to report on a complementary dual assessment protocol to analyse whether our genetic algorithm, using morphological image analyses as fitness function, is an effective methodology. Thus, we present here fitness distance correlation to measure how effectively the fitness of an individual correlates to its genotypic distance to a known optimum, and introduce clustering as a mechanism to verify how the objective function can effectively differen- tiate between dissimilar phenotypes and classify similar ones for the purpose of selection. Keywords Self-assembly · Evolutionary design · Genetic algorithm · Fitness distance correlation · Wang tiles · Evolutionary optimisation design G. Terrazas (B ) · N. Krasnogor Interdisciplinary Computing and Complex Systems (ICOS) Research Group, School of Computer Science, University of Nottingham, Nottingham, UK e-mail: [email protected] URL: http://icos.cs.nott.ac.uk N. Krasnogor e-mail: [email protected] 1 Introduction Self-assembly systems are characterized by inorganic or liv- ing entities that achieve global order as the result of local interactions within a particular closed environment [16]. Self-assembly is a key cooperative mechanism in nature. Surprisingly, it has received (relatively) very little atten- tion in computer science [28]. In [23], we defined the self- assembly Wang tiles system T sys as a computational model of self-assembly. This system comprises a set of square tiles with labelled edges that randomly move across the Euclid- ean plane forming aggregates or bouncing off as result of their interaction (see Fig. 1). Cooperativity is an emergent feature of this system where the combination of a cer- tain number of tiles is required to initiate self-assembly [17, 21]. Formally speaking, T sys is defined as follows: T sys = (T ,, g, L,τ) T ={t |t = (c 0 , c 1 , c 2 , c 3 )} where c 0 , c 1 , c 2 , c 3 g :: 2 Z + ∪{0} τ Z + ∪{0} (1) In this system, T is a non-empty finite set of Wang tile types t defined as a 4-tuple (c 0 , c 1 , c 2 , c 3 ) which indicates the asso- ciated “glue” types labels at the north, east, south and west edges of a tile, is a set of symbols representing glue type labels, g is called the glue function, L is a lattice and τ is a threshold modelling the kinetic energy of the system. The glue function g is defined to compute the strength associ- ated to a given pair of glue types. The lattice L is a two- dimensional surface with size W × H composed by a finite set of interconnected unit squared cells where tiles belonging to T are randomly located and perform random walks. Thus, when two or more tiles collide, the strength between the glue 123

A genotype-phenotype-fitness assessment protocol for evolutionary self-assembly Wang tiles design

Embed Size (px)

Citation preview

Memetic Comp. (2013) 5:19–33DOI 10.1007/s12293-012-0092-0

REGULAR RESEARCH PAPER

A genotype-phenotype-fitness assessment protocolfor evolutionary self-assembly Wang tiles design

Germán Terrazas · Natalio Krasnogor

Received: 5 March 2012 / Accepted: 9 September 2012 / Published online: 5 October 2012© Springer-Verlag 2012

Abstract In a previous work we have reported on theevolutionary design optimisation of self-assembling Wangtiles capable of arranging themselves together into a targetstructure. Apart from the significant findings on how self-assembly is achieved, nothing has been yet said about theefficiency by which individuals were evolved. Specially inlight that the mapping from genotype to phenotype and fromthis to fitness is clearly a complex, stochastic and non-linearrelationship. One of the most common procedures wouldsuggest running many experiments for different configura-tions followed by a fitness comparison, which is not onlytime-consuming but also inaccurate for such intricate map-pings. In this paper we aim to report on a complementarydual assessment protocol to analyse whether our geneticalgorithm, using morphological image analyses as fitnessfunction, is an effective methodology. Thus, we present herefitness distance correlation to measure how effectively thefitness of an individual correlates to its genotypic distance toa known optimum, and introduce clustering as a mechanismto verify how the objective function can effectively differen-tiate between dissimilar phenotypes and classify similar onesfor the purpose of selection.

Keywords Self-assembly · Evolutionary design · Geneticalgorithm · Fitness distance correlation · Wang tiles ·Evolutionary optimisation design

G. Terrazas (B) · N. KrasnogorInterdisciplinary Computing and Complex Systems (ICOS)Research Group, School of Computer Science,University of Nottingham, Nottingham, UKe-mail: [email protected]: http://icos.cs.nott.ac.uk

N. Krasnogore-mail: [email protected]

1 Introduction

Self-assembly systems are characterized by inorganic or liv-ing entities that achieve global order as the result of localinteractions within a particular closed environment [16].Self-assembly is a key cooperative mechanism in nature.Surprisingly, it has received (relatively) very little atten-tion in computer science [28]. In [23], we defined the self-assembly Wang tiles system Tsys as a computational modelof self-assembly. This system comprises a set of square tileswith labelled edges that randomly move across the Euclid-ean plane forming aggregates or bouncing off as result oftheir interaction (see Fig. 1). Cooperativity is an emergentfeature of this system where the combination of a cer-tain number of tiles is required to initiate self-assembly[17,21].

Formally speaking, Tsys is defined as follows:

Tsys = (T , �, g,L, τ )

T = {t |t = (c0, c1, c2, c3)} where c0, c1, c2, c3 ∈ �

g :: �2 → Z+ ∪ {0}τ ∈ Z

+ ∪ {0}

(1)

In this system, T is a non-empty finite set of Wang tile types tdefined as a 4-tuple (c0, c1, c2, c3) which indicates the asso-ciated “glue” types labels at the north, east, south and westedges of a tile, � is a set of symbols representing glue typelabels, g is called the glue function, L is a lattice and τ isa threshold modelling the kinetic energy of the system. Theglue function g is defined to compute the strength associ-ated to a given pair of glue types. The lattice L is a two-dimensional surface with size W × H composed by a finiteset of interconnected unit squared cells where tiles belongingto T are randomly located and perform random walks. Thus,when two or more tiles collide, the strength between the glue

123

20 Memetic Comp. (2013) 5:19–33

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 1 A step-by-step (a–h) aggregate formed as the result of the interaction between five self-assembly Wang tiles performing random walkacross a lattice

types at their colliding edges is calculated and compared to τ .If the resulting strength is bigger than τ , tiles self-assembleforming aggregates, otherwise they bounce off and keepmoving.

Finding the appropriate combination of autonomous enti-ties capable of arranging themselves together into a targetconfiguration is a challenging open problem for the designand development of distributed cooperative systems. In [23]we addressed the self-assembly Wang tiles designabilityproblem by means of artificial evolution. Our interest incombining self-assembly Wang tiles with evolutionary algo-rithms lays on the use of a method for the automated con-struction of supra-structures that emerge as the result of tilesinteraction. In particular, we seek to answer the following:

Given a collective target configuration, is it possible toautomatically design, e.g. with an evolutionary algorithm,the local interactions so they self-assemble into the desiredtarget? That is, we want to design a set T = {t |t =(c0, c1, c2, c3)} such that tiles of type t self-assemble intoa given target shape.

In order to address this question, our research was cen-tred in the use of a genetic algorithm (GA) to evolve apopulation of self-assembly Wang tile families. Broadly

speaking, a self-assembly Wang tile family is a descrip-tor comprising a set of glue types each of which is asso-ciated to one of the four sides of a self-assembly Wangtile. Thus, each tile family is instantiated with equal num-ber of tiles which are randomly located into a simula-tion environment where they drift and interact one anotherself-assembling in aggregates until the simulation runs itcourse. Once the simulation finishes, aggregates are com-pared for similarity to a user defined (target) structureemploying the Minkowski functionals [18,19]. This assem-bly assessment returns a numerical representation that isconsidered as the fitness value (Fitness_i) of each individ-ual. Thus, individuals capable of creating aggregates sim-ilar to the specified target are better ranked and becomethe most likely to survive across generations. This process,together with one-point crossover and bitwise mutation oper-ators, is applied to the entire population and repeated fora certain number of generations. In particular, our exper-iments comprised four increasingly rich/complex simula-tion environments: deterministic, probabilistic, deterministicwith rotation and probabilistic with rotation. An illustrationsummarising our approach and its components is shown inFig. 2.

123

Memetic Comp. (2013) 5:19–33 21

Fig. 2 Evolutionary approach for the evolutionary design optimisationof self-assembly Wang tiles. A population of self-assembling Wang tilesfamily (genotype) is randomly initialised. After that, each individual isset up into a tiles simulator from where the emerging self-assembledaggregations (phenotypes) are compared against a target structure for

similarity. This comparison returns in the fitness of the individual. Lateron, the application of genetic operators follows where the best rankedindividuals are likely to pass throughout selection, crossover and muta-tion stages

The achieved results supported our evolutionary designapproach as a successful engineering mechanism for thecomputer-aided design of self-assembly Wang tiles. Earlyevidence of our research in this topic is available in [22]where we employed a very simple evaluation mechanismcomposed by a lattice scanner fitness function and later,in [23], where morphological image analyses brought amore accurate and efficient way to collectively assess theassembled aggregates towards the target. Since the mappingfrom genotype to phenotype and from this to fitness is clearlya complex, stochastic and non-linear relationship, would it bepossible to analyse the effectiveness of Minkowski function-als as fitness function? The aim of this paper is then to addressthis question with a complementary dual assessment proto-col which tells us whether the employed genetic algorithmis an effective design optimisation method for our problem.Thus, in the following section we introduce fitness distancecorrelation and cluster analysis which set up the foundationsof our novel protocol. Next we present how fitness distancecorrelation and cluster analysis are applied to our evolution-ary design approach together with experiments and results.Finally, conclusions and discussions follow.

2 The genotype-phenotype-fitness assessment protocol

The evolutionary design of self-assembly Wang tiles is char-acterised as a problem in which the mapping from geno-

type to phenotype and then from phenotype to fitness isa highly complex, non-linear and in some cases stochas-tic relationship. It is non-linear because different genotypes(tile sets) with small differences may lead to widely diverg-ing phenotypes. While the same genotype, due to randomeffects, might produce a variety of end-products. This intri-cate relationship (see Fig. 3) makes the assessment of thegenotype very difficult since the same (different) fitnessvalue could be assigned to different (the same) genotypes.Hence, in order to analyse the efficiency by which individu-als were evolved, we present in this section the fundamentalcomponents of a complementary dual assessment protocol:fitness distance correlation and cluster analysis. We havepreviously used this approach for the analysis of complexmeta-cellular automata [24]. The former measures how effec-tively the fitness of an individual correlates to its genotypicdistance to a known optimum. The latter verifies how theobjective function can effectively differentiate between dis-similar phenotypes and classify similar ones for the purposeof selection.

2.1 Fitness distance correlation

Fitness distance correlation (FDC) [10,11] is a summary sta-tistic that performs a correlation analysis in terms of a knownoptimum and samples taken from the search space, predictingwhether a GA will be effective for solving an optimisation

123

22 Memetic Comp. (2013) 5:19–33

Fig. 3 The highly complex, non-linear and stochastic relationship taking place across the mapping from genotype to phenotype and then fromphenotype to fitness

problem. Thus, when facing a minimisation (maximisation)problem, a large positive (negative) correlation indicates thata GA may successfully treat the problem or that the problemis straightforward, whereas a large negative (positive) valuesuggests that employing a GA may not be effective or thatthe problem is misleading. However, a correlation aroundzero, i.e. −0.15 ≤ FDC ≤ 0.15, would suggest that morenuisances, perhaps including scatter plots analyses, of thefitness versus distance to the optimum should be done and,in general the problem is categorized as difficult. The for-mula for computing the FDC is shown in Eq. 2, where n isthe number of samples, fi is the fitness of sample i with dis-tance to the known optimum di , f and SF are the mean andstandard deviation of the fitness values, and d and SD are themean and standard deviation of the distances.

F DC = (1/n)∑n

i=1( fi − f )(di − d)

SF SD

f = 1

n

n∑

i=1

fi

d = 1

n

n∑

i=1

di (2)

SF =√√√√ 1

n − 1

n∑

i=1

( fi − f )2

SD =√√√√ 1

n − 1

n∑

i=1

(di − d)2

A study focused on whether FDC predicts the GA behav-iour, and whether it detects differences in encoding and

representation for a number of well-studied minimisationproblems has been given in [11]. When predicting the GAbehaviour, the FDC confirmed that Deb & Goldberg’s 6-bitfully deceptive function and Whitley’s 4-bit (F2 and F3) fullydeceptive functions are indeed misleading since the correla-tion values were 0.30, 0.51 and 0.36 respectively, and thefitnesses tended to increase with the distance from the globaloptimum. In addition, FDC also confirmed that problemslike Ackley’s One Max, Two Max, Porcupine and NK land-scape problems for K≤3 are easy since the correlation val-ues resulted in−0.83,−0.55,−0.88 and−0.35 respectively.Nevertheless, the FDC indicated that NK(12,11) landscape,Walsh polynomials on 32 bits with 32 terms each of order 8,Royal Road functions R1 and R2, as well as some of the DeJong’s functions like F2(12) are difficult since the resultingcorrelation values were close to 0.0. When the differences inencoding and representation were considered, experimentsusing Gray and binary coding led to the conclusions that thesuperiority depends on the number of bytes used to encode thenumeric values. For instance, De Jong’s F2 with binary cod-ing is likely to make the search easier than with Gray codingwhen using 8 bits. In contrast, the correlation value of F12indicated that Gray coding works better than binary whenusing 12 or 24 bits. Despite its successful application on awide benchmarking set of problems, FDC is still not consid-ered to be a very accurate predictor in some other problems.For instance, a case where FDC failed as a difficulty predictorhas been presented when studying a GA maximising Ridgefunctions [20].

In summary, although FDC cannot be expected to be aperfect predictor of performance, previous work reported in

123

Memetic Comp. (2013) 5:19–33 23

[1,13,25–27] suggests that it is indeed a good indicator ofperformance. Our goal is then to assess how effectively thefitness of an individual correlates to its genotype when usingMinkowski functionals as fitness function for the GA pre-sented in [23].

2.2 Cluster analysis

One of the characteristics of our problem is non-linearity, i.e.several different genotypes can encode the same phenotypeand hence introduce severe noise in the FDC analysis. As,ultimately, selection is based on fitness which in turn dependson the phenotype, studying the phenotype-fitness mappingcould shed light on why the GA worked quite well. For thisreason, this section introduces clustering as a method foranalysing the phenotype-fitness relationship in self-assemblyWang tiles design optimisation.

Cluster analysis or clustering is a technique for groupingobjects according to their similarities [2,3,6,9]. In contrast toclassification, clustering is an unsupervised task in which aset of objects is partitioned in groups, called clusters, accord-ing to their proximities such that those belonging to a clusterare more similar to each other than objects in a different clus-ter. A clustering procedure comprises four basic stages: fea-ture selection or extraction, clustering algorithm selection,cluster validation and results interpretation.

In the first stage, the features by which the objects will bedistinguished are chosen. This pairwise affinity is then con-sidered to compute a proximity matrix to which a cluster strat-egy is applied. The resulting partition of the data is subject toa subsequent testing criteria in order to validate the cluster-ing process. Finally, a visualization and interpretation overthe achieved clusters closes the procedure with the hope ofproviding meaningful insights coming from the original data.The whole inter-relation among these tasks can be seen as asequential procedure as detailed and commented in Fig. 4.

Clustering algorithms are classified as hierarchical, par-titioning, density-based partitioning, grid-based, evolution-ary methods and so forth in [2,7,14]. Although there areslight differences among the proposed taxonomies, many arethe common features associated with them [9]. For instance,according to their structure and operation, abreak clusteringalgorithm is agglomerative if clusters arise from singletons(bottom-up) or divisive if one super cluster is split in sev-eral ones (top-down). The sequential or simultaneous useof object features in the clustering process also plays animportant role as it defines whether the algorithm is mono-thetic or polythetic. Additionally, it is also reported that somemethodologies allow objects to belong to a single cluster(hard classification) or to multiple clusters (fuzzy classifi-cation). Besides these three characterizations, a clusteringalgorithm can also be deterministic, stochastic, incremental

Fig. 4 Clustering procedure comprising feature selection, inter-objectcomparison, clustering, validation of data partition and results interpre-tation

or non-incremental if there are constraints on execution timeor memory affecting the architecture of the algorithm.

The clustering methods appearing in the literature are,mainly, variants of the hierarchical agglomerative cluster-ing. Among them, the single-link (SLINK), complete-link(CLINK) or minimum-variance [12] are the best-knownwhere their differences lay on the way they characterise thesimilarity between pairs of clusters. In addition, the UPGMA,Weighted Pair Group Method using arithmetic Average(WPGMA), the Unweighted Pair Group Method using Cen-troids (UPGMC) and the Weighted Pair Group Method usingCentroids (WPGMC) are also broadly employed in manyapplications [5,8,15].

3 Correlating the self-assembly Wang tiles design

Since different set of tiles may self-assemble in aggregatessimilar in shape to the target structure, it is of our interest tostudy here how effectively the fitness of an individual cor-relates to its genotypic distance to a known optimum (seeFig. 5). In the rest of this section, we carry out FDC analysis inorder to study Minkowski functionals effectiveness as fitnessfunction for the evaluation of the achieved self-assembledaggregates.

3.1 Experiments and results

In order to perform a FDC analysis, we first choose thebest individual found among the four simulation environ-ments. In this case, the best individual belongs to the resultsachieved when using probabilistic criteria and no rotationsimulator. Next, a data set comprising 500 individuals at

123

24 Memetic Comp. (2013) 5:19–33

Fig. 5 Diagram of mappings from genotype onto phenotype and from phenotype onto numerical fitness value, and relationship to the fitnessdistance correlation

different Hamming distances from the best individual is cre-ated. In particular, given two individuals I ndi and I nd j ofsame length, their Hamming distance H is defined as in Eq. 3.

H(I ndi , I nd j ) =n∑

k=1

di f f (T ik , T j

k )

di f f (Ti , Tj ) =3∑

l=0

cil � c j

l

ca � cb ={

1 if ca �= cb

0 otherwiseca, cb ∈ �

(3)

Thus, this 500 individuals data set comprises all the pos-sible chromosomes at Hamming distance of 1 plus someother randomly generated individuals at greater distance, allof these systematically generated following the pseudocodedescribed in Algorithm 1 where DuplicateReplacing (Tk, cl ,

cnew) duplicates tile Tk replacing glue type cl with cnew,DuplicateReplacing(I ndi , Tk, Tnew) duplicates individ-ual I ndi replacing tile Tk with Tnew, T ileAt (I ndi , k)

returns the tile at position k of an individual, and Replace(Tk, cl , cnew) replaces glue type cl in tile Tk with glue typecnew.

In total, each of the generated individuals was simulated5 times giving as a result a group with equal number of finalconfigurations. Thus, a configuration in turn was consideredas a target (Con fT ) against which the remaining configura-tions of all the groups (Con fi ) were evaluated on fitness ( fi )and on distance (di ) among their associate genotypes (seeEq. 4).

di = H(indi , indT )

fi = f (Con fi ) = Eval(Con fi , Con fT )

Algorithm 1 GenerateIndividualsRequire: I nd an individualEnsure: S a set of individuals1. for all tiles Tk in I ndi do2. for all glue types cl in Tk do3. for all glue type cnew ∈ � do4. Tnew ← DuplicateReplacing(Tk , cl , cnew)

5. I ndnew ← DuplicateReplacing(I ndi , Tk , Tnew)

6. I nsert (S, I ndnew)

7. end for8. end for9. end for10. while | S |< 500 do11. I ndnew ← Duplicate(randomly chosen I ndi ∈ S)

12. n← Random(0, | I ndnew |)13. for all k to n do14. Tk ← T ileAt (I ndnew, k)

15. m ← Random(0, 3)

16. for all l to m do17. cnew ← Random(� \ cl)

18. Replace(Tk , cl , cnew)

19. I nsert (S, I ndnew)

20. end for21. end for22. end while

=√

(�A)2 + (�P)2 + (�N )2

�A = max{AT1 , . . . , AT

m} − max{Ai1, . . . , Ai

n} (4)

�P =m∑

k=1

PTk −

n∑

k=1

Pik

�N = m − n

Since a configuration comprises a collection of aggregates, away is needed to perform an evaluation involving all its aggre-gations collectively. For this reason, considering a target con-figuration Con fT and an arbitrary one Con fi with aggregates

123

Memetic Comp. (2013) 5:19–33 25

Fig. 6 A target configuration (Con fT ) towards which the fitness of an arbitrary configuration (Con fi ) is calculated in terms of areas and perimeters.Tags AT

j , PTj and Ai

k , Pik label area and perimeter values for aggregates in Con fT and Con fi configurations respectively

Fig. 7 Proportion of FDC values falling into difficult, misleading and easy to solve categories. From the 2,500 analyses performed over 500individuals, only a 4.68 % reveals that a GA may successfully treat the problem

{AT1 , . . . , AT

m} and {Ai1, . . . , Ai

n} respectively, Con fi will beevaluated upon Con fT in terms of the difference in areas,perimeters and number of achieved aggregations as shownin Eq. 4. In order to illustrate an example, Fig. 6 shows botha target configuration Con fT and an arbitrary configurationCon fi , the fitness of which is calculated in terms of areasand perimeters in Eq. 5.

�A = max{AT1 , . . . , A13T } − max{Ai

1, . . . , Ai11}

= 57− 46 = 11

�P =13∑

k=1

PTk −

11∑

k=1

Pik = 260− 255 = 5 (5)

�N = m − n = 11− 13 = −2

fi =√

(�A)2 + (�P)2 + (�N )2

=√

112 + 52 + (−2)2 = 12.24745

After performing the calculations, the findings show thatthe FDC values range from −0.331444 to 0.281457. SinceEq. 4 defines a minimisation, 50.60 % of the FDC valuesindicate that using a GA may not be effective, 44.72 % thatthe problem is difficult to solve and a 4.68 % that the GAmay successfully treat the problem (see Fig. 7).

In particular, visual inspections over scatter plots obtainedfrom the values captured into the smallest percentage depictgood correlation on some individuals. A representative

123

26 Memetic Comp. (2013) 5:19–33

Fig. 8 Graphics of the resultantscatter plots and correlationcoefficients for theself-assembly Wang tiles modelshowing that the Minkowskifunctionals has a relativelysatisfactory correlation with thegenotype for some of theself-assembly Wang tile families

sample of these is depicted in Fig. 8 but at http://www.cs.nott.ac.uk/~gzt/fdcminkowski we provide the rest ofthe experiments. For each plot, ind_ij identifies thej-simulation of individual i , where j ∈ {a, b, c, d, e} andi ∈ {1, . . . , 500}. Hence, from the sampling of 500 indi-viduals and 2,500 simulations subject to FDC analyses, itemerges that employing Minkowski functionals as fitnessfunction offers a relatively satisfactory correlation upon therelationship genotype-fitness only for half of the putativesamples.

Contrary to the interpretations given by some of theFDC figures seen in this section, the results reported in

[23] reveal that using Minkowsi functionals as evaluationmethod of a GA has positively addressed the self-assemblyWang tiles design problem. Henceforth, analyses on thephenotype-fitness relationship are expected to shed light onthe reasons for which such evolutionary approach has beeneffective.

4 Clustering the self-assembly Wang tiles design

In this section, a clustering procedure will be applied at phe-notype level with the hope of obtaining a better insight and

123

Memetic Comp. (2013) 5:19–33 27

Fig. 9 Diagram of mappings from genotype onto phenotype and from phenotype onto numerical fitness value, and relationship to clusteringanalysis

of verifying the phenotype-fitness relationship (see Fig. 9).For this to be effective, the cluster analysis is considered overthe final configurations obtained by the individuals createdwith the algorithm of Sect. 3.1 and from where the result-ing findings of this methodology will be used as a comple-mentary assessment of the Minkowski functionals as fitnessfunction.

4.1 Experiments and results

In order to perform the clustering experiments, the wholeset of final configurations obtained from the self-assemblyWang tile simulations performed in Sect. 3.1 is employed.As these configurations comprise the actual clustering inputdata, the feature selection by which the objects are expectedto be distinguished is the number of aggregates, their perime-ters and the biggest aggregate area. Thereby, the measure ofaffinity between each pair of configurations (Con fi , Con f j )

is computed in terms of the evaluation function (see Eq. 4)and stored in the similarity matrix Ms defined in Eq. 6.

Ms[i, j] ={

Eval(Con fi , Con f j ) if i �= j0 otherwise

Ms is a proximity matrix s.t. Ms[i, j] = Ms[ j, i]Con fi , Con f j are final configurations

1 ≤ i, j ≤ 2,500

(6)

Although a number of different clustering methods andrepresentations are available, the unweighted pair-groupmethod using arithmetic average (UPGMA) [4] has beenchosen along with a logarithm dendogram representation tovisualize and interpret the data partition. The pseudocode

for the clustering algorithm UPGMA is outlined in Algo-rithm 2 where MergeRows joins the content of row i androw j , MergeColumns joins the content of column i andcolumn j and MakeNode associates i and j in a node thatI nsert Node will add to the hierarchical structure T .

Thus, the resultant data partition showing eight clusterslabelled as A, B, C, D, E, F, G and H is depicted inFig. 10. By sampling representative configurations from eachof these clusters, it is possible to observe that the data hasbeen well partitioned since the distribution and morphologyof the aggregations is mostly similar in each of the clus-ters.

On the one hand, by analysing the partitions located at thetop part of the dendogram, the configurations of cluster Hreveal scattered tiles with no or very few small aggregateslike those appearing in the snapshot of Fig. 11a. Conversely,representatives of cluster G reveal large-size aggregates withvery few unassembled tiles as in the configuration depicted

Algorithm 2 MakeClusterRequire: Ms a proximity symmetric matrixEnsure: T a hierarchical structure1. while dimension(Ms) ≥ 3× 3 do2. for all i, j such that i �= j do3. minimum ← min(Ms [i, j], Minimum)

4. end for5. MergeRows(Ms , i, j)6. MergeColumns(Ms , i, j)7. Ms [k, i j] ← avg(Ms [k, i], Ms [k, j])8. Ms [i j, k] ← avg(Ms [i, k], Ms [ j, k])9. node← MakeNode(i, j)10. I nsert Node(T, node, minimum)

11. end while

123

28 Memetic Comp. (2013) 5:19–33

Fig. 10 Illustration of the logarithmic cluster tree for self-assembly Wang tiles configurations, individuals of which were obtained with Algorithm 1

in Fig. 11b. Close to these two types of partitions, the con-figurations of cluster F have achieved assemblies which areeither large in size and merged with some small others asshown in Fig. 11c, or medium-size aggregates combined withfew others of minor area as shown in the configuration inFig. 11d.

On the other hand, from an analysis of some representa-tives belonging to the bottom right partitions labelled as Aand B, it is evident that the morphology of the aggregates

is contrary to the ones described before. For instance, theconfigurations observed in A comprise either aggregateswith dendritic shape along with scattered tiles as shown inFig. 12a, or small strips also sharing the lattice with fewunconnected tiles as depicted in Fig. 12b. Similarly, theaggregations found in cluster B also comprise small rectan-gular aggregates although mixed with T-shaped and L-shapedstructures in most of the cases as appears in configurationsof Fig. 12c, d.

123

Memetic Comp. (2013) 5:19–33 29

Fig. 11 Representatives of three clusters: a scattered tiles and small size aggregates characterise partition H; b large aggregates feature partitionG; c–d large and small size aggregates as well as medium and small size aggregates characterise partition F

The three partitions located at the central part of thedendogram seem to represent a transition between the twoanalyses done above. For instance, the configurations belong-ing to cluster D mostly show medium-size strips togetherwith a vast number of scattered tiles distributed across thelattice as it is shown in Fig. 13a, b. Configurations withsimilar morphology and a reduced number of scattered tilesare among those observed in cluster E (see Fig. 13c). More-over, the same partition also includes some other type ofconfigurations where aggregates are usually large and, inmany cases, surrounded by scattered tiles (see Fig. 13d).Finally, the occurrence of large- and medium-size aggre-gates combined with scattered tiles is the common featurethat identifies the configurations observed in partition C (seeFig. 14).

To summarise, the findings achieved after applying clus-ter analyses over the 2,500 final configurations, i.e. productof simulating 500 individuals, come to support the view thatthere is in fact an acceptably high correlation between thephenotypes and their fitness values. In other words, thesefindings verify that Minkowski functionals can effectivelydifferentiate between dissimilar phenotypes and classify sim-ilar ones for the purpose of selection.

5 Summary

As a general conclusion, the combination of FDC plus clus-ter analysis indicates that the use of Minkowski functionalsis amenable for the evolutionary design optimisation of self-assembly Wang tiles. On the one hand, the results obtainedwith the morphological image analyses supports the use ofMinkowski functionals as fitness function, although only 5 %of the FDC analysis applied to the systematically obtainedindividuals has revealed that the use of GA may successfullysolve the problem. On the other hand, the cluster analyseshave accurately classified the configurations according totheir morphological features, supporting the way in whichthe fitness function evaluates the self-assembled aggregates.

Hence, the application of our dual methodology, shownalong Sects. 3 and 4, reveals that employing a fitness func-tion in terms of Minkowski functionals for the evolutionarydesign optimisation of self-assembly Wang tiles results in acomplex mechanism of evaluation where, although its suc-cess as phenotype evaluator seems to be appropriate, a dif-ferent type of analysis is needed for an assessment of howeffectively an individual correlates to its genotypic distanceto a known optimum.

123

30 Memetic Comp. (2013) 5:19–33

Fig. 12 Representatives of two clusters: a–b dendritic aggregates along with scattered tiles and small strips with few unconnected tiles characterisepartition A; c–d the appearance of T-shaped and L-shaped structures characterise partition B

Considering the results presented here, we can concludethat employing the combination clustering and FDC is adual assessment that reveals an accurate indication of thequality of the encoding, i.e. genotype, its mapping to pheno-type and Minkowski functionals as fitness function. There-fore, the application of this methodology before starting longand expensive evolutionary runs should be considered inany problem where the genotype–phenotype-fitness map-ping is complex, stochastic, many-to-many and computa-tionally expensive. Thus, this protocol analysis is a contri-bution of general interest beyond automated self-assemblydesign.

6 Conclusions and future work

This paper has presented a dual assessment protocol tostudy the effectiveness of a GA employed as method forthe design optimisation of self-assembly Wang tiles. Suchis the complexity of the genotype–phenotype-fitness map-ping, that FDC cannot, alone, be guaranteed to give acompletely accurate picture. Indeed, the objective func-tion itself is also only an approximation of two indi-

viduals’ phenotypic similarity. For these reasons, rely-ing on only FDC or only clustering to validate complexproblems would not be adequate. It is for this reasonthat we combined both methods with the aim to showwhether a given fitness function is a suitable evaluationmechanism for the evolutionary design problem addressedin [23].

Overall, we have contributed with a complementary dualassessment for validating complex, stochastic, non-lineargenotype–phetontype-fitness relationships. The proposedmethodology combines FDC and clustering validation tech-niques, the results of which revealed that Minkowski func-tionals are suitable fitness functions for the evaluation ofself-assembled structures. In addition, these indicators havecontributed with a more general result which is a methodfor assessing the quality of the encoding and the accu-racy of its mapping to phenotype in evolutionary systemswhere genotype–phenotype-fitness is an intricate relation-ship.

Self-assembly can be seen as an information-drivenprocess and hence be better exploited by directly linkingit to computation. Taken as an operational hypothesisthis assumption implies that desired emergent phenomena

123

Memetic Comp. (2013) 5:19–33 31

Fig. 13 Representatives of two clusters: a–b medium strips surrounded by a vast number of scattered tiles distributed across the lattice identifypartition D; c–d a number of scattered tiles approaching to nil and some big aggregates characterise partition E

Fig. 14 Two representatives configurations of partition C showing large- and medium-size aggregates combined with unassembled tiles distributedacross the lattice

could in principle be programmed into self-assemblingnanosystems. Thus, we are currently focused on employingself-assembly Wang tiles model for the design and exploita-tion of molecular self-assembly, in particular porphyrins-based nanotiles. Porphyrins are suitable molecules since theyhave four fold symmetry, could be synthesized withdifferent

functional groups, and are planars, ideal for surface depo-sition on solid substrates see Fig. 15. In this way, given aset of self-assembly Wang tiles that self-assemble into spe-cific patterns, our goal is to manufacture porphyrin moleculeswhich would be able to self-assemble into similar structuresby exploiting preferential intermolecular values.

123

32 Memetic Comp. (2013) 5:19–33

Fig. 15 Design and exploitation of porphyrins-based nanotiles. Self-assembly Wang tiles that create aggregates are seen as blueprints forthe manufacture of functionalised porphyrins. These molecules would

then be deposited on a solid substrate where intermolecular bindingstrengths give origin to porphyrins self-assembly

Acknowledgments The research reported here is funded by EPSRCgrant EP/H010432/1 Evolutionary Optimisation of Self AssemblingNano-Designs (ExIStENcE) and a Leadership Fellowship (NatalioKrasnogor) EP/J004111/1.

References

1. Altenberg L (1997) Fitness distance correlation analysis: aninstructive counterexample. In: 7th International conference ongenetic algorithms, Morgan Kaufmann, San Francisco, CA, USA,pp 57–64

2. Berkhin P (2002) Survey of Clustering Data Mining Techniques.Tech. rep, Accrue Software, San Jose, CA, USA

3. Berkhin P (2006) A survey of clustering data mining techniques.In: Nicholas C, Teboulle M, Kogan J (eds) Grouping multidimen-sional data. Springer, Berlin, pp 25–71

4. Brzustowski J Clustering Calculator, http://www2.biology.ualberta.ca/jbrzusto/cluster.php

5. Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biologicalsequence analysis: probabilistic models of proteins and nucleicacids. Cambridge University Press, Cambridge

6. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering val-idation techniques. Intell Inf Syst 17(2–3):107–145

7. Hansen P, Jaumard B (1997) Cluster analysis and mathematicalprogramming. Math Program 79(1–3):191–215

8. Henz SR, Huson DH, Auch AF, Nieselt-Struwe K, Schuster SC(2005) Whole-genome prokaryotic phylogeny. Bioinformatics21(10):2329–2335

9. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review.ACM Comput Surv 31(3):264–323

10. Jones T (1995) Evolutionary algorithms, fitness landscapes andsearch. PhD thesis, University of New Mexico

11. Jones T, Forrest S (1995) Fitness distance correlation as a measureof problem difficulty for genetic algorithms. In: 6th InternationalConference on Genetic Algorithms, Morgan Kaufmann PublishersInc., San Francisco, CA, USA, pp 184–192

12. Kamvar SD, Klein D, Manning CD (2002) Interpreting and extend-ing classical agglomerative clustering algorithms using a model-based approach. In: 19th International conference on machinelearning, Morgan Kaufmann Publishers Inc., San Francisco, CA,USA, pp 283–290

13. Koljonen J (2006) On fitness distance distributions and correla-tions, GA performance, and population size of fitness functionswith translated optima. In: Honkela T, Kortela J, Raiko T, Valpola H

123

Memetic Comp. (2013) 5:19–33 33

(eds) 9th Scandinavian conference on artificial intelligence. FinnishArtificial Intelligence Society, Espoo, pp 68–74

14. Kotsiantis S, Pintelas P (2004) Recent advances in clustering: abrief survey. Trans Inf Sci Appl 1(1):73–81

15. Krasnogor N, Pelta DA (2004) Measuring the similarity of proteinstructures by means of the universal similarity metric. Bioinfor-matics 20(7):1015–1021

16. Krasnogor N, Gustafson S, Pelta D, Verdegay J (2008) SystemsSelf-Assembly: Multidisciplinary Snapshots. In: Studies in multi-disciplinarity, vol 5. Elsevier, Amsterdam

17. Li L, Siepmann P, Smaldon J, Terrazas G, Krasnogor N (2008)Automated self-assembling programming. In: Krasnogor N,Gustafson S, Pelta D, Verdegay JL (eds) Systems self-assembly:multidisciplinary snapshots. Elsevier, Amsterdam

18. Michielsen K, Raedt HD (2000) Morphological image analysis.Comput Phys Commun 1:94–103

19. Michielsen K, Raedt HD (2001) Integral-geometry morphologicalimage analysis. Phys Rep 347:461–538

20. Quick RJ, Rayward-Smith VJ, Smith GD (1998) Fitness distancecorrelation and Ridge Functions. In: 5th International Confer-ence on parallel problem solving from nature, Springer, London,pp 77–86

21. Rothemund PWK, Winfree E (2000) The program-size complexityof self-assembled squares (extended abstract). In: 32nd ACM sym-posium on theory of computing, ACM, New York, NY, USA,pp 459–468

22. Terrazas G, Krasnogor N, Kendall G, Gheorghe M (2005) Auto-mated tile design for self-assembly conformations. In: IEEEcongress on evolutionary computation, vol 2. IEEE Press,New York, pp 1808–1814

23. Terrazas G, Gheorghe M, Kendall G, Krasnogor N (2007) Evolvingtiles for automated self-assembly design. In: IEEE congress onevolutionary computation. IEEE Press, New York, pp 2001–2008

24. Terrazas G, Siepman P, Kendal G, Krasnogor N (2007) Anevolutionary methodology for the automated design of cellularautomaton-based complex systems. J Cell Autom 2(1):77–102

25. Tomassini M, Vanneschi L, Collard P, Clergue M (2005) A studyof fitness distance correlation as a difficulty measure in geneticprogramming. Evol Comput 13(2):213–239

26. Vanneschi L, Tomassini M (2003) Pros and cons of fitness distancecorrelation in genetic programming. In: Barry AM (ed) Bird ofa Feather Workshops. In: Genetic and evolutionary computationconference, AAAI, Chigaco, pp 284–287

27. Vanneschi L, Tomassini M, Collard P, Clergue M (2003) Fitnessdistance correlation in structural mutation genetic programming.In: Ryan C, Soule T, Keijzer M, Tsang E, Poli R, Costa E (eds)Genetic programming. Proceedings of EuroGP, Springer, Essex,Lecture Notes in Computer Science, vol 2610. pp 455–464

28. Winfree E, Yang X, Seeman NC (1996) Universal computationvia self-assembly of DNA: Some theory and experiments. In:DNA based computers II. American Mathematical Society, vol 44,pp 191–213

123