7
Following the updating rule of the synaptic weights used in the classical back propagation: T c Chi + Awf(il)I &! = &j- I Cui, + Aof(6) - 0J2 I I + Aw#v'-') the average value of the new estimations is TW, l T l T = - t,( = - 1 of N - 1 Cuii + Awf(ij)] om< Tf-1 T,=i (ui, - &,)Z Wi Taking into consideration the convergence of the network in a stable optimum, the sensibility test tUi of the synaptic weight oi is This test computed is a sensibility test because the network has already reached a stable state. IEE Proc.-Vis. Image Signal Process., Vol. 141, No. 4, August 1994 231

Contextual image labelling with a neural network

  • Upload
    wa

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Contextual image labelling with a neural network

Following the updating rule of the synaptic weights used in the classical back propagation:

T c Chi + Awf(il)I &! = &j- I

Cui, + Aof(6) - 0J2 I I + Aw#v'-')

the average value of the new estimations is TW, l T l T = - t,( = - 1 of N - 1 Cuii + Awf(ij)] o m < T f - 1 T , = i (ui, - &,)Z

Wi

Taking into consideration the convergence of the network in a stable optimum, the sensibility test tUi of the synaptic weight oi is

This test computed is a sensibility test because the network has already reached a stable state.

I E E Proc.-Vis. Image Signal Process., Vol. 141, No. 4, August 1994 231

Page 2: Contextual image labelling with a neural network

Secondly, the method has been generalised to use colour image data. Thirdly, a completely new database of 80 24-bit colour images has been used. This more extensive and better understood database has made it possible to perform a more detailed quantitative performance evalu- ation. It is intended that these results will form the first step of a comprehensive evaluation of these methods.

The following Sections discuss the images, segmen- tation, labels, features, and neural network used in' this work.

2 Images

The work presented here has used a sample of 80 colour images of outdoor scenes from the Sowerby Image Data- base. The database was originally designed for research on methods of road recognition and consists of colour images of a wide range of urban and rural scenes con- taining roads or rough tracks. The images were digitised using a calibrated digitiser from small-grain 35 mm colour transparency film to produce high-quality 24-bit colour images. Examples of two of these images are shown in Figs. 1 and 2.

Fig. 1 Urban road image from database

Fig. 2 Rural road imagefrom database

The statistics of the image content and acquisition conditions were specified for all images in the database. Of the 80 images in the sample, 40 are of urban scenes and 40 are of rural scenes. Approximately one third of the images are taken from an off-road viewpoint and the remainder from an on-road viewpoint. For on-road view- points, the camera was placed in the middle of one of the

I E E Proc.-Vis. Image Signal Process., Vol. 141, N o . 4, August 1994

driving lanes, with the line-of-sight parallel to the road. For off-road viewpoints, the camera was placed between 1 m and 5 m from the edge of the road, with a uniform distribution over the position and over the angle between the road and the line-of-sight, such that the full width of the road is visible. Approximately one third of the images contain at least one vehicle. Environmental conditions at image capture were overcast, with no sunlight and no rain or recently fallen rain. For all images, the camera was focussed at infinity and an aperture no larger than f/8 was used in order to ensure an adequate depth-of- field. The declination of the line-of-sight was approx- imately 4". These conditions were uniformly and indepen- dently distributed over each other.

3 Segmentat ion and labels

In this work, the purpose of the segmentation process is to divide the image into a number of regions which corre- spond to objects and/or parts of objects. Current segmen- tation techniques cannot achieve this goal reliably; it is nearly always the case that some regions will be under- segmented (too large), and some regions over-segmented (too small). The quality of the segmentation is important since the features are computed from each region.

The concept of an 'ideal' object-oriented segmentation is introduced. It is defined as the set of regions which correspond to the objects and/or parts of objects in a predefined set of labels. This set of labels corresponds to the prior knowledge of the content of images from a par- ticular domain, in the present case, road scenes. A set of labels is defined that is suficient to describe all objects larger than a certain size which are likely to be visible in images within the domain. The main reason for using 'ideal' segmentations is that it makes it possible to separate the two problems of segmentation and recogni- tion, which allows one to assess the performance of a recognition method independently of the effects of using an imperfect real segmentation process.

An approximation to the 'ideal' segmentation is con- tructed from a hand-labelled nonideal segmentation by merging adjacent over-segmented regions which are assigned the same label, and splitting under-segmented regions which are assigned more than one label. The problem of finding the optimal split boundary is, in general, underdetermined. A simplification may be made here by attempting to split only those under-segmented regions which contain exactly two objects and for which the number of differently-labelled nearest-neighbour regions does not exceed two (excluding any unlabelled regions). In practice, this assumption is not unduly restrictive since it is found to hold for the majority of the under-segmented regions. Endpoints of the split bound- ary are determined by the positions of pairs of adjacent labelled regions that have different labels. The split boundary is then determined by linear interpolation of the endpoints. This scheme is simple but generally works well. Under-segmented regions are rare (< 0.5%) and also typically rather small (< 100 pixels). Hence, whatever processing is applied to them is assumed not to have a very significant effect on the segmentation. An example segmentation of the image in Fig. 1 is shown in Fig. 3.

A set of labelled examples is drawn from the regions in the 80 images. Only regions larger than a size threshold of 80 pixels have been selected, resulting in a total of 4780 examples. The sample contains the following set of labels shown in Table 1.

239

Page 3: Contextual image labelling with a neural network

~~~ ~ ~-~

Fig. 3 Segmenf boundaries for urban image d

Table 1 : List of labels

Label Category

Cloud/mist Cloud/mist

Vegetation Vegetation Vegetation tree Vegetation grass Vegetation bush Vegetation other

Road-surface marking Road-surface marking

Roadsurface gravel Road-surface other Road-surface tarmac Road-surface stone/rock

Road-border grass Vegetation

Road-border gravel Road-Border rnisc Road-border tarmac Road-border kerbstone Road-border stone/rock Road-border mud

Building wall Building Building roof Building chimney Building window Building other

Bounding-object wall Bounding-object Bounding-object fence Bounding-obiect aate

Road-sign Road-sign

Telegraph-pole Telegraph-pole

Illumination shadow Illumination shadow

Mobile-object car window Mobile-object car Mobile-object car body

4 Features

The classification performance of the neural network is strongly influenced by the choice of features. Ideally, since an object should be identified with equal confidence from different views, one should seek features which are viewpoint-invariant. However, many common features are not strictly viewpoint-invariant. In this work, we use features which are approximately invariant.

A feature vector F = (fl,f2, ..., fN) composed of N = 29 features is defined as shown in Table 2 below. As in Wright [lo], two types of feature are used: features that describe the internal characteristics of a region, and features that describe the local contextual relations of a region. However, there are some differences in the design of the features used here, most notably the addition of

240

Table 2: List of features

Feature no. Type Property measured

1 internal intensity 2-3 colour - hue angle 4 hollowness 5 Euler number 6 compactness 7-1 0 shape descriptors

11-13 texture descrimors

14-21 contextual inter-centroid angles 22-25 size ratios 26-29 intensity ratios

colour, topological, and multiscale shape features, and also the choice of unary contextual features which are defined here in terms of the region R and the set R , of the four largest nearest-neighbours of R. The following geometric and intensity features are used, defined in terms of R and R,: the angles of the vectors between the centroid of R and the centroids of the regions in R, , the ratios of the area of R to the areas of the regions in R,, and the ratios of the intensity of R to the intensities of the regions in R, .

Feature f, measures the mean intensity of the region. Featuresf,, measure the colour of the region by the hue angle of the mean colour vector (see normalisation, below). Featuref, is the percentage of the area occupied by holes. Feature f5 is the logarithm of 2 minus the Euler number of the region. Featuref, measures the com- pactness with 471 arealperimeter’. Features f,. .... ,,, are multiscale shape descriptors. These features are based on the decay of curvature extrema over multiple scales of a. contour [l 11. Essentially, they measure irregularity and linearity at several scales of a contour. Featuresf,,, ..., 13 are texture descriptors based on the intensity co- occurrence matrices. The eigenvalues and eigenvectors are computed from the mean co-occurrence matrix.

for a single-pixel displacement at each of the orientations k = 0, 7114, 7112, 3x14. The mean of the pair of co- occurrence matrices at opposite orientations is used in order to obtain less noisy results. The largest eigenvalue e,(k) measures the total amount of smooth variation of intensity, while the next largest eigenvalue e,(k) relates to the ‘amount’ of texture. The two eigenvalues are found to be positively correlated; the best results have been obtained withf,, as the ratio of the two largest eigen- values e,(k)/e,(k) averaged across the C,Xk). Featuref,, is a measure of the polarisation of the texture based on the functions f ( k ) = e,(k)/e,(k) and g = (argmax/kXf(k)). f12 is 0 for either diagonal polarisations or e,(k) below a sensitivity threshold 0.15, 0.5 for horizontal polarisation, and 1.0 for vertical polarisation. Feature f i 3 = zk e,(k) + e,(k), the total variance.

Normalisation: All feature values are normalised to the interval [0, 11. An obvious representation for the angular features is simply to normalise the values with F‘ = F1271 to the unit interval. The problem with this representation is that although the underlying property has no discon- tinuities, the normalised variable has two discontinuities at its upper and lower bounds, corresponding to 0 and 271 radians. The solution adopted here is to encode an angle as the vector (sin 8, cos e), a representation that is contin- uous on the closed interval [0,2n].

IEE Proc.-Vis. Image Signal Process., Vol. 141, No. 4 , August 1994

Page 4: Contextual image labelling with a neural network

The distributions of some features with wide dynamic ranges are far from uniform, good examples being the size and intensity ratios used as contextual features. These features are normalised using the nonlinear func- tion F = arctan (F)/2x, which approximately flattens the distributions.

5 Neural n e t w o r k

Since labelled examples are available, it is possible to use a supervised neural network. The architecture of the network is a multilayer perceptron with 1 hidden layer, as shown in Fig. 4. The network has a 29-vector of input

output units

input units Fig. 4 Multilayer perceptron architecture

features values; its task is to produce the correct output label for each feature vector. An important question is what kind of output encoding to use for the labels. In this work, a local 1-out-of-M code is used in which each unit represents a different label. This code has the advantages that the trained classifier can be decomposed very simply into M separate classifiers and, furthermore, the outputs have a simple interpretation as approximations to the a posteriori probabilities [12]. It is also more robust than the maximally compact 5-bit code. The classification is determined by the maximum output value.

Many techniques exist for solving the problem of finding optimal values for the network parameters. In this work, a conjugate gradient descent algorithm with batch training was used due to the superior convergence properties compared to simple gradient-based methods such as backpropagation. A cross-validation paradigm was used for the training in order to avoid overfitting. Ten networks with different numbers of hidden units were tested in order to find the optimal number of degees of freedom with respect to the test data. The minimum test error E achieved by each of these networks as a func- tion of the number of hidden units N , is shown in Fig. 5.

The optimal network is assumed to be the simplest network which achieves the minimum error on the test

360r 340

300 E

280 1 I 240 260w

set. From the available data, it is determined that the network with 14 hidden units is the optimum. The optimal network has 855 degrees of freedom. Assuming the training examples are independent and random samples from the pattern populations, the total informa- tion content of the training set, which contains 3825 examples, is approximately 20 OOO bits. The training set should, therefore, adequately constrain the degrees of freedom in the network, whose information content is approximately 14 OOO bits. An alternative argument can be advanced based on the concept of the dichotomisation capacity due to Cover [13]. Here, the number of training examples comfortably exceeds the total dichotomisation capacity of our network (870).

6 Results and discussion

In this Section, the performance of the neural network recognition system is analysed. There are several possible methods for estimating the error rate of the network [14]. The leave-one-out method of error rate estimation is theoretically attractive compared to other techniques, such as resubstitution, because it is asymptotically unbiased. However, in the present case, leave-one-out would be impracticable due to the computational com- plexity of retraining the network many times. By con- trast, the holdout method requires much less computation, although its error rate estimates are slightly pessimistically biased. Therefore, the holdout method is used in this work. The available labelled examples are divided randomly into a training set and a test set in the ratio 80 : 20. The performance of the neural network is then measured in terms of the expected accuracy of its classifications of the examples in the test set. This is a useful figure-of-merit because it represents the probability that the network will correctly classify a randomly chosen region.

6.1 Definitions In statistical pattern recognition theory, the confusion matrix of a classifier is defined as

where xk is the kth pattern, &Axk) is the true class of pattern xkr oj*(xr) is the class predicted by the classifier for pattern xkr and i, j are indexes of the ith and jth classes, respectively.

In this work, each label defines a class m i , i = 1, 2, 3, . . . , 29. The classifier predicts that a pattern in feature space from class hi belongs to class wj*, which may or may not be the same as the true class hi. The confusion matrix may be divided by the total number of patterns, yielding an array of probabilities, where the i , jth entry is probability P ( o 3 16,) that a pattern from class hi is clas- sified into class U?. The accuracy for a true class uii is given by P(o* 1 hi), and represents the conditional prob- ability that the classifier predicts the correct class given that an example is known to belong to a particular class. The expected accuracy for the classifier is defined in terms of a set of unseen test data

(3)

6.2 Performance of the network The expected accuracy of the network is 51.1%. This figure suggests the network is doing reasonably well on what is certainly a difficult multiclass classification

24 1 IEE Proc.-Vis. Image Signal Process., Vol. 141, No. 4 , August 1994

Page 5: Contextual image labelling with a neural network

problem. However, since this estimate of the expected accuracy is pessimistically biased (see above), it is expected that the network would perform significantly better if it were retrained on an enlarged dataset. A picture of the confusion matrix C, (rows are the true classes, columns are the predicted classes), shown in Fig. 6, emphasises this performance more clearly as a concen- tration of values on the diagonal

Fig. 6 Confusion matrix

6.3 Clustering transformation A detailed examination of the ‘confusion matrix’* indi- cates that when the network makes mis-classifications it often makes ‘sensible’ mistakes. For example, ‘Vegetation tree’ is sometimes confused with ‘Vegetation bush’ and vice versa. Also, ‘Building roof‘ is sometimes confused with ‘Building chimney’ (although not so commonly vice versa). Furthermore, there are relatively few mis-classifica- tions between incompatible pairs of classes such as ‘Vege- tation tree’ and ‘Building wall’. These types of error

* The full ‘confusion matrix’ is not shown here due to lack of space.

suggest an improvement in the performance of the classi- fier might be obtained by applying a clustering transform- ation to the output labels generated by the network. Consequently, a clustering transformation is defined as shown in the second column of Table 1. The performance of the network after applying the clustering transform- ation is shown in the clustered confusion matrix in Fig. 7. Note that due to round-off errors, not all rows have a sum equal to 100%. The expected accuracy is 64.1%, which is a significant improvement (P < 0.05) over the expected accuracy of the classifier when operating without the clustering transformation (51.1 %).

6.4 Contribution of the contextual features In this experiment, the effect on the network performance of removing the contextual features is studied. The experiment is intended to answer two questions: first, ‘Do the contextual features contain useful classification infor- mation? and, secondly, ‘Is the information sufficient to balance the increased complexity of the classifier?

To answer the first question, the performance of the original network is compared with the performance of a network with an identical architecture but which uses a different feature vector where the value of each of the contextual features is held constant, thus effectively delet- ing those features. In this experiment, the values of the constants are the mean values of the contextual features in the test set. Mean values are also substituted for the values of the contextual features in the training set. The network is retrained on the modified training set and the performance is measured on the modified test set.

The confusion matrix for the retrained network tested on the modified test set is shown in Fig. 8. The expected accuracies for the original network (row A) and the re- trained network (row B) are summarised in Table 3. The expected accuracies of the two networks operating with and without the label clustering transformation are shown in columns 2 and 3 of the Table.

The increase in the expected accuracy for the re- trained network is mainly due to changes in the per- formance on three labels : ‘Road-border misc’,

Table 3: Expected classification accuracies

Network Clustered labels Unclustered labels

A 0.641 1 0.51 05 B 0.6261 0.4686

I I I I I I I I I I I

94 0 0 0 0 0 0 0 0 0 3

Fig. 7 Clustered confusion matrix

I I - _________-___, I 1 1 , _ _ _ _ - - _ _ - ---__-, 1 1 1 1 ) ) ) ) I - - -___, I I I I I I I I I I I I I I I I I I I I I ’ - - ’ I I I I I I I I I ’ I l l 1 I l l l l l 2 1 1 0 0 0 1 0 0 0

7 5 1 0 3 1 4 2 0 2 1 0 0 7 8 1 1 6 6 0 0 0 0 0

28 1 3 6 1 4 1 7 3 0 0 0 0 27 2 1 2 4 2 1 3 4 0 0 1 0

6 0 1 1 7 2 4 0 1 0 3 4 14 0 0 4 3 5 3 6 0 7 2 2 11 0 0 0 7 7 0 1 1 0 0 0 4 4 0 7 1 1 1 1 0 5 4 1 1 0

21 0 0 7 0 1 4 0 0 5 7 0 13 0 0 3 1 6 0 0 0 3 6 1

CloudJmist Vegetation Road-surface marking Road-surface other Road-border misc Building Bou ndi ng-o bject Road-sign Telegraph-pole Illumination shadow Mobile-object car

Cloud/mist Vegetation Road-surface marking Roadsurface other Road-border misc Building Bounding-object Road-sign Telegraph-pole Illumination shadow Mobile-object car

242 IEE Proc.-Vis. Image Signal Process., Vol. 141, No. 4, August 1994

Page 6: Contextual image labelling with a neural network

I ’ -

I I I I I 1 I I I I I 1 I I I I I I I I

94 1 0 78 6 0 0 35 0 30 3 5 0 18

11 11 0 8 0 28 3 4

I I I I I I I I I 0 4

72 7 4 0 1 0 0 0 0

6 1 2 6 0 0 36 15 7 0 0

8 32 23 1 0 1 3 7 0 6 0 0 3 32 32 0 0 0 55 22 0 0 4 1 8 0 0 0 0 0 7 0 0 0 3 0 0 3

Fig. 8 Confusion matrixlor network with deleted contextualfeatures

Road-sign’ and ‘Mobile-object car’. An examination of the unclustered confusion matrices suggests that this is largely because the contextual features are helping to dis- tinguish car windows from the windows of buildings, and also objects in the road border from the walls of build- ings. It is also noted that the application of the label clus- tering transformation makes the classifier more robust against the removal of the contextual features.

To answer the second question, two network architec- tures with different feature vectors are studied: the first is the original network and feature vector with contextual features as discussed in Section 4), and the second is a network with a reduced feature vector which omits the contextual features. The network with contextual features included has approximately twice the number of input units as the network with no contextual features, and it therefore has a greater complexity. The performance of the no-contextual-features network was estimated for different numbers of hidden units on the test set. The optimal network architecture is chosen as the one which achieves the minimum test error. The optimal network with no contextual features achieved a test error some 10% worse than the original network with contextual features. It is therefore concluded that the contextual fea- tures do contain sufficient information to balance the extra complexity required for the network architecture.

6.5 Area-based performance An important performance evaluation is to measure the mean proportion of the area of the images that is correct- ly labelled; this is not necessarily equal to the mean pro- portion of the total number of regions that is correctly labelled. The mean proportion of the total area that has been correctly labelled by the network, which was trained using the original data and a full feature vector, is listed for each different unclustered label in Table 4.

It can be seen that there is a wide variation in the (area-based) accuracy over the different classes. This is caused by the network classification performance being adversely affected by the existence of classes with rela- tively small a priori probabilities. In fact, there is a signifi- cant correlation (r = 0.57, P < 0.05) between the a priori probability and the probability of correct classification.

Nonetheless, the expected accuracy of the network as measured in terms of region area is high: 71.6%. This is greater than the expected accuracy of the network as measured in terms of numbers of regions, which is 51.1%

I E E Proc.-Vis. Image Signal Process., Vol. 141, N o . 4, August 1994

Cloud/mist Vegetation Road-surface marking Roadsurface other Road-border misc Building Bounding-object Road-sign Telegraph-pole Illumination shadow Mobile-object car

Cloud/mist Vegetation Roadsurface marking Road-surface other Road-border misc Building Bounding-object wall Road-sign Telegraph-pole Illumination shadow Mobilesbject car

Table 4: Area-based accuracy

Rank Label “A Area Prior correct probability

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Cloud/mist Mobile-object car body Road-object roadsurface tarmac Road-object road-surface stone/rock Mobile-object car window Landscape vegetation tree Road-object roadsurface marking Roadsbject road-border grass Illumination shadow Building window Building wall Landscape vegetation bush Road-object road-border stone/rock Landscape vegetation other Electrical-object telegraph-pole Building roof Bounding-object wall Building chimney Road-object road-border tarmac Road-object road-border kerb-stone Road-object roadsign Landscape vegetation grass Landscape vegetation Road-object roadsurface gravel Road-object road-border gravel Bounding-object fence Building other Bounding-object gate Road-object road-border mud

91.3 90.6 87.9 86.1 85.4 85.3 84.7 82.2 81.4 78.3 76.4 57.7 54.0 53.3 52.7 52.3 34.8 29.8 28.4 20.5

5.66 4.46 1.83 0.00 0.00 0.00 0.00 0.00 0.00

0.081 6 0.01 57 0.0243 0.0259 0.01 67 0.061 2 0.0191 0.0682 0.01 52 0.0826 0.0873 0.0580 0.0340 0.0476 0.0340 0.0476 0.0507 0.0873 0.01 93 0.0327 0.0094 0.01 46 0.0363 0.0084 0.0089 0.0308 0.0094 0.0230 0.0089

The reason is that larger regions are more likely to be classified correctly than smaller regions because the reli- ability of some of the features used by the neural network increases with the area of the region being considered. The reliabilities of shape and texture descriptors, in par- ticular, increases with region area. For example, the multiscale shape descriptors have a maximum scale range of seven octaves, although a subrange of only four octaves is used for the smallest regions whose perimeters are -30 pixels. Examples of labellings generated by the neural network for the images in Figs. 1 and 2 are shown in Figs. 9 and IO, respectively, for which a key to the colours is provided in Table 5.

It follows from these data that the probability of a randomly chosen pixel from one of these images being correctly labelled by the system is approximately 72%. On the assumption that our sample of images of road scenes is unbiased, a similar level of performance by the

243

Page 7: Contextual image labelling with a neural network

Tab le 5: Kev to c o l o u r s in Fios. 9 and 10 ~ ~

Cnlniir Label Examole

light green mid green olive green bright green purple deep purple bright blue red-1 orange pink-1 salmon redd2 brown Dink 2 pink13 pale yellow

Vegetation tree Vegetation grass Vegetation bush Vegetation other Road-surface tarmac Road-surface stonelrock Road-border tarmac Road-border stone/rock Building roof Building window Building wall Telegraph-pole Illumination shadow Mobile-object car window Mobile-object car body Bounding-object wall

tree at top-left grass patches to lower-left hedges to left patch above-left of car road in foreground road surface to lower-left footpath in background stone at lower-left of road rooftops at top-right building windows at far right wall at far left telegraph pole above-left of car underneath car to right rear car window to right car body to right low wall to left under hedge

Fig. 9 Key to colours provided in Table 5

Urban image labelled by the network

Fig. 10 Key to colours provided in Table 5

Rural image labelled by the network

system may be expected if it is applied to new sets of images of road scenes that are chosen in accordance with the statistics given in Section 2.

7 Conclusions and fu ture work

A neural network with a multilayer perceptron architec- ture has been shown to be capable of labelling the visible objects in colour images of outdoor scenes. The two problems of segmentation and recognition were separat- ed by using ‘ideal’ segmentations, allowing the per- formance of the recognition method to be studied independently of the effects of using an imperfect real segmentation process. A network architecture with 14 hidden units was shown to be optimal. A label clustering

244

transformation was proposed which results in a signifi- cant increase in the expected accuracy of the network. The deletion of the contextual features from the feature vector was shown to degrade the performance of the network. Measurements of the generalisation per- formance on unseen test data show that, on average, the system correctly recognises approximately 72% of the area of these images.

It is difficult to make a quantitative comparison of these results with other vision systems. Although there has been a significant amount of work using related sta- tistical methods, it mostly deals with different types of image, for example, on IR images, the work of Haddon and Boyce [lS], and Priddy et al. [16]; and on remote- sensed images, the work of Goodman and Greenspan [17]. Other work on outdoor scene analysis has used knowledge-based approaches [7, 81. Unfortunately, these publications do not include any quantitative evaluation of the performance of their methods, results being demonstrated on only a few images. It is hoped that the database used in this work will enable further experimen- tation to pursue the issues discussed here. In the future, it is hoped to extend this work by testing the performance of the neural network model on nonideal segmentations, and also to investigate the effects of regularisations of the model parameters and the use of a cross-entropy objec- tive function.

8 References

1 ‘DARPA image understanding workshop’ (Morgan-Kaufmann, New York, 1993)

2 Proceedings of the 1st IFAC international workshop on Intelligent autonomous vehicles (Pergamon Press, Oxford, 1993)

3 GRIMSON, W.E.L., and LOZANO-PEREZ, T.: ’Localizing over- lapping parts by searching the interpretation tree’, IEEE Trans. Pattern Analysis & Machine Intelligence, 1984,9, (4). pp. 469-482

4 GOAD, C.: ‘Fast 3D model-based vision’, in PENTLAND, A.P. (Ed.): ’From pixels to predicates’ (Norwood, NJ), 1985, pp. 371-391

5 LOWE, D.G.: ‘The viewpoint consistency constraint’, Int. J. Com- puter Vision, 1987, I, pp. 51-72

6 GRIMSON, W.E.L., and LOZANO-PEREZ, T.: ’Model-based recognition and localisation from sparse range or tactile data’, Int. J. Robotics Res., 1987, 3, (3), pp. 3-35

7 DRAPER, B.A., COLLINS, R.T., BROLIO, J., HANSON, A.R., and RISEMAN. E.M.: ‘The schema system’. Int. J. Computer Vision, 1989,2, pp. 209-250

8 OHTA, Y.: ‘Knowledge-based interpretation of outdoor natural scenes’ (Pitman, London, 1985)

9 HAMPSHIRE, J.B., and PEARLMUTTER, B.A.: ‘Equivalence proofs for multilayer perceptron classifiers and the Bayesian dis- criminant function’. Proceedings 1990 connectionist models summer school (Morgan Kaufmann, New York, 1990), pp. 34-45

10 WRIGHT, W.A.: ‘Image labelling with a neural network. Pro- ceedings 5th Alvey vision conference, Reading, MA, 1989, pp. 227- 232

11 MACKEOWN, W.P.J.: Thesis, in preparation 12 LOWE, D., and WEBB, A.R.: ‘Optimised feature extraction and the

Bayes decision in feedforward classifier networks’, IEEE Trans. Pattern Mach. Intell., 1991, 13, (4), pp. 355-364

13 COVER, T.M.: ‘Geometrical and statistical properties of systems of linear inequalities . . .’, IEEE Trans. Electron. Compuf., 1965, EC-14, (3), pp. 326-334

14 TOU, J.T., and GONZALEZ, R.C.: ‘Pattern recognition principles’ . . (Addison-Wesley, Reading, MA, 1977)

15 HADDON, J.F., BOYCE, J.F., PROTHEROE, S., and HESKETH, S.: in ILLINGWORTH. J. (Ed.): ‘Neural networks for the texture , , classification of segmented regions in forward looking infrared images’. Proceedings of the 4th British machine vision conference, Guildford, UK, 1993, pp. 197-206

16 PRIDDY, K.L., ROGERS, S.K., RUCK, W., TARR, G.L., and KABRISKY, M.: ‘Bayesian selection of important features for feed- forward neural networks’, Neuroaornpuling, 1993, 5, (2-3) pp. 91- 103

17 GREENSPAN, H.K., and GOODMAN, R.: ‘Remote sensing image analysis via a texture classification neural network’, Adv. Neural Information Proc. Syst., 1993, 5, pp. 425-432

I E E Proc.-Vis. Image Signal Process., Vol. 141, No. 4, August 1994