[Draft: Work In Progress] Deep Challenges of Natural ... · [Draft: Work In Progress] Deep Challenges of Natural Vision for AI Aaron Sloman (with help from Michael Zillich) September

[Draft: Work In Progress]

Deep Challenges of Natural Vision for AI

Aaron Sloman (with help from Michael Zillich)

September 22, 2018

Abstract

Much research on vision assumes that the functions of vision are obvious andattempts to investigate, or build, mechanisms that those functions require. However,there are many functions of biological vision that go unnoticed, or are mis-described,and that holds back both science and engineering. Some generally unrecognizedor poorly characterised functions of vision are described, including uses of visionto provide information about what is not the case, the variety of requirements forintegrating visual input across time and space, and the roles of vision in mathematicaldiscoveries leading up to Euclid’s Elements, for instance reasoning about what is andis not possible in the environment and why. Some visual functions are important for“online intelligence”, e.g. in visual servo-control, others for “offline intelligence”,e.g. in planning, designing or explaining. Vision can be important in “social intelligence”including requesting or providing shareable information, and acquiring and usinginformation about other individuals (e.g. what they can or cannot reach, or do, or see,or their states of mind). Exhaustively specifying functions of vision is impossible,since they can change. However, we can start with a set of unnoticed requirementsrelevant to research on vision in intelligent animals and machines.Keywords: Functions of vision, Architectures, Spatial reasoning, Euclidean geometry,Topology, Animal vision, Robot vision, Stereo vision, Motion perception.

Contents1 What’s missing in current artificial vision systems? 2

2 A lesson from biological stereo perception? 4

3 Detecting pseudo-questions and bad theories 5

4 There’s no fixed set of functions of vision 6

5 Domain-specific visual functions and learning modes 8

6 Meta-cognitive visual functions 9

7 More on the history of functions of vision 12

8 Good and bad sources of research targets. 13

9 Human vision and mathematics 16

10 Concluding remarks 18

1 What’s missing in current artificial vision systems?This paper illustrates an example-driven approach to finding out what the functions ofbiological vision are and what has not yet been achieved in artificial vision systems,despite the many impressive achievements of AI and robotics in the last half century,recently accelerated by astounding advances in hardware, making possible real-timeperformances that would have been unthinkable for most of the history of AI.1 At present,the speed, precision and reliability of machines far exceeds performances of humans inmany tasks, e.g. catching a moving object (https://www.youtube.com/watch?v=M413lLWvrbI), or welding car bodies (https://www.youtube.com/watch?v=0L7Xk5_s3QQ). But each such robot has only a narrow range of competences, andvery little understanding of what it is doing and how it works, leaving huge gaps betweenmachines and animals with a broad range of abilities.

A recent issue of the IAPR Newsletter mentioned a workshop on Unsolved Problemsin Pattern Recognition and Computer Vision Kuijper (2013). It is commendable thata research community attempts to characterise what it does not yet know, as a steptoward agreeing on what the major unsolved problems are against which progress can beevaluated. In particular this can counter the (pernicious) over-use of sets of benchmarksto define research problems. The infamous but accurate Rumfeld pronouncement: thereare known unknowns and unknown unknowns is as true of science and engineering as itis of military intelligence. This paper describes functions of biological vision that arerelevant to long term goals of AI/Robotics, psychology, cognitive science and relatedfields (including philosophy of mind and philosophy of mathematics) but which have notyet been generally noticed, or have been noticed, but remain unexplained.

A collection of unsolved problems can provide criteria for evaluating progress, andhelp researchers identify gaps that are worthy of investigation. Researchers should beable to agree on what they don’t understand even if they don’t agree on other importantissues. The result of such a survey should not be an unstructured list. For example, itwould be useful to identify dependency relations between the problems to help guideresearch choices. Whether two visual capabilities have a dependency relation may itselfbe a research problem – e.g. the extent to which natural vision depends on detectionof low level image features. The list of unsolved problems may never be completed: itcan grow indefinitely. The biological uses of perception have expanded over millions ofyears. There is no reason to assume that that expansion has ended. For example, uses ofhuman vision in reading text, in using mathematical symbols, in sight-reading music, inunderstanding computer programs, and in controlling movements of a computer mousepointer, are all relatively recent in human history.

Proposed solutions can generate new problems. If several researchers agree that X isan important problem, and some think that Y is a solution to the problem while others

1In 1973, when cpu speeds were measured in Khz not Ghz, and memory in KBytes, not GBytes, thevision system of Freddy, the Edinburgh robot, took about 20 minutes to find the rim of a cup in full view.So observing its own actions in real time was completely out of the question. http://www.aiai.ed.ac.uk/project/freddy/

2

do not, then there is a new problem: how to decide whether Y is or is not a solution toX. That may require agreeing on a new set of tests to be passed or on stricter criteria forpassing old tests.

Another, more satisfying, possibility, would be to agree on a mathematicalcharacterisation of the requirements and then search for a mathematical proof that theproposed design meets the requirements, as is increasingly a requirement for missioncritical engineering design – and is common in other sciences. However, the requirementsfor sub-systems (or competences) embedded in a rich, multi-component architecture (asin a future human-like domestic helper robot) are very varied and can vary dynamically– as uses of vision vary in humans, e.g. between controlling delicate manipulation,recognising a face peering around a door, searching for a tool on a cluttered desk, readingtext, reading music, enjoying a view, and preparing to jump over a wall.

Much research has been focused on the need for mechanisms to “scale up” (e.g. tocope with increasingly large and detailed inputs, or with more rapidly changing inputstreams). There are very different requirements for mechanisms to “scale out”, i.e.interact with varied parts of a multi-functional information processing architecture, wherethe required interactions can change quickly. This is implicit in some well known robotchallenges, e.g. robot soccer, robot rescue, robot home help. We need to make explicitthe architectural challenges for vision in multi-functional robots dealing with complexand varied requirements.

David Marr famously wrote “... the quintessential fact of human vision – that ittells about shape and space and spatial arrangement” Marr (1982). Many researcherswere persuaded that the problems of vision are all concerned with reversing the 3D to2D projection process. But there is far more information that can be acquired usingvision than information about physical things and their current spatial properties andrelations. Gibson (1966, 1979) drew attention to additional functions of vision, includingcontrolling actions2, anticipating ideas about “enactive” cognition. Simon (1969) noteduses of environmental information during action, though ethologists had been collectingexamples earlier. Many insects and plants have converged on mutually beneficial designsusing vision in the insect and colour and shape in the plant. If a bee uses visual patternsin the optic array to control its approach to a good feeding location, both the bee andthe flower benefit, though the bee’s brain need not use any explicit information about the3-D shape of the flower Lunau (1992). Compare the use of a light-house in navigation toa harbour. These are examples of visual servo-control: behaviour that could in anothersituation be controlled by a physical funnel is instead controlled by what might be called“an information funnel”. Additional functions of vision will be discussed below, includingperception of impossibilities and perception of mental states, including emotional states.

2A useful but brief discussion of Gibson’s ideas and how they relate to AI can be found in section 7.v ofBoden (2006), pp. 465–472.

3

2 A lesson from biological stereo perception?Many animals lack forward facing eyes. For them, getting more information may bemore important than accurate distance estimation. Yet they see and act in a 3-D world,when feeding, mating, tending offspring, or escaping predators. How do they use barelyoverlapping visual inputs to perceive a 3-D environment? The process may be more likemerging slightly overlapping maps than like triangulation. For mobile animals in a worldwith moving objects, perceptual integration is needed because motion of a perceiver orperceived object can cause information about an object to come first through one eyethen the other. Integration of information across time and space may be more effectiveif precise details of rapidly changing retinal contents are discarded after use, leaving a-modal information about the environment. Information that X was seen to move to whereit is from another place need not store all the previous retinal image information - justenough to guide the search in a predicted location. Likewise if something is seen firstusing one eye then the other, because the perceiver has rotated. Such mechanisms willalso cope with temporary occlusion, e.g. by a thick tree trunk, and can simplify trackingof a moving object, as demonstrated in Hogg (1983).

Many vision researchers assume that random-dot stereograms need the samemechanisms as normal stereo vision, even though normal scenes have much monocularlydetectable structure not found in random dots. Suppose monocular 3D vision evolved first,with integration mechanisms sketched above. Then binocular vision, without much stereooverlap, could combine results of monocular processing, in the ways suggested above.Such “abstract fusion” mechanisms using a-modal information structures could havemultiple uses. For example, when using a hand to determine the structure of an object inthe dark, it is typically necessary for the hand to move around the object sampling differentparts of its surface. Decoupling derived environmental information from sensory inputstreams would allow two hands to be used in parallel, feeding information into a commonamodal information structure shared between different modes of perception: e.g. seeing,feeling and hearing the same thing. Combining monocular input streams would then bea special application, though requiring maintenance of changing mappings between themoving input source and the enduring, growing structure containing information aboutthe environment. (Garbage collection issues would also arise.)

If binocular fusion mechanisms (with appropriate new facial structure) evolvedrelatively late, then older mechanisms for merging monocular information could be used,with different levels of interpretation constructed in parallel. (Something like this ideais used for visual slam in Smith et al (2006).) Later, improved spatial precision coulduse triangulation at pixel levels. But that would require new mechanisms for findingand using pixel correspondences. Perhaps random dot pictures reveal only relativelyrecent additions to depth-perception mechanisms. Some people (including the visionresearcher who invited me to write this paper) seem to lack the mechanisms requiredfor detecting structure in random dot stereograms, yet survive in a 3-D world and see a lotof 3-D structure. An important challenge for vision research is to investigate the relativemerits of the two sorts of stereo vision (based on pixel correspondences and based on

4

merging results of monocular processing), and what natural vision systems actually do.There are related questions about detection and use of highlights and specular reflectionsto discern 3-D structure. Investigating the options also requires a study of informationprocessing architectures required by the different approaches. An important problem forvision research is to identify the variety of complete architectures and functional roles inwhich various types of visual system may need to perform, and to understand all the manydesign tradeoffs – rather than finding the one “right” design – which may not exist!

3 Detecting pseudo-questions and bad theoriesSameness of experience: Some apparently hard questions can turn out to be pseudoquestions. For example, it may appear that there is a well defined question of the form:is this machine conscious of seeing something? Or, does this machine have the samevisual qualia as humans? Questions that arise in philosophical discussions can sometimesappear to be intelligible when they are in fact disguised nonsense. For example, beforeEinstein it appeared to be sensible to ask what the distance is between the locations ofyour left foot at 12 noon and the location of the same foot five minutes later. Now weknow that the question makes sense only relative to a frame of reference, and answerswill be different depending on the frame chosen. If the frame of reference is fixed inthe railway carriage that you occupy, and your foot does not move relative to the floor,the two locations will be the same. Relative to a frame fixed on the railway track, thetwo locations could be kilometers apart. A partly analogous argument can show thatquestions about sameness and difference of information contents (e.g. qualia) in differentinformation users can turn out to have no answers in some contexts because of the “causalindexicality” (Campbell 1994) of the concepts used, e.g. if the colour discriminations areresults of a self-organising learning mechanism, and the colour labels assigned by eachperceiver therefore implicitly refer back to the private mechanism within that individual,as explained in Sloman and Chrisley (2003) and Sloman (2010).Confusions about perception of affect: Other errors are related to the variety offunctions of vision, including social functions. Humans use vision to make inferencesabout the state of mind of another human, including states such as anger, embarrassment,guilt or fear. Despite centuries of research on this, there is no single generally acceptedtheory of mind, or of emotions, that can be taken as a framework for developing robotsable to detect, make inferences about, and respond appropriately to, humans (or otheranimals). But robot researchers sometimes consult a local expert, or a famous author’swritings, then choose design requirements unaware that they are ignoring rival theories,or important phenomena. A good theory of perception of affective states needs to address:motives, preferences, likes, ideals, values, enjoyment, anxiety, fear, apprehension, hope,despair, anger, exasperation, impatience, grief, jealousy (of various kinds), hunger, pain,embarrassment, shame, guilt, regret, elation, depression, sorrow, sympathy, pity, groupsolidarity, patriotism, and many more, including combinations (Wright et al (1996)).Attempts to teach robots to detect and react to emotions while ignoring most of this

5

complexity and subtlety is doomed to superficiality and possibly serious error.Errors about symbol grounding: Some of the information produced by perceptualprocesses can use complex notions: not just labels for types of thing, or perceivablerelations between things, but also more subtle and abstract information. E.g. X is trying toprevent something, X is fragile, X belongs to Y, and many more. Where do the conceptsused to express the information come from? An old answer in philosophy is conceptempiricism, i.e. all concepts are developed “by abstraction from experience of instances”.Kant, in 1781, objected that experience already requires use of concepts, e.g. “above”,“inside”, “near”, etc. So not all concepts can come from experience, and that raises thequestion: where do the concepts used by very young children come from? The answermust lie somewhere in the genome and the way it specifies various types of developmentaltrajectory. Concept empiricism has recently been reinvented with a new name “SymbolGrounding” Harnad (1990), and many vision researchers implicitly or explicitly use thoseideas in designing visual learners. This topic has been discussed in some depth byphilosophers. I shall not go into it here, as my aim is only to mention that uninformedvision researchers may unwittingly follow in the footsteps of bad philosophers.

One of the consequences of the fact that animals can benefit from inheritedframeworks for acquiring visual expertise, benefitting from experience of previousgenerations, as discussed in Chappell and Sloman (2007), is that AI/Robotics researchersconcerned with developing machines with visual abilities that have to be learnt should notrely only on general-purpose learning mechanisms. There may be inheritable packagesof competences to start the process that would be just as useful for beginner robots asfor infant animals. Not all species of infants necessarily start from the same learningprejudices. However as the just-referenced paper demonstrated the inherited packagesneed not be fully specified: rather if they use parametric polymorphism they may beabstractions that can be instantiated differently in different development environments, asis clearly the case with inherited abilities to learn human languages.

In some cases this developmental flexibility can lead to developmental abnormality,e.g. if some part of the brain does not develop normally that could prevent other partsfrom developing normally. E.g. Sloman (2013) attempts to explain some of the visualphenomena observed in an autistic child described in Selfe (1977). The explanation useshypothesised abnormal development of some visual functions leading to re-deploymentof the underlying mechanisms, rather than linking autism with “lack of a theory of mind”,a theory that does not do justice to the variety of observed phenomena.

4 There’s no fixed set of functions of visionWe can never be sure that we have found all the existing functions of vision, since suchfunctions can be totally unobvious to individual perceivers, like the use of optical flowto control posture, with the consequent ability to affect control of balance – a type ofunconscious optical illusion reported in Lee and Lishman (1975).

Moreover, what appears to be a single function, may be revealed by closer

6

investigation to be two (or several) separate functions performed in parallel. For example,we think of vision as providing information about the sizes of objects. But it is not alwaysrealised that there are different uses of information about size, in different sub-systems ofthe information-processing architecture, using different derivations of information aboutsize, as shown by the Ebbinghaus illusion (Figure 1).

Figure 1: Ebbinghaus Illusion: are the two middle circles the same size? Try measuringthem.

Information about size can be used in many different ways, including selectingexplanations of properties of things in the environment, or even simply storing theinformation for future use. A quite different, non-descriptive, use of information aboutsize is in visual servo-control, for example controlling the gap between finger and thumbused to pick up a circular object (Sloman (1983)). The perception of size in that contextneed not assign a globally relevant measure. So someone who judges one of the centralcircles in Figure 1 as larger than the other when there’s no practical application drivingthe judgement, may implicitly judge them as the same size when grasping circular objectsbecause of the use of servo-control to reduce the gaps between the rim of the centraldisc and the inward facing surfaces of finger and thumb. On the other hand, comparingobjects at different places and different times (e.g. buying shelves to fit in a fixed gap)requires a notion of size that allows comparisons across time and space, relating thesize of objects not to action control requirements, but to a variety of different otherobjects in the environment: a multi-purpose type of information, which makes senseonly if the information processing architecture can use a uniform distance metric forall objects perceived or thought about, no matter where they are or what they are, orwhy the information is required. Humans have relatively developed standard technologyto support this. It’s not clear what other animals do. I suspect most brains use mainlyrelative lengths in a context-sensitive, task-relevant, manner. Those mechanisms mightturn out to work well in some contexts, while producing errors or illusions in others, e.g.Fig 1. Similar remarks could be made about all the metrics relevant to perception of theenvironment, including height, curvature, velocity, weight, pressure, and others. I suspectthat in many cases instead of metrics, animals use many types of partial ordering (e.g.based on comparatives, like bigger, wider, further, more symmetrical, etc.). However

7

some ballistic actions require at least approximate absolute metrics, e.g. a cat jumpingfrom the ground to the top of a narrow wall.

5 Domain-specific visual functions and learning modesPeople do not always see what is in plain view before their eyes, e.g. when proof-reading. In part that is because vision is not a unitary biological function based on aset of mathematical principles relating retinal stimuli to things in the environment, buthas many different uses, in which the relationships between sensory input and what isseen can differ in important ways. Detecting something edible is a different task fromseeing which way your prey is likely to move.

Forest dwellers have to learn to see many sorts of things that are specific to theirenvironment, including detection of evidence of prey they are searching for, and learningto recognise varieties of edible and inedible plant matter. Other examples include learningto recognise cloud patterns that indicate likely or possible changes in the weather, andthese may be different in different geographical regions. The ability to develop so manygeography specific or culture specific (domain specific) visual competences could besummed up as the creativity and versatility of human vision.

Humans can acquire new visual competences throughout life, though age and previouslearning can make a difference to the level of competence achievable. Learning toread a new script involves several levels: learning the characters (and their lower levelcomponents), the morphology, the words, the syntax, and in some cases acquiringconcepts used in an unfamiliar culture. Variants of that process are required forlearning mathematical notations, learning to read music, learning computer programminglanguages, learning to read chemical formulae, learning to understand various kinds ofmaps, and learning to see tissue sections in a microscope (Abercrombie (1960)). In mostcases there is not a mathematical projective relationship between the 2D image contentsand the semantic contents, contrary to a popular interpretation of Marr’s views on thefunctions of vision, in Marr (1982).

It is often assumed that some general purpose learning mechanism will suffice toexplain all these cases. But closer investigation may reveal that there are differentsubtypes corresponding to different stages in our evolutionary history, and that differentanimals with visual creativity, instead of all being able to learn the same things havedifferent abilities that will require different explanations. This has important implicationsfor designers of visual mechanisms for use in robots. Successful design may depend ondiscovering the right “innate” learning prejudices, or bootstrapping mechanisms, to suiteach type of robot’s environment and expected roles in that environment. In particular,some human uses of vision are closely related to meta-cognitive mechanisms in a layeredinformation processing architecture, as we’ll see.

8

6 Meta-cognitive visual functionsA special subtype of advanced visual capability is “reading” mental states of others,within and across species. Human vision supports a variety of meta-cognitive functions,concerned with information about information-processing, whether in oneself or in others.For example, self cognition includes information about what one has seen and how itlooked at the time, as well as what was not seen. An artist needs to be aware of aspectratios and curvatures in the visible projection of a scene, which will be different from theproperties of the things perceived. Humans are capable of attending to such intermediateinformation structures, though that requires learning. In other organisms the intermediateinformation structures may be used then immediately discarded so that there is nothing toinspect (as in some AI systems).

The meta-cognitive ability to attend to viewpoint specific or viewer specific featuresof how things are experienced, as opposed to how they are perceived to be, can beuseful in many contexts, including guiding others as to where or how to detect somethingunobvious.

But the very same facts, once reflected on, can generate deeply confused theoriesabout qualia and consciousness (which may occur in future intelligent robots, as sciencefiction writers have noted). As far as I know, no AI vision research has addressed thisand neuroscience still lacks theories about how brains combine normal and metacognitivevisual functioning. Perhaps an exception is Trehub (1991).

Meta-cognitive visual functions include abilities to detect that information requiredfor a task is not available but could be made available by a change of location, then usedlater. For example an individual wishing to cross a stream without getting wet may setout into a forest hunting for a fallen tree to use as a bridge. A more intelligent individualwould realise the importance of first checking the width of the stream and using that as acriterion for selecting a tree.

Sometimes children show that they have not yet developed that sort of ability, e.g.when choosing the next cup to add to a partly built stack of inverted cups of decreasingsize. Later they learn to use the diameter of the current top cup when selecting the next cupby trying to fit cups on the existing stack top. A more sophisticated child can avoid thatby somehow storing information about the size of the top cup and comparing unstackedcups with that stored information. That requires a new visual function: comparing thesize of something visible with the size of something remembered.

Different biological mechanisms are likely to be required for comparing size, or shape,or orientation, or other features, for different sorts of objects and for different contexts.Some of the mechanisms may be more reliable than others. This is illustrated in researchon change blindness, though often the wrong question is asked: “Why don’t people see thechanges?” That presupposes a theory of which changes can be seen, and how, a theory thatis challenged by the “blindness” examples. Too often, researchers fail to ask how familiarand apparently easy things are possible. Without such explanations of occurrences askingwhy occurrences don’t happen is pointless, since any deep answer should identify thepart of the explanation that failed. If two complex mostly similar pictures or scenes differ,

9

what mechanisms are required to identify the difference? Different mechanisms may berequired for different sorts of differences: e.g. something has changed colour vs a newobject has been inserted between two others vs a word has been replaced by a synonym,etc.

Yet another meta-cognitive ability related to vision is seeing that required informationis not available, but other information is available about a possible action that will provideaccess to the missing information: e.g. rotating an object to see another part of it, movingto a new location, to see something not visible, moving an intervening object to seesomething not visible, climbing to see something not visible, and at a much later stageof evolution making telescopes to see something not visible, a task that itself requiressophisticated visual and other capabilities! Being able to work out how what you see inthe next room will change as you move in relation to an open doorway is another aspectof visual meta-cognition. These are all examples of perception of epistemic affordances:perception of opportunities to acquire new or different information from the environment.

All the meta-cognitive visual capabilities mentioned so far are concerned with whatthe perceiver has or has not seen or can or cannot see. It is also possible to use other-related meta-cognitive capabilities: e.g. using vision to work out what another individualcan see, or to select a location at which to hide, so as not to be visible to another individual.

Many of the details of these processes depend on an implicit assumption that visualinformation travels in straight lines: an assumption that may be built into the mechanismsused (either innately or as a result of learning?). Exactly what that notion of straightness isand how it relates to the later development of Euclidean geometry, and what mechanismsand forms of representation are used, are all topics for further research. At a later stage ofdevelopment that assumption may be fully articulated and used explicitly.Meta-cognitive abilities to “see into” other minds. Many perceptual capabilities useontologies that extend beyond the contents of sensory data, for example seeing somethingas flexible, elastic, hard or fragile on the basis of observed patterns of behaviour. Theseare cases of “seeing inside” objects insofar as our percepts refer to properties of materialsinferred from behaviours (Arriola-Rios and Wyatt (2011)). Many other animals seem tohave this ability. What mechanisms support it, and how are they acquired? How canmachines acquire such competences?Visual “back projection”. Perception of spatial structure, motion and physical propertiesof objects sometimes seems to produce information held in registration with the relevantportions of the visual field, perhaps to support inferences about consequences of forcesor movements. Something similar also seems to happen when intentions and emotionalstates are perceived: the characterisations are somehow in registration with contributingportions of the image and in some cases hallucinated onto portions that could contributeeven when they don’t. This is illustrated in Figure 2. Compare Kanizsa’s illusorycontours.3

Another example is perception of “biological motion” in Johansson’s movies withmoving point lights. When stationary they just look like isolated lights. When theymove they appear to fuse into humans or animals moving in characteristic ways: walking,

3http://en.wikipedia.org/wiki/Illusory_contours

10

Figure 2: Stare at each face for a while. Do the eyes look different? How? Many peopleexperience the eyes as different: happy in one face and sad (or neutral) in the other, eventhough the eye images are geometrically indistinguishable. Compare seeing the famous“duck-rabbit” picture in two ways with parts of the picture experienced differently after a“flip” even though no physical feature changes.

dancing, climbing, fighting, etc. In some cases, even the effort seems to be visible in themoving lights (Johansson (1973)). Why does that happen? How does it happen?

These examples seem to support a theory that treats visual systems, at least in humans,as having different levels of capability that produce interpretations all “in registration”with the optic array: the perceived flexibility, rigidity, elasticity, hardness, etc. of physicalmaterials is perceived as being in the space occupied by the materials, not merely assome abstract description applied in a logical expression. A biological extension of thatcapability is “projection” of the happiness and sadness inferred by meta-cognitive sub-systems back into the spatial information records, and likewise the effort seen in movinglights demonstrations, the 3-D slope information in the necker cube, the “direction oflooking” of the duck rabbit in each interpretation as well as the re-located body parts(ears, bill, mouth), and many examples of perceived causation e.g. when an impact isperceived as causing bending. What are our robots missing?

However broad our survey, we cannot assume that we have identified all the uses ofvision, since new examples may turn up in previously unstudied animals, or humans inunusual situations – e.g. gazing out of a spaceship, remotely controlling a blind robot orother device, or interacting with new kinds of interactive visual or multi-modal interactiontechnology. That’s in addition to functions that researchers simply fail to notice, such asvision researchers who ignored the role of vision in servo-control.

A deeply important collection of functions that most vision researchers ignore areconnected with abilities involved in making mathematical discoveries about spatialstructures and processes, such as the discoveries that led to Euclid’s Elements.4

4There are some examples here, and in pages linked from this, all still under development:http://www.cs.bham.ac.uk/research/projects/cogaff/misc/triangle-theorem.html

11

7 More on the history of functions of visionOne way to try to expand our understanding of the varieties of possible functionsand possible mechanisms of vision is to attempt to trace the various requirements forperceptual systems at various stages in the evolutionary history of species that nowuse vision, and also developmental and cultural changes in individuals of differentspecies. Different patterns of individual development can give clues as to which visualcompetences are innate, e.g. differences between animals born blind or helpless andanimals that are highly competent soon after birth, e.g. deer whose foals run with theherd.

As the history of biological evolution shows clearly, the range and variety of functionsof vision can change over time. One of many major developments was the invention ofhuman sign languages (which I’ve argued preceded spoken languages Sloman (2008))and later the invention of reading and writing. For very simple signs, e.g. a child holdingout a hand to be grasped by a parent, thereby guiding the parent’s hand to the location,the parent’s use of vision is just a special case of visual servo-control, normally followedby control based on haptic and tactile information to guide final adjustments. The factthat the object being grasped is cooperative helps. Some domesticated animals use visionfor various functions not encountered in their ancestors – e.g. interpreting human handsignals.

In some cases it is not enough to see where things are and their direction and speedof motion. Some tasks require use of the intended trajectory of another animal to selecta target for interception. Compare using motion of a physical object simply obeyingphysical laws. More complex interactive abilities, such as the ability to herd sheep, maybuild on and modify previously acquired abilities to detect the trajectory of a prey animaland use it to intercept and kill the prey. Such uses of vision will overlap with, but bedifferent from, the use in heading off an errant sheep in order to steer it back to the herd.More subtle and complex abilities are required for steering several sheep through a gate,including greater parallelism in perception.

What made it possible for our relatively recent ancestors to develop a host of newmathematical and technical uses of vision, including abilities to construct and readgeometric proofs, flow charts, architecture diagrams, inheritance diagrams, and syntaxdiagrams apparently without the need for any supporting genetic changes? New functionsof vision emerged during and after development of mathematical notations such asreasoning with diagrams in geometry, topology and logic, in Euclid’s proofs and Venndiagrams. More sophisticated visual parsing was required for Fregean notations usinghierarchically nested function-argument structures, including bound variables Sloman(1971). Flow-charts, inheritance diagrams, dependency diagrams, and architecturediagrams, arose from developments in computing, operations research and other activities.Understanding all those changes in visual competences may require us to understand oldervisual functions and the mechanisms they use, and how they needed to change, for newways of seeing.

New graphical mechanisms and functions relevant to interaction with computing

12

devices are now emerging rapidly. Use of a mouse pointer in a text editor is partlyanalogous to use of a pointing finger: the mouse pointer shows where in the text thenext action (e.g. mouse click or key press) will have its effect. But that is not referenceto location in a geometric space: it is location in a stream of text with variable displayformat. The relationship between the controlled motion and the motor signals are totallydifferent from the case where a pointing hand or finger is being moved. Controlling themotion of the far end of a held stick is an intermediate case, and that has clearly evolved inseveral different species, including primates and some birds. Compare moving somethingin order to control the location of its shadow.

8 Good and bad sources of research targets.Much AI and robotics research is based on targets deemed valuable by funding agencies,by companies with research budgets, or targets chosen because they are assumed to berepresentative. An alternative source could be an interdisciplinary team attempting toidentify aspects of natural intelligence that we still cannot model or explain. A collectionof well analysed knowledge gaps could be the basis of a research strategy. For a while theEU Cognitive Systems initiative, in 2003, took such a broad view.

“Pure” research has many different triggers and motives, including curiosity triggeredby biological examples. Birds of different species build nests with very differentstructures using very different materials, requiring very different physical movements,both to fetch the materials and to add them to the incomplete nest. Do their visualsystems all have the same competences, or do they, like humans who read very differentlanguages, need to develop different sorts of visual expertise, with different subsystemsusing different ontologies, different forms of representation, different algorithms, andeven different information processing architectures?

Modelling or replicating visual capabilities of birds that weave a nest from a verylarge number of leaves (e.g. weaver birds), birds that assemble rigid or nearly rigid twigsto form a nest (e.g. crows), and birds that use lumps of mud (e.g. swallows) may requiredifferent, partly overlapping, combinations of visual, motor and cognitive competences.That could include using different abilities to perceive and make use of affordances whenfinding materials, detaching them, carrying them to the nest site, inserting them into apartly built nest (or starting a new one) and using what has been constructed to decidewhat to do next.

Much current research assumes that innate general purpose learning mechanismsexplain all information processing capabilities in mature, competent individuals, andwould also suffice for production of future intelligent robots. But that is not obvious.Examples of successful learning in narrowly constrained benchmark tests or actiondomains, do not show that the mechanisms used in those learning experiments can achievethe wide range of visual competences found in intelligent animals. The only learningmechanism that has demonstrated its ability to produce all known forms of intelligenceis the combination of natural selection and its chemical substrate – and that requires a

13

whole planet supporting diverse, mutually influencing, experiments over billions of years– including violent experiments such as volcanic eruptions and asteroid impacts.5 Arethere alternatives to general purpose fixed innate learning mechanisms? Perhaps the formsof perception and forms of learning in different animals provide clues.

Could biological evolution have produced a more general common biologicalframework for learning and development (CBFLD)? Over time, this could have producedspecies specific frameworks for learning and development (SSFLD), which determine theadult competences in combination with individual learning and development trajectories(ILDT). The ILDT for each individual would be the joint product of the SSFLDand influences from both environment and changing physiology and needs duringdevelopment. The environment in some species would include both conspecifics and otherspecies, e.g. prey, predators, and competitors for resources. A single CBFLD can producedifferent SSFLDs which in turn produce different ILDTs. The individual trajectorieswithin a species vary in their diversity between (a) mostly genetically determined adultforms, based on the SSFLD, in precocial species, and (b) widely varying sets ofcompetences produced by a common SSFLD via different ILDTs, in altricial species.

Each includes a spectrum of cases (Chappell and Sloman (2007)) with a spaceof learning and development trajectories partly summarised by Fig 3, indicating, at ahigh level of abstraction, possible developmental routes for different mechanisms orcompetences in highly intelligent animals, allowing two members of the same speciesto acquire very different adult competences. This could explain diversity in visual andother competences in adult organisms both within species and across species. Similardiversity is to be expected in “adult” robots developed in different environments.

Perhaps we can accommodate that diversity in a common theory of vision thatis loosely analogous to the theory of a common abstract physical design shared byvertebrates, instantiated in a wide variety of shapes and sizes of adult forms, producedby a variety of evolutionary and developmental processes from a common evolutionaryancestor (or a small number of ancestors that independently evolved precursors of avertebrate design). The variety of vertebrate species shows that natural selection somehow“discovered” that there can be a generic physical design that can be instantiated indifferent ways, meeting different sets of requirements. Perhaps it also discovered similarfacts about designs for information-processing mechanisms, such as visual systems, andlearning systems? Evolution seems to have “discovered” a visual version of somethinglike the common abstract genetic framework that allows a human infant to learn any oneof several thousand very different languages, with differences at many levels, includingphonetic, morphemic, syntactic and semantic differences. Many visual informationprocessing mechanisms obviously evolved earlier, and are spread across more species.

We know from half a century of AI and software engineering that informationprocessing systems can be specified at different levels of abstraction, where more abstractspecifications can be instantiated in different ways, with various classes of instancesinheriting common high level properties in combination with application-specific details:

5http://www.cs.bham.ac.uk/research/projects/cogaff/misc/entropy-evolution.html

14

Figure 3: Developmental trajectories for individual competences. Based on a figure inChappell and Sloman (2007). Chris Miall helped with the original version of this figure.

“parametric polymorphism”. Some programming systems allow multiple inheritance,where two or more different sorts of abstract specification are combined to producesubclasses whose instances inherit features of more than one high level abstraction.6

Could evolution have “discovered” the power of parametric polymorphism and multipleinheritance millions of years ago and used common designs for biological informationprocessing systems at various levels of abstraction, in combination with species specificdetails, further instantiated by individual instances under different environmental andmorphological influences during learning and development? In that case visual systemsin different species might share some design features, e.g. use of information in streamsof photons hitting photo-receptors, some common feature extraction mechanisms andcommon architectural features while varying in many species-specific details, some ofwhich are further instantiated in different ways during individual development. Thosedifferences in morphology and function indicate different requirements for explanationsof what vision is used for and how it works. Other differences come from differences inphysical environments, modes of locomotion, types of prey, types of predator, types ofbehaviour required for dealing with food (e.g. peeling bananas, cracking nuts, opening upa carcass to get at meat), climate and many more.

6http://en.wikipedia.org/wiki/Multiple_inheritance

15

9 Human vision and mathematicsJames Gibson introduced the idea of perceived affordances, involving possibilities foraction by the perceiver, relevant to the current or possible goals or needs of the perceiver.But perception of what’s possible or impossible is far more general than that. Inhumans there seem to be deep connections between abilities related to perception ofaffordances and mathematical abilities that have not been replicated in machines. Theseuse information about possible process fragments that can be combined, in sequence orin parallel, in action or in hypothetical reasoning, to form new complex processes. Thereare also constraints making certain structures or processes impossible, or consequencesnecessary.

Figure 4: In 1934 the Swedish artist Oscar Reutersvard produced the picture on the right.Each picture shows coloured cubes, in a configuration that makes a variety of processespossible, including a hand moving in the spaces between the cubes, or a cube beingswapped with another cube. But normal adults also see that the configuration on theright is impossible, unlike the configuration on the left. How?

At what age can a child see the impossibility in Fig 4. What is needed in a brain tosupport that ability? Why/how did it evolve? Affordances can interact in complex wayswhen combined, because of changing spatial relationships of the objects involved in theprocesses. On detecting that the chair is too wide to go through a door you may be able tosee the possibility of rotating it first about a horizontal axis to get it on its side, then abouta vertical axis to allow two legs to go through the doorway first. What changes as a childacquires abilities to reason about such combinations of possibilities and constraints on thepossibilities in a scene? How can future robots discover the possibilities and necessities(not probabilities) unaided?

Suppose you are building a hut before the invention of measuring devices and you

16

want to make a door frame as high as possible, with two upright pillars and a horizontalbar on top. You go to the local wood-merchant, and see a row of vertical pillars. Whichtwo should you choose? A non-mathematical mind could keep trying pairs of pillars witha bar on top until the bar is horizontal. An alternative is to hold a (real or imagined!) barhorizontally above and in front of the pillars, and gradually move it down until it is levelwith two (or more) tops. How can you be sure that process will find the tallest pair ofpillars the same height (if a pair exists)? What mechanisms, in brains or computers, couldsupport such reasoning. Perhaps problems like this led to the discovery of the notion ofmeasurement? And perhaps from there to the understanding of space as metrical – as inEuclid? Grasping the structure of a space of possibilities provides procedures for derivinginformation that you can tell must work: there’s no need to collect statistical evidence thatit does work. (Though errors are possible – and can be corrected Lakatos (1976).)

Long before there were mathematics teachers, and long before discovery of Cartesiancoordinate representations of geometry, and formal logical and algebraic methods ofreasoning our ancestors must have begun to notice facts about geometrical shapes thatwere later codified in Euclidean geometry. Here’s a proof by Mary Pardoe (unpublished)that angles of ANY (planar) triangle sum to half a rotation (180 degrees). Compare thiswith measuring the angles and adding the numbers. Can you be sure that there are noexceptions to the proof?

Early mathematicians made and proved discoveries: but the representational andreasoning mechanisms used are not known. Logical forms powerful enough to servethe purposes of mathematicians were not discovered/created by Frege, Russell and othersuntil the 19th century.

Extending Gibson’s notion of “perception of affordance” we can see that someroots of mathematical cognition, used in discovery and proof of theorems in Euclideangeometry and topology, may have developed from ancient animal abilities to perceiveand reason about affordances required for selecting complex goals and plans in novelsituations. Similar processes can be observed in the discovery of “toddler theorems”by young children, though they normally go unnoticed.7 These proto-mathematicaland mathematical discoveries are closely related to the ideas about “RepresentationalRedescription” in Karmiloff-Smith (1992).

Animals that see mostly live in worlds with motion, and their own movementscause what is seen to change. Moreover, even when nothing is actually changingvision contributes information about changes that can and cannot occur, and what theconsequences of those changes would be. Some of the most important consequences of

7http://www.cs.bham.ac.uk/research/projects/cogaff/misc/toddler-theorems.html

17

possible changes involve changing what would be possible or impossible if they occurred.For example if block B were to move closer to block A then block E could no longer passbetween A and B, whereas it would become able to pass between B and C. Moreover, thechange would make it possible for E to rest on A and B spanning the gap between them.

So the task of giving machines human-like vision is nothing like the task of attachinglabels or descriptions to static images, or portions of images. It often includes perceptionof possible processes, and perception of limitations or constraints on sets of possibilities.This is closely related to mathematical capabilities manifested in the theorems and proofsin Euclid’s Elements long before modern logic and the logic-based axiomatic method hadbeen developed. I believe current machine vision systems are not even close to havingthese important capabilities, even when they have superficially similar abilities, includingabilities to run simulations with great precision: something humans can’t do.

Comparisons between humans and robots, or between animals and current robots arenot as important as comparisons of requirements for different sorts of tasks.

Examples are provided by nest building technologies used by birds. Some repeatedlyfetch blobs of mud and press them against the unfinished structure, requiring the builder tosee how the current structure can be extended towards the desired shape. For a nest madeof stiff twigs the challenges include finding twigs of the required shape and size, breakingthem off the main plant, then selecting gap in the unfinished nest into which the twig canbe inserted to extend the nest towards the required shape. The process may depend onthe twigs being slightly bendable. A weaver bird’s nest made of over a thousand longthin flexible leaves requires a far more complex process to attach each new leaf to thestructure, including tying knots to hold the early leaves firmly to the branch from whichthe nest will hang.

10 Concluding remarksWe can look at a varied range of human and animal abilities still unmatched by computersand robots and ask whether the gaps exist because filling them requires major advancesin AI. However, many aspects of human and animal competences are invisible. Duringthe first few years, young humans display an extraordinary variety of changes of size,shape, strength, and forms of behaviour, including major landmarks, like grasping,crawling, walking, and uttering a two word sentence. But many details go unnoticed.Moreover, there are wide individual and cultural differences, as well as developmentaltransitions. Attempts to discover regularities supported by evidence may ignore most ofthe interesting changes because they are idiosyncratic – enabled by a tangled mixture ofshared and individual genetic mechanisms (Fig 3) operating in environments with subtlecommonalities and differences. Normal laboratory experiments cannot cope with thevariety and time-scales. Many developments in young humans seem to be mathematicalbut not concerned with numbers or logic or formal proofs, but with intuitive reasoningabout topology and non-metrical geometry that seem to have been ignored by most AIresearchers and most psychologists (with a few exceptions, e.g. Sauvy and Sauvy (1974)).

18

Characterising such changes requires a sophisticated mathematical background.The actions of many other species – such as feeding, catching prey, avoiding predators,

caring for offspring, building nests, solving practical problems, and in some casesinteracting with humans – suggest that fairly rich subsets of human visual capabilitiesare shared with other species, though it is not easy to determine the extent of overlap.Moreover individual humans, and cultures, differ in what they use vision for, e.g. musicalsight-reading, understanding mathematical formulae, tracking forest animals, diagnosingmedical conditions; and visual capabilities of individual humans change over time.

Many people (e.g. Brooks (1991)) criticise early work in AI because of its concernwith allegedly trivial tasks in toy worlds, e.g. stacking blocks, and the assumption thatsymbolic reasoning, such as explicit planning, is relevant to how robots need to act. Onemight have expected such critics to go on and demonstrate how their proposed methodscould work in more demanding situations, such as transforming the configuration ofobjects in Figure 10(a) into the stack shown in Figure 10(b) and vice versa. Moreover, inaddition to being able to make suitable plans and carry them out an intelligent machineshould be able to think about the 3-D spatial orientation its fingers would need in order tograsp one of the objects at a specified location. That orientation keeps changing aroundthe rim of the cup, or the saucer. As far as I know that remains beyond the state of the artof current vision systems.

(a) (b)

Figure 5: Most people see the 3-D scenes depicted here fairly clearly, though not withgreat precision, as the images are noisy and have low resolution (160x120 pixels). Ahuman can easily visualise a sequence of actions that would transform the configurationon the left to the configuration shown on the right, and vice versa, though not with perfectprecision. What we can see in such images, demonstrates that AI vision lacks somethingdeeper than better camera technology.

This example is one of many that illustrate how simple everyday competences ofhumans and other animals defeat current working models of vision. The main point of thispaper is that not enough research has been done on identifying requirements for human-like or animal-like vision systems, and this is just another example. The requirementsfor a vision system in a particular situation depend not only on what is seen but on what

19

vision is being used for. If the use is to control the action of grasping a mug at a particularlocation, then relatively precise information is needed about whether the hand is movingin the right direction, rotating to the correct angle of approach, and moving the fingersin an appropriate trajectory to the selected grasping point. This usually does not requiremillimetre precision since getting within a few millimetres and then closing the graspsuffices, whereas higher precision is required when threading a fine needle.

Much lower precision may be required for seeing that something is not graspable, e.g.because it is too far away, too big, has no suitable protrusions, is moving too fast, etc.Intermediate levels of precision may suffice for thinking about how an action could bedone, what the consequences would be, and what might go wrong if a mistake is made orif some potential obstacle is moved during the action.

For some tasks it may be enough to discretise possibilities, for instance when planningfuture actions, explaining past events, or giving someone else verbal instructions ‘Put thecup on the table then put the saucer on the cup, and then the spoon on the saucer’.

All of this implies that a visual system that is capable of representing shape andsurface structure in only one way will not be adequate, even if it has great mathematicalprecision. We need generalisations of work in qualitative representation and reasoningto be applicable to 3-D features and relationships. A more detailed analysis requiresfurther research, but it is expected that different kinds of learning process will provideboth topological and metrical ontologies, allowing different ways of chunking continuousspaces, reducing resolution, and developing measurements of varying precision relativeto features of the perceiver such as proportions of the visual field, proportions of variouskinds of stretch the agent is capable of, proportions of lengths of body parts of the agent,and also the use of size and angle representations relative to other things in the scene.

There is much still to be done on the forms of representation, mechanisms,architectures and varieties of learning and reasoning present in animals and neededin future robots. These are all contributions to the Meta-Morphogenesis project,investigating changes in forms of biological information processing on many time scales.

ReferencesAbercrombie M (1960) The Anatomy of Judgement. Basic Books, New York

Arriola-Rios VE, Wyatt J (2011) 2d mass-spring-like model for prediction of a spongesbehaviour upon robotic interaction. In: Bramer M, Petridis M, Nolle L (eds)Research and Development in Intelligent Systems XXVIII, Springer London, pp195–208, DOI 10.1007/978-1-4471-2318-7 14, URL http://dx.doi.org/10.1007/978-1-4471-2318-7_14

Boden MA (2006) Mind As Machine: A history of Cognitive Science (Vols 1–2). OxfordUniversity Press, Oxford

Brooks RA (1991) Intelligence without representation. Artificial Intelligence pp

20

139–159, URL http://people.csail.mit.edu/brooks/papers/representation.pdf

Campbell J (1994) Past, Space and Self. MIT Press, Cambridge

Chappell J, Sloman A (2007) Natural and artificial meta-configured altricial information-processing systems. International Journal of Unconventional Computing 3(3):211–239, URL http://www.cs.bham.ac.uk/research/projects/cogaff/07.html#717

Gibson J (1966) The Senses Considered as Perceptual Systems. Houghton Mifflin, Boston

Gibson JJ (1979) The Ecological Approach to Visual Perception. Houghton Mifflin,Boston, MA

Harnad S (1990) The Symbol Grounding Problem. Physica D 42:335–346

Hogg D (1983) Model-based vision: A Program to see a walking person. Image andVision Computing 1(1):5–20

Johansson G (1973) Visual perception of biological motion and a model for its analysis.Perception and Psychophysics 14:201–211

Kant I (1781) Critique of Pure Reason. Macmillan, London, translated (1929) by NormanKemp Smith

Karmiloff-Smith A (1992) Beyond Modularity: A Developmental Perspective onCognitive Science. MIT Press, Cambridge, MA

Kuijper A (2013) From the Editor’s Desk Unsolved Problems... The InternationalAssociation for Pattern Recognition Newsletter 35(4), URL http://www.iapr.org/docs/newsletter-2013-04.pdf

Lakatos I (1976) Proofs and Refutations. Cambridge University Press, Cambridge, UK

Lee D, Lishman J (1975) Visual proprioceptive control of stance. Journal of HumanMovement Studies 1:87–95

Lunau K (1992) A new interpretation of flower guide colouration: Absorption ofultraviolet light enhances colour saturation. Plant Systematics and Evolution 183(1-2):51–65, DOI 10.1007/BF00937735, URL http://dx.doi.org/10.1007/BF00937735

Marr D (1982) Vision. W.H.Freeman, San Francisco

Sauvy J, Sauvy S (1974) The Child’s Discovery of Space: From hopscotch to mazes –an introduction to intuitive topology. Penguin Education, Harmondsworth, translatedfrom the French by Pam Wells

21

Selfe L (1977) Nadia – A Case of Extraordinary Drawing Ability in an Autistic Child.Academic Press, London

Simon HA (1969) The Sciences of the Artificial. MIT Press, Cambridge, MA, (Secondedition 1981)

Sloman A (1971) Interactions between philosophy and AI: The role of intuition and non-logical reasoning in intelligence. In: Proc 2nd IJCAI, William Kaufmann, London,pp 209–226, URL {http://www.cs.bham.ac.uk/research/cogaff/04.html#200407}, reprinted in Artificial Intelligence, vol 2, 3-4, pp 209-225, 1971

Sloman A (1983) Image interpretation: The way ahead? In: Braddick O, Sleigh A(eds) Physical and Biological Processing of Images (Proceedings of an internationalsymposium organised by The Rank Prize Funds, London, 1982.), Springer-Verlag, Berlin, pp 380–401, URL http://www.cs.bham.ac.uk/research/projects/cogaff/06.html#0604

Sloman A (2008) Evolution of minds and languages. What evolved first anddevelops first in children: Languages for communicating, or languages forthinking (Generalised Languages: GLs)? URL http://www.cs.bham.ac.uk/research/projects/cosy/papers/\#pr0702

Sloman A (2010) Phenomenal and Access Consciousness and the “Hard” Problem:A View from the Designer Stance. Int J Of Machine Consciousness 2(1):117–169, URL http://www.cs.bham.ac.uk/research/projects/cogaff/09.html#906

Sloman A (2013) Autistic Information Processing: Steps toward a generativetheory of information-processing abnormalities. Research note, School of ComputerScience, The University of Birmingham, URL http://www.cs.bham.ac.uk/research/projects/cogaff/misc/autism.html

Sloman A, Chrisley R (2003) Virtual machines and consciousness. Journal ofConsciousness Studies 10(4-5):113–172, URL http://www.cs.bham.ac.uk/research/projects/cogaff/03.html#200302

Smith P, Reid I, Davison A (2006) Real-Time Monocular SLAM with Straight Lines. In:British Machine Vision Conf., Vol. 1, pp 17–26

Trehub A (1991) The Cognitive Brain. MIT Press, Cambridge, MA, URL http://people.umass.edu/trehub/

Wright I, Sloman A, Beaudoin L (1996) Towards a design-based analysis of emotionalepisodes. Philosophy Psychiatry and Psychology 3(2):101–126, URL http://www.cs.bham.ac.uk/research/projects/cogaff/96-99.html#2

22

Documents

[Draft: Work In Progress] Deep Challenges of Natural ... · [Draft: Work In Progress] Deep Challenges of Natural Vision for AI Aaron Sloman (with help from Michael Zillich) September