16
arXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard and Ralph Etienne-Cummings Fellow, IEEE Singapore Institute for Neurotechnology (SINAPSE), National University of Singapore, Singapore Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore MD, USA Email: [email protected], [email protected] Abstract—Visual motion estimation is a computationally in- tensive, but important task for sighted animals. Replicating the robustness and efficiency of biological visual motion estimation in artificial systems would significantly enhance the capabilities of future robotic agents. 25 years ago, in this very journal, Carver Mead outlined his argument for replicating biological processing in silicon circuits. His vision served as the foundation for the field of neuromorphic engineering, which has experienced a rapid growth in interest over recent years as the ideas and technologies mature. Replicating biological visual sensing was one of the first tasks attempted in the neuromorphic field. In this paper we focus specifically on the task of visual motion estimation. We describe the task itself, present the progression of works from the early first attempts through to the modern day state-of-the-art, and provide an outlook for future directions in the field. I. I NTRODUCTION Visual sensing is a computationally intensive, but crucial task for sighted animals [1]. Although motion estimation is just one aspect of visual sensing, its importance is easily understood when observing its wide range of uses in biology [2], including depth perception, ego-motion estimation, colli- sion avoidance and triggering escape reflexes, time-to-contact estimation (landing control), prey detection and identification, segmentation by motion, and visual odometry. The ability to reliably estimate visual motion in artificial systems would find applications ranging from surveillance and tracking, to visual flight control [2], to video compres- sion, image stabilization, and even the computer mouse [3]. However, the most relevant application of bio-inspired visual motion estimation is for embedded sensing onboard robotic agents capable of moving through and interacting with their environment, since this is precisely the function for which biological visual motion systems have evolved. Key charac- teristics which distinguish this application from others are: the sensor must operate in real-time, the sensor must operate under egomotion, the sensor must be physically carried by the agent, and the sensor should provide information which is relevant for enabling the agent to interact meaningfully with its environment. In fact, Wolpert argues in a recent TED talk that control of motion is the primary purpose of the brain [4]. There are significant differences between biological and artificial systems regarding how visual information is acquired and processed. State-of-the-art modern visual motion esti- mation methods still rely on capturing sequences of images (frames) in rapid succession, even though the majority of data in these images is redundant [5]. The problem of storing and transmitting this redundant information is partially overcome by using dedicated video compression ASICs, or in the case of standalone visual motion sensors, by computing on chip [6]– [9]. Nevertheless, these artificial approaches capture frames at pre-determined discrete time points regardless of the visual scene. On the other hand, biological retinae continuously capture data and perform a combination of compression and pre-processing in analog (using graded potentials) at the focal plane itself, with the visual scene largely driving when and where data is transmitted as spikes (voltage pulses) down the optic tract. Spikes are similar to digital pulses in artificial systems in that their signal amplitude can be restored and they are therefore particularly useful for communicating over longer distances, such as along the optical tract. Processing also differs significantly between biological and artificial systems. Similarly to how artificial systems typically capture data at a constant rate, they must also compute at a constant rate to ensure all the captured data is processed, thus processing of visual information continues even if the scene is static. On the other hand, computation in biological systems is driven by the sparse captured data (spikes), in turn ensuring that neuron activation is sparse [10] (since neuron activation is driven by the sparse incoming data). This sparsity combined with the low power consumption of neurons which are not computing [11] results in significant energy savings. Modern biologically inspired sensors generate sparse data (events) in response to activity in the scene [12], [13] and this data can be used to drive sparse computation on modern neural simulator platforms [14]. Together, these sensors and neural simulators allow both data capture and computation to scale with scene activity. The architectures used for processing also differ. Typical artificial systems compute on a small number of parallel processors, each of which performs sequential operations in a precise repeatable manner, and operates at a timescale on the order of nanoseconds. On the other hand, biology relies on massively parallel processing using a very large number of imprecise computing elements (neurons), each of which operates on a timescale on the order of milliseconds [15]. However, parallelism in artificial systems is increasing, par- ticularly for visual processing (GPUs), and emerging custom neural hardware platforms [14], [16]. State of the art ASICs dedicated to visual motion estimation are also optimized to perform processing in as parallel a manner as possible. Despite the imprecise nature of individual neurons, bio- logical systems perform robustly and continue to do so even after the death of individual neurons. The same is not true of artificial systems, where a single fault can cause catastrophic

Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

arX

iv:1

511.

0009

6v1

[cs.

CV

] 31

Oct

201

51

Bio-inspired Visual Motion EstimationGarrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE

†Singapore Institute for Neurotechnology (SINAPSE), National University of Singapore, Singapore⋆Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore MD, USA

Email: [email protected], [email protected]

Abstract—Visual motion estimation is a computationally in-tensive, but important task for sighted animals. Replicating therobustness and efficiency of biological visual motion estimation inartificial systems would significantly enhance the capabilities offuture robotic agents. 25 years ago, in this very journal, CarverMead outlined his argument for replicating biological processingin silicon circuits. His vision served as the foundation for thefield of neuromorphic engineering, which has experienced a rapidgrowth in interest over recent years as the ideas and technologiesmature. Replicating biological visual sensing was one of the firsttasks attempted in the neuromorphic field. In this paper we focusspecifically on the task of visual motion estimation. We describethe task itself, present the progression of works from the earlyfirst attempts through to the modern day state-of-the-art, andprovide an outlook for future directions in the field.

I. I NTRODUCTION

Visual sensing is a computationally intensive, but crucialtask for sighted animals [1]. Although motion estimation isjust one aspect of visual sensing, its importance is easilyunderstood when observing its wide range of uses in biology[2], including depth perception, ego-motion estimation, colli-sion avoidance and triggering escape reflexes, time-to-contactestimation (landing control), prey detection and identification,segmentation by motion, and visual odometry.

The ability to reliably estimate visual motion in artificialsystems would find applications ranging from surveillanceand tracking, to visual flight control [2], to video compres-sion, image stabilization, and even the computer mouse [3].However, the most relevant application of bio-inspired visualmotion estimation is for embedded sensing onboard roboticagents capable of moving through and interacting with theirenvironment, since this is precisely the function for whichbiological visual motion systems have evolved. Key charac-teristics which distinguish this application from others are:the sensor must operate in real-time, the sensor must operateunder egomotion, the sensor must be physically carried bythe agent, and the sensor should provide information which isrelevant for enabling the agent to interact meaningfully withits environment. In fact, Wolpert argues in a recent TED talkthat control of motion is the primary purpose of the brain [4].

There are significant differences between biological andartificial systems regarding how visual information is acquiredand processed. State-of-the-art modern visual motion esti-mation methods still rely on capturing sequences of images(frames) in rapid succession, even though the majority of datain these images is redundant [5]. The problem of storing andtransmitting this redundant information is partially overcome

by using dedicated video compression ASICs, or in the case ofstandalone visual motion sensors, by computing on chip [6]–[9]. Nevertheless, these artificial approaches capture frames atpre-determined discrete time points regardless of the visualscene. On the other hand, biological retinae continuouslycapture data and perform a combination of compression andpre-processing in analog (using graded potentials) at the focalplane itself, with the visual scene largely driving when andwhere data is transmitted as spikes (voltage pulses) down theoptic tract. Spikes are similar to digital pulses in artificialsystems in that their signal amplitude can be restored and theyare therefore particularly useful for communicating over longerdistances, such as along the optical tract.

Processing also differs significantly between biological andartificial systems. Similarly to how artificial systems typicallycapture data at a constant rate, they must also compute at aconstant rate to ensure all the captured data is processed, thusprocessing of visual information continues even if the scene isstatic. On the other hand, computation in biological systemsis driven by the sparse captured data (spikes), in turn ensuringthat neuron activation is sparse [10] (since neuron activation isdriven by the sparse incoming data). This sparsity combinedwith the low power consumption of neurons which are notcomputing [11] results in significant energy savings. Modernbiologically inspired sensors generate sparse data (events) inresponse to activity in the scene [12], [13] and this data canbeused to drive sparse computation on modern neural simulatorplatforms [14]. Together, these sensors and neural simulatorsallow both data capture and computation to scale with sceneactivity.

The architectures used for processing also differ. Typicalartificial systems compute on a small number of parallelprocessors, each of which performs sequential operations ina precise repeatable manner, and operates at a timescale onthe order of nanoseconds. On the other hand, biology relieson massively parallel processing using a very large numberof imprecise computing elements (neurons), each of whichoperates on a timescale on the order of milliseconds [15].However, parallelism in artificial systems is increasing, par-ticularly for visual processing (GPUs), and emerging customneural hardware platforms [14], [16]. State of the art ASICsdedicated to visual motion estimation are also optimized toperform processing in as parallel a manner as possible.

Despite the imprecise nature of individual neurons, bio-logical systems perform robustly and continue to do so evenafter the death of individual neurons. The same is not true ofartificial systems, where a single fault can cause catastrophic

Page 2: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

2

x

y

z

Fig. 1. Definition of axes used in (1) for a pinhole camera approximationwith unit focal length. Thez axis points out towards the scene perpendicularto the image plane, while thex andy axes are parallel to the image plane.

failure of the entire system. Similar fault tolerance is highlydesirable in artificial systems, especially as the number oftransistors per device continues to increase, and as the sizelimits of silicon technology continue to be pushed. Evenlooking past silicon to nanotechnology, device yield continuesto be a major challenge [17].

Beyond handling minor faults, biological systems are alsoable to learn and adapt to changes in the visual systemitself [18], as well as to different environments through visualexperience [19], allowing them to operate effectively under awide range visual of conditions. Such self-contained on-linelearning and adaptation would also prove valuable for artificialsystems, removing any need for manual tuning of parametersfor operation in different environments.

Biology’s robust processing, low power consumption, andability to learn and adapt all present desirable characteristicsfor artificial systems and drive the field of bio-inspired sensingand computation.

In this paper we provide a brief introduction to the visualmotion estimation problem and provide background on themethods used in traditional computer vision versus biologicalsystems, before reviewing advances in bio-inspired visualmotion estimation for artificial systems, presenting our ownapproach to the problem, and discussing future directions.

II. T HE V ISUAL MOTION ESTIMATION PROBLEM

When relative motion is present between an observer (eye orcamera) and objects in a scene, the projections of these objectsonto the image plane (retina or pixel sensor) will move. Visualmotion estimation is the task of estimating how the projectionsof these objects move on the image plane.

Assuming a pinhole camera approximation, with the imageplane at unit focal length, visual motion can be described asafunction of the image plane co-ordinates(x,y), the relativerotation (ωx,ωy,ωz) and translation(Tx,Ty,Tz) between thecamera and the object being viewed, and the depth of theobject(z) [20] as shown in Fig. 1 and (1) below.

Fig. 2. The aperture problem illustrated with a triangle moving between aninitial position (grey) and a final position (black), while viewed only throughthree apertures (blue circles). For the leftmost aperture,the image only variesin the horizontal direction, and therefore only the horizontal component ofmotion can be estimated. Similarly, for the middle aperture, only the verticalcomponent of motion can be estimated. For the rightmost aperture, the viewedimage varies along both the horizontal and vertical directions and thereforemotion can be uniquely determined.

δxδ t = Tzx−Tx

z −ωy+ωzy+ωxxy−ωyx2

δyδ t =

Tzy−Tyz +ωx−ωzx−ωyxy+ωxy2 (1)

Visual motion is constrained to lie in the image plane andtherefore has noz-direction component. The first term in eachequation describes the visual motion due to translation, whichis depth dependent, while the remaining terms describe visualmotion due to rotation, which is independent of depth. In otherwords, visual motion due to rotation does not depend on thestructure of the scene, while visual motion due to translationdoes depend on scene structure. The rotations and translationsin the equation above are for motion of the camera relative tothe origin as depicted in Fig. 1.

The relationship described in (1) also shows that multipledifferent combinations of scene structure and relative motioncan result in identical visual motion. Thus visual motion aloneis not enough to infer relative motion or scene structure andadditional information is required. For example, if the sceneis static and the rotational motion of the sensor is known,then the translational motion direction can be determined,anda relationship between scene depth and camera translationspeed can be obtained. Thus, measuring or even eliminatingrotational motion allows additional valuable informationto bederived from visual motion.

Visual motion can only be estimated in the presence of anintensity gradient. A shape of uniform colour moving against abackground of identical colour will have no intensity gradientand will therefore not elicit a visual motion stimulus. Morespecifically, for motion to be detected, the intensity gradientmust be non-zero in the direction of motion. This is anexample of the aperture problem [21], to which all visualsystems are prone, and is illustrated in Fig. 2.

When considering only a small image region which hasno intensity gradient in a particular direction, the magnitudeof image velocity in that direction cannot be determinedunless additional information is available. The componentofmotion in the direction of the maximum image gradient canbe determined, and is known as the “normal flow”, since it

Page 3: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

3

is perpendicular (normal) to the edge orientation. The largerthe image region under consideration, the more likely it willcontain gradients in different directions, helping to alleviatethe aperture problem. A common approach is to simulta-neously consider multiple neighbouring image regions andassume their motion to be either consistent [22] or smoothlyvarying [23], thus providing the additional constraint requiredto uniquely determine motion.

III. A PPROACHES TOV ISUAL MOTION ESTIMATION

For the purpose of providing background for later sections,we introduce here the basic theory underlying each of the threemain classes of visual motion estimation approaches: correla-tion methods, gradient methods, and frequency methods.

Underlying all three of these methods is the assumptionof brightness constancy, known as the brightness constancyconstraint, which states that the brightness of a point remainsconstant after moving a small distance on the image plane[∆x,∆y], within a small period of time,∆t. Formally this canbe written as:

I(x,y, t) ≈ I(x+∆x,y+∆y, t+∆t) (2)

whereI(x,y, t) is the intensity of the point located at(x,y) onthe image plane at timet.

A. Correlation Methods

Correlation methods for motion estimation rely on de-tecting the same visual feature at different points in timeas it moves across the image plane. Correlation is used todetermine whether two feature signals detected at differentpoints in time relate to the same or different features. Thefeature signals on which correlation is computed can take theform of continuous-time analog signals, discrete-time analogsignals, discrete digital signals, or even single bit binarytokens indicating only the presence or absence of a feature(“token methods”). The change in feature location can becombined with the change in time between detections todetermine the feature’s motion. The simplest features to useare brightness patterns, or derivatives thereof, the appearanceof which remains constant over small time periods as describedin (2).

Most state of the art commercially available motion estima-tion ASICs rely on correlation methods [6]–[8]. These devicescapture frames at high frame rates and detect correlations inlocal pixel intensity patterns between images to determinemotion. One very common approach is the Sum of Abso-lute Differences (SAD) block-matching algorithm [9], whichmatches “blocks” of pixels between frames by computing theSAD for pixel intensities. This process is repeated for manydifferent blocks and the estimates from all of these blocks arecombined to determine motion.

B. Gradient Methods

Gradient methods rely on the Taylor series expansion of (2),which for a first order expansion can be rearranged into theform:

δ I(x,y,t)δx

δxδ t +

δ I(x,y,t)δy

δyδ t +

δ I(x,y,t)δ t = 0 (3)

t

x

Δx

Δtω

t

ωx

Δωt

Δωx

I(x,t) = δ (x−vxt) I

ω(ω

x,ω

t) = δ (ω

t + v

x)

Fig. 3. Frequency representation of the motion described in(4). The left plot(red) shows the relationship between time and location for apoint movingwith constant velocity. The right plot (blue) shows the velocity dependentrelationship between spatial and temporal frequency for the same movingpoint.

where δxδ t and δy

δ t are the visual motion values that must

be estimated, whileδ I(x,y,t)δx , δ I(x,y,t)

δy and δ I(x,y,t)δ t are intensity

derivatives which can be obtained from captured frames.Notice that if the intensity derivative in either spatial direc-

tion is zero, then motion in that direction is removed from theequation and cannot be estimated. This is the aperture problemdiscussed in Section II. When these spatial derivatives arenon-zero, but small, they are sensitive to noise and can still resultin erroneous motion measurements. Even if accurate non-zerointensity derivatives are available, (3) is a single equation withtwo unknowns and thus does not provide a unique solution.

To arrive at a unique solution additional constraints mustbe imposed, such as that the motion of all points in an imagepatch will be equal (as is used in the Lucas-Kanade algorithm[22]), or that motion varies smoothly across image locations(as is used in the Horn-Schunk algorithm [23]).

C. Frequency Methods

Frequency based methods rely on the observation thatthere is a relationship between temporal frequency, spatialfrequency, and velocity [24]. For simplicity consider a point(Dirac Delta function [25]) moving in thex direction. Thispoint will trace out a line in the space-time plot on the leftof Fig. 3 with slope equal to the velocity. Taking the Fouriertransform results in a line in frequency space with slope equalto the inverse of velocity, as shown on the right of Fig. 3.

I(x, t) = δ (x− vxt)vx = ∆x

∆tIω(ωx,ωt) = δ (ωt + vxωx)1vx

= −∆ωx∆ωt

(4)

whereI(x, t) is the intensity at locationx at time t, vx is thevelocity in thex direction,δ is the Dirac Delta function, andIω(ωx,ωt) is the Fourier transform [26] ofI(x, t).

The case described above is an ideal case where the stimulusis a point (Dirac Delta function) and therefore has equal energyat all spatial frequencies. In the more general case of a stimuluswith an arbitrary distribution of spatial frequency content, theenergy will still be constrained to lie along the line shown onthe right of Fig. 3.

Page 4: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

4

Fig. 4. An example placement of four different quadrature pairs ofspatiotemporal filters in frequency space. Each quadraturepair is indicatedby a different color and is symmetric about theωt axis and the planeωt = 0(shaded gray). The red and dark blue quadrature pairs are sensitive to differentvelocities in they-direction. The green quadrature pair is sensitive to motionof a specific speed in thex-direction. The light blue quadrature pair is mostsensitive to a particular velocity which has bothx andy components.

Visual motion can be estimated by finding the slope of theline in Fig. 3, which can be achieved by tiling frequency spacewith spatiotemporal filters as shown in Fig. 4 and combiningtheir responses to find the location of the energy peak.

D. Implementation

Typical artificial approaches to motion estimation rely ona frame-based camera to capture snapshots (frames) of thescene at fixed intervals. A processor is then used to applyone of the three methods described above to estimate visualmotion. In computer vision typically correlation or gradientbased methods are used, with frequency methods regarded asbio-inspired approach as will be discussed in Section IV.

In all three of the methods outlined above, increasing theframe rate improves the accuracy of the algorithm because thebrightness constancy constraint relies on the assumption of asmall time period between observations (frames).

For correlation methods, increased frame-rate also decreasesthe distance a feature can move between frames, thereby help-ing to restrict the search region for that feature in subsequentframes. For example, typical optical mouse algorithms operateat thousands of frames per second (albeit small frames), andthe search can be in increments of fractions of a pixel.For gradient and frequency methods, increased frame-rateis equivalent to a higher temporal sampling rate, reducingaliasing and allowing for higher order digital filters to be usedwhen estimating the temporal derivative or frequency.

However, increasing frame-rate also increases the comput-ing power required to sustain real-time operation, since moreframes must be processed within the same time period. Theadditional computing needs are typically met by using morepowerful hardware, such as GPUs, FPGAs, and custom ASICs.Tight coupling of a frame-based sensor and ASIC is often usedto reduce communication costs for embedded applications,such as for a stand alone motion estimation unit relying ona high frame rate, or in-camera video compression relying onblock matching.

Some artificial approaches to motion estimation do not relyon frames, but instead process on continuous-time analog

signals derived from CMOS photodiodes. Notable examplesinclude [3], [27].

For all three of the approaches described, motion estimateswithin a local image region can be computed independently ofmotion estimates for other image regions. This opens up thepossibility of simultaneously computing motion for differentimage regions in parallel. GPUs, FPGAs, and ASICs all takeadvantage of this.

The computational complexity of each approach varies. Amajor disadvantage of gradient methods is that they typicallyrequire the expensive computation of a matrix inverse (orpseudo-inverse) in order to find the solution which best satis-fies (with the least square error) the brightness constancy (3)and secondary constraints. However, gradient methods havethe advantage of computing on instantaneous gradient valuesand therefore require very little memory (only enough toestimate temporal gradients).

Computing correlations for correlation methods is compu-tationally simpler, but requires the system to have memoryof previously observed features, whether by delaying featuresignals or by explicitly storing them.

Digital implementations of frequency methods require evenmore memory because many time points are required to detecttemporal frequencies, particularly if very low frequencies arepresent. The spatial and temporal frequency content of thescene is typically not known in advance, so frequency methodsrequire a large number of spatiotemporal filters in order toaccurately detect the frequency content of different motionstimuli. Implementing these filters digitally is costly both interms of computation and memory.

The robustness of the approaches also varies. When com-puting the correlation between two signals, it is not alwaysclear whether a large output has resulted from a strong corre-lation or from large input signals. Signals can be normalizedbefore correlation to overcome this ambiguity. However, thetypical signals on which correlation is computed arise fromacombination of image motion and image spatial contrast andit is not always possible to disambiguate the effects of motionversus contrast in the final output.

Gradient methods rely on the ratio of the temporal gradientto the spatial gradient (3) and are therefore very sensitivetonoise, particularly when spatial gradient signals are weak.

Spatiotemporal frequency models are far more robust tonoise, but as discussed above, the computational and memoryrequirements for estimating frequencies is far higher.

IV. V ISUAL MOTION ESTIMATION IN BIOLOGY

The animal kingdom is incredibly diverse and the visualsystems of many creatures have evolved independently [29](although they may share a common origin), resulting invariations in size, number, shape, location, wavelength sen-sitivity, and acuity of eyes across families [30]. Similarlyto how eyes vary by family, so does the process of visualmotion estimation. For the sake of discussion we will focusspecifically onDrosophila (fruit flies) and macaque (mon-keys), which are both well studied genera, but possess verydifferent visual systems. Properties of the vision systemsof

Page 5: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

5

+-Σ ΣΣ Σ+

-+ +

++

Σ+ +

Σ+ +

Σ- +

ωx

ωx

ωt1

ωt2

ωt1

ωt2

XX

Σ- +

ΣΣ

Σ- +

- +-+

Photoreceptors

Spatial Filtering

Temporal Filtering

Combining Filter Responses

Opponent Motion Output

ωx

ωx

ωt1

ωt2

ωt1

ωt2

ωx

ωx

ωt1

ωt2

ωt1

ωt2

Fig. 5. A comparison of the Hassenstein-Reichardt (left), Barlow-Levick (center), and Adelson-Bergen (right) models, similar to that in [28]. Spatial filters(dark blue) can either differ in location (as shown) or in phase. Two temporal filters (red) are used, with the second (ωt2) having a longer delay than thefirst (ωt1). The Reichardt model detects correlation between the signal at one location and the delayed signal from a neighbouringlocation. The correlationcan be modelled as a multiplication (brown), or as a logical AND if using single bit signals. The Barlow-Levick model instead uses the signal to blockresponse to motion in the null direction. This can be modelled as a summation (green) followed by half wave rectification (light blue) to prevent negativeintermediate responses. If using 1-bit signals, the inhibition can be modelled as an AND gate with the inhibiting (delayed) input negated. The Adelson-Bergenmodel combines separable spatiotemporal filter responses incorporating a squaring (light blue) non-linearity to compute motion energy. The output of theAdelson-Bergen model is formally equivalent to that of the Hassenstein-Reichardt model [24].

Drosophila and macaque generalise to many other insectsand mammals respectively. This section is intended only asa brief introduction. For more details, the reader is directed toneuroscience reviews coveringDrosophila [31] and primatevision [32].

Despite the differences between macaque andDrosophilavision systems, there are also many similarities. In bothsystems initial computation is performed at the focal planebyneurons which communicate using a combination of spikingand graded responses. These neurons respond to intensitychanges, with responses to intensity increases (ON) and de-creases (OFF) processed independently by parallel pathways[33]. In Drosophila, motion estimation can still be performedin the L1 (ON) pathway if the L2 (OFF) pathway is blocked,and vice versa. In macaque, direction selective StarburstAmacrine Cells (SACs) [34] can be found in the retina itself,and separate SACs are used to process ON and OFF responsesin parallel. Although direction selective, these cells show verylimited speed sensitivity, with true velocity sensitive neuronsonly found in higher visual areas.

In Drosophila, as with many other animals, a fast escapereflex triggered by visual motion aids in evading approachingpredators. Low latency motion detection is critical for this taskand is achieved by keeping the motion processing circuitryrelatively simple and located close to the photoreceptors [35].These motion processing circuits rely on correlation methods,which are fast, and allow for compact implementation whichcan be realised in the limited space available near the photore-ceptors [36]. Furthermore, these circuits are tuned to detectstimuli characteristics indicative of an approaching predator,rather than to accurately measure a wide range of complexmotion stimuli.

On the other hand, the macaque visual system is more

concerned with accurately estimating motion for a wide rangespeeds and visual stimuli than detecting approaching predators.Accurate motion estimates help to achieve a deeper sceneunderstanding, which can then be used for action planning.This link between motion estimation and scene understandingrequires interaction between motion detectors and cortex,andmotion sensitive neurons are therefore found in various corticalareas [37].

Low latency detection is also desirable for macaque, but isnot as important as when triggering escape reflexes, allowingthe luxury of using more complex motion estimation methodsand taking advantage of the computational resources avail-able in cortex. The macaque visual system is therefore notrestricted to using simple correlation methods. The presenceof spatiotemporal frequency sensitive neurons in the MiddleTemporal (MT) visual area, which plays an important rolein motion estimation [38], suggests that motion estimationinprimates relies on frequency methods [39].

In both macaque andDrosophila the visual motion es-timation system is tightly coupled with the motor system.In macaque, the vestibular ocular reflex is important forvisual perception [40]. Visual inputs can trigger saccades, andsaccades can suppress visual responses [41]. InDrosophila,the visual system can trigger motor responses through theoptomotor reflex, and flight control is heavily reliant on visualmotion estimation [2].

It is therefore important when considering biological visionsystems to note that they do not exist in isolation. The visualsystem is part of an embodied system capable of movingthrough, and interacting with the environment. The visual andmotor systems are tightly coupled and deficits in either systemcan affect the other, as documented in both primates [42] andDrosophila[43]. The optomotor response is so strong in many

Page 6: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

6

insects that motor outputs in response to visual stimuli canprovide insight into the visual system .

It was through an investigation of the optomotor responseof the beetle Clorophanus in the 1950s that Hassenstein andReichardt arrived at their seminal model of the Elemen-tary Motion Detector (EMD) [36], shown in Fig. 5a. TheHassenstein-Reichardt EMD computes the correlation betweenthe signal of one photoreceptor and the time-delayed signalofa neighbouring photoreceptor. The delay is typically modelledas a low pass filter and correlation performed as multiplication.Strong correlation indicates the presence of motion in thepreferred direction. A mirror symmetric circuit detects motionin the opposite direction, and the difference between the circuitoutputs indicates motion direction.

The Hassenstein-Reichardt EMD does not provide a directmeasure of the stimulus velocity. Instead, it provides anindication of how well the motion stimulus matches the EMD’spreferred combination of speed and spatial frequency. Multiplecombinations of speed and spatial frequency can result in thesame EMD output magnitude, so speed cannot be uniquelydetermined. Even if the spatial frequency is known, there arespeeds greater than and less than the EMD’s preferred speedfor which the response magnitude would be equal, so speedwould still not be uniquely determined.

These observations led to many interesting predictionswhich were later verified. Evidence of the existence of theHassenstein-Reichardt model has since been found in manyother visual systems, including that ofDrosophila.

By the 1960s Hubel and Weisel had isolated directionallyselective units in the cat cortex [44]. Later similar responseswere observed in the tectum of pigeons and frogs. Barlowand Levick found such directional responses even earlier inthe visual pathway of the rabbit, in the retina itself. Basedontheir recordings, they proposed what is now known as theBarlow-Levick model [45] shown in Fig. 5b. The Barlow-Levick model relies on inhibition instead of excitation as theunderlying mechanism, with null direction motion inhibiting amotion unit’s response. Although the presence of Hassenstein-Reichardt and Barlow-Levick motion models have been ruledout in macaque, a similar mechanism relying on unbalancedinhibition is thought to underly directional selectivity of star-burst amacrine cells in primate retina [34].

As mentioned earlier, the presence of cells in MT sensitiveto specific spatiotemporal frequencies indicates that frequencymethods likely underly motion perception in macaque. In1985, Adelson and Bergen [24] and Watson and Ahumada [46]proposed similar architectures for motion estimation basedon spatiotemporal frequency filtering. The Adelson-Bergenmotion energy model is shown in Fig. 5c. Each unit computesthe energy at a specific spatiotemporal frequency, with therelationship between spatial and temporal frequency beingindicative of speed as outlined in Section III. Adelson andBergen also showed how the opponent energy output bytheir model was equivalent to the output of the Hassenstein-Reichardt model. The Adelson-Bergen model has since beenused to explain many visual illusions [47], [48] and Simoncelliand Heeger [49] have proposed a model describing how thespatiotemporal responses in MT may be computed in biology.

Simoncelli [50] also took the brightness constancy con-straint (3) relied upon by gradient methods and used it todevelop a probabilistic Bayesian framework capable of ex-plaining responses of MT neurons, although the frameworkdoes not describe how such responses are computed physio-logically.

In the abovementioned models, individual motion units donot encode velocity directly, but rather are selective to specificspatiotemporal stimuli. To infer velocity, the responses ofmotion units must be combined as described in [51].

V. REVIEW OF BIO-INSPIRED WORKS

In this section we discuss integrated real-time bio-inspiredvisual motion estimation works, which are compared in Ta-ble I. The discussion is divided into three subsections, oneforeach of the three methods described in Section III, and withineach subsection, works are described in chronological order.

A. Gradient Methods

In his seminal 1986 thesis, Tanner [3] presented two VLSIimplementations of visual motion estimation. The first madeuse of the correlation method and will be discussed in thenext subsection. Tanner’s second implementation used a 2Danalog VLSI gradient based approach relying a feedback loopto arrive at a minimum-error solution which simultaneouslysatisfies both the brightness constancy constraint (3) and alocal smoothness constraint. This gradient based approachwaslater adopted and extended in other works by Stocker [90] andMehta [86].

B. Correlation Methods

1) Block Matching: Tanner’s second implementation usesa correlation method. It comprises a linear array of 16 pixelswhich is sampled and binarized in a single step to producea binary image. These binary images are captured and theircorrelation to shifted versions of an initial image are computed(the shifts are 1 pixel left, no shift, and 1 pixel right). Digitalpulses indicate when leftward or rightward displacement byasingle pixel has occurred, at which point a new initial imageis captured and the process is repeated. Velocity is encodedby the digital pulse rate at the output of the sensor. Gottardi[60] would later present a similar approach using CCD pixelscoupled with CMOS circuity, but capable of detecting motionof up to 5 pixels per image.

Yakovleff [63] presented a local matching approach whichuses the sign of temporal gradients as the feature to be matchedby digital circuits.

Arreguit [64] presented the first 2D array for block match-ing, which used spatial edges as features and was designed foruse as a pointing device (computer mouse). Pixels computedmotion locally, and local estimates were combined to estimateglobal motion.

Page 7: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

7

TABLE ISUMMARY OF BIO-INSPIREDVLSI VISUAL MOTION ESTIMATION WORKS

Author Year Process Array Motion Method Feature

Tanner [3] 1986 2µ 1D (1x16) Global Block-Matching Intensity variationTanner [3] 1986 1.5µ 2D (8x8) Global Gradient -Franceschini [52] 1989 - 1D (1x100) Local Token Temporal edgeAndreou [53] 1991 2µ 1D (1x25) Global Reichardt ON-center OFF-surroundEtienne-Cummings [54] 1992 2µ 2D (5x5) Global Token (TI) Temporal edge of center surroundHoriuchi [55], [56] 1992 2µ 1D (1x17) Local Token (TS) Temporal edgeDelbruck [57] 1993 2µ 2D (25x25) Local Reichardt Temporal contrastSarpeshkar [58], [59] 1993 2µ NA NA NA Token Temporal edge of spatial contrastGottardi [60] 1995 2µ 1D (1x115) Global Block-Matching Intensity valuesKramer [61], [62] 1995 2µ 1D (1x8) Local Token (FS) Temporal edgeYakovleff [63] 1996 2µ 1D (1x61) Local Block-Matching Sign of spatiotemporal gradientsArreguit [64] 1996 2µ 2D (9x9) Local Block-Matching Spatial edgeEtienne-Cummings [65], [66] 1997 2µ 2D (9x9) Global Token Temporal edge of center surroundMoini [67], [68] 1997 1.2µ 2x1D (2x64) Local Block-Matching Spatiotemporal templatesHarrison [69] 1998 2µ 2D (1x2) Local Reichardt Temporal contrastHiggins [70] 1999 1.2µ 2D (14x13) Local Token (ITI, FS) Temporal edgeIndiveri [71] 1999 1.2µ 2D (8x8) Global Token (FS) Temporal edgeJiang [72] 1999 0.6µ 2D (32x32) Global Token (ISI) Temporal edge of spatial contrastEtienne-Cummings [51] 1999 2µ 2x1D (2x18) Global Adelson-Bergen Spatiotemporal energy of edge mapBarrows [73] 2000 1.2µ 2x1D (2x4) Global Token (FS) Spatial featuresLiu [74] 2000 1.2µ 1D (1x37) Global Reichardt Temporal contrastPant [75] 2000 1.6µ 2D (13x6) Local Reichardt Temporal contrastHiggins [76] 2000 1.2µ 2D (13x15) Local Token (FS) Temporal edgeHarrison [77] 2000 1.2µ 1D (1x22) Global Reichardt Temporal contrastHiggins [78] 2002 1.2µ 2D (27x29) Local Token (ITI) Temporal edgeYamada [79] 2003 1.5µ 2D (2x10) Local Token (FS) Spatial edgeOzalevi [80] 2003 1.5µ 2D (6x6) Global Adelson-Bergen Spatiotemporal energyMassie [81] 2003 0.5µ 12x1D (12x90) Yaw/Pitch/Roll Token (FS) Temporal edge of spatial contrastStocker [27] 2004 0.8µ 2D (30x30) Local Gradient -Ozalevi [82], [83] 2005 1.6µ 2D (6x7) Global Reichardt Temporal contrastHarrison [84] 2005 0.5µ 2D (16x16) Global Reichardt Temporal contrastShoemaker [85] 2005 0.35µ 1D (1x7) Global Reichardt Temporal contrastMehta [86] 2006 0.5µ 2D (95x52) Local Gradient -Moeckel [87] 2007 1.5µ 1D (1x24) Local Token (FS) Temporal contrastBartolozzi [88] 2011 0.6µ 1D (1x64) Local Token (FS) Temporal edgeRoubieu [89] 2013 - 1D (1x5) Global Token (ISI) Temporal contrast

2) Hassenstein-Reichardt and Barlow-Levick Models:At ahigh level, Tanner’s implementation can be seen as sequen-tially implementing Reichardt detectors with increasing timedelays until sufficient correlation is detected. In 1991 Andreou[53] reported an analog implementation of the Reichardtdetector which instead outputs the correlation value itself. Thesensor computes ON-center OFF-surround features in analogand uses an all-pass filter to implement the delay. Outputs fromall the Reichardt detectors along a linear array are summed andoutput as a differential current.

Delbruck [57] later presented a 2D analog variation of theReichardt detector. Temporal contrast was used as the inputsignal and was delayed using a multi-tap analog delay linecapable of propagating a signal across multiple pixels. Ateach pixel the correlation is computed between the propa-gating signal and the local signal before being propagatedto the next pixel. The output therefore incorporates signalsfrom many pixels and increases if motion is sustained acrossmultiple pixels. However, the magnitude of the output is highlydependent on contrast.

Harrison [77] presented another analog Reichardt imple-mentation which also computes correlation based on temporalcontrast, but exhibits decreased dependence on contrast magni-tude. The detectors in the chip are either accessed individually,

or combined using a linear summation to obtain a globalresponse. Harrison also used the approach to generate anartificial optomotor response (torque signal) and compareditto measurement ofDrosophila in experiments [77].

Pant [75] presented an analog Reichardt implementationin which the responses of multiple Reichardt detectors arecombined on-chip in a non-linear fashion. Pant showed howthe output can be used to generate a torque signal to control thegaze direction of a robot during visual tracking, even thoughadjusting the gaze of the robot induces optical flow throughegomotion (1).

Liu [74] also presented an analog Reichardt implementationwith non-linear on-chip integration of detector outputs, andshowed that the frequency response of the chip is similar tothe frequency response of Horizontal System (HS) neurons inDrosophila [43].

Shoemaker [85] presented an analog Reichardt implemen-tation based on visual motion processing in the fly. The sensormakes use of non-linearities in the processing chain to reducethe dependence of the sensor output on scene contrast andspatial frequencies.

Harrison [84] later presented a time-to-contact sensor forcollision avoidance based on detecting 2D looming motionfields with Reichardt detectors.

Although partially addressed in the latter of the works

Page 8: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

8

ωx

ωx

ωx

ωt

ωt

ωt

i i+1

time

Token Processing

i-1

IT

TF

I T

T F

ITF

I T F

i i+1i-1

F

F S F

S

(a)

(b)

(c)

(d)

(e)

Token Processing

Fig. 6. Popular token methods. The left image (a) shows how tokens aretypically generated by thresholding (light blue) the output of a temporalbandpass filter (red) to detect changes in pixel intensities. Spatial filtering(blue) is optionally performed as a first step. The Trigger and Inhibit (TI),Facilitate and Trigger (FT), Facilitate Trigger and Inhibit (FTI), and Facilitateand Sample (FS) methods are shown in (b-e) respectively for arightwardmoving stimulus, which will trigger responses from left (first) to right (last)when moving across the array. Red arrows indicate preferreddirection ofmotion in each case.

mentioned thus far, one of the shortcomings of the Reichardtdetector is the dependence of the output on stimulus contrast.This can be overcome by using so-called “token” methods,where thresholding provides a 1 bit token to signal thepresence or absence of a stimulus.

3) Token Methods:Horiuchi [55] developed a linear arraywhich used detection of a sufficient temporal intensity deriva-tive (edge) as a digital token. Tokens from neighbouring pixelspropagate down a digital delay line in opposite directions untilthey cross. The location in the delay lines at which they crossindicates both speed and direction. The delay line betweeneach pixel pair contributes a “vote” into a Winner-Take-Allcircuit which outputs the location with the most votes.

Etienne-Cummings [54] developed a chip for 2D motiondetection which uses the temporal derivative of a center-surround feature as the token. The chip encoded speed usingthe width of a pulse initiated by departure of a token from onepixel and terminated by arrival of the token at either neighbour.A voting scheme was used to determine direction.

Sarpeshkar [58] also took a token approach in which tokenstrigger digital pulses. The pulse from one pixel would be de-layed and correlated against its neighbour. The output provideda measure of how well the observed stimulus matched thecircuit’s optimal stimulus, but the response to non-optimalis ambiguous because both increasing and decreasing thestimulus speed causes the response to decrease. In the samepaper, Sarpeshkar proposed a facilitate and trigger approachto overcome the speed ambiguity.

Similar approaches to that proposed by Sarpeshkar soonbecame popular. Fig. 6 outlines some of these token methods.The Trigger and Inhibit (TI) mechanism (Fig. 6b) triggers a

pulse when a token is detected at a pixel, and ends (inhibits)the pulse when the token is detected at the next pixel, therebyproviding a pulse with width inversely proportional to speed.However, for motion in the null direction, the inhibition occursbefore the trigger, and the pulse can continue indefinitely.Theshortest of two pulses from directionally opposing circuits istypically assumed to be the correct one.

The Facilitate and Trigger (FT) method (Fig. 6c) only trig-gers a pulse if a facilitation signal generated by the previouspixel is present. The pulse ends when the facilitation signalends, thereby limiting the maximum pulse width to the widthof the facilitation pulse. For motion in the null direction,thetrigger occurs before the facilitation signal and no pulse isgenerated. Unlike the TI case, the pulse from the FT approachis directly proportional to speed.

The Facilitate, Trigger, and Inhibit (FTI) method (Fig. 6d)combines the outputs of three pixels, using a facilitation signalfor directional selectivity, followed by a Trigger and Inhibitmechanism to generate a pulse inversely proportional to thestimulus speed.

In the TI, FT, and FTI methods (Fig. 6b-d) the duration apulse must be measured in order to infer speed. The Facilitateand Sample (FS) method (Fig. 6e) overcomes this by using ashaped facilitation pulse. Instead of triggering an outputpulse,the facilitation pulse is sampled providing a value proportionalto speed. For motion in the null direction, the facilitationsignalwill be zero when it is sampled.

Other methods have been proposed which take a similarapproach, but rely on inhibition to suppress response in thenull direction rather than facilitation to enable responsein thepreferred direction.

Kramer [61], [62] presented the first FTI and the first FStoken implementations, each using an 8 pixel linear array withtemporal contrast edges as the token.

Etienne-Cummings [65] developed a foveated sensor fortracking and stabilization, consisting of a 19x17 pixel array fordetecting onset and offset of spatial edges, with the middle5x5pixels replaced by a 9x9 array of smaller motion estimatingpixels which output motion direction only, thereby realising a“bang-bang” output.

Higgins [70] later presented the first Inhibit, Trigger, andInhibit (ITI) implementation as well as the first 2D FSimplementation, both using temporal contrast edges for thetoken. His FS implementation also subtracted the samples ofopposite direction circuits on-chip to provide a signed velocity.He later further developed the concept and extended it tolarger array [78]. Jiang [72] meanwhile presented the first 2DFTI implementation, using temporal edges of spatial contrastas the token, while Yamada [79] demonstrated an FS tokenimplementation using 1D arrays and reported on its possibleapplication to traffic flow measurement and monitoring blindcorners while driving.

As neuromorphic front end sensors for temporal contrastdetection improved, Higgins [76] and Indiveri [71] bothadopted multi-chip approaches to motion-estimation, relyingon a stand-alone specialized front end sensor for temporalcontrast detection, and a separate chip for the FS tokenalgorithm implementation. The multi-chip approach carries

Page 9: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

9

the disadvantage of requiring additional power for off-chipcommunication between the front-end sensor and the motioncomputation chip. However, moving the in-pixel motion com-putation circuits to a separate chip reduces pixel size on thefront-end chip, allowing a denser pixel array at the front-end.

Barrows [73] implemented a multi-chip token method whichcombined a microcontroller for postprocessing with a frontendsensor for extracting programmable spatial features. Differentspatial features were found to work well for different stimuli,and the combined use of multiple spatial features was usedto improve performance. The chips were used to control therudder of a Micro Aerial Vehicle (MAV) to help it avoidobstacles.

The multi-chip approach also allows signals to be remappedbetween the front-end sensor and back-end motion compu-tation chip, allowing mapping from cartesian to polar co-ordinates, which can be useful for measuring looming motionfields as Higgins and Indiveri both demonstrated [71], [76].Furthermore, signals can easily be copied and routed tomultiple motion processing chips. Higgins demonstrated suchan approach for simultaneously computing motion in cartesianand polar co-ordinates using two motion processing chips inparallel. Higgins also demonstrated how two front-end sensorscan feed into a single motion processing chip to computemotion only at a specific disparity (depth inz direction) [76].

Massie [81] presented a combination imager and motionestimation chip for roll, pitch, and yaw estimation. The chipconsists of 12 linear 90 pixel arrays (2 for yaw, 2 for pitch, and8 for roll) relying on the token based FS method. Integratedinto the same chip was a 128x128 pixel variable acuityimager capable of providing maximum resolution on objectsof interest, while conserving bandwidth by combining pixelresponses from “uninteresting” regions.

Ozalevi [82] presented a multi-chip approach which used aseparate front end sensor to generate temporal edge tokens,buta low-pass filter was used to convert these tokens back intoanalog signals which were processed by separate chips im-plementing Hassenstein-Reichardt and Barlow-Levick models.The low pass filter also serves to create the delays required bythese models (see Fig. 5). Thus, an analog implementationof the Hassenstein-Reichardt and Barlow-Levick models isrealised, but the intermediate “token” stage serves to normalizesignal amplitude, largely removing the dependence on stimuluscontrast.

Moeckel [87] presented a linear array relying on the FStoken method with improved robustness to noise allowing thechip to extract motion over 2 decades of speeds.

Bartolozzi [88] recently presented a prototype linear arraymotion tracking chip which relies on the FS token method us-ing temporal contrast edges as the token. Temporal derivativesare computed as part of the token generation process, but thechip also computes spatial derivatives in parallel. Both thesederivatives as well as the motion estimates themselves are fedinto a WTA circuit with programmable input weights, allowingthe user to track the most salient feature in the array. Theprogrammability of the WTA allows the most salient featureto be defined as a weighted summation the spatial contrast,temporal contrast, and motion features.

oubieu [89] presented a 23.3mm× 12.3mm sensor weighingunder 1 gram (including optics). The sensor consists of 5 pairsof 1D motion sensors which use tokens to measure the timefor a feature to travel between neighbouring locations, similarto the Trigger and Inhibit (TI) token method.

The authors [91] proposed an algorithm in which simplespiking neurons with pre-programmed synaptic delays canbe combined with a silicon retina [13] to implement motionsensitive receptive fields. Similarly to Fig. 3a, where a pointwith motion in 1 spatial dimension traces out a line in a 2Dspace-time plot, an edge with motion in 2 spatial dimensionswill trace out plane in a 3D space-time plot, with the slope ofthe plane encoding the local motion velocity. This can be seenas a Reichardt detector, where the inter-pixel delays uniquelydescribe a plane, although the algorithm still computes ontemporal contrast tokens provided by the front end sensor.This approach is elaborated on in Section VI.

In the author’s implementation each neuron is designedto detect the presence of a particular local space-time plane(i.e. a specific inter-pixel delay) but combines responses from5x5 pixels per motion unit to simultaneously determine boththe direction and speed of the normal flow. The responsesof multiple motion units are then combined in a secondlayer of the neural network to attenuate errors due to theaperture problem. However, this algorithm has not yet beenimplemented on embedded hardware or in real-time.

At a similar time, Benosman [92] proposed an approachwhich also relies on detection of local planes in data providedby a silicon retina. Instead of using multiple receptive fieldstuned to detect different motions (planes), the best fit for asingle local plane is mathematically computed, with the normalof the computed fit indicating the normal flow locally. Thealgorithm runs in JAVA on a host computer.

C. Frequency Methods

Others have focused on Adelson-Bergen type models [24]relying on spatiotemporal filtering for motion detection. Thefirst such implementation was reported by Etienne-Cummings[51], using a front-end silicon retina to compute a binarymap of spatial edges, thereby providing an input signal ofnormalized amplitude. Subsequent processing using a multi-chip reconfigurable neural processor implements pairs of spa-tial and temporal filters to extract the oriented energy ata particular spatiotemporal frequency, thereby implementingan Adelson-Bergen motion unit. The oriented energy frommultiple motion units of different frequencies are computedin parallel and their outputs are combined (as described in[51]) to obtain an estimate of image motion.

Ozalevi [80], in an approach similar to his implementationof the Hassenstein-Reichardt and Barlow-Levick models, de-scribed a multi-chip Adelson-Bergen model. Although suc-cessful, this model only implemented a single motion energyunit per pixel, therefore indicating the presence of a preferredmotion stimulus and direction without indicating speed.

Modern computing technologies allows for processing on alarger scale than ever before. The author [93] demonstratedhow an FPGA can be used to implement and combine 720

Page 10: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

10

12

31

23

05

101520

X address (pixels)Y address (pixels)

Tim

e (

ms)

Predicted tokens Measured tokens

Edge location

Temporal coincidence created by delays

Synaptic delay

Fig. 7. Construction of a 3×3 pixel receptive field sensitive to motion ofan edge parallel to the y-axis travelling in the positive x-direction with speedvx = 1/5 pixels per millisecond. The location of the edge can be describedby x= vxt, wherex is measured in pixels andt is measured in milliseconds.This equation describes a spatiotemporal plane (shown in gray). Red crossesare located where the plane crosses pixel locations and indicate which pixelsare expected to respond when (blue circles indicate actual recorded data). Thelength of green arrows above each pixel location indicate the synaptic delayfor the synapse connecting from that pixel, and the green plane indicates thetime at which the neuron would respond to this stimulus.

Adelson-Bergen motion energy units per pixel in real-time fora 128x128 pixel array running at 30FPS.

VI. A SPIKING NEURAL NETWORK FORV ISUAL MOTION

ESTIMATION

As mentioned in the previous section, the authors havedeveloped a spiking neural network architecture for visualmotion estimation [91] which relies on synaptic delays tocreate motion sensitive receptive fields.

Discrete temporal contrast tokens from a separate front endsensor [13] are used as input spikes to the architecture andLeaky Integrate and Fire (LIF) neurons with a linear decayare used for computation. Such neurons are good at detectingtemporal coincidence of their inputs, but motion signals areinherently spread over time, as modelled in (5).

As stated in [91] and repeated here for convenience, a mo-tion sensitive unit in the architecture relies on the assumptionthat if we consider a small enough spatial region, a movingedge can be approximated as being a straight edge movingwith constant velocity. The equation below shows how amotion stimulus is modelled.

I(x,y, t) = H(x− vxt)dI(x,y,t)

dt = δ (t − xvx)

E(x,y, t) = δ (t − xvx)III 1(x)III 1(y)

(5)

where x and y describe a location on the image plane.I isintensity, t is time (milliseconds),H is the Heaviside stepfunction, vx is the x-component of velocity in pixels permillisecond,δ is the Dirac delta function,dI(x,y, t)/dt is thetemporal derivative of image intensity,E(x,y, t) is the sensoroutput, andIII 1 is a sampling comb with period 1 pixel.Multiplying by the sampling combs converts the continuousspace signal into a discrete space signal which only has valuesat integer pixel locations.

a)

b) c)

Speed

Direction

Fig. 8. The multilayer architecture detecting the motion ofa blue box.(a) shows a set of neurons tuned to detect different speeds and directions ofmotion. A full set of such neurons is present at every image location. (b) showsmultiple image locations as they detect the motion of a stimulus (blue boxmoving upwards to the left). Black circles show locations ofactivated neurons,with arrows indicating which neuron (speed and direction) was activated. (c)shows a layer 2 neuron which determines the correct motion bycombininglayer 1 outputs to alleviate the aperture problem.

Fig. 7 shows how a receptive field sensitive to a specific mo-tion stimulus (in this case a speed of 1/5 pixels per millisecondin the x-direction) can be constructed. The underlying conceptrelies on using synaptic delays (green arrows) to convert atemporal sequence of spikes (red crosses) into a group ofspikes coincident in time (green plane). The delayed spikesserve as input to a LIF neuron, which is good at detectingtemporal coincidence of its inputs. In practice there will notbe perfect temporal coincidence because the actual spikesreceived from the front end sensor (blue circles) will notperfectly match the spike times predicted by our model (redcrosses).

Lowering the threshold voltage of the LIF neuron will causeit to still respond when its inputs are slightly spread in time,and the threshold value can be used to control how much timespreading can be tolerated before the neuron stops responding.

Our approach can be seen as a Reichardt detector coveringmultiple pixels, since it is effectively delaying the signal fromneighbouring pixels while the LIF neuron detects multi-pixelcorrelations in the delayed spikes.

As with the Reichardt detector, multiple detectors (in ourcase neurons) are required in order to detect different speedsand directions of motion. Our architecture uses 8×8 neuronsper pixel location to detect all possible combinations of 8different directions and 8 different speeds as shown in Fig.8a.Directions vary from 0 to 315 degrees in steps of 45 degrees,

Page 11: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

11

Fig. 9. Motion of a moving bus as detected by the network. The left paneshows temporal change tokens (black for decrease, white forincrease, greyfor no change). The middle pane shows outputs of Layer 1, which tend to beperpendicular to edges. The right pane shows the output of the Layer 2, whichuses the data from Layer 1 to detect the actual flow. In each pane only 3.3msworth of data is shown. Grayscale values are obtained using the exposuremeasurement function of the Asynchronous Time-based ImageSensor (ATIS)[13].

while speeds vary from√

2/50 to√

28/50 pixels/ms by factors

of√

2.The equations in (5) are independent of motion parallel to

the edge direction, presenting a form of the aperture problemwhere only motion perpendicular to an edge (the normal flow)can be detected. An edge moving in a direction perpendicularto its orientation at speeds would look identical to the sameedge moving in a direction 45 degrees to its orientation withspeeds

√2 (see dotted arrows in Figs. 2 and 8) since in both

cases the perpendicular component of motion is justs. Thisrelationship gives rise to the

√2 factor used between different

speeds.A key feature which sets this work apart from previous

token and Reichardt works is the use of a second stage ofprocessing to overcome the aperture problem. The secondstage of processing is implemented by a second layer ofneurons, with each neuron receiving inputs over a widerspatial region than first layer neurons (Fig 8c). A layer 2neuron sensitive to speeds and directiond would incorporateinputs from layer 1 neurons sensitive to the same speed anddirection, but also from layer 1 neurons sensitive to speeds/√

2 and directionsd±45o. This multi-layer approach bearsresemblance to gradient based methods such as Lucas-Kanade[22] and Horn-Schunck [23] which compute normal flowlocally in a first step before incorporating the normal flowfrom other nearby locations to more accurately approximatethe true optical flow.

Fig 8 shows the system architecture. At each pixel loca-tion there are 8×8 neurons sensitive to different speeds anddirections of motion (Fig 8a), but mutual inhibition ensuresonly one neuron (shown in black) can respond to a stimulus(produce an output spike) at any time.

Fig 8b shows a stimulus covering 3×3 pixel locations. Darkcircles indicate locations where neurons are responding, withsolid arrows indicating the speed and direction selectivity ofthe neuron responding at that location. Dotted arrows indi-cate other speeds and directions consistent with the apertureproblem as discussed above. Fig 8c shows a layer 2 neurondetermining the correct motion from the inputs it receives fromthe layer 1 neurons of Fig 8b.

Speed

Index

Direction

0.03

0

0.04

45

0.06

90

0.08

135

0.11

180

0.16

225

0.23

270

0.32

315

pixels/ms

degrees

Fig. 10. Output of the architecture when viewing a rotating spiral. Top imagesshow speed (left) and direction (right) outputs accumulated over multiplerotations. The lower image shows 10ms of output data while the spiral isspinning. The motion vectors all point outwards from the center, creating alooming field.

Fig. 9 shows actual outputs from each layer for a real worldscene of a bus crossing a bridge. Layer 1 outputs tending to beperpendicular to edges, while layer 2 outputs more accuratelydescribe the actual motion of the bus by incorporating dataover a larger spatial region.

Fig. 10 presents the output of the architecture for a con-trolled stimulus consisting of a spinning spiral. The figureshows how it can reliably detect different speeds and directionsof motion. The top images show data accumulated overmultiple rotations of the spiral. Colour is used to encode speed(left) and direction (right) according to the legend providedabove the images. Speed varies as a function of distance tothe axis of rotation, while direction varies with angle. Near theaxis of rotation, the motion is slower than the slowest receptivefield and therefore elicits no responses. The lower image showsthe motion architecture’s output accumulated over a periodof 10ms. The colour of arrows helps to encode the motiondirection, while their length encodes speed. The spiral stimulusused has been superimposed on the image as a yellow dashedline.

Fig. 11 shows another example of the architecture operatingon real world data. The top part of the image shows a fullscene captured with the ATIS. There are two moving objects

Page 12: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

12

0

63 90

37 47

70

117 150

Fig. 11. An example real world scene (top) consisting of a moving car anda startled bird. The lower subimages show a cropped region around the birdduring flight at different points in time. The time in milliseconds is indicatedat the top left of each image. Notably, the motion of each wingis detectedas well as the motion of the body.

in the scene, a rightward moving car in the lower part, and abird in the top right. The rest of the images in Fig. 11 showmotion responses elicited by the flying bird. Inset numbersindicate the time (in milliseconds) at which each image wascaptured. Each image shows 3.3ms of motion output data. Inthe first frame (t = 0) rightward motion of the bird’s bodyis detected, while separate motion is detected for each wing.In the second frame (t = 7ms) retraction of the left wing isdetected, followed by retraction of the right wing at 37ms, asshown by the opposite motion estimates for each wing while

the body continues motion upward to the right. 10ms laterextension of the left wing is detected (red arrows at t = 47ms).By 90ms, the bird has returned to a similar pose to that seenat 0ms. At 117ms, both wings have been pulled in front of thebody, causing motion upwards. After 150ms, the bird exits thescene.

The bird example is particularly tricky for conventionalmotion estimation techniques due to the bird significantly andrapidly changing its appearance while moving.

The model as presented here has not yet been integratedinto a real-time implementation. However, the model is com-putationally efficient. Neurons are only updated when aninput spike arrives, and each neuron update only requires 4additions, 2greater-thancomparisons, and 1 multiplication.Sparsity of the incoming spikes from the ATIS front-endsensor keeps the required number of operations per second lowand easily achievable with commercially available hardware.

The challenge in achieving real-time implementation doesnot lie in the computation rate, but rather in the implementationof synaptic delays. Incoming spikes must be delayed andstored, which drives up memory requirements. These memoryrequirements can be reduced by observing that only a smallpercentage of synapses are active at any one time, so memorycan be shared between synapses. However, different synapseshave different delays, so a simple First In First Out (FIFO)buffer will not work because the order in which spikes areplaced in the buffer will not be the same as the order in whichthey must be read out.

VII. F UTURE DIRECTIONS

In the previous sections we have summarized past andpresent works. In this section we outline the directions forfuture works.

Although great progress has been made, the organisationof computational circuits in artificial approaches still differssignificantly from biology. In the retina, neurons are arrangedin interconnected layers stacked on top of each other andlying above the photoreceptor layer. Similarly, in visual cortexneurons are arranged in 3D layers, with both short localconnections, and longer range axonal connections betweenmore distant regions of cortex.

In silicon, photoreceptors and computational circuits mim-icking different layers of biological processing are restricted tolying side by side within the same plane, which limits both thephotoreceptor size and spacing, which in turn affects the signalstrength and spatial resolution respectively. When mapping a3D biological structure onto 2D silicon, short vertical connec-tions are often mapped to long lateral connections, increasingline capacitance, energy consumption, and occupying valuablespace. This is overcome in many artificial implementations byonly considering motion in one lateral direction, then stackingcircuits in the other lateral direction instead of vertically [3],[53], [55], [58], [62], [63], [74], [77], [88], [94].

As 3D stacked silicon technology matures, it can be used toalleviate the wiring problem, allow for larger photodiode fill-factors, and achieve a more biologically realistic organisationof neural circuits. This compact 3D organisation is typical

Page 13: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

13

of insect vision and primate retina, however, the early stagesof primate retina and visual cortex are located far from eachother, and thus compact integration of cortical circuits andphotoreceptors is not necessarily accurate to biology.

Some of the described works have relied on an approachin which a spiking “silicon retina” and neural processingare implemented in separate chips [12], [13] (although 3Dintegration is useful for both chips). Implementing retinaland cortical processing as two different components providesadvantages during system development. First, an improvementin either component can be achieved without re-fabricatingtheother, and second, data can be recorded as it is transmittedbetween the two, allowing for in depth off-line analysis whichcan provide insights into how neural algorithms for processingcan be further improved.

The last decade has seen silicon retinae mature to the pointwhere they are now commercially available and are used bymany labs around the world and there are a number of worksemerging (see other papers in this special issue) which arguefor the superiority of these sensors for high speed visualtasks which must be executed in real-time on a limited powerbudget. Reconfigurable neural processing platforms are alsorapidly maturing, spurred by the dramatic increase in interestand funding the neuromorphic field has experienced in the lastfew years.

It is an exciting time for the neuromorphic area, with majorcompanies including Qualcomm, Samsung, Intel, and IBMcoming on board and launching their own research projectsin the field. There are also major projects in the US (the $200million BRAIN initiative [95]) and Europe (theC1 billion Hu-man Brain Project [96]) which are incorporating neuromorphicaspects. The US Defense Advanced Research Projects Agency(DARPA) has also been taking notice, funding projects suchas theUnconventional Processing Of Signals For IntelligentData Exploitation (UPSIDE) andSystems of NeuromorphicAdaptive Plastic Scalable Electronics(SYNAPSE) projects.

Modern CMOS technology is quite different from biological“wetware”. CMOS typically operates at frequencies rangingfrom megahertz to gigahertz, while a general rule of thumb isthat biological neurons do not operate at frequencies above1kHz. This massive speed difference is not necessarily anadvantage for silicon. In fact, slowing silicon neural circuitsdown to biologically realistic time-scales can prove quitechallenging, and often requires extra design effort and costto implement.

Biology leverages parallel processing, and the speed ofCMOS can be useful when one wants to approximate mul-tiple parallel units from biology using a single high speedsequential unit in silicon. However, this approach comes atadisproportionate power cost. Higher operating speeds requirehigher operating voltages, and power scales proportionally tovoltage squared. It is therefore preferable to have many lowspeed, low voltage processors (like biology) than a few highspeed, high voltage processors (like modern CMOS). Hence,biology provides a road map for the future, where the scalingof CMOS will allow the realization of ultra-low voltage (hencelow-power) circuits performing massively parallel computa-tion in very small and three dimensionally stacked dies. As

technology moves in this direction, CMOS can learn about3D connectivity, massively parallel computation, densityofcomputational elements, and stochastic circuits from biology.

Despite technological improvements, wiring remains anissue in the connectivity which can currently be achievedbetween artificial neurons. Even though 3D integration canhelp, inter-neuron connectivity with present technologies isconstrained to remain far sparser than in biology. The com-bined use of silicon circuits and carbon nanotube crossbararrays has been proposed to improve physical connectivity,with memristor devices capable of learning proposed for useas synaptic connections between nanotubes [97].

The development of new online learning algorithms andarchitectures, whether relying on memristor devices or conven-tional silicon, are likely to play an increasingly important role.Using learning through visual experience to help configureand organise a neural architecture can improve fault tolerance(and therefore device yield) and save man hours spent onmanual configuration. This learning is especially important ifcopies of the same device are to adapt to operation under verydifferent visual conditions, such as in urban versus forestedenvironments, or onboard flying versus ground vehicles.

As mentioned in the Introduction, biological sensors areembodied, and have evolved in conjunction with motor sys-tems. The interplay between motor and sensory systems canbe useful for sensing. Examples of this include the peeringbehaviour used by many animals to induce motion parallaxfor depth perception [98], the optomotor response in insects[99], and the vestibular ocular reflex in humans [40]. Recentstudies also suggest that micro-saccades during fixation playan important role in perception, particularly for object recog-nition in humans [100]. InDrosophilamotion is found to alsoplay a role in motion perception [52].

An embodied biological sensor also serves a particularpurpose, to provide information relevant to the agent for self-preservation and meaningful interaction with the environment.The value metric of biological motion estimates is thereforenot directly assessed by how accurately motion is perceived,but rather by how motion estimates improve the effective-ness of the agent’s behaviour (although to a degree, moreaccurate motion estimates will be more effective in affectingbehaviour).

It is therefore important to keep in mind the intended useof the system being constructed. Inspiration from biologyis useful, but at some stage the design must deviate fromprecise bio-mimicry. A micro-aerial vehicle may benefit froman artificial version of theDrosophila vision system, but forthe vehicle to be of value to the operator, it will be expectedto execute a goal oriented task rather than simply behavelike Drosophila. Also, at some point making a system morebiologically accurate will come at a performance cost ratherthan benefit due to the inherent differences between siliconcircuits and biological neurons. It was Carver Mead who firstdeveloped the concept of imitating neural processing in siliconcircuits by noting the similarities between the two [101], butit was also Carver Mead who said “Listen to the technology;find out what it’s telling you”.

Nevertheless, we are still a long way from matching the

Page 14: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

14

power efficient performance of biology in artificial systems,so for the foreseeable future, continued research into bio-inspired visual motion estimation techniques will reap rewardsfor artificial systems.

REFERENCES

[1] D. N. Lee and H. Kalmus. The optic flow field: The foundationof vision. Philosophical Trans. of the Royal Society of London. B,Biological Sciences, 290(1038):169–179, 1980.

[2] M. V. Srinivasan and S. Zhang. Visual motor computationsin insects.Annu. Rev. Neurosci., 27:679–696, 2004.

[3] J. E. Tanner.Integrated optical motion detection. PhD thesis, CaliforniaInstitute of Technology, CA, USA, 1986.

[4] D. Wolpert. The real reason for brains.TED Talks, July 2011.[5] A. Hyvrinen, J. Hurri, and P. O. Hoyer.Natural Image Statistics:

A Probabilistic Approach to Early Computational Vision. SpringerPublishing Company, Incorporated, 1st edition, 2009.

[6] M. E. Adan, T. Aoyagi, T. E. Holmdahl, T. M. Lipscomb, and T. Miura.Operator input device, Jan 2001. US Patent 6,172,354.

[7] G. B. Gordon, D. L. Knee, R. Badyal, and J. T. Hartlove. Seeing eyemouse for a computer system, Aug 2002. US Patent 6,433,780.

[8] C. A. Mead and J. E. Tanner. Correlating optical motion detector, Dec1986. US Patent 4,631,400.

[9] D. Honegger, L. Meier, P. Tanskanen, and M. Pollefeys. Anopen sourceand open hardware embedded metric optical flow CMOS camera forindoor and outdoor applications. InRobotics and Automation, IEEEInt. Conf. on (ICRA), pages 1736–1741, 2013.

[10] B. A. Olshausen and D. J. Field. Sparse coding of sensoryinputs.Current opinion in neurobiology, 14(4):481–487, 2004.

[11] N. K. Logothetis and B. A. Wandell. Interpreting the BOLD signal.Annu. Rev. Physiol., 66:735–769, 2004.

[12] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128x128 120 dB 15us latency asynchronous temporal contrast vision sensor.Solid-StateCircuits, IEEE J. of, 43(2):566–576, Feb 2008.

[13] C. Posch, D. Matolin, and R. Wohlgenannt. A QVGA 143 dB dynamicrange frame-free PWM image sensor with lossless pixel-level videocompression and time-domain CDS.Solid-State Circuits, IEEE J. of,46(1):259–275, Jan 2011.

[14] B. V. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A. R. Chan-drasekaran, J.-M. Bussat, R. Alvarez-Icaza, J. V. Arthur, P. A Merolla,and K. Boahen. Neurogrid: A mixed-analog-digital multichip systemfor large-scale neural simulations.Proc. of the IEEE, 102(5):699–716,May 2014.

[15] S. Panzeri, N. Brunel, N. K. Logothetis, and C. Kayser. Sensory neuralcodes using multiplexed temporal scales.Trends in Neurosciences,33(3):111–120, 2010.

[16] E. Painkras, L. A. Plana, J. Garside, S. Temple, F. Galluppi, C. Pat-terson, D. R. Lester, A. D. Brown, and S. B. Furber. SpiNNaker:A 1-W 18-core system-on-chip for massively-parallel neural networksimulation. Solid-State Circuits, IEEE J. of, 2013.

[17] G. A. Silva. Neuroscience nanotechnology: progress, opportunities andchallenges.Nature Reviews Neuroscience, 7(1):65–74, 2006.

[18] T. N. Wiesel and D. H. Hubel. Single-cell responses in striate cortexof kittens deprived of vision in one eye.J. Neurophysiol, 26(6):1003–1017, 1963.

[19] T. N. Wiesel. The postnatal development of the visual cortex and theinfluence of environment.Bioscience reports, 2(6):351–377, 1982.

[20] B. Jahne.Computer vision and applications: a guide for students andpractitioners. Academic Press, 2000.

[21] K. Nakayama and G. H. Silverman. The aperture problem. perceptionof nonrigidity and motion direction in translating sinusoidal lines.Vision research, 28(6):739–746, 1988.

[22] B. D. Lucas and T. Kanade. An iterative image registration techniquewith an application to stereo vision.Proc. DARPA Image Understand-ing Workshop, pages 121–130, Apr 1981.

[23] B. K. P. Horn and B. G. Schunck. Determining optical flow.ArtificialIntelligence, 17:185–203, 1981.

[24] E. H. Adelson and J. R. Bergen. Spatiotemporal energy models for theperception of motion.J. Opt. Soc. Am., 2(2):284–299, 1985.

[25] P. A. M. Dirac. The principles of quantum mechanics. OxfordUniversity Press, 1981.

[26] R. N. Bracewell.The Fourier transform and its applications. McGraw-Hill Education, New York, 1980.

[27] A. Stocker. Analog integrated 2-D optical flow sensor.AnalogIntegrated Circuits and Signal Processing, 46:121–138, 2006.

[28] C. M. Higgins, V. Pant, and R. Deutschmann. Analog VLSI imple-mentation of spatio-temporal frequency tuned visual motion algorithms.Circuits and Systems I: Regular Papers, IEEE Trans. on, 52(3):489–502, March 2005.

[29] D.-E. Nilsson. Eye ancestry: old genes for new eyes.Current Biology,6(1):39–42, 1996.

[30] M. F. Land and D.-E. Nilsson.Animal eyes. Oxford University Press,2012.

[31] A. Borst, J. Haag, and D. F. Reiff. Fly motion vision.Annual reviewof neuroscience, 33:49–70, 2010.

[32] P. R. Martin and S. G. Solomon. Information processing in the primatevisual system.The J. of physiology, 589(1):29–31, 2011.

[33] M. Joesch, F. Weber, H. Eichner, and A. Borst. Functional specializa-tion of parallel motion detection circuits in the fly.J. of Neuroscience,33(3):902–905, 2013.

[34] W. R. Taylor and R. G. Smith. The role of starburst amacrine cells invisual signal processing.Visual neuroscience, 29(01):73–81, 2012.

[35] M. Joesch, B. Schnell, S. V. Raghu, D. F. Reiff, and A. Borst. On andoff pathways in Drosophila motion vision.Nature, 468(7321):300–304,2010.

[36] W. Reichardt. Autocorrelation, a principle for the evaluation of sensoryinformation by the central nervous system.Sensory communication,pages 303–317, 1961.

[37] J. H. Maunsell and D. C. Van Essen. Functional properties of neuronsin middle temporal visual area of the macaque monkey. I. selectivity forstimulus direction, speed, and orientation.J Neurophysiol, 49(5):1127–1147, 1983.

[38] W. T. Newsome and E. B. Pare. A selective impairment of motionperception following lesions of the middle temporal visualarea (MT).The J. of Neuroscience, 8(6):2201–2211, 1988.

[39] A. Borst and T. Euler. Seeing things in motion: models, circuits, andmechanisms.Neuron, 71(6):974–994, 2011.

[40] D. E. Angelaki and K. E. Cullen. Vestibular system: the many facetsof a multimodal sense.Annu. Rev. Neurosci., 31:125–150, 2008.

[41] B. G. Breitmeyer and L. Ganz. Implications of sustainedand transientchannels for theories of visual pattern masking, saccadic suppression,and information processing.Psychological review, 83(1):1, 1976.

[42] A. D. Milner, M. A. Goodale, and A. J. Vingrys.The visual brain inaction, volume 2. Oxford University Press, 2006.

[43] K. Hausen. Motion sensitive interneurons in the optomotor system ofthe fly. Biological Cybernetics, 45(2):143–156, 1982.

[44] D. H. Hubel and T. N. Weisel. Receptive fields, binocularinteraction,and function architecture in the cat’s visual cortex.J Physiol (Lond.),160:106–154, 1962.

[45] H. B. Barlow and W. R. Levick. The mechanism of directionallyselective units in rabbit’s retina.The J. of physiology, 178(3):477,1965.

[46] A. B. Watson and A. J. Jr. Ahumada. Model of human visual-motionsensing.J. Opt. Soc. Am. A, 2(2):322–341, 1985.

[47] C. Fermuller, H. Ji, and A. Kitaoka. Illusory motion dueto causal timefiltering. Vision Research, 50(3):315–329, 2010.

[48] Y. Weiss, E. P. Simoncelli, and E. H. Adelson. Motion illusions asoptimal percepts.Nature neuroscience, 5(6):598–604, 2002.

[49] E. P. Simoncelli and D. J. Heeger. Representing retinalimage speedin visual cortex.Nature Neuroscience, pages 461–462, 2001.

[50] E. P. Simoncelli. Local analysis of visual motion.The visualneurosciences, pages 1616–1623, 2003.

[51] R. Etienne-Cummings, J. Van der Spiegel, and P. Mueller. Hardwareimplementation of a visual-motion pixel using oriented spatiotemporalneural filters. Circuits and Systems II: Analog and Digital SignalProcessing, IEEE Trans. on, 46(9):1121–1136, Sep 1999.

[52] N. Franceschini. Small brains, smart machines: From flyvision torobot vision and back again.Proc. of the IEEE, 102(5):751–781, May2014.

[53] A. G. Andreou, K. Strohbehn, and R. E. Jenkins. Silicon retina formotion computation. InCircuits and Systems, IEEE Int. Sympoisumon (ISCAS), pages 1373–1376 vol.3, Jun 1991.

[54] R. Etienne-Cummings, S. A. Fernando, J. Van der Spiegel, andP. Mueller. Real-time 2D analog motion detector VLSI circuit. InNeural Networks, Int. Joint Conf. on (IJCNN), volume 4, pages 426–431, Jun 1992.

[55] T. Horiuchi, J. Lazzaro, A. Moore, and C. Koch. A delay-line basedmotion detection chip. InAdv. Neural Inf. Proc. Systems, pages 406–412, 1991.

[56] T. Horiuchi, W. Bair, B. Bishofberger, A. Moore, C. Koch, andJ. Lazzaro. Computing motion using analog VLSI vision chips:

Page 15: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

15

An experimental comparison among different approaches.Int. J. ofComputer Vision, 8(3):203–216, 1992.

[57] T. Delbruck. Silicon retina with correlation-based, velocity-tunedpixels. Neural Networks, IEEE Trans. on, 4(3):529–541, May 1993.

[58] R. Sarpeshkar, W. Bair, and C. Koch. Visual motion computationin analog VLSI using pulses. InAdvances in Neural InformationProcessing Systems, volume 5, pages 781–788, 1993.

[59] R. Sarpeshkar, J. Kramer, G. Indiveri, and C. Koch. Analog VLSIarchitectures for motion processing: From fundamental limits to systemapplications. InProc. IEEE, pages 969–987, 1996.

[60] M. Gottardi and W. Yang. A CCD/CMOS image motion sensor.In Solid-State Circuits Conf., Digest of Technical Papers. IEEE Int.(ISSCC), pages 194–195, Feb 1993.

[61] J. Kramer. Compact integrated motion sensor with three-pixel inter-action. Pattern Analysis and Machine Intelligence, IEEE Trans. on,18(4):455–460, Apr 1996.

[62] J. Krammer and C. Koch. Pulse-based analog VLSI velocity sensors.Circuits and Systems II: Analog and Digital Signal Processing, IEEETrans. on, 44(2):86–101, Feb 1997.

[63] A. J. S. Yakovleff and A. Moini. Motion perception usinganalogVLSI. Analog Integrated Circuits and Signal Processing, 15(2):183–200, 1998.

[64] X. Arreguit, F. A. Van Schaik, F. V. Bauduin, M. Bidiville, andE. Raeber. A CMOS motion detector system for pointing devices.Solid-State Circuits, IEEE J. of, 31(12):1916–1921, 1996.

[65] R. Etinne-Cummings, J. Van der Spiegel, P. Mueller, andM.-Z. Zhang.A foveated visual tracking chip. InSolid-State Circuits Conf., Digestof Technical Papers., IEEE Int. (ISSCC), pages 38–39, Feb 1997.

[66] R. Etienne-Cummings, J. Van Der Spiegel, and P. Mueller. A focalplane visual motion measurement sensor.IEEE Trans. Circuits Syst.II , pages 55–66, 1997.

[67] A. Moini, A. Bouzerdoum, K. Eshraghian, A. Yakovleff, X. T. Nguyen,A. Blanksby, R. Beare, D. Abbott, and R. E. Bogner. An insectvision-based motion detection chip.Solid-State Circuits, IEEE J. of,32(2):279–284, 1997.

[68] A. Moini. Vision Chips. Springer, 2000.[69] R. R. Harrison and K. Christof. An analog VLSI model of the

fly elementary motion detector. InAdvances in Neural InformationProcessing Systems 10, pages 880–886. MIT Press, 1998.

[70] C. M. Higgins, R. A. Deutschmann, and C. Koch. Pulse-based 2-Dmotion sensors.Circuits and Systems II: Analog and Digital SignalProcessing, IEEE Trans. on, 46(6):677–687, Jun 1999.

[71] G. Indiveri, A. M. Whatley, and J. Kramer. A reconfigurable neuromor-phic VLSI multi-chip system applied to visual motion computation. InProc. of the Seventh Int. Conf. on Microelectronics for Neural, Fuzzyand Bio-inspired Systems; Microneuro’99, pages 37–44, Los Alamitos,CA, April 1999. IEEE Computer Society.

[72] H.-C. Jiang and C.-Y. Wu. A 2-D velocity- and direction-selective sen-sor with BJT-based silicon retina and temporal zero-crossing detector.Solid-State Circuits, IEEE J. of, 34(2):241–247, Feb 1999.

[73] G. L. Barrows and C. Neely. Mixed-mode VLSI optic flow sensors forin-flight control of a micro air vehicle.Proc. SPIE, 4109:52–63, 2000.

[74] S.-C. Liu. A neuromorphic aVLSI model of global motion processingin the fly. Circuits and Systems II: Analog and Digital SignalProcessing, IEEE Trans. on, 47(12):1458–1467, Dec 2000.

[75] V. Pant and C. M. Higgins. A biomimetic VLSI architecture for smalltarget tracking. InCircuits and Systems, Proc. of the Int. Symp. on(ISCAS), volume 3, pages 5–8, May 2004.

[76] C. M. Higgins and C. Koch. A modular multi-chip neuromorphicarchitecture for real-time visual motion processing.Analog Integr.Circuit. Signal, pages 195–211, 2000.

[77] R. R. Harrison and C. Koch. A robust analog VLSI Reichardt motionsensor.Analog Integrated Circuits and Signal Processing, 24(3):213–229, 2000.

[78] C. M. Higgins and S. A. Shams. A biologically inspired modularVLSI system for visual measurement of self-motion.Sensors J., IEEE,2(6):508–528, Dec 2002.

[79] K. Yamada and M. Soga. A compact integrated visual motion sensorfor its applications. Intelligent Transportation Systems, IEEE Trans.on, 4(1):35–42, March 2003.

[80] E. Ozalevli and C. M. Higgins. Multi-chip implementation of abiomimetic VLSI vision sensor based on the Adelson-Bergen algo-rithm. In Proc. of the Joint Int. Conf. on Artificial Neural Networksand Neural Information Processing, ICANN/ICONIP, pages 433–440,Berlin, Heidelberg, 2003. Springer-Verlag.

[81] M. Massie, C. Baxter, J. P. Curzan, P. McCarley, and R. Etienne-Cummings. Vision chip for navigating and controlling microunmannedaerial vehicles. InCircuits and Systems. Proc. of the Int. Symp. on(ISCAS), volume 3, pages 786–789, May 2003.

[82] E. Ozalevli and C. M. Higgins. Reconfigurable biologically inspiredvisual motion systems using modular neuromorphic VLSI chips. Cir-cuits and Systems I: Regular Papers, IEEE Trans. on, 52(1):79–92, Jan2005.

[83] E. Ozalevli, P. Hasler, and C. M. Higgins. Winner-take-all-based visualmotion sensors.Circuits and Systems II: Express Briefs, IEEE Trans.on, 53(8):717–721, Aug 2006.

[84] R. R. Harrison. A biologically inspired analog IC for visual collisiondetection. Circuits and Systems I: Regular Papers, IEEE Trans. on,52(11):2308–2318, Nov 2005.

[85] P. A. Shoemaker and D. C. O’Carroll. Insect-based visual motiondetection with contrast adaptation. InDefense and Security, pages292–303. International Society for Optics and Photonics, 2005.

[86] S. Mehta and R. Etienne-Cummings. A simplified normal optical flowmeasurement CMOS camera.Circuits and Systems I: Regular Papers,IEEE Trans. on, 53(6):1223–1234, June 2006.

[87] R. Moeckel and S.-C. Liu. Motion detection circuits fora time-to-travelalgorithm. InCircuits and Systems, IEEE Int. Symp. on (ISCAS), pages3079–3082, May 2007.

[88] C. Bartolozzi, N. K. Mandloi, and G. Indiveri. Attentive motion sensorfor mobile robotic applications. InCircuits and Systems, IEEE Int.Symp. on (ISCAS), pages 2813–2816, May 2011.

[89] F. L. Roubieu, F. Expert, G. Sabiron, and F. Ruffier. Two-directional1-g visual motion sensor inspired by the fly’s eye.Sensors J., IEEE,13(3):1025–1035, 2013.

[90] A. A. Stocker. Analog VLSI Circuits for the Perception of VisualMotion. John Wiley & Sons, 2006.

[91] G. Orchard, R. Benosman, R. Etienne-Cummings, and N.V.Thakor. Aspiking neural network architecture for visual motion estimation. InIEEE Biomedical Circuits and Systems Conf. (BioCAS), pages 298–301. IEEE, 2013.

[92] R. Benosman, C. Clercq, X. Lagorce, S.-H. Ieng, and C. Bartolozzi.Event-based visual flow.Neural Networks and Learning Systems, IEEETrans. on, 25(2):407–417, Feb 2014.

[93] G. Orchard, N. V. Thakor, and R. Etienne-Cummings. Real-time motionestimation using spatiotemporal filtering in FPGA. InIEEE BiomedicalCircuits and Systems Conf. (BioCAS), pages 306–309. IEEE, 2013.

[94] R. Moeckel, R. Jaeggi, and S.-C. Liu. Steering with an aVLSI motiondetection chip. InCircuits and Systems, IEEE Int. Symp. on (ISCAS),pages 1036–1039, May 2008.

[95] A. P. Alivisatos, M. Chun, G. M. Church, R. J. Greenspan,M. L.Roukes, and R. Yuste. The brain activity map project and the challengeof functional connectomics.Neuron, 74(6):970–974, 2012.

[96] H. Markram, K. Meier, T. Lippert, S. Grillner, R. Frackowiak, S. De-haene, A. Knoll, H. Sompolinsky, K. Verstreken, and J. DeFelipe.Introducing the human brain project.Procedia Computer Science,7:39–42, 2011.

[97] T. Serrano-Gotarredona, T. Prodromakis, and B. Linares-Barranco. Aproposal for hybrid memristor-CMOS spiking neuromorphic learningsystems.Circuits and Systems Magazine, IEEE, 13(2):74–88, 2013.

[98] E. C. Sobel. The locust’s use of motion parallax to measure distance.J. of Comparative Physiology A, 167(5):579–588, 1990.

[99] H. Reichardt. Uber die geschwindigkeitsverteilung in einer geradlin-igen turbulenten couettestromung.ZAMM-J. of Applied Mathematicsand Mechanics/Zeitschrift fur Angewandte Mathematik undMechanik,36(S1):S26–S29, 1956.

[100] S. Martinez-Conde, S. L. Macknik, and D. H. Hubel. The roleof fixational eye movements in visual perception.Nature ReviewsNeuroscience, 5(3):229–240, 2004.

[101] C. Mead. Neuromorphic electronic systems.Proc. of the IEEE,78(10):1629–1636, 1990.

Page 16: Bio-inspired Visual Motion EstimationarXiv:1511.00096v1 [cs.CV] 31 Oct 2015 1 Bio-inspired Visual Motion Estimation Garrick Orchard† and Ralph Etienne-Cummings⋆ Fellow, IEEE †

16

Garrick Orchard received the B.Sc. degree inelectrical engineering from the University of CapeTown, South Africa in 2006 and the M.S.E. andPh.D. degrees in electrical and computer engineeringfrom Johns Hopkins University, Baltimore in 2009and 2012 respectively. He was named a Paul V.Renoff fellow in 2007 and a Virginia and EdwardM. Wysocki, Sr. fellow in 2011, was a recipientof the JHUAPL Hart Prize for Best Research andDevelopment Project, and won the best live demon-stration prize at the IEEE Biocas 2012 conference.

He is currently a Postdoctoral Research Fellow at the Singapore Institutefor Neurotechnology (SINAPSE) at the National University of Singaporewhere his research focuses on developing neuromorphic vision algorithmsand systems for real-time sensing on aerial platforms. His other researchinterests include mixed-signal very large scale integration (VLSI) design,compressive sensing, spiking neural networks, visual motion perception, andlegged locomotion.

Ralph Etienne-Cummings received his B. Sc. inphysics, 1988, from Lincoln University, Pennsylva-nia. He completed his M.S.E.E. and Ph.D. in elec-trical engineering at the University of Pennsylvaniain 1991 and 1994, respectively. He is currently aprofessor of electrical and computer engineering,and computer science at Johns Hopkins University(JHU). He is the former Director of Computer En-gineering at JHU and the Institute of NeuromorphicEngineering. He is also the Associate Director forEducation and Outreach of the National Science

Foundation (NSF) sponsored Engineering Research Centers on ComputerIntegrated Surgical Systems and Technology at JHU. He has served asChairman of the IEEE Circuits and Systems (CAS) Technical Committeeon Sensory Systems and on Neural Systems and Application. Hewas alsothe General Chair of the IEEE BioCAS 2008 Conference. He was also amember of Imagers, MEMS, Medical and Displays Technical Committee ofthe ISSCC Conference from 1999 2006. He is the recipient of the NSFsCareer and Office of Naval Research Young Investigator Program Awards. In2006, he was named a Visiting African Fellow and a Fulbright FellowshipGrantee for his sabbatical at University of Cape Town, SouthAfrica. He wasinvited to be a lecturer at the National Academies of ScienceKavli FrontiersProgram, in 2007. He has won publication awards including the 2003 BestPaper Award of the EURASIP Journal of Applied Signal Processing and BestPh.D. in a Nutshell at the IEEE BioCAS 2008 Conference, and has beenrecognized for his activities in promoting the participation of women andminorities in science, technology, engineering and mathematics. His researchinterest includes mixed signal VLSI systems, computational sensors, computervision, neuromorphic engineering, smart structures, mobile robotics, leggedlocomotion and neuroprosthetic devices.