8
Positioning and Scene Analysis from Visual Motion Cues within Water Column S. Negahdaripour and M.D. Aykin ECE Department University of Miami Coral Gables, FL 33146 (nshahriar{m.aykin}@{u}miami.edu) M. Babaee Department of Computer science Technical University of Munich Munich, Germany ([email protected]) S. Sinnarajah and A. Perez Gulliver Preparatory Pinecrest, FL 33155 (sinn022{pere051}@students.gulliverschools.org) Abstract—Over the last dozen or more years, many appli- cations of vision-based positioning and navigation near the sea bottom and surface have been explored. Mid-water operations have primarily relied on traditional positioning systems, namely INS, DVL, gyros, etc. This paper investigates the application of a vision system for mid-water operations by exploiting stationary features within the water column. The unique nature of these environment – namely, the abundance of randomly distributed targets over a wide field of view and range of depth – are ideal for the application of well-known motion vision methods for 3-D motion estimation and scene analysis. We demonstrate through experiments with water tank and ocean data how various visual motion cues may be used for passive navigation, environmental assessment and target/habitat classification based on visual motion behavior. I. I NTRODUCTION The ability to determine one’s position and (or) achieve point-to-point navigation with precision is an important capa- bility in the operation of robotics and (or) sensor platforms in underwater. The subsea platforms have traditionally deployed various sensors, including gyros, DVL, INS, etc. Many such positioning systems have their advantages, but also limitations and drawbacks. For example, off-the-shelf sensors are cheap, but are slow, inaccurate, and subject to significant drift. More precise sensors are often pricey and or bulky, making them impractical for deployment on small low-cost platforms. The shortcomings, mainly inaccuracy and drift issues, can be reduced to some extent by fusing information (e.g., integrating and fusion of estimates from a number of sensors of the same or different modalities). Over the last 15 years, applications of computer vision techniques exploiting the visual cues in underwater (video) imagery have provided new approaches, with many being implemented through integration with information from tradi- tional devices (e.g., [6], [9]–[11], [14]–[18]). More precisely, it has been shown that a vehicle’s trajectory and (or) motion can be estimated by tracking a number of stationary envi- ronmental/target features in image sequences. The visual cue comprises either the correspondences among the positions of a number of features over the imaged scene surfaces or the apparent image-to-image variations and optical flow. The key advantage of the computer vision techniques is the fact that they can work well under scenarios where auxiliary positioning systems perform worse, mainly when the platform undergoes slow motions and drifts. This is due to the fact that these other devices typically estimate motion and (or) position by integrating the vehicle’s acceleration or velocity, which cannot be determined with accuracy at low speeds. In contrast, visual motion can be determined with the same accuracy, more or less independent of the motion size. For example, if the motion is too small to be reliably detected in a video sequence, every N -th frame may be processed. A positioning method employing visual motion estimation also suffers from the drift problem, becoming significant over an extended period of operation; particularly where the frame-to-frame motions as the main source of information are integrated to determine the vehicle’s position and (or) trajectory. However, fusion with auxiliary devices that measure instantaneous orientation (e.g., pitch/roll sensor, magnetome- ter) can significantly enhance accuracy [15]. Therefore, vision- based systems can complement, and enhance the performance in an integrated positioning system. Earlier work has primarily explored the near bottom or surface operations, developing the capabilities that enable: 1) mapping of benthic habitats, e.g., reefs as well as shipwrecks, in the form of large-area photo-mosaics; e.g., [9], [11], [14], [17], [18]; 2) autonomous positioning and local navigation for inspection of manmade structures (e.g., ship hulls, bridge pilings, and off-shore oil structures) [16]. Here, the surfaces of the target scene(s) to be imaged, mapped, documented, and (or) inspected offer abundant visual cues for establishing feature tracks and correspondences, the fundamental problem for self-motion detection and estimation. In this paper, we investigate the potential application of well-known visual motion estimation methods in support of operations within the water column. Here, the stationary suspended particles within the water column play the role of points on natural and man-made object surfaces (e.g., surface texture of natural objects, marine growth, structural features markings on ship hulls), and actually can provide much stronger and often ideal visual cues for the application of the vision-based techniques. The advantages come from the fact that 1) these particles are randomly distributed; 2) extend over the entire sphere of viewing directions; 3) rest at a large range of distances from the camera (vehicle).

Marine Snow

Embed Size (px)

Citation preview

Page 1: Marine Snow

Positioning and Scene Analysis from Visual MotionCues within Water Column

S. Negahdaripour and M.D. AykinECE Department

University of MiamiCoral Gables, FL 33146

(nshahriar{m.aykin}@{u}miami.edu)

M. BabaeeDepartment of Computer scienceTechnical University of Munich

Munich, Germany([email protected])

S. Sinnarajah and A. PerezGulliver PreparatoryPinecrest, FL 33155

(sinn022{pere051}@students.gulliverschools.org)

Abstract—Over the last dozen or more years, many appli-cations of vision-based positioning and navigation near the seabottom and surface have been explored. Mid-water operationshave primarily relied on traditional positioning systems, namelyINS, DVL, gyros, etc. This paper investigates the application of avision system for mid-water operations by exploiting stationaryfeatures within the water column. The unique nature of theseenvironment – namely, the abundance of randomly distributedtargets over a wide field of view and range of depth – areideal for the application of well-known motion vision methodsfor 3-D motion estimation and scene analysis. We demonstratethrough experiments with water tank and ocean data howvarious visual motion cues may be used for passive navigation,environmental assessment and target/habitat classification basedon visual motion behavior.

I. INTRODUCTION

The ability to determine one’s position and (or) achievepoint-to-point navigation with precision is an important capa-bility in the operation of robotics and (or) sensor platforms inunderwater. The subsea platforms have traditionally deployedvarious sensors, including gyros, DVL, INS, etc. Many suchpositioning systems have their advantages, but also limitationsand drawbacks. For example, off-the-shelf sensors are cheap,but are slow, inaccurate, and subject to significant drift. Moreprecise sensors are often pricey and or bulky, making themimpractical for deployment on small low-cost platforms. Theshortcomings, mainly inaccuracy and drift issues, can bereduced to some extent by fusing information (e.g., integratingand fusion of estimates from a number of sensors of the sameor different modalities).

Over the last 15 years, applications of computer visiontechniques exploiting the visual cues in underwater (video)imagery have provided new approaches, with many beingimplemented through integration with information from tradi-tional devices (e.g., [6], [9]–[11], [14]–[18]). More precisely,it has been shown that a vehicle’s trajectory and (or) motioncan be estimated by tracking a number of stationary envi-ronmental/target features in image sequences. The visual cuecomprises either the correspondences among the positions ofa number of features over the imaged scene surfaces or theapparent image-to-image variations and optical flow. The keyadvantage of the computer vision techniques is the fact thatthey can work well under scenarios where auxiliary positioning

systems perform worse, mainly when the platform undergoesslow motions and drifts. This is due to the fact that theseother devices typically estimate motion and (or) position byintegrating the vehicle’s acceleration or velocity, which cannotbe determined with accuracy at low speeds. In contrast, visualmotion can be determined with the same accuracy, more or lessindependent of the motion size. For example, if the motion istoo small to be reliably detected in a video sequence, everyN -th frame may be processed.

A positioning method employing visual motion estimationalso suffers from the drift problem, becoming significantover an extended period of operation; particularly where theframe-to-frame motions as the main source of informationare integrated to determine the vehicle’s position and (or)trajectory. However, fusion with auxiliary devices that measureinstantaneous orientation (e.g., pitch/roll sensor, magnetome-ter) can significantly enhance accuracy [15]. Therefore, vision-based systems can complement, and enhance the performancein an integrated positioning system.

Earlier work has primarily explored the near bottom orsurface operations, developing the capabilities that enable: 1)mapping of benthic habitats, e.g., reefs as well as shipwrecks,in the form of large-area photo-mosaics; e.g., [9], [11], [14],[17], [18]; 2) autonomous positioning and local navigationfor inspection of manmade structures (e.g., ship hulls, bridgepilings, and off-shore oil structures) [16]. Here, the surfacesof the target scene(s) to be imaged, mapped, documented,and (or) inspected offer abundant visual cues for establishingfeature tracks and correspondences, the fundamental problemfor self-motion detection and estimation.

In this paper, we investigate the potential application ofwell-known visual motion estimation methods in support ofoperations within the water column. Here, the stationarysuspended particles within the water column play the roleof points on natural and man-made object surfaces (e.g.,surface texture of natural objects, marine growth, structuralfeatures markings on ship hulls), and actually can providemuch stronger and often ideal visual cues for the applicationof the vision-based techniques. The advantages come fromthe fact that 1) these particles are randomly distributed; 2)extend over the entire sphere of viewing directions; 3) restat a large range of distances from the camera (vehicle).

Page 2: Marine Snow

These advantages can be captured and utilized effectivelyby one or more cameras that cover a relatively large fieldof view, while employing various intelligent image capturestrategies to enhance image contrast and information content,and to simplify the fundamental feature matching problem. Toidentify and discard non-stationary objects from computations,a formulation based on well-known robust estimation methodscan be readily employed (e.g., RANSAC [7]).

We should emphasize that most, if not all, of the methodsand underlying technical material discussed in the paper havebeen previously applied in some terrestrial-domain applica-tion. The main contribution of this paper is to demonstrate(perhaps, in contrast to common belief) that water-column isan environment, very rich in visual motion cues, often moreso than near sea floor or surface. Thus, it is very suitableto apply visual motion/stereo methods for 3-D scene analysisand reconstruction, as well as environmental assessment andtarget/habitat classification based on motion behavior, and thesize and distribution of suspended particle.

II. TECHNICAL BACKGROUND

The coordinates of a 3-D point P in the optical cameracoordinate system at some reference (e.g., initial) positionis denoted P = [X,Y, Z]T . For convenience, we oftenmake use of the homogeneous coordinates, represented byP̂ = λ[X,Y, Z, 1]. The image of P formed by perspectiveprojection model, has coordinate p = (x, y, f):

p =f

ZP (1)

where f is the effective focal length of the camera, and Z isthe so-called depth of P. Using homogeneous coordinates, itis convenient to express the perspective projection model inthe linear form

p ≃ CP̂; C =

f 0 00 f 00 0 1

(2)

where ≃ denotes up to scale equality. The real coordinates p(typically in [mm] units) are related to the computer coordi-nates representing the (column,row) coordinates pc = (c, r, 1)in an image as follows:

pc =

−1/sx 0 cx0 −1/sy cy0 0 1/f

p = Mp (3)

where (sx, sy) [mm] are the horizontal and vertical pixelsizes, and (cx, cy) [pix] are the coordinates of the imagecenter. The internal camera parameters (sx, sy, cx, cy, f) canbe determined by calibration, and are necessary to transformfrom image measurements to the 3-D world measurements.When only the images from an uncalibrated camera areavailable (e.g., as for the ocean data in our experiment),the 3-D information can be determined up to a projectivetransformation only.

The transformation between the coordinate systems at anytwo viewing positions can be expressed in terms of 3 transla-tion parameters t and 3 rotation parameters, expressed in theform of a 3×3 rotation matrix R satisfying the orthogonalityconstraint RRT = RT R = I (I: 3×3 identity matrix). Thepoint P in the new view has coordinates P′:

P′ = RP + t (4)

and maps onto new position p′

po ≃ C′P̂; C′ = C[R|t] (5)

Thus, we obtain

x′ =c′1 · P̂c′3 · P̂

y′ =c′2 · P̂c′3 · P̂

(6)

where ci (i = 1, . . . , 3) denote the rows of C′. The displace-ment vector v = p′ − p is the image motion of the 3-D pointP . For small motions, it can be represented by [5]

v = vr + vt = Arωω +1

ZAtt (7)

Ar =

xy/f −(f + x2/f) y(f + y2/f) −xy/f −x

0 0 0

At =

−f 0 x0 −f y0 0 0

(8)

where 3-D vectors ωω = (ωx, ωy, ωz)T and t = (tx, ty, tz)

T

are the rates of rotational and translational motions, andvr = (vrx, vry, 0)

T and vt = (vtx, vty, 0)T are the rotational

and translation components of the image motion, respectively.It readily follows that only the translational component vtencodes information (in terms of depth Z) about the scenestructure; that is, relative 3-D positions of spatial features.There is no structural cue in the absence of translation.Furthermore, the image motion v remains unchanged if Zand t are scaled by the same constant k. This well-knownscale factor ambiguity of monocular vision confirms thattranslational motion and scene structure can be determinedup to a scale only. Without loss of generality, the depth ofa particular feature can be fixed (as unit length), and thenthe translational motion and all other feature positions can beexpressed in terms of this distance.

III. MOTION ESTIMATION

In monocular vision, the estimation of motion is limited tothe direction of translation due to the well-known scale-factorambiguity. Without loss of generality, we typically determinethe direction of translation t̂ = t/abst.

Page 3: Marine Snow

A. Translation Estimation

When the camera/vehicle simply translates in the watercolumn, the image motion vectors simplify to

v =1

ZAtvt =

1

Z

−f 0 x0 −f y0 0 0

t (9)

The displacement vectors intersect at a common point:

xfoe =

xfoe

yfoef

=f

tzt tz ̸= 0 (10)

The point xfoe is known as the focus of expansion/contraction(FOE/FOC) for forward/backward motion. Given that the cam-era translation is along the FOE/FOV vector, the estimation ofthe translation vector reduces to that of locating the FOE [4],[8], [12], [13]. Whether the motion is forward or backward canbe readily established by whether the image motion vectorspoint away from or towards the FOE/FOV. If tz = 0, thenthe image motion vectors become parallel to the direction(tx, ty, 0), with the FOE/FOC (intersection point) movingtowards infinity.

The process involves first determining the computer co-ordinates (rfoe, cfoe) of the FOE based on the image mo-tion of a few stationary features, transforming to the imagecoordinates (xfoe, yfoe) based on camera calibration param-eters, and finally establishing the direction of motion fromt̂ = (xfoe, yfoe, f). It is noted that we require the camerainternal parameters, or the estimation is limited to the locationof FOE/FOC in terms of computer coordinates.

It readily follows from (9) that the up-to-scale depth ofeach feature can be determined from its image displacement.A solution based on least-square formulation is given by [5]

Z =vtx(x− xfoe) + vty(y − yfoe)

∥vt∥2(tz ̸= 0) (11)

Alternatively, we can use

Z =(x− xfoe)

2 + (y − yfoe)2

vtx(x− xfoe) + vty(y − yfoe)tz ̸= 0 (12)

For tz = 0, where the FOE is at infinity, we simply use theup-to-scale solution for the FOE direction in the image motionequation.

B. Pure Rotation

The pure rotation of the camera can be readily determinedfrom

v=

xy/f −(f + x2/f) y(f + y2/f) −xy/f −x

0 0 0

ωω≈

0 −f yf 0 −x0 0 0

ωω(13)

where the approximation is valid for cameras with averagefield of view (say up to about 50-60 [deg] in water sincex << f and y << f over most of the image, but is notso for larger fov’s (e.g., in the periphery for a fish eye lens).The above equation comprise two linear constraints in terms

of three unknown rotational motion component ωω. The imagemotions at a minimum of two feature points is sufficient tocompute the rotation motion. A single point is sufficient ifthe rotation is limited to pitch and roll of the vehicle/camera.Furthermore, the approximation is not necessary for the motioncomputation. It simply allows for the observation that theimage motion induced by the pitch and roll motion of thecamera is roughly constant for features in the central regionof the image (where x << f and y << f ).

C. Arbitrary Motion

The water column, as in deep space imaging [1], [2], isan ideal environment to readily estimate arbitrary cameramotions with good accuracy. Here, objects extend over a largedepth within the f.o.v., with some points lying at “infinite”distance from the camera (Z → ∞). Examination of (7)reveals that their image motions comprise solely of rotationcomponent. Identifying such points and using their imagemotions, we first estimate ωω, compute the rotation field vr overthe entire image and subtract it out, with the remainder givingthe displacements due to camera translation. The method insection III-A can then be applied to compute the cameratranslation.

An alternative multi-step strategy can be adopted by not-ing that, over most of the central part of the image, thepitch and yaw motions ωx and ωy , respectively (scaled byf >> x and f >>> y) contribute more significantly tothe image feature displacements than the roll ωz does; thelatter is significant primarily in the image periphery (largerx and y) . This can be exploited to identify suitable distantfeatures in the central region of the image (undergoing nearlyconstant image displacements) for the computation of pitchand yaw components. Subtracting out their induced imagemotion (vrx, vry) = (−fωy, fωx), the remaining rotationωz can be determined by utilizing distant features in theimage periphery. Next, by subtracting the induced imagemotion (vrx, vry) = (yωz,−xωz), we can finally computethe translational components from image motion vectors thatintersect at the FOE/FEC (corresponding to stationary scenefeatures).

D. Moving Objects

Moving targets have image motions than are different, typi-cally both in magnitude and direction, in contrast to stationaryobjects that are constrained to move towards/away from theFOC/FOE. By computing the camera motion and removingthe induced image motion, each independent object’s motioncan be analyzed.

A particular implementation involves the following steps:

• Camera Motion Estimation1) Utilize distant features in central image region to

estimate pitch and yaw components. Compute andsubtract out the image displacement due to thesetwo components.

Page 4: Marine Snow

2) Estimate the roll ωω (rotation about the optical axis)from the distant features in the periphery, and sub-tract out the contribution to image motion.

3) Compute the translation motion components usinga certain number of features with image motiondisplacements that intersect at a common point(FOE/FOC).

• Depth Estimation for Stationary Features: Points withimage motion magnitude vt below a certain thresholdare considered as points at infinity. The depths of otherstationary features (with image motion vectors vt inter-secting at the FOE) are computed from (11).

• Moving Object Detection: The image motion of eachobject, if stationary, must go through the FOE. Thisconstraint is violated, when the object has a non-zeromotion. Thus, we can identify and segment out theseobjects, in order to analyze their motion behaviors.

At first glance, it appears that we are haunted by thechicken-and-egg nature of motion estimation (by utilizingstationary objects) based on identifying stationary objects(by discriminating between moving and stationary features).However, the nature of the water-column environment, namely,the presence of stationary targets that are randomly distributedat a large range of distances comes to our rescue. In particular,steps 1 and 2 are critical: locating some (minimum of 2)stationary targets (at infinity) with image motions inducedby camera rotation. It goes without saying that when thesepoints do not move in the image, then there is no camerarotation. We can classify objects based on their image motionsand employ a RANSAC-based implementation: we randomlyselect 2 points for the two-step process to compute the camerarotation, and do image de-rotation to be left with translation-induced image motions only. If the majority of motion vectorsintersect at a common point (FOE/FOC), then the estimatedrotation is accurate (and so is the translation motion based onthe FOE). Else, we repeat with a different sample.

IV. FEATURE MATCHING

The correspondence problem is the primary complexity offeature-based motion and (or) stereo methods for 3-D motionand scene reconstruction. Many existing techniques (e.g.,SIFT) incorporate the estimation of a homography (projectivetransformation), affine or similarity transformations to mappoints from one view to the next, to confine the search forthe correct match to the vicinity of the projected position. Todetermine the transformation, a robust estimation method isapplied. For example, RANSAC uses a random sample fromsome initial matches, assuming they comprise 50% or moreinliers.

These transformations are valid for simple motion models(e.g., camera rotation) and (or) points/objects lying roughlyon a single plane. With randomly distributed features withinthe water column at different distances, many such planes canbe defined, each passing through a small number of features;however, there is typically no dominant plane. Take any oneplane and treating the features (nearly) lying on it as the

inliers for matching. These are significantly outnumbered byoutliers, leading no more than a handful of correct matches.Other complexities arise from significant variations in featureappearances.

In our problem, certain imaging strategies can be applied tosimplify the correspondence problem: 1) A camera with smalldepth of field is focused at some desired distance d. Thus,only a smaller number of features (namely those at and arounddistance d) appear in focus, simplifying the matching problem.2) The camera is set at a low shutter speed (longer exposuretime) allowing the feature dynamics to be recorded in oneframe. The relatively continuous feature tracks on a blurredout background can be utilized to establish feature matches.Both strategies are also useful in facilitating the identificationof distant objects, which enable the estimation of rotationalmotion.

V. EXPERIMENTS

We present results of various experiments on two datasets: 1) A calibrated water tank data with small sphericaltargets of different sizes, showing motion estimation andestimation of camera trajectory; 2) An uncalibrated ocean data,to demonstrate different technical aspects of this contribution.

A. Tank Scene

We simulate the water column environment with a water-tank scene comprising several small spherical balls of differentsizes at varying distances from the camera.

The data consists of a total of 23 images, taken along 3parallel tracks. On the first track the camera moves awayfrom the scene over 8 images. After translating right to thenext track (roughly parallel to the scene), the camera thenmoves forward over the next 8 images. Following a NE-direction motion (forward and right) to reach the next track, themovement is backwards over the last 7 images. The motionsare about 5 [in] along the track and 2.5 [in] from one trackto the next. As selected results, we have depicted 3 sampleimages in Fig. 1(a), from the beginning and end of first track,and the middle of last track. In (a), we have also superimposedin one image the estimated “relative depth” of the features(in meters) computed from (11) (recall that we can onlycompute relative depth from monocular motion cues). We havevisually confirmed the consistency with the relative spatialarrangements of the features.

In (a), we have shown the features (crosses) with theirmatches from the next frame (circles), which establish theframe-to-frame displacements. The red lines depict the direc-tions of image displacement vectors that (roughly) intersect ata common point; this being near the image center, the cameramotion is determined to be mostly in the forward/backwarddirection. For the cases where the camera moves from onetrack to the next (middle image), the image motion vectorsare (roughly) parallel. The errors are significant for distanttargets with low SNR (small image motion relative to theinaccuracies of feature localization in one of two views). These

Page 5: Marine Snow

2764.2

3406.5

3045.4

4233.8 4009.1

5550.3

3295.7

2977.9

9.2

11.4

10.2

14.1 13.4

18.5

11

9.9

0.9

1.1

1

1.4 1.3

1.9

1.1

1

0.9

1.1

1

1.4 1.3

1.9

1.1

1

0.92

1.14

1.02

1.41 1.34

1.85

1.1

0.99

0.9

1.1

1.0

1.4 1.3

1.9

1.1

1.0

0.9

1.1

1.01.4 1.3

1.9

1.11.0

(a)

−1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

17

1819

202122

23

1615

141312

X [in]

11109

3

1

4

2

56

7

8

Y[in

](b)

Fig. 1. (a) Three selected images with features (+) and match positions (o) from next view. The red lines are the image motion directions,showing intersection at FOE (encircled +). These lines are parallel for side-way motion, where the camera moves form one track to the next.(b) Reconstructed camera trajectory for water tank experiment.

errors do not impact the results, since we apply a RANSAC-based implementation to reject the outlier image motion lines.

For completeness, we have estimated the camera trajectoryby integrating the frame-to-frame motions; see (b). As stated,we can determine the translation motion and distances of scenetargets up to a scale only (from monocular cues). However,once the initial motion scale is set (here we use OUR knowl-edge of the motion size for the experiment), the scaling canbe established for subsequent images. A small motion in they direction along each track is not unusual since the estimatedcamera motions (and track positions) are expressed in thecamera coordinate system at the initial position. With a slightcamera tilt (downward in this case), the positions along thetrajectory will retain a non-zero Y component. Based on theestimated positions, this tilt is roughly tan−12/35 ≈ 3 [deg],small enough not to be noticed without precise calibration.

VI. OCEAN DATA

Fig. 2 shows the results from various experiments with oneocean data. This is recorded by a camera, mounted on theside of a submersible platform as it moves through the watercolumn. Therefore, the dominant camera motion is sidewayswith some occasional rotational effects due to change invehicle pitch and heading (yaw).

In (a), we have depicted two images that are 9 frames apart,

starting at frame number 600 in the sequence. The circles showsome features and diamonds are their positions in the nextimage (a’). The image motions are defined by vectors betweenthem. Next, adding all 11 intermediate frames, as depictedin (a”), we can readily recognize the parallel motion tracksof these stationary features, all pointing in the NW direction(dots showing positions of the features from two earlier viewsmark the start and end of each track). Because image in(a”) comprise many parallel tracks, the image gradients areperpendicular to these contours, and the large energy in thegradient direction can be readily observed in the Fourierdomain (frequency content of this image). This property isexploited by the Frequency-based methods for optical flowcomputation; see [3]. The Fourier transform of (a”), depictedin (a”’), confirms this fact. The motion vector directions, andconsequently the vehicle motion can be deduced either from(a) and (a’), (a”), or (a”’).

In (b), we have shown the same results from a different partof the sequence, where the camera motion is more horizontal(image motion vectors are nearly horizontal); frames 1170 to1180 Comparing the motion in these two cases, (a-a””) and(b’-b”’) , we deduce that the vehicle moves at a higher pitchangle in the former case.

The image in (c) is the first view of a sequence definedby frame numbers 1210 and 1216; circles depict features

Page 6: Marine Snow

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

(a) (a’) (a”) (a””)

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

(b) (b’) (b”) (b””)

1

2

3

4

5

6

(c) (c’)

111222

333

444

555

666

(d) (d’) (d”)

Fig. 2. Various motion computation results. (a-a”’) pure camera translation with various stationary features. (b-b”’) pure camera translationwith 4 stationary and 3 moving targets. (c,c’) features in one frame (circle) and matches from the other frame (dot) during translation androtation camera motion, with 8 integrated in-between frames. (d,d”) Both views with features in each frame (circle) and the other frame(dot), with (d) also showing the image motion vectors. (d’) rotation compensated view, with new feature positions (circle), matches (dot), andimage motion lines intersecting at finite FOE. Features 1 and 2 are distant points used for estimation of the rotation (inducing a displacementof roughly (-29,-5) pixels).

in each view, while dots are the matches from the otherview. The vehicle undergoes rolling motion while changingheading to make a turn. This induces camera pitch and yaw

motions (vehicle roll becomes camera pitch, and headingchange becomes the camera yaw motion). The integration ofevery frame from 1210 to 1216 in (c’), showing the feature

Page 7: Marine Snow

(a) (a’) (a”)

(b) (b’) (b”)

Fig. 3. Another sequence (a,a’), with image displacements comprising rotation component of roughly (-25,-5) pixels (b), the displacementsafter rotation compensation (b”).

50 100 150 200

0

20

40

60

80

100

120

140

160

180

(a) (a’) (a”) (a”’)

50 100 150 200

0

20

40

60

80

100

120

140

160

180

(b) (b’) (b”) (b”’)

50 100 150 200

0

20

40

60

80

100

120

140

160

180

(c) (c’) (c”) (c”’)Fig. 4. Various water-column objects and tracks depicting their speed and motion behavior.

Page 8: Marine Snow

tracks, depicts the rotational motion. With motion comprisingboth rotation and translation, the image motion vectors nolonger intersect at a common point (the FOE) as depicted in(d). Using distant points 1 and 2, we have estimated the imagemotion of (-29,-5) pixels, due to the rotation. After rotationcompensation, we construct the image in (d’), which is simplytranslated relative to the first view. The image motions nowintersect at a finite FOE. Note that points 1 and 2 becomestationary, after “rotation stabilization.”

Fig. 3 is another example from nearby frames (1200 and1205), with image motions containing both rotational andtranslational components; see (a). The squares show positionsof each feature, the dots are the matches from the other view,and the dashed lines are the motion vectors. The integratedimage, including all 6 frames from the sequence in (b”),depicts the feature tracks. We have marked 3 selected “pointsat infinity” in the central region of the image, where the imagemotion is dominantly induced by the pitch and yaw cameramotions. From these points, we estimate the induced imagedisplacement of roughly (-25,-5) pixels. After compensatingfor the estimated rotation, the new image in (a”) depictsparallel image motions intersecting at infinity (correspondingto camera motion parallel to the scene); this view is obtainedby shifting the entire image by the constant image motion dueto rotation. Distant features shown by cyan diamonds in (a”)now have nearly zero motion, after rotation compensation.

In Fig. 4, we have depicted 3 integrated view starting fromframes 129 (80 frames), 370 (55), and 1362 (10 frames) of thesequence. From the Fourier transform (large central portiondepicted to visually exaggerate the central portion), one canstill deduce the average translation direction of the vehicle(which is perpendicular to the oriented yellow blob near thecenter). Furthermore, the motion tracks of the different water-column habitats provide rich cues about their motion behaviorsand speeds.

VII. SUMMARY

Environmental features, namely suspended particles andhabitats, can be exploited for vision-based positioning andnavigation within water column. While any imaging systemwith sensitivity to emitted energy from particular suspendedparticles can be employed, the visual motion cues in opticalimages have been studied here. We have aimed to demonstratethat such environments can be more ideal for motion visiontechniques, compared to most near bottom or surface applica-tions, which have been previously explored.

The development of a robust vision system can benefit fromeffective use of active imaging, including lighting and imagingsensor arrays that are tuned to specific visual motion cues, e.g.,focus at different ranges, shutter speed variation, directionallighting, etc. We are currently investigating a particular designfor real-time processing. Furthermore, while we have focusedon optical imaging in the visible band, same advantages extendto other imaging devices that can record any form of emittedenergy from the natural water-column particles.

ACKNOWLEDGMENT

We are deeply grateful to Drs. Scott Reed and Ioseba Tenafrom SeaByte Ltd, Edinburgh, Scotland (UK) who providedthe ocean data. Murat D. Aykin is a first year Ph.D. studentsupported by the Department of Electrical and ComputerEngineering at the University of Miami. MohammadrezaBabaee is a M.S. student at the Technical Univ. of Munich,carrying out his dissertation at the Underwater Vision andImaging Lab (UVIL), University of Miami. He is funded inpart by a research Grant No. 2006384 from the US-IsraelBinational Science Foundation (BSF). Shayanth Sinnarajah,who completed his junior year at Gulliver Preparatory inJune’11, is completing his third academic year of researchinternship at UVIL. Alejandro Perez, who graduated fromGulliver Preparatory in June’11 (attending Harvard Universityin Fall’11) has done roughly one month of summer internshipat UVIL during last 3 summers.

REFERENCES

[1] http:/www.spacetelescope.org/news/heic0701/[2] http:/www.dailygalaxy.com/my weblog/2009/04/78-billion–a-h.html[3] Beauchemin, S.S., Barron, J.L., “The computation of optical flow,” ACM

Computmg Surveys, 27(3), September, 1995.[4] A. Branca, E. Stella, G. Attolico, and A. Distante, “Focus on expansion

estimation by an error backpropagation neural network,” Neural Comput-ing and Applications, pp.142-147, 1997.

[5] Bruss, A.R., Horn, B.K.P., Passive navigation, “Computer Vision, Graph-ics, and Image Processing (CVGIP),” 21(1), pp. 3-20, 1983.

[6] Gracias, N., M. Mahoor, S. Negahdaripour, A. Gleason, “Fast imageblending using watersheds and graph cuts,” Image and Vision Computing,Vol 27(5): pp. 597-607, 2009.

[7] Fischler, M.A., Bolles, R.C., “Random sample consensus: A paradigmfor model fitting with applications to image analysis and automatedcartography,” —it Comm. of the ACM, vol 24:381395, June, 1981.

[8] Jain, Ramesh; , “Direct computation of the focus of expansion,” IEEE T.Pattern Analysis Machine Intelligence, 5(1), pp.58-64, January, 1983.

[9] Ludvigsen, M. and B. Sortland, G. Johnsen, H. Singh, “Applicationsof geo-referenced underwater photo mosaics in marine biology andarcheology,” Oceonography, Vol 20(4), pp. 140-149, 2007.

[10] Marks, R.L., H.H. Wang, M.J. Lee, S.M. Rock, “Automatic visual stationkeeping of an underwater robot,” Proc. IEEE Oceans ’94, 1994.

[11] Marks, R.L., S.M. Rock, and M.J. Lee, “Real-time video mosaicking ofthe ocean floor,” IEEE J. Oceanic Engineering, Vol 20(3), pp. 229-241,1995.

[12] Negahdaripour, S., Horn, B.K.P., “A direct method for locating the focusof expansion,” Comp. Vision, Graphics, Image Processing, 46(3), pp. 303-326, June, 1989.

[13] S. Negahdaripour, “Direct computation of FOE with confidence mea-sures,” Computer Vision Image Understanding, Vol 64 (3), pp. 323-350,November, 1996.

[14] Negahdaripour, S., X. Xu, and A. Khamene, “A vision system for real-time positioning, navigation and video mosaicing of sea floor imageryin the application of ROVs/AUVs,” IEEE Workshop on Appl. ComputerVision, pp. 248-249, 1998.

[15] Negahdaripour, S., C. Barufaldi, and A. Khamene, “Integrated systemfor robust 6 DOF positioning utilizing new closed-form visual motionestimation methods for planar terrains,” it IEEE J. Oceanic Engineering,Vol 31(3), pp. 462-469, July, 2006.

[16] Negahdaripour, S., and P. Firoozfam, “An ROV stereovision system forship hull inspection,” IEEE J. Oceanic Engineering, Vol 31(3), pp. 551-564, July, 2006.

[17] Nicosevici, N. Gracias, S. Negahdaripour, R. Garcia, ”Efficient 3D SceneModeling and Mosaicing,” J. Field Robotics, Vol 26(10), pp. 757-862,October 2009.

[18] Rzhanov, Y., L.M. Linnett, and R. Forbes, “Underwater video mosaicingfor seabed mapping,” Int. Conference on Image Processing, 2000.