A Framework for Real-Time Scene Modelling based on Visual ...the overall system performance2 in terms of: • Temporal coverage - Typical frame rates of a image processing system lie

International Journal of Computer Information Systems and Industrial Management Applications.ISSN 2150-7988 Volume 3 (2011) pp. 846–861c⃝ MIR Labs, www.mirlabs.net/ijcisim/index.html

A Framework for Real-Time Scene Modellingbased on Visual-Inertial Cues

Dominik Aufderheide1,2 and Werner Krybus1

1South Westphalia University of Applied Sciences, Division Soest,Institute for Computer Science, Vision and Computational Intelligence (CV&CI)

Luebecker Ring 2, 59494 Soest, Germanyaufderheide, [email protected]

2The University of Bolton,School of the Built Environment and Engineering

Deane Road, Bolton BL3 5AB, [email protected]

Abstract: The self-acting generation of three-dimensionalmodels by analysing monocular image streams from standardcameras is one fundamental problem in the field of computervision. A prerequisite for the scene modelling is the formercomputation of camera poses for the different frames of the se-quence. Several techniques and methodologies are introducedduring recent decades to solve this classical Structure from Mo-tion (SfM) problem, which incorporates camera egomotion es-timation and subsequent recovery of scene structure. Neverthe-less the applicability of those systems in real world devices andapplications is still limited due to non-satisfactorily propertiesin terms of computational costs, accuracy and robustness. Thispaper suggests a novel framework for visual-inertial scene re-construction (VISrec!) based on ideas from multi-sensor datafusion (MSDF). The integration of additional modalities (here:inertial measurements) is useful to compensate typical problemsof systems which rely only on visual information.Keywords: Structure from Motion (SfM), Inertial sensing, Dy-namic World Modeling (DWM), Multi-Sensor Data Fusion(MSDF), Kalman Filter

I. Introduction

The automatic generation of three-dimensional models hasbeen one of the fundamental problems in computer visionfor decades. Even if methods for image-based modelling arealready introduced the usage of active 3D scanners is still thedominant technology in this field. Thus it is highly desirableto use monoscopic image streams which can be captured bystandard digital camera devices as a base for non-invasivescene modelling. Those cameras are available at low costsand easy to handle even for a single non-professional user.The simultaneous estimation of the camera motion and scenestructure is widely known as the Structure from Motion(SfM) problem where numerous solutions were proposed andimplemented. Even if the potentials of those methods areworthy for many different application fields, as AugmentedReality (AR), robot navigation or Unmanned Vehicles (UV),the applicability in real-world applications is limited due to

some unsolved issues.One of these drawbacks is the missing ability to run thoseclassical SfM-systems in real-time due to high computationalcosts (see [41]) or necessary batch-type computations1 asfor classical factorisation methods (see [45]). On the otherhand most SfM-methods are suffering from missing robust-ness of the feature detection and tracking procedures whichare generating necessary input data for the recovery of shapeand motion. In [6] numerous problems (namely: occlusions,depth discontinuities, low texture, repetitive patterns, etc.) ofimage registration are listed and analysed in the context ofstereo matching. All of these problems are also considerablefor SfM and many algorithms suffer from non-robust featureregistration between subsequent frames of a monocular im-age stream. Also [43] stated that even the tracking of a subsetof features is unstable in nature. By this two different prob-lems have to be solved by SfM methods: on the one hand in-accurate localisation of matches and on the other hand a notneglectable number of complete wrong matches (outliers).Many algorithms are also restricted to constrained type ofcamera movements or only a subset of possible scene types.In this context especially the necessity for a reinitialisation ofthe whole systems if the feature track is lost once within thesequence is a major drawback for manually operated camerasystems.In the field of mobile robotics novel methodologies and con-cepts for simultaneous localisation and mapping (SLAM) arerecently crossing the border to real-time processing. So [21]presented the MonoSLAM-approach which is based on a sin-gle camera mounted on a moving robot. Similar ideas wereused in the parallel tracking and mapping approach (PTAM)suggested by [33].Due to these problems with classical vision-based SfMmethodologies and inspired by recent developments in mo-bile robotics in general and SLAM in particular the general

1Batch-type methods are composed in such a way that the whole imagesequence has to be available for the estimation of camera egomotion andscene structure. In those systems structure and motion are recovered simul-taneously by solving a large-scale optimisation problem.

Dynamic Publishers, Inc., USA

concept of aided Structure from Motion (aSFM) was devel-oped. Due to the fact that the estimation of camera egomo-tion is a major step for 3D scene recovery the state-of-the-artfor Integrated Navigation Systems (INS) is a major influencefor the suggested approach of a visual-inertial approach foraSfM.The integration of visual and inertial information was re-cently proposed as a methodology for full six degree of free-dom (6DoF) tracking of an object’s ego-motion (positionand orientation) for AR applications (see [29]), navigationof UVs or mobile robot navigation [32]. Research in thefield of SfM by combining visual and inertial cues is an opentopic, since recently published work lacks the ability for real-time operation [19], modelling of unconstrained, dense scenereconstruction or rapid sensor movements. This paper de-scribes a multi-modal approach for aSfM incorporating vi-sual measurements from a single standard camera and inertialmeasurements from gyroscopes, accelerometers and magne-tometers. The main focus lies on a naive implementationbased on a two-track architecture which consists of severalfusion nodes for MSDF. This paper is an extended version ofthe work presented in [4] with the same title.The remainder of this paper is organised as follows: Sec-tion II gives a general introduction into the motivation forusing visual-inertial cues for aSfM. Section III covers theconceptual design of the systems general architecture. In thiscontext the idea of a two-track design is introduced and de-scribed in detail. Subsequent sections IV and V are cover-ing the implementation of the processing of visual, respec-tively inertial measurements, while Section VI gives an ideaabout the interfaces between the visual and inertial route ofthe system. Finally section VII concludes the work and givesan overview about intended working packages for future re-search.

II. Visual-Inertial Scene Reconstruction (VIS-rec!) - A Motivation

Any kind of sensor measurements are uncertain and the phys-ical property which should be determined can only be esti-mated with a limited level of confidence. Especially for opti-cal measurement systems there are many possible sources oferrors beside the typical random noise. Some of the typicalproblems of relying only on images for estimation the mo-tion of a camera during the acquisition of a sequence werealready mentioned in Section I.The general concept of MSDF was successfully applied forexample in the field of mobile robotics in recent years (seee.g. [37, 50]). One reason for the attention MSDF has cov-ered in a wide branch of applications and scientific disci-plines is the fact that a sound mathematical and formal back-ground was developed since the mid 1990s. In this contextthe works of [39, 12, 24, 34, 13] are examples for particularoverviews and surveys.The major objective of applying MSDF in the field of SfMand 3D modelling is the compensation or at least attenuationof the described drawbacks of classical SfM-methods. Sothe incorporation of the inertial-modalities should improve

the overall system performance2 in terms of:

• Temporal coverage - Typical frame rates of a imageprocessing system lie between 5 to 50 frames per sec-ond. So an update of the cameras egomotion is onlyavailable every 20 to 200 ms.

• Accuracy - Due to the fact that the recovered scenestructure is determined based on a previously estimatedrelative camera position based on feature correspon-dences in successive frames of a sequence which areincorporated by noise and other uncertainties (see [2])the accuracy of typical SfM-methods is limited.

• Certainty - Typically the certainty of SfM-algorithmsis mainly influenced by the quality of the used homolo-gous image features. For this especially the handling ofoutliers is an important aspect, because all the desiredinformation is directly related to the quality of the usedmatches.

• Computational costs - All sates of the system (motion,observed scene structure) are not directly measured bya vision system, but have to be recovered from imagedata and adequate algorithms. As mentioned before thecorresponding computational complexity leads often tothe lack of ability of real time operation.

Besides these specific objectives of the MSDF-approachthere are also general targets which are indirectly derivedfrom the disadvantages of currently available 3D scanningdevices as high costs, missing mobility and time consumingmeasurements. Thus the final system should mainly integratestandard low-cost components in a mobile easy-to-operatedevice.As suggested by [39] the implementation of a MSDF-systemwhich relies on fusion across sensors (see [59]) starts witha conceptual design based on former identification of ade-quate additional modalities and information channels. Forthis we follow the classification of relational sensor proper-ties as given in [22, 10]. The following Table 1 gives anoverview about the sensor-sensor relationships between vi-sual and inertial measurements and clarifies the adequate-ness of inertial cues towards the realisation of a aSfM-systemwhich is able to fulfil the objectives defined above.3

As it is indicated in Table 1 there is a asynchronous prop-erty of the different sensors observable which should leadto increased temporal coverage of the overall system. Thisreduces the danger of wrong or inaccurate feature matchingbecause the stability of feature tracking is influenced in a pos-itive manner by the higher update rate of a possible motionprediction step, which is especially important for constantvelocity (CV) or constant acceleration (CA) motion models(see [53] for a definition and description of motion models infeature tracking). In this context the robustness of the featuretracking can be increased. The heterogeneous characteristicsleads to a higher coverage of possible motion patterns of thecamera. Furthermore the redundancy of the involved signalsprovides the possibility to achieve a higher accuracy of the

2The different categories are based on the definition of a generic notionof the qualified gain of a data fusion process in [10].

3The table is taken from a former publication of the author given in [5].

847 Aufderheide and Krybus

Table 1: Relational properties of visual and inertial sensingVisual sensing Inertial sensing Property

Sensing spatial derivative Sensing spatial derivativeswith order 0 (position) with order 1 (gyroscopes -

angular velocities) and Complementaryorder 2 (Accelerometers -translational accelerations)

Long-term estimation Short-term estimation forfor slow and smooth rapid and unpredicted Heterogeneous

motion movementsOperating frequency: Operating frequency: Asynchronous

5-30 Hz 50-1000 HzPose estimation Gyroscopes: Attitude estimation

from corresponding from integrated rotationalimage features velocities

between successive Accelerometers: Attitudeframes estimation (roll and pitch) Redundant

from gravitational fieldMagnetometers: Attitudeestimation from sensing

earth’s magnetic field

motion estimate and as a consequence from this also accu-racy of the reconstruction of the scene can be increased. Thusthe integration of inertial measurements into a visual systemis an adequate way for compensate typical drawbacks of theoptical SfM.Besides these specific characteristics the recent develop-ments in the field of Micro-machined Electro-MechanicalSystems (MEMS) make it possible to integrate inertial sen-sors such as accelerometers, gyroscopes or magnetometersinto a handheld device at low-costs. At this point it shouldbe mentioned that the usage of MEMS sensory units is in-corporating a bunch of problems based on immense driftingerrors and variable biases. Those problems will be coveredin Section IV in detail.The integration of visual and inertial information was re-cently proposed as a methodology for full six degrees offreedom (6 DoF) tracking of an objects pose (including po-sition and orientation) for Augmented Reality (AR) applica-tions (see [28]), navigation of unmanned vehicles (UV) ormobile robot navigation ([32]). Research in the field of SfMby combining visual and inertial cues was recently done by[35], [18] and [17], but none of these systems can be regardedas a complete solution for on-the-fly 3D scene modelling inreal-time.The following section of this work describes the conceptualdesign of the proposed VISrec-system based on actual defi-nitions from MSDF.

III. VISrec!-architecture

The first milestone in the development of a complete im-plementation of a VISrec!-system is the conceptual designand realisation of a dual track architecture based on ideaspresented in [18, 35]. Here two separate tracks (visual andinertial routes) are considered as almost independent fusionnodes. This strategy allows the generation of two differentsubsystems integrating only one specific form of data sources(inertial or visual measurements). By this it is possible tocompare the performances of the separate stages in a first stepindependently from each other and in a second step the es-tablishment of different interfaces between both subsystems

or the addition of another fusion node.The following subsection describes the dual track architec-ture as a parallel fusion network by following the definitionsfrom [39], before in the last subsection the used hardware-prototype is described in detail.

A. Parallel fusion network

The dual-track architecture we suggests is mainly influencedby the works of [35, 18], but we choose a formulation moreclosely related to the scheme of MSDF. Here each track isconsidered as a fusion cell as suggested by [39]. So theeasiest representation of the dual-track system would be thestructure shown in Figure 1. It can be seen that each fu-sion cell (FC) consists of at least one input which collectsall sensory data. Here the inertial fusion cell (IFC) collectsdata from a 9-DoF inertial measuring unit which consists ofa 3-DoF accelerometer unit, a 3-DoF gyroscope unit and a3-DoF magnetometer. The visual fusion cell (VFC) collectsthe frames from a single camera. FCs also have additionalinputs for auxiliary information (AI), which can be derivedfrom other sources in the network, and external knowledge(EK). External knowledge collects all those additional datasources which are a-priori known and help to derive a higherlevel of abstraction which should be delivered at the outputof each FC.

VISUAL FUSION CELL (VFC)

AI EK

3 DoF accelerometers

3 DoF gyroscopes

3 DoF magnetometers

Camera

INERTIAL FUSION CELL (IFC)

AI EK

VISUAL-INERTIAL FUSION CELL (IVFC)

AI EK

Figure. 1: Dual track system design in a representation as aparallel arrangement of fusion cells

The suggested structure contains of course different inter-faces between VFC and IFC which are indicated in Figure1 by the connections between the outputs of the two FCs andthe AI-inputs of the opposite FC.The output of the IFC will contain information about thecameras movements in a higher granularity as the raw in-put values (e.g. camera pose). The VFC will deliver scenestructure estimates. Both signals can be collected in an addi-tional visual-inertial fusion cell (IVFC) which realises a finalrefinement of structure and motion. Noteworthy there is ofcourse the possibility that the separate FCs contain also sub-FCs for the realisation of their function.Details about the implementation of the IFC and the VFC canbe found in Sections IV and V respectively, while ideas forthe realisation of the IVFC are collected in Section VI.

848A Framework for Real-Time Scene Modelling based on Visual-Inertial Cues

B. Hardware prototype

The current hardware platform used for the implementationof the shown architecture consists on a visual-inertial sensoryunit build from a greyscale Unibrain Fire-i digital camera anda 9-DoF inertial unit, as shown in Figure 2.

Figure. 2: Hardware prototype of the visual-inertial sensoryunit

The inertial unit is inspired by the standard configuration of amulti-sensor orientation system (MODS) as defined in [49].The used system consists of a LY530AL single-axis gyro anda LPR530AL dual-axis gyro both from STMicroelectronics,which are measuring the rotational velocities about the threemain axis of the inertial coordinate system ICS (see Figure2). The accelerations for translational movements are mea-sured by a triple-axis accelerometer ADXL345 from Ana-log Devices. Finally a 3-DoF magnetometer from Honey-well (HMC5843) is used to measure the earth gravitationalfield. All IMU sensors are connected to a micro controller(ATMega 328) which is responsible for initialisation, signalconditioning and communication. The data from all sensorsare transferred from the MODS to a standard PC. The digitalcamera is connected to a PC by using a standard Firewire-interface (IEEE1394).The whole implementation of the different FCs is realised onthe standard PC, as described in the subsequent sections.

IV. Inertial Fusion Cell (IFC)

The inertial route contains all the steps which are necessaryto determine position and orientation of the MODS (which isrigidly attached to the camera). As already indicated in Sec-tion III-B the used MODS consists of three orthogonal ar-ranged accelerometers measuring a three dimensional accel-eration ab = [ax ay az]

T normalised with the gravitationalacceleration constant g. Here b indicates the actual body co-ordinate system in which the entities are measured. In addi-tion three gyroscopes measuring the corresponding angularvelocities ωb = [ωx ωy ωz]

T around the sensitivity axes ofthe accelerometers. Also magnetometers with three perpen-dicular sensitivity axes are used to sense the earth’s magneticfield mb = [mxmymz]

T .Classical approaches for inertial navigation are stable-platform systems which are isolated from any external rota-tional motion by specialised mechanical platforms. In com-parison to those classical stable platform systems the MEMS

MODS

Sig

nal C

onditio

nin

g

Accelerometers

Gyroscopes

Magnetometers

ab

wb

mb

Orientation

estimation

Initial

orientation

Coordinate

transform

C nb

X

Position

estimation

Gravity

correctiong

Orientation

Initial position

and speed

Position

Figure. 3: Computational elements of an INS

sensors are mounted rigidly to the device (here: the camera).In such a strapdown system it is necessary to transform themeasured quantities of the accelerometers into a global co-ordinate system by using known orientations computed fromgyroscope measurements.In general the mechanisation of a strapdown inertial naviga-tion systems (INS) can be described by the computationalelements indicated in Figure 3.The main problem with this classical framework is that lo-

cation is determined by integrating of measurements fromgyros (orientation) and accelerometers (position). Due to su-perimposed sensor drift and noise, which is especially forMEMS devices not neglectable, the errors for the egomotionestimation tend to grow unbounded. Besides that the dan-ger of ambiguities during initialisation of initial conditionsis given. It was shown e.g. by [16] that a combination withmagnetometers can help to reduce drift error.The calibration of IMUs can be realised by moving the IMUwith specialised mechanical platforms or industrial robots toknown orientations with precisely controlled accelerationsand rotational velocities. This provides a possibility for thedetermination of calibration parameters for a given sensormodel and allows a signal correction.So the final framework for pose estimation considers twosteps: an orientation estimation and a position estimation asshown in Figure 4. In terms of FCs the whole procedure canagain described as a sub-network of FCs which are locatedinside the inertial fusion cell of the overall system design, asindicated in Figure 1. In comparison to the classical strap-down mechanisation as described e.g. in [51, 60] the sug-gested approach here incorporates also the accelerometersfor orientation estimation. The suggested fusion network isvisualised in the following figure, whereat the different sub-fusion processes are described in subsections IV-A and IV-B.

A. Fusion for orientation

The estimation of the orientation of the MODS is realised inmost approaches just based on information from the magne-tometer and the gyroscopes. The most simple approach isthe implementation based only on a single integrator. Dueto the fact that the MEMS-implementation of the gyroscopesis suffering from an immense drifting error such a system isonly stable for short-term sequences. The following Figure5 gives an indication for the accumulated drifting error overtime, while on the left hand side a comparison between the


INERTIAL FUSION CELL (IFC)


3 DoF gyroscopes

FUSION FOR ORIENTATION

CELL (OFC)

AI EK

FUSION FOR POSITION

CELL (PFC)

AI EK

3 DoF magnetometers

Orientation

PositionSpeed

Figure. 4: System design of the inertial fusion cell (IFC)

true and the determined angle (here: roll) is shown and on theright hand side the corresponding residual. It can be easilyseen that over time the error is accumulated over time.

0 200 400 600 800 1000 1200 1400-6

-4

-2

0

2

4

6

8

sampling time [0.01s]

angle

[°]

roll angle detection

gyro error

0 500 1000 1500-50

0

50

100


angle

[°]

roll angle detection

gyro Axz(roll)

real angle

Figure. 5: Drifting error of gyroscope measurements

The general idea for compensating the drift error of the gyro-scopes is based on using the accelerometer as an additionalattitude sensor for generating redundant information. Dueto the fact that the 3-DoF accelerometer measures not only(external) translational motion, but also the influence of thegravity it is possible to calculate the attitude based on thesingle components of the measured acceleration. This is ofcourse only true if no external force is accelerating the sen-sor. So there are to questions which have to be answered: 1.How it is possible to calculate the attitude from accelerome-ter measurements? and 2. How external translational motioncan be handled? Both problems can be solved by following atwo-stage switching behaviour inspired by work presented in[47]. At this point it should be pointed out that measurementsfrom the accelerometers can only provide roll and pitch angleand the heading angle has to be derived by using the magne-tometer instead.Figure 6 gives an illustration about the geometrical relationsbetween measured accelerations due to gravity and the rolland pitch angle of the attitude. By this it follows that theangles can be determined by following relations:

θ = arctan2

(a2x,√

(ay + az)2)

(1)

ϕ = arctan2(a2y,√(ax + az)2

)(2)

The missing heading angle can be recovered by using thereadings from the magnetometer and the already determinedroll and pitch angles. Here it is important to consider thatthe measured elements of the earth magnetic field have to betransformed to the local horizontal plane (tilt compensation).Figure 7 is indicating the corresponding relations as shownin [16]:

y

z

x

Q - Roll

F - Pitch

FQ

Gravity

Figure. 6: Geometrical relations between measured accel-erations due to gravity and the roll and pitch angle of theattitude

Xh = mx · cφ+my · sθ · sφ−mz · sθ · sφYh = my · cθ +mz · sθψ = arctan 2 (Yh, Xh)

(3)

Based on these findings a discrete Kalman filter bank (DKF-bank) is implemented which is responsible for the estimationof all three angles of Ξ. For the pitch and the roll angle thesame DKF-architecture is used, as indicated in Figure 8-(a).In comparison to that the heading angle is estimated by a al-ternative architecture as shown in Figure 8-(b).All DKFs are mainly based on the classical structure of aKalman filter (see [12]) which consists of a first predictionof states and subsequent correction, where the two states arethe unknown angle ξ and the bias of the gyroscope bgyro. TheKalman filtering itself is composed from the following clas-sical steps, whereat the following descriptions are simplifiedto a single angle ξ.

1) Computation of an a priori state estimate x−k+1

As already mentioned the hidden states of the system arex = [ξ,bgyro]

T. The a priori estimates are computed byfollowing the following relations:

ωk+1 = ωk+1 − bgyrokξk+1 = ξk +

∫ωk+1dt

bgyrok+1= bgyrok

(4)

Here the actual measurements from the gyroscopes ωk+1 arecorrected by the actually estimated bias bgyrok from the for-mer iteration, before the actual angle ξk+1 is computed.

2) Computation of a priori error covariance matrix P−k+1

The a priori covariance matrix is calculated by incorporat-ing the Jacobi matrix A of the states and the process noisecovariance matrix QK as follows:

P−k+1 = A ·Pk ·AT +QK (5)

Local horizontal plane

Gravity

-roll

pitch

Yh

Xh

Figure. 7: Local horizontal plane as a reference


Yaw (heading)-DKFRoll/pitch-DKFs

MODS


3 DoF gyroscopes

3 DoF magnetometers

State update (prediction)

Kalman update

(Correction)

Attitude calculation

External acceleration

detector

q, f

MODS


3 DoF gyroscopes

3 DoF magnetometers

State update (prediction)

Kalman update

(Correction)

Tilt compensation

DKFs for roll/pitch estimation

y

(a) (b)

Figure. 8: (a) - Discrete Kalman filter (DKF) for estimation of roll and pitch angles based on gyroscope and accelerometermeasurements; (b) - DKF for estimation of yaw (heading) angle from gyroscope and magnetometer measurements

The two steps 1) and 2) are the elements of the predictionstep as indicated in Figure 8.

3) Computation of Kalman gain Kk+1

As a prerequisite for computing the a posteriori state estimatethe Kalman gain Kk+1 has to be determined by followingEquation 6.

Kk+1 = P−k+1 ·HT

k+1 ·(Hk+1 ·P−

k+1 ·HTk+1 +Rk+1

)−1

(6)

4) Computation of a posteriori state estimate x+k+1

The state estimate can now be corrected by using the calcu-lated Kalman gain Kk+1. Instead of incorporating the ac-tual measurements as in the classical Kalman structure thesuggested approach is based on the computation of an angledifference ∆ξ. The difference is a comparison of the anglecalculated from the gyroscope measures and the correspond-ing attitude as derived from the accelerometers, respectivelythe heading angle from the magnetometer, as already intro-duced in the introduction of this chapter. So the relation forx+k+1 can be formulated as:

x+k+1 = x−

k+1 −Kk+1 ·∆ξ (7)

At this point it is important to consider the fact that the atti-tude measurements from the accelerometers are only reliableif there is no external translational motion. For this an exter-nal acceleration detection mechanism is also part of the fu-sion procedure. For this reason the following condition (see[47]) is evaluated continuously:

∥a∥ =√(a2x + a2y + a2z)

!= 1 (8)

If the relation is fulfilled there is no external acceleration andthe estimation of the attitude from accelerometers is morereliable than the one computed from rotational velocities asprovided by the gyroscopes. Noteworthy for real sensors anadequate threshold ϵg is introduced to define an allowed vari-ation from this ideal case. If the camera is not at rest the ob-servation variance for the gyroscope data σ2

g is set to zero. So

by incorporating the magnitude of the acceleration measure-ments as ∥a∥ and the earth gravitational field g = [0, 0,−g]Tthe observation variance can be defined by following Equa-tion 9.

σ2g =

σ2g ,0,

∥a∥ − ∥g∥ < εgotherwise

(9)

A similar approach is chosen to overcome the problems withthe magnetometer measurements in magnetically distortedenvironments for the DKF for the heading angle. Insteadof gravity g the magnitude of the earth magnetic field m isevaluated as shown in the following relation4:

σ2g =

σ2g ,0,

∥m∥ −mdes < εmotherwise

(10)

5) Computation of posteriori error covariance matrix P+k+1

Finally the error covariance matrix is updated in the follow-ing way:

P+k+1 = P−

k+1 −Kk+1 ·Hk+1 ·P−k+1 (11)

B. Fusion for position

At this point the orientation of the camera is known and byfollowing the classical strapdown mechanisation, as shownin Figure 4, the next steps for position estimation consist ofthe transformation from body-coordinate frame to the globalnavigation coordinate system and the double integration ofaccelerometer measurements.In the actual configuration of the system all measurementsare resolved in a body-coordinate frame, rather than a globalinertial system. Hence, the position p can only be obtainedby double integration of the body accelerations a, when aknown orientation Ξ = [ϕ θ ψ]T is available that allows a ro-tation from body frame B to reference (or navigation) frameN by using the direct cosine matrix (DCM) Cb

n, defined asfollows5:

4mdes describes the magnitude of the earth’s magnetic field (e.g. 48 µTin Western Europe)

5For simplification: sα = sin(α) and cβ = cos(β)


Cbn(q) =

1√q24 + ∥e∥2

·

q21 − q22 − q23 + q24 2 (q1q2 + q3q4) 2 (q1q3 − q2q4)2 (q1q2 − q3q4) −q21 + q22 − q23 + q24 2 (q2q3 + q1q4)2 (q1q3 + q2q4) 2 (q2q3 − q1q4) −q21 − q22 + q23 + q24

(12)

Cbn =

cθcψ sφsθcψ − cφsψ cφsθcψ + sφsψcθsψ sφsθsψ + cφcψ cφsθsψ − sφcψ−sθ sφcθ cφcθ

(13)

The DCM can also be expressed in terms of an orientationquaternion q = [eT , q4]

T , where e = [q1, q2, q3]T describes

the vector part and q4 is the scalar part of q. Equation 12shows the relation between Cb

n and a computed q. A de-tailed introduction in quaternions for representing rotationscan be found in [52].The actual position is computed by double integration of ac-celerometer measurements.

C. Evaluation

The evaluation of the orientation estimation was realised byattaching the VISrec!-prototype to an industrial robot plat-form. A ABB IRB1400 industrial robot as shown in Figure9 was used to generate different motion patterns where theground truth data is known.

Figure. 9: ABB industrial robot for determination of groundtruth motion data

At this point the following figure gives just an impressionabout the performance of the DKF approach in comparisonto the usage of gyroscopes alone, whereat the roll angle isshown for vibration without motion (Figure 10-(a)) and fora specified rotation pattern (Figure 10-(b)). It can be clearlyseen that the DKF is improving the situation enormously interms of long-time stability and accuracy by the incorpora-tion of accelerometer attitude measurements.The same experiment is also done for the yaw angle and us-ing the magnetometer as an additional information source.By observation of the residuals (Figure 11-right) it can be de-termined that the accuracy of orientation estimation can alsobe increased for this case. Noteworthy the performance forthe heading angle is of course dependent of the environmen-tal magnetic disturbances during the measurements which

0 200 400 600 800 1000 1200 1400−20

0

20

40

60

80

100

Time [10ms]

angle

[°]

(a)

0 1000 2000 3000 4000 5000 6000−7

−6

−5

−4

−3

−2

−1

0

1

Tme [10ms]

angle

[°]

(a)

Ground truth

Fused results

Gyroscopes alone

Figure. 10: Results of orientation estimation for the roll an-gle for (a): no rotations only noise and vibration; (b): rotationpattern

was one sever problem during the data acquisition near themoving industrial robot. Due to the definition from Equation10 the system is relying completely on gyroscope measure-ments if the magnetic disturbance exceeds over a specifiedthreshold.

0 200 400 600 800 1000 1200 1400 1600 1800-40

-20

0

20

40

60

80

100


angle

[°]

yaw rotation

KF yaw

gyro yaw

robot yaw

0 500 1000 1500-25

-20

-15

-10

-5

0

5

10


angle

[°]

yaw error

KF yaw

gyro yaw

Abb. 11.1.9. Darstellung Yaw-Winkel und des absoluten Fehlers.

Figure. 11: Results of orientation estimation for the yaw an-gle for a specified motion pattern left: estimated angle; right:residual

V. Visual Fusion Cell (VFC)

For the VFC classical SfM algorithms have to be reconsid-ered and evaluated for their applicability in the given con-text, but most of those methods are fundamentally offline innature (see e.g. [46]) due to their structure based on batch-computation. Especially those methods proposed for 3Dmodel generation are mostly based on analysing a completegiven image sequence and not successive frames. An exam-ple of such an approach can be found in [23]. Recently newapproaches for SLAM, as those proposed by [21], are highlyfocused on the ability for high frame-rate real-time perfor-mance motivated by the intended usage in mobile robotics,but the focus is not a dense and accurate 3D reconstruction ofthe scene but rather a robust localisation. Thus this method-ology is also labelled as visual odometry (VO). For the im-plementation of a mobile on-the-fly scene acquisition devicethe recently developed methods for SfM and SLAM have tobe combined, due to the goal of a sequentially growing scenestructure model which consists of reliable 3D feature points


acquired in real-time during camera motion. Figure 12 illus-trates the main stages of the VFC as described in the remain-der of this section.

Sequential Operation

Initialisation of

structure model

Sequential SfM

Trajectory model

Structure model

IFC

Figure. 12: Overview of the elements of the visual fusioncell

A. Initialisation of structure model

The VFC of the two-track system design consists of two sep-arate steps: the initialisation of the structure model and thesequential SfM. The initial structure model is generated atthe beginning of the data acquisition and can be used duringthe sequential SfM phase to estimate the absolute pose of thecamera.

R, t

Object

Feature

tracking

Automatic

keyframe

selection

Relative pose

estimation

Preliminary

stereo

triangulation

Optimisation

of initial scene

model

Initial scene

model

Figure. 13: Elements of the initialisation phase for the VisR

The Figure 13 gives an overview about the different elementsof the initialisation phase as described in the following sub-sections, starting with the acquisition of the initial sequencein section V-A.1, followed by the recovery of motion infor-mation for initial keyframes in sections V-A.2 and V-A.3 andthe subsequent generation of an initial model, as describedin sections V-A.5 and V-A.5. Finally a bundle adjustmentscheme, as described in subsection V-A.6 is used for optimi-sation of the initial model.

1) Acquisition of the initial sequence

Due to the fact that the usage of the five-point relative posealgorithm as proposed by [42] in 2004 leads to a scale ambi-guity for the translational motion it is necessary for the gen-eration of the initial structure model to capture an initial se-quence where the translational motion between the first andthe last frame is approximately known. This can be done inthe final scheme of the two-track system design to use posi-tion information from the IFC. For the first tests, as explained

here, a fixed translational motion of 600 mm is assumed andthe operator of the camera has to manually start and finish theacquisition of the initial sequence, e.g. by pressing a button.By incorporating this initial guess of the translational motionit is possible to get a more adequate initial reconstruction ofthe feature points which is an important factor for the finalbundle adjustment, because only ”good” initial values guar-antee an optimal convergence of the nonlinear optimisationroutine.The initial sequence is acquired during the camera is movedin front of the object by approximately 600 mm in one di-rection. Figure 14 illustrates the acquisition of the initial se-quence which contains of n frames. The overall translationbetween the first frame of the sequence I1 and the last one(In) is assumed as t13 = [tinit, 0, 0]

T , where tinit repre-sents a fixed known translation between the first and the lastframe of the initial video stream.

Object

Q1

Q2

Q3I1In/2

In

R12,t12R23,t23

R13,t13

Figure. 14: Acquisition of the initial sequence

From the overall frames of the initial sequence threekeyframes Q1,Q2 and Q3 are selected.The three keyframes are used subsequently for the estimationof the relative pose and the partial stereo reconstruction of theobserved object as described in the following sections.

2) Relative pose estimation between key frames

The three first keyframes of the initialisation sequence areused to generate two relative pose estimates by followinggeneral five-point relative pose algorithms as the one pro-posed by [42]. For this at least five points (Pi) have to bematched successfully between two of the three keyframes.The general problem of relative pose estimation based ona set of 2D/2D correspondences can be formulated as therecovery of time-varying parameters of a cameras egomo-tion Rk, tk from corresponding image feature coordinates[ui,k, vi,k]

T . In this context it is necessary to distinguish twodifferent setups: the calibrated or uncalibrated camera setup.The relative pose parameters Rk, tk are directly related tothe essential matrix E as defined as follows:

Ek = Rk [tk]× (14)

The essential matrix describes the general epipolar relationsfor a stereo image pair. Here Xi describes a point in theworld coordinate system which is imaged on the two imageplanes Π and Π′. So two corresponding image feature pointsare localised at xi, respectively x′

i.In general for an image point in homogeneous coordinatesx = [u v 1]

T in image I and an corresponding image point


x′ = [u′ v′ 1]T in image I′ the simplified epipolar constraint

as shown in the following equation is true:

q′TEq = 0 (15)

Where q and q′ are computed by multiplication of the imagepoints with the inverse of the predetermined calibration ma-trices K and K′ of the camera. Those coordinates are calledcamera normalised coordinates.

q = K−1x and q′ = K′−1x′ (16)

The intrinsic calibration matrices K and K′ are determinedwithin a prior calibration routine following the procedure ofthe Camera Calibration Toolbox of Bouguet.K is in general composed as shown in Equation 17, where theparameters uo and v0 describe a translation along the imageplane and αu, αv and γ describe scale changes along theimage axes and a rotation in the image plane (see [9]).

K =

αu γ u00 αv v00 0 1

(17)

The definition in equation 16 shows also the relation betweenthe essential and the fundamental matrix F:

F = K−TEK′−1 (18)

F can be used to define the general epipolar constraint asshown in Equation 19.

x′TFx = 0 (19)

One important constraint for estimation of both essential andfundamental matrix is the fact that both matrices are singular.So their determinants are both zero:

det(F) = 0 and det(E) = 0 (20)

During the last decades many different algorithms are deal-ing with estimating both the essential and fundamental ma-trix from point correspondences. The approach which ismostly used in literature over years is the so called eightpoint algorithm which is widely used to estimate F and sub-sequently derive E by following Equation 18. A detaileddescription can be found e.g. in [26].By using the additional constraint from Equation 20 it is pos-sible to reduce the minimal number of points for estimatingE to seven. As indicated by [26] it is necessary to normalisethe point correspondences due to the dependency of the esti-mation techniques to the range of the measured values. Forthis case Hartley suggested an isotropic scaling which canbe summarised as translate all points (in inhomogeneous co-ordinates) so that their mean coordinate is at the origin andscale the points that the average distance from the origin isequal to

√(2).

It was shown by [44], that an additional property, as shownin Equation 21 of the essential matrix, which can be derivedfrom the fact that the two non-zero singular values of E areequal, can be used to reduce the sufficient number of pointsto six (see [44]), respectively five (see [42]).

EETE− 1

2trace

(EET

)E = 0 (21)

It was shown in an experimental evaluation by [48] that theusage of five-point algorithms outperforms other techniques,especially for noisy data. Even if [14] suggested a combina-tion of an eight-point and an five-point estimator as the op-timal solution for robust relative pose, the current approachconsiders the five-point relative pose estimator as suggestedby [42]. A experimental evaluation of different techniquesis provided in [3]. Each pair of corresponding points inthe images x is leading to one equation following the con-straint shown in Equation 15. [42] suggests the formulationqT E = 0, as shown in Equation 26.For all five point correspondences the following 5x9 data ma-trix Q can be obtained:

Q =

q1[1] · · · q1

[9]

......

...q5[1] · · · q5

[1]

(23)

The solution for E is found by first decomposing Q bysingular value decomposition (SVD) (see [14]) or QR-factorisation (see [42]) to compute the null space. The nullspace is leading to vectors A, B, C and D. Than the follow-ing linear combination is leading to the essential matrix:

E = a · A+ b · B+ c · C+ d · D (24)

It should be stated here that the four scalar values a,b,c andd are just defined up to a common scale, so it can be as-sumed that d = 1. Substituting Equation 24 into the con-straints as shown in Equations 19 and 21 the problem can beformulated as ten polynomials of third degree. Nister sug-gested an algorithm for solving the problem to recover theunknowns of the system and recovering the essential ma-trix E, whereat up to ten solutions are possible. In recentyears different methods for the final estimation of E weresuggested in literature. The original algorithm proposed byNister in [42] uses Sturm sequences to solve a univariate for-mulation of the problem. Later [55] proposed a more effi-cient procedure based on Groebner bases. It was suggestedby [61] that a formulation as a polynomial eigenvalue prob-lem is more straightforward and leads to solutions which arenumerically more stable. The different methods were evalu-ated in terms of accuracy and robustness against noise for thecurrent project (see [3]).In most cases the feature detection and matching routine willproduce more than the minimum set of five correct point cor-respondences. In those cases the ”best” solution can be foundby evaluating a defined error metric. Different kinds of errormetrics are defined in literature. So [48] suggests the usageof the Sampson error metric de over all matches ℓ, whichshould be minimal for the correct solution of E and can bedefined as follows:

de =

ℓ∑K=1

(xTk′Exk

)[Exk]

2x + [Exk]

2y + [ET x′

k]2x+ [ET x′

k]2y

(25)

[26] uses the classic algebraic error based on the simplifiedepipolar constraint as already defined in Equation 15. An-other error metric is the symmetric squared geometric error,as suggested by [14]:


q =(

x[1]x′[1] x[2]x

′[1] x[3]x

′[1] x[1]x

′[2] x[2]x

′[2] x[3]x

′[2] x[1]x

′[3] x[2]x

′[3] x[3]x

′[3]

)TE =

(E[1,1] E[1,2] E[1,3] E[2,1] E[2,2] E[2,3] E[3,1] E[3,2] E[3,3]

)T (26)

dssg =

(xTk′Exk

)2[Exk]

2x + [Exk]

2y

+

(xTk′Exk

)2[ET x′

k]2x+ [ET x′

k]2y

(26)

3) Recovering motion parameters

Once the essential matrix is known the egomotion of thecamera between two successive frames can be retrieved fromE. It has to be stated here that E can just be recovered upto scale. There is also an ambiguity, such that there are fourpossible solutions regarding the rotation matrix and the trans-lation vector.The first step in determining R and t from E is the com-putation of the singular value decomposition (SVD) of theessential matrix:

E ∼ UΣVT (27)

As it was shown in [58, 26] the four possible solutions R andt can be composed based on two different solutions for therotation matrix Ra, Rb and two different solutions for thetranslation ta, tb as follows: Ra, ta, Rb, tb, Ra, tband Rb, ta.The definition of the solutions is based on the following def-initions for ta and tb:

ta ≡[U[1,3] U[2,3] U[3,3]

]Ttb ≡ −1 ·

[U[1,3] U[2,3] U[3,3]

]T (28)

Ra and Rb are defined as follows:

Ra = UDVT ; Rb = UDTVT (29)

with

D =

0 1 0−1 0 00 0 1

This four-fold ambiguity can be solved by using the cheiral-ity constraint, which states that the observed feature pointshave to be located in front of both cameras. For this it is nec-essary to reconstruct the three-dimensional coordinates of atleast one feature point by using standard triangulation meth-ods and the four possible solutions for the motion parame-ters. Only in one of those cases the reconstructed point liesin front of both cameras which means that the z-coordinateis bigger than zero.[42] suggested a more efficient method to test the cheiralityconstraint which just uses one triangulation and subsequenttesting of additional properties which can lead directly to thecorrect configuration.

4) Guided-RanSaC for handling outliers

Usually the feature detection and matching routine will pro-vide more than five corresponding points between two suc-cessive frames of the image sequence. However, it is very

likely that the set of point matches contains also a non neg-ligible number of wrong matches (outliers). So there is theopen question of choosing the optimal point correspondencesfor the relative pose estimation.Thus in literature the calculation of the essential matrix isrealised by following Random Sample Consensus (RanSaC)which randomly selects a minimal subset of the data (here:five point matches) and generates an estimate for E based onthose points. Finally all other points are tested against the ac-tual estimation of the essential matrix (e.g. by checking thesimplified epipolar constraint from Equation 15). If a suf-ficient number of point matches are following the estimatedmodel it is assumed to be a correct estimate, otherwise thenext minimal subset of points is sampled and the procedurestarts again.Due to the iterative character of the RanSaC approach its us-age is neglected within this framework. Instead of a randomsampling which treats all samples equally a guided samplingbased on a-priori known measures from the feature detectionand matching procedure is used here. Similar ideas are de-scribed by [38] and [56] within their GOODSaC and GuiSaCprocedures.Most feature detection methods lead to a score which can beinterpreted as kind of a distinctiveness measure6 ξ and alsothe matching procedure leads to a similarity measure ρ. Forthe experiments incorporating Harris features the distinctive-ness V[u,v] at the corner positions defines ξ. These informa-tion sources are weighted by factors wξ and wρ to computean indicator τ which can be interpreted as the likelihood forbeing a correct or wrong match.For the estimation of E at least five matches are neces-sary. Hence, the minimal sample sets (MSSs) consist of fivematches which are sampled from the set of matches presortedwith respect to τ . An iterative procedure is generating esti-mates for E by Nisters five-point algorithm until a test of theactual configuration produces a Sampson error de over allmatches ℓ below a specified threshold dlim. The definition ofde can be found in Equation 25. Besides that, the number ofinliers produced by the actual configuration of E is evaluatedfor the stop criterion. The whole procedure for estimatingrelative camera pose can be described by the following Al-gorithm 1.The whole routine is used to generate an estimate for E,whereat it is of course also necessary to apply the conceptof the guided-RANSAC scheme for handling the outliers inthe set of matched point features. Furthermore the rotationmatrix R and the translation t are extracted by SVD andsubsequent evaluation of the cheirality constraint. The ar-bitrary scale of t is determined by incorporating the assump-tion for tinit as the translational movement during the acqui-

6It should be stated that the general term distinctiveness describes dif-ferent properties for different feature detectors. So the distinctiveness fora corner-detector would be labelled more exactly as ”cornerness” while thefeatures extracted by Fast-Radial Symmetry Transform (FRST) (see [54])are selected based on their ”roundness”.


Algorithm 1 Guided-RanSaC procedure for camera egomo-tion estimation

1: Detect n features in I and m features in I′ and computeξi : i ∈ 1...n and ξ′j : j ∈ 1...m

2: Find ℓ corresponding points qk and q′k and compute ρk

with k ∈ 1...ℓ3: for all found matches ℓ do4: Calculate likelihood for being a correct match5: τ k = wξξk + wρρk

6: end for7: Sort all found matches x and x′ by τ8: Transform x and x′ to normalised coordinates q and q′

9: Sample N MSSs from sorted matches10: while (de < dlim) ∧ (g ≤ N) ∧ (h > hlim) do11: Estimate E with MSS g : g ∈ 1...N12: Calculate de over ℓ matches13: Calculate number of inliers h with actual E14: end while15: Extract Ra, Rb and ta, tb from E by SVD16: Chose correct solution for R and t by cheirality con-

straint

sition of the initial sequence. The following figure gives anoverview about the whole procedure, where xQj−k describethe matched 2D feature point coordinates in Qj and Qk.

5-pt relative

pose

estimation

5-pt relative

pose

estimation

5-pt relative

pose

estimation

Feature

tracking

Q1Q2

Q3

tinit

xQ1-2xQ1-3 xQ2-3

RQ1-3, tQ1-3 RQ1-2, tQ1-2 RQ2-3, tQ2-3

Figure. 15: Relative pose estimation based on threekeyframes

5) Preliminary Stereo Triangulation

The generated estimates for R and t are used subsequentlyto determine the preliminary scene model. For this the ob-served point features which were successfully tracked duringthe acquisition of the initial sequence, are reconstructed in3D by standard triangulation techniques (see e.g. [26]). Dueto the fact that the translation t can only be recovered up toan arbitrary scale by Nisters algorithm and the used proce-dure, which involves the usage of tinit is only an assumptionabout the translational motion between the first and the lastframe, the unknown scale between the different two-framereconstructions has to be resolved. For this the procedure of[27] was used to estimate the scale s by minimising the termshown in Equation 30, where CX

Qj−k

i =[xi yi zi

]Tdescribes the 3D reconstruction of the i-th feature pointfound in both keyframes Qj and Qk. The minimisation ofEquation 30 is realised in a least-squares sense.

∑i

(CX

Q1−2

i − s ·C XQ1−3

i

)(30)

6) Optimisation of initial scene model

The initial reconstruction of the scene structure is used as abase for a further refinement by using classical Bundle Ad-justment (BA). BA performs a simultaneous optimisation of3D structure and camera egomotion by minimising the dif-ference between estimated and measured image feature lo-cations Pxk

i =[uki vki

]T. In this context the camera

or projection matrix of the k-th frame Pk is used to computethe estimated projections of the 3D structure by following theprojection shown in Equation 31, where ∼ indicates equalityup to scale. Here P xk

i describes the i-th 2D point in pixelcoordinates for the k-th frame of a sequence in homogenouscoordinates. Kk is the corresponding intrinsic camera matrixand Rk and tk are the corresponding extrinsic parameters forthe rigid transformation.

P xki ∼ KkRk

[CXk

i − tk

](31)

In general this projection can be formulated by using the pro-jection or camera matrix Pk = Kk [Rk| − tk] as follows:

P xki ∼ Pk

CXki (32)

The procedure of BA consists an interleaving approach basedon ideas in [27] and [57] which decouples structure and mo-tion optimisation. The following subsections describe thestructure and motion estimation with BA in detail. Besidesthat it is shown which data is used as initial estimates forboth scene structure and camera egomotion, because the pro-vision of adequate initial estimates is crucial for the successof BA-algorithms.

Optimisation of scene structure

The scene structure optimisation is based on the minimisa-tion of the difference between estimated and measured im-age feature locations. For this the projection in Equation 32is used as a reference.The optimisation incorporates all m features, which couldbe tracked through the whole initialisation sequence with nframes. So the optimal 3D point location for all features canbe computed by minimising the following term:

n∑k=1

(uki −PT

k,1CXi

PTk,3

CXi

)2

+

(vki −

PTk,2

CXi

PTk,3

CXi

)2 (33)

The minimisation is realised in MATLAB by using theNelder-Mead method as described in [7], where the recon-structed 3D points from the two stereo pairs are used as theinitial estimate of scene structure.

Optimisation of camera egomotion

The initial estimates for the camera movement are generatedby interpolating the calculated rotations and translations be-tween Q1 and Q2, respectively Q1 and Q3 from David Nis-ters algorithm.


The minimisation is based on a nested optimisation proce-dure which runs one optimisation of scene structure for eachiteration of the minimisation of the following error term:

m∑i=1

minCXi

n∑k=1

(uki −PT

k,1CXi

PTk,3

CXi

)2

+

(vki −

PTk,2

CXi

PTk,3

CXi

)2

(34)It should be stated that it is necessary to update the elementsof Pk for each new iteration of the Nelder-Mead method.The following figure gives an example for an initial scenemodel for a planar object (checkerboard on a wall) and thecorresponding camera egomotion.

Figure. 16: Example for an initial scene model and the cor-responding camera egomotion

B. Sequential SfM

The initial scene model is then used to estimate the camerapose based on 2D/3D correspondences between image fea-tures and a scene model which contains calibrated featurepositions. For this the 4-point algorithm suggested by [15],because this procedure is also working for a unknown focallength of a camera. This is important to consider if the focallength of the camera can change during the scene acquisi-tion. The following Figure 17 gives an impression about thegeneral configuration of the four-point problem, whereat therelation between object model and image features in the ac-tual frame is visualised.One major issue in this context is the successful detection andtracking of at least four feature points between the acquiredframes and the initial scene model. Of course it is also nec-essary to add adequate points to the 3D world model to guar-antee also the possibility to move around the object whichis necessary acquire a complete 3D representation from allsides of the scene. The different parts of the sequential SfMare described in detail in the following paragraphs.

1) Feature detection and matching

The reconstruction of a 3D scene or object from 2D imagesequences and the estimation of the camera trajectories arealways based on generating correspondences between ex-tracted features from two or more successive frames. For thisreason the task can be subdivided into the detection of visuallandmarks and their matching in successive frames (imageregistration) of the sequence. Those features could be of var-ious appearances, whereat most SfM algorithms are based onpoint features, because the identification of distinctive points

x

z

y

World

coordinate

system

Pixel

coordinatesCamera

coordinate

system

Figure. 17: General configuration of the four-point absolutepose problem

(corners, junctions, etc.) is a well studied field in image pro-cessing (see [40]). Also recently published methodologies asSIFT (see [36]) and SURF (see [8]) have drawn the attentionof researchers due to their tolerance to scale, illumination andpose variations which can considerably increase the robust-ness of the registration procedure. For first experiments theHarris corner detector as described by [25] was used for find-ing distinctive features in the images. The Harris-featuresare located at the maxima of the local image autocorrelationfunction A, as shown in the following equation:

A =

∑

(i,j)∈Ω

fx(i, j)2

∑(i,j)∈Ω

fx(i, j) · fy(i, j)∑(i,j)∈Ω

fx(i, j) · fy(i, j)∑

(i,j)∈Ω

fy(i, j)2

The distinctiveness V[u,v] of the points is computed by eval-uating A at image position [u, v] in the following manner:

V[u,v] = det(A[u,v] − k · trace(A[u,v])2 (35)

2) Feature tracking

The feature tracking procedure is based on ideas mainlydeveloped by [53] and [31]. Whereat the combination ofMarkov chains (Kalman filter) and graphical models is usedfor a robust tracking in 3D. The predicted feature positionsare reprojected and based on the reprojected 2D image co-ordinates a search region for the feature matching proce-dure can be defined. This is realised by an area-based ap-proach which compares intensity-patches around the featurepositions from two successive frames by following the non-parametric ordinal measure as described by [11]. A com-parative study of area-based matching techniques in [2] hasshown that the ordinal measures outperforms other classicalapproaches in terms of robustness. Nevertheless there aremany possible reasons for wrong matches, as shown e.g. by[6].

3) Absolute pose estimation

As already mentioned above the estimation of the abso-lute pose is realised by following the algorithm proposed in[15]. The suggested method uses a Groeber basis techniquewhich solves a system of algebraic equations derived from


the number of 2D/3D correspondences generated by the fea-ture tracking and matching routine.

VI. Inertial-Visual Fusion Cell (IVFC)

As mentioned before, one motivation for the implementationof a loosely coupled system as a first stage in the develop-ment process is the possibility to run both routes indepen-dently. This allows a deep analysis and evaluation of bothtracks. By this it will be easily possible to give evidencefor the aiding character of the IMU to the SfM-procedure bycomparing results of the SfM with and without inference ofthe inertial track. For the combination of both tracks twounidirectional interfaces will be established between the tworoutes, as it was shown in Figure 18.It should be pointed that the visual- and inertial-route willbe operating at different frequencies due to the implied sen-sor devices and the computational elements of both tracks.So for the implementation of the interfaces it is importantto consider the multi-rate character of the different measure-ments. As it was already stated one major problem of thevisual measurements is the missing robustness and computa-tional complexity of the feature extraction and tracking. Byintegrating the pose predictions of the IMU it is possible toconsiderably limit the search space for feature tracking, be-cause there is an expectation where those features are posi-tioned in the new frame. For this each new feature point in-serted into the scene graph is described by a Hidden MarkovModel (HMM) (see [20]) for tracking the new feature posi-tion based on egomotion estimates from visual and inertialcues.Moreover it is possible to pre-warp the extracted patches forfeature matching based on estimated camera pose. For thisthe patches are assumed to be locally planar as visualised inFig. 18. Thus a homography as shown in the Equation 36

Figure. 18: Locally planar patches for pre-warping

can be computed which relates the patch appearance fromone frame to another. Here K is the intrinsic camera matrix,R and t describe the rotation and translation between cam-era poses, n is the surface normal, which can be estimatedby following the algorithm described in [21], and xp is thecentre of the patch in the image.

H = KR[nTxpI− tnT

]K−1 (36)

During times of rapid camera movements strength motionblur exists in the imaged frames, so there is the danger thatvision-based estimation of egomotion is not possible. In such

periods the inertial route would be able to fill the gaps andleading the system while the vision-module is almost not inoperation. It was shown by [30] that such a strategy is ableto compensate missing measurements from the vision sensor.On the other hand the estimated camera motion from SfMcan be used to bound the drift error of the IMU which is alogical consequence from necessary double integration. Thisis possible due to the generation of position measurementsof the visual route and provides a possible solution for anextension of accurate inertial pose predictions for long-termsequences. The realisation of fusing both pose estimates isbased on an additional Kalman filter scheme, which incor-porates the uncertainties of the separate tracks. It should bepointed out here that the suggested two track system fusesin the Kalman stage pose estimates from both routes whichprovides the possibility for a relatively simple system andmeasurement model. It was shown by [17] that it is also pos-sible to fuse the measurements from the MEMS IMU (e.g.3D acceleration and rotational velocity) directly with poseestimates from the visual route.

VII. Conclusions and future work

The paper proposed a framework for visual-inertial scene re-construction, whereat the main focus lies on the two distinc-tive fusion cells for visual and inertial information alone. Theactual configuration consists of a parallel fusion network asindicated in Figure 1.The authors developed also the scheme of a monolithic sys-tem design which combines in a single FC the measurementsof all four sensory units. The following figure indicates thegeneral architecture of such an entity.

MotionInertial

Measurement

Unit (IMU)

Camera

Measurements

Visual-inertial

measuring unit

Feature

detection

and

matching

processor

motion state

estimation

System model “motion”

SfM Estimator

Structure

structure

state

estimation

System model “structure”

Figure. 19: Monolithic System Design

In such a monolithic or tightly-coupled approach the differ-ent sensor units are not longer handled as two separate mod-ules. The camera and the MODS are interpreted as a singlevisual-inertial sensing device, which provides typical iner-tial measurements (3D acceleration, angular velocities, earthmagnetic field) and feature correspondences as visual mea-sures. Thus the Feature detection and matching processorare formally included in the single measuring unit in this ap-proach. Therefore the definition and implementation of thisroutine is one major task in this field. Based on the findingsfrom the first stage of the project an enhancement and pos-sible expansion of feature handling is planned at this stage.Furthermore a strategy for handling of multi-rate signals hasto be considered and implemented based on the used sensordevices for inertial and visual sensing. Here especially thework of [1], which suggests a multi-rate Unscented Kalman


Filter (UKF) for camera egomotion estimation, shows the po-tential of multi-rate (MR) sensor fusion.

Acknowledgments

The authors acknowledge support from all the other fellowsand students at the Laboratory for Image Processing Soest(LIPS), the Institute for Computer Science, Vision and Com-putational Intelligence (CV&CI) and all people at the Uni-versity of Bolton who are involved in this project especiallyDr. Dennis Dodds.

References

[1] L. Armesto, S. Chroust, M. Vincze, and J. Tornero,“Multi-rate fusion with vision and inertial sensors,” inRobotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, vol. 1,2004, pp. 193 – 199 Vol.1.

[2] D. Aufderheide, “Spatial Reconstruction of Head-Geometry by Stereo Vision,” Masterthesis, The Univer-sity of Bolton, 2008.

[3] ——, “VISrec! - Progress reports 2010,” Technical re-port, The University of Bolton, South Westphalia Uni-versity of Applied Sciences, 2010.

[4] D. Aufderheide and W. Krybus, “A Framework for realtime Scene Modeling based on Visual-Inertial Cues,”in IADIS International Conference Computer Graph-ics, Visualization, Computer Vision and Image Process-ing 2010 (part of MCCSIS 2010), Yingcai Xiao, TomazAmon, and Piet Kommers, Eds. IADIS, 2010, pp.385–390.

[5] ——, “Towards real-time camera egomotion estimationand three-dimensional scene acquisition from monocu-lar image streams,” in 2010 International Conferenceon Indoor Positioning and Indoor Navigation (IPIN),R. Mautz, M. Kunz, and H. Ingensand, Eds. Zurich,Switzerland: IEEE, 2010, pp. 1–10.

[6] D. Aufderheide, M. Steffens, S. Kieneke, W. Krybus,C. Kohring, and D. Morton, “Detection of salient re-gions for stereo matching by a probabilistic scene anal-ysis,” in Proceedings of the 9th Conference on Optical3-D Measurement Techniques, Wien, 2009, pp. 328–331.

[7] M. Avriel, Nonlinear Programming: Analysis andMethods. Dover Publications, 2003.

[8] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool,“Speeded-Up Robust Features (SURF),” Computer Vi-sion and Image Understanding, vol. 110, no. 3, 2008.

[9] E. Bayro-Corrochano and B. Rosenhahn, “A geometricapproach for the analysis and computation of the intrin-sic camera parameters,” 2002.

[10] D. Bellot, A. Boyer, and F. Charpillet, “A new defini-tion of qualified gain in a data fusion process: appli-cation to telemedicine,” in Information Fusion, 2002.

Proceedings of the Fifth International Conference on,vol. 2, 2002, pp. 865 – 872 vol.2.

[11] D. N. Bhat, D. N. Bhat, S. K. Nayar, and S. K. Na-yar, “Ordinal measures for visual correspondence,” inIn IEEE Conference on Computer Vision and PatternRecognition, 1997, pp. 351–357.

[12] C. M. Bishop, Pattern Recognition and Machine Learn-ing (Information Science and Statistics). Springer,2007.

[13] R. S. Blum, Multi-Sensor Image Fusion and Its Ap-plications (Signal Processing and Communications).CRC Press, 2005.

[14] M. Bruckner, F. Bajramovic, and J. Denzler, “Exper-imental Evaluation of Relative Pose Estimation Algo-rithms,” in VISAPP 2008: Proceedings of the 3rd Inter-national Conference on Computer Vision Theory andApplications, vol. 2, 2008, pp. 431–438.

[15] M. Bujnak, Z. Kukelova, and T. Pajdla, “A general so-lution to the P4P problem for camera with unknown fo-cal length,” in CVPR 2008, Anchorage, Alaska, USA,2008.

[16] M. Caruso, “Applications of magnetic sensors for lowcost compass systems,” 2000, pp. 177 –184.

[17] S. G. Chroust and M. Vincze, “Fusion of Vision and In-ertial Data for Motion and Structure Estimation,” Jour-nal of Robotic Systems, vol. 21, no. 2, 2004.

[18] P. Corke, J. Lobo, and J. Dias, “An Introduction toInertial and Visual Sensing,” International Journal ofRobotics Research, vol. 26, no. 6, 2007.

[19] K. Cornelis, M. Pollefeys, M. Vergauwen, L. V. Gool,and K. U. Leuven, “Augmented Reality using Uncali-brated Video Sequences,” in Lecture Notes in ComputerScience, 2001.

[20] A. Davison, Real-time simultaneous localisation andmapping with a single camera. IEEE, 2003.

[21] A. Davison, I. Reid, N. Molton, and O. Stasse,“MonoSLAM: real-time single camera SLAM.” IEEEtransactions on pattern analysis and machine intelli-gence, vol. 29, no. 6, pp. 1052–67, 2007.

[22] H. F. Durrant-Whyte, “Sensor Models and MultisensorIntegration,” The International Journal of Robotics Re-search, vol. 7, no. 6, pp. 97–113, 1988. [Online]. Avail-able: http://ijr.sagepub.com/content/7/6/97.abstract

[23] O. Faugeras and Q.-T. Luong, The Geometry of Multi-ple Images: The Laws That Govern the Formation ofMultiple Images of a Scene and Some of Their Applica-tions. The MIT Press, 2004.

[24] D. L. Hall, Multisensor Data Fusion (Electrical Engi-neering & Applied Signal Processing Series). CRCPress, 2001.


[25] C. Harris and M. Stephens, “A combined corner andedge detection,” in Proceedings of The Fourth Alvey Vi-sion Conference, 1988, pp. 147–151.

[26] R. Hartley and A. Zisserman, Multiple View Geome-try in Computer Vision. Cambridge University Press,2004.

[27] S. Heung-Yeung, K. Qifa, and Z. Zhengyou, “Efficientbundle adjustment with virtual key frames: a hierarchi-cal approach to multi-frame structure from motion,” inProceedings. 1999 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (Cat. NoPR00149). IEEE Comput. Soc, pp. 538–543.

[28] J. Hol, T. Schon, F. Gustafsson, and P. Slycke, “Sen-sor Fusion for Augmented Reality,” in 9th Interna-tional Conference on Information Fusion, Florence,Italy, 2006.

[29] J. Hol and P. Slycke, “2D-3D Model Correspondencefor Camera Pose Estimation using Sensor Fusion,” inIn Proc. of InerVis workshop at the IEEE InternationalConference on Robotics and Automation, 2005.

[30] J. D. Hol, “Pose Estimation and Calibration Algo-rithms for Vision and Inertial Sensors,” Masterthesis,Linkoping University, 2008.

[31] S. Kieneke, M. Steffens, D. Aufderheide, W. Krybus,C. Kohring, and D. Morton, “Spatio-Temporal SceneAnalysis Based on Graph Algorithms to DetermineRigid and Articulated Objects,” in Lecture Notes InComputer Science; Vol. 5496, 2009.

[32] J. Kim and S. Sukkarieh, “Real-time implementationof airborne inertial-SLAM,” Robotics and AutonomousSystems, vol. 55, no. 1, pp. 62–71, 2007.

[33] G. Klein and D. Murray, “Parallel tracking and map-ping on a camera phone,” in Proc. Eigth IEEE and ACMInternational Symposium on Mixed and Augmented Re-ality (ISMAR’09), Orlando, October 2009.

[34] J. Llinas, Handbook of Multisensor Data Fusion: The-ory and Practice, Second Edition (Electrical Engineer-ing & Applied Signal Processing Series). CRC Press,2008.

[35] J. Lobo, “Inertial Sensor Data Integration in ComputerVision Systems,” Masterthesis, University of Coimbra,2002.

[36] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Com-puter Vision, vol. 60, pp. 91–110, 2004.

[37] F. Mata and A. Jimnez, “Multisensor fusion: An au-tonomous mobile robot,” Journal of Intelligent andRobotic Systems: Theory and Applications, vol. 22,no. 2, pp. 129–141, 1998.

[38] E. Michaelsen, W. v. Hansen, M. Kirchhof, J. Meidow,and U. Stilla, “Estimating the essential matrix: Good-sac versus ransac,” 2006.

[39] H. Mitchell, Multi-Sensor Data Fusion: An Introduc-tion. Berlin-Heidelberg: Springer Verlag, 2007.

[40] P. Moreels and P. Perona, “Evaluation of features detec-tors and descriptors based on 3d objects,” Int. J. Com-put. Vision, vol. 73, no. 3, pp. 263–284, 2007.

[41] T. Morita and T. Kanade, “A sequential factorizationmethod for recovering shape and motion from imagestreams,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 19, no. 8, pp. 858–867,1997.

[42] D. Nister, “An efficient solution to the five-point rela-tive pose problem.” IEEE transactions on pattern anal-ysis and machine intelligence, vol. 26, no. 6, pp. 756–77, 2004.

[43] J.-S. Park, J.-H. Yoon, and C. Kim, “Stable 2D Fea-ture Tracking for Long Video Sequences,” Interna-tional Journal of Signal Processing, Image Processingand Pattern Recognition, vol. 1, no. 1, pp. 39–46, 2008.

[44] J. Philip, “A Non-Iterative Algorithm for Determin-ing All Essential Matrices Corresponding to Five PointPairs,” The Photogrammetric Record, vol. 15, no. 88,pp. 589–599, Oct. 1996.

[45] C. Poelman and T. Kanade, “A paraperspective factor-ization method for shape and motion recovery,” IEEETransactions on Pattern Analysis and Machine Intelli-gence, vol. 19, no. 3, pp. 206–218, 1997.

[46] M. Pollefeys, R. Koch, M. Vergauwen, and L. V. Gool,“Gool. Metric 3D surface reconstruction from uncali-brated image sequences,” in In: 3D Structure from Mul-tiple Images of Large Scale Environments. LNCS Se-ries. Springer-Verlag, 1998, pp. 138–153.

[47] H. Rehbinder and X. Hu, “Drift-free attitude estimationfor accelerated rigid bodies,” Automatica, vol. 40, no. 4,pp. 653 – 659, 2004.

[48] V. Rodehorst, M. Heinrichs, and O. Hellwich, “Evalu-ation of Relative Pose Estimation Methods for Multi-camera Setups.”

[49] A. Sabatini, “Quaternion-based extended kalman fil-ter for determining orientation by inertial and magneticsensing,” Biomedical Engineering, IEEE Transactionson, vol. 53, no. 7, pp. 1346 –1356, jul. 2006.

[50] J. Sasiadek and Q. Wang, “Sensor fusion based onfuzzy kalman filtering for autonomous robot vehicle,”vol. 4, 1999, pp. 2970–2975.

[51] P. G. Savage, “Computational Elements for StrapdownSystems,” 2009.

[52] J. Schmidt and H. Niemann, “Using Quaternions forParametrizing 3-D Rotations in Unconstrained Nonlin-ear Optimization,” in Vision, Modeling, and Visualiza-tion 2001, T. Ertl, B. Girod, G. Greiner, H. Niemann,and H.-P. Seidel, Eds., Berlin, Amsterdam, 2001, pp.399–406.


[53] M. Steffens, “Dynamic World Modelling by Di-chotomic Information Sets and Graphical Inferencewith Focus on Facial Pose Tracking,” Dissertation, TheUniversity of Bolton, 2010.

[54] M. Steffens, S. Kieneke, D. Aufderheide, W. Krybus,C. Kohring, and D. Morton, “Stereo Tracking of Facesfor Driver Observation,” in Lecture Notes In ComputerScience; Vol. 5575, 2009.

[55] H. Stewenius, C. Engels, and D. Nister, “Recent devel-opments on direct relative orientation,” ISPRS Journalof Photogrammetry and Remote Sensing, vol. 60, no. 4,pp. 284–294, Jun. 2006.

[56] B. Tordoff and R. Cipolla, “Uncertain ransac.”

[57] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W.Fitzgibbon, “Bundle Adjustment - A Modern Synthe-sis,” Lecture Notes In Computer Science; Vol. 1883,1999.

[58] R. Y. Tsai and T. S. Huang, “Uniqueness and estimationof three-dimensional motion parameters of rigid objectswith curved surfaces,” Pattern Analysis and MachineIntelligence, IEEE Transactions on, vol. PAMI-6, no. 1,pp. 13 –27, 1984.

[59] A. Weckenmann, X. Jiang, K.-D. Sommer,U. Neuschaefer-Rube, J. Seewig, L. Shaw, andT. Estler, “Multisensor data fusion in dimensionalmetrology,” CIRP Annals - Manufacturing Technology,vol. 58, no. 2, pp. 701 – 721, 2009.

[60] J. Wendel, Integrierte Navigationssysteme. Olden-bourg Wissensch.Vlg, 2007.

[61] K. Z., B. M., and P. T., “Polynomial eigenvalue solu-tions to the 5-pt and 6-pt relative pose problems,” inProceedings. BMVC 2008), 2008.

Author Biographies

Dominik Aufderheide is an active researcher in the areaof multi-sensor image processing and computer vision. Hestudied Electrical Engineering with a focus on embeddedsystems and signal processing at the South Westphalia Uni-versity of Applied Sciences in Soest, Germany and gradu-ated 2007 with the German diploma. Afterwards he becamepart of an international master course at the University of

Bolton, U.K. focused on electronic system design and engi-neering management, where he received 2009 a M.Sc. withDistinction. Currently he is a research fellow at the Instituteof Computer Science, Vision and Computational Intelligence(CV&CI) in Soest, where he is working towards a Ph.D. de-gree in cooperation with the University of Bolton. His re-search interests are mainly focused on multi-sensor data fu-sion, computer vision, machine learning and smart sensorsystens. Dominik Aufderheide is IEEE student member since2007.

Werner Krybus is professor at South Westphalia Universityof Applied Sciences in Soest, Germany. He studied Electri-cal Engineering at RWTH Aachen University and graduatedin 1984 . He received his Ph.D. from RWTH Aachen Uni-versity on a topic about computer-assisted surgery. In 1996W. Krybus joined the South Westphalia University of Ap-plied Sciences as a professor for data systems engineeringand signal processing. Dr. Krybus is founder of the Labora-tory for Image Processing Soest within the Institute for Com-puter Science, Vision and Computational Intelligence. Hisprimary research interests include embedded systems, signalprocessing and sensor fusion.


Documents

A Framework for Real-Time Scene Modelling based on Visual ...the overall system performance2 in terms of: • Temporal coverage - Typical frame rates of a image processing system lie