8
A Real Time Augmented Reality System Using GPU Acceleration David Chi Chung Tam Computer Science Ryerson University Toronto, Canada [email protected] Mark Fiala Computer Science Ryerson University Toronto, Canada mark.fi[email protected] Abstract—Augmented Reality (AR) is an application of computer vision that is processor intensive and typically suffers from a trade-off between robust view alignment and real time performance. Real time AR that can function robustly in variable environments is a process difficult to achieve on a PC (personal computer) let alone on the mobile devices that will likely be where AR is adopted as a consumer application. Despite the availability of high quality feature matching algorithms such as SIFT, SURF and robust pose estimation algorithms such as EPNP, practical AR systems today rely on older methods such as Harris/KLT corners and template matching for performance reasons. SIFT-like algorithms are typically used only to initialize tracking by these methods. We demonstrate a practical system with real time performance using only SURF without the need for tracking. We achieve this with extensive use of the Graphics Processing Unit (GPU) now prevalent in PC’s. Due to mobile devices becoming equipped with GPU’s we believe that this architecture will lead to practical robust AR. Keywords-augmented reality, feature detection, real-time, GPU, CUDA I. I NTRODUCTION Figure 1. Example of virtual content aligned with real time video. Augmented Reality (AR) gives a user the impression of computer generated objects coexisting in the environment around them, such as video game characters running across their newspaper or walking around architectural previews of proposed buildings seen in an outdoor environment without looking at a conventional computer screen. An outward facing video camera on a mobile device provides video shown on the screen of a mobile device, or in a Head Mounted Display (HMD), with overlaid virtual computer generated objects that appear to belong in the scene. This illusion is achieved by calculating the location of the real video camera, setting the virtual computer graphics camera to the same pose and rendering Computer Generated (CG) graphics, and compositing the two. An example from our system is shown in Fig. 1. Feature Detection Descriptor Matching Camera Input 3D Point-Descriptor Map Pose Recovery Project Virtual Objects Display Image Unwarping Camera Matrix Distortion Coefficients Figure 2. Execution flow for a single video frame in our AR system AR systems rely on this pose estimation which can be accomplished with various means, such as the solid state gyroscope in modern mobile devices used in many ’apps’ such as ’Layar’ 1 . Users can hold up their smartphone and 1 www.layar.com 2012 Ninth Conference on Computer and Robot Vision 978-0-7695-4683-4/12 $26.00 © 2012 IEEE DOI 10.1109/CRV.2012.21 101

[IEEE 2012 Canadian Conference on Computer and Robot Vision (CRV) - Toronto, Ontario, Canada (2012.05.28-2012.05.30)] 2012 Ninth Conference on Computer and Robot Vision - A Real Time

  • Upload
    mark

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 Canadian Conference on Computer and Robot Vision (CRV) - Toronto, Ontario, Canada (2012.05.28-2012.05.30)] 2012 Ninth Conference on Computer and Robot Vision - A Real Time

A Real Time Augmented Reality System Using GPU Acceleration

David Chi Chung Tam

Computer Science

Ryerson University

Toronto, Canada

[email protected]

Mark Fiala

Computer Science

Ryerson University

Toronto, Canada

[email protected]

Abstract—Augmented Reality (AR) is an application ofcomputer vision that is processor intensive and typically suffersfrom a trade-off between robust view alignment and realtime performance. Real time AR that can function robustlyin variable environments is a process difficult to achieve ona PC (personal computer) let alone on the mobile devicesthat will likely be where AR is adopted as a consumerapplication. Despite the availability of high quality featurematching algorithms such as SIFT, SURF and robust poseestimation algorithms such as EPNP, practical AR systemstoday rely on older methods such as Harris/KLT cornersand template matching for performance reasons. SIFT-likealgorithms are typically used only to initialize tracking bythese methods. We demonstrate a practical system with realtime performance using only SURF without the need fortracking. We achieve this with extensive use of the GraphicsProcessing Unit (GPU) now prevalent in PC’s. Due to mobiledevices becoming equipped with GPU’s we believe that thisarchitecture will lead to practical robust AR.

Keywords-augmented reality, feature detection, real-time,GPU, CUDA

I. INTRODUCTION

Figure 1. Example of virtual content aligned with real time video.

Augmented Reality (AR) gives a user the impression of

computer generated objects coexisting in the environment

around them, such as video game characters running across

their newspaper or walking around architectural previews of

proposed buildings seen in an outdoor environment without

looking at a conventional computer screen. An outward

facing video camera on a mobile device provides video

shown on the screen of a mobile device, or in a Head

Mounted Display (HMD), with overlaid virtual computer

generated objects that appear to belong in the scene. This

illusion is achieved by calculating the location of the real

video camera, setting the virtual computer graphics camera

to the same pose and rendering Computer Generated (CG)

graphics, and compositing the two. An example from our

system is shown in Fig. 1.

Feature Detection

Descriptor Matching

Camera Input

3D Point-Descriptor Map

Pose Recovery

Project Virtual Objects

Display

Image Unwarping Camera Matrix

Distortion Coefficients

Figure 2. Execution flow for a single video frame in our AR system

AR systems rely on this pose estimation which can be

accomplished with various means, such as the solid state

gyroscope in modern mobile devices used in many ’apps’

such as ’Layar’ 1. Users can hold up their smartphone and

1www.layar.com

2012 Ninth Conference on Computer and Robot Vision

978-0-7695-4683-4/12 $26.00 © 2012 IEEE

DOI 10.1109/CRV.2012.21

101

Page 2: [IEEE 2012 Canadian Conference on Computer and Robot Vision (CRV) - Toronto, Ontario, Canada (2012.05.28-2012.05.30)] 2012 Ninth Conference on Computer and Robot Vision - A Real Time

look around as if the phone was a piece of ’virtual glass”

to see arrows pointing to restaurants, subway entrances,

etc. However, to properly align the virtual camera requires

rotational accuracy better than 0.1 degrees which is not

feasible with the orientation sensors in mobile devices. In ad-

dition the position accuracy is not sufficient to render virtual

objects that are close to the user, such as a virtual statue the

user can walk around, for example. AR applications relying

on this sensor do not provide a satisfactory user experience

due to the drift and inaccuracy of both the orientation and

position sensors. However, computer vision can process the

video input itself and calculate the necessary pose. Modern

algorithms have made this possible using ’natural features’

in the environment which are used to calculate pose.

The prospect of using the video imagery itself to calculate

pose is a possibly very efficient solution that can provide

the necessary pixel (or indeed sub-pixel) pose accuracy,

if only it can be made robust enough and able to run at

a fast enough frame rate and with minimal latency. This

is a difficult prospect even with today’s (2012) personal

computer processors and algorithms, however, much of the

processing necessary is of the type that it can be off-

loaded to a new processing unit type (the GPU) originally

intended for graphics processing. The GPU has increasingly

been used for non-graphics uses to the point that video

graphics chip producers have created tools to allow general

purpose (GPGPU) processing on these devices. The CUDA

[1] created to run C-like code on GPU’s made by NVIDIA2 is one example and is used in the system presented in this

paper.

Figure 2 shows a typical AR system that uses computer

vision as the pose calculation component. As well as some

details specific to a system similar to ours, Figure 2 refers

to a feature detection computer vision algorithm stage, a

matching phase, and a pose recovery stage. This is the

typical flow employed in most desired architectures but is

usually not achieved due to the non-real time performance

typically encountered in almost all the stages.

We demonstrate a real time system for AR using mixed

CPU and GPU processing that does achieve this goal,

without resorting to secondary tracking methods as with

other systems, and provide details of its architecture and

implementation.

Typical feature detectors used (or desired to be used)

are SIFT [2] and SURF [3] feature detectors. SIFT was a

revolutionary algorithm introduced which allowed a patch of

pixels corresponding to an object to be recognized despite

changes in lighting, rotation, object size in the image (zoom,

distance) and even the camera used. Specifically the ability

to recognize matching keypoints despite rotation and scale

differences enabled many computer vision systems to this

day. SIFT involves finding salient stable points that are

2http://www.nvidia.com

likely to be found in other images and an abstract descriptor

vector. Matching points between two images reduces down

to finding a list of SIFT keypoints for each image and

matching the descriptor vectors, usually with a dot product

or SSD (Sum of Square Differences) comparison. Unfortu-

nately due to the processing requirements SIFT is relegated

to offline processes and the research community focused on

designing algorithms that could somewhat compare to SIFT

in reliable matching but run in real time. SURF [4] is one

popular example relying on integral images to approximate

integrating with Gaussian and other kernels.

Our system relies on a previously created map file of 3D

points, each with an associated descriptor. The matching

process is performed rapidly in real time, also on the GPU,

providing a list of 3D to 2D correspondences for each image

which is then input to a pose recovery stage. As with the

feature extraction the descriptor matching process usually

consumes a lot of CPU processing and complex tree based

search algorithms are usually employed to make this stage

run at real time speeds. But due to the highly parallel nature

of GPU’s a full match can be done on a large set of points

without the reliability losses usually incurred in the grouping

approximations usually necessary for fast matching.

II. RELATED WORK

Real time AR systems often rely on planar target patterns

such as fiducial marker arrays [5] or highly textured surfaces

such as artwork [6]. This is useful for augmentations such

as advertisements coming out of magazine pages, but not as

applicable for AR in three-dimensional environments. The

vision system has to first identify a pattern from a database,

and then calculate a set of 2D to 2D matches, after which

typically a homography matrix can be found from which

pose or just a projection matrix can be determined. Arth

et al [7] demonstrate a similar system for outdoor use on

a mobile phone, recovering poses using panorama images

aligned to GPS coordinates.

For 3D environments a large set of descriptor vectors is

typically needed, often with several descriptors for a single

3D point to capture the appearance as seen from widely

varying viewpoints. Real time systems, especially on mobile

devices, usually do not use full feature extraction for each

frame but rather use it only when the system is ’lost’ and

use 2D feature tracking to approximate following 3D points.

The system of [8] is an example of this. Typically the 2D

tracker does not follow the same image points as the features

associated with the 3D point set and so the pose estimation

breaks down after a sizable translation motion. However, 2D

features can be tracked a lot faster and the extracted pose is

sufficient for nearby positions.

Lee et al [9] proposed such a method. In the online stage,

it uses the Lucas Kanade (LK) tracker [10] to track the

target surface. It is initialized by pointing the camera toward

a planar reference image, and using a Shi-Tomasi corner

102

Page 3: [IEEE 2012 Canadian Conference on Computer and Robot Vision (CRV) - Toronto, Ontario, Canada (2012.05.28-2012.05.30)] 2012 Ninth Conference on Computer and Robot Vision - A Real Time

detector to find interest points from it. The LK tracker is

used to track the movement of the interest points, for the

calculation of the camera’s pose. Only when the tracked

surface leaves the camera view will feature detection be

performed again. These allowed their method to achieve

30 frames per second, but with a moderately low image

resolution of 320×240 pixels.

Some research, including this paper, have removed the

restriction of all the processing occurring on a single proces-

sor, as is the traditional paradigm. The presence of multiple

processing cores (so called multi-core systems, eg. ’quad-

core’) in modern computers allows the process to be divided

into two or more processes. The multi-core approach uses

a low number of similar high performance CPU’s, typically

two, to handle different tasks in different threads. The PTAM

system [11] performs tracking of 2D fast features [12]

for each frame in one thread and calculates pose using

bundle adjustment slowly over several frames in a second

thread, occasionally correcting the pose calculated by the

first thread. Lee and Hollerer [13] presented another multi-

threaded solution. The user’s hand is tracked in one thread

while feature detection with SIFT is running in a separate

thread. Yet another thread is responsible for using optical

flow to track movement of the SIFT features, along with

pose recovery supported by RANSAC to remove outliers

and off-plane features.

In contrast to using multiple cores, systems can use a

single main CPU and a large array of simpler processors

found in the GPU typically used for graphics rendering.

The real time Speeded Up SURF library created at the

UTIAS group [14] is a useful system that allows SURF

feature extraction at high frame rates on computers with

a relatively modern NVIDIA GPU. We used this library for

the feature extraction step in Fig. 2.

Matching of descriptor vectors is another stage which

often slows down an AR system, and is usually performed

in the CPU. SIFT descriptors have a 128 byte vector,

SURF features have a vector of 64 floating point numbers.

Two descriptor vectors are said to match if their euclidean

distance is below a threshold, this can be performed with the

sum of squared differences (SSD) or dot product operation

on vectors of similar length. If there are N features found in

a video frame, and there are M features in a database, then

there are N×M operations to perform. Each of which is a 64

element multiply accumulate (MAC) operation. For a typical

640 x 480 image there are N= 500 to 2000 SURF features in

a complex scene, and possibly a few thousand features in a

database. Even for small environments this quickly leads to

millions of floating point operations per second. To alleviate

this load some researchers have preprocessed the database

into a tree structure where each feature from a camera image

is a query into the tree which is already sorted to reduce

computations. K-means clustering and K-D trees [15] and

more rapid but less accurate systems such as ANN [16] and

FLANN [17] are used for fast matching. The benefit of these

approaches is reduced computation for a single CPU but at

the cost of greater complexity and some missed matches.

GPU’s were designed to handle a large number of math

operations such as MAC’s to render complex scenes rapidly,

such as for video gaming. GPU’s today contain a large array

of separate MAC units that can run in parallel. Therefore it is

possible to perform a full comparison of all the vectors. An

efficient way to perform this is to use the fast matrix multiply

operations created for GPU’s, and put the descriptors for

the image or the database as rows in the first matrix and

as columns for the other in the second matrix. The best

match for each video frame feature is then determined by

the largest element in its corresponding row or column. The

CUBLAS library in CUDA [18] is such a math library that

we use in our system. CUBLAS is a library of optimized

GPU-accelerated matrix operations that are included with

Speeded Up SURF Toolkit.

III. REAL TIME GPU BASED AUGMENTED REALITY

SYSTEM

Our system uses the architecture of a single CPU and

a GPU, with the bulk of the processing occurring on the

GPU. The feature detection, matching, and the RANSAC

reprojection portion of pose recovery is performed on the

GPU with the CPU only performing an initial image cor-

rection, pose recovery, and system management. We use the

CUDA programming language on a laptop computer with a

CUDA capable NVIDIA GPU.

Video frames are captured and unwarped to remove lens

distortion as in Fig. 2. This image correction is performed

by the OpenCV implementation of unwarping using Zhang’s

calibration parameters [19].

The unwarped image is then input to the CPU, which

moved them into GPU memory where the feature detection

is performed by the Speeded Up SURF library [14]. Robust

interest points are extracted from the camera image and

generates 64-dimension SURF descriptors for each.

Matching is performed with matrix multiplication using

the CUBLAS matrix library and our GPU code to search

the output matrix for the best matches. The code applies

a minimum threshold to determine if a match was deemed

found, and also applies the uniqueness criterion and declares

the match invalid if the ratio of the first best to the second

best match is not above a threshold ratio.

After the 2D-3D correspondences are made, the camera’s

position and orientation in the real world are calculated,

using OpenCV’s PnP [20], with GPU-accelerated RANSAC

[21] loop to improve stability of recovered pose by reducing

the effect of incorrect 2D-3D correspondences (outliers) .

Finally, with the calculated pose virtual objects, defined in

real-world world coordinates, are projected onto the display

using the recovered position and orientation of the camera.

103

Page 4: [IEEE 2012 Canadian Conference on Computer and Robot Vision (CRV) - Toronto, Ontario, Canada (2012.05.28-2012.05.30)] 2012 Ninth Conference on Computer and Robot Vision - A Real Time

Figure 3. Interested points detected with Speeded Up SURF and matchedusing their descriptors.

The resulting CG graphics are drawn over the camera image

and the result is displayed on the screen.

The system shown in Fig. 2 is described below.

A. Image Acquisition and Unwarping

A Firefly IEEE-1394 (Firewire) camera provides a 60 Hz

stream of greyscale images of resolution 640×480 pixels

into the CPU memory. The lens is a low cost small lens

and has radial distortion that should not be neglected for

accurate pose estimation. An offline calibration routine was

performed to obtain the unwarping parameters applied using

the OpenCV lookup table unwarp function.

B. Feature Detection

Feature detection involves finding interest points and

ascribe descriptor vectors to each. The interest points have an

image location, as well as a rotation and scale measurement.

The colored circles in figure 3 show some examples of

interest points detected from their images. The size of the

circle indicate the scale of each interest points, and the line

inside the circle indicates the computed orientation of the

point.

With the Lenovo W520 laptop used for development of

this AR system, for 640×480 sized images and default

parameters, Speeded Up SURF is capable of performing

keypoint detection and descriptor generation in less than 20

milliseconds.

The Speeded Up SURF parameters used are as follows:

• 8 octaves, the maximum possible for Speeded Up

SURF.

• 9 intervals per octave, the maximum possible for

Speeded Up SURF. This is assumed to mean how

many scale levels to be subdivided in each octave as

mentioned in the original SURF paper [3].

• Interest operator threshold of 0.2

• First octave scale of 2 - same as default. Any smaller

will result in too many keypoints with scale too small

to be useful for keypoint matching.

• Descriptor computation enabled, which is necessary for

matching.

• Orientation computation enabled. SURF can either be

’upright SURF’ in cases when the camera stays level

(such as on a mobile robot) or can have the extra pro-

cessing for rotational invariance enabled. The human

user is able to move the camera freely and so we need

the rotational invariance.

The parameters are chosen from experimentation in indoor

and outdoor environments. For the interest operator thresh-

old, when this value is increased, fewer number of interest

points are detected. To improve matching performance with

different distances between camera and 2D mapped surfaces,

the number of octaves and intervals are set to the maximum

supported.

C. Descriptor Matching

After the interest points from the camera image has been

detected, they can now be matched to the 3D map’s points

using each of their descriptors. Figure 3 illustrates two

examples of descriptor matching, where different coloured

lines represent different matches of interest points in each of

the image pairs. As the images show, a few of the matches

are incorrect, but in the pose recovery step, these outliers

are taken care by RANSAC [21].

In this AR system, the ’brute force’ keypoint matching

is used, with dot product as the comparison criteria. This

is a highly data-parallel operation well suited for GPUs.

Recursive tree-based matching algorithms such as Kd-trees

[15] may be more efficient but are not a good match for the

parallel nature of GPU implementations.

1) Step 1: Dot product calculation: By putting the two

sets of descriptors into matrices, in rows for the first ma-

trix and as columns in the second, the resulting matrix

is the dot product for each descriptor comparisons. This

can be done using fast matrix multiplication using function

cublasSgemm from the CUBLAS library (CUDA Basic

Linear Algebra Subprograms) [18].

The descriptor sets are as follows:

• The first descriptor set, Dp set represents keypoints

from the image captured by the camera, arranged as

a N × 64 matrix where n is the number of keypoints

in the camera’s image. This descriptor set is copied to

the GPU memory in every frame.

Dp =

⎡⎢⎢⎢⎢⎢⎢⎣

dp0,0 dp0,1 ... dp0,63dp1,0 dp1,1 ... dp1,63dp2,0 dp2,1 ... dp2,63dp3,0 dp3,1 ... dp3,63... ... ... ...

dpN,0 dpN,1 ... dpN,63

⎤⎥⎥⎥⎥⎥⎥⎦

104

Page 5: [IEEE 2012 Canadian Conference on Computer and Robot Vision (CRV) - Toronto, Ontario, Canada (2012.05.28-2012.05.30)] 2012 Ninth Conference on Computer and Robot Vision - A Real Time

• The second descriptor set, DmT comes from the map

file, arranged as a transposed 64×M matrix where M

is the number of 3D points in the database. This de-

scriptor set copied to the GPU memory during program

initialization only.

DmT=

⎡⎢⎢⎢⎢⎣

dm0,0 dm1,0 dm2,0 ... dmM,0

dm0,1 dm1,1 dm2,1 ... dmM,1

dm0,2 dm1,2 dm2,2 ... dmM,2

... ... ... ... ...

dm0,63 dm1,63 dm2,63 ... dmM,63

⎤⎥⎥⎥⎥⎦

The matrices are copied from computer’s memory to the

GPU’s device memory in a column-major format as required

by CUBLAS. These two matrices are multiplied together

using function cublasSgemm, which returns the resulting

n ×m matrix S, containing the dot products from each of

the keypoint combinations in the pair:

S = DpDmT

The rows are the keypoint indices for the camera image and

columns are the index to a 3D point in the map, stored in a

column-major order:

S =

⎡⎢⎢⎢⎢⎣

sp0,m0 sp0,m1 sp0,m2 ... sp0,mM

sp1,m0 sp1,m1 sp1,m2 ... sp1,mM

sp2,m0 sp2,m1 sp2,m2 ... sp2,mM

... ... ... ... ...

spN,m0 spN,m1 spN,m2 ... spN,mM

⎤⎥⎥⎥⎥⎦

Where spx,my is the dot product score between the xth 2D

point from the picture, and the yth 3D point from the map.

2) Step 2: Picking best matches: The second step of

keypoint matching is to select the best match between each

keypoints in the camera image and each 3D point in the map

file, using the calculated dot product results in the previous

step. The resultant matrix from the previous step remains

in GPU’s memory to be used for this step, avoiding time-

consuming memory transfers between host and device.

A CUDA kernel function is launched with N threads on

the GPU, where N is the number of keypoints in the cam-

era’s image. Each keypoint in the camera’s image is assigned

a thread, which the ith thread is responsible for the ith

keypoint. The threads perform an exhaustive search through

their keypoint’s row in the dot product matrix, searching for

the best and second best matches. Two thresholds will be

compared before the chosen matches should be accepted or

rejected:

• The relative threshold is the minimum ratio between

the best and second best match.

• The absolute threshold is the minimum in order to

discard weak matches.

The column-major order of the dot product matrix allows

’coalesced’ optimization of memory access, thus all memory

operations in this kernel are performed on global memory.

The result of this step is an one-dimensional array of

matching indices, where each of the ith elements in this

array contains the ID of the 3D point the ith keypoint from

the camera image is matched to. For keypoints without a

matching 3D point, their matching index are set to -1.

3) Step 3: Eliminating multiple matches to the same 3D

point: The last step of keypoint matching is to scan through

the array of matching indices, to eliminate duplicate matches

to the same 3D point, keeping the one with the best matching

score. This ensures that only one keypoint from the camera

image is matched to any given 3D point in the map.

This step is implemented as a CUDA kernel executed

on the GPU, with the same number of threads as the

previous step, which is the number of keypoints in the

camera’s image. Each thread, if their respective keypoint

has a match, searches through the array concurrently for any

other keypoint that have the same matching target as it, and

compares their matching score. If the thread’s own keypoint

have a better matching score than then other keypoint, then

that keypoint’s matching index is set to -1, and the loop

continues for this thread. Otherwise, if the other keypoint

has a better matching score than the thread’s own keypoint,

then the thread’s own keypoint has its matching index set to

-1, and the loop ends for this thread. All memory operations

in this kernel are performed on global memory.

D. Pose Recovery

After the descriptor matching step, a number of the

interest points in the camera image now have correspondence

to a 3D world point, allowing the estimation of the camera’s

pose in the real world.

Pose recovery is performed by OpenCV 2.3.1’s

solvePnPRansac from the GPU module [20]. This func-

tion calculates pose from four 2D-3D correspondences and

the camera matrix within a RANSAC loop.

In our system 64 RANSAC iterations were chosen empir-

ically.

E. GPU Accelereration

CUDA is a C-like language created by NVIDIA to pro-

gram their GPU units for general purpose computing.

In CUDA, the smallest executable unit is a warp, consist-

ing of 32 parallel threads. Each microprocessor is capable

of executing 24 to 48 warps concurrently, depending on the

architecture of the GPU. This gives 768 to 1536 concurrent

threads for each microprocessor [22]. Each GPU chip con-

tain multiple microprocessors, where high end GPUs contain

a large number of them, while budget GPUs on the bottom

end may contain as little as one microprocessor on the die.

For instance, NVIDIA Quadro 2000M, a mid-range laptop

GPU as of this writing, contains 4 microprocessors on its

die capable of 48 concurrent warps for each. This allows a

grand total of 4 × 48 × 32 = 6144 concurrent threads to

be executed on this GPU. A high-end laptop GPU with the

105

Page 6: [IEEE 2012 Canadian Conference on Computer and Robot Vision (CRV) - Toronto, Ontario, Canada (2012.05.28-2012.05.30)] 2012 Ninth Conference on Computer and Robot Vision - A Real Time

same architecture, such as a GeForce 580M, contains twice

as many microprocessors and can thus execute twice yet as

many concurrent threads.

F. Offline phase - 3D Maps Creation

The 3D map file for this AR system is a collection of 3D

points in the real-world space, and each 3D point has a set

of 64-dimensional Speeded Up SURF descriptor, to allow

the matching between 2D image points and 3D world points

to be used for pose recovery. As with [23] the creation of

3D maps is an offline process. The map files used in this

paper were of planar surfaces captured from images.

IV. EXPERIMENTS

For this experiment, a Lenovo W520 laptop was used,

equipped with an Intel Core i7 2720QM CPU and a mid-

range NVIDIA Quadro 2000M GPU 3, which has approx-

imately half the parallel processing power of the top-end

laptop GPUs available as of this writing. A single IEEE-

1394 camera was used for video input, with a resolution of

640×480 pixels.

For pose recovery, we chose 64 iterations for the

RANSAC loop for a balance between computation speed

and stability of recovered pose. For the interest operator

threshold, when this value is increased, fewer number of

interest points are detected.

A. Indoor experiments

The first three images in the top row of Figure 4 shows

the ideal scenario (set 1), where a planar and feature-rich

wall that allows a very large number of Speeded Up SURF

interest points to be detected and mapped, allowing stable

recovered poses. The rightmost image in the same row, half

a flight upstairs, while not perfectly planar, appeared to be so

unless the camera is extremely close to the railings, allowing

that surface to be recorded to the 3D map and a pose to be

recovered.

The 3D map of lab set shown in the middle row of figure 4

(set 2) is an example, where 3D points from multiple locally

planar surfaces are appended to create a more complex 3D

map. In the first image, two planar surfaces from the map

are tracked: one on the silver numbered boxes, and another

on the bricks on the wall to allow more pose stability. In

the second and third images, feature-rich wall is tracked,

showing successful pose recovery with the camera far away

and close to it, respectively. In the last image of the row, the

surface where the doors lie on only allowed limited matches,

and the pose is less stable than the other 3 images in the

row.

3http://www.nvidia.com/page/quadrofx go.html

Set 1 Set 2

Number of points in 3D map 1987 2328

Image unwarping (ms) 5 5

Keypoint detection (ms) 30 28

Descriptor matching (ms) 6 6

Pose recovery (ms) 33 37

Graphic rendering (ms) 3 4

Overall frame rate (fps) 11.2 10.7

B. Outdoor experiments

The scene in top row of 5 yielded steady recovered poses,

where irregular brick patterns on the arched brick wall’s

front surface worked well with Speeded Up SURF feature

detector. However the maximum distance to recover pose

from this surface is limited, as shown on the first image.

These bricks did not allow large-scale interest points to be

detected. When the camera tilted upwards in the third image,

the camera reduced its exposure and increased shutter speed,

causing the image to darken, and no interest points were

available to be matched to the 3D map, and a pose could

not be computed.

For the scene in the bottom row of figure 5 resulted in

translation errors in the z-position throughout the sequence,

with the virtual objects constantly bobbing up and down.

Future work will concentrate on pose filtering to reduce the

unpleasant vibration in the augmentations. However the x

and y-positions recovered are accurate.

Set 1 Set 2

Number of points in 3D map 3024 1925

Image unwarping (ms) 5 5

Keypoint detection (ms) 40 35

Descriptor matching (ms) 15 8

Pose recovery (ms) 32 34

Graphic rendering (ms) 3 3

Overall frame rate (fps) 9.6 10.2

V. CONCLUSION

A real time augmented reality system was created that

performs true feature extraction and pose estimation on every

frame, without relying on 2D tracking. SURF feature points

were found in each frame and matched to a database of

3D points and pose extracted using OpenCV’s PnP. An

architecture of a single CPU thread and a large array of

GPU processors was used, with a majority of the processing

performed in the GPU. Due to the inclusion of GPU units

in mobile devices we expect this to be the architecture of

choice in future consumer AR systems.

ACKNOWLEDGMENT

This work was supported by NSERC Discovery Grant

356072.

106

Page 7: [IEEE 2012 Canadian Conference on Computer and Robot Vision (CRV) - Toronto, Ontario, Canada (2012.05.28-2012.05.30)] 2012 Ninth Conference on Computer and Robot Vision - A Real Time

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 4. Indoor experiments

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 5. Outdoor experiments

REFERENCES

[1] www.nvidia.com/object/cuda home.html.

[2] D. Lowe, “Object recognition from local scale-invariant fea-tures,” Sept 1999, pp. 1150–1157.

[3] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-uprobust features (surf),” Computer Vision and Image Under-standing (CVIU), vol. 110, no. 3, pp. 346–359, 2008.

[4] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Surf: Speededup robust features.” in Computer Vision and Image Under-standing (CVIU), Vol. 110, No. 3, 2008, pp. 346–359.

[5] M. Fiala, “Artag, a fiducial marker system using digitaltechniques,” in CVPR’05, vol. 1, 2005, pp. 590 – 596.

[6] D. Nister and H. Stewenius, “Scalable recognition with avocabulary tree,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR), vol. 2, Jun. 2006, pp. 2161–2168, oral presentation.

[7] C. Arth, M. Klopschitz, G. Reitmayr, and D. Schmalstieg,“Real-time self-localization from panoramic images on mo-bile devices,” Mixed and Augmented Reality, IEEE / ACMInternational Symposium on, vol. 0, pp. 37–46, 2011.

[8] A. K. F. Wientapper, H. Wuest, “Composing the featuremap retrieval process for robust and ready-to-use monoculartracking,” in Computers and Graphics, vol. 35, 2011, pp. 778–788.

[9] A. Lee, J.-Y. Lee, S.-H. Lee, and J.-S. Choi, “Markerless

107

Page 8: [IEEE 2012 Canadian Conference on Computer and Robot Vision (CRV) - Toronto, Ontario, Canada (2012.05.28-2012.05.30)] 2012 Ninth Conference on Computer and Robot Vision - A Real Time

augmented reality system based on planar object tracking,” inFrontiers of Computer Vision (FCV), 2011 17th Korea-JapanJoint Workshop on, feb. 2011, pp. 1 –4.

[10] J. Shi and C. Tomasi, “Good features to track,” in CVPR,Seattle, Washington, USA, June 1994, pp. 593–600.

[11] G. Klein and D. Murray, “Parallel tracking and mapping forsmall ar workspaces.” in ISMAR, 2007, pp. 225–234.

[12] E. Rosten and T. Drummond, “Fusing points and lines forhigh performance tracking.” in IEEE International Confer-ence on Computer Vision, vol. 2, Oct 2005, pp. 1508–1511.

[13] T. Lee and T. Hollerer, “Hybrid feature tracking and user in-teraction for markerless augmented reality,” in Virtual RealityConference, 2008. VR ’08. IEEE, march 2008, pp. 145 –152.

[14] http://asrl.utias.utoronto.ca/code/gpusurf/.

[15] J. L. Bentley, “Multidimensional binary search trees used forassociative searching,” Commun. ACM, vol. 18, Sept. 1975.

[16] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, andA. Y. Wu, “An optimal algorithm for approximate nearestneighbor searching fixed dimensions,” J. ACM, vol. 45, pp.891–923, November 1998.

[17] M. Muja and D. G. Lowe, “Fast approximate nearest neigh-bors with automatic algorithm configuration,” in InternationalConference on Computer Vision Theory and ApplicationVISSAPP’09). INSTICC Press, 2009, pp. 331–340.

[18] “Cublas library.” [Online]. Available: http://developer.nvidia.com/cublas

[19] Z. Zhang, “Flexible camera calibration by viewing a planefrom unknown orientations,” in Computer Vision, 1999. TheProceedings of the Seventh IEEE International Conferenceon, vol. 1, 1999, pp. 666 –673 vol.1.

[20] Opencv 2.3.1 documentation: Camera calibra-tion and 3d reconstruction. [Online]. Avail-able: http://opencv.itseez.com/modules/gpu/doc/cameracalibration and 3d reconstruction.html#gpu-solvepnpransac

[21] M. Fischler and R. Bolles, “Random sample consensus: aparadigm for model fitting with applications to image analysisand automated cartography.” in Communications of the ACM,1981, pp. 381–395.

[22] Cuda c programming guide. [Online]. Available:http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA C Programming Guide.pdf

[23] F. Wientapper, H. Wuest, and A. Kuijper, “Reconstruction andaccurate alignment of feature maps for augmented reality,” in3D Imaging, Modeling, Processing, Visualization and Trans-mission (3DIMPVT), 2011 International Conference on, may2011, pp. 140 –147.

108