Upload
mark
View
214
Download
1
Embed Size (px)
Citation preview
A Real Time Augmented Reality System Using GPU Acceleration
David Chi Chung Tam
Computer Science
Ryerson University
Toronto, Canada
Mark Fiala
Computer Science
Ryerson University
Toronto, Canada
Abstract—Augmented Reality (AR) is an application ofcomputer vision that is processor intensive and typically suffersfrom a trade-off between robust view alignment and realtime performance. Real time AR that can function robustlyin variable environments is a process difficult to achieve ona PC (personal computer) let alone on the mobile devicesthat will likely be where AR is adopted as a consumerapplication. Despite the availability of high quality featurematching algorithms such as SIFT, SURF and robust poseestimation algorithms such as EPNP, practical AR systemstoday rely on older methods such as Harris/KLT cornersand template matching for performance reasons. SIFT-likealgorithms are typically used only to initialize tracking bythese methods. We demonstrate a practical system with realtime performance using only SURF without the need fortracking. We achieve this with extensive use of the GraphicsProcessing Unit (GPU) now prevalent in PC’s. Due to mobiledevices becoming equipped with GPU’s we believe that thisarchitecture will lead to practical robust AR.
Keywords-augmented reality, feature detection, real-time,GPU, CUDA
I. INTRODUCTION
Figure 1. Example of virtual content aligned with real time video.
Augmented Reality (AR) gives a user the impression of
computer generated objects coexisting in the environment
around them, such as video game characters running across
their newspaper or walking around architectural previews of
proposed buildings seen in an outdoor environment without
looking at a conventional computer screen. An outward
facing video camera on a mobile device provides video
shown on the screen of a mobile device, or in a Head
Mounted Display (HMD), with overlaid virtual computer
generated objects that appear to belong in the scene. This
illusion is achieved by calculating the location of the real
video camera, setting the virtual computer graphics camera
to the same pose and rendering Computer Generated (CG)
graphics, and compositing the two. An example from our
system is shown in Fig. 1.
Feature Detection
Descriptor Matching
Camera Input
3D Point-Descriptor Map
Pose Recovery
Project Virtual Objects
Display
Image Unwarping Camera Matrix
Distortion Coefficients
Figure 2. Execution flow for a single video frame in our AR system
AR systems rely on this pose estimation which can be
accomplished with various means, such as the solid state
gyroscope in modern mobile devices used in many ’apps’
such as ’Layar’ 1. Users can hold up their smartphone and
1www.layar.com
2012 Ninth Conference on Computer and Robot Vision
978-0-7695-4683-4/12 $26.00 © 2012 IEEE
DOI 10.1109/CRV.2012.21
101
look around as if the phone was a piece of ’virtual glass”
to see arrows pointing to restaurants, subway entrances,
etc. However, to properly align the virtual camera requires
rotational accuracy better than 0.1 degrees which is not
feasible with the orientation sensors in mobile devices. In ad-
dition the position accuracy is not sufficient to render virtual
objects that are close to the user, such as a virtual statue the
user can walk around, for example. AR applications relying
on this sensor do not provide a satisfactory user experience
due to the drift and inaccuracy of both the orientation and
position sensors. However, computer vision can process the
video input itself and calculate the necessary pose. Modern
algorithms have made this possible using ’natural features’
in the environment which are used to calculate pose.
The prospect of using the video imagery itself to calculate
pose is a possibly very efficient solution that can provide
the necessary pixel (or indeed sub-pixel) pose accuracy,
if only it can be made robust enough and able to run at
a fast enough frame rate and with minimal latency. This
is a difficult prospect even with today’s (2012) personal
computer processors and algorithms, however, much of the
processing necessary is of the type that it can be off-
loaded to a new processing unit type (the GPU) originally
intended for graphics processing. The GPU has increasingly
been used for non-graphics uses to the point that video
graphics chip producers have created tools to allow general
purpose (GPGPU) processing on these devices. The CUDA
[1] created to run C-like code on GPU’s made by NVIDIA2 is one example and is used in the system presented in this
paper.
Figure 2 shows a typical AR system that uses computer
vision as the pose calculation component. As well as some
details specific to a system similar to ours, Figure 2 refers
to a feature detection computer vision algorithm stage, a
matching phase, and a pose recovery stage. This is the
typical flow employed in most desired architectures but is
usually not achieved due to the non-real time performance
typically encountered in almost all the stages.
We demonstrate a real time system for AR using mixed
CPU and GPU processing that does achieve this goal,
without resorting to secondary tracking methods as with
other systems, and provide details of its architecture and
implementation.
Typical feature detectors used (or desired to be used)
are SIFT [2] and SURF [3] feature detectors. SIFT was a
revolutionary algorithm introduced which allowed a patch of
pixels corresponding to an object to be recognized despite
changes in lighting, rotation, object size in the image (zoom,
distance) and even the camera used. Specifically the ability
to recognize matching keypoints despite rotation and scale
differences enabled many computer vision systems to this
day. SIFT involves finding salient stable points that are
2http://www.nvidia.com
likely to be found in other images and an abstract descriptor
vector. Matching points between two images reduces down
to finding a list of SIFT keypoints for each image and
matching the descriptor vectors, usually with a dot product
or SSD (Sum of Square Differences) comparison. Unfortu-
nately due to the processing requirements SIFT is relegated
to offline processes and the research community focused on
designing algorithms that could somewhat compare to SIFT
in reliable matching but run in real time. SURF [4] is one
popular example relying on integral images to approximate
integrating with Gaussian and other kernels.
Our system relies on a previously created map file of 3D
points, each with an associated descriptor. The matching
process is performed rapidly in real time, also on the GPU,
providing a list of 3D to 2D correspondences for each image
which is then input to a pose recovery stage. As with the
feature extraction the descriptor matching process usually
consumes a lot of CPU processing and complex tree based
search algorithms are usually employed to make this stage
run at real time speeds. But due to the highly parallel nature
of GPU’s a full match can be done on a large set of points
without the reliability losses usually incurred in the grouping
approximations usually necessary for fast matching.
II. RELATED WORK
Real time AR systems often rely on planar target patterns
such as fiducial marker arrays [5] or highly textured surfaces
such as artwork [6]. This is useful for augmentations such
as advertisements coming out of magazine pages, but not as
applicable for AR in three-dimensional environments. The
vision system has to first identify a pattern from a database,
and then calculate a set of 2D to 2D matches, after which
typically a homography matrix can be found from which
pose or just a projection matrix can be determined. Arth
et al [7] demonstrate a similar system for outdoor use on
a mobile phone, recovering poses using panorama images
aligned to GPS coordinates.
For 3D environments a large set of descriptor vectors is
typically needed, often with several descriptors for a single
3D point to capture the appearance as seen from widely
varying viewpoints. Real time systems, especially on mobile
devices, usually do not use full feature extraction for each
frame but rather use it only when the system is ’lost’ and
use 2D feature tracking to approximate following 3D points.
The system of [8] is an example of this. Typically the 2D
tracker does not follow the same image points as the features
associated with the 3D point set and so the pose estimation
breaks down after a sizable translation motion. However, 2D
features can be tracked a lot faster and the extracted pose is
sufficient for nearby positions.
Lee et al [9] proposed such a method. In the online stage,
it uses the Lucas Kanade (LK) tracker [10] to track the
target surface. It is initialized by pointing the camera toward
a planar reference image, and using a Shi-Tomasi corner
102
detector to find interest points from it. The LK tracker is
used to track the movement of the interest points, for the
calculation of the camera’s pose. Only when the tracked
surface leaves the camera view will feature detection be
performed again. These allowed their method to achieve
30 frames per second, but with a moderately low image
resolution of 320×240 pixels.
Some research, including this paper, have removed the
restriction of all the processing occurring on a single proces-
sor, as is the traditional paradigm. The presence of multiple
processing cores (so called multi-core systems, eg. ’quad-
core’) in modern computers allows the process to be divided
into two or more processes. The multi-core approach uses
a low number of similar high performance CPU’s, typically
two, to handle different tasks in different threads. The PTAM
system [11] performs tracking of 2D fast features [12]
for each frame in one thread and calculates pose using
bundle adjustment slowly over several frames in a second
thread, occasionally correcting the pose calculated by the
first thread. Lee and Hollerer [13] presented another multi-
threaded solution. The user’s hand is tracked in one thread
while feature detection with SIFT is running in a separate
thread. Yet another thread is responsible for using optical
flow to track movement of the SIFT features, along with
pose recovery supported by RANSAC to remove outliers
and off-plane features.
In contrast to using multiple cores, systems can use a
single main CPU and a large array of simpler processors
found in the GPU typically used for graphics rendering.
The real time Speeded Up SURF library created at the
UTIAS group [14] is a useful system that allows SURF
feature extraction at high frame rates on computers with
a relatively modern NVIDIA GPU. We used this library for
the feature extraction step in Fig. 2.
Matching of descriptor vectors is another stage which
often slows down an AR system, and is usually performed
in the CPU. SIFT descriptors have a 128 byte vector,
SURF features have a vector of 64 floating point numbers.
Two descriptor vectors are said to match if their euclidean
distance is below a threshold, this can be performed with the
sum of squared differences (SSD) or dot product operation
on vectors of similar length. If there are N features found in
a video frame, and there are M features in a database, then
there are N×M operations to perform. Each of which is a 64
element multiply accumulate (MAC) operation. For a typical
640 x 480 image there are N= 500 to 2000 SURF features in
a complex scene, and possibly a few thousand features in a
database. Even for small environments this quickly leads to
millions of floating point operations per second. To alleviate
this load some researchers have preprocessed the database
into a tree structure where each feature from a camera image
is a query into the tree which is already sorted to reduce
computations. K-means clustering and K-D trees [15] and
more rapid but less accurate systems such as ANN [16] and
FLANN [17] are used for fast matching. The benefit of these
approaches is reduced computation for a single CPU but at
the cost of greater complexity and some missed matches.
GPU’s were designed to handle a large number of math
operations such as MAC’s to render complex scenes rapidly,
such as for video gaming. GPU’s today contain a large array
of separate MAC units that can run in parallel. Therefore it is
possible to perform a full comparison of all the vectors. An
efficient way to perform this is to use the fast matrix multiply
operations created for GPU’s, and put the descriptors for
the image or the database as rows in the first matrix and
as columns for the other in the second matrix. The best
match for each video frame feature is then determined by
the largest element in its corresponding row or column. The
CUBLAS library in CUDA [18] is such a math library that
we use in our system. CUBLAS is a library of optimized
GPU-accelerated matrix operations that are included with
Speeded Up SURF Toolkit.
III. REAL TIME GPU BASED AUGMENTED REALITY
SYSTEM
Our system uses the architecture of a single CPU and
a GPU, with the bulk of the processing occurring on the
GPU. The feature detection, matching, and the RANSAC
reprojection portion of pose recovery is performed on the
GPU with the CPU only performing an initial image cor-
rection, pose recovery, and system management. We use the
CUDA programming language on a laptop computer with a
CUDA capable NVIDIA GPU.
Video frames are captured and unwarped to remove lens
distortion as in Fig. 2. This image correction is performed
by the OpenCV implementation of unwarping using Zhang’s
calibration parameters [19].
The unwarped image is then input to the CPU, which
moved them into GPU memory where the feature detection
is performed by the Speeded Up SURF library [14]. Robust
interest points are extracted from the camera image and
generates 64-dimension SURF descriptors for each.
Matching is performed with matrix multiplication using
the CUBLAS matrix library and our GPU code to search
the output matrix for the best matches. The code applies
a minimum threshold to determine if a match was deemed
found, and also applies the uniqueness criterion and declares
the match invalid if the ratio of the first best to the second
best match is not above a threshold ratio.
After the 2D-3D correspondences are made, the camera’s
position and orientation in the real world are calculated,
using OpenCV’s PnP [20], with GPU-accelerated RANSAC
[21] loop to improve stability of recovered pose by reducing
the effect of incorrect 2D-3D correspondences (outliers) .
Finally, with the calculated pose virtual objects, defined in
real-world world coordinates, are projected onto the display
using the recovered position and orientation of the camera.
103
Figure 3. Interested points detected with Speeded Up SURF and matchedusing their descriptors.
The resulting CG graphics are drawn over the camera image
and the result is displayed on the screen.
The system shown in Fig. 2 is described below.
A. Image Acquisition and Unwarping
A Firefly IEEE-1394 (Firewire) camera provides a 60 Hz
stream of greyscale images of resolution 640×480 pixels
into the CPU memory. The lens is a low cost small lens
and has radial distortion that should not be neglected for
accurate pose estimation. An offline calibration routine was
performed to obtain the unwarping parameters applied using
the OpenCV lookup table unwarp function.
B. Feature Detection
Feature detection involves finding interest points and
ascribe descriptor vectors to each. The interest points have an
image location, as well as a rotation and scale measurement.
The colored circles in figure 3 show some examples of
interest points detected from their images. The size of the
circle indicate the scale of each interest points, and the line
inside the circle indicates the computed orientation of the
point.
With the Lenovo W520 laptop used for development of
this AR system, for 640×480 sized images and default
parameters, Speeded Up SURF is capable of performing
keypoint detection and descriptor generation in less than 20
milliseconds.
The Speeded Up SURF parameters used are as follows:
• 8 octaves, the maximum possible for Speeded Up
SURF.
• 9 intervals per octave, the maximum possible for
Speeded Up SURF. This is assumed to mean how
many scale levels to be subdivided in each octave as
mentioned in the original SURF paper [3].
• Interest operator threshold of 0.2
• First octave scale of 2 - same as default. Any smaller
will result in too many keypoints with scale too small
to be useful for keypoint matching.
• Descriptor computation enabled, which is necessary for
matching.
• Orientation computation enabled. SURF can either be
’upright SURF’ in cases when the camera stays level
(such as on a mobile robot) or can have the extra pro-
cessing for rotational invariance enabled. The human
user is able to move the camera freely and so we need
the rotational invariance.
The parameters are chosen from experimentation in indoor
and outdoor environments. For the interest operator thresh-
old, when this value is increased, fewer number of interest
points are detected. To improve matching performance with
different distances between camera and 2D mapped surfaces,
the number of octaves and intervals are set to the maximum
supported.
C. Descriptor Matching
After the interest points from the camera image has been
detected, they can now be matched to the 3D map’s points
using each of their descriptors. Figure 3 illustrates two
examples of descriptor matching, where different coloured
lines represent different matches of interest points in each of
the image pairs. As the images show, a few of the matches
are incorrect, but in the pose recovery step, these outliers
are taken care by RANSAC [21].
In this AR system, the ’brute force’ keypoint matching
is used, with dot product as the comparison criteria. This
is a highly data-parallel operation well suited for GPUs.
Recursive tree-based matching algorithms such as Kd-trees
[15] may be more efficient but are not a good match for the
parallel nature of GPU implementations.
1) Step 1: Dot product calculation: By putting the two
sets of descriptors into matrices, in rows for the first ma-
trix and as columns in the second, the resulting matrix
is the dot product for each descriptor comparisons. This
can be done using fast matrix multiplication using function
cublasSgemm from the CUBLAS library (CUDA Basic
Linear Algebra Subprograms) [18].
The descriptor sets are as follows:
• The first descriptor set, Dp set represents keypoints
from the image captured by the camera, arranged as
a N × 64 matrix where n is the number of keypoints
in the camera’s image. This descriptor set is copied to
the GPU memory in every frame.
Dp =
⎡⎢⎢⎢⎢⎢⎢⎣
dp0,0 dp0,1 ... dp0,63dp1,0 dp1,1 ... dp1,63dp2,0 dp2,1 ... dp2,63dp3,0 dp3,1 ... dp3,63... ... ... ...
dpN,0 dpN,1 ... dpN,63
⎤⎥⎥⎥⎥⎥⎥⎦
104
• The second descriptor set, DmT comes from the map
file, arranged as a transposed 64×M matrix where M
is the number of 3D points in the database. This de-
scriptor set copied to the GPU memory during program
initialization only.
DmT=
⎡⎢⎢⎢⎢⎣
dm0,0 dm1,0 dm2,0 ... dmM,0
dm0,1 dm1,1 dm2,1 ... dmM,1
dm0,2 dm1,2 dm2,2 ... dmM,2
... ... ... ... ...
dm0,63 dm1,63 dm2,63 ... dmM,63
⎤⎥⎥⎥⎥⎦
The matrices are copied from computer’s memory to the
GPU’s device memory in a column-major format as required
by CUBLAS. These two matrices are multiplied together
using function cublasSgemm, which returns the resulting
n ×m matrix S, containing the dot products from each of
the keypoint combinations in the pair:
S = DpDmT
The rows are the keypoint indices for the camera image and
columns are the index to a 3D point in the map, stored in a
column-major order:
S =
⎡⎢⎢⎢⎢⎣
sp0,m0 sp0,m1 sp0,m2 ... sp0,mM
sp1,m0 sp1,m1 sp1,m2 ... sp1,mM
sp2,m0 sp2,m1 sp2,m2 ... sp2,mM
... ... ... ... ...
spN,m0 spN,m1 spN,m2 ... spN,mM
⎤⎥⎥⎥⎥⎦
Where spx,my is the dot product score between the xth 2D
point from the picture, and the yth 3D point from the map.
2) Step 2: Picking best matches: The second step of
keypoint matching is to select the best match between each
keypoints in the camera image and each 3D point in the map
file, using the calculated dot product results in the previous
step. The resultant matrix from the previous step remains
in GPU’s memory to be used for this step, avoiding time-
consuming memory transfers between host and device.
A CUDA kernel function is launched with N threads on
the GPU, where N is the number of keypoints in the cam-
era’s image. Each keypoint in the camera’s image is assigned
a thread, which the ith thread is responsible for the ith
keypoint. The threads perform an exhaustive search through
their keypoint’s row in the dot product matrix, searching for
the best and second best matches. Two thresholds will be
compared before the chosen matches should be accepted or
rejected:
• The relative threshold is the minimum ratio between
the best and second best match.
• The absolute threshold is the minimum in order to
discard weak matches.
The column-major order of the dot product matrix allows
’coalesced’ optimization of memory access, thus all memory
operations in this kernel are performed on global memory.
The result of this step is an one-dimensional array of
matching indices, where each of the ith elements in this
array contains the ID of the 3D point the ith keypoint from
the camera image is matched to. For keypoints without a
matching 3D point, their matching index are set to -1.
3) Step 3: Eliminating multiple matches to the same 3D
point: The last step of keypoint matching is to scan through
the array of matching indices, to eliminate duplicate matches
to the same 3D point, keeping the one with the best matching
score. This ensures that only one keypoint from the camera
image is matched to any given 3D point in the map.
This step is implemented as a CUDA kernel executed
on the GPU, with the same number of threads as the
previous step, which is the number of keypoints in the
camera’s image. Each thread, if their respective keypoint
has a match, searches through the array concurrently for any
other keypoint that have the same matching target as it, and
compares their matching score. If the thread’s own keypoint
have a better matching score than then other keypoint, then
that keypoint’s matching index is set to -1, and the loop
continues for this thread. Otherwise, if the other keypoint
has a better matching score than the thread’s own keypoint,
then the thread’s own keypoint has its matching index set to
-1, and the loop ends for this thread. All memory operations
in this kernel are performed on global memory.
D. Pose Recovery
After the descriptor matching step, a number of the
interest points in the camera image now have correspondence
to a 3D world point, allowing the estimation of the camera’s
pose in the real world.
Pose recovery is performed by OpenCV 2.3.1’s
solvePnPRansac from the GPU module [20]. This func-
tion calculates pose from four 2D-3D correspondences and
the camera matrix within a RANSAC loop.
In our system 64 RANSAC iterations were chosen empir-
ically.
E. GPU Accelereration
CUDA is a C-like language created by NVIDIA to pro-
gram their GPU units for general purpose computing.
In CUDA, the smallest executable unit is a warp, consist-
ing of 32 parallel threads. Each microprocessor is capable
of executing 24 to 48 warps concurrently, depending on the
architecture of the GPU. This gives 768 to 1536 concurrent
threads for each microprocessor [22]. Each GPU chip con-
tain multiple microprocessors, where high end GPUs contain
a large number of them, while budget GPUs on the bottom
end may contain as little as one microprocessor on the die.
For instance, NVIDIA Quadro 2000M, a mid-range laptop
GPU as of this writing, contains 4 microprocessors on its
die capable of 48 concurrent warps for each. This allows a
grand total of 4 × 48 × 32 = 6144 concurrent threads to
be executed on this GPU. A high-end laptop GPU with the
105
same architecture, such as a GeForce 580M, contains twice
as many microprocessors and can thus execute twice yet as
many concurrent threads.
F. Offline phase - 3D Maps Creation
The 3D map file for this AR system is a collection of 3D
points in the real-world space, and each 3D point has a set
of 64-dimensional Speeded Up SURF descriptor, to allow
the matching between 2D image points and 3D world points
to be used for pose recovery. As with [23] the creation of
3D maps is an offline process. The map files used in this
paper were of planar surfaces captured from images.
IV. EXPERIMENTS
For this experiment, a Lenovo W520 laptop was used,
equipped with an Intel Core i7 2720QM CPU and a mid-
range NVIDIA Quadro 2000M GPU 3, which has approx-
imately half the parallel processing power of the top-end
laptop GPUs available as of this writing. A single IEEE-
1394 camera was used for video input, with a resolution of
640×480 pixels.
For pose recovery, we chose 64 iterations for the
RANSAC loop for a balance between computation speed
and stability of recovered pose. For the interest operator
threshold, when this value is increased, fewer number of
interest points are detected.
A. Indoor experiments
The first three images in the top row of Figure 4 shows
the ideal scenario (set 1), where a planar and feature-rich
wall that allows a very large number of Speeded Up SURF
interest points to be detected and mapped, allowing stable
recovered poses. The rightmost image in the same row, half
a flight upstairs, while not perfectly planar, appeared to be so
unless the camera is extremely close to the railings, allowing
that surface to be recorded to the 3D map and a pose to be
recovered.
The 3D map of lab set shown in the middle row of figure 4
(set 2) is an example, where 3D points from multiple locally
planar surfaces are appended to create a more complex 3D
map. In the first image, two planar surfaces from the map
are tracked: one on the silver numbered boxes, and another
on the bricks on the wall to allow more pose stability. In
the second and third images, feature-rich wall is tracked,
showing successful pose recovery with the camera far away
and close to it, respectively. In the last image of the row, the
surface where the doors lie on only allowed limited matches,
and the pose is less stable than the other 3 images in the
row.
3http://www.nvidia.com/page/quadrofx go.html
Set 1 Set 2
Number of points in 3D map 1987 2328
Image unwarping (ms) 5 5
Keypoint detection (ms) 30 28
Descriptor matching (ms) 6 6
Pose recovery (ms) 33 37
Graphic rendering (ms) 3 4
Overall frame rate (fps) 11.2 10.7
B. Outdoor experiments
The scene in top row of 5 yielded steady recovered poses,
where irregular brick patterns on the arched brick wall’s
front surface worked well with Speeded Up SURF feature
detector. However the maximum distance to recover pose
from this surface is limited, as shown on the first image.
These bricks did not allow large-scale interest points to be
detected. When the camera tilted upwards in the third image,
the camera reduced its exposure and increased shutter speed,
causing the image to darken, and no interest points were
available to be matched to the 3D map, and a pose could
not be computed.
For the scene in the bottom row of figure 5 resulted in
translation errors in the z-position throughout the sequence,
with the virtual objects constantly bobbing up and down.
Future work will concentrate on pose filtering to reduce the
unpleasant vibration in the augmentations. However the x
and y-positions recovered are accurate.
Set 1 Set 2
Number of points in 3D map 3024 1925
Image unwarping (ms) 5 5
Keypoint detection (ms) 40 35
Descriptor matching (ms) 15 8
Pose recovery (ms) 32 34
Graphic rendering (ms) 3 3
Overall frame rate (fps) 9.6 10.2
V. CONCLUSION
A real time augmented reality system was created that
performs true feature extraction and pose estimation on every
frame, without relying on 2D tracking. SURF feature points
were found in each frame and matched to a database of
3D points and pose extracted using OpenCV’s PnP. An
architecture of a single CPU thread and a large array of
GPU processors was used, with a majority of the processing
performed in the GPU. Due to the inclusion of GPU units
in mobile devices we expect this to be the architecture of
choice in future consumer AR systems.
ACKNOWLEDGMENT
This work was supported by NSERC Discovery Grant
356072.
106
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 4. Indoor experiments
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 5. Outdoor experiments
REFERENCES
[1] www.nvidia.com/object/cuda home.html.
[2] D. Lowe, “Object recognition from local scale-invariant fea-tures,” Sept 1999, pp. 1150–1157.
[3] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-uprobust features (surf),” Computer Vision and Image Under-standing (CVIU), vol. 110, no. 3, pp. 346–359, 2008.
[4] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Surf: Speededup robust features.” in Computer Vision and Image Under-standing (CVIU), Vol. 110, No. 3, 2008, pp. 346–359.
[5] M. Fiala, “Artag, a fiducial marker system using digitaltechniques,” in CVPR’05, vol. 1, 2005, pp. 590 – 596.
[6] D. Nister and H. Stewenius, “Scalable recognition with avocabulary tree,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR), vol. 2, Jun. 2006, pp. 2161–2168, oral presentation.
[7] C. Arth, M. Klopschitz, G. Reitmayr, and D. Schmalstieg,“Real-time self-localization from panoramic images on mo-bile devices,” Mixed and Augmented Reality, IEEE / ACMInternational Symposium on, vol. 0, pp. 37–46, 2011.
[8] A. K. F. Wientapper, H. Wuest, “Composing the featuremap retrieval process for robust and ready-to-use monoculartracking,” in Computers and Graphics, vol. 35, 2011, pp. 778–788.
[9] A. Lee, J.-Y. Lee, S.-H. Lee, and J.-S. Choi, “Markerless
107
augmented reality system based on planar object tracking,” inFrontiers of Computer Vision (FCV), 2011 17th Korea-JapanJoint Workshop on, feb. 2011, pp. 1 –4.
[10] J. Shi and C. Tomasi, “Good features to track,” in CVPR,Seattle, Washington, USA, June 1994, pp. 593–600.
[11] G. Klein and D. Murray, “Parallel tracking and mapping forsmall ar workspaces.” in ISMAR, 2007, pp. 225–234.
[12] E. Rosten and T. Drummond, “Fusing points and lines forhigh performance tracking.” in IEEE International Confer-ence on Computer Vision, vol. 2, Oct 2005, pp. 1508–1511.
[13] T. Lee and T. Hollerer, “Hybrid feature tracking and user in-teraction for markerless augmented reality,” in Virtual RealityConference, 2008. VR ’08. IEEE, march 2008, pp. 145 –152.
[14] http://asrl.utias.utoronto.ca/code/gpusurf/.
[15] J. L. Bentley, “Multidimensional binary search trees used forassociative searching,” Commun. ACM, vol. 18, Sept. 1975.
[16] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, andA. Y. Wu, “An optimal algorithm for approximate nearestneighbor searching fixed dimensions,” J. ACM, vol. 45, pp.891–923, November 1998.
[17] M. Muja and D. G. Lowe, “Fast approximate nearest neigh-bors with automatic algorithm configuration,” in InternationalConference on Computer Vision Theory and ApplicationVISSAPP’09). INSTICC Press, 2009, pp. 331–340.
[18] “Cublas library.” [Online]. Available: http://developer.nvidia.com/cublas
[19] Z. Zhang, “Flexible camera calibration by viewing a planefrom unknown orientations,” in Computer Vision, 1999. TheProceedings of the Seventh IEEE International Conferenceon, vol. 1, 1999, pp. 666 –673 vol.1.
[20] Opencv 2.3.1 documentation: Camera calibra-tion and 3d reconstruction. [Online]. Avail-able: http://opencv.itseez.com/modules/gpu/doc/cameracalibration and 3d reconstruction.html#gpu-solvepnpransac
[21] M. Fischler and R. Bolles, “Random sample consensus: aparadigm for model fitting with applications to image analysisand automated cartography.” in Communications of the ACM,1981, pp. 381–395.
[22] Cuda c programming guide. [Online]. Available:http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA C Programming Guide.pdf
[23] F. Wientapper, H. Wuest, and A. Kuijper, “Reconstruction andaccurate alignment of feature maps for augmented reality,” in3D Imaging, Modeling, Processing, Visualization and Trans-mission (3DIMPVT), 2011 International Conference on, may2011, pp. 140 –147.
108