4
Decoupled Localization and Sensing with HMD-based AR for Interactive Scene Acquisition Søren Skovsen * Aarhus University [email protected] Harald Haraldsson Cornell Tech [email protected] Abe Davis Cornell Tech [email protected] Henrik Karstoft Aarhus University [email protected] Serge Belongie Cornell Tech [email protected] Abstract Real-time tracking and visual feedback offer interactive AR-assisted capture systems as a convenient and low-cost alternative to specialized sensor rigs and robotic gantries. We present a simple strategy for decoupling localization and visual feedback in these applications from the primary sensor being used to capture the scene. Our strategy is to use an AR HMD and 6-DOF controller for tracking and feedback, synchronized with a separate primary sensor for capturing the scene. In this extended abstract, we present a prototype implementation of this strategy and investigate the accuracy of decoupled tracking by comparing runtime pose estimates to the results of high-resolution offline SfM. 1. Introduction & Related Work There are many ways to capture a scene, spanning a va- riety of different imaging modalities that target a range of different applications. In the past decade, we have seen tremendous improvement in capture scenarios where the sensing device is hand-held and able to localize itself within its own environment—for example, using devices equipped with vision-based Simultaneous Localization and Mapping (SLAM) (e.g., [10, 11]) or ICP-based tracking with RGBD input (e.g., KinectFusion [14]). A critical advantage of these settings is the ability to give real-time feedback and guidance to human users about where to move the sens- ing device. In this work, we explore a simple and flexi- ble way to extend this type of guidance to more arbitrary sensing devices. Our strategy is to use dedicated hardware for tracking and user feedback—namely, a head-mounted AR device and 6-DOF controller—and to incorporate the * This work was conducted during a research stay at Cornell Tech. Flash Trigger Used to Relay Image Capture Signal to HMD Sensor/Camera Custom Mount Used to attach sensor for tracking 6-DOF Controller Magic Leap 1 Figure 1: We evaluate the sensor tracking accuracy by us- ing it with an MILC as the attached sensor, and comparing runtime tracking data with high resolution SfM computed offline after capture. primary sensing device separately by fixing it to a cus- tom mount built for the hand-held controller. This solution makes the sensing device interchangeable, providing a flex- ible platform for experimenting with various types of cap- ture. However, decoupling localization from the primary sensing device can potentially complicate the analysis of measurement error. In this extended abstract, we present an implementation of our controller-mounted tracking strategy using current off-the-shelf hardware (augmented with our custom controller mount), and investigate the tracking ac- curacy of the mounted sensor by comparing run-time pose estimation with offline estimates based on Structure from Motion (SfM). 1.1. The Value of Real-Time Tracking and Feedback Regardless of imaging modality and target application, most capture scenarios can be formulated as a coverage problem—that is, one where the goal is to record measure- 1

Decoupled Localization and Sensing ... - vision.cornell.edu · with vision-based Simultaneous Localization and Mapping (SLAM) (e.g., [10,11]) or ICP-based tracking with RGBD input

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Decoupled Localization and Sensing ... - vision.cornell.edu · with vision-based Simultaneous Localization and Mapping (SLAM) (e.g., [10,11]) or ICP-based tracking with RGBD input

Decoupled Localization and Sensing with HMD-based AR for Interactive SceneAcquisition

Søren Skovsen∗

Aarhus [email protected]

Harald HaraldssonCornell Tech

[email protected]

Abe DavisCornell Tech

[email protected]

Henrik KarstoftAarhus [email protected]

Serge BelongieCornell Tech

[email protected]

Abstract

Real-time tracking and visual feedback offer interactiveAR-assisted capture systems as a convenient and low-costalternative to specialized sensor rigs and robotic gantries.We present a simple strategy for decoupling localizationand visual feedback in these applications from the primarysensor being used to capture the scene. Our strategy is touse an AR HMD and 6-DOF controller for tracking andfeedback, synchronized with a separate primary sensor forcapturing the scene. In this extended abstract, we presenta prototype implementation of this strategy and investigatethe accuracy of decoupled tracking by comparing runtimepose estimates to the results of high-resolution offline SfM.

1. Introduction & Related WorkThere are many ways to capture a scene, spanning a va-

riety of different imaging modalities that target a range ofdifferent applications. In the past decade, we have seentremendous improvement in capture scenarios where thesensing device is hand-held and able to localize itself withinits own environment—for example, using devices equippedwith vision-based Simultaneous Localization and Mapping(SLAM) (e.g., [10, 11]) or ICP-based tracking with RGBDinput (e.g., KinectFusion [14]). A critical advantage ofthese settings is the ability to give real-time feedback andguidance to human users about where to move the sens-ing device. In this work, we explore a simple and flexi-ble way to extend this type of guidance to more arbitrarysensing devices. Our strategy is to use dedicated hardwarefor tracking and user feedback—namely, a head-mountedAR device and 6-DOF controller—and to incorporate the

∗This work was conducted during a research stay at Cornell Tech.

Flash Trigger Used to Relay Image

Capture Signal to HMDSensor/Camera

Custom MountUsed to attach sensor for tracking

6-DOF ControllerMagic Leap 1

Figure 1: We evaluate the sensor tracking accuracy by us-ing it with an MILC as the attached sensor, and comparingruntime tracking data with high resolution SfM computedoffline after capture.

primary sensing device separately by fixing it to a cus-tom mount built for the hand-held controller. This solutionmakes the sensing device interchangeable, providing a flex-ible platform for experimenting with various types of cap-ture. However, decoupling localization from the primarysensing device can potentially complicate the analysis ofmeasurement error. In this extended abstract, we present animplementation of our controller-mounted tracking strategyusing current off-the-shelf hardware (augmented with ourcustom controller mount), and investigate the tracking ac-curacy of the mounted sensor by comparing run-time poseestimation with offline estimates based on Structure fromMotion (SfM).

1.1. The Value of Real-Time Tracking and Feedback

Regardless of imaging modality and target application,most capture scenarios can be formulated as a coverageproblem—that is, one where the goal is to record measure-

1

Page 2: Decoupled Localization and Sensing ... - vision.cornell.edu · with vision-based Simultaneous Localization and Mapping (SLAM) (e.g., [10,11]) or ICP-based tracking with RGBD input

(a) Point cloud color guidance. (b) Opaque surface color guidance. (c) Post-processed 3D-reconstruction.

Figure 2: The acquisition system in use. The perceived user interfaces during image acquisition are shown in (a) and (b).Image coverage is visualized in colors from red to green, red meaning no coverage. Each sampled camera pose is logged andpresented by a cyan 6-DOF indicator. (c) shows an offline 3D-reconstruction from the camera-equipped sensor platform.

ments that cover a target domain of interest. For example, inpanoramic image capture the target domain is the 2D spaceof pixels that share a common center of projection; in 3Dshape acquisition, it is Euclidean 3-space; and in light fieldcapture, it is 4D ray space. In each case, we can describe therequirements of capture in terms of some coverage criteriathat determines when and how well each part of the domainhas been captured. As the dimensionality of the target do-main increases, or as the criteria for coverage becomes morecomplex, the precise set of measurements needed to acquirea scene becomes more difficult to navigate by hand. For thisreason, domains with more than two dimensions have oftenbeen captured with fixed sensor arrays [15, 16] or roboticgantries [7, 12, 13]. However, with the development ofreal-time tracking and mapping on hand-held devices (e.g.,[10, 11, 14]) several works have shown that well-designedvisual feedback through AR can be used to reliably guidehuman users through otherwise impractical capture proce-dures, making mobile and user-driven acquisition feasibleeven in high-dimensional domains [2, 3, 5, 8, 9].

1.2. Decoupling Localization & Sensing

Most AR-based capture systems use the same imagingmodality (most commonly, the same physical sensor) forlocalization and for obtaining scene coverage. This is con-venient because 1) it does not require additional hardware,and 2) it reduces potential sources of measurement errorand simplifies their analysis. However, coupling localiza-tion with the primary sensor limits interactive capture todevices that offer real-time tracking. We explore a sim-ple strategy for decoupling the tracking and AR aspects ofassisted capture systems from the primary sensor used forcapture in order to accommodate a wider variety of capturescenarios. We do this by attaching the primary sensor to a 6-DOF controller for tracking, and using a AR HMD for userfeedback. Closely related to our work is the strategy used in

recent work by Andersen et al. [3], which combines a HMDand fiducial-based tracking to guide the interactive captureof images for photometric reconstructions. This work gen-eralizes and extends the strategy they employ for trackingtheir primary camera, and focuses on the analysis trackingaccuracy rather than a specific capture application.

1.3. Implementation & Evaluation

Our implementation is based on the Magic Leap 1(ML1), a self-contained wearable AR system that uses anoptical see-through HMD and a hand-held controller. Torender spatial digital content, ML1 relies on real-time 6-DOF tracking based on a black-box implementation ofSLAM using interest points (e.g. [6]) and provides a persis-tent coordinate frame system in indoor environments. The6-DOF tracking of the controller is based on an electro-magnetic tracking system, relative to the lightwear [4]. Ourstrategy for evaluating the accuracy of this tracking appliedto a separate sensor is to attach a high-resolution mirrorlessinterchangeable-lens camera (MILC) as our primary sensorfor the application, and compare the localization providedby the ML1 at runtime to pose estimates computed with of-fline SfM based on the high-resolution MILC frames. Tothis end, we implemented an example scene acquisition sys-tem for photometric reconstruction that provides users real-time feedback about coverage in the scene being captured(described in Section 3).

2. Tracked Acquisition SystemThe tracked sensor platform is equipped with a Sony

a7rii full-frame MILC with a Zeiss Batis 25mm F2 lens.The 3D-printed adapter is mounted on the bottom of thecamera in the tripod mount using a threaded rod. Theadapter is designed to tightly fit outside the ML1 controllerwhile aligning the rod with the controller’s z-axis. This ex-tends the built-in 6-DOF tracking abilities of the controller

Page 3: Decoupled Localization and Sensing ... - vision.cornell.edu · with vision-based Simultaneous Localization and Mapping (SLAM) (e.g., [10,11]) or ICP-based tracking with RGBD input

Position [cm] Orientation [◦]

Room Images X Y Z ω φ κ

L1 206 0.88 0.85 0.64 3.0 1.3 2.3L2 150 1.02 0.96 0.66 3.5 1.3 2.6K1 130 0.77 0.88 0.96 4.5 1.7 3.6K2 105 0.88 1.20 0.43 2.5 1.3 1.4B1 77 0.82 0.75 0.75 4.0 1.1 3.2B2 110 0.75 0.93 0.75 2.2 1.1 1.4

Average 0.85 0.93 0.70 3.2 1.3 2.4

Table 1: Mean absolute error of the 6-DOF tracking on thesensor platform. The ground plane is spanned by XY and Zpoints upwards in a world coordinate system. ω, φ, and κrefer to rotations around the X, Y, and Z-axis, respectively.

to the camera. A camera-mounted Godox X1T flash triggerprovides an output PC-sync interface for a battery poweredAdafruit Feather 32u4. The Feather forwards trigger eventsto the ML1 system using a bluetooth connection. When thecamera is triggered, the ML1 system logs the camera pose.

3. Guided AcquisitionImage acquisition is carried out in AR looking through

the ML1 HMD as shown in Figure 2. In our prototype im-plementation for photometric capture, feedback is providedby coloring the geometric proxy provided by the ML1 toindicate scene coverage. Starting out, a non-occluded pointcloud on the scene surfaces lights up in red, indicating nocaptured image coverage. As the user captures images, thepoints within each camera view frustum, change color. Or-ange indicates points that appear in a single view, but arenot yet triangulated in multiple captured views. Green indi-cates points that are visible in multiple views and thereforesufficiently triangulated. Visual frustums are also displayedat the poses from which previous images were captured. Atany time, the user may switch between the point cloud vi-sualization and a more opaque shaded surface visualization,which makes coverage clearer at the expense of loweringvisible contrast in the scene.

4. Decoupled Tracking EvaluationTwo bedrooms, two living rooms, and two kitchens, de-

noted B1, B2, L1, L2, K1, and K2, were densely sampledusing the camera-equipped sensor platform, guided withinAR. The acquisition was performed in daytime for high lev-els of ambient lighting. Using Agisoft Metashape [1], eachset of hand-held camera images was aligned in 3D usingSfM. The alignments for each room were then scaled andoriented to minimize the distance from the recorded 3D po-sitions during the acquisition. A reconstruction example is

Figure 3: Top-down view of a reconstructed room (L2) usedfor evaluation.

shown in Figure 3.The 6-DOF accuracy of the sensor platform, evaluated as

the difference between real-time logged and offline derivedimage poses, is shown in Table 1. The mean absolute errorin positioning is less than 1 cm in each dimension in 16 outof the 18 measurements with an average error of 0.83 cm. Aconsiderable amount of deviation is caused by a few sam-ples in each room. These anomalies are possibly causedby large reflective surfaces or nearby conductive materials,interfering with the SLAM or electromagnetic tracking, re-spectively. The orientation error is consistently distributedunevenly around the three axes, but averages between 1.3and 3.2 degrees per axis.

Our results indicate that this strategy could be quitepromising as a way to extend interactive capture to sensorsnot capable of self-localization. The pattern of outliers intracking accuracy suggests that this goal may be aided byonline confidence estimates provided by the HMD and con-troller. We anticipate systems like ours opening up excitingnew directions for AR and multi-modal imaging.

5. Conclusion

We presented a simple approach for HMD-guided sceneacquisition that decouples localization from the primarysensor used to obtain coverage. Based on our analysis oftracking accuracy over six scenes captured with our pro-totype, the decoupled tracking had on average less than1 cm localization error in each dimension, and an orien-tation error lower than 3.3 degrees per axis. This suggeststhat our solution could enable new types of scene acquisi-tion based on sensors that cannot do their own localization.For example, in the case of our prototype, it allows for high-resolution image capture that is incompatible with real-timetracking on current devices. One could also imagine usingthis system to record other modalities—an audio field, forexample—in the future. The ability to attach and track ar-bitrary new types of sensors poses an exciting opportunityfor using AR to explore new sensing modalities.

Page 4: Decoupled Localization and Sensing ... - vision.cornell.edu · with vision-based Simultaneous Localization and Mapping (SLAM) (e.g., [10,11]) or ICP-based tracking with RGBD input

6. AcknowledgmentsThis work was partly funded by Stibofonden and Inno-

vation Fund Denmark.

References[1] Agisoft. Metashape. https://www.agisoft.com. Accessed:

2020-04-24. 3[2] Daniel Andersen and Voicu Popescu. HMD-Guided Image-

Based Modeling and Rendering of Indoor Scenes: 15th Eu-roVR International Conference, EuroVR 2018, London, UK,October 22–23, 2018, Proceedings, pages 73–93. 01 2018.2

[3] D. Andersen, P. Villano, and V. Popescu. Ar hmd guidancefor controlled hand-held 3d acquisition. IEEE Transactionson Visualization and Computer Graphics, 25(11):3073–3082, 2019. 2

[4] Brian Bucknor, Christopher Lopez, Michael Janusz Woods,Aly H. M. Aly, James William Palmer, and Evan FrancisRynk. Electromagnetic tracking with augmented reality sys-tems. U.S. Patent 2017/0307891, Oct. 26, 2017. 2

[5] Abe Davis, Marc Levoy, and Fredo Durand. Unstructuredlight fields. Comput. Graph. Forum, 31:305–314, 2012. 2

[6] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-novich. Superpoint: Self-supervised interest point detectionand description. CoRR, abs/1712.07629, 2017. 2

[7] Sing Choong Foo. A gonioreflectometer for measuring thebidirectional reflectance of material for use in illuminationcomputation, 1997. 2

[8] A. Hartl, J. Grubert, D. Schmalstieg, and G. Reitmayr. Mo-bile interactive hologram verification. In 2013 IEEE Inter-national Symposium on Mixed and Augmented Reality (IS-MAR), pages 75–82, 2013. 2

[9] J. Jachnik, R. A. Newcombe, and A. J. Davison. Real-timesurface light-field capture for augmentation of planar spec-ular surfaces. In 2012 IEEE International Symposium onMixed and Augmented Reality (ISMAR), pages 91–97, Nov2012. 2

[10] Georg Klein and David Murray. Parallel tracking and map-ping for small AR workspaces. In Proc. Sixth IEEE and ACMInternational Symposium on Mixed and Augmented Reality(ISMAR’07), Nara, Japan, November 2007. 1, 2

[11] Georg Klein and David Murray. Parallel tracking and map-ping on a camera phone. In Proc. Eigth IEEE and ACMInternational Symposium on Mixed and Augmented Reality(ISMAR’09), Orlando, October 2009. 1, 2

[12] Marc Levoy and Pat Hanrahan. Light field rendering. InProceedings of the 23rd Annual Conference on ComputerGraphics and Interactive Techniques, SIGGRAPH ’96, page31–42, New York, NY, USA, 1996. Association for Comput-ing Machinery. 2

[13] Marc Levoy, Kari Pulli, Brian Curless, SzymonRusinkiewicz, David Koller, Lucas Pereira, Matt Ginz-ton, Sean Anderson, James Davis, Jeremy Ginsberg,Jonathan Shade, and Duane Fulk. The digital michelangeloproject: 3d scanning of large statues. In Proceedings ofthe 27th Annual Conference on Computer Graphics and

Interactive Techniques, SIGGRAPH ’00, page 131–144,USA, 2000. ACM Press/Addison-Wesley Publishing Co. 2

[14] Richard A. Newcombe, Shahram Izadi, Otmar Hilliges,David Kim, Andrew J. Davison, Pushmeet Kohli, JamieShotton, Steve Hodges, and Andrew Fitzgibbon. Kinect-fusion: Real-time dense surface mapping and tracking. InIEEE ISMAR. IEEE, October 2011. 1, 2

[15] Ren Ng. Digital Light Field Photography. PhD thesis, Stan-ford, CA, USA, 2006. 2

[16] Bennett Wilburn, Neel Joshi, Vaibhav Vaish, Eino-Ville Tal-vala, Emilio Antunez, Adam Barth, Andrew Adams, MarkHorowitz, and Marc Levoy. High performance imaging us-ing large camera arrays. ACM Trans. Graph., 24(3):765–776,July 2005. 2