Upload
mark-billinghurst
View
1.149
Download
1
Embed Size (px)
DESCRIPTION
A keynote talk given by Mark Billinghurst at the ICMI 2013 conference, December 12th 2013. The talk is about how to use speech and gesture interaction with Augmented Reality interfaces.
Citation preview
Hands and Speech in Space: Multimodal Interaction for AR
Mark Billinghurst
The HIT Lab NZ, University of Canterbury
December 12th 2013
1977 – Star Wars
1977 – Star Wars
Augmented Reality Definition Defining Characteristics
Combines Real and Virtual Images - Both can be seen at the same time
Interactive in real-time - The virtual content can be interacted with
Registered in 3D - Virtual objects appear fixed in space
Azuma, R. T. (1997). A survey of augmented reality. Presence, 6(4), 355-385.
Augmented Reality Today
Key Question: How should a person interact with the Augmented Reality content? Connecting physical and virtual with interaction
Physical Elements
Virtual Elements Interaction
Metaphor Input Output
AR Interface Components
AR Interaction Metaphors Information Browsing
View AR content
3D AR Interfaces 3D UI interaction techniques
Augmented Surfaces Tangible UI techniques
Tangible AR Tangible UI input + AR output
VOMAR Demo (Kato 2000) AR Furniture Arranging
Elements + Interactions Book:
- Turn over the page Paddle:
- Push, shake, incline, hit, scoop
Kato, H., Billinghurst, M., et al. 2000. Virtual Object Manipulation on a Table-Top AR Environment. In Proceedings of the International Symposium on Augmented Reality (ISAR 2000), Munich, Germany, 111--119.
Opportunities for Multimodal Input Multimodal interfaces are a natural fit for AR
Need for non-GUI interfaces Natural interaction with real world Natural support for body input Previous work shown value of multimodal input
and 3D graphics
Related Work Related work in 3D graphics/VR
Interaction with 3D content [Chu 1997] Navigating through virtual worlds [Krum 2002] Interacting with virtual characters [Billinghurst 1998]
Little earlier work in AR Require additional input devices Few formal usability studies Eg Olwal et. al [2003] Sense Shapes
Examples
SenseShapes [2003] Kolsch [2006]
Marker Based Multimodal Interface
Add speech recognition to VOMAR Paddle + speech commands
Irawati, S., Green, S., Billinghurst, M., Duenser, A., & Ko, H. (2006, October). IEEE Xplore. In Mixed and Augmented Reality, 2006. ISMAR 2006. IEEE/ACM International Symposium on (pp. 183-186). IEEE.
Commands Recognized Create Command "Make a blue chair": to create a virtual
object and place it on the paddle. Duplicate Command "Copy this": to duplicate a virtual object
and place it on the paddle. Grab Command "Grab table": to select a virtual object and
place it on the paddle. Place Command "Place here": to place the attached object in
the workspace. Move Command "Move the couch": to attach a virtual object
in the workspace to the paddle so that it follows the paddle movement.
System Architecture
Object Relationships
"Put chair behind the table” Where is behind?
View specific regions
User Evaluation Performance time
Speech + static paddle significantly faster
Gesture-only condition less accurate for position/orientation Users preferred speech + paddle input
Subjective Surveys
2012 – Iron Man 2
To Make the Vision Real.. Hardware/software requirements
Contact lens displays Free space hand/body tracking Speech/gesture recognition Etc..
Most importantly Usability/User Experience
Natural Interaction Automatically detecting real environment
Environmental awareness, Physically based interaction
Gesture interaction Free-hand interaction
Multimodal input Speech and gesture interaction
Intelligent interfaces Implicit rather than Explicit interaction
Environmental Awareness
AR MicroMachines AR experience with environment awareness
and physically-based interaction Based on MS Kinect RGB-D sensor
Augmented environment supports occlusion, shadows physically-based interaction between real and
virtual objects
Clark, A., & Piumsomboon, T. (2011). A realistic augmented reality racing game using a depth-sensing camera. In Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry (pp. 499-502). ACM.
Operating Environment
Architecture Our framework uses five libraries:
OpenNI OpenCV OPIRA Bullet Physics OpenSceneGraph
System Flow The system flow consists of three sections:
Image Processing and Marker Tracking Physics Simulation Rendering
Physics Simulation
Create virtual mesh over real world
Update at 10 fps – can move real objects
Use by physics engine for collision detection (virtual/real)
Use by OpenScenegraph for occlusion and shadows
Rendering
Occlusion Shadows
Gesture Interaction
Natural Hand Interaction
Using bare hands to interact with AR content MS Kinect depth sensing Real time hand tracking Physics based simulation model
Hand Interaction
Represent models as collections of spheres
Bullet physics engine for interaction with real world
Scene Interaction
Render AR scene with OpenSceneGraph Using depth map for occlusion Shadows yet to be implemented
Architecture 5. Gesture
• Static Gestures • Dynamic Gestures • Context based Gestures
4. Modeling
• Hand recognition/modeling • Rigid-body modeling
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
Architecture 5. Gesture
• Static Gestures • Dynamic Gestures • Context based Gestures
4. Modeling
• Hand recognition/modeling
• Rigid-body modeling
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
o Supports PCL, OpenNI, OpenCV, and Kinect SDK. o Provides access to depth, RGB, XYZRGB. o Usage: Capturing color image, depth image and concatenated
point clouds from a single or multiple cameras o For example:
Kinect for Xbox 360
Kinect for Windows
Asus Xtion Pro Live
Architecture 5. Gesture
• Static Gestures • Dynamic Gestures • Context based Gestures
4. Modeling
• Hand recognition/modeling
• Rigid-body modeling
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
o Segment images and point clouds based on color, depth and space.
o Usage: Segmenting images or point clouds using color models, depth, or spatial properties such as location, shape and size.
o For example:
Skin color segmentation
Depth threshold
Architecture 5. Gesture
• Static Gestures • Dynamic Gestures • Context based Gestures
4. Modeling
• Hand recognition/modeling
• Rigid-body modeling
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
o Identify and track objects between frames based on XYZRGB.
o Usage: Identifying current position/orientation of the tracked object in space.
o For example:
Training set of hand poses, colors represent unique regions of the hand.
Raw output (without-cleaning) classified on real hand input (depth image).
Architecture 5. Gesture
• Static Gestures • Dynamic Gestures • Context based Gestures
4. Modeling
• Hand recognition/modeling
• Rigid-body modeling
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
o Hand Recognition/Modeling Skeleton based (for low resolution
approximation) Model based (for more accurate
representation) o Object Modeling (identification and tracking rigid-
body objects) o Physical Modeling (physical interaction)
Sphere Proxy Model based Mesh based
o Usage: For general spatial interaction in AR/VR environment
Architecture 5. Gesture
• Static Gestures • Dynamic Gestures • Context based Gestures
4. Modeling
• Hand recognition/modeling
• Rigid-body modeling
3. Classification/Tracking
2. Segmentation
1. Hardware Interface
o Static (hand pose recognition) o Dynamic (meaningful movement recognition) o Context-based gesture recognition (gestures with context,
e.g. pointing) o Usage: Issuing commands/anticipating user intention and high
level interaction.
Skeleton Based Interaction
3 Gear Systems Kinect/Primesense Sensor Two hand tracking http://www.threegear.com
Skeleton Interaction + AR
HMD AR View Viewpoint tracking
Two hand input Skeleton interaction, occlusion
What Gestures do People Want to Use? Limitations of Previous work in AR
Limited range of gestures Gestures designed for optimal recognition Gestures studied as add-on to speech
Solution – elicit desired gestures from users Eg. Gestures for surface computing [Wobbrock] Previous work in unistroke getsures, mobile gestures
User Defined Gesture Study Use AR view
HMD + AR tracking
Present AR animations 40 tasks in six categories
- Editing, transforms, menu, etc
Ask users to produce gestures causing animations Record gesture (video, depth)
Piumsomboon, T., Clark, A., Billinghurst, M., & Cockburn, A. (2013, April). User-defined gestures for augmented reality. In CHI'13 Extended Abstracts on Human Factors in Computing Systems (pp. 955-960).ACM
Data Recorded 20 participants Gestures recorded (video, depth data)
800 gestures from 40 tasks
Subjective rankings Likert ranking of goodness, ease of use
Think aloud transcripts
Typical Gestures
Results - Gestures Gestures grouped according to
similarity – 320 groups 44 consensus (62% all gestures) 276 low similarity (discarded) 11 hand poses seen
Degree of consensus (A) using guessability score [Wobbrock]
Results –Agreement Scores
Red line – proportion of two handed gestures
Usability Results
Significant difference between consensus and discarded gesture sets (p < 0.0001)
Gestures in consensus set better than discarded gestures in perceived performance and goodness
Consensus Discarded Ease of Performance 6.02 5.50 Good Match 6.17 5.83
Likert Scale [1-7], 7 = Very Good
Lessons Learned AR animation can elicit desired gestures For some tasks there is a high degree of
similarity in user defined gestures Especially command gestures (eg Open), select
Less agreement in manipulation gestures Move (40%), rotate (30%), grouping (10%)
Small portion of two handed gestures (22%) Scaling, group selection
Multimodal Input
Multimodal Interaction Combined speech input Gesture and Speech complimentary
Speech - modal commands, quantities
Gesture - selection, motion, qualities
Previous work found multimodal interfaces intuitive for 2D/3D graphics interaction
Wizard of Oz Study What speech and gesture input
would people like to use? Wizard
Perform speech recognition Command interpretation
Domain 3D object interaction/modelling
Lee, M., & Billinghurst, M. (2008, October). A Wizard of Oz study for an AR multimodal interface. In Proceedings of the 10th international conference on Multimodal interfaces (pp. 249-256). ACM.
System Architecture
Hand Segmentation
System Set Up
Experiment 12 participants Two display conditions (HMD vs. Desktop) Three tasks
Task 1: Change object color/shape Task 2: 3D positioning of obejcts Task 3: Scene assembly
Key Results Most commands multimodal
Multimodal (63%), Gesture (34%), Speech (4%)
Most spoken phrases short 74% phrases average 1.25 words long Sentences (26%) average 3 words
Main gestures deictic (65%), metaphoric (35%) In multimodal commands gesture issued first
94% time gesture begun before speech Multimodal window 8s – speech 4.5s after gesture
Free Hand Multimodal Input
Use free hand to interact with AR content Recognize simple gestures
Open hand, closed hand, pointing
Point Move Pick/Drop
Lee, M., Billinghurst, M., Baek, W., Green, R., & Woo, W. (2013). A usability study of multimodal input in an augmented reality environment. Virtual Reality, 17(4), 293-305.
Speech Input MS Speech + MS SAPI (> 90% accuracy) Single word speech commands
Multimodal Architecture
Multimodal Fusion
Hand Occlusion
Experimental Setup
Change object shape and colour
User Evaluation
25 subjects, 10 task trials x 3, 3 conditions Change object shape, colour and position Conditions
Speech only, gesture only, multimodal
Measures performance time, errors (system/user), subjective survey
Results - Performance Average performance time
Gesture: 15.44s Speech: 12.38s Multimodal: 11.78s
Significant difference across conditions (p < 0.01) Difference between gesture and speech/MMI
Errors User errors – errors per task
Gesture (0.50), Speech (0.41), MMI (0.42) No significant difference
System errors Speech accuracy – 94%, Gesture accuracy – 85% MMI accuracy – 90%
Subjective Results (Likert 1-7)
User subjective survey Gesture significantly worse, MMI and Speech same MMI perceived as most efficient
Preference 70% MMI, 25% speech only, 5% gesture only
Gesture Speech MMI
Naturalness 4.60 5.60 5.80
Ease of Use 4.00 5.90 6.00
Efficiency 4.45 5.15 6.05
Physical Effort 4.75 3.15 3.85
Observations Significant difference in number of commands
Gesture (6.14), Speech (5.23), MMI (4.93)
MMI Simultaneous vs. Sequential commands 79% sequential, 21% simultaneous
Reaction to system errors Almost always repeated same command In MMI rarely changes modalities
Lessons Learned Multimodal interaction significantly better than
gesture alone in AR interfaces for 3D tasks Short task time, more efficient
Users felt that MMI was more natural, easier, and more effective that gesture/speech only
Simultaneous input rarely used More studies need to be conducted
Intelligent Interfaces
Intelligent Interfaces Most AR systems stupid
Don’t recognize user behaviour Don’t provide feedback Don’t adapt to user
Especially important for training Scaffolded learning Moving beyond check-lists of actions
Intelligent Interfaces
AR interface + intelligent tutoring system ASPIRE constraint based system (from UC) Constraints
- relevance cond., satisfaction cond., feedback
Westerfield, G., Mitrovic, A., & Billinghurst, M. (2013). Intelligent Augmented Reality Training for Assembly Tasks. In Artificial Intelligence in Education (pp. 542-551). Springer Berlin Heidelberg.
Domain Ontology
Intelligent Feedback
Actively monitors user behaviour Implicit vs. explicit interaction
Provides corrective feedback
Evaluation Results 16 subjects, with and without ITS Improved task completion
Improved learning
Intelligent Agents AR characters
Virtual embodiment of system Multimodal input/output
Examples AR Lego, Welbo, etc Mr Virtuoso
- AR character more real, more fun - On-screen 3D and AR similar in usefulness
Wagner, D., Billinghurst, M., & Schmalstieg, D. (2006). How real should virtual characters be?. In Proceedings of the 2006 ACM SIGCHI international conference on Advances in computer entertainment technology (p. 57). ACM.
Looking to the Future
What’s Next?
Directions for Future Research Mobile Gesture Interaction
Tablet, phone interfaces
Wearable Systems Google Glass
Novel Displays Contact lens
Environmental Understanding Semantic representation
Mobile Gesture Interaction Motivation
Richer interaction with handheld devices Natural interaction with handheld AR
2D tracking Finger tip tracking
3D tracking Hand tracking
[Hurst and Wezel 2013]
[Henrysson et al. 2007]
Henrysson, A., Marshall, J., & Billinghurst, M. (2007). Experiments in 3D interaction for mobile phone AR. In Proceedings of the 5th international conference on Computer graphics and interactive techniques in Australia and Southeast Asia (pp. 187-194). ACM.
Fingertip Based Interaction
System Setup Running System
Bai, H., Gao, L., El-Sana, J., & Billinghurst, M. (2013). Markerless 3D gesture-based interaction for handheld augmented reality interfaces. In SIGGRAPH Asia 2013 Symposium on Mobile Graphics and Interactive Applications (p. 22). ACM.
Mobile Client + PC Server
System Architecture
3D Prototype System 3 Gear + Vuforia
Hand tracking + phone tracking
Freehand interaction on phone Skeleton model 3D interaction 20 fps performance
Google Glass
User Experience Truly Wearable Computing
Less than 46 ounces
Hands-free Information Access Voice interaction, Ego-vision camera
Intuitive User Interface Touch, Gesture, Speech, Head Motion
Access to all Google Services Map, Search, Location, Messaging, Email, etc
Contact Lens Display Babak Parviz
University Washington MEMS components
Transparent elements Micro-sensors
Challenges Miniaturization Assembly Eye-safe
Contact Lens Prototype
Environmental Understanding
Semantic understanding of environment What are the key objects? What are there relationships? Represented in a form suitable for multimodal interaction?
Conclusion
Conclusions AR experiences need new interaction methods Enabling technologies are advancing quickly
Displays, tracking, depth capture devices
Natural user interfaces possible Free hand gesture, speech, intelligence interfaces
Important research for the future Mobile, wearable, displays
More Information
• Mark Billinghurst – Email: [email protected]
– Twitter: @marknb00
• Website – http://www.hitlabnz.org/