90
Hands and Speech in Space: Multimodal Interaction for AR Mark Billinghurst [email protected] The HIT Lab NZ, University of Canterbury December 12 th 2013

Hands and Speech in Space: Multimodal Input for Augmented Reality

Embed Size (px)

DESCRIPTION

A keynote talk given by Mark Billinghurst at the ICMI 2013 conference, December 12th 2013. The talk is about how to use speech and gesture interaction with Augmented Reality interfaces.

Citation preview

Page 1: Hands and Speech in Space: Multimodal Input for Augmented Reality

Hands and Speech in Space: Multimodal Interaction for AR

Mark Billinghurst

[email protected]

The HIT Lab NZ, University of Canterbury

December 12th 2013

Page 2: Hands and Speech in Space: Multimodal Input for Augmented Reality

1977 – Star Wars

1977 – Star Wars

Page 3: Hands and Speech in Space: Multimodal Input for Augmented Reality

Augmented Reality Definition   Defining Characteristics

 Combines Real and Virtual Images -  Both can be seen at the same time

  Interactive in real-time -  The virtual content can be interacted with

  Registered in 3D -  Virtual objects appear fixed in space

Azuma, R. T. (1997). A survey of augmented reality. Presence, 6(4), 355-385.

Page 4: Hands and Speech in Space: Multimodal Input for Augmented Reality

Augmented Reality Today

Page 5: Hands and Speech in Space: Multimodal Input for Augmented Reality

  Key Question: How should a person interact with the Augmented Reality content?  Connecting physical and virtual with interaction

Physical Elements

Virtual Elements Interaction

Metaphor Input Output

AR Interface Components

Page 6: Hands and Speech in Space: Multimodal Input for Augmented Reality

AR Interaction Metaphors   Information Browsing

  View AR content

  3D AR Interfaces   3D UI interaction techniques

  Augmented Surfaces   Tangible UI techniques

  Tangible AR   Tangible UI input + AR output

Page 7: Hands and Speech in Space: Multimodal Input for Augmented Reality

VOMAR Demo (Kato 2000)   AR Furniture Arranging

  Elements + Interactions   Book:

-  Turn over the page   Paddle:

-  Push, shake, incline, hit, scoop

Kato, H., Billinghurst, M., et al. 2000. Virtual Object Manipulation on a Table-Top AR Environment. In Proceedings of the International Symposium on Augmented Reality (ISAR 2000), Munich, Germany, 111--119.

Page 8: Hands and Speech in Space: Multimodal Input for Augmented Reality

Opportunities for Multimodal Input   Multimodal interfaces are a natural fit for AR

 Need for non-GUI interfaces  Natural interaction with real world  Natural support for body input   Previous work shown value of multimodal input

and 3D graphics

Page 9: Hands and Speech in Space: Multimodal Input for Augmented Reality

Related Work   Related work in 3D graphics/VR

  Interaction with 3D content [Chu 1997]  Navigating through virtual worlds [Krum 2002]   Interacting with virtual characters [Billinghurst 1998]

  Little earlier work in AR   Require additional input devices   Few formal usability studies   Eg Olwal et. al [2003] Sense Shapes

Page 10: Hands and Speech in Space: Multimodal Input for Augmented Reality

Examples

SenseShapes [2003] Kolsch [2006]

Page 11: Hands and Speech in Space: Multimodal Input for Augmented Reality

Marker Based Multimodal Interface

  Add speech recognition to VOMAR   Paddle + speech commands

Irawati, S., Green, S., Billinghurst, M., Duenser, A., & Ko, H. (2006, October). IEEE Xplore. In Mixed and Augmented Reality, 2006. ISMAR 2006. IEEE/ACM International Symposium on (pp. 183-186). IEEE.

Page 12: Hands and Speech in Space: Multimodal Input for Augmented Reality
Page 13: Hands and Speech in Space: Multimodal Input for Augmented Reality

Commands Recognized   Create Command "Make a blue chair": to create a virtual

object and place it on the paddle.   Duplicate Command "Copy this": to duplicate a virtual object

and place it on the paddle.   Grab Command "Grab table": to select a virtual object and

place it on the paddle.   Place Command "Place here": to place the attached object in

the workspace.   Move Command "Move the couch": to attach a virtual object

in the workspace to the paddle so that it follows the paddle movement.

Page 14: Hands and Speech in Space: Multimodal Input for Augmented Reality

System Architecture

Page 15: Hands and Speech in Space: Multimodal Input for Augmented Reality

Object Relationships

"Put chair behind the table” Where is behind?

View specific regions

Page 16: Hands and Speech in Space: Multimodal Input for Augmented Reality

User Evaluation   Performance time

  Speech + static paddle significantly faster

  Gesture-only condition less accurate for position/orientation   Users preferred speech + paddle input

Page 17: Hands and Speech in Space: Multimodal Input for Augmented Reality

Subjective Surveys

Page 18: Hands and Speech in Space: Multimodal Input for Augmented Reality

2012 – Iron Man 2

Page 19: Hands and Speech in Space: Multimodal Input for Augmented Reality

To Make the Vision Real..  Hardware/software requirements

 Contact lens displays  Free space hand/body tracking  Speech/gesture recognition  Etc..

 Most importantly  Usability/User Experience

Page 20: Hands and Speech in Space: Multimodal Input for Augmented Reality

Natural Interaction   Automatically detecting real environment

  Environmental awareness, Physically based interaction

  Gesture interaction   Free-hand interaction

  Multimodal input   Speech and gesture interaction

  Intelligent interfaces   Implicit rather than Explicit interaction

Page 21: Hands and Speech in Space: Multimodal Input for Augmented Reality

Environmental Awareness

Page 22: Hands and Speech in Space: Multimodal Input for Augmented Reality

AR MicroMachines   AR experience with environment awareness

and physically-based interaction   Based on MS Kinect RGB-D sensor

  Augmented environment supports   occlusion, shadows   physically-based interaction between real and

virtual objects

Clark, A., & Piumsomboon, T. (2011). A realistic augmented reality racing game using a depth-sensing camera. In Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry (pp. 499-502). ACM.

Page 23: Hands and Speech in Space: Multimodal Input for Augmented Reality

Operating Environment

Page 24: Hands and Speech in Space: Multimodal Input for Augmented Reality

Architecture   Our framework uses five libraries:

 OpenNI  OpenCV  OPIRA   Bullet Physics  OpenSceneGraph

Page 25: Hands and Speech in Space: Multimodal Input for Augmented Reality

System Flow   The system flow consists of three sections:

  Image Processing and Marker Tracking   Physics Simulation   Rendering

Page 26: Hands and Speech in Space: Multimodal Input for Augmented Reality

Physics Simulation

  Create virtual mesh over real world

  Update at 10 fps – can move real objects

  Use by physics engine for collision detection (virtual/real)

  Use by OpenScenegraph for occlusion and shadows

Page 27: Hands and Speech in Space: Multimodal Input for Augmented Reality

Rendering

Occlusion Shadows

Page 28: Hands and Speech in Space: Multimodal Input for Augmented Reality

Gesture Interaction

Page 29: Hands and Speech in Space: Multimodal Input for Augmented Reality

Natural Hand Interaction

  Using bare hands to interact with AR content  MS Kinect depth sensing   Real time hand tracking   Physics based simulation model

Page 30: Hands and Speech in Space: Multimodal Input for Augmented Reality

Hand Interaction

  Represent models as collections of spheres

  Bullet physics engine for interaction with real world

Page 31: Hands and Speech in Space: Multimodal Input for Augmented Reality

Scene Interaction

  Render AR scene with OpenSceneGraph   Using depth map for occlusion   Shadows yet to be implemented

Page 32: Hands and Speech in Space: Multimodal Input for Augmented Reality

Architecture 5. Gesture

•  Static Gestures • Dynamic Gestures •  Context based Gestures

4. Modeling

• Hand recognition/modeling •  Rigid-body modeling

3. Classification/Tracking

2. Segmentation

1. Hardware Interface

Page 33: Hands and Speech in Space: Multimodal Input for Augmented Reality

Architecture 5. Gesture

•  Static Gestures •  Dynamic Gestures •  Context based Gestures

4. Modeling

•  Hand recognition/modeling

•  Rigid-body modeling

3. Classification/Tracking

2. Segmentation

1. Hardware Interface

o  Supports PCL, OpenNI, OpenCV, and Kinect SDK. o  Provides access to depth, RGB, XYZRGB. o  Usage: Capturing color image, depth image and concatenated

point clouds from a single or multiple cameras o  For example:

Kinect for Xbox 360

Kinect for Windows

Asus Xtion Pro Live

Page 34: Hands and Speech in Space: Multimodal Input for Augmented Reality

Architecture 5. Gesture

•  Static Gestures •  Dynamic Gestures •  Context based Gestures

4. Modeling

•  Hand recognition/modeling

•  Rigid-body modeling

3. Classification/Tracking

2. Segmentation

1. Hardware Interface

o  Segment images and point clouds based on color, depth and space.

o  Usage: Segmenting images or point clouds using color models, depth, or spatial properties such as location, shape and size.

o  For example:

Skin color segmentation

Depth threshold

Page 35: Hands and Speech in Space: Multimodal Input for Augmented Reality

Architecture 5. Gesture

•  Static Gestures •  Dynamic Gestures •  Context based Gestures

4. Modeling

•  Hand recognition/modeling

•  Rigid-body modeling

3. Classification/Tracking

2. Segmentation

1. Hardware Interface

o  Identify and track objects between frames based on XYZRGB.

o  Usage: Identifying current position/orientation of the tracked object in space.

o  For example:

Training set of hand poses, colors represent unique regions of the hand.

Raw output (without-cleaning) classified on real hand input (depth image).

Page 36: Hands and Speech in Space: Multimodal Input for Augmented Reality

Architecture 5. Gesture

•  Static Gestures •  Dynamic Gestures •  Context based Gestures

4. Modeling

•  Hand recognition/modeling

•  Rigid-body modeling

3. Classification/Tracking

2. Segmentation

1. Hardware Interface

o  Hand Recognition/Modeling   Skeleton based (for low resolution

approximation)   Model based (for more accurate

representation) o  Object Modeling (identification and tracking rigid-

body objects) o  Physical Modeling (physical interaction)

  Sphere Proxy   Model based   Mesh based

o  Usage: For general spatial interaction in AR/VR environment

Page 37: Hands and Speech in Space: Multimodal Input for Augmented Reality

Architecture 5. Gesture

•  Static Gestures •  Dynamic Gestures •  Context based Gestures

4. Modeling

•  Hand recognition/modeling

•  Rigid-body modeling

3. Classification/Tracking

2. Segmentation

1. Hardware Interface

o  Static (hand pose recognition) o  Dynamic (meaningful movement recognition) o  Context-based gesture recognition (gestures with context,

e.g. pointing) o  Usage: Issuing commands/anticipating user intention and high

level interaction.

Page 38: Hands and Speech in Space: Multimodal Input for Augmented Reality

Skeleton Based Interaction

  3 Gear Systems   Kinect/Primesense Sensor   Two hand tracking   http://www.threegear.com

Page 39: Hands and Speech in Space: Multimodal Input for Augmented Reality

Skeleton Interaction + AR

  HMD AR View   Viewpoint tracking

  Two hand input   Skeleton interaction, occlusion

Page 40: Hands and Speech in Space: Multimodal Input for Augmented Reality

What Gestures do People Want to Use?   Limitations of Previous work in AR

  Limited range of gestures  Gestures designed for optimal recognition  Gestures studied as add-on to speech

  Solution – elicit desired gestures from users   Eg. Gestures for surface computing [Wobbrock]   Previous work in unistroke getsures, mobile gestures

Page 41: Hands and Speech in Space: Multimodal Input for Augmented Reality

User Defined Gesture Study   Use AR view

 HMD + AR tracking

  Present AR animations   40 tasks in six categories

-  Editing, transforms, menu, etc

  Ask users to produce gestures causing animations   Record gesture (video, depth)

Piumsomboon, T., Clark, A., Billinghurst, M., & Cockburn, A. (2013, April). User-defined gestures for augmented reality. In CHI'13 Extended Abstracts on Human Factors in Computing Systems (pp. 955-960).ACM

Page 42: Hands and Speech in Space: Multimodal Input for Augmented Reality

Data Recorded   20 participants   Gestures recorded (video, depth data)

  800 gestures from 40 tasks

  Subjective rankings   Likert ranking of goodness, ease of use

  Think aloud transcripts

Page 43: Hands and Speech in Space: Multimodal Input for Augmented Reality

Typical Gestures

Page 44: Hands and Speech in Space: Multimodal Input for Augmented Reality

Results - Gestures   Gestures grouped according to

similarity – 320 groups   44 consensus (62% all gestures)   276 low similarity (discarded)   11 hand poses seen

  Degree of consensus (A) using guessability score [Wobbrock]

Page 45: Hands and Speech in Space: Multimodal Input for Augmented Reality

Results –Agreement Scores

Red line – proportion of two handed gestures

Page 46: Hands and Speech in Space: Multimodal Input for Augmented Reality

Usability Results

  Significant difference between consensus and discarded gesture sets (p < 0.0001)

  Gestures in consensus set better than discarded gestures in perceived performance and goodness

Consensus Discarded Ease of Performance 6.02 5.50 Good Match 6.17 5.83

Likert Scale [1-7], 7 = Very Good

Page 47: Hands and Speech in Space: Multimodal Input for Augmented Reality

Lessons Learned   AR animation can elicit desired gestures   For some tasks there is a high degree of

similarity in user defined gestures   Especially command gestures (eg Open), select

  Less agreement in manipulation gestures  Move (40%), rotate (30%), grouping (10%)

  Small portion of two handed gestures (22%)   Scaling, group selection

Page 48: Hands and Speech in Space: Multimodal Input for Augmented Reality

Multimodal Input

Page 49: Hands and Speech in Space: Multimodal Input for Augmented Reality

Multimodal Interaction   Combined speech input   Gesture and Speech complimentary

  Speech -  modal commands, quantities

 Gesture -  selection, motion, qualities

  Previous work found multimodal interfaces intuitive for 2D/3D graphics interaction

Page 50: Hands and Speech in Space: Multimodal Input for Augmented Reality

Wizard of Oz Study   What speech and gesture input

would people like to use?   Wizard

  Perform speech recognition  Command interpretation

  Domain   3D object interaction/modelling

Lee, M., & Billinghurst, M. (2008, October). A Wizard of Oz study for an AR multimodal interface. In Proceedings of the 10th international conference on Multimodal interfaces (pp. 249-256). ACM.

Page 51: Hands and Speech in Space: Multimodal Input for Augmented Reality

System Architecture

Page 52: Hands and Speech in Space: Multimodal Input for Augmented Reality

Hand Segmentation

Page 53: Hands and Speech in Space: Multimodal Input for Augmented Reality

System Set Up

Page 54: Hands and Speech in Space: Multimodal Input for Augmented Reality

Experiment   12 participants   Two display conditions (HMD vs. Desktop)   Three tasks

  Task 1: Change object color/shape   Task 2: 3D positioning of obejcts   Task 3: Scene assembly

Page 55: Hands and Speech in Space: Multimodal Input for Augmented Reality

Key Results   Most commands multimodal

 Multimodal (63%), Gesture (34%), Speech (4%)

  Most spoken phrases short   74% phrases average 1.25 words long   Sentences (26%) average 3 words

  Main gestures deictic (65%), metaphoric (35%)   In multimodal commands gesture issued first

  94% time gesture begun before speech  Multimodal window 8s – speech 4.5s after gesture

Page 56: Hands and Speech in Space: Multimodal Input for Augmented Reality

Free Hand Multimodal Input

  Use free hand to interact with AR content   Recognize simple gestures

 Open hand, closed hand, pointing

Point Move Pick/Drop

Lee, M., Billinghurst, M., Baek, W., Green, R., & Woo, W. (2013). A usability study of multimodal input in an augmented reality environment. Virtual Reality, 17(4), 293-305.

Page 57: Hands and Speech in Space: Multimodal Input for Augmented Reality

Speech Input   MS Speech + MS SAPI (> 90% accuracy)   Single word speech commands

Page 58: Hands and Speech in Space: Multimodal Input for Augmented Reality

Multimodal Architecture

Page 59: Hands and Speech in Space: Multimodal Input for Augmented Reality

Multimodal Fusion

Page 60: Hands and Speech in Space: Multimodal Input for Augmented Reality

Hand Occlusion

Page 61: Hands and Speech in Space: Multimodal Input for Augmented Reality

Experimental Setup

Change object shape and colour

Page 62: Hands and Speech in Space: Multimodal Input for Augmented Reality

User Evaluation

  25 subjects, 10 task trials x 3, 3 conditions   Change object shape, colour and position   Conditions

  Speech only, gesture only, multimodal

  Measures   performance time, errors (system/user), subjective survey

Page 63: Hands and Speech in Space: Multimodal Input for Augmented Reality

Results - Performance   Average performance time

  Gesture: 15.44s   Speech: 12.38s   Multimodal: 11.78s

  Significant difference across conditions (p < 0.01)  Difference between gesture and speech/MMI

Page 64: Hands and Speech in Space: Multimodal Input for Augmented Reality

Errors   User errors – errors per task

 Gesture (0.50), Speech (0.41), MMI (0.42)  No significant difference

  System errors   Speech accuracy – 94%, Gesture accuracy – 85%  MMI accuracy – 90%

Page 65: Hands and Speech in Space: Multimodal Input for Augmented Reality

Subjective Results (Likert 1-7)

  User subjective survey   Gesture significantly worse, MMI and Speech same   MMI perceived as most efficient

  Preference   70% MMI, 25% speech only, 5% gesture only

Gesture Speech MMI

Naturalness 4.60 5.60 5.80

Ease of Use 4.00 5.90 6.00

Efficiency 4.45 5.15 6.05

Physical Effort 4.75 3.15 3.85

Page 66: Hands and Speech in Space: Multimodal Input for Augmented Reality

Observations   Significant difference in number of commands

 Gesture (6.14), Speech (5.23), MMI (4.93)

  MMI Simultaneous vs. Sequential commands   79% sequential, 21% simultaneous

  Reaction to system errors   Almost always repeated same command   In MMI rarely changes modalities

Page 67: Hands and Speech in Space: Multimodal Input for Augmented Reality

Lessons Learned   Multimodal interaction significantly better than

gesture alone in AR interfaces for 3D tasks   Short task time, more efficient

  Users felt that MMI was more natural, easier, and more effective that gesture/speech only

  Simultaneous input rarely used   More studies need to be conducted

Page 68: Hands and Speech in Space: Multimodal Input for Augmented Reality

Intelligent Interfaces

Page 69: Hands and Speech in Space: Multimodal Input for Augmented Reality

Intelligent Interfaces   Most AR systems stupid

 Don’t recognize user behaviour  Don’t provide feedback  Don’t adapt to user

  Especially important for training   Scaffolded learning  Moving beyond check-lists of actions

Page 70: Hands and Speech in Space: Multimodal Input for Augmented Reality

Intelligent Interfaces

  AR interface + intelligent tutoring system   ASPIRE constraint based system (from UC)  Constraints

-  relevance cond., satisfaction cond., feedback

Westerfield, G., Mitrovic, A., & Billinghurst, M. (2013). Intelligent Augmented Reality Training for Assembly Tasks. In Artificial Intelligence in Education (pp. 542-551). Springer Berlin Heidelberg.

Page 71: Hands and Speech in Space: Multimodal Input for Augmented Reality

Domain Ontology

Page 72: Hands and Speech in Space: Multimodal Input for Augmented Reality

Intelligent Feedback

  Actively monitors user behaviour   Implicit vs. explicit interaction

  Provides corrective feedback

Page 73: Hands and Speech in Space: Multimodal Input for Augmented Reality
Page 74: Hands and Speech in Space: Multimodal Input for Augmented Reality

Evaluation Results   16 subjects, with and without ITS   Improved task completion

  Improved learning

Page 75: Hands and Speech in Space: Multimodal Input for Augmented Reality

Intelligent Agents   AR characters

  Virtual embodiment of system  Multimodal input/output

  Examples   AR Lego, Welbo, etc  Mr Virtuoso

-  AR character more real, more fun -  On-screen 3D and AR similar in usefulness

Wagner, D., Billinghurst, M., & Schmalstieg, D. (2006). How real should virtual characters be?. In Proceedings of the 2006 ACM SIGCHI international conference on Advances in computer entertainment technology (p. 57). ACM.

Page 76: Hands and Speech in Space: Multimodal Input for Augmented Reality

Looking to the Future

What’s Next?

Page 77: Hands and Speech in Space: Multimodal Input for Augmented Reality

Directions for Future Research   Mobile Gesture Interaction

  Tablet, phone interfaces

  Wearable Systems  Google Glass

  Novel Displays  Contact lens

  Environmental Understanding   Semantic representation

Page 78: Hands and Speech in Space: Multimodal Input for Augmented Reality

Mobile Gesture Interaction   Motivation

  Richer interaction with handheld devices  Natural interaction with handheld AR

  2D tracking   Finger tip tracking

  3D tracking  Hand tracking

[Hurst and Wezel 2013]

[Henrysson et al. 2007]

Henrysson, A., Marshall, J., & Billinghurst, M. (2007). Experiments in 3D interaction for mobile phone AR. In Proceedings of the 5th international conference on Computer graphics and interactive techniques in Australia and Southeast Asia (pp. 187-194). ACM.

Page 79: Hands and Speech in Space: Multimodal Input for Augmented Reality

Fingertip Based Interaction

System Setup Running System

Bai, H., Gao, L., El-Sana, J., & Billinghurst, M. (2013). Markerless 3D gesture-based interaction for handheld augmented reality interfaces. In SIGGRAPH Asia 2013 Symposium on Mobile Graphics and Interactive Applications (p. 22). ACM.

Mobile Client + PC Server

Page 80: Hands and Speech in Space: Multimodal Input for Augmented Reality

System Architecture

Page 81: Hands and Speech in Space: Multimodal Input for Augmented Reality

3D Prototype System   3 Gear + Vuforia

 Hand tracking + phone tracking

  Freehand interaction on phone   Skeleton model   3D interaction   20 fps performance

Page 82: Hands and Speech in Space: Multimodal Input for Augmented Reality

Google Glass

Page 83: Hands and Speech in Space: Multimodal Input for Augmented Reality
Page 84: Hands and Speech in Space: Multimodal Input for Augmented Reality

User Experience   Truly Wearable Computing

  Less than 46 ounces

  Hands-free Information Access   Voice interaction, Ego-vision camera

  Intuitive User Interface   Touch, Gesture, Speech, Head Motion

  Access to all Google Services   Map, Search, Location, Messaging, Email, etc

Page 85: Hands and Speech in Space: Multimodal Input for Augmented Reality

Contact Lens Display   Babak Parviz

 University Washington   MEMS components

  Transparent elements  Micro-sensors

  Challenges  Miniaturization   Assembly   Eye-safe

Page 86: Hands and Speech in Space: Multimodal Input for Augmented Reality

Contact Lens Prototype

Page 87: Hands and Speech in Space: Multimodal Input for Augmented Reality

Environmental Understanding

  Semantic understanding of environment   What are the key objects?   What are there relationships?   Represented in a form suitable for multimodal interaction?

Page 88: Hands and Speech in Space: Multimodal Input for Augmented Reality

Conclusion

Page 89: Hands and Speech in Space: Multimodal Input for Augmented Reality

Conclusions   AR experiences need new interaction methods   Enabling technologies are advancing quickly

 Displays, tracking, depth capture devices

  Natural user interfaces possible   Free hand gesture, speech, intelligence interfaces

  Important research for the future  Mobile, wearable, displays

Page 90: Hands and Speech in Space: Multimodal Input for Augmented Reality

More Information

•  Mark Billinghurst –  Email: [email protected]

– Twitter: @marknb00

•  Website –  http://www.hitlabnz.org/