Hands and Speech in Space: Multimodal Input for Augmented Reality

Hands and Speech in Space: Multimodal Interaction for AR

Mark Billinghurst

[email protected]

The HIT Lab NZ, University of Canterbury

December 12th 2013

1977 – Star Wars

1977 – Star Wars

Augmented Reality Definition   Defining Characteristics

 Combines Real and Virtual Images -  Both can be seen at the same time

  Interactive in real-time -  The virtual content can be interacted with

  Registered in 3D -  Virtual objects appear fixed in space

Azuma, R. T. (1997). A survey of augmented reality. Presence, 6(4), 355-385.

Augmented Reality Today

  Key Question: How should a person interact with the Augmented Reality content?  Connecting physical and virtual with interaction

Physical Elements

Virtual Elements Interaction

Metaphor Input Output

AR Interface Components

AR Interaction Metaphors   Information Browsing

  View AR content

  3D AR Interfaces   3D UI interaction techniques

  Augmented Surfaces   Tangible UI techniques

  Tangible AR   Tangible UI input + AR output

VOMAR Demo (Kato 2000)   AR Furniture Arranging

  Elements + Interactions   Book:

-  Turn over the page   Paddle:

-  Push, shake, incline, hit, scoop

Kato, H., Billinghurst, M., et al. 2000. Virtual Object Manipulation on a Table-Top AR Environment. In Proceedings of the International Symposium on Augmented Reality (ISAR 2000), Munich, Germany, 111--119.

Opportunities for Multimodal Input   Multimodal interfaces are a natural fit for AR

 Need for non-GUI interfaces  Natural interaction with real world  Natural support for body input   Previous work shown value of multimodal input

and 3D graphics

Related Work   Related work in 3D graphics/VR

  Interaction with 3D content [Chu 1997]  Navigating through virtual worlds [Krum 2002]   Interacting with virtual characters [Billinghurst 1998]

  Little earlier work in AR   Require additional input devices   Few formal usability studies   Eg Olwal et. al [2003] Sense Shapes

Examples

SenseShapes [2003] Kolsch [2006]

Marker Based Multimodal Interface

  Add speech recognition to VOMAR   Paddle + speech commands

Irawati, S., Green, S., Billinghurst, M., Duenser, A., & Ko, H. (2006, October). IEEE Xplore. In Mixed and Augmented Reality, 2006. ISMAR 2006. IEEE/ACM International Symposium on (pp. 183-186). IEEE.

Commands Recognized   Create Command "Make a blue chair": to create a virtual

object and place it on the paddle.   Duplicate Command "Copy this": to duplicate a virtual object

and place it on the paddle.   Grab Command "Grab table": to select a virtual object and

place it on the paddle.   Place Command "Place here": to place the attached object in

the workspace.   Move Command "Move the couch": to attach a virtual object

in the workspace to the paddle so that it follows the paddle movement.

System Architecture

Object Relationships

"Put chair behind the table” Where is behind?

View specific regions

User Evaluation   Performance time

  Speech + static paddle significantly faster

  Gesture-only condition less accurate for position/orientation   Users preferred speech + paddle input

Subjective Surveys

2012 – Iron Man 2

To Make the Vision Real..  Hardware/software requirements

 Contact lens displays  Free space hand/body tracking  Speech/gesture recognition  Etc..

 Most importantly  Usability/User Experience

Natural Interaction   Automatically detecting real environment

  Environmental awareness, Physically based interaction

  Gesture interaction   Free-hand interaction

  Multimodal input   Speech and gesture interaction

  Intelligent interfaces   Implicit rather than Explicit interaction

Environmental Awareness

AR MicroMachines   AR experience with environment awareness

and physically-based interaction   Based on MS Kinect RGB-D sensor

  Augmented environment supports   occlusion, shadows   physically-based interaction between real and

virtual objects

Clark, A., & Piumsomboon, T. (2011). A realistic augmented reality racing game using a depth-sensing camera. In Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry (pp. 499-502). ACM.

Operating Environment

Architecture   Our framework uses five libraries:

 OpenNI  OpenCV  OPIRA   Bullet Physics  OpenSceneGraph

System Flow   The system flow consists of three sections:

  Image Processing and Marker Tracking   Physics Simulation   Rendering

Physics Simulation

  Create virtual mesh over real world

  Update at 10 fps – can move real objects

  Use by physics engine for collision detection (virtual/real)

  Use by OpenScenegraph for occlusion and shadows

Rendering

Occlusion Shadows

Gesture Interaction

Natural Hand Interaction

  Using bare hands to interact with AR content  MS Kinect depth sensing   Real time hand tracking   Physics based simulation model

Hand Interaction

  Represent models as collections of spheres

  Bullet physics engine for interaction with real world

Scene Interaction

  Render AR scene with OpenSceneGraph   Using depth map for occlusion   Shadows yet to be implemented

Architecture 5. Gesture

•  Static Gestures • Dynamic Gestures •  Context based Gestures

4. Modeling

• Hand recognition/modeling •  Rigid-body modeling

3. Classification/Tracking

2. Segmentation

1. Hardware Interface


•  Static Gestures •  Dynamic Gestures •  Context based Gestures

4. Modeling

•  Hand recognition/modeling

•  Rigid-body modeling


2. Segmentation


o  Supports PCL, OpenNI, OpenCV, and Kinect SDK. o  Provides access to depth, RGB, XYZRGB. o  Usage: Capturing color image, depth image and concatenated

point clouds from a single or multiple cameras o  For example:

Kinect for Xbox 360

Kinect for Windows

Asus Xtion Pro Live



4. Modeling




2. Segmentation


o  Segment images and point clouds based on color, depth and space.

o  Usage: Segmenting images or point clouds using color models, depth, or spatial properties such as location, shape and size.

o  For example:

Skin color segmentation

Depth threshold



4. Modeling




2. Segmentation


o  Identify and track objects between frames based on XYZRGB.

o  Usage: Identifying current position/orientation of the tracked object in space.

o  For example:

Training set of hand poses, colors represent unique regions of the hand.

Raw output (without-cleaning) classified on real hand input (depth image).



4. Modeling




2. Segmentation


o  Hand Recognition/Modeling   Skeleton based (for low resolution

approximation)   Model based (for more accurate

representation) o  Object Modeling (identification and tracking rigid-

body objects) o  Physical Modeling (physical interaction)

  Sphere Proxy   Model based   Mesh based

o  Usage: For general spatial interaction in AR/VR environment



4. Modeling




2. Segmentation


o  Static (hand pose recognition) o  Dynamic (meaningful movement recognition) o  Context-based gesture recognition (gestures with context,

e.g. pointing) o  Usage: Issuing commands/anticipating user intention and high

level interaction.

Skeleton Based Interaction

  3 Gear Systems   Kinect/Primesense Sensor   Two hand tracking   http://www.threegear.com

Skeleton Interaction + AR

  HMD AR View   Viewpoint tracking

  Two hand input   Skeleton interaction, occlusion

What Gestures do People Want to Use?   Limitations of Previous work in AR

  Limited range of gestures  Gestures designed for optimal recognition  Gestures studied as add-on to speech

  Solution – elicit desired gestures from users   Eg. Gestures for surface computing [Wobbrock]   Previous work in unistroke getsures, mobile gestures

User Defined Gesture Study   Use AR view

 HMD + AR tracking

  Present AR animations   40 tasks in six categories

-  Editing, transforms, menu, etc

  Ask users to produce gestures causing animations   Record gesture (video, depth)

Piumsomboon, T., Clark, A., Billinghurst, M., & Cockburn, A. (2013, April). User-defined gestures for augmented reality. In CHI'13 Extended Abstracts on Human Factors in Computing Systems (pp. 955-960).ACM

Data Recorded   20 participants   Gestures recorded (video, depth data)

  800 gestures from 40 tasks

  Subjective rankings   Likert ranking of goodness, ease of use

  Think aloud transcripts

Typical Gestures

Results - Gestures   Gestures grouped according to

similarity – 320 groups   44 consensus (62% all gestures)   276 low similarity (discarded)   11 hand poses seen

  Degree of consensus (A) using guessability score [Wobbrock]

Results –Agreement Scores

Red line – proportion of two handed gestures

Usability Results

  Significant difference between consensus and discarded gesture sets (p < 0.0001)

  Gestures in consensus set better than discarded gestures in perceived performance and goodness

Consensus Discarded Ease of Performance 6.02 5.50 Good Match 6.17 5.83

Likert Scale [1-7], 7 = Very Good

Lessons Learned   AR animation can elicit desired gestures   For some tasks there is a high degree of

similarity in user defined gestures   Especially command gestures (eg Open), select

  Less agreement in manipulation gestures  Move (40%), rotate (30%), grouping (10%)

  Small portion of two handed gestures (22%)   Scaling, group selection

Multimodal Input

Multimodal Interaction   Combined speech input   Gesture and Speech complimentary

  Speech -  modal commands, quantities

 Gesture -  selection, motion, qualities

  Previous work found multimodal interfaces intuitive for 2D/3D graphics interaction

Wizard of Oz Study   What speech and gesture input

would people like to use?   Wizard

  Perform speech recognition  Command interpretation

  Domain   3D object interaction/modelling

Lee, M., & Billinghurst, M. (2008, October). A Wizard of Oz study for an AR multimodal interface. In Proceedings of the 10th international conference on Multimodal interfaces (pp. 249-256). ACM.

System Architecture

Hand Segmentation

System Set Up

Experiment   12 participants   Two display conditions (HMD vs. Desktop)   Three tasks

  Task 1: Change object color/shape   Task 2: 3D positioning of obejcts   Task 3: Scene assembly

Key Results   Most commands multimodal

 Multimodal (63%), Gesture (34%), Speech (4%)

  Most spoken phrases short   74% phrases average 1.25 words long   Sentences (26%) average 3 words

  Main gestures deictic (65%), metaphoric (35%)   In multimodal commands gesture issued first

  94% time gesture begun before speech  Multimodal window 8s – speech 4.5s after gesture

Free Hand Multimodal Input

  Use free hand to interact with AR content   Recognize simple gestures

 Open hand, closed hand, pointing

Point Move Pick/Drop

Lee, M., Billinghurst, M., Baek, W., Green, R., & Woo, W. (2013). A usability study of multimodal input in an augmented reality environment. Virtual Reality, 17(4), 293-305.

Speech Input   MS Speech + MS SAPI (> 90% accuracy)   Single word speech commands

Multimodal Architecture

Multimodal Fusion

Hand Occlusion

Experimental Setup

Change object shape and colour

User Evaluation

  25 subjects, 10 task trials x 3, 3 conditions   Change object shape, colour and position   Conditions

  Speech only, gesture only, multimodal

  Measures   performance time, errors (system/user), subjective survey

Results - Performance   Average performance time

  Gesture: 15.44s   Speech: 12.38s   Multimodal: 11.78s

  Significant difference across conditions (p < 0.01)  Difference between gesture and speech/MMI

Errors   User errors – errors per task

 Gesture (0.50), Speech (0.41), MMI (0.42)  No significant difference

  System errors   Speech accuracy – 94%, Gesture accuracy – 85%  MMI accuracy – 90%

Subjective Results (Likert 1-7)

  User subjective survey   Gesture significantly worse, MMI and Speech same   MMI perceived as most efficient

  Preference   70% MMI, 25% speech only, 5% gesture only

Gesture Speech MMI

Naturalness 4.60 5.60 5.80

Ease of Use 4.00 5.90 6.00

Efficiency 4.45 5.15 6.05

Physical Effort 4.75 3.15 3.85

Observations   Significant difference in number of commands

 Gesture (6.14), Speech (5.23), MMI (4.93)

  MMI Simultaneous vs. Sequential commands   79% sequential, 21% simultaneous

  Reaction to system errors   Almost always repeated same command   In MMI rarely changes modalities

Lessons Learned   Multimodal interaction significantly better than

gesture alone in AR interfaces for 3D tasks   Short task time, more efficient

  Users felt that MMI was more natural, easier, and more effective that gesture/speech only

  Simultaneous input rarely used   More studies need to be conducted

Intelligent Interfaces

Intelligent Interfaces   Most AR systems stupid

 Don’t recognize user behaviour  Don’t provide feedback  Don’t adapt to user

  Especially important for training   Scaffolded learning  Moving beyond check-lists of actions

Intelligent Interfaces

  AR interface + intelligent tutoring system   ASPIRE constraint based system (from UC)  Constraints

-  relevance cond., satisfaction cond., feedback

Westerfield, G., Mitrovic, A., & Billinghurst, M. (2013). Intelligent Augmented Reality Training for Assembly Tasks. In Artificial Intelligence in Education (pp. 542-551). Springer Berlin Heidelberg.

Domain Ontology

Intelligent Feedback

  Actively monitors user behaviour   Implicit vs. explicit interaction

  Provides corrective feedback

Evaluation Results   16 subjects, with and without ITS   Improved task completion

  Improved learning

Intelligent Agents   AR characters

  Virtual embodiment of system  Multimodal input/output

  Examples   AR Lego, Welbo, etc  Mr Virtuoso

-  AR character more real, more fun -  On-screen 3D and AR similar in usefulness

Wagner, D., Billinghurst, M., & Schmalstieg, D. (2006). How real should virtual characters be?. In Proceedings of the 2006 ACM SIGCHI international conference on Advances in computer entertainment technology (p. 57). ACM.

Looking to the Future

What’s Next?

Directions for Future Research   Mobile Gesture Interaction

  Tablet, phone interfaces

  Wearable Systems  Google Glass

  Novel Displays  Contact lens

  Environmental Understanding   Semantic representation

Mobile Gesture Interaction   Motivation

  Richer interaction with handheld devices  Natural interaction with handheld AR

  2D tracking   Finger tip tracking

  3D tracking  Hand tracking

[Hurst and Wezel 2013]

[Henrysson et al. 2007]

Henrysson, A., Marshall, J., & Billinghurst, M. (2007). Experiments in 3D interaction for mobile phone AR. In Proceedings of the 5th international conference on Computer graphics and interactive techniques in Australia and Southeast Asia (pp. 187-194). ACM.

Fingertip Based Interaction

System Setup Running System

Bai, H., Gao, L., El-Sana, J., & Billinghurst, M. (2013). Markerless 3D gesture-based interaction for handheld augmented reality interfaces. In SIGGRAPH Asia 2013 Symposium on Mobile Graphics and Interactive Applications (p. 22). ACM.

Mobile Client + PC Server

System Architecture

3D Prototype System   3 Gear + Vuforia

 Hand tracking + phone tracking

  Freehand interaction on phone   Skeleton model   3D interaction   20 fps performance

Google Glass

User Experience   Truly Wearable Computing

  Less than 46 ounces

  Hands-free Information Access   Voice interaction, Ego-vision camera

  Intuitive User Interface   Touch, Gesture, Speech, Head Motion

  Access to all Google Services   Map, Search, Location, Messaging, Email, etc

Contact Lens Display   Babak Parviz

 University Washington   MEMS components

  Transparent elements  Micro-sensors

  Challenges  Miniaturization   Assembly   Eye-safe

Contact Lens Prototype

Environmental Understanding

  Semantic understanding of environment   What are the key objects?   What are there relationships?   Represented in a form suitable for multimodal interaction?

Conclusion

Conclusions   AR experiences need new interaction methods   Enabling technologies are advancing quickly

 Displays, tracking, depth capture devices

  Natural user interfaces possible   Free hand gesture, speech, intelligence interfaces

  Important research for the future  Mobile, wearable, displays

More Information

•  Mark Billinghurst –  Email: [email protected]

– Twitter: @marknb00

•  Website –  http://www.hitlabnz.org/

Technology

Hands and Speech in Space: Multimodal Input for Augmented Reality