34
We acknowledge support from: NSF-STIMULATE program, Grant No. IRI- 9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant No. BCS- 9980054, “Cross-Modal Analysis of Signal and Sense: Multimedia Corpora and Tools for Gesture, Speech, and Gaze Research” NSF-ITR program, Grant No. IIS-0219875, “Beyond The Talking Head and Animated Icon: Behaviorally Situated Avatars for Tutoring” ARDA-VACE II program “From Video to VACE Multimodal Meeting Corpus Lei Chen, Travis Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian Huang, Mary Harper, Francis Quek, David McNeill, Ronald Tuttle, and Thomas Huang Francis Quek Professor of Computer Science Director, Center for Human Computer Interaction Virginia Tech

We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Embed Size (px)

Citation preview

Page 1: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

We acknowledge support from:• NSF-STIMULATE program, Grant No. IRI-9618887,

“Gesture, Speech, and Gaze in Discourse Segmentation”

• NSF- KDI program, Grant No. BCS-9980054, “Cross-Modal Analysis of Signal and Sense: Multimedia Corpora and Tools for Gesture, Speech, and Gaze Research”

• NSF-ITR program, Grant No. IIS-0219875, “Beyond The Talking Head and Animated Icon: Behaviorally Situated Avatars for Tutoring”

• ARDA-VACE II program “From Video to Information: Cross-Modal Analysis of Planning Meetings”

VACE Multimodal Meeting CorpusLei Chen, Travis Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian Huang, Mary Harper, Francis Quek, David McNeill, Ronald Tuttle, and Thomas Huang

Francis Quek

Professor of Computer Science

Director, Center for Human Computer Interaction

Virginia Tech

Page 2: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Corpus Rationale

A quest for meaning: Embodied cognition and language production drives our research

Analysis of ‘natural’ [human human*] meetings Resource in support of research in

Multimodal language analysis Speech recognition and analysis Vision-based communicative behavior analysis

Page 3: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Why Multimodal Language Analysis?

S1 you know like those ¿fireworks?

S2 well if we're trying to drive'em / out her<r>e # we need to put'em up her<r>e

S1 yeah well what I'm saying is we should*

S2 in front

S1 we should do it* we should make it a lin<n>e through the room<m>s / so that they explode like here then here then here then here

Page 4: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Multimodal Language Example

QuickTime™ and aCinepak decompressor

are needed to see this picture.

Page 5: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Embodied Communicative Behavior Constructed dynamically at the moment of speaking

(thinking for speaking) Dependent on cultural, personal, social, cognitive

differences Speaker is often unwitting of gestures Reveals the contrastive foci of language stream (Hajcova,

Halliday et. al.) Is co-expressive (co-temporal) with speech Is multiply determined Temporal synchrony is critical for analysis

Page 6: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

In a Nutshell

Gesture/Speech Framework: (McNeill 1992, 2000, 2001, Quek et al 1999-2003)

DiscourseProduction

MentalImagery

EmbodiedImagery

VideoAccess

Computable'Image-bearing'

Features

Inference- Imagistic gesture features- Coherence with conceptual discourse units

Inference

Page 7: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

ARDA/VACE Program

ARDA is to the intelligence community what DARPA is to the military

Interest is in the exploitation of video data (Video Analysis and Content Exploitation)

A key VACE challenge: Meeting Analysis Our key theme: Multimodal communication

analysis

Page 8: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

From Video to Information: Cross-Modal Analysis for Planning Meetings

Page 9: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Team

Multimodal Meeting Analysis: A Cross-Disciplinary Enterprise

Page 10: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Overarching Approach Coordinated multidisciplinary research Corpus assembly

Data is transcribed and coded for relevant speech/language structure War-gaming (planning) scenarios are captured to provide real planning behavior

in a controlled experimental context (reducing many ‘unknowns’) Meeting room is multiply instrumented with cross-calibrated video, synchronized

audio/video, motion tracking All data components are time-aligned across the dataset

Multimodal video processing research Research on posture, head position/orientation, gesture tracking, hand-shape

recognition, and in multimodal integration Research in tools for analysis, coding and interpretation Speech analysis research in support of multimodality

Page 11: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Scenarios

Each Scenario to have Five Participants Roles Tailored to Available Participant Expertise Five Initial Scenarios

Delta II Rocket Launch Foreign Material Exploitation Intervention to Support Democratic Movement Humanitarian Assistance Scholarship Selection

Page 12: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Scenarios (cont’d)

Planned Scenarios (to be Developed) Lost Aircraft Crisis Response Hostage Rescue Downed Pilot Search & Rescue Bomb Shelter Design

Page 13: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Scenario Development

Humanitarian Assistance Walkthrough Purpose: Develop Plan for Immediate Military Support to Dec 04 Asian

Tsunami Victims Considerable Open Source Information from Internet for Scenario

Development Roles:

Medical Officer Task Force Commander Intel Officer Operations Officer Weather Officer

Mission Goals & Priorities Provided for Each Role

Page 14: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Before Tsunami

After Tsunami

“As intelligence officer, your role is to provide intelligence support to OPERATION UNIFIED ASSISTANCE. While the extent of damage is still unknown, early reporting indicates that coastal areas throughout South Asia have been affected. Communications have been lost with entire towns. Currently, the only means of determining the magnitude of destruction is from overhead assets. Data from the South Asia and Sri Lanka region has already been received from civilian remote sensing satellites. Although the US military will be operating in the region on a strictly humanitarian mission, the threat still exists of hostile action to US personnel by terrorist factions opposed to the US. As intel officer, you are responsible for briefing the nature of the terrorist threat in the region.”

Meulaboh, Indonesia

Page 15: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Corpus Assembly

Page 16: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Multi-modal Elicitation Experiment

Tim

e A

lign

ed M

ulti

med

ia T

rans

crip

tion

Video Processing:10-Camera Calibration, Vector Extraction, Hand Tracking, Gaze Tracking, Head Modeling, Head Tracking, Body Tracking

Motion Capture Interpretation

Speech & Psycholinguistic CodingSpeech Transcription, Psycholinguistic Coding

Speech & Audio ProcessingAutomatic Transcript Word/Syllable Alignment to Audio, Audio Feature Extraction

10-camera video & digital audio capture

3D Vicon Extraction

Data Acquisition & Processing

Page 17: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Meeting Room and Camera Configuration

B C DA BEDDCB

H A BE FD

G FH E F H A HGF

A[4, 6, 8]

B[6, 7, 8, 10]

C[7, 10]

D[1, 7, 9, 10]

E[4, 6, 10]

F[1, 2, 3, 5]

G[2, 5]

H[2, 4, 5, 6]

2[F, G, H]

3[E, F]

4[A, H]

5[F, G, H]

10[B, C, D]

9[D, E]

8[A, B]

7[B, C, D]

6[B, A, H]

1[D, E, F]

T1 T2

1 C9C3 7 C7C10

2 C1C3 8 C2C5

3 C9C1 9 C2C4

4 C4C8 10 C3C5

5 C4C6 11 C7C9

6 C6C8 12 C8C10

C1 DEF C6 BAH

C2 HGF C7 DCB

C3 FE C8 BA

C4 HA C9 DE

C5 FGH C10 BCD

Page 18: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

QuickTime™ and a decompressor

are needed to see this picture.

Cam1

Page 19: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

48 Calibration Dots for Calibration

18 Vicon Markers for Coordinate System Transformation

Y=RX+T

Global & Pairwise Camera Calibration

Page 20: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Error Distribution in X Direction Error Distribution in Y Direction

Error Distribution in Z Direction

X Direction

maximum: 0.5886mm minimum: 0.4mm

mean: 0.4755mm

Z Direction

maximum: 0.5064 mm minimum: 0.3804mm

mean: 0.4317mm

Y Direction

maximum: 0.6925mm minimum: 0.3077mm

mean: 0.4529mm

(Camera pair 5~12)Error Distributions in Meeting Room Area

Page 21: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

VICON Motion Capture

Motion capture technology: Near-IR cameras Retro-reflective markers Datastation + PC workstation

Vicon modes of operation: Individual points (as seen in calibration) Kinematic models Individual objects

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 22: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

VICON Motion Capture

Learning about MoCap 11/03: Initial Installation 6/04: Pilot scenario, using kinematic models 10/04: Follow-up training using object models 11/04: Rehearsed using Vicon with object models 1/05: Data captured for FME scenario

Export position information for each participant’s head, hand, body position & orientation

Post-processing of motion capture data: ~1 hour per minute for a 5-participant meeting

Incorporating MoCap into Workflow Labeling of point clusters is labor intensive 3 Work Studies @ 20 hours/wk = ~60 minutes (1 dataset) per week

Page 23: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Speech Processing Tasks

Formulate an audio work flow to support the efficient and effective construction of a large-size high quality multimodal corpus

Implement support tools to achieve the goal Package time-aligned word transcriptions into

appropriate data formats that can be efficiently shared and used

Page 24: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Audio Processing

Audio Recording,Meeting Metadata Annotation

Audio Segmentation

Manual Transcription

Forced Alignment

OOV WordResolution

segmentationtranscription

audio

audio

Corpus Integration

Page 25: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

VACE Metadata Approach

Page 26: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Data Collection Status Pilot: June 04

Low Audio Volume. Sound Mixer Purchased Video Frame Drop-out. Purchased High Grade DV Tapes

(AFIT 02-07-05 Democratic movement assistance) (AFIT 02-07-05 Democratic movement assistance, session 2)

audio clipping in close-in mikes -- may be able to salvage data using the desktop mics.

AFIT 02-24-05 Humanitarian Assistance (Tsunami) AFIT 03-04-05 Humanitarian Assistance (Tsunami) AFIT 03-18-05 Scholarship selection AFIT 04-08-05 Humanitarian Assistance (Tsunami) AFIT 04-08-05 Card Game AFIT 04-25-05 Problem Solving Task, (cause of deterioration of Lincoln

Memorial) AFIT 06-??-05 Problem Solving Task

Page 27: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Some Multimodal Meeting Room Results

Page 28: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

F2 & F1 Lance Armstrong Episode

QuickTime™ and a decompressor

are needed to see this picture.NIST Microcorpus July 29, 2003Meeting Dynamics – F1 vs F2

Page 29: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

M1 [do you wanna pull up a site here to]gaze/orientation: head; instrumentalF1, M1 and M2 gaze at screen off cameraor might be looking at F2 is now off camera walking toward the board

00:00:46:11F2 [washington post]

gaze/orientation: head; interactive36.000 F1 directs gaze at F2actionF1 twists in chair to toward F2

M1 [/]F2 [/]

gaze/orientation: head; interactiveM1 and F1 gaze directed at F2M2 gaze remains directed at screen

F1 yeah what do you what do you li ke]

00:00:47:14F1 yeah washington

actionF2 facing white board

M2 [we should put them in and just vote ]Gaze/orientation: head; interactiveM2 turns head to direct gaze to M1 (?)NB it’s possible that M2’s gaze is following F2 who is walking around table atthis point, up until M2 begins speaking

M1 [ / ]F2 [ / ]

gaze/orientation: head; interactiveF1 and M1 gaze toward M2

M2 %chuckle

M1 vote[<uhh>]

M1 turns head L to direct gaze at screen (possibly)

M2 we all follow the news rightactionM2 pulls back in chair away from table

F2 [ / ]gaze/orientation: head: instrumentalF1 gaze directed down at paper on tableactionF1 tears off a piece of paper from her pad on the table

Gaze direction tracks social patterns (interactive gaze) and engagement of objects (instrumental gaze), which may be a form of pointing as well as perception

Gaze source

Gaze target

F2 F1 M1 M2 ∑

F2 1 2 3 F1 3 4 3 10 1M 6 1 2 9 2M 3 2 5

∑ 12 1 7 7

Interactive gaze - 5 min. sample:

Interactive gaze occurrences

Instrumental gaze

Instrumental gaze

Gaze - NIST July 29, 2003 Data

Page 30: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Gaze - AFIT dataGazee

Gazer

B C D E F G H TotalB ---- 1 3 1 3 8C ---- 6 1 5 12D 4 ---- 3 5 1 13E 10 ---- 1 3 14F 5 5 ---- 3 13G 7 3 1 ---- 11H 10 1 10 1 8 ---- 30

Total 37 1 30 5 27 1

CO (Moderator)General’s Rep. Engineering Lead

Page 31: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

F-formation analysis

“An F-formation arises when two or more people cooperate together to maintain a space between them to which they all have direct and exclusive [equal] access.” (A. Kendon 1977).

An F-formation is discovered from tracking gaze direction in a social group. It is not only about shared space. It reveals common ground and has an associated meaning. The cooperative property is crucial.

It is useful for detecting units of thematic content being jointly developed in a conversation.

Page 32: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

this week I believe

You're good

I don't know I was thinking about the sports story

That sports one's

You thinking Coby Bryant

Click on sports right here

I'm thinking

on the left

Lance Armstrong

I prefer that story

I'm I'm a huge Actually I'm a huge Tour fan so

I think Lance Armstrong's

I was thinkin' Lance Armstrong

Yeah I just didn't like the bad news aspect of the

I'm sorry

Yeah the the

F1

F2

M1

M2

F1

F2

M1

M2

F1

F2

M1

M2

F2@M2 M1@F1

F2-M1-Mutual

F2@Screen M1@F2 F2@M1 M1@Screen

F1@Screen

F1@M1

F1@F2 F2@F1

M1-F1 Mutual

this week I believe

You're good

I don't know I was thinking about the sports story

That sports one's

You thinking Coby Bryant

Click on sports right here

I'm thinking

on the left

Lance Armstrong

I prefer that story

I'm I'm a huge Actually I'm a huge Tour fan so

I think Lance Armstrong's

I was thinkin' Lance Armstrong

Yeah I just didn't like the bad news aspect of the

I'm sorry

Yeah the the

F1

F2

M1

M2

F1

F2

M1

M2

F1

F2

M1

M2

F2@M2 M1@F1

F2-M1-Mutual

F2@Screen M1@F2 F2@M1 M1@Screen

F1@Screen

F1@M1

F1@F2 F2@F1

M1-F1 Mutual

NIST-F-Formation Coding (76.11s–92.27s)

Page 33: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

NIST-F-Formation Coding (92.27s–108.97s)

bike guy Lance Yeah Yeah I think so too

Yeah Lance Armstrong

Yeah Lance Armstrong Tour de France

Lance okay

Yeah

Where were you

Where did you wanna go Laura

Well there's sports right up there right near the metro

Oooo-Kaaaaay Yeah that's ...

No Never mind

[Laughs]

[Laughs]

Let's go to your page What is it you like CNN?

CNN

CNN

Yeah let's look there

Yeah you don't

F1

F2

M1

M2

F1

F2

M1

M2

F1

F2

M1

M2

M1-F1-Mutual F2@Whiteboard

F2@Notes

F2@F1 F2@Screen

M1@F1 M1@Screen

F1@F2 F1@Screen F1@M1 F1@Whiteboard

F1@M2 F1@ScreenF1@Screen F1@F2

M1+F1 Shared

M1-F1-Mutual F2@Whiteboard

F2@Notes

F2@F1 F2@Screen

M1@F1 M1@Screen

F1@F2 F1@Screen F1@M1 F1@Whiteboard

F1@M2 F1@ScreenF1@Screen F1@F2

M1+F1 Shared

Page 34: We acknowledge support from: NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” NSF- KDI program, Grant

Summary

Corpus collection based on sound scientific foundations Data includes audio, video, motion-capture, speech

transcription, and manual codings A suite of tools for visualizing and coding the cotemporal

data has been developed Research results demonstrate multimodal discourse

segmentation and meeting dynamics analysis