Download pdf - An introduction to the multimodal corpus · An introduction to the multimodal corpus HuComTech and its annotation László Hunyadi University of Debrecen Nijmegen, MPI, 19 December

An introduction to the multimodal corpus

HuComTech and its annotation

László HunyadiUniversity of Debrecen

Nijmegen, MPI, 19 December 2012

Wednesday, April 24, 13

HuComTech:

Human-Computer Technologies

Hungarian Communication Technologies



research modules involved:



• computational linguistics




• communication theory





• psychology





• psychology

• digital image processing





• psychology

• digital image processing

• engineering (robotics)


Some basic data about the corpus


Purpose of the corpus

to identify elements of human-human communication and structural relations between them that are

- relevant for HCI- technologically implementable

furthermore, to - learn the multimodal nature of human communication (both its verbal and nonverbal aspects)- describe human communication in a multimodal, holistic model


The corpus is intended to represent sufficient data in proper arrangement for purposes of

linguisticslanguage technology (the teaching and testing of a speech recognition software)behavioral psychologyrobotics and more



Corpus:


Corpus:

• appr. 60 hours of video-recordings of 111 speakers aged 18-29, including


Corpus:


• ≈ 450.000 word tokens


Corpus:



• 15 read sentences


Corpus:




• 10 minute guided dialogues (job interviews)


Corpus:




• 10 minute guided dialogues (job interviews)

• 15 minute free dialogues


45.5%

54.5%

Distribution of subjects by sex

male female

0

10

20

30

40

50

19 20 21 22 23 24 25 26 27 28 29 30

22235

710

811

4848

6

Distribition of subjects by age

# su

bjec

ts

age


Annotation

serves the study of multimodality through the study of unimodality and the fusion of aligning markers

Markers to be annotated are determined by a theoretical-technological model of communication


Theoretical considerations


We assume that human communication has a two-way mechanism: speakers and listeners rely on the same mechanism to comunicate.




Therefore, in order to properly represent human communication for technology, we need a model to follow this two-way mechanism: to serve both synthesis and analysis.




Therefore, in order to properly represent human communication for technology, we need a model to follow this two-way mechanism: to serve both synthesis and analysis.

We assume that the approach of a generative model proposed for syntax (especially Chomsky 1981) can be useful in building such a two-way model of communication for technology.



The basic structure of the model

Modularity:


The pragmatic extension:

The basic structure of the model


Characteristics of the constituent modules




• Each module has a characteristic finite set of primitives and, by way of the Operational component these primitives are combined into an infinite set of non-primitives and further structures




• The basic structure generates all and only those structures (configurations of primitives) that are formally possible (‘grammatical’) in any communicative event.





Eg. The ‘start’ of an event can be followed by ‘end’ of an event, but the inverse order is not possible (‘ungrammatical’)






• The functional extension assigns all possible communicative functions to any given structure generated by the basic structure and only those






• The functional extension assigns all possible communicative functions to any given structure generated by the basic structure and only those

Eg. the restart node in the basic structure can be associated with the continuity function, but the assignment of the function turn-taking to it is not possible.







• The pragmatic extension actualizes the input from the functional extension for the given actual communicative event by selecting the appropriate markers and their appropriate values based on the given scenario and ontology behind the event.



• The pragmatic extension actualizes the input from the functional extension for the given actual communicative event by selecting the appropriate markers and their appropriate values based on the given scenario and ontology behind the event.

Eg. The function of ‘happiness’ is expressed by the appropriate value of some modal marker(s): facial, gestural, audio, lexical or some/all of them.



• Functions are translated into their technological counterparts as parameters through data fusion

• The pragmatic extension selects the modalities and their markers to represent the given function

• Actual occurrences of markers are represented by the corresponding parameter values

• Technology receives these parameter values as input and operates on them.

Interface to technology: application of the model

Pragmatic extension

Technology

The pragmatic extension is the interface to technology:


Experimental settings and annotation

1. video


Annotation

unimodal (video, audio)multimodal (video + audio)

manualautomatic

description of physical properties (esp. video)interpretative annotations

with focus on emotions and the multimodal alignment of video and audiospecial features of annotation: pragmatic syntactic prosodic


Levels of annotation and attributes (labels)

Audio



Audio

IP-level: HC, SC, EM, IN, BC, HE, RE, IT, SL, V



Audio


discourse-level: TT, TK, BC, SL



Audio


discourse-level: TT, TK, BC, SLemotion-level: neutral, sad, happy/laughing, surprised, recall, tensed (and degrees of: strong, moderate, reduced), other, silence



Video



Video

comevent: start, end



Video


deictic: addressee, self, measure, object, shape; left, right, both



Video


deictic: addressee, self, measure, object, shape; left, right, both

emblems: attention, agree, block, disagree, doubt, doubt-shrug, refusal, surprise, more-or-less, number, finger-ring, hands up, one hand other hand, other



Video



Video

emotions: natural, happy, recall, sad, surprise, tense; (and degrees of: strong, moderate, reduced)



Video

headshift: lower, turn, raise, shake, nod; sideways, left, right




Video

headshift: lower, turn, raise, shake, nod; sideways, left, right

touchmotion: hair, leg, arm, face, eye, ear, chin, mouth, neck, bust, forehead, nose, glasses; tap, scratch; left, right




Video



Video

posture: crossing arm, holding head, lean back, lean forward, lean left, lean right, rotate right, roate left, shoulder up, upright



Video

posture: crossing arm, holding head, lean back, lean forward, lean left, lean right, rotate right, roate left, shoulder up, upright

handshape: breaking, fist, crossing fingers, open flat, open spread, thumb out, index out; left, right, both



Video



Video

facial expressions: natural, happy, recall, sad, surprise, tense (and degrees of: moderate, reduced, strong)



Video


eyebrows: scowl, up; left, right, both



Video


eyebrows: scowl, up; left, right, both

gaze: blink, up, down, left, right, forwards, left-up, left-down, right-up, right-down



Syntax

structural segmentation:



Syntax

clause boundaries




Syntax

clause boundaries

hierarchical arrangement of clauses




Syntax

clause boundaries

hierarchical arrangement of clauses

internal structure of clauses (esp. missing elements)




Syntax vs. prosody

(prosody of clauses)



Syntax vs. prosody

pitch movement: rise, fall, stagnant + finer distinctions




Syntax vs. prosody


intensity: increase, decrease, stagnant + finer distinctions




Syntax vs. prosody


intensity: increase, decrease, stagnant + finer distinctions

pause/duration: increase, decrease, stagnant + finer distinctions




Pragmatics - multimodal




Annotation: DiAMSL for text based eventseral on several layers, esp. audio: turn management, discourse




Annotation: DiAMSL for text based eventseral on several layers, esp. audio: turn management, discourse

Multimodality - the complex of audio + video multimodal communicative act (annotation based on Bach-Harnish)







communicative act types: constatives directives, comissives, acknowledgements, none




communicative act types: constatives directives, comissives, acknowledgements, none

supporting events of communicative acts: backchannel, politeness markers, corrections, no support







thematic control: topic initiation, elaboration, topic change (contextual, non-contextual)




thematic control: topic initiation, elaboration, topic change (contextual, non-contextual)

information structure: given vs. new information



Pragmatics - unimodal




agreement: uninterested, disagree, block, uncertainty; full, partial





attention: calling, paying





attention: calling, paying

deixis







information: received novelty




information: received novelty

turn-management: intending to start speaking, start speaking successfully, end speaking, breaking in


Sample data for multimodal alignments

Turn management



Turn management

turn-give: forward, blink, down, left-down, right-down



Turn management


turn-take: forward, blink, down, left-down, right-down



Turn management


turn-take: forward, blink, down, left-down, right-down

break-in_turn-keep: forwards, blink, up, down, left-down, right-down



Emotions vs. gestures




uncertainty is mostly found to be associated with the hand gesture open spread, less frequently with crossing fingers




uncertainty is mostly found to be associated with the hand gesture open spread, less frequently with crossing fingers

agreement is also associated with open spread and crossing fingers







doubt is found to be associated with open spread, crossing fingers and sideways as well


Video annotation (manual vs. automatic)

annotation methodadvantage/disadvantage (+/-)advantage/disadvantage (+/-)

annotation methodphysical values interpretative values

manual - +

automatic + -

Essential difference between the two: automatic annotation is ‘digital’ following framewise judgement across a predefined number of frames, whereas manual annotation is ‘analog’


Automatic annotation:Noldus FaceReader


Video analysis state log - Face Model: General - Calibration: ContinousVideo analysis state log - Face Model: General - Calibration: ContinousStart time: 2012.06.15. 9:35:15Filename: C:\Users\MACMINI\Documents\Noldus test\007_J_C2.trimmed.movFrame rate:3,33333333333333

Video TimeEmotion0:00:03 Unknown

0:00:19.800Neutral0:00:34.800Unknown0:00:35.699Neutral0:00:39.300Sad0:00:42.900Neutral0:00:49.500Scared

0:00:51 Sad0:00:51.900Neutral0:01:02.400Sad0:01:05.100Disgusted0:01:07.800Neutral0:01:13.500Angry

0:01:15 Neutral0:01:16.800Sad

0:01:18 Neutral0:01:22.200Sad0:01:25.500Neutral0:01:31.200Surprised0:01:32.700Neutral0:01:33.600Sad0:01:35.399Unknown

0:01:36 Sad0:01:36.899Neutral0:01:42.600Angry0:01:46.199Sad0:01:56.100Neutral0:02:03.300Sad0:02:04.200Neutral0:02:06.300Scared0:02:07.500Neutral0:02:26.100Surprised0:02:29.100Neutral0:02:42.299Scared0:02:43.500Sad0:02:47.100Surprised

0:02:48 Scared0:02:51.299Sad0:03:00.299Neutral0:03:08.100Surprised0:03:10.500Neutral0:03:12.299Sad0:03:13.200Neutral0:03:15.899Sad0:03:20.100Disgusted

0:03:24 Sad0:03:26.100Neutral0:03:27.899Angry0:03:29.100Neutral

0:03:30 Angry0:03:31.799Sad0:03:47.100Scared0:03:48.300Sad0:04:05.100Neutral0:04:07.500Disgusted0:04:08.400Angry0:04:09.300Neutral0:04:10.500Angry

0:04:12 Sad0:04:13.200Neutral0:04:23.400Happy0:04:28.800Scared0:04:32.400Neutral0:04:40.800Unknown0:04:46.200Sad0:04:50.400Neutral0:04:51.300Sad0:04:57.900Neutral0:05:04.800Unknown0:05:05.700Angry0:05:08.700Happy0:05:13.200Angry0:05:14.099Neutral0:05:16.200Sad0:05:18.599Disgusted0:05:19.800Sad0:05:21.900Surprised0:05:22.800Sad0:05:24.900Neutral

0:05:33 Sad0:05:35.099Neutral0:05:41.400Scared0:05:43.800Neutral0:05:50.400Sad0:05:52.500Angry0:05:53.700Sad0:06:04.200Neutral0:06:11.400Angry

0:06:15 Sad0:06:15.900Unknown0:06:17.099Angry0:06:19.500Neutral0:06:21.599Sad0:06:23.400Surprised0:06:24.300Neutral0:06:29.700Angry0:06:31.799Neutral0:06:33.299Scared0:06:34.500Angry0:06:36.900Sad0:06:37.799Disgusted0:06:40.500Sad0:06:44.700Scared0:06:46.799Neutral0:06:57.900Sad0:07:02.099Neutral0:07:03.900Sad0:07:06.900Angry

0:07:09 Scared0:07:10.200Sad

0:07:12 Neutral0:07:15.300Surprised0:07:16.800Scared0:07:18.300Neutral0:07:21.300Sad0:07:22.200Unknown0:07:27.600Sad0:07:36.900Scared0:07:37.800Neutral0:07:41.100Sad0:07:42.600Neutral0:07:46.200Sad0:07:52.500Surprised0:07:56.700Sad0:07:58.200Scared0:07:59.400Neutral0:08:04.200Unknown0:08:05.100Sad0:08:08.100Scared0:08:09.600Neutral0:08:11.400Happy0:08:12.300Neutral

0:08:15 Sad0:08:15.900Neutral0:08:40.199Sad0:08:41.400Neutral0:08:46.800Scared0:08:48.300Neutral0:08:59.100Unknown0:09:00.300Sad0:09:03.300Disgusted0:09:06.900Sad0:09:10.500Disgusted0:09:13.800Surprised0:09:15.900Neutral0:09:24.300Unknown0:09:27.300Sad0:09:31.199Neutral0:09:32.100Happy0:09:34.199Sad0:09:38.699Disgusted0:09:42.900Neutral0:09:50.100Surprised0:09:52.199Neutral0:09:55.199Surprised0:09:56.699Neutral0:10:14.400Unknown0:10:15.900END

Values assigned to single frames hence only begin time of the given value

Settings and sample output


Binary value assignment


Timeline: natural vs. happy


Valence


Summary statistics

2. video


Comparison: automatic vs. manual emotion recognition

3%4%7%

45%

42%

Manual annotation

happy natural recall tense surprise



Although both systems are based on the FACS model of emotions, different categories (emotions) recognised

Whereas both systems assign interpretative values, manual annotation selects ‘more difficult’ ones

Manual annotation offers subjectively observed degrees of emotions (strong, moderate, reduced), for automatic annotation thresholds for being ‘happy’, ‘angry’ etc. are determined statistically > smaller degrees are left out



Eventual unrealistic values in automatic annotation are the result of the single frames approach

Duration is not marked > offset of an emotion cannot be determined in case of non-continuous labels

Most agreement between the two approaches: happy, natural


Annotation of spoken syntax and its relation to prosody in the HuComTech corpus


• Aim: language technology (speech-to-text)

• communication studies (alignment of multimodal markers for communicative acts and emotions)

• linguistics (the syntax-prosody interface)


Syntactic data from our annotation

Spoken language vs. written language

Grammar: same/different?

Same underlying principles:

grouping of elementshierarchical organisation of groups

Difference: two additional dimensions of spoken language:

- time - grouping has language specific means


# of clauses in a sentence

# of clauses/sentence

12345678910111213141516171819202122

Informal dialogs Formal dialogs

2933 688289 82163 2492 2644 937 924 224 217 07 06 14 04 02 01 02 02 13 00 03 01 00 0

0

750

1500

2250

3000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

# of clauses/sentence

Informal dialogs Formal dialogs


Structural relations (hierarchy)

clause sequences with no structural relation has no subordinate clausehas no coordinate clausehas neither subordinate, nor coordinate clause

embeddingsinsertionsmultiple subordination (recursion)


Type of missing element according to

syntactic code

Type of missing element Informal dialogs % Formal dialogs %1. nothing missing 2664 35.59 758 34.62. main clause 37 0.49 15 0.693. preceding clause 58 0.77 6 0.274. relative pronoun 89 1.19 22 1.015. conjunction 22 0.29 4 0.186. subject (grammatical) 3178 42.37 1167 54.147. subject (logical) 274 3.66 113 5.178. predicate 214 2.87 72 3.299. object 102 1.36 45 2.0610. adverb 11 0.15 4 0.1811. attribute 0 0 0 012. verb 10 0.13 0 013. unfinished clause 728 9.7 167 13.114 missing element not relevant 3375 45.05 769 35.21

Sum: 143.62 149.9Type of missing element by frequency

Type of missing element Informal dialogs % Formal dialogs %14 missing element not relevant6. subject (grammatical)1. nothing missing13. unfinished clause7. subject (logical)8. predicate9. object4. relative pronoun3. preceding clause2. main clause5. conjunction10. adverb12. verb11. attribute

Sum:

3375 45.05 769 35.213178 42.37 1167 54.142664 35.59 758 34.6728 9.7 167 13.1274 3.66 113 5.17214 2.87 72 3.29102 1.36 45 2.0689 1.19 22 1.0158 0.77 6 0.2737 0.49 15 0.6922 0.29 4 0.1811 0.15 4 0.1810 0.13 0 00 0 0 0

142.62 149.9


Syntactic types vs. gestures

Alignment of syntactic type and gestures can offer an insight into certain cognitive processes in communication:

- speech dynamics - error detection - gesturing the “untold”


Syntactic types vs. gestures

A very interesting finding: pause (silence) and gaze can be mutually supplementary:

We found very few instance of gazing at the right overlap of an unfinished clause followed by a pause, but there was frequent gazing if there was no pause in a similar position.

Also, looking up as gazing direction was specific to alignment with the end of an unfinished clause but was quite rare at the end of a finished/complete clause.


14%

11%

2%2%

5%

10%6%

49%

Unfinished clause type 13 + no SL + gaze

informal: forwards left-down right-downright-up left-up up downblink


Annotation of prosody in the HuComTech corpus


• the “IP-level” (based on F0-contour and pause, manual)

• pitch movement (automatic)

• intensity change (automatic, in progress)

• accent/stress detection (automatic, in progress)


• aim: to generate data on pitch movement trends (actual movement type, tone range)

• capture F0-properties of syntactic types

• assign communicative functions (including emotions)

Detection of pitch movement


• based on Praat (algorithm and scripts)

• stylization on the syllable level (P. Mertens’ prosogram (‘perceived pitch’ in semitones, http://bach.arts.kuleuven.be/pmertens/prosogram/)

• trend of syllable based stylization (Szekrényes, 2012)

• classification

The calculation of the trend of pitch movement


http://bach.arts.kuleuven.be/pmertens/prosogram/




{p} %o hát %o pályakezdĘ vagyok, úgyhogy legfeljebb most a tanulmányaimról tudok mesélni,1.0.2.0.0.0.6. 2.3.1.0.0.0.6.

s12upward stagnant rise fall stagnantL1 L1 L1 L1 L1 L1L1 L1 L1 L1

126.86 135.98 135.98 121.67 121.67 155.21155.21 119.61 119.61 108.28

2 3 4

40

50

60

70

80

90asyll, G=0.32/T2, DG=20, dmin=0.035

Prosogram v2.8006mc22 F shure

150 Hz

Calculations associated with syntactic type but not based on it


Filename StartTime EndTime Duration StartValue EndValueAbsolute

DifferenceChange across time (Hz/msec) Movement ActualF0Range Sentence # Clause analysis

057fc30_I_shure 136.44 137.01 0.57 236.31 172.81 63.5 0.111411 fall MM s30 1.0.0.0.0.0.6.

057fc30_I_shure 137.01 137.06 0.05 172.81 221.4 48.59 1.079783 rise MM s30 1.0.0.0.0.0.6.

057fc30_I_shure 137.06 137.29 0.23 221.4 255.44 34.04 0.14798 rise MH2 s30 1.0.0.0.0.0.6.

057fc30_I_shure 137.29 137.78 0.49 255.44 179.93 75.5 0.153578 fall H1M s30 1.0.0.0.0.0.6.

057fc30_I_shure 137.78 138.32 0.54 179.93 186.13 6.2 0.011479 stagnant MM s31 1.2.0.0.0.0.4,6,9.

057fc30_I_shure 138.32 138.42 0.1 186.13 229.14 43.01 0.452767 rise MM s31 1.2.0.0.0.0.4,6,9.

057fc30_I_shure 138.42 139.03 0.61 229.14 168.1 61.05 0.10008 fall ML1 s31 1.2.0.0.0.0.4,6,9.

057fc30_I_shure 139.03 139.08 0.05 168.1 198.17 30.07 0.6014 rise L1M s31 2.0.0.1.0.0.6.

057fc30_I_shure 139.08 139.2 0.12 198.17 165.49 32.68 0.272326 fall ML1 s31 2.0.0.1.0.0.6.

057fc30_I_shure 139.2 139.56 0.36 165.49 169.56 4.07 0.011147 stagnant L1M s31 2.0.0.1.0.0.6.

057fc30_I_shure 139.56 139.63 0.07 169.56 201.24 31.68 0.452586 rise MM s31 2.0.0.1.0.0.6.

057fc30_I_shure 139.63 140.2 0.56 201.24 208.74 7.5 0.013276 stagnant MM s31 2.0.0.1.0.0.6.

5 pitch levels: L2, L1, M, H1, H2Slope (Hz/ms)Absolute values

3. video


Syntactic types vs. F0

70%

23%

6%

Formal type 13 (unfinished clause)

fall rise stagnant

61%25%

14%

Informal type 13 (unfinished clause)

fall rise stagnant



40%

16%

44%

Formal type 2 (main clause missing)

fall rise stagnant

56% 29%

16%

Informal type 2 (main clause missing)

fall rise stagnant



70%

23%

8%

Formal type 3 (subord. clause missing)

fall rise stagnant

56% 29%

16%

Informal type 2 (subord. clause missing)

fall rise stagnant


methodology based on the calculation of the trend of pitch movement (currently being implemented)

Detection of intensity change


• based on Hunyadi 2002

• PET: pitch and energy over time

• accent/stress is the result of the interaction of pitch and intensity: relative prominence

• absolute PET-value + duration

Detection of accent/stress


Ő látta? he saw ‘Did HE see it?’


Ő látta? he saw ‘Did he SEE it?’


mondja‘says it.’

Kati mondja.Kate says‘It is Kate who says it.’


THANK YOU!http://hucomtech.unideb.hu/hucomtech


http://hucomtech.unideb.hu/hucomtech

http://hucomtech.unideb.hu/hucomtech