13
Data collection and experimentation

Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Embed Size (px)

Citation preview

Page 1: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Data collection and experimentation

Page 2: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Why should we talk about data collection? •

• It is a central part of most, if not all, aspects of current speech technology

• The higher grades (A, B; as tested in the home exam assignments and the project) require a measure of data collection

Page 3: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

What is data collection? •

• In speech technology, the gathering of human communicative behaviours that can be used for implementation of e.g. spoken dialogue systems

• What do we gather?- Speech- Text- Voices- Gestures- Patterns!

Page 4: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

All vs one?

• Recognition: we want to have seen all possibilities• Synthesis: we want one, consistent behaviour

Page 5: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Group exercise

• Same groups as before• Design one or more data collection(s) that will become

the basis for a spoken dialogue system intended to inform users of the television program

• Take note of why you make your design choices• We’ll talk about it here in 30 minutes

Page 6: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

• Application- Remote control• Select programme• Menu options - tree

- TV guide• More free speech• But connected to GUI options (e.g. for lists)

• Data- Room environment- Age recognition data • Recognize age• Recognize identity of a specific mother

- Usage probabilities- Asking people - ratings- Language? Programmes are english, swedish- Read tv guide- But people speak differently (“trean”)- Monitor corpus (updated)- “Beta” version – iterative process (h/h, WoZ, beta)- Demography: adults, elderly, kids?- Keywords• Cloud• Times• Some commands

Page 7: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Multimodal corpus work: manual annotation, validation and computer driven analysis. Jens Edlund, 2012-09-01-05

7

What is a corpus? •

• Wikipedia:- A collection of written or spoken material in machine-

readable form, assembled for the purpose of studying linguistic structures, frequencies, etc.

Click icon to add pictureClick icon to add pictureClick icon to add picture

Page 8: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Multimodal corpus work: manual annotation, validation and computer driven analysis. Jens Edlund, 2012-09-01-05

8

Why collect a corpus? •

- ”[...] for the purpose of studying linguistic structures, frequencies, etc.”

- Sample - cannot analyze all

- Training data for duplicating behaviours

- Analysis of how humans do things

- Generalisability, representativeness• Same results in different corpora

• Use constraints, standards, theories to form the corpus

• If findings are expected - corroborate theory - we're better off

Click icon to add pictureClick icon to add pictureClick icon to add picture

Page 9: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Multimodal corpus work: manual annotation, validation and computer driven analysis. Jens Edlund, 2012-09-01-05

9

How is a corpus collected? •

• Often high formal demands:- Structure- Balance

• Audio, visual, audiovisual - choice of modalities- Requires equipment- Silent lab

Click icon to add pictureClick icon to add pictureClick icon to add picture

Page 10: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Multimodal corpus work: manual annotation, validation and computer driven analysis. Jens Edlund, 2012-09-01-05

10

Where are corpora collected? •

Click icon to add pictureClick icon to add pictureClick icon to add picture

Page 11: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Multimodal corpus work: manual annotation, validation and computer driven analysis. Jens Edlund, 2012-09-01-05

11

When are corpora collected? •

• Often collected once, then static- But monitor corpora exists- And the web is as always changing things

Click icon to add pictureClick icon to add pictureClick icon to add picture

Page 12: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Multimodal corpus work: manual annotation, validation and computer driven analysis. Jens Edlund, 2012-09-01-05

12

Examples of corpora? •

Click icon to add pictureClick icon to add pictureClick icon to add picture

Page 13: Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology

Thank you!Questions?