16
1 www.ntnu.no 2008-05-29 Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated Norwegian Broadcast News Speech Corpus

LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

  • Upload
    qabil

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

RUNDKAST: An Annotated Norwegian Broadcast News Speech Corpus. LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen. Overview. Purpose of Rundkast An overview of the database Rundkast Structure of annotation Orthographic transcription Broad phonetic annotation. - PowerPoint PPT Presentation

Citation preview

Page 1: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

1

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

LREC 2008

Ingunn Amdal, Ole Morten Strand,Jørn Almberg, and Torbjørn Svendsen

RUNDKAST:An Annotated Norwegian

Broadcast News Speech Corpus

Page 2: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

2

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Overview

• Purpose of Rundkast• An overview of the database Rundkast• Structure of annotation• Orthographic transcription• Broad phonetic annotation

Page 3: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

3

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Purpose of Rundkast

Databases of broadcast news can be used for a number of research topics in speech technology such as:

• Supplement to existing databases of read speech for training and testing automatic speech recognition and speaker adaptation.

• Research on recognition of spontaneous speech.• Research on automatic indexing of audio data.• Research on topic and/or speaker segmentation.• Research on speech/non-speech detection (e.g. background

music).• International research cooperation involving speech technology

for broadcast news applications.

A corpus of this kind is necessary for language technology research, but has not been available for Norwegian

Page 4: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

4

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Overview of Rundkasthttp://www.iet.ntnu.no/projects/rundkast/

Database of 77 hours radio broadcast news fromthe Norwegian Broadcasting Corporation (NRK):

• Read and spontaneous speech, as well as spontaneous dialogsand multipart discussions

• There is large variation between speakers, speaking styles and topics

• Speaker turns may be rapid and several speakers may talk simultaneously

• The quality of the recordings include studio and telephone(mobile, satellite etc)

• Frequent occurrences of background noise, jingles,music and audio illustrations

Funded by the Norwegian University of Science and Technology (NTNU)

Page 5: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

5

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Structure of annotation

Rundkast is hierarchically organizedand orthographically annotated:

• Name of programme, type and date• Name of speaker (if known) and dialect (5 regions)• Type of speech: spontaneity, channel, recording quality• Segmented in speaker turns of app. 2-5 seconds• Orthographic transcription (standard Norwegian)• Labels for noise (speaker noise, background noise etc.)• Labels for pronunciation mistakes, foreign words, unintelligible

speech etc.

• ~70 hrs work per hour of recording

Transcriber used for annotation: ”standard”-tool

Page 6: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

6

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Hierarchy of annotation levels

[i] blah blah ... more blah ...[lp] • • •

speaker 1 speaker 2 no speaker • • •speaker 1

report fillernontrans • • •report

one episode file

[b-]noisy blah[-b] ...

annotation level:

1

2

3

levels: 1=section, 2=speaker turn, and 3=segment

Page 7: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

7

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Orthographic transcription

• The lowest level in the annotation hierarchy, segments, are transcribed orthographically.

• Orthographic transcription of spoken language is a challenge, especially for Norwegian. Using dialect also in official circumstances is more and more accepted.

• The majority of RUNDKAST is not compliant to any standard pronunciation.

• The aim of the conventions for the orthographic transcription in RUNDKAST is to minimize uncertainty about pronunciations and facilitate consistency.

Page 8: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

8

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Orthographic transcription:Main conventions

• Words are transcribed with the written forms closest to actual pronunciations. A limited number of interjections are allowed.

• Text codes are used to mark mispronunciations, truncations, and unknown words.

• Numbers and symbols are written out as words.• Abbreviations are not used.• Punctuation marks are restricted to comma, period, and

question mark.• Space is used between spelled letters, also when acronyms

have spelled pronunciation.• Capital letters are used in proper names, spellings, and

acronyms, but not at the start of sentences.

Page 9: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

9

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Example annotation in Transcriber

Page 10: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

10

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Broad phonetic annotation

• Part of the data were to be phonetically annotated– Use for low-level experiments in ASR (new methods),

smaller Norwegian counterpart to TIMIT– Auto-segmentation for e.g. unit selection TTS

• Annotation to be based on existing standards– with necessary adjustments

• Exploit experience and specifications from development of Norwegian speech synthesis databases

• ”Suitable” level of detail: Acoustic boundaries should be labeled, but more phonemic than phonetic

• Consistency of utmost importance!

Page 11: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

11

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Broad phonetic annotation:Selected data

• 10 speakers (5 male and 5 female)

• Amount of speech per speaker:– app 5 min ”planned” speech and 1 min spontaneous speech

– discard noisy parts (as far as possible)

– from more than one programme

– use turn segmentation from orthographic annotation

• All in all 1 hour of speech• Approximately 1000 hours of work

Page 12: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

12

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Broad phonetic annotation:Main principles

• The annotation is mainly phonemic using the phoneme symbols closest to the perceived sound

• Acoustic boundaries should be marked; some acoustically motivated symbols are included

• A transcription as close as possible to the citation form is preferred

• Norwegian standard SAMPA is preferred– Some English phonemes included as well as dialect variants

– Example: 3 variants of the /r/-sound/r/ (tap/trill)/R/ (uvular fricative)/r\/ (approximant)

Page 13: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

13

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Broad phonetic annotation:Annotation procedure

1. Conversion of orthographic transcription to a format suitable for automatic transcription.

2. Automatic segmentation with a phonotypical transcription using a speech recognizer.

3. Manual correction of both segments and labels by four phonetics students using Praat.

4. Format check.

5. Control of all annotation by one supervisor.

Page 14: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

14

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Broad phonetic annotation:Comments on deviations

Always cases of uncertainty, need a log for these.

Problem: will the log be read?

Solution: Codes for deviations!

• Additional Praat tier for deviations• Synchronous with the phoneme tier• Easy to utilize automatically• Examples:

– creaky voice

– unexpected voiced/unvoiced

– uncertain boundary or symbol

• ... in addition a log file with whatever deviations left

Page 15: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

15

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Example annotation in Praat

Page 16: LREC 2008 Ingunn Amdal,  Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen

16

www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech

Concluding remarks

• Availability:– Planned to be included for non-commercial use in a future

Norwegian language bank

– Will complement other corpora also intended to be included

• To be validated by Spex• Planned use at NTNU: SIRKUS project

– Investigation in new paradigms for ASR– Low-level phone recognition experiments initially

• multi-linguality aspects

– Spoken information retrieval