37
Forced Alignment of Spoken Audio Josef Fruehwald 19 April 2016

Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Forced Alignment ofSpoken AudioJosef Fruehwald

19 April 2016

Page 2: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Why ForcedAlignment?

Page 3: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

What we had

Data - Static

3/37

Page 4: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

What we wanted:

Data - Dynamicdob

19901900

1900 1909 1918 1927 1936 1945 1954 1963 1972 1981 1990

-1.2-1.0-0.8-0.6-0.4-0.2-0.00.20.40.60.81.01.2

f2

-1.2

-1.0

-0.8

-0.6

-0.4

-0.2

-0.0

0.2

0.4

0.6

0.8

1.0

1.2

f1

æ

ɛ

ɪ

i:

ɑ

ʊ

ʌ

voicingnasalvoicedvoiceless

4/37

Page 5: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Getting from what we haveto what we want

1. Convert analogue recordings to digitalformat.

2. Identify where in the audio speech soundsof interest are.

3. Automate the acoustic analysis of thespeech sounds.

4. Apply statistical analysis to the acousticanalysis for inferences.

Preserve the most important metadata·

5/37

Page 6: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Identifying where in theaudio speech soundsinterest are.

"Forced Alignment""

6/37

Page 7: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Finding words in audio

7/37

Page 8: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Forced Alignment

Rest of the presentation:

What some of the necessary bits and piecesare for doing forced alignment.

What some of the tools out there are fordoing alignment as an end user.

·

·

8/37

Page 9: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Bits and Pieces andIssues for doingforced alignment

Page 10: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Piece 1: A pronouncingdictionary

word pronunciation

well W EH1 L

there DH EH1 R

was W AH0 Z

one W AH1 N

time T AY1 M

10/37

Page 11: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Issue 1: PronunciationVariants

What do you do for multiple pronunciations?e.g. Bailey (2016)

word pronunciation

walking W AO1 L K IH0 N

walking W AO1 L K IH0 NG

walking W AO1 L K IH0 NG G

11/37

Page 12: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Issue 1: PronunciationVariants

Option 1: Include all options

Let the aligner figure out which option to use.

Pros

Cons

·

You'll get more accurate timing.-

·

In choosing pronunciation variants,some aligners have a lower rate ofagreement with humans coders thanhumans coders do with each other(Bailey 2015)

It can be tricky to identify whichpronunciations are variants of eachother.

-

-

12/37

Page 13: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Issue 1: PronunciationVariants

Option 2: Only include one option

Only allow the aligner to choose one option

Pros

Cons

·

It'll be easier to identify all instances ofpotential pronunciation variation.

-

·

The timing information will be lessaccurate.

-

13/37

Page 14: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Issue 2: Out of DictionaryWords

No matter how large a pronouncing dictionaryyou're working with, there will always be somewords in free flowing speech that aren't in thedictionary.

word pronunciation

Fruehwald F R UW1 W AO0 L D

hoagie HH OW1 G IY0

These either need to be added to the dictionarywhen the aligner is run, or a separate piece ofsoftware needs to try to guess thepronunciation based on the spelling.

14/37

Page 15: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Piece 2: An acoustic model

[w]=

15/37

Page 16: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Piece 3: A transcript

Outside of the original fieldwork, this is themost time consuming and expensive part.

16/37

Page 17: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

How it works

ð əbeginning middle end beginning middle end

17/37

Page 18: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

How it works

ð əbeginning middle end beginning middle end

if pppp > pp, beginningif pppp < p p, middle

18/37

Page 19: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

How it works

ð əbeginning middle endmiddlebeginning middle endbeginning

19/37

Page 20: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Concerns about forcedalignment

It'll make mistakes

It is easier and faster (read: cheaper) tomanually correct the output of automatedsystems than to create the annotations fromscratch

Humans make mistakes too! And the kindsof mistakes automated sytems make areusually systematic, so they're easier toidentify and locate.

·

·

20/37

Page 21: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Concerns about forcedalignment

It's a black box!

21/37

Page 22: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

You are a black box

22/37

Page 23: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Concerns about forcedalignment

Automation removes me from the data

23/37

Page 24: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Doing ForcedAlignment at Home

Page 25: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

FAVE

The FAVE-suite is actually two pieces ofsoftware: An aligner, and a Bayesian formantanalyzer.

Aligner based on p2fa, trained on 25 hoursof US Supreme Court oral arguments.

Fairly good time accuracy.

·

·

25/37

Page 26: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

FAVE Benefits

Developed assuming that multiple talkers inthe audio was the default case.

Developed in the open, trying to be as cross-platform friendly as possible.

Written in Python, which is a very widelyunderstood programming language.

The system is relatively simple and flexible(although its acoustic models are not).

The primary developer is friendly andresponsive !

·

·

·

·

·

26/37

Page 27: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Median Mean Max

Onset Offset Onset Offset Onset Offset

FAVE 0.009 0.009 0.019 0.021 0.583 0.588

PLA 0.015 0.019 0.267 0.252 55.473 55.488

SPPAS 0.150 0.155 0.504 0.480 68.903 67.408

FAVE Cons

Based on North American acoustic models,although MacKenzie & Turton have found itcompares favorably to other aligners onBritish data.

·

27/37

Page 28: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Recommended FAVE Usage

Download and install locally

Extensive documentation online, writtenassuming minimal familiarity with commandline interfaces.

28/37

Page 29: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

What FAVE needs as input

Audio

Transcriptions

·

·

Partially time aligned

Multiple speakers annotated separately

-

-

29/37

Page 30: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Prosodylab Aligner

Developed at University of McGill, Montreal

Pros & Cons

Much the same as FAVE, but re-training ofthe acoustic models is built in.

No streamlined facility yet for multipletalkers

·

·

30/37

Page 31: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

Prosodylab Aligner

Recommended Usage

Download & Install·

31/37

Page 32: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

webMAUS

Developed in association with CLARIN-D

Web-based platform

No multiple talkers yet

·

Easy to use

Less easy to adapt to task specificpurposes

May be tricky if there are ethicsrestrictions on where and how your datais stored.

-

-

-

·

32/37

Page 33: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

webMAUS

Recommended Usage

33/37

Page 34: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

webMAUS

Recommended Usage

34/37

Page 35: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

DARLA

System developed at Dartmouth University

Pros

Cons

·

Includes an automatic speechrecognition system.

-

·

So far, just a web-based service, withservers in the US

-

35/37

Page 36: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

DARLA

Recommended Usage

36/37

Page 37: Forced Alignment of Spoken Audio · e.g. Bailey (2016) word pronunciation walking W AO1 L K IH0 N walking W AO1 L K IH0 NG walking W AO1 L K IH0 NG G 11/37. Issue 1: Pronunciation

The End