Towards large(r) -scale cross- linguistic study of speechpeople.linguistics.mcgill.ca/~morgan/nu2018_slides.pdf · Polyglot-Speech Corpus Tools Datasets (speech corpora, lexicons)

Towards large(r)-scale cross-linguistic study of speech:

prosodic case studies

Morgan SondereggerMcGill University

Northwestern University 6/1/2018

Introduction

• Speech is highly variable

• Structure and sources of variability– Central Qs in linguistics/speech sciences– Decades of work → much known– Scale: mostly handful of cues (VOT, formants),

languages (English), hand measurement

• Most of what we know is from fine-grained studies

Introduction

• Huge amount of annotated speech data exists– Corpora– Academic labs–Web

• Across– Languages– Speech styles– Time

Buckeye

Corpus of Spontaneous Japanese

TIMIT

Kiel corpora(German)

NIKL Korean Corpus

Switchboard

At least orthography + audio

English

Diachronic

Introduction• Large(r)-scale studies– corpora + (semi) automatic analysis + statistical modeling– Scale up– Less careful

• Today: two case studies

• Enabled by software facilitating large-scale studies• Claims:– New insights– Complementary to fine-grained studies

Polyglot-Speech Corpus Tools

Datasets (speech corpora, lexicons)

Databaseimport

enrichment

queryingSet of

linguistic objects

Data file (CSV)

export

• Implementation• Python module• Graphical interface

(under redevelopment)

McAuliffe et al. (2017) Interspeech

montrealcorpustools.github.io/speechcorpustools/

Polyglot-SCT: Goals

1. Scalable

2. Require minimal technical skill from user

3. Abstraction away from dataset format

4. Querying dataset without access to raw data

• Aim to address barriers to large-scale corpus studies

Polyglot-SCT: Import

• Speech, text datasets → queryable databases

Speech datasets

Globalphone French

TIMIT

Lexique

Subtlex-US

force-aligned textgrids

idiosyncraticformat

CSV

Lexicons

Datasets DatabasesFrench graph database

English graph DB

v i t

English SQL DB

French SQL DB

très vite

t r E

words

phones

utteranceil est très vite

vite: frequency = 100, POS = adj, …CSV importer

TIMITimporter

TextGrid importer

Linguistic objectsInannotation graph (Bird & Liberman, 2001)

Properties of objects(e.g. word frequencies, features)

Polyglot-SCT: query

• Find subset of linguistic objects

DatabasesFrench graph database

English graph DB

v i t

English SQL DB

French SQL DB

très vite

t r E

words

phones


vite: frequency = 100, POS = adj, …

Query

all utterance final words in French, English

first syllable of utt-final words in French

vowels before low-frequency word-final obstruents in English

Polyglot-SCT: export

• Properties of objects → spreadsheet– (→ R, Excel)

phonefirst vowel of utterance-final words in French, English

duration, identity, F0

word

speech rate

duration,identity,

# syllables

speakerID,

gender

Data file(CSV)

Study 1: intrinsic F0 effects

Introduction

• Where does sound change come from?• Most common:

phonetic effect → phonological pattern

• Ex: tones–Often (e.g. Chinese):

F0 perturbations → lexical tone

“phonetic precursors”

Introduction

But: most phonetic precursors never lead to sound change!

What kind of precursor can be a source of change?

• robust– Across speakers, languages

• … but variable– Individual differences, language-specific phonetics

(e.g. Hombert et al. 1979, Ohala 19XX; Baker et al., 2011; Labov, 1967; Kingston, 2007; Yu, 2013)

tension

Introduction

How robust/variable is each phonetic precursor, across languages and individuals?

Introduction

• Methodologically hard– Need: big and comparable data: many languages, speakers– small effects, big confounds

• Approach:cross-linguistic corpora + automatic analysis + statistical modeling

• Q1: can a “phonetic precursor” be detected in corpus data across languages & speakers?

Influences on vowel F0

Preceding C class(ba < pa)

Vowel height(ba < bi)

Vowel F0

Speaker(gender, age, ..)

Intonation

Tone

CF0VF0

Intrinsic F0 effects

Macro

Micro

(e.g. Chen, 2011; Connell 2002; Fischer-Jørgenson, 1990; Hanson, 2009; Hoole & Honda, 2011; House & Fairbanks, 1953; Kingston & Diehl, 1994; Kirby & Ladd, 2016; Kingston, 2007; Ladd & Silverman, 1994; Meyer, 1896; Whalen & Levitt, 1995)

Intrinsic F0• Huge literature– primarily: small n, lab speech– focus: mechanism (automatic vs. controlled)

Across languages:• CF0 – “voiced”<“voiceless”:

most languages • VF0– [-high] < [+high] :

(near-)universal

• Effect size: variable– Tonal ⇒ smaller effect?

Q2: How much variability in IF0 across 14 languages?

Intrinsic F0

• Strongly affected by:– “Intonation”– Gender (VF0)…

• Interspeaker variability:–Often noted

• Relationship to sound change:– CF0 ⇒ sound change (“tonogenesis”)– VF0 ⇒ sound change–Why?

Q3: How much variability in IF0 across speakers?

Non-tonal Tonal

English Russian Hausa

French Polish Mandarin

German Spanish Thai

Korean Turkish Vietnamese

Pitch-accent

CroatianSwedish

Datasets

• Read sentence corpora• ~20 hours each• Force-aligned

GlobalPhone (Schulz et al., 2013), Librispeech (Panyatov et al., 2015)

Montreal Forced Aligner: trainable for different

languages

• “Utterance-initial” C V

• vowel F0 (Praat)– F0 histogram → speaker min, max → re-extract F0

• Controls : info about– Speaker– Utterance– Context– Word

Datasets

obstruent /a/, /i/, /u/

Polygot-Speech Corpus

Tools

Datasets• Data cleaning: minimize F0 errors, reduced vowels

• Exclusions:– “bad” speakers– “bad” tokens (e.g. too short)

• Data per language:– 1.9-9.5k tokens (~2000)– ~100 speakers

CF0: Analysis

C1 V XResponse: mean F0 in first 50 ms

consonant“voicing”*

* Ex: French p/b, Mandarin p/ph

• One linear mixed effects model / language• Main terms:

fixed effect by-speaker random slope

overall effect + interspeaker variability

CF0: analysis

• Other terms: extra slides

• Conservative model structure

● ● ● ● ● ● ● ● ●● ●

●

●●

0

1

2

MAN THA VIE RUS SWE CRO SPA KOR TUR GER HAU POL FRE ENGLanguage

Est.

CF0

effe

ct (s

t)CF0: across languages

• “most voiceless” – “most voiced” effect:

• Robust across languages• Variable effect size– Non-tonal ⇒ larger effect

Mandarin: 0.32 stp/ph diff.(p=0.08) English:

1.92 st“b”/”p” diff. (p<0.001) tonal pitch accent non-tonal

0

1

2

3

4

MAN THA VIE RUS SWE CRO SPA KOR TUR GER HAU POL FRE ENGLanguage

Est.

spea

ker C

F0 e

ffect

s (s

t)CF0: across speakers

• Predicted effects for 95% of individuals:

• Common: large interspeaker variability

English: ~0-3.5 st

Mandarin:~0-0.73 st

VF0: Analysis

C1 V XResponse: mean F0

Vowel identityHeight (a vs. i/u)+ i vs. u

• One linear mixed effects model / language• Main terms:

fixed effect by-speaker random slope

overall effect + interspeaker variability

VF0: analysis

• + various controls


• High – low vowel effect:

• mostly robust across languages• variable effect size– Non-tonal ⇒ generally larger effect

VF0: across languages

●● ● ● ●

● ● ●● ● ● ● ●

●

0.0

0.5

1.0

1.5

2.0

2.5

VIE RUS THA CRO ENG HAU TUR MAN FRE POL KOR SPA SWE GERLanguage

Est.

VF0

effe

ct (s

t)

Vietnamese: 0.39 st(p=0.26)

German: 2.13 st(p<0.001)

tonal pitch accent non-tonal

Average effect across gender, tone, etc.

• Sentences vs. lab speech?• (or artifact of methodology?)

VF0: across languages

●● ● ● ●

● ● ●● ● ● ● ●

●

0.0

0.5

1.0

1.5

2.0

2.5

VIE RUS THA CRO ENG HAU TUR MAN FRE POL KOR SPA SWE GERLanguage

Est.

VF0

effe

ct (s

t)

Whalen & Levitt (1995) average over 31 langs:

1.65 st

−1

0

1

2

3

VIE CRO RUS THA ENG HAU TUR MAN FRE SWE POL KOR SPA GERLanguage

Est.

spea

ker V

F0 e

ffect

s (s

t)VF0: across speakers

• Predicted effects for 95% of individuals

• Common: large interspeaker variability

opposite effectnull effect

big effect

Discussion

• IF0 effects can be detected using– Corpus data– Fully automatic analysis– Basic statistical controls– n =~2-4k

• Not obvious!

• Demonstrates feasibility of large-scale studies of phonetic precursors (involving F0)

Discussion

• Robust group-level IF0 effects across languages– same direction– “universality” (Whalen & Levitt, 1995)

• Very different effect sizes–One reason: tonal/pitch accent language ⇒ smaller IF0 more likely(hypothesized for VF0: Connell 2002)

• Fits with automatic + controlled mechanism(c.f. Hoole & Honda, 2011)

Discussion• Large interspeaker variability in IF0 magnitude

common, within language–⇒ there are some speakers with null/large effects– Still, most speakers show effect in same direction

• Overall: IF0 effects– robust across languages– variable across speakers

• Both important for sound change

• Related to actuation: why sound changes from IF0 possible, but rare? (Kingston, 2007)

Study 2: duration compression

effects

With: Michael McAuliffe Michael Wagner

Introduction

• Major aspect of speech timing: longer linguistic unit ⇒ compressed sub-parts

• Ex: stick, sticky, stickiness (Lehiste, 1972)

• cover term: duration compression effects

Introduction

• Menzerath’s Law (Menzerath, 1928, 1954)

– ‘The longer the whole, the shorter the parts’– Domain-general (not just speech)– longer words ⇒ shorter average syllable duration

– phonetic Menzerath effect

• Related: Polysyllabic shortening (Lehiste, 1972)

– Syllable/V durations shorter in bigger words/prosodic domains

Introduction

• Extensive work on DCEs – individual languages, controlled settings

• Unclear:– Are DCEs universal?

(Siddins et al., 2014; Suomi, 2007; White & Turk, 2010)

Q1: can we observe duration compression effects across typologically-diverse languages?

• Today: test for phonetic Menzerath effect

Introduction

• Unclear: are DCEs just reducible to other factors?

• Fewer segments per second:– Speech rate– Longer words ⇒ fewer segments/syllable

(“Structural Menzerath effect”)

• Prosodic effects on syllable duration:– Accent– Initial position– Final position

(e.g. Sluijter, 1995; Fougeron & Keating, 1997; Oller, 1973; Klatt, 1973, 1975; White & Turk, 2010; Windmann et al., 2015)

Introduction

• Q2: can DCEs be reduced to fewer segments/second?

• Q3: can DCEs be reduced to a prosodic lengthening effect?

Datasets

CroatianSwedish

• Read sentence corpora• ~20 hours each• Force-aligned

GlobalPhone (Schulz et al., 2013), TIMIT (Garofolo et al., 1993)Montreal Forced

Aligner

EnglishGermanRussianSwahiliUkrainianBulgarian

MandarinVietnameseThai

HausaPolishPortugueseSpanishSwahiliTurkish

FrenchKorean

Diverse (word) prosody

Datasets

• “Utterance” final “words”

• Measures–Word length (# syllables)– Mean syllable duration

• Controls– Speech rate– Expected syllable duration– Speaker, word ID– etc.

Polygot-Speech Corpus

Tools

Given segmental content,for individual speaker (Ernestus, Gahl)

Datasets

• Pruning–Words above language-specific cutoff length

Results: Mean syllable duration

200

300

400

500

1 2 3 4 5 6 7 8Word length (# syllables)

Mea

n sy

llabl

e du

ratio

n (m

sec)

French

German

Hausa

Korean

Mandarin

Polish

Portuguese

Russian

Spanish

Swahili

Swedish

Thai

Clear effect across languages

Analysis 1: controlling for segments/second

• Does mean syllable duration ~ word length, beyond effects of– Speech rate– Expected syllable duration–Who’s talking– Particular words– Utterance length

?

Enhanced “# segments in syllable”

Analysis 1

• Menzerath-Altmann law:

• Our case:

• Linear mixed-effects model

y = axbecx

Mean syll duration # syllables

log(y) = �1x + �2 log(x) + ...

Effect of speech rate, expected syllduration, etc.

Phonetic ME alone

Results

Thai Turkish Ukrainian Vietnamese

Portuguese Russian Spanish Swahili Swedish

German Hausa Korean Mandarin Polish

Bulgarian Croatian Czech English French

2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

2 4 6 8

200300400500

200300400500

200300400500

200300400500

Word length (# syllables)

Mea

n sy

llabl

e du

ratio

n in

wor

d (s

)

Solid: phonetic Menzerath effect

Dashed: empirical(withoutControls)

Results: analysis 1

• Clear phonetic Menzerath effect across languages

• Q2: are DCEs reducible to segments/second?– No

• Empirical relationship “steeper”:– Phonetic M effect (compressed syllables), plus– Structural M effect (compressed # segments)

Analysis 2: prosodic lengthening effects

• Observed effect due to– Initial strengthening– Final lengthening– Accentual lengthening –… across languages?

• Check:– Same analysis for initial syllable duration only, etc.

(White & Turk, 2010; Windmann et al., 2015)

Results: initial syllables





2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

2 4 6 8

100

200300400500

100

200300400500

100

200300400500

100

200300400500


Initi

al s

ylla

ble

dura

tion

(ms)

Some effect across languages






2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

2 4 6 8

100

200300400500

100

200300400500

100

200300400500

100

200300400500


Initi

al s

ylla

ble

dura

tion

(ms)

~ initial prominence Final prominence

Penult prominence

Lexical stress






2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

2 4 6 8

100

200300400500

100

200300400500

100

200300400500

100

200300400500


Initi

al s

ylla

ble

dura

tion

(ms)

complications


• Consistent compression effect– (at least: 1-3 syllables)

• Very different prosodic systems

• Can’t be just– Accentual lengthening– Initial strengthening– PSS on accented syllables only

Results: final syllables





2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

2 4 6 8

200300400500

200300400500

200300400500

200300400500


Fina

l syl

labl

e du

ratio

n (m

s)

No clear phonetic effect across languages

Results: final syllables

• No consistent phonetic compression effect– = phonetic ME

• Overridden by other factors?– final lengthening, language-specific prosody

• Aside: Much of empirical effect is actually due to fewer segments/syllable– = structural ME

Discussion

1. Duration compression effects may be universal– At least phonetic Menzerath effect

2. DCEs not reducible to (some) other factors

• Not obvious!

• (1)+(2) ⇒ DCEs reflect something deep about processing/planning– Mechanism?

Thanks

• Michael McAuliffe, Elias Stengel-Eskin, Arlie Coles

• Comments: James Kirby, Simon King, Montreal Language Modeling Lab members

• Funding:

Questions

Extra slides

Barriers to large-scale corpus studies

• Speech datasets:– Large– Complex– Diverse formats

• Access to many speech datasets– Costly or ethically restricted

• Result: requires lots of specialized code, $$, effort

CF0: analysis

• Other terms– “Voicing” interactions: gender , (tone , V length)– Controls:• Speaker gender, mean F0• Utterance length• V identity (incl. height)• Speaker, word, preceding/following phone• (Tone, V length)


VF0: analysis

• Other terms– V height interactions: gender , (tone, V length)– Controls:• Speaker gender, mean F0• C1 “voicing”• Utterance length• V identity• Speaker, word, preceding/following phone• (Tone, V length)


Extra: VF0 vs. CF0

• Asymmetry between IF0 effects w.r.t. sound change: – CF0: many attested changes– VF0: ~none

• Why? – VF0/CF0 magnitude roughly similar? (Hombert et al., 1979)

– Perhaps perception is different (Hombert, 1979)

– VF0 effects show more variability? (Kingston, 2011)

• Q4: Relative magnitude, variability of CF0 & VF0 across languages?

●● ● ●

● ●● ● ●

●●

●

●

●

0

1

2

VIE THA RUS CRO MAN TUR SWE KOR SPA HAU POL ENG FRE GERLanguage

Est.

VF0/

CF0

effe

ct (s

t)

type● VF0

CF0

VF0 vs. CF0: effect size

• No clear pattern• CF0, VF0 of ~comparable size

CF0 > VF0

CF0 < VF0

−1

0

1

2

3

4

VIE THA RUS CRO MAN TUR SWE KOR SPA HAU POL ENG FRE GERLanguage

Est.

spea

ker V

F0/C

F0 e

ffect

s (s

t)

typeVF0

CF0

VF0 vs. CF0: speaker variability

CF0 >> VF0VF0 >> CF0

Minority of speakers show reverse effects

• Overall: no obvious pattern• But: some evidence that VF0 “more variable” than CF0

Mean syllable duration





1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

200300400500

200300400500

200300400500

200300400500


Mea

n sy

llabl

e du

ratio

n (m

sec)

SCT: representation & enrichment• DBs: contains properties of objects, relationships

between them:– Positional:

• Ex: Utterance position– Hierarchical

• Ex: containing word– Temporal

• Begin, end, duration

• Enrich with additional information:– Suprasegmental: pauses, speech rate, ..– Acoustic: F0, formants..

French graph database

v i t

très vite

t r E

words

phones


utterance final

t = 5 10 12 15 20 25 …. msec

Documents

Towards large(r) -scale cross- linguistic study of speechpeople.linguistics.mcgill.ca/~morgan/nu2018_slides.pdf · Polyglot-Speech Corpus Tools Datasets (speech corpora, lexicons)