From Here to Utility Melding Phonetic Insight With Speech Technology Steven Greenberg

Preview:

DESCRIPTION

From Here to Utility Melding Phonetic Insight With Speech Technology Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng steveng@icsi.berkeley.edu. Acknowledgements and Thanks. - PowerPoint PPT Presentation

Citation preview

From Here to UtilityMelding Phonetic Insight With Speech Technology

Steven GreenbergInternational Computer Science Institute1947 Center Street, Berkeley, CA 94704

http://www.icsi.berkeley.edu/~stevengsteveng@icsi.berkeley.edu

Acknowledgements and Thanks

Automatic Feature Classification and AnalysisJoy Hollenback, Shawn Chang, Leah Hitchcock

Research FundingU.S. National Science FoundationU.S. Department of Defense

Road Map of the PresentationWhat is Truth?

• The story of Rashomon, a film by Akira Kurosawa• Its application to spoken language

Road Map of the PresentationWhat is Truth?

• The story of Rashomon, a film by Akira Kurosawa• Its application to spoken language

The Varieties of Scientific Experience• The Fundamental Duality• The Eternal Pentangle• The Inner Triangle

Road Map of the PresentationWhat is Truth?

• The story of Rashomon, a film by Akira Kurosawa• Its application to spoken language

The Varieties of Scientific Experience• The Fundamental Duality• The Eternal Pentangle• The Inner Triangle

The Importance of Being Phonetically Annotated• A Corpus-Centric Perspective on Spoken Language• Phonetic Annotation of Spontaneous American English Discourse

Road Map of the PresentationWhat is Truth?

• The story of Rashomon, a film by Akira Kurosawa• Its application to spoken language

The Varieties of Scientific Experience• The Fundamental Duality• The Eternal Pentangle• The Inner Triangle

The Importance of Being Phonetically Annotated• A Corpus-Centric Perspective on Spoken Language• Phonetic Annotation of Spontaneous American English Discourse

Phonetic Dissection of Automatic Speech Recognition Systems• Stress Accent and Word Error Rate• Syllable Structure and Word Error Rate

Road Map of the PresentationWhat is Truth?

• The story of Rashomon, a film by Akira Kurosawa• Its application to spoken language

The Varieties of Scientific Experience• The Fundamental Duality• The Eternal Pentangle• The Inner Triangle

The Importance of Being Phonetically Annotated• A Corpus-Centric Perspective on Spoken Language• Phonetic Annotation of Spontaneous American English Discourse

Phonetic Dissection of Automatic Speech Recognition Systems• Stress Accent and Word Error Rate• Syllable Structure and Word Error Rate

The Relation Between Stress Accent and Vocalic Identity• The Relation Between Segmental Duration and Vowel Height• Durational Differences Between Stressed and Unstressed Vowels• The Relation Between Vowel Height and Stress Accent

Road Map of the PresentationWhat is Truth?

• The story of Rashomon, a film by Akira Kurosawa• Its application to spoken language

The Varieties of Scientific Experience• The Fundamental Duality• The Eternal Pentangle• The Inner Triangle

The Importance of Being Phonetically Annotated• A Corpus-Centric Perspective on Spoken Language• Phonetic Annotation of Spontaneous American English Discourse

Phonetic Dissection of Automatic Speech Recognition Systems• Stress Accent and Word Error Rate• Syllable Structure and Word Error Rate

The Relation Between Stress Accent and Vocalic Identity• The Relation Between Segmental Duration and Vowel Height• Durational Differences Between Stressed and Unstressed Vowels• The Relation Between Vowel Height and Stress Accent

Spoken Language – What is Truth?• Fundamental Questions Remain Unanswered

Part One

WHAT IS TRUTH?

The Story of Rashomon

Its Moral for the Study of Spoken Language

Rashomon – What is Truth?It is twelfth-century Japan, and a nobleman has died ….

This we learn from a conversation between a woodcutter, a priest and a peasant under a gate in the ancient city of Kyoto ….

Rashomon – What is Truth?

The woodcutter and the priest have just come from a judicial inquest into the death, and are telling the peasant what they have heard

Rashomon – What is Truth?

The woodcutter and the priest have just come from a judicial inquest into the death, and are telling the peasant what they have heard

The woodcutter testified at the inquest, having witnessed the sequence of events resulting in the Nobleman’s death

Rashomon – What is Truth?

The story begins with the capture of the notorious bandit, Tajomaru, who is the accused in the nobleman’s death ….

Rashomon – What is Truth?

The nobleman and his wife had been traveling through the forest ….

Rashomon – What is Truth?

When, all of a sudden,

Rashomon – What is Truth?

When, all of a sudden, they are confronted by Tajomaru, who halts their progress ….

Rashomon – What is Truth?

The nobleman and bandit go off alone into a thicket, where the former winds up being subdued by the latter

Rashomon – What is Truth?

The nobleman is tied to a tree and forced to watch as his wife is violated by the bandit

Rashomon – What is Truth?

Rashomon – What is Truth?The wife, at first, resists ….

Rashomon – What is Truth?But eventually drops the dagger and submits

So far, all parties concerned agree (roughly) as to the course of events, but from this point on the picture becomes murky, with each participant telling a somewhat different version of the story

Rashomon – What is Truth?

In two versions (Tajomaru’s and the woodcutter’s) the wife insists that her husband and the bandit fight for her honor. The nobleman’s death results from losing the duel.

Rashomon – What is Truth?

Rashomon – What is Truth?In the wife’s version, the bandit departs, with the husband still tied to the tree. The

husband proceeds to taunt his wife, telling her how ashamed he is – of her!

Rashomon – What is Truth?She cuts the rope binding her husband to the tree and asks to be killed! The wife

promptly faints and when she awakens, finds the dagger in the chest of her (now very dead) husband

In yet another version (the husband’s through a spirit medium) his wife betrays him and tries to convince the bandit to kill the husband

Rashomon – What is Truth?

However, the bandit is repulsed by this suggestion and quickly departs ….

Rashomon – What is Truth?

However, the bandit is repulsed by this suggestion and quickly departs ….

The nobleman, still tied to the tree, picks up the dagger and plunges it into his chest, thus taking his own life

Rashomon – What is Truth?

However, the bandit is repulsed by this suggestion and quickly departs ….

The nobleman, still tied to the tree, picks up the dagger and plunges it into his chest, thus taking his own life

Some time later the (now very dead) nobleman is aware of someone (it is not clear who) removing the dagger from his chest

Rashomon – What is Truth?

The film ends as the priest, woodcutter and peasant mull over the significance of the disparate accounts of the nobleman’s death, seeking some kernel of truth in the morass of ambiguity and uncertainty

Rashomon – What is Truth?

The film ends as the priest, woodcutter and peasant mull over the significance of the disparate accounts of the nobleman’s death, seeking some kernel of truth in the morass of ambiguity and uncertainty

It is unclear whether ANY witness has been entirely truthful

Rashomon – What is Truth?

The film ends as the priest, woodcutter and peasant mull over the significance of the disparate accounts of the nobleman’s death, seeking some kernel of truth in the morass of ambiguity and uncertainty

It is unclear whether ANY witness has been entirely truthful (probably not)

Rashomon – What is Truth?

The story of Rashomon is cited often in philosophical discussions of “truth”

Rashomon – What is Truth?

The story of Rashomon is cited often in philosophical discussions of “truth”

As nothing is known (or knowable) with absolute certainty, all knowledge is relative (and hence ephemeral)

Rashomon – What is Truth?

The story of Rashomon is cited often in philosophical discussions of “truth”

As nothing is known (or knowable) with absolute certainty, all knowledge is relative (and hence ephemeral)

The concept of truth is a chimera

Rashomon – What is Truth?

The story of Rashomon is cited often in philosophical discussions of “truth”

As nothing is known (or knowable) with absolute certainty, all knowledge is relative (and hence ephemeral)

The concept of truth is a chimera

Rashomon – What is Truth?

The story of Rashomon is cited often in philosophical discussions of “truth”

As nothing is known (or knowable) with absolute certainty, all knowledge is relative (and hence ephemeral)

The concept of truth is a chimera and therefore unworthy of pursuit

Rashomon – What is Truth?

Yet, there is an alternative interpretation, one that questions not the concept of truth itself, but rather the capacity of its assimilation through a single vantage point

Rashomon – What is Truth?

Yet, there is an alternative interpretation, one that questions not the concept of truth itself, but rather the capacity of its assimilation through a single vantage point

Perhaps the “true” message of Rashomon is that deep and ever-lasting knowledge can only be gained through exposure to a variety of perspectives,

Rashomon – What is Truth?

Yet, there is an alternative interpretation, one that questions not the concept of truth itself, but rather the capacity of its assimilation through a single vantage point

Perhaps the “true” message of Rashomon is that deep and ever-lasting knowledge can only be gained through exposure to a variety of perspectives,

No single source providing sufficient depth and detail to comprehend a situation as complex (and as tragic) as the murder of a man

Rashomon – What is Truth?

Can an intellectual domain as complex as spoken language be fully understood through the testimony of a single perspective?

Spoken Language – What is Truth?

Can an intellectual domain as complex as spoken language be fully understood through the testimony of a single perspective?

Or must orthogonal varieties of evidence be sought with which to reconstruct the “truth”?

Spoken Language – What is Truth?

Can an intellectual domain as complex as spoken language be fully understood through the testimony of a single perspective?

Or must orthogonal varieties of evidence be sought with which to reconstruct the “truth”?

How does true insight proceed from “objective” study of spoken language?

Spoken Language – What is Truth?

Can an intellectual domain as complex as spoken language be fully understood through the testimony of a single perspective?

Or must orthogonal varieties of evidence be sought with which to reconstruct the “truth”?

How does true insight proceed from “objective” study of spoken language?

Is it possible to fully comprehend the multivocal nature of a scientific domain from the sole vantage point of a laboratory?

Spoken Language – What is Truth?

Can an intellectual domain as complex as spoken language be fully understood through the testimony of a single perspective?

Or must orthogonal varieties of evidence be sought with which to reconstruct the “truth”?

How does true insight proceed from “objective” study of spoken language?

Is it possible to fully comprehend the multivocal nature of a scientific domain from the sole vantage point of a laboratory?

Or does the spirit of Rashomon compel us to seek testimony from other sources in the pursuit of objective knowledge?

Spoken Language – What is Truth?

Part Two

THE VARIETIES OF SCIENTIFIC EXPERIENCE

The Fundamental Duality

The Eternal Pentangle

The Inner Triangle

The Fundamental DualityTechnology and science appear to oppose each other in perspective

The Fundamental DualityTechnology and science appear to oppose each other in perspective

• Technology is concerned with what works

The Art of the Workable

The Fundamental DualityTechnology and science appear to oppose each other in perspective

• Technology is concerned with what works (and can sell)

The Art of the Sellable The Art of the Workable

The Fundamental DualityTechnology and science appear to oppose each other in perspective

• Technology is concerned with what works (and can sell)• Science is concerned with what is

The Art of the WorkableThe Art of the Sellable

The Art of the Soluble

The Fundamental DualityTechnology and science appear to oppose each other in perspective

• Technology is concerned with what works (and can sell)• Science is concerned with what is (and can be published)

The Art of the Sellable The Art of the Workable

The Art of the SolubleThe Art of the Publishable

The Fundamental DualityThere is an essential “tension” between Science and Technology

The Art of the Sellable The Art of the Workable

The Art of the SolubleThe Art of the Publishable

The Fundamental DualityThere is an essential “tension” between Science and Technology

• Science is often deemed “pure”

The Art of the Sellable The Art of the Workable

The Art of the SolubleThe Art of the Publishable

The Fundamental DualityThere is an essential “tension” between Science and Technology

• Science is often deemed “pure”• Technology is usually perceived as “applied”

The Art of the Sellable The Art of the Workable

The Art of the SolubleThe Art of the Publishable

The Fundamental DualityThere is an essential “tension” between Science and Technology

• Science is often deemed “pure”• Technology is usually perceived as “applied” (and therefore not quite as pure)

The Art of the Sellable The Art of the Workable

The Art of the SolubleThe Art of the Publishable

The Eternal PentangleSpeech Research Provides an Excellent Example of the Tension between

Science and Technology

The Eternal PentangleSpeech Research Provides an Excellent Example of the Tension between

Science and Technology

The Eternal PentangleSpeech Research Provides an Excellent Example of the Tension between

Science and Technology• “Phonetic insight” is on the side of the angels

The Eternal PentangleSpeech Research Provides an Excellent Example of the Tension between

Science and Technology· “Phonetic insight” is on the side of the angels (a.k.a. “science”)

Phonetic Insight

The Eternal PentangleSpeech Research Provides an Excellent Example of the Tension between Science and

Technology· “Phonetic insight” is on the side of the angels (a.k.a. “science”) · While “speech technology” is on the side of the apes

Phonetic Insight

The Eternal PentangleSpeech Research Provides an Excellent Example of the Tension between Science and

Technology• “Phonetic insight” is on the side of the angels (a.k.a. “science”) • While “speech technology” is on the side of the apes (a.k.a. “the real world”)

Phonetic Insight

The Real World

The Inner TriangleThe Inner Triangle of the Eternal Pentangle Can Potentially Shed Light on

this Philosophical (and Methodological) Conundrum

The Inner TriangleThe Inner Triangle of the Eternal Pentangle Can Potentially Shed Light on this Philosophical

(and Methodological) Conundrum• Manual annotation provides the empirical foundation with which to train machine algorithms

The Inner TriangleThe Inner Triangle of the Eternal Pentangle Can Potentially Shed Light on this Philosophical (and Methodological)

Conundrum• Manual annotation provides the empirical foundation with which to train machine algorithms• Statistical characterization of the annotated material provides the basis for structuring the machine learning regime

The Inner TriangleThe Inner Triangle of the Eternal Pentangle Can Potentially Shed Light on this Philosophical (and Methodological)

Conundrum• Manual annotation provides the empirical foundation with which to train machine algorithms• Statistical characterization of the annotated material provides the basis for structuring the machine learning regime• Machine learning provides a method for evaluating phonetic knowledge

The Inner TriangleThe Inner Triangle of the Eternal Pentangle Can Potentially Shed Light on this Philosophical (and Methodological) Conundrum

• Manual annotation provides the empirical foundation with which to train machine algorithms• Statistical characterization of the annotated material provides the basis for structuring the machine learning regime• Machine learning provides a method for evaluating phonetic knowledge• Phonetic knowledge can be used to efficiently train machine algorithms

The Inner TriangleThe Inner Triangle of the Eternal Pentangle Can Potentially Shed Light on this Philosophical (and Methodological) Conundrum

• Manual annotation provides the empirical foundation with which to train machine algorithms• Statistical characterization of the annotated material provides the basis for structuring the machine learning regime• Machine learning provides a method for evaluating phonetic knowledge• Phonetic knowledge can be used to efficiently train machine algorithms• Statistical characterization can serve as a “reality check” on phonetic knowledge

The Inner TriangleThus, the three apices of the Inner Triangle feed into each other and provide

insight and perspective difficult to achieve from a single vantage point

The Inner TriangleThus, the three apices of the Inner Triangle feed into each other and provide insight and

perspective difficult to achieve from a single vantage point• In a manner analogous to Rashomon, insight may be gained from this multi- dimensional

perspective that deepens our knowledge of spoken language

The Inner TriangleThus, the three apices of the Inner Triangle feed into each other and provide insight and perspective difficult to achieve

from a single vantage point• In a manner analogous to Rashomon, insight may be gained from this multi- dimensional perspective that deepens our knowledge

of spoken language• And thus enables the development of superior technology that truly works in the “real world”

The Inner TriangleThus, the three apices of the Inner Triangle feed into each other and provide insight and perspective difficult to achieve from a single vantage

point• In a manner analogous to Rashomon, insight may be gained from this multi- dimensional perspective that deepens our knowledge of spoken language• And thus enables the development of superior technology that truly works in the “real world”• The development of sterling technology provides (in principle) a means to fund further basic technology-driven research

The Inner TriangleThus, the three apices of the Inner Triangle feed into each other and provide insight and perspective difficult to achieve from a single vantage point

• In a manner analogous to Rashomon, insight may be gained from this multi- dimensional perspective that deepens our knowledge of spoken language• And thus enables the development of superior technology that truly works in the “real world”• The development of sterling technology provides (in principle) a means to fund further basic technology-driven research• And that, in turn, results in further technological advances

The Inner TriangleThus, the three apices of the Inner Triangle feed into each other and provide insight and perspective difficult to achieve from a single vantage point

• In a manner analogous to Rashomon, insight may be gained from this multi- dimensional perspective that deepens our knowledge of spoken language• And thus enables the development of superior technology that truly works in the “real world”• The development of sterling technology provides (in principle) a means to fund further basic technology-driven research• And that, in turn, results in further technological advances• And so on

The Inner TriangleThus, the three apices of the Inner Triangle feed into each other and provide insight and perspective difficult to achieve from a single vantage point

• In a manner analogous to Rashomon, insight may be gained from this multi-dimensional perspective that deepens our knowledge of spoken language• And thus enables the development of superior technology that truly works in the “real world”• The development of sterling technology provides (in principle) a means to fund further basic technology-driven research• And that, in turn, results in further technological advances• And so on (forever after)

Part Three

THE IMPORTANCE OF BEING PHONETICALLY ANNOTATED

A Corpus-Centric Perspective on Spoken Language

Phonetic Annotation of Spontaneous American English Discourse

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken Language

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understanding

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understandingAnd facilitate development of new models of spoken language

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understandingAnd facilitate development of new models of spoken language

They Also Provide Training Material for Technology Applications in:

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understandingAnd facilitate development of new models of spoken language

They Also Provide Training Material for Technology Applications in:Automatic speech recognition, particularly pronunciation models

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understandingAnd facilitate development of new models of spoken language

They Also Provide Training Material for Technology Applications in:Automatic speech recognition, particularly pronunciation modelsSpeech synthesis, in pronunciation models as well as in

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understandingAnd facilitate development of new models of spoken language

They Also Provide Training Material for Technology Applications in:Automatic speech recognition, particularly pronunciation modelsSpeech synthesis, in pronunciation models as well as inCross-linguistic transfer of technology algorithms, etc.

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understandingAnd facilitate development of new models of spoken language

They Also Provide Training Material for Technology Applications in:Automatic speech recognition, particularly pronunciation modelsSpeech synthesis, in pronunciation models as well as inCross-linguistic transfer of technology algorithms, etc.

They Promote Development of NOVEL Algorithms for Speech Technology

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understandingAnd facilitate development of new models of spoken language

They Also Provide Training Material for Technology Applications in:Automatic speech recognition, particularly pronunciation modelsSpeech synthesis, in pronunciation models as well as inCross-linguistic transfer of technology algorithms, etc.

They Promote Development of NOVEL Algorithms for Speech TechnologyIncluding pronunciation models and lexical representations for

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understandingAnd facilitate development of new models of spoken language

They Also Provide Training Material for Technology Applications in:Automatic speech recognition, particularly pronunciation modelsSpeech synthesis, in pronunciation models as well as inCross-linguistic transfer of technology algorithms, etc.

They Promote Development of NOVEL Algorithms for Speech TechnologyIncluding pronunciation models and lexical representations for automatic speech recognition and speech synthesis, as well as

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understandingAnd facilitate development of new models of spoken language

They Also Provide Training Material for Technology Applications in:Automatic speech recognition, particularly pronunciation modelsSpeech synthesis, in pronunciation models as well as inCross-linguistic transfer of technology algorithms, etc.

They Promote Development of NOVEL Algorithms for Speech TechnologyIncluding pronunciation models and lexical representations for automatic speech recognition and speech synthesis, as well asMulti-tier representations of spoken language

Phonetic Annotation is Useful, Because …

Many Properties of Spontaneous Spoken Language Differ from Those of Laboratory and Citation Speech

There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Such Corpora Provide Empirical Material for the Study of Spoken LanguageSuch data provide an important basis for scientific insight and understandingAnd facilitate development of new models of spoken language

They Also Provide Training Material for Technology Applications in: Automatic speech recognition, particularly pronunciation models

Speech synthesis, in pronunciation models as well as inCross-linguistic transfer of technology algorithms, etc.

They Promote Development of NOVEL Algorithms for Speech TechnologyIncluding pronunciation models and lexical representations for automatic speech recognition and speech synthesis, as well asMulti-tier representations of spoken language

All of Which Can be Used for Gaining Further Insight into Spoken Language

Phonetic Annotation is Useful, Because …

Corpus-Centric View of Spoken LanguageEach Tier of Linguistic Organization Provides a Unique Perspective

Corpus-Centric View of Spoken LanguageEach Tier of Linguistic Organization Provides a Unique Perspective

However, integrating the annotated material across levels is tricky …

Corpus-Centric View of Spoken LanguageEach Tier of Linguistic Organization Provides a Unique Perspective

However, integrating the annotated material across levels is tricky ….And a lot of work!!

Corpus-Centric View of Spoken LanguageEach Tier of Linguistic Organization Provides a Unique Perspective

However, integrating the annotated material across levels is tricky ….And a lot of work!!

Let’s Focus on a Specific Aspect of Linguistic Organization in Order to Exemplify the Concepts Involved

Corpus-Centric View of Spoken LanguageEach Tier of Linguistic Organization Provides a Unique Perspective

However, integrating the annotated material across levels is tricky ….And a lot of work!!

Let’s Focus on a Specific Aspect of Linguistic Organization in Order to Exemplify the Concepts InvolvedIn order to do so, we first consider the nature of the transcription material used

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

CORPUS, have been phonetically annotated (labeled and segmented)    

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

CORPUS, have been phonetically annotated (labeled and segmented)

Most of this Material has been Manually Annotated    

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

CORPUS, have been phonetically annotated (labeled and segmented)

Most of this Material has been Manually Annotated     4 hours labeled at the phone level and segmented at the syllabic level

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

CORPUS, have been phonetically annotated (labeled and segmented)

Most of this Material has been Manually Annotated     4 hours labeled at the phone level and segmented at the syllabic level 1 hour labeled and segmented at the phonetic-segment level

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

CORPUS, have been phonetically annotated (labeled and segmented)

Most of this Material has been Manually Annotated     4 hours labeled at the phone level and segmented at the syllabic level 1 hour labeled and segmented at the phonetic-segment levelThe remaining material has been segmented at the phonetic-segment level using

automatic methods

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

CORPUS, have been phonetically annotated (labeled and segmented)

Most of this Material has been Manually Annotated     4 hours labeled at the phone level and segmented at the syllabic level 1 hour labeled and segmented at the phonetic-segment levelThe remaining material has been segmented at the phonetic-segment level using

automatic methods45 minutes of stress-accent-labeled material

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

CORPUS, have been phonetically annotated (labeled and segmented)

Most of this Material has been Manually Annotated     4 hours labeled at the phone level and segmented at the syllabic level 1 hour labeled and segmented at the phonetic-segment levelThe remaining material has been segmented at the phonetic-segment level using

automatic methods45 minutes of stress-accent-labeled materialAn additional four hours of material automatically labeled with respect to accent

(this latter material not used in the current analysis, but will be available soon)  

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

CORPUS, have been phonetically annotated (labeled and segmented)

Most of this Material has been Manually Annotated     4 hours labeled at the phone level and segmented at the syllabic level 1 hour labeled and segmented at the phonetic-segment levelThe remaining material has been segmented at the phonetic-segment level using

automatic methods45 minutes of stress-accent-labeled materialAn additional four hours of material automatically labeled with respect to accent

(this latter material not used in the current analysis, but will be available soon)  

There is a Lot of Diversity in the Material Transcribed

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

CORPUS, have been phonetically annotated (labeled and segmented)

Most of this Material has been Manually Annotated     4 hours labeled at the phone level and segmented at the syllabic level 1 hour labeled and segmented at the phonetic-segment levelThe remaining material has been segmented at the phonetic-segment level using

automatic methods45 minutes of stress-accent-labeled materialAn additional four hours of material automatically labeled with respect to accent

(this latter material not used in the current analysis, but will be available soon)  

There is a Lot of Diversity in the Material TranscribedSpans speech of both genders (ca. 50/50%), reflecting a wide range of American

dialectal variation, speaking rate and voice quality

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

CORPUS, have been phonetically annotated (labeled and segmented)

Most of this Material has been Manually Annotated     4 hours labeled at the phone level and segmented at the syllabic level 1 hour labeled and segmented at the phonetic-segment levelThe remaining material has been segmented at the phonetic-segment level using

automatic methods45 minutes of stress-accent-labeled materialAn additional four hours of material automatically labeled with respect to accent (this

latter material not used in the current analysis, but will be available soon)  

There is a Lot of Diversity in the Material TranscribedSpans speech of both genders (ca. 50/50%), reflecting a wide range of American

dialectal variation, speaking rate and voice quality

Transcription SystemA variant of Arpabet, with phonetic diacritics such as:_gl,_cr, _fr, _n, _vl, _vd

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

http://www.icsi/berkeley.edu/real/stp

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

http://www.icsi/berkeley.edu/real/stp

This Means there is Phonetically Validated Material at the Level of the:

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

http://www.icsi/berkeley.edu/real/stp

This Means there is Phonetically Validated Material at the Level of the:

WORD

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

http://www.icsi/berkeley.edu/real/stp

This Means there is Phonetically Validated Material at the Level of the:

WORD SYLLABLE

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

http://www.icsi/berkeley.edu/real/stp

This Means there is Phonetically Validated Material at the Level of the:

WORD SYLLABLE PHONETIC SEGMENT

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

http://www.icsi/berkeley.edu/real/stp

This Means there is Phonetically Validated Material at the Level of the:

WORD SYLLABLE PHONETIC SEGMENT

ARTICULATORY-ACOUSTIC FEATURE

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

http://www.icsi/berkeley.edu/real/stp

This Means there is Phonetically Validated Material at the Level of the:

WORD SYLLABLE PHONETIC SEGMENT

ARTICULATORY-ACOUSTIC FEATURE and STRESS ACCENT

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

http://www.icsi/berkeley.edu/real/stp

This Means there is Phonetically Validated Material at the Level of the:

WORD SYLLABLE PHONETIC SEGMENT

ARTICULATORY-ACOUSTIC FEATURE and STRESS ACCENT

(as well as at the utterance level)

The Eternal Pentangle (Redux)Let’s re-examine the eternal triangle from the perspective of manual

annotation for three linguistic tiers….

Phonetic Transcription How was the Labeling and Segmentation Performed?

Phonetic Transcription How was the Labeling and Segmentation Performed?

VERY carefully …. by UC-Berkeley linguistics students

Phonetic Transcription How was the Labeling and Segmentation Performed?

VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform,

Phonetic Transcription How was the Labeling and Segmentation Performed?

VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform, spectrogram,

Phonetic Transcription How was the Labeling and Segmentation Performed?

VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform, spectrogram, word transcription

Phonetic Transcription How was the Labeling and Segmentation Performed?

VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform, spectrogram, word transcription and

“forced alignments” (automatic estimates of phones and boundaries)

Phonetic Transcription How was the Labeling and Segmentation Performed?

VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform, spectrogram, word transcription and

“forced alignments” (automatic estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance)

Phonetic Transcription How was the Labeling and Segmentation Performed?

VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform, spectrogram, word transcription and

“forced alignments” (automatic estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations

Phonetic Transcription How was the Labeling and Segmentation Performed?

VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform, spectrogram, word transcription and

“forced alignments” (automatic estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations

Additionally, automatic segmentation and labeling of articulatory manner was used as a guide for phonetic labeling and segmentation in the current year

Phonetic Transcription In addition to phonetic labels and syllabic segmentation,

Phonetic Transcription In addition to phonetic labels and syllabic segmentation,

45 minutes of this material was labeled with respect to stress accent for each syllable Three levels of stress were marked - FULLY Stressed, Unstressed and Intermediate Stress

Phonetic Transcription Such material can be used to perform statistical characterization of spontaneous speech

as well as train machine algorithms to label and segment additional material

Phonetic Transcription Such material can be used to perform statistical characterization of spontaneous speech

as well as train machine algorithms to label and segment additional material

In addition, the transcription material can be used to evaluate the performance of automatic speech recognition systems

Phonetic Transcription Such material can be used to perform statistical characterization of spontaneous speech

as well as train machine algorithms to label and segment additional material

In addition, the transcription material can be used to evaluate the performance of automatic speech recognition systems

Let’s first consider how this transcription can be used for ASR evaluation

Phonetic Transcription Such material can be used to perform statistical characterization of spontaneous speech

as well as train machine algorithms to label and segment additional material

In addition, the transcription material can be used to evaluate the performance of automatic speech recognition systems

Let’s first consider how this transcription can be used for ASR evaluation

We’ll focus on stress-accent, but then relate this to syllable structure

Part Four

PHONETIC DISSECTION OF

AUTOMATIC SPEECH RECOGNITION SYSTEMS

A Case Study

Stress Accent and Word Error Rate

Syllable Structure and Word Error Rate

In Collaboration with Shawn Chang

The Eternal Pentangle (Redux)Let’s re-examine the eternal triangle from the perspective of automatic

speech recognition ….

Generation of Evaluation Data - 1A complex sequence of data formatting was required to place the speech recognition

data of 8 separate sites into register with the transcription material (and vice versa)

Generation of Evaluation Data - 2But, let’s not sweat the details during this presentation

Generation of Evaluation Data - 2Let’s not sweat the details during this presentationInterested parties may consult the relevant papers (Greenberg, Hollenback and Chang,

2000; Greenberg and Chang, 2000) at

www.icsi.berkeley.edu/~steveng

Generation of Evaluation Data - 3Recognition performance was analyzed with reference to ca. 50 separate acoustic,

linguistic and structural parameters

• LEXICAL PROPERTIES – Lexical Identity– Unigram Frequency– Number of Syllables in Word– Number of Phones in Word– Word Duration– Speaking Rate– Prosodic Prominence– Energy Level– Lexical Compounds– Non-Words– Word Position in Utterance

• SYLLABLE PROPERTIES– Syllable Structure– Syllable Duration– Syllable Energy– Prosodic Prominence– Prosodic Context

Summary of Corpus Acoustic Properties• PHONE PROPERTIES

– Phonetic Identity– Phone Frequency– Position within the Word– Position within the Syllable– Phone Duration– Speaking Rate– Phonetic Context– Contiguous Phones Correct– Contiguous Phones Wrong– Phone Segmentation– Articulatory Features– Articulatory Feature Distance– Phone Confusion Matrices

• OTHER PROPERTIES– Speaker (Dialect, Gender)– Utterance Difficulty– Utterance Energy– Utterance Duration

• LEXICAL PROPERTIES – Lexical Identity– Unigram Frequency– Number of Syllables in Word– Number of Phones in Word– Word Duration– Speaking Rate– Prosodic Prominence– Energy Level– Lexical Compounds– Non-Words– Word Position in Utterance

• SYLLABLE PROPERTIES– Syllable Structure– Syllable Duration– Syllable Energy– Prosodic Prominence– Prosodic Context

Summary of Corpus Acoustic Properties• PHONE PROPERTIES

– Phonetic Identity– Phone Frequency– Position within the Word– Position within the Syllable– Phone Duration– Speaking Rate– Phonetic Context– Contiguous Phones Correct– Contiguous Phones Wrong– Phone Segmentation– Articulatory Features– Articulatory Feature Distance– Phone Confusion Matrices

• OTHER PROPERTIES– Speaker (Dialect, Gender)– Utterance Difficulty– Utterance Energy– Utterance Duration

• LEXICAL PROPERTIES – Lexical Identity– Unigram Frequency– Number of Syllables in Word– Number of Phones in Word– Word Duration– Speaking Rate– Prosodic Prominence– Energy Level– Lexical Compounds– Non-Words– Word Position in Utterance

• SYLLABLE PROPERTIES– Syllable Structure– Syllable Duration– Syllable Energy– Prosodic Prominence– Prosodic Context

Summary of Corpus Acoustic Properties• PHONE PROPERTIES

– Phonetic Identity– Phone Frequency– Position within the Word– Position within the Syllable– Phone Duration– Speaking Rate– Phonetic Context– Contiguous Phones Correct– Contiguous Phones Wrong– Phone Segmentation– Articulatory Features– Articulatory Feature Distance– Phone Confusion Matrices

• OTHER PROPERTIES– Speaker (Dialect, Gender)– Utterance Difficulty– Utterance Energy– Utterance Duration

• LEXICAL PROPERTIES – Lexical Identity– Unigram Frequency– Number of Syllables in Word– Number of Phones in Word– Word Duration– Speaking Rate– Prosodic Prominence– Energy Level– Lexical Compounds– Non-Words– Word Position in Utterance

• SYLLABLE PROPERTIES– Syllable Structure– Syllable Duration– Syllable Energy– Prosodic Prominence– Prosodic Context

Summary of Corpus Acoustic Properties• PHONE PROPERTIES

– Phonetic Identity– Phone Frequency– Position within the Word– Position within the Syllable– Phone Duration– Speaking Rate– Phonetic Context– Contiguous Phones Correct– Contiguous Phones Wrong– Phone Segmentation– Articulatory Features– Articulatory Feature Distance– Phone Confusion Matrices

• OTHER PROPERTIES– Speaker (Dialect, Gender)– Utterance Difficulty– Utterance Energy– Utterance Duration

What is (usually) Meant by Stress Accent?Prosody is supposed to pertain to extra-phonetic cues in the acoustic

signal

What is (usually) Meant by Stress Accent?Prosody is supposed to pertain to extra-phonetic cues in the acoustic

signal

The pattern of variation over a sequence of SYLLABLES pertaining to: syllabic DURATION, AMPLITUDE and PITCH (fo) variation over time

What is (usually) Meant by Stress Accent?Prosody is supposed to pertain to extra-phonetic cues in the acoustic signal

The pattern of variation over a sequence of SYLLABLES pertaining to: syllabic DURATION, AMPLITUDE and PITCH (fo) variation over time

But, the plot thickens (considerably) .… as we’ll shortly see

The effect of stress accent is most discernable among word-deletion errors

Stress Accent and Word Error Rate

Unstressed Fully Stressed Intermediate Stress

Data are averaged across all eight sites

The effect of stress accent is most discernable among word-deletion errors

There is no essential relation between accent and word-substitution errors

Stress Accent and Word Error Rate

Unstressed Fully Stressed Intermediate Stress

Data are averaged across all eight sites

Syllable Structure and Word Error RateLet’s now consider syllable structure with respect to ASR word error

Syllable Structure and Word Error RateLet’s now consider syllable structure with respect to ASR word error

There is a certain similarity with the pattern observed for stress accent ….

Syllable Structure and Word Error RateVowel-initial forms show the greatest error, particularly for word deletions

Data are averaged across all eight sites

C = ConsonantV = Vowel

Syllable Structure and Word Error RateVowel-initial forms show the greatest error, particularly for word deletions

Polysyllabic forms manifest the lowest error, especially for word deletions

C = ConsonantV = Vowel

Data are averaged across all eight sites

Syllable Structure and Word Error RateVowel-initial forms show the greatest error, particularly for word deletions

Polysyllabic forms manifest the lowest error, especially for word deletions

The vowel-initial forms tend to be unstressed, so ….

C = ConsonantV = Vowel

Data are averaged across all eight sites

Syllable Structure and Word Error RateVowel-initial forms show the greatest error, particularly for word deletions

Polysyllabic forms manifest the lowest error, especially for word deletions

The vowel-initial forms tend to be unstressed, so ….

Perhaps the similarity in pattern is not so surprising after all

C = ConsonantV = Vowel

Data are averaged across all eight sites

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables

The Plot … So Far

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables (Relative to Fully and even Partially Stressed Syllables)

The Plot … So Far

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables (Relative to Fully and even Partially Stressed Syllables)

The Proportion of Word (Deletion) Errors is Much Higher Among Syllables that Begin with a Vowel

The Plot … So Far

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables (Relative to Fully and even Partially Stressed Syllables)

The Proportion of Word (Deletion) Errors is Much Higher Among Syllables that Begin with a Vowel

The exception being words composed of more than a single syllable

The Plot … So Far

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables (Relative to Fully and even Partially Stressed Syllables)

The Proportion of Word (Deletion) Errors is Much Higher Among Syllables that Begin with a Vowel

The exception being words composed of more than a single syllable

Polysyllabic Words Exhibit the Lowest Word Deletion Error Rate

The Plot … So Far

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables (Relative to Fully and even Partially Stressed Syllables)

The Proportion of Word (Deletion) Errors is Much Higher Among Syllables that Begin with a Vowel

The exception being words composed of more than a single syllable

Polysyllabic Words Exhibit the Lowest Word Deletion Error RateSuch words usually have at least one syllable that is highly stressed

The Plot … So Far

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables (Relative to Fully and even Partially Stressed Syllables)

The Proportion of Word (Deletion) Errors is Much Higher Among Syllables that Begin with a Vowel

The exception being words composed of more than a single syllable

Polysyllabic Words Exhibit the Lowest Word Deletion Error RateSuch words usually have at least one syllable that is highly stressedSuggesting that deletion errors reflect the general stress pattern within the word

The Plot … So Far

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables (Relative to Fully and even Partially Stressed Syllables)

The Proportion of Word (Deletion) Errors is Much Higher Among Syllables that Begin with a Vowel

The exception being words composed of more than a single syllable

Polysyllabic Words Exhibit the Lowest Word Deletion Error RateSuch words usually have at least one syllable that is highly stressedSuggesting that deletion errors reflect the general stress pattern within the word

Syllable Structure and Stress Accent are not Salient Properties in (most) ASR Systems

The Plot … So Far

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables (Relative to Fully and even Partially Stressed Syllables)

The Proportion of Word (Deletion) Errors is Much Higher Among Syllables that Begin with a Vowel

The exception being words composed of more than a single syllable

Polysyllabic Words Exhibit the Lowest Word Deletion Error RateSuch words usually have at least one syllable that is highly stressedSuggesting that deletion errors reflect the general stress pattern within the word

Syllable Structure and Stress Accent are not Salient Properties in (most) ASR Systems

As ASR systems know about phones and words, but not syllables and stress (at least in American English)

The Plot … So Far

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables (Relative to Fully and even Partially Stressed Syllables)

The Proportion of Word (Deletion) Errors is Much Higher Among Syllables that Begin with a Vowel

The exception being words composed of more than a single syllable

Polysyllabic Words Exhibit the Lowest Word Deletion Error RateSuch words usually have at least one syllable that is highly stressedSuggesting that deletion errors reflect the general stress pattern within the word

Syllable Structure and Stress Accent are not Salient Properties in (most) ASR Systems

As ASR systems know about phones and words, but not syllables and stress (at least in American English)

Could There Therefore be a Link Between Syllable Structure, Stress Accent and Some Other Linguistic Properties that ASR Systems “Know About”?

The Plot … So Far

The Proportion of Word (Deletion) Errors is Much Higher Among Unstressed Syllables (Relative to Fully and even Partially Stressed Syllables)

The Proportion of Word (Deletion) Errors is Much Higher Among Syllables that Begin with a Vowel

The exception being words composed of more than a single syllable

Polysyllabic Words Exhibit the Lowest Word Deletion Error RateSuch words usually have at least one syllable that is highly stressedSuggesting that deletion errors reflect the general stress pattern within the word

Syllable Structure and Stress Accent are not Salient Properties in (most) ASR Systems

As ASR systems know about phones and words, but not syllables and stress (at least in American English)

Could There Therefore be a Link Between Syllable Structure, Stress Accent and Some Other Linguistic Properties that ASR Systems “Know About”?

Let’s Find Out ….

The Plot … So Far

Part Five

The Relation Between Stress Accent and

Vocalic IdentityYet Another Case Study

The Relation Between Segmental Duration and Vowel Height

Durational Differences Between Stressed and Unstressed Vowels

The Relation Between Vowel Height and Stress Accent

In Collaboration with Leah Hitchcock

The Eternal Pentangle (Redux)Let’s re-examine the eternal triangle from the perspective of statistical

characterization of the annotated Switchboard corpus

The Eternal Pentangle (Redux)Let’s re-examine the eternal triangle from the perspective of statistical characterization of the annotated

Switchboard corpus

These data were originally collected to improve the quality of speech recognition systems, but are now being pressed into service for SCIENCE

The Eternal Pentangle (Redux)But first ….

A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a function primarily of two articulatory properties – both related to the motion of the tongue

A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a function primarily of two articulatory properties – both related to the motion of the tongue

• The front-back plane is most closely associated with the second formant frequency (or more precisely F2 - F1) and the volume of the front-cavity resonance

A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a function primarily of two articulatory properties – both related to the motion of the tongue

• The front-back plane is most closely associated with the second formant frequency (or more precisely F2 - F1) and the volume of the front-cavity resonance

• The height parameter is closely linked to the frequency of F1

A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a function primarily of two articulatory properties – both related to the motion of the tongue

• The front-back plane is most closely associated with the second formant frequency (or more precisely F2 - F1) and the volume of the front-cavity resonance

• The height parameter is closely linked to the frequency of F1

In the classic vowel “triangle,” segments are positioned in terms of the tongue positions associated with their production, as follows:

A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a function primarily of two articulatory properties – both related to the motion of the tongue

• The front-back plane is most closely associated with the second formant frequency (or more precisely F2 - F1) and the volume of the front-cavity resonance

• The height parameter is closely linked to the frequency of F1

In the classic vowel “triangle,” segments are positioned in terms of the tongue positions associated with their production, as follows:

A Brief Primer on Vocalic Acoustics

Let’s return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data

Spatial Patterning of Duration and Amplitude

Let’s return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data

The duration will be plotted on a 2-D grid , where the x-axis will always be in terms of hypothetical front-back tongue position

Spatial Patterning of Duration and Amplitude

Let’s return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data

The duration will be plotted on a 2-D grid , where the x-axis will always be in terms of hypothetical front-back tongue position (and hence remain a constant throughout the plots to follow)

Spatial Patterning of Duration and Amplitude

Let’s return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data

The duration will be plotted on a 2-D grid , where the x-axis will always be in terms of hypothetical front-back tongue position (and hence remain a constant throughout the plots to follow)

The y-axis will serve as the dependent measure expressed in terms of duration or the proportion of fully stressed (or unstressed) nuclei

Spatial Patterning of Duration and Amplitude

Let’s return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data

The duration will be plotted on a 2-D grid , where the x-axis will always be in terms of hypothetical front-back tongue position (and hence remain a constant throughout the plots to follow)

The y-axis will serve as the dependent measure expressed in terms of duration or the proportion of fully stressed (or unstressed) nuclei

Spatial Patterning of Duration et al.

Vocalic Duration and Vowel HeightThe spatial patterning of vocalic segments is systematic with respect to

duration

Vocalic Duration and Vowel HeightThe spatial patterning of vocalic segments is systematic with respect to

duration

Low vowels, be they diphthongs or monophthongs, are longer (on average) than high vowels

Vocalic Duration and Vowel Height

All nuclei Diphthongs Monophthongs

The spatial patterning of vocalic segments is systematic with respect to duration

Low vowels, be they diphthongs or monophthongs, are longer (on average) than high vowels

Vocalic Duration and Vowel Height

All nuclei Diphthongs Monophthongs

The spatial patterning of vocalic segments is systematic with respect to duration

Low vowels, be they diphthongs or monophthongs, are longer (on average) than high vowels

Thus, duration appears to be highly correlated with vowel height

Vocalic Duration and Vowel Height

All nuclei Diphthongs Monophthongs

The spatial patterning of vocalic segments is systematic with respect to duration

Low vowels, be they diphthongs or monophthongs, are longer (on average) than high vowels

Thus, duration appears to be highly correlated with vowel height

But … the situation is a little more complicated than first appearances would suggest

Durational Differences - Stressed/UnstressedThere is a large dynamic range in duration between stressed and unstressed

nuclei

Durational Differences - Stressed/UnstressedThere is a large dynamic range in duration between stressed and unstressed nuclei

Moreover, diphthongs and tense, low monophthongs tend to exhibit a larger dynamic range than the lax monophthongs

Durational Differences - Stressed/UnstressedThere is a large dynamic range in duration between stressed and unstressed nuclei

Moreover, diphthongs and tense, low monophthongs tend to exhibit a larger dynamic range than the lax monophthongs

Lax monophthongs

Vocalic Identity Among Unstressed NucleiThe high, lax monophthongs are almost always unstressed

Vocalic Identity Among Unstressed NucleiThe high, lax monophthongs are almost always unstressed

The low vowels, be they monophthongs or diphthongs, are rarely unstressed

Vocalic Identity Among Unstressed NucleiThe high, lax monophthongs are almost always unstressed

The low vowels, be they monophthongs or diphthongs, are rarely unstressed

The high diphthongs and high/mid, tense monophthongs occupy an intermediate position

The high vowels are rarely fully stressed

Vocalic Identity Among Fully Stressed Nuclei

The high vowels are rarely fully stressed

The low vowels, be they monophthongs or diphthongs, are far more likely to be fully stressed

Vocalic Identity Among Fully Stressed Nuclei

The high vowels are rarely fully stressed

The low vowels, be they monophthongs or diphthongs, are far more likely to be fully stressed

An intermediate degree of stress accounts for the other vocalic instances

Vocalic Identity Among Fully Stressed Nuclei

The high vowels are rarely fully stressed

The low vowels, be they monophthongs or diphthongs, are far more likely to be fully stressed

An intermediate degree of stress accounts for the other vocalic instances (but will not be addressed here)

Vocalic Identity Among Fully Stressed Nuclei

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs

Low Vowels Tend to be Much Longer in Duration than High Vowels

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs

Low Vowels Tend to be Much Longer in Duration than High VowelsThis is the case even for diphthongs

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs

Low Vowels Tend to be Much Longer in Duration than High VowelsThis is the case even for diphthongs

Low Vowels are Rarely without Some Measure of Stress Accent

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs

Low Vowels Tend to be Much Longer in Duration than High VowelsThis is the case even for diphthongs

Low Vowels are Rarely without Some Measure of Stress AccentThis is true for monophthongs as well as diphthongs

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs

Low Vowels Tend to be Much Longer in Duration than High VowelsThis is the case even for diphthongs

Low Vowels are Rarely without Some Measure of Stress AccentThis is true for monophthongs as well as diphthongs

High Vowels are Fully Stressed Extremely Rarely

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs

Low Vowels Tend to be Much Longer in Duration than High VowelsThis is the case even for diphthongs

Low Vowels are Rarely without Some Measure of Stress AccentThis is true for monophthongs as well as diphthongs

High Vowels are Fully Stressed Extremely RarelyThis is particularly so for monophthongs, but also applies to diphthongs

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs

Low Vowels Tend to be Much Longer in Duration than High VowelsThis is the case even for diphthongs

Low Vowels are Rarely without Some Measure of Stress AccentThis is true for monophthongs as well as diphthongs

High Vowels are Fully Stressed Extremely RarelyThis is particularly so for monophthongs, but also applies to diphthongs

Thus, Stress Accent Appears to Be Intricately Involved with Vocalic Identity

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs

Low Vowels Tend to be Much Longer in Duration than High VowelsThis is the case even for diphthongs

Low Vowels are Rarely without Some Measure of Stress AccentThis is true for monophthongs as well as diphthongs

High Vowels are Fully Stressed Extremely RarelyThis is particularly so for monophthongs, but also applies to diphthongs

Thus, Stress Accent Appears to Be Intricately Involved with Vocalic IdentityThis relation is likely to have an important impact on pronunciation variation

Is It Stress? Vocalic Identity? Or What?

Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs

Low Vowels Tend to be Much Longer in Duration than High VowelsThis is the case even for diphthongs

Low Vowels are Rarely without Some Measure of Stress AccentThis is true for monophthongs as well as diphthongs

High Vowels are Fully Stressed Extremely RarelyThis is particularly so for monophthongs, but also applies to diphthongs

Thus, Stress Accent Appears to Be Intricately Involved with Vocalic IdentityThis relation is likely to have an important impact on pronunciation variation

And Thus Could be Useful for Modeling Pronunciation Variation for BOTH Scientific and Technological Applications

Is It Stress? Vocalic Identity? Or What?

Part Six

SPOKEN LANGUAGEWHAT IS TRUTH?

Fundamental Questions Remain Unanswered

The Current Story Raises More Questions than it Answers ….

Spoken Language – What is Truth?

Is It Possible to Dissociate Vocalic Identity from Stress Accent?

Spoken Language – What is Truth?

Is It Possible to Dissociate Vocalic Identity from Stress Accent?

Is Duration an Essential Component of Stress Accent and Vowel Height?

Spoken Language – What is Truth?

Is It Possible to Dissociate Vocalic Identity from Stress Accent?

Is Duration an Essential Component of Stress Accent and Vowel Height?

How Should Words (and other organizational units) be Represented in ASR Lexicons to Exploit Such Interrelations?

Spoken Language – What is Truth?

Is It Possible to Dissociate Vocalic Identity from Stress Accent?

Is Duration an Essential Component of Stress Accent and Vowel Height?

How Should Words (and other organizational units) be Represented in ASR Lexicons to Exploit Such Interrelations?

Can Speech Technology Afford to View Language as a Mere Concatenation of Phones and Words (or analogous units)?

Spoken Language – What is Truth?

Is It Possible to Dissociate Vocalic Identity from Stress Accent?

Is Duration an Essential Component of Stress Accent and Vowel Height?

How Should Words (and other organizational units) be Represented in ASR Lexicons to Exploit Such Interrelations?

Can Speech Technology Afford to View Language as a Mere Concatenation of Phones and Words (or analogous units)?Perhaps No Single Perspective Can Truly Capture the Essence of Spoken Language, or

Spoken Language – What is Truth?

Is It Possible to Dissociate Vocalic Identity from Stress Accent?

Is Duration an Essential Component of Stress Accent and Vowel Height?

How Should Words (and other organizational units) be Represented in ASR Lexicons to Exploit Such Interrelations?

Can Speech Technology Afford to View Language as a Mere Concatenation of Phones and Words (or analogous units)?Perhaps No Single Perspective Can Truly Capture the Essence of Spoken Language, or Portray It with the Depth and Clarity Required to Produce “Flawless” Technology and Enduring Scientific Insight

Spoken Language – What is Truth?

That’s All, Folks

Many Thanks for Your Time and Attention

Recommended