128
An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch Steven Greenberg, Shawn Chang and Mirjam Wester International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng [email protected] http://www.icsi.berkeley.edu/~shawnc [email protected] Mirjam Wester A 2 RT, Department of Language and Speech Nijmegen University, Netherlands http://www.lands.let.kun.nl/Tspublic/wester [email protected]

An Elitist Approach to Articulatory-Acoustic Feature

Embed Size (px)

Citation preview

Page 1: An Elitist Approach to Articulatory-Acoustic Feature

An Elitist Approachto

Articulatory-Acoustic Feature Classification in English and in Dutch

Steven Greenberg, Shawn Chang and Mirjam WesterInternational Computer Science Institute1947 Center Street, Berkeley, CA 94704

http://www.icsi.berkeley.edu/[email protected]

http://www.icsi.berkeley.edu/[email protected]

Mirjam WesterA2 RT, Department of Language and Speech

Nijmegen University, Netherlandshttp://www.lands.let.kun.nl/Tspublic/wester

[email protected]

Page 2: An Elitist Approach to Articulatory-Acoustic Feature

Acknowledgements and Thanks

Automatic Feature Classification and AnalysisJoy Hollenback, Lokendra Shastri, Rosaria Silipo

Research FundingU.S. National Science FoundationU.S. Department of Defense

Page 3: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech

Page 4: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

Page 5: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity

Page 6: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity– Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis

Page 7: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity– Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis

• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce

Page 8: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity– Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis

• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce– Hand labeling and segmentation is time consuming and expensive

Page 9: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity– Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis

• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce– Hand labeling and segmentation is time consuming and expensive– It is difficult to find qualified transcribers and training can be arduous

Page 10: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity– Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis

• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce– Hand labeling and segmentation is time consuming and expensive– It is difficult to find qualified transcribers and training can be arduous

• Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation

Page 11: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity– Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis

• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce– Hand labeling and segmentation is time consuming and expensive– It is difficult to find qualified transcribers and training can be arduous

• Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation

– Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries

Page 12: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity– Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis

• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce– Hand labeling and segmentation is time consuming and expensive– It is difficult to find qualified transcribers and training can be arduous

• Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation

– Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries– Phone classification error is ca. 30-50%

Page 13: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity– Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis

• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce– Hand labeling and segmentation is time consuming and expensive– It is difficult to find qualified transcribers and training can be arduous

• Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation

– Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries– Phone classification error is ca. 30-50%– Speech recognition systems do not currently deal with prosody

Page 14: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity– Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis

• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce– Hand labeling and segmentation is time consuming and expensive– It is difficult to find qualified transcribers and training can be arduous

• Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation

– Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries– Phone classification error is ca. 30-50%– Speech recognition systems do not currently deal with prosody

• Automatic Transcription is Likely to Aid in the Development of Speech Recognition and Synthesis Technology

Page 15: An Elitist Approach to Articulatory-Acoustic Feature

Motivation for Automatic Transcription• Many Properties of Spontaneous Spoken Language Differ from Those of

Laboratory and Citation Speech– There are systematic patterns in “real” speech that potentially reveal underlying

principles of linguistic organization

• Phonetic and Prosodic Annotation Material is of Limited Quantity– Phonetic and prosodic material important for understanding spoken language

and developing superior technology for recognition and synthesis

• Manual Annotation of Phonetic and Prosodic Material is a Pain in the Butt to Produce– Hand labeling and segmentation is time consuming and expensive– It is difficult to find qualified transcribers and training can be arduous

• Automatic Alignment Systems (used in speech recognition) are Inaccurate both in Terms of Labeling and Segmentation

– Forced-alignment-based segmentation is poor - ca. 40% off on phone boundaries– Phone classification error is ca. 30-50%– Speech recognition systems do not currently deal with prosody

• Automatic Transcription is Likely to Aid in the Development of Speech Recognition and Synthesis Technology

– And therefore is worth the effort to develop

Page 16: An Elitist Approach to Articulatory-Acoustic Feature

Road Map of the Presentation• Introduction

– Motivation for developing automatic phonetic transcription systems– Rationale for the current focus on articulatory-acoustic features (AFs)– The development corpus - NTIMIT– Justification for using NTIMIT for development of AF classifiers

Page 17: An Elitist Approach to Articulatory-Acoustic Feature

Road Map of the Presentation• Introduction

– Motivation for developing automatic phonetic transcription systems– Rationale for the current focus on articulatory-acoustic features (AFs)– The development corpus - NTIMIT– Justification for using NTIMIT for development of AF classifiers

• The ELITIST Approach and Its Application to English– The baseline system– The ELITIST approach – Manner-specific classification for place of articulation features

Page 18: An Elitist Approach to Articulatory-Acoustic Feature

Road Map of the Presentation• Introduction

– Motivation for developing automatic phonetic transcription systems– Rationale for the current focus on articulatory-acoustic features (AFs)– The development corpus - NTIMIT– Justification for using NTIMIT for development of AF classifiers

• The ELITIST Approach and Its Application to English– The baseline system– The ELITIST approach – Manner-specific classification for place of articulation features

• Application of the ELITIST Approach to Dutch – The training and testing corpus - VIOS– The nature of cross-linguistic transfer of articulatory-acoustic features– The ELITIST approach to frame selection as applied to the VIOS corpus– Improvement of place-of-articulation classification using manner-specific

training in Dutch

Page 19: An Elitist Approach to Articulatory-Acoustic Feature

Road Map of the Presentation• Introduction

– Motivation for developing automatic phonetic transcription systems– Rationale for the current focus on articulatory-acoustic features (AFs)– The development corpus - NTIMIT– Justification for using NTIMIT for development of AF classifiers

• The ELITIST Approach and Its Application to English – The baseline system– The ELITIST approach – Manner-specific classification for place of articulation features

• Application of the ELITIST Approach to Dutch – The training and testing corpus - VIOS– The nature of cross-linguistic transfer of articulatory-acoustic features– The ELITIST approach to frame selection as applied to the VIOS corpus– Improvement of place-of-articulation classification using manner-specific

training in Dutch

• Conclusions and Future Work– Development of fully automatic phonetic and prosodic transcription systems– An empirically oriented discipline based on annotated corpora

Page 20: An Elitist Approach to Articulatory-Acoustic Feature

Part One

INTRODUCTION

Motivation for Developing Automatic Phonetic Transcription Systems

Rationale for the Current Focus on Articulatory-Acoustic Features

Description of the Development Corpus – NTIMIT

Justification for Using the NTIMIT Corpus

Page 21: An Elitist Approach to Articulatory-Acoustic Feature

• Provides Detailed, Empirical Material for the Study of Spoken Language– Such data provide an important basis for scientific insight and understanding– Facilitates development of new models for spoken language

Corpus Generation - Objectives

Page 22: An Elitist Approach to Articulatory-Acoustic Feature

• Provides Detailed, Empirical Material for the Study of Spoken Language– Such data provide an important basis for scientific insight and understanding– Facilitates development of new models for spoken language

• Provides Training Material for Technology Applications– Automatic speech recognition, particularly pronunciation models– Speech synthesis, ditto– Cross-linguistic transfer of technology algorithms

Corpus Generation - Objectives

Page 23: An Elitist Approach to Articulatory-Acoustic Feature

• Provides Detailed, Empirical Material for the Study of Spoken Language– Such data provide an important basis for scientific insight and understanding– Facilitates development of new models for spoken language

• Provides Training Material for Technology Applications– Automatic speech recognition, particularly pronunciation models– Speech synthesis, ditto– Cross-linguistic transfer of technology algorithms

• Promotes Development of NOVEL Algorithms for Speech Technology– Pronunciation models and lexical representations for automatic speech recognition

speech synthesis– Multi-tier representations of spoken language

Corpus Generation - Objectives

Page 24: An Elitist Approach to Articulatory-Acoustic Feature

Corpus-Centric View of Spoken LanguageOur Focus in Today’s Presentation is on Articulatory Feature Classification

Other levels of linguistic representation are also extremely important to annotate

Our Focus

Page 25: An Elitist Approach to Articulatory-Acoustic Feature

Rationale for Articulatory-Acoustic Features• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the

Lowest (i.e., Phonetic) Tier of Spoken Language– AFs can be combined in a variety of ways to specify virtually any speech sound found in

the world’s languages

Page 26: An Elitist Approach to Articulatory-Acoustic Feature

Rationale for Articulatory-Acoustic Features• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the

Lowest (i.e., Phonetic) Tier of Spoken Language– AFs can be combined in a variety of ways to specify virtually any speech sound found in

the world’s languages – AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments

Page 27: An Elitist Approach to Articulatory-Acoustic Feature

Rationale for Articulatory-Acoustic Features• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the

Lowest (i.e., Phonetic) Tier of Spoken Language– AFs can be combined in a variety of ways to specify virtually any speech sound found in

the world’s languages – AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments

• AFs are Systematically Organized at the Level of the Syllable– Syllables are a basic articulatory unit in speech

Page 28: An Elitist Approach to Articulatory-Acoustic Feature

Rationale for Articulatory-Acoustic Features• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the

Lowest (i.e., Phonetic) Tier of Spoken Language– AFs can be combined in a variety of ways to specify virtually any speech sound found in

the world’s languages – AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments

• AFs are Systematically Organized at the Level of the Syllable– Syllables are a basic articulatory unit in speech– The pronunciation patterns observed in casual conversation are systematic at the AF level,

but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments

Page 29: An Elitist Approach to Articulatory-Acoustic Feature

Rationale for Articulatory-Acoustic Features• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the

Lowest (i.e., phonetic) Tier of Spoken Language– AFs can be combined in a multitude of ways to specify virtually any speech sound found in

the world’s languages – AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments

• AFs are Systematically Organized at the Level of the Syllable– Syllables are a basic articulatory unit in speech– The pronunciation patterns observed in casual conversation are systematic at the AF level,

but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments

• AFs are Potentially More Effective in Speech Recognition Systems– More accurate and flexible pronunciation models (tied to syllabic and lexical units)

Page 30: An Elitist Approach to Articulatory-Acoustic Feature

Rationale for Articulatory-Acoustic Features• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the

Lowest (i.e., phonetic) Tier of Spoken Language– AFs can be combined in a multitude of ways to specify virtually any speech sound found in

the world’s languages – AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments

• AFs are Systematically Organized at the Level of the Syllable– Syllables are a basic articulatory unit in speech– The pronunciation patterns observed in casual conversation are systematic at the AF level,

but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments

• AFs are Potentially More Effective in Speech Recognition Systems– More accurate and flexible pronunciation models (tied to syllabic and lexical units)– Are generally more robust under acoustic interference than phonetic segments

Page 31: An Elitist Approach to Articulatory-Acoustic Feature

Rationale for Articulatory-Acoustic Features• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the

Lowest (i.e., phonetic) Tier of Spoken Language– AFs can be combined in a multitude of ways to specify virtually any speech sound found in

the world’s languages – AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments

• AFs are Systematically Organized at the Level of the Syllable– Syllables are a basic articulatory unit in speech– The pronunciation patterns observed in casual conversation are systematic at the AF level,

but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments

• AFs are Potentially More Effective in Speech Recognition Systems– More accurate and flexible pronunciation models (tied to syllabic and lexical units)– Are generally more robust under acoustic interference than phonetic segments– Relatively few alternative features for various AF dimensions makes classification

inherently more robust than phonetic segments

Page 32: An Elitist Approach to Articulatory-Acoustic Feature

Rationale for Articulatory-Acoustic Features• Articulatory-Acoustic Features (AFs) are the “Building Blocks” of the

Lowest (i.e., phonetic) Tier of Spoken Language– AFs can be combined in a multitude of ways to specify virtually any speech sound found in

the world’s languages – AFs are therefore more appropriate for cross-linguistic transfer than phonetic segments

• AFs are Systematically Organized at the Level of the Syllable– Syllables are a basic articulatory unit in speech– The pronunciation patterns observed in casual conversation are systematic at the AF level,

but not at the phonetic-segment level, and therefore can be used to develop more accurate and flexible pronunciation models than phonetic segments

• AFs are Potentially More Effective in Speech Recognition Systems– More accurate and flexible pronunciation models (tied to syllabic and lexical units)– Are generally more robust under acoustic interference than phonetic segments– Relatively few alternative features for various AF dimensions makes classification

inherently more robust than phonetic segments

• AFs are Potentially More Effective in Speech Synthesis Systems– More accurate and flexible pronunciation models (tied to syllabic and lexical units)

Page 33: An Elitist Approach to Articulatory-Acoustic Feature

• Sentences Read by Native Speakers of American English– Quasi-phonetically balanced set of materials– Wide range of dialect variability , both genders, variation in speaker age– Relatively low semantic predictability

“She washed his dark suit in greasy wash water all year”

Primary Development Corpus – NTIMIT

Page 34: An Elitist Approach to Articulatory-Acoustic Feature

• Sentences Read by Native Speakers of American English– Quasi-phonetically balanced set of materials– Wide range of dialect variability , both genders, variation in speaker age– Relatively low semantic predictability

“She washed his dark suit in greasy wash water all year”

• Corpus Manually Labeled and Segmented at the Phonetic-Segment Level– The precision of phonetic annotation provides an excellent training corpus– Corpus was annotated at MIT

Primary Development Corpus – NTIMIT

Page 35: An Elitist Approach to Articulatory-Acoustic Feature

• Sentences Read by Native Speakers of American English– Quasi-phonetically balanced set of materials– Wide range of dialect variability , both genders, variation in speaker age– Relatively low semantic predictability

“She washed his dark suit in greasy wash water all year”

• Corpus Manually Labeled and Segmented at the Phonetic-Segment Level– The precision of phonetic annotation provides an excellent training corpus– Corpus was annotated at MIT

• A Large Amount of Annotated Material– Over 2.5 hours of material used for training the classifiers– 20 minutes of material used for testing

Primary Development Corpus – NTIMIT

Page 36: An Elitist Approach to Articulatory-Acoustic Feature

• Sentences Read by Native Speakers of American English– Quasi-phonetically balanced set of materials– Wide range of dialect variability , both genders, variation in speaker age– Relatively low semantic predictability

“She washed his dark suit in greasy wash water all year”

• Corpus Manually Labeled and Segmented at the Phonetic-Segment Level– The precision of phonetic annotation provides an excellent training corpus– Corpus was annotated at MIT

• A Large Amount of Annotated Material– Over 2.5 hours of material used for training the classifiers– 20 minutes of material used for testing

• Relatively Canonical Pronunciation Ideal for Training AF Classifiers– Formal pronunciation patterns provides a means of deriving articulatory

features from phonetic-segment labels via mapping rules (cf. Proceedings paper for details)

Primary Development Corpus – NTIMIT

Page 37: An Elitist Approach to Articulatory-Acoustic Feature

• Sentences Read by Native Speakers of American English– Quasi-phonetically balanced set of materials– Wide range of dialect variability , both genders, variation in speaker age– Relatively low semantic predictability

“She washed his dark suit in greasy wash water all year”

• Corpus Manually Labeled and Segmented at the Phonetic-Segment Level– The precision of phonetic annotation provides an excellent training corpus– Corpus was annotated at MIT

• A Large Amount of Annotated Material– Over 2.5 hours of material used for training the classifiers– 20 minutes of material used for testing

• Relatively Canonical Pronunciation Ideal for Training AF Classifiers– Formal pronunciation patterns provides a means of deriving articulatory

features from phonetic-segment labels via mapping rules (cf. Proceedings paper for details)

• NTIMIT is a Telephone Pass-band Version of the TIMIT Corpus– Sentential material passed through a channel between 0.3 and 3.4 kHz– Provides capability of transfer to other telephone corpora (such as VIOS)

Primary Development Corpus – NTIMIT

Page 38: An Elitist Approach to Articulatory-Acoustic Feature

Part Two

THE ELITIST APPROACH

The Baseline System for Articulatory-Acoustic Feature Classification

The ELITIST Approach to Systematic Frame Selection for AF Classification

Improving Place-of-Articulation Classification Using Manner-Specific Training

Page 39: An Elitist Approach to Articulatory-Acoustic Feature

The Baseline System for AF Classification• Spectro-Temporal Representation of the Speech Signal

– Derived from logarithmically compressed, critical-band energy pattern– 25-ms analysis windows (i.e., a frame) – 10-ms frame-sampling interval (i.e., 60% overlap between adjacent frames)

Page 40: An Elitist Approach to Articulatory-Acoustic Feature

The Baseline System for AF Classification• Spectro-Temporal Representation of the Speech Signal

– Derived from logarithmically compressed, critical-band energy pattern– 25-ms analysis windows (i.e., a frame) – 10-ms frame-sampling interval (i.e., 60% overlap between adjacent frames)

• Multilayer Perceptron (MLP) Neural Network Classifiers– Single hidden layer of 200-400 units, trained with back-propagation– Nine frames of context used in the input

Page 41: An Elitist Approach to Articulatory-Acoustic Feature

The Baseline System for AF Classification• An MLP Network for Each Articulatory Feature (AF) Dimension

– A separate network trained on voicing, place and manner of articulation, etc.– Training targets were derived from hand-labeled phonetic transcripts and a

fixed phone-to-AF mapping– “Silence” was a feature included in the classification of each AF dimension– All of the results reported are for FRAME accuracy (not segmental accuracy)

Page 42: An Elitist Approach to Articulatory-Acoustic Feature

The Baseline System for AF Classification• Focus on Articulatory Feature Classification Rather than Phone Identity

– Provides a more accurate means of assessing MLP-based classification system

Page 43: An Elitist Approach to Articulatory-Acoustic Feature

Baseline System Performance Summary • Classification of Articulatory Features Exceeds 80% – Except for Place

• Objective – Improve Classification across All AF Dimensions, but Particularly on Place-of-Articulation

NTIMIT Corpus

Page 44: An Elitist Approach to Articulatory-Acoustic Feature

• Correlation Between Frame Position and Classification Accuracy for MANNER of articulation features:– The 20% of the frames closest to the segment BOUNDARIES are 73% correct– The 20% of the frames closest to the segment CENTER are 90% correct

Not All Frames are Created Equal

Page 45: An Elitist Approach to Articulatory-Acoustic Feature

• Correlation Between Frame Position and Classification Accuracy for MANNER of articulation features:– The 20% of the frames closest to the segment BOUNDARIES are 73% correct– The 20% of the frames closest to the segment CENTER are 90% correct

• Correlation between frame position within a segment and classifier output for MANNER features:

– The 20% of the frames closest to the segment BOUNDARIES have a mean maximum output (“confidence”) level of 0.797

– The 20% of the frames closest to the segment CENTER have a mean maximum output (“confidence”) level of 0.892

– This dynamic range of 0.1 (in absolute terms) is HIGHLY significant

Not All Frames are Created Equal

Page 46: An Elitist Approach to Articulatory-Acoustic Feature

Not All Frames are Created Equal • Manner Classification is Best for Frames in the Phonetic-Segment Center

Page 47: An Elitist Approach to Articulatory-Acoustic Feature

Not All Frames are Created Equal • Manner Classification is Best for Frames in the Phonetic-Segment Center

• MLP Network Confidence Level is Highly Correlated with Frame Accuracy

Page 48: An Elitist Approach to Articulatory-Acoustic Feature

Not All Frames are Created Equal • Manner Classification is Best for Frames in the Phonetic-Segment Center

• MLP Network Confidence Level is Highly Correlated with Frame Accuracy

• The Most Confidently Classified Frames are Generally More Accurate

Page 49: An Elitist Approach to Articulatory-Acoustic Feature

Selecting a Threshold for Frame Selection • The Correlation Between Neural Network Confidence Level and Frame

Position within the Phonetic Segment Can Be Exploited to Enhance Articulatory Feature Classification– This insight provides the basis for the “Elitist” approach

Page 50: An Elitist Approach to Articulatory-Acoustic Feature

Selecting a Threshold for Frame Selection • The Most Confidently Classified Frames are Generally More Accurate

Page 51: An Elitist Approach to Articulatory-Acoustic Feature

Selecting a Threshold for Frame Selection • The Most Confidently Classified Frames are Generally More Accurate• Frames with Confidence Levels Below “Threshold” are Discarded

– Setting the threshold to 0.7 filters out ca. 20% of the frames– Boundary frames are twice as likely to be discarded as central frames

Criterion

Page 52: An Elitist Approach to Articulatory-Acoustic Feature

Selecting a Threshold for Frame Selection • The Most Confidently Classified Frames are Generally More Accurate• Frames with Confidence Levels Below “Threshold” are Discarded

– Setting the threshold to 0.7 filters out ca. 20% of the frames– Boundary frames are twice as likely to be discarded as central frames

• Primary Drawback of Using This Threshold for Frame Selection– 6% of the phonetic segments have most of their frames discarded

Criterion

Page 53: An Elitist Approach to Articulatory-Acoustic Feature

• The Accuracy of MANNER Frame Classification Improves– Frame-level classification accuracy increases overall from 85% to 93%

The Elitist Approach to Manner Classification

Page 54: An Elitist Approach to Articulatory-Acoustic Feature

• The Accuracy of MANNER Frame Classification Improves– Frame-level classification accuracy increases overall from 85% to 93%

• Certain Manner Classes Improve Highly with Frame Selection– Nasals, Stops, Fricatives, Flaps all show strong improvement in performance

The Elitist Approach to Manner Classification

Page 55: An Elitist Approach to Articulatory-Acoustic Feature

• Objective – Reduce the Number of Place Features to Classify for Any Single Manner Class– Although there are NINE distinct place of articulation features overall ...– For any single manner class there are only three or four place features– The specific PLACES of articulation for stops differs from fricatives, etc.– HOWEVER, the SPATIAL PATTERNING of the constriction loci is SIMILAR

Manner-Dependency for Place of Articulation

Page 56: An Elitist Approach to Articulatory-Acoustic Feature

• Objective – Reduce the Number of Place Features to Classify for Any Single Manner Class– Although there are NINE distinct place of articulation features overall ...– For any single manner class there are only three or four place features– The specific PLACES of articulation for stops differs from fricatives, etc.– HOWEVER, the SPATIAL PATTERNING of the constriction loci is SIMILAR

• Because Classification Accuracy for Manner Features is High, Manner-Specific Training for Place of Articulation is Feasible (as we’ll show

you)

Manner-Dependency for Place of Articulation

Page 57: An Elitist Approach to Articulatory-Acoustic Feature

Manner-Specific Place ClassificationThus, Each Manner Class can be Trained on Comparable Relational Place

Features:ANTERIOR CENTRAL POSTERIOR

Page 58: An Elitist Approach to Articulatory-Acoustic Feature

Manner-Specific Place Classification

NTIMIT (telephone) Corpus

Thus, Each Manner Class can be Trained on Comparable Relational Place Features:

ANTERIOR CENTRAL POSTERIORClassifying Place of Articulation in Manner-Specific Fashion Can Improve

the Classification Accuracy of this Feature Dimension– The training material is far more homogeneous under this regime and is thus

more reliable and robust

Page 59: An Elitist Approach to Articulatory-Acoustic Feature

Manner-Specific Classification – Vowels• Knowing the “Manner” Improves “Place” Classification for Vowels as Well• Also Improves “Height” Classification

NTIMIT (telephone) Corpus

Page 60: An Elitist Approach to Articulatory-Acoustic Feature

Manner-Specific Place Classification - Overall

NTIMIT (telephone) Corpus

• Overall, Performance Improves Between 5% and 14% (in absolute terms)• Improvement is Greatest for Stops, Nasals and Flaps

Page 61: An Elitist Approach to Articulatory-Acoustic Feature

• A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature

Classification

Summary – ELITIST Approach

Page 62: An Elitist Approach to Articulatory-Acoustic Feature

• A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature

Classification• The ELITIST Approach is Based on the Observation that Frames In the

Center of Phonetic Segments are More Accurately Classified thanThose at Segment Boundaries

Summary – ELITIST Approach

Page 63: An Elitist Approach to Articulatory-Acoustic Feature

• A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature

Classification• The ELITIST Approach is Based on the Observation that Frames In the

Center of Phonetic Segments are More Accurately Classified thanThose at Segment Boundaries

• Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard

Frames

Summary – ELITIST Approach

Page 64: An Elitist Approach to Articulatory-Acoustic Feature

• A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature

Classification• The ELITIST Approach is Based on the Observation that Frames In the

Center of Phonetic Segments are More Accurately Classified thanThose at Segment Boundaries

• Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard

Frames• Discarding such Low-Confidence Frames Improves AF Classification

Summary – ELITIST Approach

Page 65: An Elitist Approach to Articulatory-Acoustic Feature

• A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature

Classification• The ELITIST Approach is Based on the Observation that Frames In the

Center of Phonetic Segments are More Accurately Classified thanThose at Segment Boundaries

• Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard

Frames• Discarding such Low-Confidence Frames Improves AF Classification• Manner Classification is Sufficiently Improved as to be Capable of

Performing Manner-Specific Training for Place-of-Articulation Features

Summary – ELITIST Approach

Page 66: An Elitist Approach to Articulatory-Acoustic Feature

• A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature

Classification• The ELITIST Approach is Based on the Observation that Frames In the

Center of Phonetic Segments are More Accurately Classified thanThose at Segment Boundaries

• Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard

Frames• Discarding such Low-Confidence Frames Improves AF Classification• Manner Classification is Sufficiently Improved as to be Capable of

Performing Manner-Specific Training for Place-of-Articulation Features

• Place of Articulation Feature Classification Improves using Manner-Specific Training

Summary – ELITIST Approach

Page 67: An Elitist Approach to Articulatory-Acoustic Feature

• A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature

Classification• The ELITIST Approach is Based on the Observation that Frames In the

Center of Phonetic Segments are More Accurately Classified thanThose at Segment Boundaries

• Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard

Frames• Discarding such Low-Confidence Frames Improves AF Classification• Manner Classification is Sufficiently Improved as to be Capable of

Performing Manner-Specific Training for Place-of-Articulation Features

• Place of Articulation Feature Classification Improves using Manner-Specific Training

• This Performance Enhancement is Probably the Result of:– Fewer features to classify for any given manner class– More homogeneous place-of-articulation training material

Summary – ELITIST Approach

Page 68: An Elitist Approach to Articulatory-Acoustic Feature

• A Principled Method of Frame Selection (the ELITIST approach) can be Used to Improve the Accuracy of Articulatory Feature Classification

• The ELITIST Approach is Based on the Observation that Frames In the Center of Phonetic Segments are More Accurately Classified than Those at Segment Boundaries

• Frame Classification Accuracy is Highly Correlated with MLP Network Confidence Level and can be Used to Systematically Discard Frames

• Discarding such Low-Confidence Frames Improves AF Classification• Manner Classification is Sufficiently Improved as to be Capable of

Performing Manner-Specific Training for Place-of-Articulation Features• Place of Articulation Feature Classification Improves using Manner-

Specific Training • This Performance Enhancement is Probably the Result of:

– Fewer features to classify for any given manner class– More homogeneous place-of-articulation training material

• Such Improvements in AF Classification Accuracy Can Be Used to Improve the Quality of Automatic Phonetic Annotation

Summary – ELITIST Approach

Page 69: An Elitist Approach to Articulatory-Acoustic Feature

Part Three

THE ELITIST APPROACH GOES DUTCH

Description of the Development Corpus - VIOS

The Nature of Cross-Linguistic Transfer of Articulatory Features

Application of the ELITIST Approach to Dutch

Manner-Specific, Place-of-Articulation Classification for Dutch

Page 70: An Elitist Approach to Articulatory-Acoustic Feature

• Extemporaneous, Prompted Human-Machine Telephone Dialogues– Human speakers querying an automatic system for Dutch Railway timetables– Wide range of dialect variability , both genders, variation in speaker age

Dutch Development Corpus – VIOS

Page 71: An Elitist Approach to Articulatory-Acoustic Feature

• Extemporaneous, Prompted Human-Machine Telephone Dialogues– Human speakers querying an automatic system for Dutch Railway timetables– Wide range of dialect variability , both genders, variation in speaker age

• A Portion of the Corpus Manually Labeled at the Phonetic-Segment Level– Material labeled by speech science students at Nijmegen University– This component of the corpus served as the testing material– There was 18 minutes of material in this portion of the corpus

Dutch Development Corpus – VIOS

Page 72: An Elitist Approach to Articulatory-Acoustic Feature

• Extemporaneous, Prompted Human-Machine Telephone Dialogues– Human speakers querying an automatic system for Dutch Railway timetables– Wide range of dialect variability , both genders, variation in speaker age

• A Portion of the Corpus Manually Labeled at the Phonetic-Segment Level– Material labeled by speech science students at Nijmegen University– This component of the corpus served as the testing material– There was 18 minutes of material in this portion of the corpus

• The Major Portion of the Corpus Automatically Labeled and Segmented– The automatic method incorporated a certain degree of pronunciation-model

knowledge derived from language-specific phonological rules– This part of the corpus served as the training material– There was 60 minutes of material in this portion of the corpus

Dutch Development Corpus – VIOS

Page 73: An Elitist Approach to Articulatory-Acoustic Feature

How Dutch Differs from English• Dutch and English are Genetically Closely Related Languages

– Perhaps 1500 years of time depth separating the languages– They share some (but not all - see below) phonetic properties in common

Page 74: An Elitist Approach to Articulatory-Acoustic Feature

How Dutch Differs from English• Dutch and English are Genetically Closely Related Languages

– Perhaps 1500 years of time depth separating the languages– They share some (but not all - see below) phonetic properties in common

• The “Dental” Place of Articulation is Present in English, but not in Dutch

Page 75: An Elitist Approach to Articulatory-Acoustic Feature

How Dutch Differs from English• Dutch and English are Genetically Closely Related Languages

– Perhaps 1500 years of time depth separating the languages– They share some (but not all - see below) phonetic properties in common

• The “Dental” Place of Articulation is Present in English, but not in Dutch

• The Manner “Flap” is Present in English, but not in Dutch

Page 76: An Elitist Approach to Articulatory-Acoustic Feature

How Dutch Differs from English• Dutch and English are Genetically Closely Related Languages

– Perhaps 1500 years of time depth separating the languages– They share some (but not all - see below) phonetic properties in common

• The “Dental” Place of Articulation is Present in English, but not in Dutch

• The Manner “Flap” is Present in English, but not in Dutch

• Certain Manner/Place Combinations in Dutch are not Found in English – For example – the velar fricative associated with orthographic “g”

Page 77: An Elitist Approach to Articulatory-Acoustic Feature

How Dutch Differs from English• Dutch and English are Genetically Closely Related Languages

– Perhaps 1500 years of time depth separating the languages– They share some (but not all - see below) phonetic properties in common

• The “Dental” Place of Articulation is Present in English, but not in Dutch

• The Manner “Flap” is Present in English, but not in Dutch

• Certain Manner/Place Combinations in Dutch are not Found in English – For example – the velar fricative associated with orthographic “g”

• The Vocalic System (particularly diphthongs) Differs Between Dutch and English

Page 78: An Elitist Approach to Articulatory-Acoustic Feature

Cross-Linguistic Classification• Classification Accuracy on the VIOS Corpus

– Results depend on whether the classifiers were trained on VIOS (Dutch) or NTIMIT (English) material

Page 79: An Elitist Approach to Articulatory-Acoustic Feature

Cross-Linguistic Classification• Classification Accuracy on the VIOS Corpus

– Results depend on whether the classifiers were trained on VIOS (Dutch) or NTIMIT (English) material

– Voicing and manner classification is comparable between the two training corpora

Page 80: An Elitist Approach to Articulatory-Acoustic Feature

Cross-Linguistic Classification• Classification Accuracy on the VIOS Corpus

– Results depend on whether the classifiers were trained on VIOS (Dutch) or NTIMIT (English) material

– Voicing and manner classification is comparable between the two training corpora– Place classification is significantly worse when training on NTIMIT

Page 81: An Elitist Approach to Articulatory-Acoustic Feature

Cross-Linguistic Classification• Classification Accuracy on the VIOS Corpus

– Results depend on whether the classifiers were trained on VIOS (Dutch) or NTIMIT (English) material– Voicing and manner classification is comparable between the two training corpora– Place classification is significantly worse when training on NTIMIT– Other feature dimensions exhibit only slightly worse performance training on NTIMIT

Page 82: An Elitist Approach to Articulatory-Acoustic Feature

For VIOS-trained Classifiers• Frames with Confidence Levels Below “Threshold” are Discarded

– Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to 6% of the segments

The Elitist Approach Applied to Dutch

Page 83: An Elitist Approach to Articulatory-Acoustic Feature

For VIOS-trained Classifiers• Frames with Confidence Levels Below “Threshold” are Discarded

– Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to 6% of the segments

• The Accuracy of MANNER Frame Classification Improves– Frame-level classification accuracy increases from 85% to 91%

The Elitist Approach Applied to Dutch

Page 84: An Elitist Approach to Articulatory-Acoustic Feature

For VIOS-trained Classifiers• Frames with Confidence Levels Below “Threshold” are Discarded

– Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to 6% of the segments

• The Accuracy of MANNER Frame Classification Improves– Frame-level classification accuracy increases from 85% to 91%

For NTIMIT-trained Classifiers (but classifying VIOS material)• Frames with Confidence Levels Below “Threshold” are Discarded

– Setting the threshold to 0.7 filters out ca. 19% of the frames

The Elitist Approach Applied to Dutch

Page 85: An Elitist Approach to Articulatory-Acoustic Feature

For VIOS-trained Classifiers• Frames with Confidence Levels Below “Threshold” are Discarded

– Setting the threshold to 0.7 filters out ca. 15% of the frames, corresponding to 6% of the segments

• The Accuracy of MANNER Frame Classification Improves– Frame-level classification accuracy increases from 85% to 91%

For NTIMIT-trained Classifiers (but classifying VIOS material)• Frames with Confidence Levels Below “Threshold” are Discarded

– Setting the threshold to 0.7 filters out ca. 19% of the frames

• The Accuracy of MANNER Frame Classification Improves– Frame-level classification accuracy increases from 73% to 81%

The Elitist Approach Applied to Dutch

Page 86: An Elitist Approach to Articulatory-Acoustic Feature

• Although There are Nine Distinct Place of Articulation Features Overall

Place of Articulation is Manner-Dependent

Page 87: An Elitist Approach to Articulatory-Acoustic Feature

• Although There are Nine Distinct Place of Articulation Features Overall

• For Any Single Manner Class There are Only Three Place Features

Place of Articulation is Manner-Dependent

Page 88: An Elitist Approach to Articulatory-Acoustic Feature

• Although There are Nine Distinct Place of Articulation Features Overall

• For Any Single Manner Class There are Only Three Place Features

• The Locus of Articulation Constriction Differs Among Manner Classes

Place of Articulation is Manner-Dependent

Page 89: An Elitist Approach to Articulatory-Acoustic Feature

• Thus, if the Manner is Classified Correctly, this Information can be Exploited to Enhance Place of Articulation Classification

Place of Articulation is Manner-Dependent

Page 90: An Elitist Approach to Articulatory-Acoustic Feature

• Thus, if the Manner is Classified Correctly, this Information can be Exploited to Enhance Place of Articulation Classification

• Thus, Each Manner Class can be Trained on Comparable Relational Place Features:

ANTERIOR CENTRAL POSTERIOR

Place of Articulation is Manner-Dependent

Page 91: An Elitist Approach to Articulatory-Acoustic Feature

• Thus, if the Manner is Classified Correctly, this Information can be Exploited to Enhance Place of Articulation Classification

• Thus, Each Manner Class can be Trained on Comparable Relational Place Features:

ANTERIOR CENTRAL POSTERIOR• Knowing the “Manner” Improves “Place” Classification for both

Consonants and Vowels in DUTCH

Place of Articulation is Manner-Dependent

VIOS (telephone) Corpus

Page 92: An Elitist Approach to Articulatory-Acoustic Feature

• Knowing the “Manner” Improves “Place” Classification for the “Approximant” Segments in DUTCH

VIOS (telephone) Corpus

Manner-Specific Place Classification – Dutch

Page 93: An Elitist Approach to Articulatory-Acoustic Feature

• Knowing the “Manner” Improves “Place” Classification for the “Approximant” Segments in DUTCH• Approximants are Classified as “Vocalic” Rather Than as “Consonantal”

VIOS (telephone) Corpus

Manner-Specific Place Classification – Dutch

Page 94: An Elitist Approach to Articulatory-Acoustic Feature

Summary – ELITIST Goes Dutch • Cross-linguistic Transfer of Articulatory Features

– Classifiers are more than 80% correct on all AF dimensions except for “place” when trained and tested on VIOS

Page 95: An Elitist Approach to Articulatory-Acoustic Feature

Summary – ELITIST Goes Dutch • Cross-linguistic Transfer of Articulatory Features

– Classifiers are more than 80% correct on all AF dimensions except for “place” when trained and tested on VIOS– Voicing and manner classification is comparable between VIOS and NTIMIT

Page 96: An Elitist Approach to Articulatory-Acoustic Feature

Summary – ELITIST Goes Dutch • Cross-linguistic Transfer of Articulatory Features

– Classifiers are more than 80% correct on all AF dimensions except for “place” when trained and tested on VIOS– Voicing and manner classification is comparable between VIOS and NTIMIT – Place classification (for VIOS) is much worse when trained on NTIMIT

Page 97: An Elitist Approach to Articulatory-Acoustic Feature

Summary – ELITIST Goes Dutch • Cross-linguistic Transfer of Articulatory Features

– Classifiers are more than 80% correct on all AF dimensions except for “place” when trained and tested on VIOS– Voicing and manner classification is comparable between VIOS and NTIMIT – Place classification (for VIOS) is much worse when trained on NTIMIT– Other AF dimensions are only slightly worse when trained on NTIMIT

Page 98: An Elitist Approach to Articulatory-Acoustic Feature

Summary – ELITIST Goes Dutch • Cross-linguistic Transfer of Articulatory Features

– Classifiers are more than 80% correct on all AF dimensions except for “place” when trained and tested on VIOS– Voicing and manner classification is comparable between VIOS and NTIMIT – Place classification (for VIOS) is much worse when trained on NTIMIT– Other AF dimensions are only slightly worse when trained on NTIMIT

• Application of the ELITIST Approach to the VIOS Corpus – Results improve when the ELITIST approach is used

Page 99: An Elitist Approach to Articulatory-Acoustic Feature

Summary – ELITIST Goes Dutch • Cross-linguistic Transfer of Articulatory Features

– Classifiers are more than 80% correct on all AF dimensions except for “place” when trained and tested on VIOS– Voicing and manner classification is comparable between VIOS and NTIMIT – Place classification (for VIOS) is much worse when trained on NTIMIT– Other AF dimensions are only slightly worse when trained on NTIMIT

• Application of the ELITIST Approach to the VIOS Corpus – Results improve when the ELITIST approach is used– Training on VIOS:

frame-level classification accuracy increases from 85% to 91% (15% of the frames discarded)

Page 100: An Elitist Approach to Articulatory-Acoustic Feature

Summary – ELITIST Goes Dutch • Cross-linguistic Transfer of Articulatory Features

– Classifiers are more than 80% correct on all AF dimensions except for “place” when trained and tested on VIOS– Voicing and manner classification is comparable between VIOS and NTIMIT – Place classification (for VIOS) is much worse when trained on NTIMIT– Other AF dimensions are only slightly worse when trained on NTIMIT

• Application of the ELITIST Approach to the VIOS Corpus – Results improve when the ELITIST approach is used– Training on VIOS:

frame-level classification accuracy increases from 85% to 91% (15% of the frames discarded)

– Training on NTIMIT: frame-level classification accuracy increases from 73% to 81% (19% of frames discarded)

Page 101: An Elitist Approach to Articulatory-Acoustic Feature

Summary – ELITIST Goes Dutch • Cross-linguistic Transfer of Articulatory Features

– Classifiers are more than 80% correct on all AF dimensions except for “place” when trained and tested on VIOS– Voicing and manner classification is comparable between VIOS and NTIMIT – Place classification (for VIOS) is much worse when trained on NTIMIT– Other AF dimensions are only slightly worse when trained on NTIMIT

• Application of the ELITIST Approach to the VIOS Corpus – Results improve when the ELITIST approach is used– Training on VIOS:

frame-level classification accuracy increases from 85% to 91% (15% of the frames discarded)

– Training on NTIMIT: frame-level classification accuracy increases from 73% to 81% (19% of frames discarded)

• Manner-Specific Classification for Place of Articulation Features– Knowing the “manner” improves “place” classification for vowels and for consonants

Page 102: An Elitist Approach to Articulatory-Acoustic Feature

Summary – ELITIST Goes Dutch • Cross-linguistic Transfer of Articulatory Features

– Classifiers are more than 80% correct on all AF dimensions except for “place” when trained and tested on VIOS– Voicing and manner classification is comparable between VIOS and NTIMIT – Place classification (for VIOS) is much worse when trained on NTIMIT– Other AF dimensions are only slightly worse when trained on NTIMIT

• Application of the ELITIST Approach to the VIOS Corpus – Results improve when the ELITIST approach is used– Training on VIOS:

frame-level classification accuracy increases from 85% to 91% (15% of the frames discarded)

– Training on NTIMIT: frame-level classification accuracy increases from 73% to 81% (19% of frames discarded)

• Manner-Specific Classification for Place of Articulation Features– Knowing the “manner” improves “place” classification for vowels and for consonants– Accuracy increases between 10 and 20% (absolute) for all “place” features

Page 103: An Elitist Approach to Articulatory-Acoustic Feature

Summary – ELITIST Goes Dutch • Cross-linguistic Transfer of Articulatory Features

– Classifiers are more than 80% correct on all AF dimensions except for “place” when trained and tested on VIOS– Voicing and manner classification is comparable between VIOS and NTIMIT – Place classification (for VIOS) is much worse when trained on NTIMIT– Other AF dimensions are only slightly worse when trained on NTIMIT

• Application of the ELITIST Approach to the VIOS Corpus – Results improve when the ELITIST approach is used– Training on VIOS:

frame-level classification accuracy increases from 85% to 91% (15% of the frames discarded)

– Training on NTIMIT: frame-level classification accuracy increases from 73% to 81% (19% of frames discarded)

• Manner-Specific Classification for Place of Articulation Features– Knowing the “manner” improves “place” classification for vowels and for consonants– Accuracy increases between 10 and 20% (absolute) for all “place” features– Approximants are classified as “vocalic” not “consonantal” – knowing the

“manner” improves “place” classification for “approximant” segments

Page 104: An Elitist Approach to Articulatory-Acoustic Feature

Part Four

INTO THE FUTURE

Towards Fully Automatic Transcription Systems

An Empirically Oriented Discipline Based on Annotated Corpora

Page 105: An Elitist Approach to Articulatory-Acoustic Feature

The Eternal Pentangle• Phonetic and Prosodic Annotation is Limited in Quantity

Page 106: An Elitist Approach to Articulatory-Acoustic Feature

The Eternal Pentangle• Phonetic and Prosodic Annotation is Limited in Quantity

– This material is important for understanding spoken language and developing superior technology for recognition and synthesis

Page 107: An Elitist Approach to Articulatory-Acoustic Feature

The Eternal Pentangle• Phonetic and Prosodic Annotation is Limited in Quantity

– This material is important for understanding spoken language and developing superior technology for recognition and synthesis

Page 108: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….

Page 109: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

Page 110: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)

Page 111: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

Page 112: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features

Page 113: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments

Page 114: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation

Page 115: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units

Page 116: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations

Page 117: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation

Page 118: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation– Morphological patterns, as well as syntactic and grammatical material

Page 119: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation– Morphological patterns, as well as syntactic and grammatical material– Semantics and its relation to the lower tiers of spoken language

Page 120: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation– Morphological patterns, as well as syntactic and grammatical material– Semantics and its relation to the lower tiers of spoken language– Audio and video detail pertaining to all aspects of spoken language

Page 121: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation– Morphological patterns, as well as syntactic and grammatical material– Semantics and its relation to the lower tiers of spoken language– Audio and video detail pertaining to all aspects of spoken language

• That a Science of Spoken Language will be Empirically Based

Page 122: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation– Morphological patterns, as well as syntactic and grammatical material– Semantics and its relation to the lower tiers of spoken language– Audio and video detail pertaining to all aspects of spoken language

• That a Science of Spoken Language will be Empirically Based– Using these annotated corpora to perform detailed statistical analyses

Page 123: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation– Morphological patterns, as well as syntactic and grammatical material– Semantics and its relation to the lower tiers of spoken language– Audio and video detail pertaining to all aspects of spoken language

• That a Science of Spoken Language will be Empirically Based– Using these annotated corpora to perform detailed statistical analyses– Generating hypotheses about the organization and function of spoken language

Page 124: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation– Morphological patterns, as well as syntactic and grammatical material– Semantics and its relation to the lower tiers of spoken language– Audio and video detail pertaining to all aspects of spoken language

• That a Science of Spoken Language will be Empirically Based– Using these annotated corpora to perform detailed statistical analyses– Generating hypotheses about the organization and function of spoken language– Performing experiments based on insights garnered from such corpora

Page 125: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation– Morphological patterns, as well as syntactic and grammatical material– Semantics and its relation to the lower tiers of spoken language– Audio and video detail pertaining to all aspects of spoken language

• That a Science of Spoken Language will be Empirically Based– Using these annotated corpora to perform detailed statistical analyses– Generating hypotheses about the organization and function of spoken language– Performing experiments based on insights garnered from such corpora

• That Such Corpora will be Used to Develop Wonderful Technology

Page 126: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation– Morphological patterns, as well as syntactic and grammatical material– Semantics and its relation to the lower tiers of spoken language– Audio and video detail pertaining to all aspects of spoken language

• That a Science of Spoken Language will be Empirically Based– Using these annotated corpora to perform detailed statistical analyses– Generating hypotheses about the organization and function of spoken language– Performing experiments based on insights garnered from such corpora

• That Such Corpora will be Used to Develop Wonderful Technology– To create “flawless” speech recognition

Page 127: An Elitist Approach to Articulatory-Acoustic Feature

I Have a Dream, That One Day ….• There will be Annotated Corpora for All Major Languages of the World

(generated by automatic means, but based on manual annotation)• That Each of These Corpora will Contain Detailed Information About:

– Articulatory-acoustic features– Phonetic segments– Pronunciation variation– Syllable units– Lexical representations– Prosodic information pertaining to accent and intonation– Morphological patterns, as well as syntactic and grammatical material– Semantics and its relation to the lower tiers of spoken language– Audio and video detail pertaining to all aspects of spoken language

• That a Science of Spoken Language will be Empirically Based– Using these annotated corpora to perform detailed statistical analyses– Generating hypotheses about the organization and function of spoken language– Performing experiments based on insights garnered from such corpora

• That Such Corpora will be Used to Develop Wonderful Technology– To create “flawless” speech recognition– And “perfect” speech synthesis

Page 128: An Elitist Approach to Articulatory-Acoustic Feature

That’s All, FolksMany Thanks for Your Time and Attention