Conversational Technologies 1 Natural Language Processing August 23, 2007 SpeechTEK University Deborah Dahl Conversational Technologies

Conversational Technologies1

Natural Language Processing August 23, 2007

SpeechTEK UniversityDeborah Dahl

Conversational Technologies

Conversational Technologies 2

Description of the Tutorial

An introduction to the principles of natural language processing and the role of natural language processing in current and future speech applications

9:00-9:15 Introduction: what is natural language

9:15-10:15 Part 1: Overview and Principles 10:15-10:45 (30 minute break) 10:45-12:00 Part 2: Detailed Examples


Attendees

Backgrounds and goals


Audience and Background

A general technical background. No natural language processing background

will be assumed, but experience developing speech applications would be helpful.


What is Natural Language?

Natural language is the kind of language that’s used to communicate between people

Can be spoken, written or gestural (in the case of Sign Languages)

There are several thousand currently spoken human languages


Why are We Interested in Natural Language?

Support for more natural and effective computer-human interactions by accommodating the ways that people already communicate


Natural Language Processing

Natural language understanding Natural language generation Machine translation


Part 1: Overview and Principles


Goals

Understand what natural language is Learn about the most common techniques for

processing natural language Their strengths and weaknesses Understand where natural language

processing technology is headed in the future. Focus is on commercial applications


Topics

What is natural language? Issues in spoken natural language and how to

handle them Statistical Language Models (SLM's) speech grammars with semantic tags Variability in expression, pronouns, and filling

multiple slots from a single utterance How emerging standards such as EMMA will

contribute to more sophisticated future applications

Recent topics in natural language research and how this research may eventually be utilized in future applications


Natural Language Understanding

The task of automatically assigning meaning to language


What natural language processing isn’t

Speech recognition, which turns the sounds of spoken language into the words of written language

Dialog management, which manages a natural language interaction between a user and a computer

Artificial intelligence, which studies how to provide intelligent capabilities to computers


Assigning Meaning to Language

In most applications, the developer decides what the set of possible meanings is

Meanings can be simple or complex Language can be simple or complex Current commercial techniques can

Assign simple meanings to simple language Assign simple meanings to complex language

Research systems can handle more complex meanings and language, but no existing system can handle all meanings and all language for even one human language


Examples of Complex Language

Shakespeare Religious texts The United States Constitution We don’t have to worry about assigning

meaning to these texts!


Simple to Slightly More Complex Language

“yes” “New York” “call home” “a red t-shirt, size large” “I want to go from Philadelphia to New

York on Sunday, August 19” As language becomes more complex,

the more we need special techniques to process it


Human Communication Process?

languageThought Thought

Person A Person B


More Realistic Communication Process

languageThought 1

A thought somewhat similar to Thought 1

How should I express this?Is this something I really need to say?What does B already know?Why do I want to express this thought?Do I want to impress B?Might I offend B by saying this?What language should I use?

Should I believe this?Could A be lying or lacking credibility?If I think A is lying should I say so?Did I hear it right?Did I understand it?Why did Person A say that?

Person APerson B


Issues in Natural Language

Variability of expression Infinite number of meanings that can be

expressed Infinite number of possible sentences in

a language Many ways to say the same thing The same thing can have different

meanings in different contexts


What is a Meaning?

Many approaches to representing meanings in traditional linguistics and philosophy of language

Most widely used commercial representation is as a token or as a set of slot/value pairs (also called “key/value” or “attribute/value” pairs)

Often structured into a set of related slot/value pairs (for example, the fields of a VoiceXML <form>, or a traditional frame)


Tokens

“my printer is printing horizontal bands and everything is printing in blue” “printer problem”

“I can’t connect to the internet” “internet problem”


What is a Meaning? Slot/Value Pairs

I want to go from Chicago to New York on August 19 midafternoon on United

Form/frame – airline reservation Destination: New York Departure city: Chicago Departure date: August 19 Departure time: midafternoon Airline: United


Information Available for Extracting Meaning

Used by today’s commercial systems Words of the utterance Word order Grammatical endings Specific grammar for the application Information about what previous instances of that

utterance have meantUsed by research systems and people

Prosody (intonation, pauses, loudness, stress, timing) General information about the language itself

(dictionaries, grammars, thesauri) Context of the utterance Information about the topic Facial expressions, gestures


Traditional Tasks in Natural Language Understanding

(Recognition – speech, handwriting, OCR…)

Lexical lookup Part of speech tagging Sense disambiguation Syntactic parsing Semantic analysis Pragmatic analysis


Problems with Traditional Approaches

Try to describe the full language and a broad set of meanings

For practical applications, it’s much easier to just write a small grammar for a specific application


(Recognition – speech, handwriting, OCR…) Lexical lookup (part of recognition) Part of speech tagging – parts of speech not

used Sense disambiguation – not needed,

constrained application Syntactic parsing – syntactic structure used

indirectly Semantic analysis Pragmatic analysis

Natural Language Tasks in Commercial Speech Systems

}Done in parallel


Extracting Meaning in Commercial Applications

Filling slots by using semantically tagged grammars (CFG’s)

Mapping complex utterances to categories (SLM’s)


Semantically Tagged Grammars

A grammar defines what the recognizer can recognize (recognized strings)

Tags define return values for different recognized strings

Information used: words of the utterance and a special-purpose grammar


Context-Free Grammar Formats

Represent what a speech recognizer can recognize

Example: Request PoliteWord + Action + Item (please open the door) Speech Recognition Grammar Specification

(SRGS) (ABNF and XML formats) Java Speech Grammar Format (JSGF) Nuance GSL Microsoft Speech Application Programmer’s

Interface (SAPI)


Semantic Tags

Reduce variability of expression Assign return values to recognized strings W3C Semantic Interpretation for Speech

Recognition (SISR) JSGF tags SAPI tags IBM ECMAScript tags Nuance GSL


Capabilities of Tag Formats

Assign tokens to strings (JSGF)Yeah yes Create key-value pairs (SAPI)

“to chicago” <destination>ord</destination>

Perform computations (SISR, IBM,GSL) “three days from now” August 26, 2007 “two medium and three large pizzas” 5

pizzas


SISR Tags for “yes” and “no”

<rule id="yes"> <one-of>

<item>yes</item> <item>yeah<tag>yes</tag></item> <item><token>you bet</token><tag>yes</tag></item> <item xml:lang="fr-CA">oui<tag>yes</tag></item>

</one-of> </rule> <rule id="no">

<one-of> <item>no</item> <item>nope</item> <item><token>no way</token></item>

</one-of> <tag>no</tag>

</rule>


GSL Token

DigitValue [ ([zero oh] one) { return (01) } ...]

“oh one” 01


SISR Slot/Value

"I would like a small coca cola and three large pizzas with pepperoni and mushrooms.”

<rule id="order"> I would like a <ruleref uri="#drink"/> <tag>out.drink = new Object();

out.drink.liquid=rules.drink.type; out.drink.drinksize=rules.drink.drinksize;</tag> and <ruleref uri="#pizza"/> <tag>out.pizza=rules.pizza;</tag> </rule>


GSL Slot/Value

;GSL 2.0; ColoredObject:public (Color Object) Color [ [red pink] { <color red> } [yellow canary] { <color yellow> } [green khaki] { <color green> } ] Object [ [truck car] { <object vehicle> } [ball block] { <object toy> } [shirt blouse] { <object clothing> } ]


SAPI Slot-Value

<RULE name="elvis"> <L PROPNAME="artist"> elvis

<O>presley</O> the king </L> </RULE>


Problems with Tagged Grammars

Hard to maintain when complex Hard to anticipate all the variations in

how someone might say something Can use wildcards/garbage to ignore

parts of utterance Speech recognition suffers when

grammars are too complex Speech recognition suffers when

wildcards are used


Statistical Language Models (SLM’s)

Speech recognition is based on statistical models, not grammars

In commercial systems, natural language processing is a process of classification, relatively coarse meaning extraction

Works well if goal is to extract very simple meanings


Stages in SLM Processing

Ngram speech recognition: probabilities of word sequences, usually 2-3 words

Much more flexible (but less accurate) than a grammar

However, accuracy is not as critical with SLM’s because you don’t have to get every single word right

Text classification: given a text, assign it to categories based on training from previous texts

There are many algorithms for classification


Problems with SLM’s

Less accurate than CFG’s Expensive to implement and maintain Require a lot of data for good

performance


Tagged Grammars or SLM’s?

Deeply nested menus SLM’s Complex applications with many slots to

fill and precise meanings needed grammars

Can combine both approaches in one application Front-end SLM followed by grammar Prompt asks specific question to catch most

common tasks but has “other” category


Other Combination Approaches

Use SLM technology to recognize but grammar to interpret

Rules combined with SLM’s Robust parsing Rules combined with wildcard

I want um make that a large pizza with pepperoni and onions


Emerging Standards: EMMA

EMMA (Extensible Multi-Modal Annotation)

Developed by the World Wide Web Consortium Multimodal Interaction Working Group

An XML format for representing users’ inputs and the results of processing them


How does EMMA relate to natural language understanding?

EMMA represents the results of a natural language understanding process


EMMA Benefits (1)

EMMA’s standard format lets all kinds of EMMA producers (multimodal modality components) exchange results handwriting recognizers speech recognizers text classifiers face recognizers speaker identification and verification …


EMMA Benefits (2)

Through “<derived-from>”, provides a way for “specialist” processing components to cooperate in processing a single input

Speechrecognition

Lexicallookup

Part ofSpeechtagging

ParsingSemantic analysis

Ngram speech recognition

Classification


EMMA Example – (1) Annotation Elements

<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma/"> <emma:info>

<application>airline</application></emma:info><emma:model>

<model class="airline"> <source></source> <destination></destination> <days></days> <meals></meals>

</model></emma:model><emma:model>

from philadelphia to boston and i want a vegetarian meal


EMMA Example – (2) Annotation Attributes

<emma:interpretation

id="interp5

emma:start="1186519245101"

emma:mode="speech“

emma:end="1186519248391“

emma:confidence="0.03"

emma:function="dialog"

emma:duration="3290"

emma:uninterpreted="false“

emma:lang="en-US"

emma:verbal="true"

emma:dialog-turn=“1"

emma:tokens="from philadelphia to boston and i want a vegetarian meal "

emma:medium="acoustic"

emma:process="file://Microsoft Speech Recognizer 8.0 for Windows (English - US), SAPI5, Microsoft" >

/>


EMMA Example (3) Application Semantics

<source>philadelphia </source><destination>boston</destination><meal>vegetarian</meal>


Part 2: Detailed Examples


SAPI XML Grammar Examples

Windows Speech Recognition (Vista) Office 2003 Speech Recognition Example – music player interface I’d like to hear Beethoven’s 5th

Please play Brandenburg Concertos by Bach

Play something by Elvis


Canonicalizing Forms

<RULE name="elvis"> <L PROPNAME="artist"> elvis

<O>presley</O> the king </L> </RULE>


Canonicalizing Forms (2)

<RULE name="name"> <L PROPNAME="name"> ninth <O>symphony</O> seventh <O>symphony</O> fifth <O>symphony</O> Brandenburg

Concertos third symphony hound dog something anything symphony in d major <O>opus

3</O> </L></RULE>


Disambiguating

<RULE name="jsbach"> <O> <L>J S 

Johann Sebastian </L> </O> Bach </RULE><RULE name="jcbach"> <L> J C 

Johann Christian </L> Bach </RULE>


SLM Examples

Meta-utterances for channel control I’m confused Speak louder please Could you say that again?


Training Data

Find out how people ask these questions Manually tag them with their categoriesCategory:repeatcould you say that again pleasei didn't catch thatsorrypardon me?repeat that pleasesay that againwhat?Category:operatorI need to speak to a humanare there any humans I can talk to?please get me an operatorI want an operatoroperator pleaseI need an agent


Use NGram Speech Grammar

Ngrams are sets of two or three words and the probabilities that they’ll occur together in that order

Much less constrained than CFG’s Less accurate Used in “How may I help you?”

applications, dictation systems, and research


Use Text Classification Software

Uses training data to develop probabilities that a new text is in one of the training categories

Many algorithms and approaches to text classification

Similar to the technology used in spam filters, but input is speech


Example

User says:Pardon me, I didn’t catch thatSpeech recognizer hears:party may i didn't catch that Classifier classifiesincrease_volume

0.4595725150090289

decrease_volume 0.0

slower 0.0

faster 0.0

confused 0.4447495899966607

repeat 0.567774973957669

operator 0.5163977794943222


EMMA Text Input Example

<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma/"> <emma:interpretation id="interp4 emma:duration="3038" emma:confidence="1.0" emma:process="file://Microsoft Speech Recognizer 8.0 for Windows (English - US), SAPI5, Microsoft" emma:medium="tactile" emma:verbal="true" emma:mode="keys" emma:start="1187040519583" emma:uninterpreted="false" emma:function="dialog" emma:dialog-turn="4" emma:end="1187040737446" emma:lang="en-US" emma:tokens="i'd like to go from boston to philadelphia on tuesday " > <source>boston</source> <destination>philadelphia</destination> <day>Tuesday</day> </emma:interpretation></emma:emma>


EMMA: Classification Example

<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma/"> <emma:interpretation id="interp4 emma:duration="3038" emma:confidence=“.5" emma:process=“tech-support-slm" emma:medium=“acoustic" emma:verbal="true" emma:mode=“voice" emma:start="1187040519583" emma:uninterpreted="false" emma:function="dialog" emma:dialog-turn="4" emma:end="1187040737446" emma:lang="en-US" emma:tokens=“my internet connection keeps going off " > <problem>internet connectivity</problem></emma:interpretation></emma:emma>


Natural Language Research

Natural language processing is an active area of academic and industrial research

Topics studied include spoken dialog processing, text understanding, natural language generation, automatic translation, acquisition of natural language information such as words and grammars, information extraction, summarization and support for search


Natural Language Research

Most interesting to this audience are topics such as

Broadening domains (sense disambiguation and parsing disambiguation)

Handling spoken dialog phenomena such as pronouns and ellipses

Handling speech errors such as hesitations, false starts Multimodal communication, such as integrating speech

and gestures Extracting information provided by prosody and other

suprasegmentals

The main academic organization is The Association for Computational Linguistics (www.aclweb.org)


More Information: Websites

W3C Voice Browser WG SISRhttp://www.w3.org/TR/semantic-interpretation/ W3C Multimodal Interaction WG (EMMA)http://www.w3.org/TR/emma/ Association for Computational Linguistics (www.aclweb.org) Loquendo Café (for testing SISR grammars)http://www.loquendocafe.com Voxeo Prophecy Platform (for testing Nuance grammars) www.voxeo.com SAPI XML grammars (test with Windows Speech Recognition or

Office 2003 Microsoft 6.1 recognizer)http://www.microsoft.com/speech/SDK/51/sapi.chm Conversational Technologies http://www.conversational-technologies.com


More Information: Books, Journals, Articles

“Natural Language Processing: the Next Steps” (September 2006)

http://www.speechtechmag.com/Articles/ReadArticle.aspx?

ArticleID=29474 Speech and Language Processing: An Introduction to Natural

Language Processing, Computational Linguistics and Speech

Recognition by Daniel Jurafsky and James H. Martin (2000) Computational Linguistics Natural Language Engineering

Documents

Conversational Technologies 1 Natural Language Processing August 23, 2007 SpeechTEK University Deborah Dahl Conversational Technologies