Speech Recognition Yonglei Tao. Voice-Activated GPS

Speech Recognition

Yonglei Tao

Voice-Activated GPS

Voice User Interface (VUI) A VUI allows human interaction with

computers through a voice/speech platform Basic components

Speech recognition Meaning extraction Response generation Speech output

Benefits Loosen some physical constraints Provide tools for universal design

disability and situational impairments Intuitive and efficiency

System Architecture

Components Endpointing

Speech to endpointed utterance Feature extraction

Endpointed utterance to feature vectors Recognition

Feature vectors to word string(s) Natural language understanding

Word string(s) to meaning(s) Dialog management

Meaning to actions

Typical Recognition Components

Examples Book, boot

Write, right Flew, flu, flue

Eight books Ate books

I scream Ice cream

Components Acoustic models

Internal representation of each basic sound Dictionary

A list of words and pronunciations Grammar

Defines all possible strings of words the recognizer can handle

Allows to associate a meaning with those strings Either rule-based or statistical

Recognition Recognition search

A recognizer searches the recognition model to find the best-matching word string

Confidence measures A quantitative measure of how confident the

recognizer is for the best-matching string VUI developers can use those measures in several

ways N-Best processing

A recognizer returns several results with a confidence measure for each

Speech Recognition Engines Microsoft Visual Studio & CMU Sphinx

Grammar Android

Language model – free form for dictation or web search for short phrases

Google Web Speech API for Web Applications

BNF (Backus-Naur Form) Notation for context-free grammars

Often used to describe the syntax of programming languages

Also specify the words and patterns of words to be listened for by a speech recognizer

EBNF (Extended Backus-Naur Form) ABNF (Augmented Backus-Naur Form)

Basis for speech grammar specifications ABNF for .Net Regular grammar for Java

Basics

::= meaning "is defined as" | meaning "or" < > include category nameTerminal basic component

<X> ::= a b c a sequence

<Y> ::= a | b | c optional

<Z> ::= a | a <Z> one or more

An Example Grammar for a speech recognition calculator

Reference: Grammar creation in C#https://msdn.microsoft.com/en-us/library/hh538495%28v=office.14%29.aspx

Speech to Text in C#using System.Speech.Recognition;using System.Speech.Synthesis;using System.Threading;

static ManualResetEvent _completed = null;

static void Main(string[] args) {

_completed = new ManualResetEvent(false); SpeechRecognitionEngine _recognizer = new SpeechRecognitionEngine();

_recognizer.LoadGrammar(new Grammar(new GrammarBuilder("test")) Name = { "testGrammar" }); _recognizer.LoadGrammar(new Grammar(new GrammarBuilder("exit")) Name = { "exitGrammar" }); _recognizer.SpeechRecognized += _recognizer_SpeechRecognized; // add an event handler

_recognizer.SetInputToDefaultAudioDevice(); _recognizer.RecognizeAsync(RecognizeMode.Multiple); … _completed.WaitOne(); // wait until speech recognition is completed _recognizer.Dispose(); // dispose the speech recognition engine}

Speech to Text in C#(Cont.)void _recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) {

if (e.Result.Text == "test") {

Console.WriteLine("The test was successful!");

}

else if (e.Result.Text == "exit") {

_completed.Set();

}

}

void _recognizer_SpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e) {

if (e.Result.Alternates.Count == 0) {

Console.WriteLine("Speech rejected. No candidate phrases found.");

return;

}

Console.WriteLine("Speech rejected. Did you mean:");

foreach (RecognizedPhrase r in e.Result.Alternates) {

Console.WriteLine(" " + r.Text); }}

Text to Speech in C#

SpeechSynthesizer _synthesizer = new SpeechSynthesizer();

synthesizer.Speak("Now the computer is speaking to you.");

...synthesizer.Dispose(); // dispose the SpeechSynthesizer

References SpeechRecognitionEngine Class

https://msdn.microsoft.com/en-us/library/system.speech.recognition.speechrecognitionengine%28v=vs.110%29.aspx?cs-save-lang=1&cs-lang=vb#code-snippet-1

Speech recognition, speech to text, text to speech, and speech synthesis in C# http://

www.codeproject.com/Articles/483347/Speech-recognition-speech-to-text-text-to-speech-a

https://msdn.microsoft.com/en-us/library/system.speech.recognition.speechrecognitionengine(v=vs.110).aspx?cs-save-lang=1&cs-lang=vb#code-snippet-1





http://www.codeproject.com/Articles/483347/Speech-recognition-speech-to-text-text-to-speech-a



Visual Studio Speech Recognizer

Speech Recognition with Visual Studio Examples

http://www.phon.ucl.ac.uk/courses/spsci/compmeth/speech/recognition.html

http://blogs.msdn.com/b/devschool/archive/2012/02/06/speech-recognition-using-visual-studio-determining-the-bna.aspx

Grammar Class http://msdn.microsoft.com/en-us/library/system.spee

ch.recognition.grammar.aspx

GrammarBuilder Class http://msdn.microsoft.com/en-us/library/system.spee

ch.recognition.grammarbuilder.aspx






http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammar.aspx

http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammar.aspx

http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammarbuilder.aspx

http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammarbuilder.aspx

Speech Recognition for Java Sphinx 4

A speech recognition engine written entirely in Java

Created by CMU, Sun, Mitsubishi, HP, … Open source Compliant with JSpeech Grammar Format Platform- and vendor-independent

Programmer’s guidehttp://cmusphinx.sourceforge.net/sphinx4/

An example https://www.assembla.com/code/sonido/subversion/nodes/4/sphinx4/src/apps/edu/cmu/sphinx/demo/helloworld

http://cmusphinx.sourceforge.net/sphinx4/



https://www.assembla.com/code/sonido/subversion/nodes/4/sphinx4/src/apps/edu/cmu/sphinx/demo/helloworld



A Sample Grammar in Java

#JSGF V1.0;

public <workProgram> = <ask> <action> <program>;<ask> = please | could you;<action> = start | open | stop | close | kill | shut down ;<program> = word | excel | out look | note pad ;

Android Speech Recognitionpublic class MainActivity extends Activity { private static final int VOICE_RECOGNITION = 1; Button speakButton ; TextView spokenWords; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); speakButton = (Button) findViewById(R.id.button1); spokenWords = (TextView)findViewById(R.id.textView1); }

@Override public boolean onCreateOptionsMenu(Menu menu) { // Inflate the menu; this adds items to the action bar if it is present. getMenuInflater().inflate(R.menu.main, menu); return true; }

public void btnSpeak(View view){ Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH); // Specify free form input intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM); intent.putExtra(RecognizerIntent.EXTRA_PROMPT,"Please start speaking"); intent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1); intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.ENGLISH); startActivityForResult(intent, VOICE_RECOGNITION); } }

@Override protected void onActivityResult(int requestCode, int resultCode, Intent data) { if (requestCode == VOICE_RECOGNITION && resultCode == RESULT_OK) { ArrayList<String> results; results = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS); // TODO Do something with the recognized voice strings Toast.makeText(this, results.get(0), Toast.LENGTH_SHORT).show(); spokenWords.setText(results.get(0)); } super.onActivityResult(requestCode, resultCode, data); }

Android and Web Speech Recognition Android Voice Recognition Tutorial

http://www.javacodegeeks.com/2012/08/android-voice-recognition-tutorial.html

http://code4reference.com/2012/07/tutorial-android-voice-recognition/

Google Web Speech Recognition Examples http://stiltsoft.com/blog/2013/05/google-chrome-h

ow-to-use-the-web-speech-api/

http://stackoverflow.com/questions/17635354/developing-a-simple-voice-driven-web-app-using-web-speech-api

http://apprentice.craic.com/tutorials/37







http://stiltsoft.com/blog/2013/05/google-chrome-how-to-use-the-web-speech-api/









Challenges for VUI Design People have very little patience for a

"machine that does not understand” VUIs need to respond to input reliably, or they

will be rejected by their users Designing a usable VUI requires

interdisciplinary talents of computer science, linguistics and human factors

The closer the VUI matches the user's mental model of the task, the easier it will be to use with little or no training, resulting in both higher efficiency and higher user satisfaction

Natural Language Understanding Ambiguity

Refers to phrases that look distinct in print but sound similar when spoken, for example, “Wreck a nice beach” “Recognize speech”

As the vocabulary and grammar get larger, the potential for ambiguity increases

Short words and phrases are harder to recognize than longer ones

Language Understanding (Cont.) Deviation

Deviating from what the developer expects For example, an issue with the question “Is that

correct?” Expecting a simple response like “Yes”, “No”, or

“Correct” Southern speakers would respond with “Yes,

ma’am” or “No, ma’am”

Discussion What you would expect if the user asks to

start Microsoft Word? Please start word Could you start word Start word Please open word Could you open word Open word

Language Understanding (Cont.) Keyword Extraction

Important for applications built with a speech recognizer that returns a string containing the actual words spoke by the user Leaving the application to interpret their semantic

meaning One might say “Computer, find me some

information about the flooding in Detroit recently“

Keywords like “find”, “flooding”, and “Detroit” are crucial for an accurate response from the VUI Others are filler words

Dialog Management Multi-modelity

Interaction can occur through different mediums Need to consider when and which part of the

application allows to be multi-model Grammar

There is a close relationship between what a prompt says and what the caller ends up saying to the system Especially the words used

Configuration files You may choose the confidence level at which the

recognizer will reject the input rather than return the answer

You may also choose parameters for the endpointer, that is, how long it should listen before timing out

Dialog Management (Cont.) Error handling

Allow the user to be able to recover after errors and get the dialog with the user back on track

Recognition does not always succeed. When it fails, there are a number of messages the recognizer may return to the application.

Voice recognition accuracy In-grammar data Out-grammar data

Error Handling In-grammar data

Correct Accept the recognizer returned the correct answer

False Accept the recognizer returned the wrong answer

False Reject the recognizer could not find match and gave up

Out-of-grammar data Correct Reject

the recognizer correctly rejected the input False Accept

the recognizer returned a value that is wrong because the input is not in the grammar

How to handle each categories?

Error Handing in Android

Documents

Speech Recognition Yonglei Tao. Voice-Activated GPS