35
Speech Recognition Yonglei Tao

Speech Recognition Yonglei Tao. Voice-Activated GPS

Embed Size (px)

Citation preview

Page 1: Speech Recognition Yonglei Tao. Voice-Activated GPS

Speech Recognition

Yonglei Tao

Page 2: Speech Recognition Yonglei Tao. Voice-Activated GPS

Voice-Activated GPS

Page 3: Speech Recognition Yonglei Tao. Voice-Activated GPS

Voice User Interface (VUI) A VUI allows human interaction with

computers through a voice/speech platform Basic components

Speech recognition Meaning extraction Response generation Speech output

Benefits Loosen some physical constraints Provide tools for universal design

disability and situational impairments Intuitive and efficiency

Page 4: Speech Recognition Yonglei Tao. Voice-Activated GPS

System Architecture

Page 5: Speech Recognition Yonglei Tao. Voice-Activated GPS

Components Endpointing

Speech to endpointed utterance Feature extraction

Endpointed utterance to feature vectors Recognition

Feature vectors to word string(s) Natural language understanding

Word string(s) to meaning(s) Dialog management

Meaning to actions

Page 6: Speech Recognition Yonglei Tao. Voice-Activated GPS

Typical Recognition Components

Page 7: Speech Recognition Yonglei Tao. Voice-Activated GPS

Examples Book, boot

Write, right Flew, flu, flue

Eight books Ate books

I scream Ice cream

Page 8: Speech Recognition Yonglei Tao. Voice-Activated GPS

Components Acoustic models

Internal representation of each basic sound Dictionary

A list of words and pronunciations Grammar

Defines all possible strings of words the recognizer can handle

Allows to associate a meaning with those strings Either rule-based or statistical

Page 9: Speech Recognition Yonglei Tao. Voice-Activated GPS

Recognition Recognition search

A recognizer searches the recognition model to find the best-matching word string

Confidence measures A quantitative measure of how confident the

recognizer is for the best-matching string VUI developers can use those measures in several

ways N-Best processing

A recognizer returns several results with a confidence measure for each

Page 10: Speech Recognition Yonglei Tao. Voice-Activated GPS

Speech Recognition Engines Microsoft Visual Studio & CMU Sphinx

Grammar Android

Language model – free form for dictation or web search for short phrases

Google Web Speech API for Web Applications

Page 11: Speech Recognition Yonglei Tao. Voice-Activated GPS

BNF (Backus-Naur Form) Notation for context-free grammars

Often used to describe the syntax of programming languages

Also specify the words and patterns of words to be listened for by a speech recognizer

EBNF (Extended Backus-Naur Form) ABNF (Augmented Backus-Naur Form)

Basis for speech grammar specifications ABNF for .Net Regular grammar for Java

Page 12: Speech Recognition Yonglei Tao. Voice-Activated GPS

Basics

::= meaning "is defined as" | meaning "or" < > include category nameTerminal basic component

<X> ::= a b c a sequence

<Y> ::= a | b | c optional

<Z> ::= a | a <Z> one or more

Page 13: Speech Recognition Yonglei Tao. Voice-Activated GPS

An Example Grammar for a speech recognition calculator

Reference: Grammar creation in C#https://msdn.microsoft.com/en-us/library/hh538495%28v=office.14%29.aspx

Page 14: Speech Recognition Yonglei Tao. Voice-Activated GPS

Speech to Text in C#using System.Speech.Recognition;using System.Speech.Synthesis;using System.Threading;  

static ManualResetEvent _completed = null;

static void Main(string[] args) { 

_completed = new ManualResetEvent(false); SpeechRecognitionEngine _recognizer = new SpeechRecognitionEngine(); 

_recognizer.LoadGrammar(new Grammar(new GrammarBuilder("test")) Name = { "testGrammar" }); _recognizer.LoadGrammar(new Grammar(new GrammarBuilder("exit")) Name = { "exitGrammar" });   _recognizer.SpeechRecognized += _recognizer_SpeechRecognized; // add an event handler 

_recognizer.SetInputToDefaultAudioDevice(); _recognizer.RecognizeAsync(RecognizeMode.Multiple); … _completed.WaitOne(); // wait until speech recognition is completed _recognizer.Dispose(); // dispose the speech recognition engine}

Page 15: Speech Recognition Yonglei Tao. Voice-Activated GPS

Speech to Text in C#(Cont.)void _recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) {

if (e.Result.Text == "test") {

Console.WriteLine("The test was successful!");

}

else if (e.Result.Text == "exit") {

_completed.Set();

}

void _recognizer_SpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e) {

if (e.Result.Alternates.Count == 0) {

Console.WriteLine("Speech rejected. No candidate phrases found.");

return;

}

Console.WriteLine("Speech rejected. Did you mean:");

foreach (RecognizedPhrase r in e.Result.Alternates) {

Console.WriteLine(" " + r.Text); }}

Page 16: Speech Recognition Yonglei Tao. Voice-Activated GPS

Text to Speech in C#

SpeechSynthesizer _synthesizer = new SpeechSynthesizer();

synthesizer.Speak("Now the computer is speaking to you."); 

...synthesizer.Dispose(); // dispose the SpeechSynthesizer  

Page 18: Speech Recognition Yonglei Tao. Voice-Activated GPS

Visual Studio Speech Recognizer

Page 19: Speech Recognition Yonglei Tao. Voice-Activated GPS
Page 20: Speech Recognition Yonglei Tao. Voice-Activated GPS
Page 22: Speech Recognition Yonglei Tao. Voice-Activated GPS

Speech Recognition for Java Sphinx 4

A speech recognition engine written entirely in Java

Created by CMU, Sun, Mitsubishi, HP, … Open source Compliant with JSpeech Grammar Format Platform- and vendor-independent

Programmer’s guidehttp://cmusphinx.sourceforge.net/sphinx4/

An example https://www.assembla.com/code/sonido/subversion/nodes/4/sphinx4/src/apps/edu/cmu/sphinx/demo/helloworld

Page 23: Speech Recognition Yonglei Tao. Voice-Activated GPS

A Sample Grammar in Java

#JSGF V1.0;

public <workProgram> = <ask> <action> <program>;<ask> = please | could you;<action> = start | open | stop | close | kill | shut down ;<program> = word | excel | out look | note pad ;

Page 24: Speech Recognition Yonglei Tao. Voice-Activated GPS

Android Speech Recognitionpublic class MainActivity extends Activity {  private static final int VOICE_RECOGNITION = 1; Button speakButton ; TextView spokenWords;   @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); speakButton = (Button) findViewById(R.id.button1); spokenWords = (TextView)findViewById(R.id.textView1); }

@Override public boolean onCreateOptionsMenu(Menu menu) { // Inflate the menu; this adds items to the action bar if it is present. getMenuInflater().inflate(R.menu.main, menu); return true; }

Page 25: Speech Recognition Yonglei Tao. Voice-Activated GPS

public void btnSpeak(View view){ Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH); // Specify free form input intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM); intent.putExtra(RecognizerIntent.EXTRA_PROMPT,"Please start speaking"); intent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1); intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.ENGLISH); startActivityForResult(intent, VOICE_RECOGNITION); } }

@Override protected void onActivityResult(int requestCode, int resultCode, Intent data) { if (requestCode == VOICE_RECOGNITION && resultCode == RESULT_OK) { ArrayList<String> results; results = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS); // TODO Do something with the recognized voice strings  Toast.makeText(this, results.get(0), Toast.LENGTH_SHORT).show(); spokenWords.setText(results.get(0)); } super.onActivityResult(requestCode, resultCode, data); }

Page 26: Speech Recognition Yonglei Tao. Voice-Activated GPS

Android and Web Speech Recognition Android Voice Recognition Tutorial

http://www.javacodegeeks.com/2012/08/android-voice-recognition-tutorial.html

http://code4reference.com/2012/07/tutorial-android-voice-recognition/

Google Web Speech Recognition Examples http://stiltsoft.com/blog/2013/05/google-chrome-h

ow-to-use-the-web-speech-api/

http://stackoverflow.com/questions/17635354/developing-a-simple-voice-driven-web-app-using-web-speech-api

http://apprentice.craic.com/tutorials/37

Page 27: Speech Recognition Yonglei Tao. Voice-Activated GPS

Challenges for VUI Design People have very little patience for a

"machine that does not understand” VUIs need to respond to input reliably, or they

will be rejected by their users Designing a usable VUI requires

interdisciplinary talents of computer science, linguistics and human factors

The closer the VUI matches the user's mental model of the task, the easier it will be to use with little or no training, resulting in both higher efficiency and higher user satisfaction

Page 28: Speech Recognition Yonglei Tao. Voice-Activated GPS

Natural Language Understanding Ambiguity

Refers to phrases that look distinct in print but sound similar when spoken, for example, “Wreck a nice beach” “Recognize speech”

As the vocabulary and grammar get larger, the potential for ambiguity increases

Short words and phrases are harder to recognize than longer ones

Page 29: Speech Recognition Yonglei Tao. Voice-Activated GPS

Language Understanding (Cont.) Deviation

Deviating from what the developer expects For example, an issue with the question “Is that

correct?” Expecting a simple response like “Yes”, “No”, or

“Correct” Southern speakers would respond with “Yes,

ma’am” or “No, ma’am”

Page 30: Speech Recognition Yonglei Tao. Voice-Activated GPS

Discussion What you would expect if the user asks to

start Microsoft Word? Please start word Could you start word Start word Please open word Could you open word Open word

Page 31: Speech Recognition Yonglei Tao. Voice-Activated GPS

Language Understanding (Cont.) Keyword Extraction

Important for applications built with a speech recognizer that returns a string containing the actual words spoke by the user Leaving the application to interpret their semantic

meaning One might say “Computer, find me some

information about the flooding in Detroit recently“

Keywords like “find”, “flooding”, and “Detroit” are crucial for an accurate response from the VUI Others are filler words

Page 32: Speech Recognition Yonglei Tao. Voice-Activated GPS

Dialog Management Multi-modelity

Interaction can occur through different mediums Need to consider when and which part of the

application allows to be multi-model Grammar

There is a close relationship between what a prompt says and what the caller ends up saying to the system Especially the words used

Configuration files You may choose the confidence level at which the

recognizer will reject the input rather than return the answer

You may also choose parameters for the endpointer, that is, how long it should listen before timing out

Page 33: Speech Recognition Yonglei Tao. Voice-Activated GPS

Dialog Management (Cont.) Error handling

Allow the user to be able to recover after errors and get the dialog with the user back on track

Recognition does not always succeed. When it fails, there are a number of messages the recognizer may return to the application.

Voice recognition accuracy In-grammar data Out-grammar data

Page 34: Speech Recognition Yonglei Tao. Voice-Activated GPS

Error Handling In-grammar data

Correct Accept the recognizer returned the correct answer

False Accept the recognizer returned the wrong answer

False Reject the recognizer could not find match and gave up

Out-of-grammar data Correct Reject

the recognizer correctly rejected the input False Accept

the recognizer returned a value that is wrong because the input is not in the grammar

How to handle each categories?

Page 35: Speech Recognition Yonglei Tao. Voice-Activated GPS

Error Handing in Android