Speech Recognition on Mobile Devices

PowerPoint Presentation

Rethinking Speech Recognitionon Mobile DevicesAnuj Kumar1, Anuj Tewari2, Seth Horrigan 2, Matthew Kam1, Florian Metze1 , John Canny2

1 Canegie Mellon University, USA2 University of California, Berkeley, USA

Mobile devices have widely penetrated the marketMobiles have widely proliferated both in developing nations and in the low-socio economic communities of the developed world, even greater than a personal computer.Mobile Devices vs. PCsText Input vs. Speech Besides being more intuitive offers several advantagesTypingSpeedsSpeech Handwriting QWERTY Predictive Text Multi-tap With speech, interactionbecomes independent of device sizeIfaccurately recognized,Only plausible inputmodality for 800 million non-literate usersspeech is three timesfaster than QWERTY(Basapur et al. 07)Given the greater availability of mobile phones and effectiveness of speech as a communication medium we propose various Automatic Speech Recognition Models for mobiles in this project.Automatic Speech Recognition (ASR)

Speech SignalFeature ExtractionDecoderRecognition ResultAcoustic ModelLanguage ModelSpeech FeaturesWord sequenceTraining Data(speech utterances)Machine LearningPhonetic DictionaryASR : Framework ASR converts spoken words into text.ASR has following components:Feature ExtractorExtracts the feature vector from the speech signalAcoustic Model(AM)Models acoustic properties of the test speech signalPhonetic DictionaryContains a mapping from words to phonesLanguage Model(LM)Defines which word could follow previously recognized words thus reducing word searchDecoderIntegrates AM and LM with phonetic dictionary to generate hypothesis for input speech

ASR on Mobile: Challenges Limited available storage and small cache(8-32 KB)Cheap and variable microphones

No hardware support for floating point calculations

Low processor clock frequency

Energy constraints

Challenging acoustic environment like heavy traffic noise in background and reverberation of multiple speakers speaking simultaneously

Embedded Mobile Speech Recognition FeatureextractionAcousticModelsLanguageModelUser SpeechASR OutputMobile DeviceASR SearchPros and Cons of Embedded Model

Pros

No networks required

No performance drop due to data loss involved in transmission

No cost involved in data transmission

No latencyCons

Mobile hardware not as good as a central server(in terms of speed and memory)

Speech Recognition in Cloud Data TransmissionUser SpeechAcousticModelsLanguageModelASR SearchFeature Extraction from Codec ParametersMobile DeviceSpeech CoderASR OutputServerPros and Cons of Cloud Model

Pros

Better speed and accuracy in ASR owing to servers superior configuration

Central systems update is enough to update all systems of network

Even cheap low-end phones works fineCons

Performance degradation due to loss during data transmission

Acoustic models on the central server need to account for large variations in the different channels.

Each data transfer over the telephone network can cost money for the end userDistributed Speech Recognition Data TransmissionUser SpeechAcousticModelsLanguageModelASR SearchFeature ReconstructionMobile DeviceFeature extraction and compressionASR OutputServerPros and Cons of Distributed Model

Pros

It has all the advantages of cloud model with added less amount of data loss owing to speech coder, transmission and decoding at low bit ratesCons

Cost

Requires continuous and reliable cellular connection

Requirement of standard feature extraction process on account of variations due to differences in channel(mic, audio data card),variable accents etc.

Shared Speech Recognition with User based Adaptation(Proposed) Data Transmission(Only when network is detected)AcousticModelsLanguageModelASR SearchFeature Reconstruction +Metadata ExtractionMobile DeviceFeature extraction and compressionServerUser Context Dependent AnalysisUser based AcousticModelsLanguageModelASR SearchUpdated DataUser SpeechASR OutputPros and Cons of Adaptation Model

Pros

It takes the best of both the worlds i.e. of server based systems(high accuracy, updates and maintainability, moderate requirements on mobile hardware) and local ASR(functioning without network).

One mobile is used by lesser number of users hence mobile itself device can cater to that.

Centralized server only performs adaptation it may be designed for average traffic and not peak traffic as adaptation need not be realWhat to adapt for? Device Specific pre-processingTo achieve robustness or signal normalization

User Specific vocabularies and Language ModelTo model dialectal and idolectal variations as well as accented speech

Speaker Specific Acoustic Models or feature and/or model based adaptation technique

Tuning of parameters

Results in Laboratory: Operating System

First, in order to determine the performance of the systems in a resource-rich Linux environment, we used a PC running Ubuntu Linux 8.04 in VMWare Server 1.0.6 on top of Microsoft Windows Server 2003 with an Intel Pentium D clocked at 2.8 GHz with 4 GB of RAM.

Second, in order to determine the performance on a resource constrained mobile device, we used a Nokia N800 Internet Tablet running Maemo Linux OS2008 with a TI OMAP 2420 ARM processor clocked at 330MHz with 128 MB of RAM

Results in Laboratory: Dataset

DARPA Resource Management (RM-1) corpus1600 training utterances,1000 tied Gaussian Mixture Models (senones)30080 context-dependent triphonesBigram statistical language model with a vocabulary of 993 words and a language weight of 9.5

ICSI Meeting Recorder (ICSI) corpus90314 training utterances1000 senones 104082 context-dependent triphonesTrigram statistical language model with a vocabulary of 11908 words and a language weight of 9.5

Test data contains 365 random utterances from RM-1 and 400 from ICSI

Results in Laboratory: Toolkits

Pocketsphinx

Sphinxtiny

Sphinx-3.7

Results in Laboratory: RM Dataset

DesktopSpeed(XRT)Word Error Rate(WER)Sentence Error Rate(SER)Pocket Sphinx0.056.0%28.2%Sphinx-3.70.197.3%34.2%Sphinx Tiny0.377.3%35.1%MobileSpeed(XRT)Word Error Rate(WER)Sentence Error Rate(SER)Pocket Sphinx0.536.0%28.2%Sphinx-3.724.347.3%34.2%Sphinx Tiny2.587.3%35.1%Results in Laboratory: ICSI Dataset

DesktopSpeed(XRT)Word Error Rate(WER)Sentence Error Rate(SER)Pocket Sphinx1.2055.1%72.8%Sphinx-3.70.7539.6%69.5%Sphinx Tiny1.2339.4%70.3%MobileSpeed(XRT)Word Error Rate(WER)Sentence Error Rate(SER)Pocket Sphinx10.6555.1%72.8%Sphinx-3.768.4939.6%69.5%Sphinx Tiny8.9139.4%70.3%Results in Laboratory

On a small vocabulary task PocketSphinx outperforms SphinxTiny on both accuracy and speed;

As the complexity of the acoustic and language models increases, SphinxTiny's accuracy is better than PocketSphinx.

PocketSphinx is superior when using small acoustic and language models for real-time recognition, but for tasks that allow larger delays in exchange for better accuracy, SphinxTiny is a better choice.

Results in Laboratory

On a small vocabulary task PocketSphinx outperforms SphinxTiny on both accuracy and speed;

As the complexity of the acoustic and language models increases, SphinxTiny's accuracy is better than PocketSphinx.

PocketSphinx is superior when using small acoustic and language models for real-time recognition, but for tasks that allow larger delays in exchange for better accuracy, SphinxTiny is a better choice.

Conclusion

Results certainly highlight the feasibility and applicability of speech recognition on mobile device and its immense capacity in educational and other fields in developing and low socio-economic communities of developed countries.Thank You!

Questions?

Presented by :ABHISHEK GARG(208/CO/11)NETAJI SUBHAS INSTITUTE OF TECHNOLOGY

Documents

Speech Recognition on Mobile Devices