Speech Recognition on Mobile Devices

  • Published on
    15-Nov-2015

  • View
    218

  • Download
    4

DESCRIPTION

This presentation is based on IEEE paper which explains recent research carried out to improve Speech Recognition on Mobile Devices.

Transcript

<p>PowerPoint Presentation</p> <p>Rethinking Speech Recognitionon Mobile DevicesAnuj Kumar1, Anuj Tewari2, Seth Horrigan 2, Matthew Kam1, Florian Metze1 , John Canny2</p> <p>1 Canegie Mellon University, USA2 University of California, Berkeley, USA</p> <p>Mobile devices have widely penetrated the marketMobiles have widely proliferated both in developing nations and in the low-socio economic communities of the developed world, even greater than a personal computer.Mobile Devices vs. PCsText Input vs. Speech Besides being more intuitive offers several advantagesTypingSpeedsSpeech Handwriting QWERTY Predictive Text Multi-tap With speech, interactionbecomes independent of device sizeIfaccurately recognized,Only plausible inputmodality for 800 million non-literate usersspeech is three timesfaster than QWERTY(Basapur et al. 07)Given the greater availability of mobile phones and effectiveness of speech as a communication medium we propose various Automatic Speech Recognition Models for mobiles in this project.Automatic Speech Recognition (ASR) </p> <p>Speech SignalFeature ExtractionDecoderRecognition ResultAcoustic ModelLanguage ModelSpeech FeaturesWord sequenceTraining Data(speech utterances)Machine LearningPhonetic DictionaryASR : Framework ASR converts spoken words into text.ASR has following components:Feature ExtractorExtracts the feature vector from the speech signalAcoustic Model(AM)Models acoustic properties of the test speech signalPhonetic DictionaryContains a mapping from words to phonesLanguage Model(LM)Defines which word could follow previously recognized words thus reducing word searchDecoderIntegrates AM and LM with phonetic dictionary to generate hypothesis for input speech</p> <p>ASR on Mobile: Challenges Limited available storage and small cache(8-32 KB)Cheap and variable microphones</p> <p>No hardware support for floating point calculations</p> <p>Low processor clock frequency</p> <p>Energy constraints</p> <p>Challenging acoustic environment like heavy traffic noise in background and reverberation of multiple speakers speaking simultaneously</p> <p>Embedded Mobile Speech Recognition FeatureextractionAcousticModelsLanguageModelUser SpeechASR OutputMobile DeviceASR SearchPros and Cons of Embedded Model</p> <p> Pros </p> <p>No networks required</p> <p>No performance drop due to data loss involved in transmission </p> <p>No cost involved in data transmission</p> <p>No latencyCons</p> <p>Mobile hardware not as good as a central server(in terms of speed and memory)</p> <p>Speech Recognition in Cloud Data TransmissionUser SpeechAcousticModelsLanguageModelASR SearchFeature Extraction from Codec ParametersMobile DeviceSpeech CoderASR OutputServerPros and Cons of Cloud Model</p> <p> Pros </p> <p>Better speed and accuracy in ASR owing to servers superior configuration</p> <p>Central systems update is enough to update all systems of network</p> <p>Even cheap low-end phones works fineCons</p> <p> Performance degradation due to loss during data transmission</p> <p>Acoustic models on the central server need to account for large variations in the different channels.</p> <p>Each data transfer over the telephone network can cost money for the end userDistributed Speech Recognition Data TransmissionUser SpeechAcousticModelsLanguageModelASR SearchFeature ReconstructionMobile DeviceFeature extraction and compressionASR OutputServerPros and Cons of Distributed Model</p> <p> Pros </p> <p>It has all the advantages of cloud model with added less amount of data loss owing to speech coder, transmission and decoding at low bit ratesCons</p> <p> Cost</p> <p>Requires continuous and reliable cellular connection</p> <p>Requirement of standard feature extraction process on account of variations due to differences in channel(mic, audio data card),variable accents etc.</p> <p>Shared Speech Recognition with User based Adaptation(Proposed) Data Transmission(Only when network is detected)AcousticModelsLanguageModelASR SearchFeature Reconstruction +Metadata ExtractionMobile DeviceFeature extraction and compressionServerUser Context Dependent AnalysisUser based AcousticModelsLanguageModelASR SearchUpdated DataUser SpeechASR OutputPros and Cons of Adaptation Model</p> <p> Pros </p> <p>It takes the best of both the worlds i.e. of server based systems(high accuracy, updates and maintainability, moderate requirements on mobile hardware) and local ASR(functioning without network).</p> <p>One mobile is used by lesser number of users hence mobile itself device can cater to that.</p> <p>Centralized server only performs adaptation it may be designed for average traffic and not peak traffic as adaptation need not be realWhat to adapt for? Device Specific pre-processingTo achieve robustness or signal normalization</p> <p>User Specific vocabularies and Language ModelTo model dialectal and idolectal variations as well as accented speech</p> <p>Speaker Specific Acoustic Models or feature and/or model based adaptation technique</p> <p>Tuning of parameters</p> <p>Results in Laboratory: Operating System</p> <p> First, in order to determine the performance of the systems in a resource-rich Linux environment, we used a PC running Ubuntu Linux 8.04 in VMWare Server 1.0.6 on top of Microsoft Windows Server 2003 with an Intel Pentium D clocked at 2.8 GHz with 4 GB of RAM.</p> <p> Second, in order to determine the performance on a resource constrained mobile device, we used a Nokia N800 Internet Tablet running Maemo Linux OS2008 with a TI OMAP 2420 ARM processor clocked at 330MHz with 128 MB of RAM</p> <p>Results in Laboratory: Dataset</p> <p> DARPA Resource Management (RM-1) corpus1600 training utterances,1000 tied Gaussian Mixture Models (senones)30080 context-dependent triphonesBigram statistical language model with a vocabulary of 993 words and a language weight of 9.5</p> <p>ICSI Meeting Recorder (ICSI) corpus90314 training utterances1000 senones 104082 context-dependent triphonesTrigram statistical language model with a vocabulary of 11908 words and a language weight of 9.5</p> <p>Test data contains 365 random utterances from RM-1 and 400 from ICSI</p> <p>Results in Laboratory: Toolkits</p> <p> Pocketsphinx</p> <p>Sphinxtiny</p> <p>Sphinx-3.7</p> <p>Results in Laboratory: RM Dataset</p> <p> DesktopSpeed(XRT)Word Error Rate(WER)Sentence Error Rate(SER)Pocket Sphinx0.056.0%28.2%Sphinx-3.70.197.3%34.2%Sphinx Tiny0.377.3%35.1%MobileSpeed(XRT)Word Error Rate(WER)Sentence Error Rate(SER)Pocket Sphinx0.536.0%28.2%Sphinx-3.724.347.3%34.2%Sphinx Tiny2.587.3%35.1%Results in Laboratory: ICSI Dataset</p> <p> DesktopSpeed(XRT)Word Error Rate(WER)Sentence Error Rate(SER)Pocket Sphinx1.2055.1%72.8%Sphinx-3.70.7539.6%69.5%Sphinx Tiny1.2339.4%70.3%MobileSpeed(XRT)Word Error Rate(WER)Sentence Error Rate(SER)Pocket Sphinx10.6555.1%72.8%Sphinx-3.768.4939.6%69.5%Sphinx Tiny8.9139.4%70.3%Results in Laboratory</p> <p> On a small vocabulary task PocketSphinx outperforms SphinxTiny on both accuracy and speed; </p> <p>As the complexity of the acoustic and language models increases, SphinxTiny's accuracy is better than PocketSphinx. </p> <p>PocketSphinx is superior when using small acoustic and language models for real-time recognition, but for tasks that allow larger delays in exchange for better accuracy, SphinxTiny is a better choice.</p> <p>Results in Laboratory</p> <p> On a small vocabulary task PocketSphinx outperforms SphinxTiny on both accuracy and speed; </p> <p>As the complexity of the acoustic and language models increases, SphinxTiny's accuracy is better than PocketSphinx. </p> <p>PocketSphinx is superior when using small acoustic and language models for real-time recognition, but for tasks that allow larger delays in exchange for better accuracy, SphinxTiny is a better choice.</p> <p>Conclusion</p> <p> Results certainly highlight the feasibility and applicability of speech recognition on mobile device and its immense capacity in educational and other fields in developing and low socio-economic communities of developed countries.Thank You!</p> <p>Questions? </p> <p>Presented by :ABHISHEK GARG(208/CO/11)NETAJI SUBHAS INSTITUTE OF TECHNOLOGY </p>

Recommended

View more >