Expert System Voice Assistant


Citation preview

  • 8/10/2019 Expert System Voice Assistant


    Expert System Voice Assistant



    Submitted For The Partial Fulfilment Of The Requirement

    For The Award Of Degree Of




    Submitted by: Guided By:

    1. Aakash Shrivastava(0101CS101001) Prof. Shikha Agarwal

    2. Ashish Kumar Namdeo(0101CS101024)

    3. Avinash Dongre(0101CS101026)

    4. Chitransh Surheley(0101CS101031)





    2013- 2014

  • 8/10/2019 Expert System Voice Assistant







    This is to certify that Akash Shrivastava, Ashish Kumar Namdeo, Avinash Dongre, Chitransh

    Surheley of B.E fourth year, Computer science & Engineering have completed their major project

    Expert System Voice Assistant during the academic year 2013-14 under our guidance and


    We approve the project for the submission for the partial fulfillment of the requirement for the

    award of degree in Computer Science & Engineering.

    Prof. Shikha Agarwal Dr. Sanjay Silakari Dr. V.K.Sethi

    Project Guide ( Head CSE Dept.) (Director, UIT-RGPV)

  • 8/10/2019 Expert System Voice Assistant




    We hereby declare that the work which is being presented in the Major project Expert System

    Voice Assistant submitted in partial fulfillment of the requirement for the award of Bache lorDegree in Computer Science & Engineering .The work which has been carried out at

    University Institute of Technology, RGPV, Bhopal is an authentic record of our work carried

    under the guidance of Prof. Shikha Agrawal Department of Computer Science & Engineering,

    UIT-RGPV, Bhopal.

    The matter written in this project has not been submitted by us for the award of any other


    Aakash Shrivastava(0101CS101001)

    Ashish Kumar Namdeo(0101CS101024)

    Avinash Dongre(0101CS101026)

    Chitransh Surheley(0101CS101031)

  • 8/10/2019 Expert System Voice Assistant




    We take the opportunity to express our cordial gratitude and deep sense of indebtedness to our

    guide Prof. Shikha Agrawal, Department / Computer Science and Engineering for the valuableguidance and inspiration throughout the project duration. We feel thankful to her for their

    innovative ideas, which led to successful completion of this project work. She has always

    welcomed our problem and helped us to clear our doubt. We will always be grateful to them for

    providing us moral support and sufficient time.

    We owe our sincere thanks to Dr. Sanjay Silakari (HOD, CSE) who helped us duly in time

    during our project work in the Department.

    At the same time, we would like to thank all other faculty members and all non-teaching staff in

    Computer Science and Engineering Department for their valuable co-operation.

    Aakash Shrivastava(0101CS101001)

    Ashish Kumar Namdeo(0101CS101024)

    Avinash Dongre(0101CS101026)

    Chitransh Surheley(0101CS101031)

  • 8/10/2019 Expert System Voice Assistant




    Speech interface to computer is the next big step that computer science need to take for

    general users. Speech recognition will play an important role in taking technology to them.

    Our goal is to create a speech recognition software that can recognise spoken words. This

    report takes a brief look at the basic building block of a speech recognition, speech synthesis

    and the overall human and computer interaction. The most important purpose of this project is

    to understand the interface between a person and a computer. Traditional or orthodox ways of

    interaction are keyboard, mouse or any other input device but nowadays the computing has

    become more sophisticated and complex operation. With these properties we have got the

    advantage and resources to think about building a more modern interface which will allow us

    to make a more natural looking interaction. So in this project, we have tried to develop an

    application which will make the human - computer interaction more interesting and user

    friendly. It is called the Expert System Voice Assistant the main application of this project is

    that it takes human voice as an input,processes it accordingly, does the given task and

    responds at the end. This project is Digital life assistant which uses mainly human

    communication means such Twitter, instant message and voice to create two way connections

    between human and his computer, controlling power, documents, social media and much

    more. In our project we mainly use voice as communication, so it is basically the Speechrecognition application. The concept of speech technology really encompasses two

    technologies: Synthesizer and Recognizer. A speech synthesizer takes as input and produces

    an audio stream as output. A speech recognizer on the other hand does opposite. It takes an

    audio stream as input and thus turns it into text transcription. The voice is a signal of infinite

    information. A direct analysis and synthesizing the complex voice signal is due to too much

    information contained in the signal. Therefore the digital signal processes such as Feature

    Extraction and Feature Matching are introduced to represent the voice signal. In this project

    we directly use speech engine which use Feature extraction technique as Mel scaled frequency

    cepstral. The mel- scaled frequency cepstral coefficients (MFCCs) derived from Fourier

    transform and filter bank analysis are perhaps the most widely used front- ends in state-of-the-

    art speech recognition systems. Our aim to create more and more functionalities which can

    help human to assist in their daily life and also reduces their efforts.

  • 8/10/2019 Expert System Voice Assistant



    Table of Contents

    1.0 INTRODUCTION-------------------------------------------- ERROR! BOOKMARK NOT DEFINED.

    1.1 EXISTING SYSTEMS----------------------------------------------------ERROR!BOOKMARK NOT DEFINED.

    1.2 SPEECH RECOGNITION-------------------------------------------------ERROR!BOOKMARK NOT DEFINED.

    1.3 SPEECH SYNTHESIS----------------------------------------------------ERROR!BOOKMARK NOT DEFINED.





    2.3 AVAILABILITY OF RESOURCES----------------------------------------ERROR!BOOKMARK NOT DEFINED.

    2.4 RELATED WORK--------------------------------------------------------ERROR!BOOKMARK NOT DEFINED.

    3.0 PROPOSED WORK------------------------------------------- ERROR! BOOKMARK NOT DEFINED.

    3.1 PROBLEM DESCRIPTION-----------------------------------------------ERROR!BOOKMARK NOT DEFINED.

    3.2 ARCHITECTURE OF THE PROJECT-------------------------------------ERROR!BOOKMARK NOT DEFINED.

    3.3 WORKING OF THE PROJECT-------------------------------------------ERROR!BOOKMARK NOT DEFINED.

    4.0 DESIGN AND DEVELOPMENT--------------------------- ERROR! BOOKMARK NOT DEFINED.

    4.1 MICROSOFT VISUAL STUDIO-----------------------------------------ERROR!BOOKMARK NOT DEFINED.

    4.2 SPEECH SYNTHESIS----------------------------------------------------ERROR!BOOKMARK NOT DEFINED.


    5.1 POST QUERY DESIGN--------------------------------------------------ERROR!BOOKMARK NOT DEFINED.

    5.2 PROTOTYPE AND INCEPTION-----------------------------------------ERROR!BOOKMARK NOT DEFINED.

    5.3 DEFAULT COMMANDS.TXT------------------------------------------ERROR!BOOKMARK NOT DEFINED.

    6.0 RESULTS-------------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.

    6.1 SNAPSHOT OF THE GUI------------------------------------------------ERROR!BOOKMARK NOT DEFINED.

    6.2 FLOWCHATS---------------------------------------------------------ERROR!BOOKMARK NOT DEFINED.


    REFERENCES -------------------------------------------------- ERROR! BOOKMARK NOT DEFINED.


    2.1 POST QUERY DESIGN--------------------------------------------------ERROR!BOOKMARK NOT DEFINED.

    2.2 PROTOTYPE AND INCEPTION-----------------------------------------ERROR!BOOKMARK NOT DEFINED.

    2.3 DEFAULT COMMANDS.TXT------------------------------------------ERROR!BOOKMARK NOT DEFINED.

  • 8/10/2019 Expert System Voice Assistant



  • 8/10/2019 Expert System Voice Assistant



    Chapter 1

    1. Introduction

    Speech is an effective and natural way for people to interact with applications, complementing

    or even replacing the use of mice, keyboards, controllers, and gestures. A hands-free, yet

    accurate way to communicate with applications, speech lets people be productive and stay

    informed in a variety of situations where other interfaces will not. Speech recognition is a

    topic that is very useful in many applications and environments in our daily life. Generally

    speech recognizer is a machine which understands humans and their spoken word in some

    way and can act thereafter. A different aspect of speech recognition is to facilitate for people

    with functional disability or other kinds of handicap. To make their daily chores easier, voice

    control could be helpful. With their voice they could operate the system. This leads to the

    discussion about intelligent homes where these operations can be made available for the

    common man as well as for handicapped.Voice activated systems and gesture control systems

    have taken the experiences of the nave end-users to the next level. Present day users are able

    to access or control the system without making a physical interaction with the computer. The

    proposed model presents a new approach to voice activated control systems which enhancesthe response time and user experience by looking beyond the steps of speech recognition and

    focus on the post processing step of natural language processing. The proposed method

    conceives the system as a Deterministic Finite State Automata, where each state is allowed a

    finite set of keywords, which will be listened to by the speech recognition system. This is

    achieved by the introduction of a new system to handle Finite Automata called Switch State

    Mechanism. The natural language processing is used to regularly update the state keywords

    and give the user a life like interaction with the computer.

    With the input functionality of speech recognition, your application can monitor the state,

    level, and format of the input signal, and receive notification about problems that might

    interfere with successful recognition.You can create grammars programmatically using

    constructors and methods on theGrammarBuilder andChoices classes. Your application can

    dynamically modify programmatically created grammars while it is running. The structure of
  • 8/10/2019 Expert System Voice Assistant



    grammars authored using these classes is independent of the Speech Recognition Grammar


    voice recognition fundamentally functions as a pipeline that converts PCM (Pulse Code

    Modulation) digital audio from a sound card into recognized speech. The elements of the

    pipeline are:

    1. Transform the PCM digital audio into a better acoustic representation

    2. Apply a "grammar" so the speech recognizer knows what phonemes to expect. A

    grammar could be anything from a context-free grammar to full-blown English.

    3. Figure out which phonemes are spoken.

    4. Convert the phonemes into words.

    1.1 Existing Systems

    Although some promising solutions are available for speech synthesis and recognition, most

    of them are tuned to English. The acoustic and language model for these systems are for

    English language. Most of them require a lot of configuration before they can be used. ISIP

    and Sphinx are two of the known Speech Recognition software in open source. gives a

    comparison of public domain software tools for speech recognition. Some commercial

    software like IBMs ViaVoice are also available.

    1.1.1 SIRI

    SIRI is an intelligent personal assistant and knowledge navigator which works as an

    application for Apple Inc.'s iOS. The application uses a natural language user interface to

    answer questions, make recommendations, and perform actions by delegating requests to a set

    of Web services. Apple claims that the software adapts to the user's individual preferences

    over time and personalizes results. The name Siri is Norwegian, meaning "beautiful womanwho leads you to victory", and comes from the intended name for the original developer's first


    Siri was originally introduced as an iOS application available in the App Store by Siri, Inc.,

    which was acquired by Apple on April 28, 2010. Siri, Inc. had announced that their software

  • 8/10/2019 Expert System Voice Assistant



    would be available for BlackBerry and for phones running Android, but all development

    efforts for non-Apple platforms were cancelled after the acquisition by Apple.

    Siri has been an integral part of iOS since iOS 5 and was introduced as a feature of the iPhone

    4S in October 14, 2011. Siri was added to the third generation iPad with the release of iOS 6in September 2012, and has been included on all iOS devices released during or after October

    2012. Siri has several fascinating features where you can call or text someone, search

    anything, open any app etc with your voice which is very helpful indeed.

    1.1.2 S-VOICE

    S Voiceis an intelligent personal assistant and knowledge navigator which is only available as

    a built-in application for the Samsung Galaxy smartphones. The application uses a natural

    language user interface to answer questions, make recommendations, and perform actions by

    delegating requests to a set of Web services. It is based on the Vlingo personal assistant.

    Some of the capabilities of S Voice include making appointments, opening apps, setting

    alarms, updating social network websites such as Facebook or Twitter and navigation. S Voice

    also offers efficient multitasking as well as automatic activation features, for example when

    the car engine is started.

    s-voice possesses same features as siri.

    1.1.3 GOOGLE NOW

    Google Now is an intelligent personal assistant developed by Google. It is available within the

    Google Search mobile application for the Android and iOS operating systems, as well as the

    Google Chrome web browser on personal computers. Google Now uses a natural language

    user interface to answer questions, make recommendations, and perform actions by delegating

    requests to a set of web services. Along with answering user-initiated queries, Google Now

    passively delivers information to the user that it predicts they will want, based on their search

    habits. It was first included in Android 4.1 ("Jelly Bean"), which launched on July 9, 2012,

    and was first supported on the Galaxy Nexus smartphone. The service was made available for

    iOS on April 29, 2013 in an update to the Google Search app, and later for Google Chrome on

    March 24, 2014.

  • 8/10/2019 Expert System Voice Assistant



    The expert system voice assistant is based on the combination of 3 major operations

    Speech Recognition

    Intermediate Operations and result creation

    Speech Synthesis

    1.2 Speech Recognition

    Speech recognition refers to the ability to listen (input in audio format) spoken words and

    identify various sounds present in it, and recognise them as words of some known language.

    Speech recognition in computer system domain may then be defined as the ability of computer

    systems to accept spoken words in audio format - such as wav or raw - and then generate its

    content in text format. Speech recognition in computer domain involves various steps with

    issues attached with them. The steps required to make computers perform speech recognition

    are: Voice recording, word boundary detection, feature extraction, and recognition with the

    help of knowledge models. Word boundary detection is the process of identifying the start and

    the end of a spoken word in the given sound signal. While analysing the sound signal, at times

    it becomes difficult to identify the word boundary. This can can be attributed to various

    accents people have, like the duration of the pause they give between words while speaking.

    Feature Extraction refers to the process of conversion of sound signal to a form suitable for the

    following stages to use. Feature extraction may include extracting parameters such as

    amplitude of the signal, energy of frequencies, etc. Recognition involves mapping the given

    input (in form of various features) to one of the known sounds. This may involve use of

    various knowledge models for precise identification and ambiguity removal. Knowledge

    models refers to models such as phone acoustic model, language models, etc. which help the

    recognition system. To generate the knowledge model one needs to train the system. During

    the training period one needs to show the system a set of inputs and what outputs they shouldmap to. This is often called as supervised learning.

  • 8/10/2019 Expert System Voice Assistant



    Structure of a standard speech recognition system.

  • 8/10/2019 Expert System Voice Assistant



    How Speech Recognition Works

    A speech recognition engine (or speech recognizer) takes an audio stream as input and turns it

    into a text transcription. The speech recognition process can be thought of as having a front end

    and a back end.

    Convert Audio Input

    The front end processes the audio stream, isolating segments of sound that are probably speech

    and converting them into a series of numeric values that characterize the vocal sounds in the


    Match Input to Speech Models

    The back end is a specialized search engine that takes the output produced by the front end and

    searches across three databases: an acoustic model, a lexicon, and a language model.

    The acoustic model represents the acoustic sounds of a language, and can be trained to

    recognize the characteristics of a particular user's speech patterns and acoustic


    The lexiconlists a large number of the words in the language, and provides information

    on how to pronounce each word.

    The language modelrepresents the ways in which the words of a language are combined.

    For any given segment of sound, there are many things the speaker could potentially be saying.

    The quality of a recognizer is determined by how good it is at refining its search, eliminating the

    poor matches, and selecting the more likely matches. This depends in large part on the quality of

    its language and acoustic models and the effectiveness of its algorithms, both for processing

    sound and for searching across the models.


    While the built-in language model of a recognizer is intended to represent a comprehensive

    language domain (such as everyday spoken English), a speech application will often need to

    process only certain utterances that have particular semantic meaning to that application. Rather

    than using the general purpose language model, an application should use a grammar that

    constrains the recognizer to listen only for speech that is meaningful to the application. This

    provides the following benefits:

    Increases the accuracy of recognition

  • 8/10/2019 Expert System Voice Assistant


  • 8/10/2019 Expert System Voice Assistant



    Described above are the core elements of the most common, HMM-based approach to speech

    recognition. Modern speech recognition systems use various combinations of a number of

    standard techniques in order to improve results over the basic approach described above. A

    typical large-vocabulary system would need context dependency for the phonemes (so

    phonemes with different left and right context have different realizations as HMM states); it

    would use cepstral normalization to normalize for different speaker and recording conditions;

    for further speaker normalization it might use vocal tract length normalization (VTLN) for

    male-female normalization and maximum likelihood linear regression(MLLR) for more

    general speaker adaptation. The features would have so-called delta and delta-delta

    coefficients to capture speech dynamics and in addition might useheteroscedastic linear

    discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use

    splicing and an LDA-based projection followed perhaps byheteroscedastic linear discriminant

    analysis or a global semi-tied covariance transform (also known as maximum likelihood linear

    transform, or MLLT). Many systems use so-called discriminative training techniques that

    dispense with a purely statistical approach to HMM parameter estimation and instead optimize

    some classification-related measure of the training data. Examples are maximum mutual

    information (MMI), minimum classification error (MCE) and minimum phone error (MPE).

    Decoding of the speech (the term for what happens when the system is presented with a new

    utterance and must compute the most likely source sentence) would probably use the Viterbi

    algorithm to find the best path, and here there is a choice between dynamically creating a

    combination hidden Markov model, which includes both the acoustic and language model

    information, and combining it statically beforehand (the finite state transducer, or FST,


    A possible improvement to decoding is to keep a set of good candidates instead of just

    keeping the best candidate, and to use a better scoring function (rescoring) to rate these good

    candidates so that we may pick the best one according to this refined score. The set of

    candidates can be kept either as a list (theN-best listapproach) or as a subset of the models (a

    lattice). Rescoring is usually done by trying to minimize the Bayes risk (or an approximation

    thereof): Instead of taking the source sentence with maximal probability, we try to take the

    sentence that minimizes the expectation of a given loss function with regards to all possible
  • 8/10/2019 Expert System Voice Assistant



    transcriptions (i.e., we take the sentence that minimizes the average distance to other possible

    sentences weighted by their estimated probability). The loss function is usually the

    Levenshtein distance, though it can be different distances for specific tasks; the set of possible

    transcriptions is, of course, pruned to maintain tractability. Efficient algorithms have been

    devised to rescore lattices represented as weighted finite state transducers with edit distances

    represented themselves as a finite state transducer verifying certain assumptions.

    Dynamic time warping (DTW)-based speech recognition

    Dynamic time warping is an approach that was historically used for speech recognition but has

    now largely been displaced by the more successful HMM-based approach.

    Dynamic time warping is an algorithm for measuring similarity between two sequences that

    may vary in time or speed. For instance, similarities in walking patterns would be detected,

    even if in one video the person was walking slowly and if in another he or she were walking

    more quickly, or even if there were accelerations and decelerations during the course of one

    observation. DTW has been applied to video, audio, and graphics indeed, any data that can

    be turned into a linear representation can be analyzed with DTW.

    A well-known application has been automatic speech recognition, to cope with different

    speaking speeds. In general, it is a method that allows a computer to find an optimal match

    between two given sequences (e.g., time series) with certain restrictions. That is, the

    sequences are "warped" non-linearly to match each other. This sequence alignment method is

    often used in the context of hidden Markov models.

    Neural networks

    Neural networks emerged as an attractive acoustic modeling approach in ASR in the late

    1980s. Since then, neural networks have been used in many aspects of speech recognition such

    as phoneme classification, isolated word recognition, and speaker adaptation.

    In contrast to HMMs, neural networks make no assumptions about feature statistical

    properties and have several qualities making them attractive recognition models for speech

    recognition. When used to estimate the probabilities of a speech feature segment, neural

    networks allow discriminative training in a natural and efficient manner. Few assumptions on

  • 8/10/2019 Expert System Voice Assistant



    the statistics of input features are made with neural networks. However, in spite of their

    effectiveness in classifying short-time units such as individual phones and isolated words,

    neural networks are rarely successful for continuous recognition tasks, largely because of their

    lack of ability to model temporal dependencies. Thus, one alternative approach is to use neural

    networks as a pre-processing e.g. feature transformation, dimensionality reduction, for the

    HMM based recognition.

    1.3 Speech Synthesis

    Speech synthesis is the artificial production of human speech. A computer system used for

    this purpose is called a speech synthesizer, and can be implemented in software or hardware

    products. A text-to-speech (TTS) system converts normal language text into speech; other

    systems render symbolic linguistic representations like phonetic transcriptions into speech.

    Synthesized speech can be created by concatenating pieces of recorded speech that are stored

    in a database. Systems differ in the size of the stored speech units; a system that stores phones

    or diphones provides the largest output range, but may lack clarity. For specific usage

    domains, the storage of entire words or sentences allows for high-quality output.

    Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice

    characteristics to create a completely "synthetic" voice output.

    The quality of a speech synthesizer is judged by its similarity to the human voice and by its

    ability to be understood clearly. An intelligible text-to-speech program allows people with

    visual impairments or reading disabilities to listen to written works on a home computer.

    Many computer operating systems have included speech synthesizers since the early 1990s.

  • 8/10/2019 Expert System Voice Assistant



    A typical TTS system

  • 8/10/2019 Expert System Voice Assistant



    A text-to-speech system (or "engine") is composed of two parts. a front-end and a back-end.

    The front-end has two major tasks. First, it converts raw text containing symbols like numbers

    and abbreviations into the equivalent of written-out words. This process is often called text

    normalization, pre-processing, or tokenization. The front-end then assigns phonetic

    transcriptions to each word, and divides and marks the text into prosodic units, like phrases,

    clauses, and sentences. The process of assigning phonetic transcriptions to words is called

    text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody

    information together make up the symbolic linguistic representation that is output by the front-

    end. The back-endoften referred to as the synthesizerthen converts the symbolic linguistic

    representation into sound. In certain systems, this part includes the computation of the target

    prosody (pitch contour, phoneme durations),which is then imposed on the output speech.

    1.4 Intermediate Operations

    After the computer recognizes the speech then it is able to convert the spoken words into

    respective text. Now that text can be used as command. whatever we speak will be converted

    into a command and that command is handled by various system references.

    We can operate, Manage and manipulate any system attribute or element using these

    commands. We can use various RSS feeds to create weather, email and other social mediaservices.

    Results are created as products of intermediate operations. Now these results are fed into the

    speech synthesis engine which is responsible for responding to all the events. Now we can get

    some better feedback from the computer

  • 8/10/2019 Expert System Voice Assistant



    1.5 Architecture of the project

  • 8/10/2019 Expert System Voice Assistant



    Chapter 2

    2. Literature Survey and Related Work

    2.1 Microsoft Speech recognition Engine

    Windows Speech Recognition is a speech recognition application included in Windows

    Vista, Windows 7 and Windows 8.Windows Speech Recognition allows the user to control the

    computer by giving specific voice commands. The program can also be used for thedictation

    of text so that the user can enter text using their voice on their Vista or Windows 7 computer.

    Applications that do not present obvious "commands" can still be controlled by asking thesystem to overlay numbers on top of interface elements; the number can subsequently be

    spoken to activate that function. Programs needing mouse clicks in arbitrary locations can also

    be controlled through speech; when asked to do so, a "mousegrid" of nine zones is displayed,

    with numbers inside each. The user speaks the number, and another grid of nine zones is

    placed inside the chosen zone. This continues until the interface element to be clicked is

    within the chosen zone.

    Windows Speech Recognition has a fairly high recognition accuracy and provides a set ofcommands that assists in dictation.

    A brief speech-driven tutorial is included to help

    familiarize a user with speech recognition commands. Training could also be completed to

    improve the accuracy of speech recognition.

    Currently, the application supports several languages, including English (U.S. and British),

    Spanish, German, French, Japanese and Chinese (traditional and simplified).

    Windows speech recognition plays an important role in the development of expert system

    voice assistant. The speech recognition phase is carried out with the help of windows speech

    recognition engine
  • 8/10/2019 Expert System Voice Assistant


  • 8/10/2019 Expert System Voice Assistant


  • 8/10/2019 Expert System Voice Assistant



    There have been two main 'families' of the Microsoft Speech API. SAPI versions 1 through 4

    are all similar to each other, with extra features in each newer version. SAPI 5 however was a

    completely new interface, released in 2000. Since then several sub-versions of this API have

    been released.

    2.3.2 .NET Application Architecture

  • 8/10/2019 Expert System Voice Assistant



    2.4 Related Work

    It would be inappropriate to not mention Siri or Google Now, when discussing voice activated

    systems, though they are for mobile devices. Siri relies on web services and hence facilitates

    the learning of user preferences over time. However, all the intelligent personal assistants

    including Samsungs S Voice, Iris and others use natural language processing following

    speech recognition. The use of state machine is limited to context storage and evaluation. One

    of the best attempts to create expert system voice assistant was achieved recognition system

    followed by natural language processing. From the videos posted of his Project Jarvis we can

    concur that the response time of the system is not real time. However, his project was able to

    capture the entirety of digital life assistant. Individual projects such as Project Alpha andothers have tried to utilize state systems through the use of Windows Speech Recognition

    Macros. Further other small projects such as Project Rita rely on state system for concocting

    responses to a command spoken by the user. The scope of these projects, however are limited

    due to improper management of macros or keywords.

  • 8/10/2019 Expert System Voice Assistant



    Chapter 3

    3. Problem Description

    A voice assistant is not a very traditional or orthodox application. These applications are not

    generally available in a very big context. Other thing is that not all the people can interact with

    the computer via orthodox input methods like keyboards or mouse click. Some people with

    physical disability or those who are not able to see, may find it very difficult to interact with

    the computer but with the help of this application they can feel like operating the computer as

    smoothly as the normal people do. The problem is that we have to combine the features of

    speech recognition, interpretation,

    system manipulation, command generation and speech synthesis.

    we want the computer to recognize our spoken words and we want the spoken operation to be

    performed. After all that we want the application to respond in text to speech or any other

    synthetic voice feedback.

    We have to make sure that the application understands every command and provides the

    results with feedback

  • 8/10/2019 Expert System Voice Assistant


  • 8/10/2019 Expert System Voice Assistant



    12. killtask - Kills a specified task You have to specify vocally which running task is to be


    13. CMD -Starts a new command prompt window.

    14. Start or Close any Program or Directory - You can start any program by saying its

    name. You can open or close any directory by commanding the name and you can switch from

    one another via voice well. The confirmation of the start and termination can be vocal.

    15. Tasklist -Views current running processes.

    16. lock - Locks the workstation.

    17. Screen off - Turns off the monitor.can dim the brightness of the screen.

    18. System specific tasks - You can control your computers regular operations via voice

    commands. like you can turn off or put asleep the computer by saying that. you can open close

    disk tray by voice commands You can turn you computer off by saying turn off or can put it to

    sleep by saying sleep etc..

    19. Open any website - You can open a specific website by calling it. This includes manyfamous websites.

    20. What is there to offer: The first thing will be to know the potentials and capabilities of

    the project, So if the user says what can you do or commands the application will show

    the list of commands and operations it can perform.

    21 . Print this page: This command is said to print a specific page. The application will take

    the spoken word print as an input and the status of the task will be provided as the outputvia voice.

    22. Screenshot anything:You can take the screenshot of any page or window by saying the


  • 8/10/2019 Expert System Voice Assistant



    23. Play music or video Locally: You can just simply instruct the assistant to play a local

    music or video file, On the basis of name, artist or genre etc.

    24. Multimedia Control: You can control the volume and select the playlist and go to next or

    previous track on the basis of voice commands.

    25. Manage your Email: You can manage and check for any new emails by saying

    something like check mail. The system will vocally response about the fed command and

    can read your emails for you.

    26 . Presentation control: You can start the presentation go to previous or next slides and end

    the presentation

    27. Delete file: You can delete any selected file by saying this command

    28. Cut/Copy/Paste:You can do these operations on any selected file or text

    29. Select all: Say it and it will select the whole document or all the files

    Program Options

    Start Automatically - If checked, this program will be added to your start-up folder so that it

    will start automatically each time you start Windows.

    Show Progress Bars- The program can monitor your usage of the mouse and keyboard and

    show you the progress you are making at using your voice instead of the mouse and keyboard.

    Progress is measured on several dimensions including: mouse clicks, mouse movement,

    keyboard letters, and navigation/function keys.

    General options -

    1. Open and Close Programs

    2. Navigate Programs/Folders

    3. Switch or Minimize Windows

    4. Change Settings

  • 8/10/2019 Expert System Voice Assistant



  • 8/10/2019 Expert System Voice Assistant



    Chapter 5

    5. Design and Development

    5.1 Required:

    Hardware: Pentium Processor, 512MB of RAM, 10GB HDD.



    Tools: .Net Framework 4.5, Microsoft Visual Studio 2010, voice macros.The speech signal

    and all its characteristics can be represented in two different domains, the time and the

    frequency domain A speech signal is a slowly time varying signal in the sense that, when

    examined over a short period of time (between 5 and 100 ms), its characteristics are short-

    time stationary. This is not the case if we look at a speech signal under a longer time

    perspective (approximately time T>0.5 s). In this case the signals characteristics are non-

    stationary, meaning that it changes to reflect the different

    sounds spoken by the talker To be able to use a speech signal and interpret its characteristics

    in a proper manner some kind of representation of the speech signal are preferred.

    5.2 Microsoft Visual Studio

    Microsoft Visual Studio is an integrated development environment (IDE) from Microsoft. It is

    used to develop computer programs for Microsoft Windows superfamily of operating systems,

    as well as web sites, web applications and web services. Visual Studio uses Microsoft

    software development platforms such as Windows API, Windows Forms, Windows

    Presentation Foundation, Windows Store and Microsoft Silverlight. It can produce both native

    code and managed code.

    Visual Studio includes a code editor supporting IntelliSense as well as code refactoring. The

    integrated debugger works both as a source-level debugger and a machine-level debugger.

    Other built-in tools include a forms designer for building GUI applications, web designer,

    class designer, and database schema designer. It accepts plug-ins that enhance the

    functionality at almost every levelincluding adding support for source-control systems (like

  • 8/10/2019 Expert System Voice Assistant



    Subversion) and adding new toolsets like editors and visual designers for domain-specific

    languages or toolsets for other aspects of the software development lifecycle(like the Team

    Foundation Server client: Team Explorer).

    Visual Studio supports different programming languages and allows the code editor and

    debugger to support (to varying degrees) nearly any programming language, provided a

    language-specific service exists. Built-in languages include C, C++ and C++/CLI (via Visual

    C++), VB.NET (via Visual Basic .NET), C# (via Visual C#), and F# (as of Visual Studio

    2010). Support for other languages such as M, Python, and Ruby among others is available via

    language services installed separately. It also supports XML/XSLT, HTML/XHTML,

    JavaScript and CSS.

    Microsoft provides "Express" editions of its Visual Studio at no cost. Commercial versions of

    Visual Studio along with select past versions are available for free to students via Microsoft's

    DreamSpark program

    5.3 Speech Synthesis

    The most important qualities of a speech synthesis system are naturalness and intelligibility.

    Naturalness describes how closely the output sounds like human speech, while intelligibility is

    the ease with which the output is understood. The ideal speech synthesizer is both natural and

    intelligible. Speech synthesis systems usually try to maximize both characteristics.

    The two primary technologies generating synthetic speech waveforms are concatenative

    synthesis and formant synthesis. Each technology has strengths and weaknesses, and the

    intended uses of a synthesis system will typically determine which approach is used.

    Create TTS Content

    The content that a TTS engine speaks is called a prompt. Creating a prompt can be as simple

    typing a string. SeeSpeak the Contents of a String.

    For greater control over speech output, you can create prompts programmatically using the

    methods of the PromptBuilder class to assemble content for prompts from text,Speech

    Synthesis Markup Language (SSML), files containing text or SSML markup, and prerecorded
  • 8/10/2019 Expert System Voice Assistant



    audio files.PromptBuilder also allows you to select a speaking voice and to control attributes

    of the voice such as rate and volume. SeeConstruct and Speak a Simple Prompt andConstruct

    a Complex Prompt for more information and examples

    Initialize and Manage the Speech Synthesizer

    TheSpeechSynthesizer class provides access to the functionality of a TTS engine in Windows

    Vista, Windows 7, and in Windows Server 2008. Using theSpeechSynthesizerclass, you can

    select a speaking voice, specify the output for generated speech, create handlers for events that

    the speech synthesizer generates, and start, pause, and resume speech generation.

    Generate Speech

    Using methods on the SpeechSynthesizer class, you can generate speech as either a

    synchronous or an asynchronous operation from text, SSML markup, files containing text or

    SSML markup, and prerecorded audio files.

    Respond to Events

    When generating synthesized speech, the SpeechSynthesizer raises events that inform a

    speech application about the beginning and end of the speaking of a prompt, the progress of a

    speak operation, and details about specific features encountered in a prompt. EventArgs

    classes provide notification and information about events raised and allow you to write

    handlers that respond to events as they occur

    Control Voice Characteristics

    To control the characteristics of speech output, you can select a voice with specific attributes

    such as language or gender, modify properties of the SpeechSynthesizer such as rate and

    volume, or adding instructions either in prompt content or in separate lexicon files that guide

    the pronunciation of specified words or phrases.

    Apart from the analysis some manual scripts can help in answering the most common

    questions without having the trouble of creating a process .
  • 8/10/2019 Expert System Voice Assistant



    Chapter 6

    6.Implementation and coding

    6.1 Post Query Design

    In Visual C#, you can use either the Windows Form Designer or the Windows Presentation

    Foundation (WPF) Designer to quickly and conveniently create user interfaces. For

    information to help you decide what type of application to build

    Adding controls to the design surface.

    Setting initial properties for the controls.

    Writing handlers for specified events.

    Although you can also create your UI by manually writing your own code, designers enable

    you to do this work much faster.

    Adding Controls

    In either designer, you use the mouse to drag controls, which are components with visualrepresentation such as buttons and text boxes, onto a design surface.As you work visually, the

    Windows Forms Designer translates your actions into C# source code and writes them into a

    project file that is named name.designer.cs where name is the name that you gave to the form.

    Similarly, the WPF designer translates actions on the design surface into Extensible

    Application Markup Language (XAML) code and writes it into a project file that is named

    Window.xaml. When your application runs, that source code (Windows Form) or XAML

    (WPF) will position and size your UI elements so that they appear just as they do on the

    design surface. For more information.

    Setting Properties

    After you add a control to the design surface, you can use the Properties window to set its

    properties, such as background color and default text.

  • 8/10/2019 Expert System Voice Assistant



    In the Windows Form designer, the values that you specify in the Properties window are the

    initial values that will be assigned to that property when the control is created at run time. In

    the WPF designer, the values that you specify in the Properties window are stored as attributes

    in the window's XAML file.

    In many cases, those values can be accessed or changed programmatically at run time by

    getting or setting the property on the instance of the control class in your application. The

    Properties window is useful at design time because it enables you to browse all the properties,

    events, and methods supported on a control.

    Handling Events

    Programs with graphical user interfaces are primarily event-driven. They wait until a user does

    something such as typing text into a text box, clicking a button, or changing a selection in a

    listbox. When that occurs, the control, which is just an instance of a .NET Framework class,

    sends an event to your application. You have the option of handling an event by writing a

    special method in your application that will be called when the event is received.

    You can use the Properties window to specify which events you want to handle in your code.

    Select a control in the designer and click the Events button, with the lightning bolt icon, on the

    Properties window toolbar to see its events.

    When you add an event handler through the Properties window, the designer automatically

    writes the empty method body. You must write the code to make the method do something

    useful. Most controls generate many events, but frequently an application will only have to

    handle some of them, or even only one. For example, you probably have to handle a button's

    Click event, but you do not have to handle its Size Changed event unless you want to do

    something when the size of the button changes.

    6.2 Prototype And Inception

    The Project is being coded in the language csharp.Speech recognition is the very first step in

    this process so we start with that.

  • 8/10/2019 Expert System Voice Assistant



    Initialize the Speech Recognizer

    To initialize an instance of the shared recognizer in Windows, we us


    SpeechRecognizer sr = newSpeechRecognizer();

    Create a Speech Recognition Grammar

    One way to create a speech recognition grammar is to use the constructors and methods on the

    GrammarBuilderLoad the Grammar into the Speech Recognizer

    After the grammar is created, it must be loaded into the speech recognizer. The following

    example loads the grammar by calling theLoadGrammar(Grammar)method, passing the

    grammar created in the previous operation.


    Register for Speech Recognition Event Notification

    The speech recognizer raises a number of events during its operation, including the

    SpeechRecognizedevent. For more information, seeUse Speech Recognition Events.The

    speech recognizer raises theSpeechRecognizedevent when it matches a user utterance with a

    grammar. An application registers for notification of this event by appending an

    EventHandlerinstance as shown in the following example. The argument to the

    EventHandlerconstructor,sr_SpeechRecognized, is the name of the developer-written event


    sr.SpeechRecognized += new


    Create a Speech Recognition Event Handler

    When you register a handler for a particular event, the Intellisense feature in Microsoft Visual

    Studio creates a skeleton event handler if you press the TAB key. This process ensures that
  • 8/10/2019 Expert System Voice Assistant



    parameters of the correct type are used. The handler for theSpeechRecognizedevent shown in

    the following example displays the text of the recognized word or phrase using theResult

    property on theSpeechRecognizedEventArgsparameter, e.

    voidsr_SpeechRecognized(objectsender, SpeechRecognizedEventArgs e)





    Namespace has been efficiently used to synthesize the speech and it gets underway like that







    staticvoidMain(string[] args)


    SpeechSynthesizer synth = newSpeechSynthesizer();


    synth.Speak("This example demonstrates a basic use of Speech Synthesizer");


    Console.WriteLine("Press any key to exit...");

  • 8/10/2019 Expert System Voice Assistant



    } }}

    System.Diagnostics.Process.Start(Name) can be used to execute the commanded text.t

    publicstaticProcess Start(



    SecureString password,



    6.4 Default Commands.TXT:


    Hello Jarvis


    Goodbye Jarvis

    Close Jarvis


    Stop talking

    What's my name?

    What time is it

    What day is it

    Whats todays date

    Whats the date

    Hows the weather

    Whats the weather like

    Whats it like outside

    What will tomorrow be like

    Whats tomorrows forecast

    Whats tomorrow like

    Whats the temperature

  • 8/10/2019 Expert System Voice Assistant



    Whats the temperature outside

    Play music

    Play a random song

    You decide



    Turn Shuffle On

    Turn Shuffle Off

    Next Song

    Previous Song

    Fast Forward

    Stop Music

    Turn Up

    Turn Down



    What song is playing


    Exit Fullscreen

    Play video

    next window

    select all



    print this page

    Close window

    Out of the way

    Come back

    Show default commands

    Show shell commands

    Show web commands

  • 8/10/2019 Expert System Voice Assistant



    Show social commands

    Show Music Library

    Show Video Library

    Show Email List

    Show listbox

    Hide listbox


    Log off



    I want to add custom commands

    I want to add a custom command

    I want to add a command

    Update commands

    Set the alarm

    What time is the alarm

    Clear the alarm

    Stop listening

    JARVIS Come Back Online

    Refresh libraries

    Change video directory

    Change music directory

    Check for new emails

    Read the email

    Open the email

    Next email

    Previous email

    Clear email list

    Change Language

    Check for new updates


  • 8/10/2019 Expert System Voice Assistant





    new folder

    take screenshot


    go up

    go down


    save as





    start presentation

    next slide

    previous slide

    end presentation

    zoom in

    hold control

    6.5 RSS_Reader

    using System;

    using System.Linq;

    using System.Text;

    using CustomizeableJarvis.Properties;

    using System.Xml;

    using System.Xml.Linq;

    using System.Net;

    namespace CustomizeableJarvis


  • 8/10/2019 Expert System Voice Assistant



    class RSSReader


    public static void CheckForEmails()


    string GmailAtomUrl = "";

    XmlUrlResolver xmlResolver = new XmlUrlResolver();

    xmlResolver.Credentials = new NetworkCredential(Settings.Default.GmailUser,


    XmlTextReader xmlReader = new XmlTextReader(GmailAtomUrl);

    xmlReader.XmlResolver = xmlResolver;



    XNamespace ns = XNamespace.Get("");

    XDocument xmlFeed = XDocument.Load(xmlReader);

    var emailItems = from item in xmlFeed.Descendants(ns + "entry")

    select new


    Author = item.Element(ns + "author").Element(ns + "name").Value,

    Title = item.Element(ns + "title").Value,

    Link = item.Element(ns + "link").Attribute("href").Value,

    Summary = item.Element(ns + "summary").Value


    frmMain.MsgList.Clear(); frmMain.MsgLink.Clear();

    foreach (var item in emailItems)


    if (item.Title == String.Empty)


    frmMain.MsgList.Add("Message from " + item.Author + ", There is no subject

  • 8/10/2019 Expert System Voice Assistant



    and the summary reads, " + item.Summary);





    frmMain.MsgList.Add("Message from " + item.Author + ", The subject is " +

    item.Title + " and the summary reads, " + item.Summary);




    if (emailItems.Count() > 0)


    if (emailItems.Count() == 1)


    frmMain.Jarvis.SpeakAsync("You have 1 new email");


    else { frmMain.Jarvis.SpeakAsync("You have " + emailItems.Count() + " new

    emails"); }


    else if (frmMain.QEvent == "Checkfornewemails" && emailItems.Count() == 0)

    { frmMain.Jarvis.SpeakAsync("You have no new emails"); frmMain.QEvent =

    String.Empty; }


    catch { frmMain.Jarvis.SpeakAsync("You have submitted invalid log in information");



    public static void GetWeather()



  • 8/10/2019 Expert System Voice Assistant




    string query = String.Format("" +

    Settings.Default.WOEID.ToString() + "&u=" + Settings.Default.Temperature);

    XmlDocument wData = new XmlDocument();


    XmlNamespaceManager man = new XmlNamespaceManager(wData.NameTable);

    man.AddNamespace("yweather", "");

    XmlNode channel = wData.SelectSingleNode("rss").SelectSingleNode("channel");

    XmlNodeList nodes = wData.SelectNodes("/rss/channel/item/yweather:forecast",








    frmMain.Humidity = channel.SelectSingleNode("yweather:atmosphere",


    frmMain.WinSpeed = channel.SelectSingleNode("yweather:wind",


    frmMain.Town = channel.SelectSingleNode("yweather:location",




  • 8/10/2019 Expert System Voice Assistant










    frmMain.QEvent = "connected";


    catch { frmMain.QEvent = "failed"; }


    public static void CheckBloggerForUpdates()


    if (frmMain.QEvent == "UpdateYesNo")


    frmMain.Jarvis.SpeakAsync("There is a new update available. Shall I start the





    String UpdateMessage;

    String UpdateDownloadLink;

    string AtomFeedURL = "";

    XmlUrlResolver xmlResolver = new XmlUrlResolver();

    XmlTextReader xmlReader = new XmlTextReader(AtomFeedURL);

    xmlReader.XmlResolver = xmlResolver;

    XNamespace ns = XNamespace.Get("");

  • 8/10/2019 Expert System Voice Assistant



    XDocument xmlFeed = XDocument.Load(xmlReader);

    var blogPosts = from item in xmlFeed.Descendants(ns + "entry")

    select new


    Post = item.Element(ns + "content").Value


    foreach (var item in blogPosts)


    string[] separator = new string[] { "
    " };

    string[] data = item.Post.Split(separator, StringSplitOptions.None);

    UpdateMessage = data[0];

    UpdateDownloadLink = data[1];

    if (UpdateDownloadLink == Properties.Settings.Default.RecentUpdate)


    frmMain.QEvent = String.Empty;

    frmMain.Jarvis.SpeakAsync("No new updates have been posted");




    frmMain.Jarvis.SpeakAsync("A new update has been posted. The description

    says, " + UpdateMessage + ".");

    System.Windows.Forms.MessageBox.Show(UpdateMessage, "Update



    frmMain.Jarvis.SpeakAsync("Would you like me to download the update?");

    frmMain.QEvent = "UpdateYesNo";

    Properties.Settings.Default.RecentUpdate = UpdateDownloadLink;




  • 8/10/2019 Expert System Voice Assistant



    Chapter 7

    7.1 Conclusion and Future work:

    In this project a simple mechanism that could eliminate the excess use of Natural Language

    Processing. This takes us another step closer to the most ideal expert voice assistant However,

    there is still lot of scope for research on this topic and Switch State Mechanism only offers us

    a partial solution that solves the responsiveness issue or the computation time for

    understanding the command

    In this Project Expert voice assistant which uses mainly human communication means such

    Twitter, instant message and voice to create two way connections between human and his

    computer, controlling it and its applications, notify him of breaking news, FacebooksNotifications and many more. In our project we mainly use voice as communication means so

    the ESVA is basically the Speech recognition application. The concept of speech technology

    really encompasses two technologies: Synthesizer and recognizer. A speech synthesizer takes

    as input and produces an audio stream as output. A speech recognizer on the other hand does

    opposite. It takes an audio stream as input and thus turns it into text transcription. The voice is

    a signal of infinite information. A direct analysis and synthesizing the complex voice signal is

    due to too much information contained in the signal. Therefore the digital signal processes

    such as Feature Extraction and Feature Matching are introduced to represent the voice signal.

    In this project we directly use speech engine which use Feature extraction technique as Mel

    scaled frequency cepstral. The mel- scaled frequency cepstral coefficients (MFCCs) derived

    from Fourier transform and filter bank analysis are perhaps the most widely used front- ends

    in state-of-the-art speech recognition systems. Our aim to create more and more functionalities

    which can help human to assist in their daily life and also reduces their efforts. In our test we

    check all this functionality is working properly.

    In the future this is going to be one of the most prominent technologies that are going to

    evolve around the technical world. This application might not fulfill all the commands that

    user want it to have but in future the commands can be in various ranges and forms Language

    support can be extended as well

  • 8/10/2019 Expert System Voice Assistant


  • 8/10/2019 Expert System Voice Assistant



    Chapter 8

    8.1 Snapshot of the GUI

  • 8/10/2019 Expert System Voice Assistant


  • 8/10/2019 Expert System Voice Assistant


  • 8/10/2019 Expert System Voice Assistant

