simon

The simon Handbook

The simon Handbookby Peter H. Grasch

Copyright © 2008-2010 Peter Grasch

simon is an open source speech recognition solution.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or

any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover

Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".

Table of Contents1. Introduction..............................................................................................................................................

2. Overview...................................................................................................................................................

Architecture..........................................................................................................................................Speech Recognition: Background........................................................................................................

Language Model.........................................................................................................................Acoustic Model...........................................................................................................................

Scenarios..............................................................................................................................................Base models..........................................................................................................................................

Where to get base models...........................................................................................................Types of base models..................................................................................................................

Static base model..............................................................................................................Adapted base model..........................................................................................................User generated model........................................................................................................

Phoneme set issues.....................................................................................................................

3. Guidelines.................................................................................................................................................

Recordings............................................................................................................................................Volume.......................................................................................................................................

simon Calibration..............................................................................................................Audacity Calibration.........................................................................................................

Silence.........................................................................................................................................Microphone.................................................................................................................................Sample Quality Assurance..........................................................................................................

4. Using simon...............................................................................................................................................

The simon Main Window.....................................................................................................................Required Resources for a Working simon Setup.................................................................................

Language Model.........................................................................................................................Acoustic Model...........................................................................................................................

First run wizard....................................................................................................................................Scenarios.....................................................................................................................................Base models................................................................................................................................Server..........................................................................................................................................Sound configuration....................................................................................................................Volume calibration.....................................................................................................................

Scenarios..............................................................................................................................................Using scenarios...........................................................................................................................Managing scenarios....................................................................................................................

Adding a new Scenario.....................................................................................................Edit Scenario.....................................................................................................................Delete Scenario.................................................................................................................Import Scenario.................................................................................................................Export Scenario.................................................................................................................

Base models..........................................................................................................................................

3

Vocabulary...........................................................................................................................................General........................................................................................................................................

Active Dictionary..............................................................................................................Shadow Dictionary............................................................................................................

Maintaining the Vocabulary.......................................................................................................Adding Words.............................................................................................................................

Defining the Word............................................................................................................Manually Selecting a Terminal...............................................................................Manually Providing the Phonetic Transcription......................................................

Training the Word.............................................................................................................Editing a word.............................................................................................................................Removing a word........................................................................................................................Special Training..........................................................................................................................Importing a Dictionary...............................................................................................................

HADIFIX Dictionary........................................................................................................HTK Dictionary................................................................................................................PLS Dictionary..................................................................................................................SPHINX Dictionary..........................................................................................................Julius Dictionary...............................................................................................................

Grammar...............................................................................................................................................General........................................................................................................................................Defining your Grammar.............................................................................................................Import a Grammar......................................................................................................................Renaming Terminals...................................................................................................................Merging Terminals.....................................................................................................................

Training................................................................................................................................................Storage Directories.....................................................................................................................Adding Texts..............................................................................................................................

Add trainings-texts............................................................................................................Local text files...................................................................................................................

On The Fly Training...................................................................................................................Importing Trainings Samples.....................................................................................................

Commands............................................................................................................................................Executable Commands...............................................................................................................

Importing Programs..........................................................................................................Place Commands........................................................................................................................

Importing Places...............................................................................................................Shortcut Commands....................................................................................................................Text-Macro Commands..............................................................................................................List Commands...........................................................................................................................

List Command Display.....................................................................................................Configuring list elements..................................................................................................

Composite Commands................................................................................................................Desktopgrid.................................................................................................................................Input Number..............................................................................................................................Dictation.....................................................................................................................................Artificial Intelligence..................................................................................................................Calculator....................................................................................................................................

4

FilterPronunciation Training...............................................................................................................Keyboard.....................................................................................................................................

Configuration.......................................................................................................................................General Configuration................................................................................................................Sound Configuration...................................................................................................................

Device Configuration........................................................................................................Voice Activity Detection..................................................................................................Training settings................................................................................................................Postprocessing...................................................................................................................

Speech Model.............................................................................................................................Model Settings............................................................................................................................

General..............................................................................................................................Extensions.........................................................................................................................

Recognition.................................................................................................................................Server................................................................................................................................

General....................................................................................................................Network...................................................................................................................

Synchronization and Model Backup.................................................................................Actions........................................................................................................................................

Recognition.......................................................................................................................Plugin base font.................................................................................................................Lists

Social desktop.............................................................................................................................Adjusting the recognition parameters manually.........................................................................

5. Questions and Answers............................................................................................................................

6. Credits and License..................................................................................................................................

A. Installation...............................................................................................................................................

5

List of Tables2-1. Sample Vocabulary.................................................................................................................................4-1. Sample Vocabulary.................................................................................................................................4-2. Sample Vocabulary.................................................................................................................................4-3. Improved Sample Vocabulary................................................................................................................4-4. Improved Sample Vocabulary................................................................................................................4-5. Julius Configuration Files.......................................................................................................................

6

Chapter 1. Introductionsimon is the main front end for the simon open source speech recognition solution. It is a simond client and provides a graphical user interface for managing the speech model and the commands. Moreover, simon can execute all sorts of commands based on the input it receives from the server: simond.

In contrast to existing commercial offerings, simon provides a unique do-it-yourself approach to speech recognition. Instead of predefined, pre-trained speech models, simon does not ship with any model whatsoever. Instead, it provides an easy to use end-user interface to create language and acoustic models from scratch.

Additionally the end-user can easily download created use cases from other users and share his / her own.

The current release can be used to set up command-and-control solutions especially suitable for disabled people. However, because of the amount of training necessary, continuous, free dictation is neither supported nor reasonable with current versions of simon.

Because of it's architecture, the same version of simon can be used with all languages and dialects. One can even mix languages within one model if necessary.

7

Chapter 2. Overview

ArchitectureThe main recognition architecture of simon consits of three applications.

• simon

This is the main graphical interface.

It acts as a client to the simond server.

• simond

The recognition server.

• ksimond

A graphical front-end for simond.

These three components form a real a client / server solution for the recognition. That means that there is one server (simond) for one or more clients (simon; This application). KSimond is just a front-end for simond which means it adds no functionality to the system but rather provides a way to interact with simond graphically.

Additionally to the simon, simond and ksimond other, more specialized applications are also part of this integrated simon distribution.

• sam

Provides more in-depth control to your speech model and allows to test the acoustic model.

• ssc / sscd

These two applications can be used to collect large amount of speech samples from different persons more easily.

Please refer to the individual handbooks of those applications for more details.

8

Chapter 2. Overview

simon is used to create and maintain a representation of your pronunciation and language. This representation is then sent to the server simond which compiles it into a usable speech model.

simon then records sound from the microphone and transmits it to the server which runs the recognition on the received input stream. simond sends the recognition result back to the client (simon).

simon then uses this recognition result to execute commands like opening programs, following links, etc.

simond identifies its connections with a user / password combination which is completely independent from the underlying operating system and its users. By default a standard user is set up in both simon and simond so the typical use case of one simond server per simon client will work "out of the box".

Every simon client logs onto the server with a user / password combination which identifies a unique user and thus a unique speech model. Every user maintains his own speech model but may use it from different computers (different, physical simon instances) simply by accessing the same simond server. One simond instance can of course also serve multiple users.

If you want to open up the server to the internet or use multiple users on one server, you will have to configure simond. Please see the simond manual (help:/simond) for details.

Speech Recognition: BackgroundSpeech recognition systems take voice input (often from a microphone) and try to translate it into written text. To do that, they rely on statistical representations of human voice. To put it into simple terms: The computer learns how words - or more correctly the sounds that make up those words - sound.

A speech model consists of two distinct parts:

• Language Model

• Acoustic Model

9

Language Model

The language model defines the vocabulary and the grammar you want to use.

For more information see the vocabulary section and the grammar section.

Acoustic Model

The acoustic model represents your pronunciation in a machine readable format.

Let's look at the following sample vocabulary:

Table 2-1. Sample Vocabulary

Word Terminal Pronunciation

Computer Noun k ax m p y uw t er

Internet Noun ih n t er n eh t

Mail Noun m ey l

close Verb k l ow s

The pronunciation of each word is composed of individual sounds which are separated by spaces. For example, the word "close" consists of the following sounds:

• k

• l

• ow

• s

The acoustic model uses the fact that spoken words are composed of sounds much like written words are composed of letters. Using this knowledge, we can segment words into sounds (represented by the pronunciation) and assemble them back when recognizing. These building blocks are called "phonemes".

Because the acoustic model actually represents how you speak the phonemes of the words, trainings material is shared among all words that use the same phonemes.

That means if you add the word "clothes" to the language model, your acoustic model already has an idea how the "clo" part is going to sound as they share the same phonemes ("k", "l", "ow") at the beginning.

To train the acoustic model (in other words to tell him how you pronounce the phonemes) you have to "train" words from your language model. That means that simon displays a word which you read out loud. Because the word is listed in your vocabulary, simon already knows what phonemes it contains and can thus "learn" from your pronunciation of the word.

ScenariosOne scenario makes up one complete use case of simon. To control firefox, for example, the user just installs the firefox scenario.

Each scenario consists of the following components:

10

Chapter 2. Overview

• Vocabulary

• Grammar

• Trainingstexts

• Commands

Scenarios only cover the language model of the recognition system, the acoustic model is completely independent.

However, in most cases scenarios are tailored to work best with a specific base model to avoid issues with the phoneme set.

Because scenarios are not specifically bound to the acoustic model, they can be shared and exchanged between different simon users without problems. To accomodate this community based repository pool, a category for simon scenarios has been created on kde-files.org (http://kde-files.org/index.php?xcontentmode=692) where the scenarios, which are just simple text files (XML format), can be exchanged easily.

For information on how to use scenarios in simon, please refer to the Scenario section in the Use simon chapter.

Base modelsBase models are already generated, most often speaker independent, acoustic models that can be used with simon.

Using base models, the user can greatly reduce or even eliminate the need for personalized training. When using a static base model (see below), installation of the HTK is not necessary.

Base models usable by simon consists of four files which you will find in archive when you download base models from their respective website.

• hmmdefs

• tiedlist

• macros

• stats

The latter two files, macros and stats, are not required when using a static base model and might, in that case, be replaced with empty files if they are not available.

Where to get base models

To keep this list up to date, please refer to the list in our online wiki (http://www.simon-listens.org/wiki/index.php/English:_Base_models#Where_to_get_base_models).

Types of base models

There are three types of base models:

• Static base model

• Adapted base model

11

• User generated model

For information on how to use base models in simon, please refer to the Base Models section in the Use simon chapter.

Static base model

Static base models simply use a pre-compiled acoustic model without modifying it.

Any training data collected through simon will not be used to improve the recognition accuracy.

This type of model does not require the HTK to be installed.

Adapted base model

By adapting a pre-compiled acoustic model you can improve accuracy by adapting it to your voice.

Collected training data will be compiled in a adaption matrix which will then be applied to the selected base model.

This type of model does require the HTK to be installed.

User generated model

When using user generated models, the user is responsible for training his own model. No base model will be used.

The training data will be used to compile your own acoustic model allowing you to create a system which directly reflects your voice.

This type of model does require the HTK to be installed.

Phoneme set issues

Because the statistical comparison happens at phoneme level the base models describe how these phonemes sound.

Your scenarios (language model) on the other hand describe what phonemes a word is composed of.

In order for this association to work, both your scenarios and your base model need to use the same set of phonemes.

If you design a new scenario it is therefore a good idea to use the dictionary that was used to create the base model as shadow dictionary. This way simon will suggest the "correct" phonemes when adding the words automatically.

If you try to use scenarios designed for a different phoneme set (different base model) then you will get an error when starting the recognition listing the affected phonemes and words. To fix this, either transcribe the words according to the base models phoneme set, use a different base model or use an user generated model.

12

Chapter 3. GuidelinesThis chapter lists some general guidelines that are relevant for different parts of simon.

RecordingsIf you are using user generated or adapted models, simon builds it's acoustic model based on transcribed samples of the users voice. Because of this, the recorded samples are of vital importance for the recognition performance.

Volume

It is important that you check your microphone volume before recording any samples.

simon Calibration

The current version of simon includes a simple way of ensuring that your volume is configured correctly.

By default the volume calibration is displayed before starting any recording in simon.

13

Chapter 3. Guidelines

To calibrate simply read the text displayed.

The calibration will monitor the current volume and tell you to either raise or lower the volume but you have to do that manually in your systems audio mixer. Once you changed the volume in any way (while the calibration is running), press the "Volume changed" button next to the affected device. This will reset the volume status.

During calibration, try to talk normally. Don't yell but don't be overly quiet either. Take into account that you should generally use the same volume setting for all your training and for the recognition too. You might speak a little bit louder (unconsciously) when you are upset or at another time of the day so try to raise your voice a little bit to anticipate this. It is much better to have a little quieter samples than to start clipping.

In the simon settings, both the text displayed and the levels considered correct can be changed. If you leave the text empty, the default text will be displayed. In the options you can also deactivate the calibration completely. See the training section for more details.

Audacity Calibration

Alternatively you can use an audio editing tool like the free Audacity (http://audacity.sourceforge.net) to monitor the recording volume.

Too quiet:

Too loud:

Perfect volume:

14

Silence

To help simon with the automatic segmentation it is recommended to leave about one or two seconds of silence on the recording before and after reading the prompted text.

Current simon versions include a graphical notice on when to speak during recording. The message will tell the user to wait for one second:

... before telling the user to speak:

15

Chapter 3. Guidelines

This method of visual feedback proved especially valuable when recording with people who can't read the prompted text for themselves and therefore need someone to tell them what they have to say. The colorful visual cue tells them when to start repeating what the facilitator said without the need of unreliable hand gestures.

Microphone

For simon to work well, a high quality microphone is recommended.

However, even relatively cheap headsets (around 30 Euros) achieve very good results - magnitudes better than internal microphones.

For maximum compatability we recommend USB headsets as they usually support the necessary samplerate of 16 kHz, are very well supported from both Microsoft Windows as well as GNU/Linux and normally don't require special, proprietary drivers to operate.

Sample Quality Assurance

simon will check each recording against certain criteria to ensure that the recorded samples are not errenous or of poor quality.

16

If simon detects a problematic sample, it will warn the user to re-record the sample.

Currently, simon checks the following criteria:

• Sample peak volume

If the volume is too loud and the microphone started to "clip" (Clipping on wikipedia (http://en.wikipedia.org/wiki/Clipping_%28audio%29)), simon will display a warning message urging the user to lower the microphone volume.

• Signal to noise ratio (SNR)

simon will automatically determine the signal to noise ratio of each recording. If the ratio is below a configurable threshold, a warning message will be displayed.

The default value of 5000 % means that for simon to accept a sample as correctly recorded the peak volume has to be 500 times louder than the noise baseline (lowest average over 50 ms).

Often this can be a result of either a very low quality microphone, high levels of ambient noise or a low microphone gain coupled with a "microphone boost" option in the system mixer.

SNR warning message triggered by an empty sample; This information dialog is displayed when clicking on the "More information" button visible in the background.

17

Chapter 4. Using simonThe following sections will describe how to use simon.

The simon Main Window

The simon main window provides quick access to most of its features through the main toolbar.

There are 9 main actions listed:

• simond connection

This menu item has several states:

• Connect

When simon is not connected to simond the option says "Connect". When activated, simon will start to connect to simond and change to the "Connecting" state.

Upon connecting to the server from simon for the first time, you might be prompted for a username and a password. If you haven't done so already, set up a user for simond (see the simond manual (help:/simond) for details) before continuing and enter the same username and corresponding password in the login dialog from simon. If you choose to store the password, you can still change it in the server configuration at any time.

18

Chapter 4. Using simon

• Connecting

When simon is currently connecting to the configured simond server(s) the option says "Connecting" and is pressed down. When activated, simon will stop trying to connect to simond and go back to the "Connect" state.

• Activate

When simon established a connection to the server the option will say "Activate" and will not be pressed down. When activated (or automatically when simon is configured to automatically start the recognition when it is available) simon will try to start the recognition.

An option to close the connection to simond ("Disconnect") is available through the small down-arrow next to it.

• Activated

When simon established a connection to the server and has successfully started the recognition the option will say "Activated" and will be pressed down. When activated simon will deactivate the recognition but not close the connection to simond - it changes back to the previous "Activate" state.

An option to close the connection to simond ("Disconnect") is available through the small down-arrow next to it.

• Add Word

Displays the add word wizard.

• Vocabulary

Displays the vocabulary.

• Grammar

Displays the grammar.

• Training

Displays the training.

• Commands

Displays the commands.

• Synchronize

When connected to the simond, this option will be available.

simon creates the speech input files which are then compiled and used by the simond server (see the section architecture for more details).

The process of sending the speech input files, compiling them and receiving the compiled versions is called "synchronization". By default, simon will initiate a synchronization immediately after the connection has been established and whenever the model changes (please see the configure synchronization section for information on how to change that).

Using this menu option the synchronization can be triggered manually at any time.

• Scenario selection

This selection box allows you to select the currently displayed scenario. Each subsection (vocabulary,

19

grammar, commands, training) will then adapt to the currently displayed scenario. Selecting a different scenario here does not affect the recognition.

• Manage scenarios

Shows the scenario management dialog. There you can manage your scenarios and change the options of the scenario selection box.

The simon main window can be hidden at any time by clicking on the simon logo in the system tray (usually next to the system clock in the task bar) which will minimize simon to the tray. Click it again to show the main window again.

Required Resources for a Working simon SetupNote: For more information about speech models, please refer to the Speech Recognition: Background section in the Overview chapter.

To get simon to recognize speech and react to it you need a speech model.

Speech models describe how your voice sounds, what words exist, how they sound and what word combination ("sentences" or "structures") exist.

A speech model basically consists of two parts:

• Language model: Describes all existing words and what sentences are grammatically correct

• Acoustic model: Describes how words sound

You need both these components to get simon to recognize your voice.

Language Model

In most cases you only need to install the appropriate scenario for your use case to set up your language model.

To create your own language model, you can use simon to add / edit / remove words and grammar structures.

To make the adding of words easier, you can import a shadow dictionary.

Acoustic Model

To create your own acoustic model you can simple read the trainings texts that come with your selected scenarios a couple of times.

If you are creating your own scenario you can easily create trainingstexts yourself.

You can, however use static or adapted base models to avoid using the HTK or to improve the recognition rate.

First run wizardOn the first start of simon, this wizard is displayed to guide you through the initial configuration of simon.

20


The configuration consists of five easy steps which are outlined below. You can skip each step and even the whole wizard if you want to - in that case, the system will be set up with default values.

Scenarios

In this step you can download scenarios from the internet and import them into simon.

21

Pressing Get scenarios displays the download dialog.

22


If you import some scenarios here (or later on in the scenario management dialog) you don't need to set up the vocuablary, grammar, commands, etc. for yourself. Especially for new users it is recommended to try some scenarios first to see how the system works before diving into configuring it exactly for your use case.

Base models

In this step you can set up simon to use base models.

23

The configuration page opened is the same one that is described in the base model usage section.

24


After completing or aborting the first run wizard you can change configuration options defined here in the simon configuration.

Server

Internally, simon is a server / client application. If you want to take advantage of a network based installation, you can provide the server address here.

25

The default configuration is sufficient for a "normal" installation and will assume that you use a local simond server that will be automatically be started and stopped with simon.

After completing or aborting the first run wizard you can change configuration options defined here in the server configuration.

Sound configuration

Because simon recognizes sound from one or more microphones, you have to tell simon which devices you want to use for recognition and training.

26


simon can use one or more input- and output devices for different tasks. You can find more information about simons multiple device capabilities in the simon sound configuration section.

If you don't set at least one input device to be used for recognition, you will not be able to activate simon.

When the option Default to power training is selected, simon will, when training, automatically start- and stop the recording when displaying and hiding (respectively) the recording prompt. This option only sets the default value of the option, the user can change it at any time before beginning a training session.

After completing or aborting the first run wizard you can change configuration options defined here in the sound configuration.

Volume calibration

For simon to work correctly, you need to configure your microphones volume to a sensible level.

27

For more details on this, please see the Volume Calibration section in the Guidelines chapter.

ScenariosThis section describes how to use scenarios from within simon. For general information about scenarios, please refer to the chapter background.

Using scenarios

Beginning with simon 0.3, each word you add will be added to the currently active scenario. The same goes for grammar sentences, commands, etc.

Using scenarios then becomes just using simon as you did in 0.2.

Per default, simon ships with an empty scenario names "Standard", so your configuration will be stored in this scenario.

To select which of your currently active scenarios should be changed (for example before adding new words), just select it from the drop down list in the upper right corner of the simon main window.

28


To change the available options, click on the Manage scenarios button right next to it or use the menu entry Scenarios > Manage scenarios.

Managing scenarios

The scenario management dialog allows you to load scenarios from your scenario pool as well as to import and export scenarios to files or directly from / to an online repository.

To load or unload a scenario you can use the arrow buttons between the two lists or simply double click the option you want to load / unload.

More information about individual scenarios can be found in the tooltips of the list items.

29

Adding a new Scenario

To add a new scenario, select the Add button. A new dialog will be displayed.

When creating a new scenario, please give it a descriptive name. For the later upload on KDE files (http://kde-files.org/index.php?xcontentmode=692) we would kindly ask you to follow a certain naming scheme altough this is of course not a requirement: "[<language>/<base model>] <name>". If, for example you create a scenario in English that works with the Voxforge base model and controls Mozilla Firefox this becomes: "[EN/VF] Firefox". If your scenario is not specifically tailored to one phoneme set (base model), just omit the second tag like this: "[EN] Firefox".

The scenario version is just an incremental version number that makes it easier to distinguish between different revisions of a scenario.

If your scenario needs a specific feature of simon (for example because you use a new plugin), you can define minimum and maximum version numbers of simon here.

The license of your scenario can be set through the drop down. You can of course also add an arbitrary license text directly in the input field.

You can then add your name (or alias) to the list of scenario authors. There you will also be asked for contact information. This field is purely provided as a convenient way to contact a scenario author for changes, problems, fanmail etc. If you don't feel comfortable providing your e-Mail address you can simply enter a dash "-" denoting that you are not willing to divulge this information.

30


Edit Scenario

To edit scenarios, just select "Edit" from the "Manage scenarios" dialog.

The dialog works exactly the same as the add scenario dialog.

Delete Scenario

To delete a scenario, select the scenario and click the "Delete" button.

Because scenarios are synchronized with the recognition server, you can restore deleted scenarios through the model synchronization backup.

Import Scenario

Scenarios can be imported from a local file in simons XML scenario file format but can also be directly downloaded and imported from the internet.

When downloading scenarios, the list of scenarios is retrieved from simon Scenarios (http://kde-files.org/index.php?xcontentmode=692) subsection of the OpenDesktop site KDE-files.org (http://kde-files.org).

If you create a scenario that might be valuable for other simon users, please consider uploading it to this online repository and help other simon users.

Export Scenario

Scenarios can be exported to a local file in simons XML scenario file format and directly uploaded to the simon Scenarios (http://kde-files.org/index.php?xcontentmode=692) subsection of the OpenDesktop site KDE-files.org (http://kde-files.org).

To upload to OpenDesktop sites, you need an account on the site. Registration

31

(http://opendesktop.org/usermanager/new.php) is very easy and of course free of charge.

simon allows you to upload new content directly from within simon (Export > Publish).

32


33

34


To use this functionality, simply enter your account credentials in the social desktop configuration in the simon configuration.

Base modelsThis section describes how to use base models from within simon. For general information about base models, please refer to the chapter background.

To configure simon to use base models, simply select the base model type you want to use and point simon to the valid files in simons configuration: Settings > Configure simon > Model Settings

35

Load the files with the appropriate "Load" button next to the file you want to set. The files will be copied to an internal location so the source files can be removed once you have selected them here.

For static models you don't need macros or stats but simon will not start a model compilation (which is needed even for setups using static base models to generate the language model) without them. If your base model doesn't provide them you can simply point simon to empty files instead.

VocabularyThe vocabulary lets the user manage the available words.

General

The vocabulary defines what words the recognition process should recognize. Every word you want to be able to use with simon should be contained in your vocabulary.

One entry in the vocabulary defines exactly one "word". In contrast to the common use of the word "word", in simon "word" means one unique combination of the following:

• Wordname

(The written word itself)

• Terminal

(Grammatical category; For example: "Noun", "Verb", etc.)

• Pronunciation

36


(How the word is pronounced; simon accepts any kind of phonetic as long as it does not use special characters or numbers)

That means that plurals or even different cases are different "words" to simon. This is an important design decision to allow more control when using a sophisticated grammar.

In general, it is advisable to keep your vocabulary as sleek as possible. The more words, the higher the chance that simon might misunderstand you.

Example vocabulary (please note that the terminals here are deliberately set to Noun / Verb to help the understanding; Please to refer to the grammar section why this might not be the best idea):


Word Terminal Pronunciation

Computer Noun k ax m p y uw t er

Internet Noun ih n t er n eh t

Mail Noun m ey l

close Verb k l ow s

Active Dictionary

The vocabulary used for the recognition is referred to as active dictionary or active vocabulary.

Shadow Dictionary

As said above, the user should keep his vocabulary / dictionary as lean as possible. However, as a word in your vocabulary has to also have information about it's pronunciation, it would also be good to have large dictionary where you could look up the pronunciation and other characteristics of the words.

simon provides this functionality. We refer to this large reference dictionary as "shadow dictionary". This shadow dictionary is not created by the user but can be imported from various sources.

As simon is a multi-language solution we do not ship shadow dictionaries with simon. However, it is very easy to import them yourself using the import dictionary wizard. This is described in the Import Dictionary section.

Maintaining the Vocabulary

simon provides a "Vocabulary" menu which lists the current vocabulary.

37

Per default, the active vocabulary is shown. To display the shadow vocabulary select the tab Shadow Vocabulary.

Every word states it "recognition rate" which at the moment is just a counter of how often the word has been recorded (alone or together with other words).

When this number is only one or zero the word entry is colored red (1: light red; 0: dark red). This is a visual warning. When a word contains a phoneme combination that is not covered by any other word and the word with this unusual phoneme combination is never recorded (recognition rate = 0), the model will fail to compile. However, simon will display an appropriate error message when the compilation of the model fails because of such an issue. In general it is a good idea to record each word at least once or twice (at best when adding the word) to avoid such problems.

Because the shadow dictionary is not used for the recognition there are of course no trainings-samples for words in the shadow dictionary. So don't be alarmed if all the entries in the shadow dictionary are colored dark red - this is perfectly normal.

38


Adding Words

To add new words to the active vocabulary, use the add word wizard.

Adding words to simon is basically a two step procedure:

• Defining the word

• Initial training

Defining the Word

Firstly, the user is asked which word he wants to add.

39

When the user proceeds to the next page, simon automatically tries to find as much information about the word in the shadow dictionary as possible.

If the word is listed in the shadow dictionary, simon automatically fills out all the needed fields (Terminal and Pronunciation).

40


All suggestions from the shadow dictionary are listed in the table "Similar words". Per default only exact word matches are shown. However, this can be changed by checking the "Also show non-perfect matches" checkbox below the suggestion table. Using similar words you can quickly deduce the correct pronunciation of the word you are actually trying to add. See below for details.

Of course this really depends on your shadow dictionary. If the shadow dictionary does not contain the word you are trying to add, the required fields have to be filled out manually.

Some dictionaries that can be imported with simon (SPHINX, HTK) do not differentiate between upper and lower case. Suggestions based on those dictionaries will always be uppercase. You are of course free to change these suggestions to the correct case.

Some dictionaries that can be imported with simon (SPHINX, PLS and HTK) provide no grammatical information at all. These will assign all the words to the terminal "Unknown". You should change this to something appropriate when adding those words.

Manually Selecting a Terminal

The terminal of the word is defined as the grammatical category the word belongs to. This might be "Noun", "Verb" or completely new categories like "Command". For more information see the grammar section.

The list contains all terminals used in both your active and your shadow lexicon and in your grammar.

You can add new terminals to the drop-down menu by using the green plus sign next to it.

Manually Providing the Phonetic Transcription

The pronunciation is a bit trickier. simon does not need a certain type of phonetics so you are free to use

41

any method as long as it uses only ASCII characters and no numbers. However, if you want to use a shadow dictionary and want to use it to it's full potential you should use the same phonetics as the shadow dictionary.

If you don't know how to transcribe a word yourself you can easily use your shadow dictionary to help you with the transcription - even if the word is not listed in it. Let's say we want to add the word "Firefox" (to launch firefox) which is of course not listed in our shadow dictionary.

(I imported the English voxforge HTK lexicon available from voxforge (http://voxforge.org/home/downloads) as a shadow dictionary.)

"Firefox" is not listed in our shadow dictionary so we don't get any suggestion at all.

However, we know that firefox sounds like "fire" and "fox" put together. So let's just open the vocabulary (you can keep the wizard open) by selecting "Vocabulary" from your simon main toolbar.

Switch to the shadow vocabulary by clicking on the tab Shadow Vocabulary.

Use the "Filter"-Box above the list to search for "Fire":

42


We can see, that the word "Fire" is transcribed as "f ay r". Now filter for "fox" instead of "Fire" and we can see that "Fox" is transcribed as "f ao k s". We can assume, that firefox should be transcribed as "f ay r f ao k s".

Using this approach of deducing the pronunciation from parts of the word has the distinct advantage that we not only get a high quality transcription but also automatically use the same phoneme set as the other words which were correctly pulled out of the shadow dictionary.

We can now enter the pronunciation and change the terminal to something appropriate.

43

Training the Word

To complete the wizard we can now train the word twice. If you don't want to do this or for example use a static base model, you can skip these two pages.

Because you are about to record some training samples, simon will display the volume calibration to make sure that your microphone is set up correctly. For more information please refer to the volume calibration section

simon will try to prompt you for real-world examples. To do that, simon will automatically fetch grammar structures using the terminal of the word and substitute the generic terminals with example words from your active lexicon.

For example: You have the grammar structure "Trigger Command" and have the word "Computer" of the terminal "Trigger" in your vocabulary. You then add a new word "Firefox" of the terminal "Command". simon will now automatically prompt you for "Computer Firefox" as it is - according to your grammar - a valid sentence.

If simon is unable to find appropriate sentences using the word (i.e.: No grammar, not enough words in your active lexicon, etc.) it will just prompt you for the word alone.

Although simon ensures that the automatically generated examples are valid, you can always override it's suggestion. Just switch to the "Examples" tab on the "Define Word" page.

44


You are free to change those examples to anything you like. You can even go so far and use words that are not yet in your active lexicon as long as you add them before you synchronize the model, although this is not recommended.

All that is left is to record the examples.

45

Make sure you follow the guidelines listed in the recording section.

Editing a word

To edit a word, simply select it from the vocabulary, and click on Edit word.

There you can change name, terminal and pronunciation of the selected word.

Removing a word

To remove a word from your language model, select it in the vocabulary view and click on "Remove

46


selected word".

The dialog offers four choices:

• Move the word to the "Unused" terminal.

Because you (hopefully) don't use the terminal "Unused" in your grammar, the word will no longer be considered for recognition. In fact, it will be removed from the active vocabulary before compiling the model because no grammar sentence references it.

If you want to use the terminal "Unused" in your grammar, you can of course use a different terminal for unused words. Just set the terminal through the Edit word dialog.

To use the word again, just set the right terminal again. No data will be lost.

• Move the word to the shadow lexicon

This will remove the selected word from the active lexicon (and thus from the recognition) but will keep a copy in the shadow vocabulary. All the recordings containing the word will be preserved.

To use the word again, add it again to the active vocabulary. When adding a "new" word with the same name the values of the moved word will be suggested to you. Therefore, no data will be lost.

• Delete the word but keep the samples

Removes the word completely but keeps the associated samples. Whenever you add another word

47

with the same word name the samples will be re-associated.

Be careful with this option as the new word you add again might be transcribed differently and this difference can not be taken into account automatically (simon will then try to force the new transcription on the old recordings during the model compilation).

Do not use this option if the samples you recorded for this word were errenous.

• Remove the word completely

Just remove the word. All the recordings containing the word will be removed too.

This option leaves no trace of neither the word itself nor the associated samples.

Because samples are global (not assigned to scenarios), even samples recorded from trainings session of other scenarios might be removed as well if they contain the word. Use this option carefully.

Special Training

Please see the special training section in the training section.

Importing a Dictionary

simon provides the functionality to import large dictionaries as a reference. This reference dictionary is called shadow dictionary.

When the user adds a new word to the model, he has to define the following characteristics to define this word:

• Wordname

• Terminal

• Phonetic definition

These characteristics are taken out of the shadow dictionary if it contains the word in question. A large, high quality shadow dictionary can thus help the user to easily add new words to the model without keeping track of the phoneme set or - in many cases - even let him forget a the phonetic transcription is needed at all.

48


Since version 0.3 you can also import dictionaries directly to the active dictionary. This option is mostly there to make it easier to move to simon from custom solutions and to encourage importing of older models (for example one used with simon 0.2). You will almost never want to import a very large dictionary as active dictionary.

You can find a list of available dictionaries that work with simon on the simon wiki (http://www.simon-listens.org/wiki/index.php/English:_Shadow_dictionary).

simon is able to import five different types of dictionaries:

• HADIFIX

• HTK

• PLS

• SPHINX

• Julius

HADIFIX Dictionary

simon can import HADIFIX dictionaries.

One example of a HADIFIX dictionary is the German HADIFIX BOMP (http://www.sk.uni-bonn.de/forschung/phonetik/sprachsynthese/bomp).

Hadifix dictionaries provide both terminals and pronunciation.

Due to a special exemption in their license the simon listens team is proud to be able to offer you to

49

download the excellent HADIFIX BOMP directly from within simon.

Using the automatic bomp import you can, after providing name and e-Mail address for the team of the University Bonn, directly download and import the dictionary from the simon listens server.

HTK Dictionary

simon can import HTK lexica.

One example of a HTK lexicon is the English Voxforge dictionary (http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Lexicon/).

Hadifix dictionaries provide pronunciation information but no terminals. All words will be assigned to the terminal "Unknown".

PLS Dictionary

simon can import PLS dictionaries.

One example of a PLS dictionary is the German GPL dictionary from Voxforge (http://www.repository.voxforge1.org/downloads/de/Trunk/Lexicon/).

PLS dictionaries provide pronunciation information but no terminals. All words will be assigned to the terminal "Unknown".

SPHINX Dictionary

simon can import SPHINX dictionaries.

50


One example of a SPHINX dictionary is this dictionary for Mexican Spanish (http://speech.mty.itesm.mx/~jnolazco/proyectos.htm).

SPHINX dictionaries provide pronunciation information but no terminals. All words will be assigned to the terminal "Unknown".

Julius Dictionary

simon can import Julius vocabularies.

One example of a Julius vocabularies are the word lists of simon 0.2.

Julius dictionaries provide pronunciation information as well as terminal information.

GrammarThe grammar defines which combinations of words are correct.

General

Let's look at an example: You want to use simon to launch programs and close those windows when you are done. You would like to use the following commands:

• "Computer, Internet" to open a browser

• "Computer, Mail"

To open a mail client

• "Computer, close"

To close the current window

Following English grammar, your vocabulary would contain the following:


Word Terminal

Computer Noun

Internet Noun

Mail Noun

close Verb

To allow the sentences defined above simon would need the following grammar:

• "Noun Noun" for sentences like "Computer Internet"

• "Noun Verb" for sentences like "Computer close"

While this would work, it would also allow the combinations "Computer Computer", "Internet Computer", "Internet Internet", etc. which are obviously bogus. To improve the recognition accuracy, we can try to create a grammar that better reflects what we are trying to do with simon.

It is important to remember that you define your own "language" when using simon. That means that you are not bound to grammar rules that exist in whatever language you want to use simon with. For a simple command and control use-case it would for example be advisable to invent new grammatical

51

rules to eliminate the differences between different commands imposed by grammatical information not relevant for this use case.

In the example above it is for example not relevant that "close" is a verb or that "Computer" and "Internet" are nouns. Instead, why not define them as something that better reflects what we want them to be:

Table 4-3. Improved Sample Vocabulary

Word Terminal

Computer Trigger

Internet Command

Mail Command

close Command

Now we change the grammar to the following:

• "Trigger Command"

This allows all the combinations described above. However, it also limits the possibilities to exactly those three sentences. Especially in larger models a well thought grammar and vocabulary can mean a huge difference in recognition results.

Defining your Grammar

simon provides an easy to use text based interface to change the grammar. You can simply list all the allowed sentences (without any punctuation marks, obviously) like described above.

52


When selecting a sentence on the left, the right pane will automatically show possible real sentences with the words of your vocabulary on the right.

The example section will list at most 35 examples so if more than that amount of sentences match the selected grammar entry, the list might not be complete.

Import a Grammar

Additionally to simply entering your desired grammar sentence by sentence, simon is able to automatically deduce allowed grammar structures by reading plain text using the Import Grammar wizard.

53

simon can read and import text files but also provides an input field if you want to simply type the text into simon.

Say we have a vocabulary like in the general section above:

Table 4-4. Improved Sample Vocabulary

Word Terminal

Computer Trigger

Internet Command

Mail Command

close Command

We want simon to recognize the sentence "Computer Internet!". So we either enter the text using the Import text option or create a simple text file with this content "Computer Internet!" (any punctuation mark would work) and save it as "simongrammar.txt" to use the Import files option.

54


55

simon will then read the entered text or all the given text files (in this case the only given text file is "simongrammar.txt") and look up every single word in both active and shadow dictionary (the definition in the active dictionary has more importance if the word is available in both). It will then replace the word with its terminal.

In our example this would mean that he would find the sentence "Computer Internet". simon would find out that "Computer" is of the terminal "Trigger" and "Internet" of the terminal "Command". Because of this simon would "learn" that "Trigger Command" is a valid sentence and add it to its grammar.

The import automatically segments the input text by punctuation marks (".", " - ", "!", etc.) so any natural text should work. The importer will automatically merge duplicate sentence structures (even across different files) and add multiple sentence (all possible combinations) when a word has multiple terminals assigned to it.

The import will ignore sentences where one or more words could not be found in the language model unless you tick the "Also import unknown sentences" checkbox in which case those words are replaced with "Unknown".

Renaming Terminals

The rename terminal wizard allows you to rename terminals in both your active vocabulary, your shadow dictionary and the grammar.

56


Merging Terminals

The merge terminal wizard allows you to merge two terminals into one new terminal in both your active vocabulary, your shadow dictionary and the grammar.

57

This functionality is especially useful if you want to simplify your grammar structures.

TrainingUsing the Training-module, you can improve your acoustic model.

The interface lists all installed trainings-texts in a table consisting of three columns:

• Name

A descriptive name for the text.

• Pages

The number of "pages" the text consists of. Each page represents one recording.

• Recognition Rate

Analogue to the vocabulary; Represents how likely simon will recognize the words (higher is better). The recognition rate of the trainings-text is the average recognition rate of all the words in the text.

58


To improve the acoustic model - and thus the recognition rate - you have to record trainings-texts. This means that simon gets essentially two needed parts:

• Samples of your speech

• Transcriptions of those samples

The active dictionary is used to transcribe the words (mapping them from the actual word to its phonetic transcription) that make up the text so every word contained in the trainings-text you want to read (train) has to be contained in your active dictionary. simon will warn you if this is not the case and provide you with the possibility to add all the missing words in one go.

The procedure is the same as if you would add a single word but the wizard will prompt you for details and recordings for all the missing words automatically. This procedure can be aborted at any time and simon will provide both a way to add the already completely defined words and to undo all changes

59

done so far. When the user has added all the words he is prompted for (all the words missing) the changes to the active dictionary / vocabulary are saved and the training of the previously selected text starts automatically.

The training (reading) of the trainings-text works exactly the same as the initial training when adding a new word.

Make sure you follow the guidelines listed in the recording section.

Storage Directories

Trainings-texts are stored at two different locations:

• Linux: ~/.kde/share/apps/simon/texts

Windows: %appdata%\.kde\share\apps\simon\texts

The texts of the current user. Can be deleted and added with simon (see below).

• Linux: `kde4-config --prefix`/share/apps/simon/texts

Windows: (install directory)\share\apps\simon\texts

60


System wide texts. They will appear on every user account using simon on this machine and can not be deleted from within simon because of the obvious permission restrictions on system wide files.

This folder can be used by system administrators to provide a common set of trainings-texts for all the users on one system.

The XML files (one for each text) can just be moved from one location to the other but this will most likely require admin privileges.

Adding Texts

The add texts wizard provides a simple way to add new trainings-texts to simon.

When importing text files, simon will automatically try to recognize individual sentences and split the text into appropriate "pages" (recordings). The algorithm treats text between "normal" punctuation (".", "!", "?", "...", """,...) and line breaks as "sentences". Each "sentence" will be on its own page.

simon supports two different sources for new trainings-texts.

61

Add trainings-texts

Simply enter the trainingstext in an input field.

62


Local text files

simon can import normal text files to use them as trainings-texts.

On The Fly Training

Additionally to trainings-texts, simon also allows to train individual words or word combinations from your dictionary on-the-fly.

This feature is located in the vocabulary-menu of simon.

63

Select the words to train from the vocabulary on the left and simply drag them to the selection list to the right (you could also select them in the table on the left and add them by clicking "Add to Training".

Start the training by selecting "Train selected words". The training itself is exactly the same as if it were a pre-composed trainings-text.

64


If there are more than 9 words to train simon will automatically split the text evenly across multiple pages.

Of course you are free to add words from the shadow lexicon to the list of words to train but simon will prompt you to add the words before the training starts just like he would if you would train a text that contains unknown words (see above).

Importing Trainings Samples

Using the import trainings-data field one can import previously gathered trainings-samples from previous simon versions or manual trainings without copying the whole dictionary.

This feature is very specific. Please use it with caution and make sure that you know exactly what you are doing before you continue.

You can either provide a separate prompts file or let simon extract the transcriptions from the filenames.

When using prompts based transcriptions your prompts file (UTF-8) needs to contain lines of the following content: "[filename] [content]". Filenames are without file extensions and the content has to be uppercase. For example: "demo_2007_03_20 DEMO" to import the file "demo_2007_03_20.wav" containing the spoken word "Demo".

65

Because prompts files do not contain a file extension, simon will try wav, mp3, ogg and flac (in that order). If one of those match, no other extension will be tested and only the first file will be imported (in contrast to file based transcription where all files would be imported).

When using file based transcriptions, a file called this_is_a_test.wav must contain "This is a test" and nothing else. Numbers and special characters (".", "-",...) in the filename are ignored and stripped.

Files recorded by simon 0.2 will follow this naming scheme so you can safely import them using the file name extraction method. Files generated by previous simon versions should not be imported using this function but you can use the prompts based import for that.

Imported files and their transcription are then added to the trainings corpus.

To import a directory containing trainings-samples just select the folder to import and depending on your import type also the prompts file.

The folder will be scanned recursively. This means that the given folder and all its subfolders will be searched for .wav, .flac, .mp3 and .ogg files. All files found will be imported.

When importing the sound files, all configured post processing filters are applied.

If you import anything other than WAV files you are responsible for decoding them during the import process (for example through post processing filters) or the model creation will fail.

66


CommandsWhen simon is active and recognizes something, the recognition result is given to the loaded command plug-ins (in order) for processing.

The command system can be compared with a group of factory workers. Each one of them knows how to perform one task (e.g. "Karl" knows how to start a program and "Joe" knows how to open a folder, etc.). Whenever simon recognizes something it is given to "Karl" who then checks if this instruction is meant for him. If he doesn't know what to do with it, it is handed over to "Joe" and so on. If none of the loaded plugins know how to process the input it is ignored. The order in which the recognition result is given to the individual commands (people) is configurable in the command options (Commands > Manage plugins).

67

Each plugin can be associated with a "trigger". Using triggers, the responsibility of each plugin can be easily be divided.

Using the factory workers abstraction from above it could be compared to stating the name of who you mean to process your request. So instead of "Open my home folder" you say "Joe, open my home folder" and "Joe" (the plugin responsible for opening folders) will instantly know that the request is meant for him.

In practice you could have commands like the executable command "Firefox" to open the popular browser and the place command "Google" to open the web search engine. If you assign the trigger "Start" to the executable plugin and the trigger "Open" to the place command you would have to say "Start Firefox" (instead of just "Firefox" if you don't use a trigger for the executable plugin) and "Open Google" to open the search engine (instead of just "Google").

Triggers are of course no requirement and you can easily use simon without defining any plugin triggers (although many plugins come with a default trigger of "Computer" set which you would have to remove). But even if you use just on trigger for all your commands (like "Computer" to say "Computer, Firefox" and "Computer, Google" like) it has the advantage of greatly limiting the number false-positives.

simons command dialog displays the complete phrase associated with a command in the upper right corner of the command configuration.

You can load multiple instances of one plugin even in one scenario. Each instance can of course also have a different plugin trigger.

Each Command has a name (which will trigger its invocation), an icon and more fields depending on the type of the plugin (see below).

Some command plugins might provide a configuration of the plugin itself (not the commands it contains). There configuration pages will be plugged directly into the action configuration dialog (below the General menu item) when you load the associated plugin.

68


Plugins that provide a graphical user interface (like for example the input number command plugin) can be configured by configuring "Voice commands". You can change the associated word that will trigger the button, for example, but also change the displayed icon, etc. If you remove all voice interface commands from a graphical element, the element will be hidden automatically.

Voice interface commands are added just like normal commands through the command configuration.

To add a new interface command to a function, just select the action you want to associate with a command, click Create from template and adapt the resulting commando to your needs.

Some plugins (for example the desktopgrid or the calculator might also provide a menu item in the Commands menu.

Executable Commands

Executable commands are associated with an executable file ("Program") which is started when the command is invoked.

69

Arguments to the commands are supported. If either path to the executable or the parameters contain spaces they must be wrapped in quotes.

Given the executable file C:\Program Files\Mozilla Firefox\firefox.exe the local html file C:\test file.html the correct line for the "Executable" would be: "C:\Program Files\Mozilla Firefox\firefox.exe" "C:\test file.html".

The working directory defines where the process should be launched from. Given the working directory C:\folder, the command "C:\Program Files\Mozilla Firefox\firefox.exe" file.html would cause firefox to search for the file C:\folder\file.html.

The working directory does not normally need to be set and can be left blank most of the time.

70


Importing Programs

For even easier configuration simon provides an import dialog which allows you to select programs directly from the KDE menu.

Note: This option is not available on Microsoft Windows.

The dialog will list all programs that have an entry in your KDE menu in their respective category.

Sub-Categories are not supported and are thus listed on the same level as top-level categories.

Just select the program you wish to start with simon and press Ok. The correct values for the executable and the working directory as well as an appropriate command name and description will automatically be filled out for you.

Place Commands

With place commands you can allow simon to open any given URL. Because simon just hands the address over to the platforms URL handler, special Protocols like "remote:/" (on Linux/KDE) or even KDEs "Web-Shortcuts" are supported.

Instead of folders, files can also be set as the commands URL which will cause the file to be opened with the application which is associated with it when the command is invoked.

71

To associate a specific URL with the command you can manually enter it in the URL field (select Manual first) or import it with the import place wizard.

Importing Places

The import place dialog allows you to easily create the correct URL for the command.

To add a local folder, select Local Place and choose the folder or file with the file selector.

72


To add a remote URL (HTTP, FTP, etc.) choose Remote URL

Please note that for URLs with authentication information the password will be stored in clear text.

Shortcut Commands

Using shortcut commands the user can associate commands with key-combinations.

The command will simulate keyboard input to "press" shortcuts like "Ctrl+C" or "Alt+F4".

73

To select the shortcut you wish to simulate just toggle the shortcut button and press the key combination on your keyboard.

simon will capture the shortcut and associate it with the command.

Due to technical limitations there are several shortcuts on Microsoft Windows that can not be captured by simon (this includes e.g. Ctrl+Alt+Del and Alt+F4). These special shortcuts can be selected from a list below the aforementioned shortcut button.

Note: This selection box is not visible in the screenshot above as the list is only displayed in the Microsoft Windows version of simon.

74


Text-Macro Commands

Using text-macro commands, the user can associate text with a command. When the command is invoked, the associated text will be "written" by simulating keystrokes.

List Commands

The list command is designed to combine multiple commands (all types of commands are supported) into one list. The user can then select the n-th entry by saying the associated number (1-9).

This is very useful to limit the amount of training required and provides the possibility to keep the vocabulary to a minimum.

75

List commands are especially useful when using commands with difficult triggers or commands that can be grouped under a general theme. A typical example would be a command "Startmenu" to present a list of programs to launch. That way the specific executable commands can still retain very descriptive names (like "OpenOffice.org Writer 3.1") without the user having to include these words in his vocabulary and consider them in the grammar just to trigger them.

Commands of different types can of course be mixed.

List Command Display

When invoked, the command will display the list centered on the screen. The list will automatically expand to accompany its items.

76


The user can invoke the commands contained in the list by simply saying their associated number (In this example: "One" to launch Mozilla Firefox).

While a list command is active (displayed), all input that is not directed at the list itself (other commands, etc.) will be rejected. The process can be canceled by pressing the "Cancel" button or by saying "Cancel".

If there are more than 9 items simon will add "Next" and "Back" options to the list ("Zero" will be associated with "Back" and "Nine" with "Next").

77

Configuring list elements

By default the list command uses the following trigger words. To use list commands to their full potential, make sure that your language and acoustic model contains and allows for the following "sentences":

• "Zero"

• "One"

• "Two"

• "Three"

• "Four"

• "Five"

• "Six"

• "Seven"

• "Eight"

• "Nine"

• "Cancel"

Of course you can also configure these words in your simon configuration:

78


• Commands > Manage plugins > General > Lists for the scenario wide list configuration.

• Settings > Configure simon... > Actions > Lists for the global configuration. When creating a new scenario, the scenario configuration will be initialized with a copy of this list configuration.

List commands are internally also used by other plugins like for example the desktopgrid. The confiugration of the triggers also affects their displayed lists.

Composite Commands

Composite commands allow the user to group multiple commands into a sequence.

When invoked the commands will be executed in order. Delays between commands can be inserted.

Using the composite command the user can compose complex "macros". The screenshot above - for

79

example - does the following:

• Start Kopete (Executable Command)

• Wait 2000ms for Kopete do be started

• Type "Mathias" (Text-Macro Command) which will select Mathias in my contact list

• Press Enter (Shortcut Command)

• Wait 1000ms for the chat window to appear

• Write "Hi!" (Text-Macro Command); The text associated to this command contains a newline at the end so that the message will be send.

• Press Alt+F4 (Shortcut Command) to close the chat window

• Press Alt+F4 (Shortcut Command) to close the kopete main window

Desktopgrid

The desktopgrid allows the user to control his mouse with his voice.

The desktopgrid divides the screen into nine parts which are numbered from 1-9. Saying one of these numbers will again divide the selected field into 9 fields again numbered from 1-9, etc. This is repeated 3 times. After the fourth time the desktopgrid will be closed and simon will click in the middle of the selected area.

The exact click action is configurable but defaults to asking the user. Therefore you will be presented with a list of possible click modes. When selecting Drag and Drop, the desktopgrid will be displayed again to select the drop point.

80


While the desktopgrid is active (displayed), all input that is not directed at the desktopgrid itself (other commands, etc.) will be rejected. Say "Cancel" at any time to abort the process.

The desktopgrid plugin registers a configuration screen right in the command configuration when it is loaded.

The trigger that invokes the desktopgrid is of course completely configurable. Moreover the user can use "real" or "fake" transparency. If your graphical environment allows for compositing effects ("desktop effects") then you can safely use "real" transparency which will make the desktogrid transparent. If your platform does not support compositing simon will simulate transparency by taking a screenshot of the screen before displaying the desktopgrid and display that picture behind the desktopgrid.

If the desktopgrid is configured to use real transparency and the system does not support compositing it

81

will display a solid gray background.

However, nearly all up-to-date systems will support compositing (real transparency).

This includes:

• Microsoft Windows 2000 or higher (XP, Vista, 7)

• GNU/Linux using a composite manager like Compiz, KWin4, xcompmgr, etc.

By default the desktopgrid uses numbers to select the individual fields. To use the desktopgrid, make sure that your language and acoustic model contains and allows for the following "sentences":

• "One"

• "Two"

• "Three"

• "Four"

• "Five"

• "Six"

• "Seven"

• "Eight"

• "Nine"

• "Cancel"

To configure these triggers, just configure the commands associated with the plugin.

82


Input Number

Using the input-number plugin the user can input large numbers easily.

Using the Dictation or the Text-Macro plugin one could associate the numbers with their digits and use that as input method. however, to input larger numbers there are two ways that both have significant disadvantages:

• Adding the words "eleven, "twelve", etc.

While this seems like the most elegant solution as it would enable the user to say "fivehundredseventytwo" we can easily see that it would be quite a problem to add all these words - let alone train them. What about "twothousandninehundredtwo"? Where to stop?

• Spell out the number using the individual digits

While this is not as elegant as stating the complete number it is much more practical.

However, many applications (like the great mouseless browsing firefox addon) rely on the user to input large numbers without too much time passing between the individual keystrokes (mouselss browsing for example will wait exactly 500ms per default before it considers the input of the number complete). So if you want to enter 52 you would first say "Five (pause) Two". Because of the needed pause, the application (like the mouseless browsing plugin) would consider the input of "Five" complete.

The input number plugin - when triggered - presents a calculator-like interface for inputting a number.

83

The input can be corrected by saying "Back". It features a decimal point accessible by saying "Comma". When saying "Ok" the number will be typed out. As all the voice-input and the correction is handled by the plugin itself the application that finally receive the input will only get couple of milliseconds between the individual digits.

While the input number plugin is active (the user currently inputs a number), all input that is not directed at the input number plugin (other commands, etc.) will be rejected. Say "Cancel" at any time to abort the process.

As there can no command instances be created of this plugin it is not listed in the "New Command" dialog. However, the input number plugin registers a configuration screen right in the command configuration when it is loaded.

84


The trigger defines what word or phrase that will trigger the display of the interface.

By default the input number plugin uses numbers to select the individual digits and a couple of control words. To use the input number plugin, make sure that your language and acoustic model contains and allows for the following "sentences":

• "Zero"

• "One"

• "Two"

• "Three"

• "Four"

• "Five"

• "Six"

• "Seven"

• "Eight"

• "Nine"

• "Back"

• "Comma"

• "Ok"

85

• "Cancel"

To configure these triggers, just configure the commands associated with the plugin.

Dictation

The dictation plugin writes the recognition result it gets using simulated keystrokes.

Assuming you didn't define a trigger for the dictation plugin it will accept all recognition results and just write them out. The written input will be considered as "processed input" and thus not be relayed to other plugins. This means that if you loaded the dictation plugin and defined no trigger for it, all plugins below it in the "Selected Plug-Ins" list in the command configuration will never receive any input.

As there can no command instances be created of this plugin it is not listed in the "New Command" dialog.

The dictation plugin can be configured to append texts after recognition results to for example add a space after each recognized word.

86


Artificial Intelligence

The Artificial Intelligence is a just-for-fun plugin that emulates a human conversation.

Using the festival text to speech technology the computer can "talk" with the user and answer question or chat about the weather.

The plugin uses AIMLs for the actual "intelligence". Most AIML sets should be supported. The popular A. L. I. C. E. bot (http://www.pandorabots.com/pandora/talk?botid=f5d922d97e345aa1) and a German version work and are shipped with the plugin.

87

The plugin registers a configuration screen in the command configuration menu where you can choose which AIML set to load.

simon will look for AIML sets in the following directory:

• GNU/Linux: `kde4-config --prefix`/share/apps/ai/aimls/

• Microsoft Windows: [installation folder (C:\Program Files\simon 0.2\ by default)]\share\apps\ai\aimls\

To add a new set just create a new directory with a descriptive name and copy the .aiml files into it.

To adjust your bots personality have a look at the bot.xml and vars.xml files in the following directory:

• GNU/Linux: `kde4-config --prefix`/share/apps/ai/util/

• Microsoft Windows: [installation folder (C:\Program Files\simon 0.2\ by default)]\share\apps\ai\util\

The plugin will use mbrola voices (http://tcts.fpms.ac.be/synthesis/mbrola.html) if they are installed.

As there can no command instances be created of this plugin it is not listed in the "New Command" dialog.

It is recommended to not use any trigger for this plugin to provide a more natural "feel" for the conversation. The AI plugin will pass any input through to the other plugins, even tough it will react on any input given. This makes it possible to add a "conversation" to the command & control use-case by developing custom AIMLs sets (e.g.: User: "Computer, open Firefox"; Computer: "Certainly, Sir! Starting Firefox..."; Firefox opens).

Please keep in mind that the AI plugin will only work if festival is installed, set-up correctly and lies in your system path.

88


Calculator

The calculator plugin is a simple, voice controlled calculator.

The calculator extends the Input Number plugin by providing additional features.

When loading the plugin, a configuration screen is added to the plugin confiugration.

There you can also confiugre the control mode of the calculator. Setting the mode to something else than Full calculator will hide options from the displayed widget.

89

However, the hidden controls will, in contrast to simply removing all associated command from the functions, still react to the configured voice commands.

When selecting Ok, the calculator will by default ask you what to do with the generated result. You can for example output the calculation, the result, both, etc. Besides always selecting this from the displayed list after selecting the Ok button, this can also be set in the configuration options.

Filter

Using the filter plugin, you can intercept recognition results from being passed on to further command plugins. Using this plugin you can for example disable the recognition by voice.

The filter has two states:

• Inactive

The default state. All recognition results will be passed through.

• Active

When activated, the filter will "eat" all results that match the configured pattern. By default this means every result that simon recognizes will be accepted by the filter and therefore not relayed to any of the plugins following the filter plugin.

The filter command plugin registers a configuration screen in the command configuration where you can

90


change what results should be filtered.

The pattern is a regular expression that will be evaluated each time a recognition results receives the plugin for processing.

The plugin also registers voice interface commands for activating and deactivating the filter.

Pronunciation Training

The pronunciation training, when combined with a good static base model, can be a powerful tool to improve your pronunciation of a new language.

91

Essentially, the plugin will prompt you to say specific words. The recognition will then recognize your pronunciation of the word and compare it to your speech model which should be a base model of native speakers for this to work correctly. Then simon will display the recognition rate (how similar your version was to the stored base model).

The closer to the native speaker, the higher the score.

The plugin adds an entry to your Commands menu to launch the pronunciation training dialog.

The training itself consists of multiple pages. Each page contains one word fetched from your active vocabulary. They are identified by a terminal which needs to be selected in the command configuration before starting the training.

92


Keyboard

The keyboard plugin displays a virtual, voice controlled keyboard.

The keyboard consits of multiple tabs, each possibly containing many keys. The entirety of tabs and keys are collected in "sets".

You can select sets in the configuration but also create new ones from scratch in the keyboard command confiugration.

93

Keys are usually mapped to single characters but can also hold long texts and even shortcuts. Because of this, keyboard sets can contain special keys like a "select all" key or a "Password" key (typing your password).

Next to the tabs that hold the keys of your set, the keyboard my also show special keys like Ctrl, Shift, etc. Those keys are provided as voice inteface commands and are displayed regardless of what tab of the set is currently active.

As with all voice triggers, removing the associated command, hides the buttons as well.

Moreover, the keyboard provides a numpad that can be shown by selecting the appropriate option in the keyboard configuration.

94


Next to the number keys and the delete key for the number input field (Number backspace), the numpad provides two options on what to do with the entered number.

When selecting Write number, the entered number will be written out using simulated key presses. Selecting Select number tries to find a key or tab in the currently active set that has this number as a trigger. This way you can control a complete keyboard just using numbers.

The keys on the num pad are configurable voice interface commands.

Configurationsimon was designed with high configurability in mind.

General Configuration

The general configuration page lists some basic settings.

95

Please note that the option to start simon at login will work on both Microsoft Windows and when you are using KDE on Linux. Support for other desktop environments like Gnome, XFCE, etc. might require manually placing simon in the session autostart (please refer to the respective manuals of your desktop environment).

When the option to start simon minimized is selected, simon will minimize to the system tray immediately after starting.

Deselecting the option to warn when there are problems with samples deactivates the sample quality assurance.

Sound Configuration

simon uses QtMultimedia to record and play sound. QtMultimedia is also used to gather data from the microphone which is then sent to the simond server for recognition.

Device Configuration

The sound device configuration allows you to choose which sound device(s) to use, how many channels to use and at which samplerate to record.

Most of the time you will want to use 1 channel and 16kHz (which is also the default) because the recognition only works on mono input and works best at 16kHz (8kHz being the other option).

However, some low-cost sound cards might not support this particular mode in which case simon can in many cases work around this limitation by using postprocessing chains and 3rd party software. Please

96


see the postprocessing section for more details.

Bottom line: Only change the channel and the samplerate if you really know what you are doing. Otherwise the recognition will most likely not work.

Use the selection boxes to change the device. Use the Refresh devices button if you have changed the sound configuration since you started simon.

You can use simon with more than one sound device at the same time. Use Add device to add a new device to the configuration and Remove device to remove it from your configuration.

The first device in your sound setup can not be removed.

For each device you can determine for what you want the device to be used: Training or recognition (last one only applicable for input devices).

If you use more than one device for training, you will create multiple sound files for each utterance. When using multiple devices for recognition each one feeds a separate sound input stream to the server resulting in recognition results for each stream.

If you use multiple output devices the playback of the trainings samples will play on all configured audio devices.

When using different sample rates for your input devices, the output will only play on matching output devices. If you for example have one input device configured to use 16kHz and the other to use 48kHz, the playback of samples generated by the first one will only play on 16kHz outputs, the other one only on 48kHz devices.

97

Voice Activity Detection

The recognition is done one the simond server. See the architecture section for more details.

The sound stream is not continuous but is segmented by the simon client. This is done by something called "voice activity detection".

Here you can configure this segmentation through the following parameters:

• Cutoff level

Everything below this level is considered "silence" (background noise).

• Head margin

Cache for as long as head margin to start consider it a real sample. During this whole time the input level needs to be above the cutoff level.

• Tail margin

After the recording went below the cutoff level, simon will wait for as long as tail margin to consider the current recording a finished sammple.

• Skip samples shorter than

Samples that are shorter than this value are not considered for recognition. (coughs, etc.)

98


Training settings

When the option Default to power training is selected, simon will, when training, automatically start- and stop the recording when displaying and hiding (respectively) the recording prompt. This option only sets the default value of the option, the user can change it at any time before beginning a training session.

The confiugrable font here refers to the text that is recorded to train the acoustic model (through explicit training or when adding a word).

This option has been introduced after we have worked with a few clients suffering spastic disability. While we used the mouse to control simon during the training, they had to read what was on the screen. At first this was very problematic as the regular font size is relatively small and they had trouble making out what to read. This is why we made the font and the font size of the recording prompt configurable.

Here you can also define the required signal to noise ratio for simon to consider a training sample to be correct. See the Sample Quality Assurance section for more details.

On this configuration page you can also set the parameters for the volume calibration.

It can be deactivated for both the add word dialog and the trainings wizard by unchecking the group box itself. As long as the volume is not louder than the minimum volume simon will prompt the user to raise the microphone volume. If the recording hits the maximum volume once, simon will tell the user to lower the volume.

Clipping (hitting the maximum amplitute) will always cause a "too loud" warning.

The prompted text can be configured by entering text in the input field below. If the edit is empty a

99

default text will be used.

Postprocessing

All recorded (training) and imported (through the import training data) samples can be processed using a series of postprocessing commands. Postprocessing chains are an advanced feature and shouldn't be needed by the average user.

The postprocessing commands can be seen as a chain of filters through which the recordings have to pass through. Using these "filters" one could define commands to suppress background noise in the training data or normalize the recordings.

Given the program "process_audio" which takes the input- and output files as its arguments (e.g.: "process_audio in.wav out.wav") the postprocessing command would be: "process_audio %1 %2". The two placeholders %1 and %2 will be replaced by the input filename and the output filename respectively.

The switch to "apply filters to recordings recorded with simon" enables the postprocessing chains for samples recorded during the training (including the initial training while adding the word). If you don't select this switch the postprocesing commands are only applied to imported samples (through the import trainings-data wizard).

One common use-case of postprocessing chains would be the resampling of audio because of hardware limitations. Given a soundcard that does not support mono 16kHz recordings but only supports 44100Hz stereo ("CD") recordings, one could use the free command line sound processing utility SoX (http://sox.sourceforge.net) to resample the recorded files after the recording.

100


This example would require the following postprocessing command:

• sox -c 1 -r 16000 %1 %2

Using this command you can safely record in 44100Hz and 2 channels and - assuming the option to apply the filters to recordings recorded with simon is selected - simon will automatically downsample them to 16000Hz and 1 channel automatically after recording them. Make sure to adjust your sound device configuration accordingly.

Speech Model

Here you can adjust the parameters of the speech model.

The samplerate set here is the target samplerate of the acoustic model. It has nothing to do with the recording samplerate and it is the responsibility of the user to ensure that the samples are actually made available in that format (usually by recording in that exact samplerate or by defining postprocessing commands that resample the files; see the sound configuration section for more details).

Usually either 16kHz or 8kHz models are built / used. 16kHz models will have higher accuracy over 8kHz models. Going higher than 16kHz is not recommended as it is very cpu-intensive and in practice probably wont result in higher recognition rates.

Moreover, the path to the trainings-samples can be adjusted. However, be sure that the previously gathered trainings-samples are also moved to the new location. If you use automatic synchronization the simond would alternatively also provide simon with the missing sample but copying them manually is still recommended for performance reasons.

101

Model Settings

General

Please see the base model section.

Extensions

Here you can configure the base URL that is going to be used for the automatic bomp import. The default points to the copy on the simon listens server.

Recognition

Here you can configure the recognition and model synchronization with the simond server.

Server

Using the server configuration you can set parameters of the connection to simond.

General

The simon main application connects to the simond server (see the architecture section for more information).

To identify individual users of the system (one simond server can of course serve multiple simon clients), simon and simond use users. Every user has his own speech model. The username / password

102


combination given here is used to log in to simond. If simond does not know the username or the password is incorrect, the connection will fail. See the simond manual (help:/simond) on how to setup users for simond.

The recognition itself - which is done by the server - might not be available at all times. For example it would not be possible to start the recognition as long as the user does not have a compiled acoustic and language model which has to be created first (during synchronization when all the ingredients - vocabulary, grammar, training - are present). Using the option to start the recognition automatically once it is available, simon will request to start the recognition when it receives the information that it is ready (acoustic and language model is available).

Using the "Connect automatically on simon start" option, simon will automatically start the connection to the configured simond servers after it has finished loading the user interface.

Network

simon connects to simond using TCP/IP.

As of yet (simon 0.3), encryption is not yet supported.

The timeout setting specifies, how long simon will wait for a first reply when contacting the hosts. If you are on a very, very slow network and/or use "connect on start" on a very slow machine, you may want to increase this value if you keep getting timeout errors and can resolve them by trying again repeatedly.

simon supports to be configured to use more than one simond. This is very useful if you for example are going to use simon on a laptop which connects to a different server depending where you are. You could

103

for example add the server you use when you are home and the server used when you are at work. When connecting, simon will try to connect to each of the servers (in order) until it finds one server that accepts the connection.

To add a server, just enter the hostname or IP and the port (separated by ":") or use the dialog that appears when you select the blue arrow next to the input field.

Synchronization and Model Backup

Here you can configure the model synchronization and restore older versions of your speech model.

Only after the speech model is synchronized the changes take effect and a new restore point is set. This is why per default simon will always synchronize the model with the server when it changes. This is called "Automatic Synchronization" and is the recommended setting.

However, if you want more control you can instruct simon to ask you before starting the synchronization after the model has changed or to rely on manual synchronization all together. When selecting the manual synchronization you have to manually use the "Synchronization" menu item of the simon main window (also see the section simon main window) every time you want to compile the speech model.

The simon server will maintain a copy of the last five iterations of model files. However, this only includes the "source files" (the vocabulary, grammar, etc.) - not the compiled model. However, the compiled model will be regenerated from the restored source files automatically.

After you have connected to the server, you can select one of the available models and restore it by clicking on "Restore Model".

104


Please note that the synchronization will only accept complete source models (containing a vocabulary, a grammar and some trainingssamples) so incomplete models will not be stored on the server and thus not be backed up.

Actions

In the actions configuration you can configure the reactions to recognition results.

Recognition

The recognition of simon computes not only the most likely result but rather the top ten results.

Each of the results are assigned a confidence score between 0 and 1 (were 1 is 100% sure).

Using the Minimum confidence you can set a minimum confidence for recognition results to be considered valid.

If more than one recognition results are rated higher than the minimum confidence score, simon will provide a popup listing the most likely options for you to choose from.

This popup can be disabled using the Display selection popup for ambiguous results checkbox.

Plugin base font

Many plugins of simon have a graphical user interface.

The fonts of these interfaces can be configured centrally and independant of the systems font settings here.

105

Lists

Here you can find the global list element configuration. This serves as a template for new scenarios but is also directly used for the popup for ambiguous recognition results.

Social desktop

Scenarios can be uploaded and downloaded from within simon.

For this we use KDEs social desktop facilities and our own category for simon scenarios on kde-files.org (http://kde-files.org/index.php?xcontentmode=692).

If you already have an account on opendesktop.org (http://opendesktop.org) you can input the credentials there. If you don't, you can register directly in the configuration module.

The registration is of course free of charge.

Adjusting the recognition parameters manually

simon is targeted towards end-users. It's interface is designed to allow even users without any background in speech technology to design their own language and acoustic models by providing reasonable default values for simple uses.

In special cases (severe speech impairments for example), special configuration might be needed. This is why the raw configuration files for the recognition are also respected by simon and can of course be

106


modified to suit your needs.

There are basically to parts of the Julius configuration that can be adjusted:

• adin.jconf

This is the configuration of the simon client of the Soundstream sent from simon to the simond. This file is directly read by the adinstreamer.

simon ships with a default adin.jconf without any special parameters. You can change this system wide configuration which will affect all users if there are different user accounts on your machine who all use simon. To just change the configuration of one of those users copy the file to the user path (see below) and edit this copy.

• julius.jconf

This is a configuration of the simond server and directly influences the recognition. This file is parsed by libjulius and libsent directly.

simond ships with a default julius.jconf. Whenever there is a new user added to the simond database, simond will automatically copy this system wide configuration to the new user. After that the user is of course free to change it but it won't affect the other users. This way the "template" (the system wide configuration) can be changed without affecting other users.

The path to the Julius configuration files will depend on your platform:

Table 4-5. Julius Configuration Files

File Microsoft Windows GNU/Linux

adin.jconf (system) (installation path)\share\apps\simon\adin.jconf

`kde4-config --prefix`/share/apps/simon/adin.jconf

adin.jconf (user) %appdata%\.kde\share\apps\simon\adin.jconf

~/.kde/share/apps/simon/adin.jconf

julius.jconf (template) (installation path)\share\apps\simond\default.jconf

`kde4-config --prefix`/share/apps/simond/default.jconf

julius.jconf (user) %appdata%\.kde\share\apps\simond\models\(user)\active\julius.jconf

~/.kde/share/apps/simond/models/(user)/active/julius.jconf

107

Chapter 5. Questions and AnswersIn an effort to keep this section always up-to-date it is available at our online wiki (http://www.cyber-byte.at/wiki/index.php/English:_Troubleshooting).

108

Chapter 6. Credits and Licensesimon

Program copyright 2006-2009 Peter Grasch <[email protected]>, Phillip Goriup, Tschernegg Susanne, Bettina Sturmann, Martin Gigerl

Documentation Copyright (c) 2009 Peter Grasch <[email protected]>

This documentation is licensed under the terms of the GNU Free Documentation License (common/fdl-license.html).

This program is licensed under the terms of the GNU General Public License (common/gpl-license.html).

109

Appendix A. InstallationPlease see our wiki (http://www.cyber-byte.at/wiki/index.php/English:_Setup) for install instructions.

110

Documents

simon