doc 1 mobility

7/29/2019 doc 1 mobility

1/56

1

Addressing Project2012/2013


2/56

2

Mansoura University

Faculty of Computer and Information Science

Dept. Information Technology

Graduation Project 2012/2013

Supervisor:-

Dr. Mohammed Elmogy

Team Work:-

Abeer Ramadan Shams Eldein

Fatma El-zahraa Alaa Ali

Fayza Fawzy Abdo Elbrawy

HebaAllah zain

Mahmoud Mohamed Abdelsalam

Mohamed Aziz Rashad

Mohamed Shalby Elayouti

Mostafa Abdelhamid Elasey

Nada Mohamed Mohammed

Nada Mohamed Sobhi Nada

OsamaHeshamHagras


3/56

3

IndexChapter 1: Introduction

Chapter 2: Speech-to-Text

Chapter 3: Text To Speech

Chapter 4: SystemAnalysisandDesign

Chapter 5: Framework


4/56

4

Chapter1Introduction


5/56

5

Sometimes it becomes apparent that previous approaches to a

problem havent quite worked the way you anticipated.

Perhaps you just need to clear away the smoky residue of the

past, take a deep breath, and try again with a new attitude and

fresh ideas. In golf, its known as a mulligan; in schoolyard

sports, its called a do-over; and in the computer industry, we

say its a reboot.

So our problem is how we can be able to communicate two

kinds of people who they are not be able to either

communicate with each other or communicate with Computer

World.

In the past, the traditional idea is there is a person which

represents the middleware between them.

After refresh ideas, we can create a system that can

communicate peoples with each other and communicate with

computer world.

We say about dump anddeaf kinds and blind kinds. Its our

problem and our system will solve this problem easily.

But how?

Lets thinklets go.

How to make the blind, deaf and dump people deal withthe computer?


6/56

6

The blind people will speak what he want and our system will

take his voice and using NLP System will convert it to

statement, then to words. This process is called Speech-to-

Text conversion.

On the other hand, we should make another process which,

using Poser program we convert the statements or words to

signs such that will display to dump and deaf people and

understand what have been said.

On the other side, the deaf and dump people will useKeyboard Signs System which makes them easily write what

they want not enough but our system also support large

database with familiar words and statements that they can use

in daily life represented in signs images and these images will

display in a dictionary form and easily they can use.

After they write what they want, our system will convert the

text to voice such that the blind people can hear; this process

is called Text-to-Speech conversion. And our system supports

it also.

Our goal of this system is to make deaf and dump and blind

people life easily and use the newest technology like Webdevelopment and Mobile development efficiently.


7/56

7

We can implement this system on web, mobile platform as they

are the newest technology all over the world. And also for the

communication process between them.

Why this project?We find that the two kinds of people suffer from abrupt

problem in dealing with the computer; they cant use the

technology well; they have found difficulty in dealing with

others so we thought of this system for solving this problem.


8/56

8

Chapter2


9/56

9

Contents

The need for real-time speech-to-text conversion The challenges of speech-to-text-conversion in real-time Methods of real-time speech-to text conversion Text adaptations Presentation format

Speech-To-Text


10/56

10

PerspectivesAbstract

Intralingua speech-to-text-conversion is a useful tool for integrating people withhearing impairments in oral communication settings, e. g. counseling interviews or

conferences. However, the transfer of speech into written language in real time

requires special techniques as it must be very fast and almost 100% correct to be

understandable. The paper introduces and discusses different techniques for

intralingua speech-to-text-conversion.

The need for real-time speech-to-text conversion

Language is a very fast and effective way of communicating. To use languagemeans to express an unlimited amount of ideas, thoughts and practical information

by combining a limited amount of words with the help of a limited amount of

grammatical rules. The result of language production processes are series of words

and structure. Series of words are producedi.e. spoken or signedin a very rapid

and effective way. Any person can follow such language production processes and

understand what the person wants to express if two preconditions are fulfilled the

recipients must:

1. Know the words and grammatical rules the speaker uses and

2. be able to receive and process the physical signal.

Most people use oral language for everyday communication, i.e. they speak to

other people and hear what other people say. People who are deaf or hard-of-

hearing do not have equal access to spoken language, for them, precondition 2 is

not fulfilled; their ability to receive speech is impaired. If people who are severely

impaired in their hearing abilities want to take part in oral communication, they

need a way to compensate their physical impairment. Hearing aids are sufficientfor many hearing impairment people. However, if hearing aids are insufficient,

spoken language has to be transferred into a modality which is accessible without

hearing, e.g. into the visual domain. There are two main methods to transfer

auditory information into a visible format. The translation into sign language is one

method and it is best for people who use sign language as a preferred language, as


11/56

11

e.g. many Deaf people do. However, for people with a hearing disability who do

not know sign language, sign language interpreting is not an optionas for many

Hard of Hearing people and people who became hearing impaired later in their life

or elderly people with various degrees of hearing loss. They prefer their native oral

language given in a visible modality. For them, a transfer of spoken words intowritten text is the method of choice, in other words: they need an intralingua

speech-to-text-conversion. Speech-to-text-translation (audiovisual translation) of

spoken language into written text is an upcoming field since movies on DVDs are

usually sold with subtitles in various languages. While the original language is

given auditorily, subtitles provide a translated version in another language at the

same time visually. The audiovisual transfer from the spoken original language

into other languages which are presented in the subtitles can be called an

interlingual audiovisual translation. Interlingual translation aims at transferringmessages from one language into another language. This translation process

combines classical interpreting with a transfer from spoken language patterns into

written text patterns. Auditory events which are realized as noises or speech

melodies would often not be transferred because normally hearing people can

interpret them by themselves. Interlingual translation primarily addresses the lack

of knowledge of the original language, i.e. the first precondition for understanding

language. The intralingual audiovisual transfer differs in many aspects from the

interlingual audiovisual translation between two languages. First of all, intralingual

audiovisual transfer for people with hearing impairments addresses primarily

precondition 2, i.e. the physical ability to perceive the speech signals. The aim of

an intralingual audiovisual transfer is to provide all auditory information which is

important for the understanding of an event or action. Words as well as non-

language sounds like noises or hidden messages which are part of the intonation of

the spoken words (e.g. irony or sarcasm) need to be transmitted into the visual (or

haptic) channel. How this can be achieved best, is a question of present and future

research and development (cf. Neves, in this book). Moreover, people with hearing

impairment may insist on a word-by-word-transfer of spoken into written language

because they do not want a third person to decide which parts of a message are

important (and will therefore be transferred) and which parts are not. As a result,

intralingual audiovisual transfer for people with hearing impairment might mean

that every spoken word of a speech has to be written down and that all relevant

auditory events from outside of the speech have to be described, too (interruptions,


12/56

12

noises). In the latter case, the intralingual audiovisual transfer would exclusively

satisfy the physical ability to perceive the speech signal (precondition 2). The

classical way to realize an intralingual speech-to-text transfer is to stenotype a

protocol or to record the event and to transfer it into a readable text subsequently.

This post-event transfer process is time-consuming and often difficult, sinceauditory events easily become ambiguous outside of the actual context. Moreover,

the time shift involved in the transfer into a readable text means a delayed access to

the spoken words, i.e. it does not help people with hearing impairments in the

actual communication situation. However, for counselling interviews, at the

doctors or at conferences, access to spoken information must be given in real-

time. For these purposes, the classical methods do not work.

The challenges of speech-to-text-conversion in real-time :-Real-time speech-to-text-conversion aims at transferring spoken language into

written text (almost) simultaneously. This gives people with a hearing impairment,

access to the contents of spoken language in a way that they e.g. become able to

take part in a conversation within the normal time frame of conversational turn

taking. Another scenario for real-time speech-to-text-transfer is a live broadcast of

a football match where the spoken comments of the reporter are so rapidly

transferred into subtitles that theystill correspond to the scene the reporter

comments on. An example from the hearing world would be a parliamentarydebate which ends with the electronic delivery of the exact word protocol

presented to the journalists immediately after the end of the debate. (cf. Eugeni,

forthcoming) This list could be easily continued. However, most people with a

hearing disability do not receive real-time speech-to-text services at counselling

interviews, conferences or when watching a sports event live on TV. Most

parliamentary protocols are tape recorded or written steno typed and subsequently

transferred into readable text. What are the challenges of real-time speech-to-text

conversion that make its use sorare?

- TimeA good secretary can type about 300 key strokes (letters) per minute. Since the

average speaking rate is about 150 words per minute (with some variance between

the speakers and the languages), even the professional typing rate is certainly not


13/56

13

high enough to transfer a stream of spoken words into a readable form in real-time.

As a consequence, the speed of typing has to be increased for a sufficient real-time

speech-to-text transfer. Three different techniques will be discussed in the

following section methods.

- Message TransferThe main aim of speech-to-text transfer is to give people access to spoken words

and auditory events almost simultaneously with the realization of the original

sound event. However, for people with limited access to spoken language at a

young age, 1:1 transfer of spoken words into written text may sometimes not be

very helpful. If children are not sufficiently exposed to spoken language, their oral

language system may develop more slowly and less effectively compared with

their peers. As a result, many peoplewith an early hearing impairment are less used

to the grammatical rules applied in oral language as adults and have a less

elaborated mental lexicon compared with normal hearing people (Schlenker-

Schulte, 1991; see also Perfetti et al. 2000 with respect to reading skills among

deaf readers). If words are unknown or if sentences are too complex, the written

form does not help their understanding. The consequence for intralingual speech-

to-text conversion is that precondition, the language proficiency of the audience,

also has to be addressed, i.e. the written transcript has to be adapted to the

language abilities of the audience - while the speech goes on. Speech-to-textservice providers not only need to know their audience, they also have to

understand, and how grammatical complexity can be reduced. They need to know

techniques of how to make the language in itself more accessible while the

information transferred is preserved. Aspects of how language can be made more

accessible will be discussed in the following section text adaptation.

- Real-time presentation of the written textReading usually means that words are already written down. Presented with a

written text, people will read at their individual reading speed .This, however, is

not possible in real-time speech-to-text conversion. Here, the text is written and

read almost simultaneously, and the control of the reading speed shifts at least

partly over to the speaker and the speech-to-text provider. The text is not fixed in

advance, instead new words are produced continuously and readers must follow


14/56

14

this word production process very closely if they want to use the real-time abilities

of speech-to-text transfer. Because of this interaction of writing and reading, the

presentation of the written text must be optimally adapted to the reading needs of

the audience. This issue will be discussed at the end of the paper in section

presentation format.

The challenges of real-time speech-to-text conversion can

now be summarized as follows:

1. To be fast enough in producing written language that

2. It becomes possible to meet the expectations of the audience with respect

to the characteristics of a written text. Word-by-word transfer enhanced by a

description of auditory events from the surroundings as well as adaptationsof the original wording into easier forms of language must be possible.

Moreover,

3. A successful real-time presentation must match the reading abilities of

the audience, i.e. the written words must be presented in a way that is

optimally recognizable and understandable for the readers.

Methods of real-time speech-to text conversion

There are three methods that are feasible when realizing (almost) real-time speech-

to-text transfer: speech recognition, computer assisted note taking (CAN) and

communication access (or computer aided) real-time translation (CART). The

methods differ

1. in their ability to generate exact real-time transcripts.

2. With respect to the conditions under which these methods can be

properly applied and


15/56

15

3. With respect to the amount of training which is needed to become a good

speech-to-text service provider.

Speech recognition

Automatic speech recognition is not yet an option for speech-to-text transfer since

phrase- and sentence boundaries are not recognized. However, speech recognition

can be used for real-time speech-to-text conversion if a person re-speaks the

original words. Re-speaking is primarily necessary for including punctuation and

speaker identification but also for adapting the language to the language

proficiency of the audience. Apart from an intensive and permanent training of the

speech recognition engine, no special training is required. A sound-shielded

environment is useful. The use of speech recognition systems does not require any

special training. Linguistic knowledge, however, is necessary for the chunking of

the words and for adaptations of the wording.

Tex adaptation

Spoken and written forms of language rely on different mechanisms to transfer

messages. Speech for instance is less grammatical and less chunked than text. A

real-time speech to-text conversion - even if it is a word-for-word service - has to

chunk the continuous stream of spoken words into sentences and phrases with

respect to punctuation and paragraphs in order for the text to be comprehensible. A

correction of grammatical slips might be necessary, too, for word-for-word

conversions and even more corrections may be necessary for an audience with less

language proficiency. While intonation may alleviate in congruencies in spoken

language, congruency errors easily cause misinterpretation in reading. The transfer

from spoken into written language patterns is only one method of text adaptation.

As discussed earlier, the speech-to-text provider might also be asked to adapt the

written text to the language proficiency of the audience. Here, the challenge of

word-for-word transfer shifts to the challenge of message transfer with a reduced

set of language material. A less skilled audience might be overstrained especially

with complex syntactical structures and low frequent words and phrases. The

speech-to-text provider therefore needs to know whether a word or phrase can be

well understood or should better be exchanged with some more frequent


16/56

16

equivalents. S/he also has to know how to split long and complex sentences into

simpler structures to make them easier to understand.

Presentation format

The last challenge of real-time speech-to-text transfer is the presentation of the text

on the screen in a way that reading is optimally supported. The need to think about

the presentation format is given as the text on the screen is moving which a

problem for the reading process is. We usually read a fixed text, and our eyes are

trained to move in saccades (rapid eye movements) on the basis of a kind of

preview calculation with respect to the next words (cf. Sereno et al. 1998). But in

real-time speech-to-text systems, the text appears consecutively on the screen andnew text replaces older text when the screen is filled. A word-by-word presentation

as a consequence of word-for-word transcription could result in less precise

saccades which subsequently decreases the reading speed. Reading might be less

hampered by a presentation line-by-line, as it is e.g. used in C-Print (cf. the online

presentation at http://www.rit.edu/~techsym/detail.html#T11C). However, for

slower readers, also line-by-line presentation might be problematic since the whole

old text is moving upwards whenever a new line is presented. As a consequence,

the word which was actually fixated by the eyes moves out of the fovea and

becomes unreadable. The eyes have to look for the word and restart reading it.The

optimal presentation of real-time text for as many potential readers as possible is

an issue which is worth further research, not only from the perspective of real-time

transcription but also for subtitling purposes.

Perspectives

Real-time speech-to-text transfer is already a powerful tool which provides peoplewith a hearing impairment access to oral communication. However, elaborated

dictionaries as they are needed for efficient CAN- or CART-systems are not yet

developed for many languages. Without those dictionaries, the systems cannot be

used. Linguistic research has to find easy but efficient strategies for the real-time

adaptation of the wording in order to make a message understandable also for an


17/56

17

audience with limited language proficiency. Finally, the optimal presentation of

moving text to an audience with diverging reading abilities is a fascinating

research field not only for real-time speech-to-text services but with respect to the

presentation of movable text in general.

Speech analysis

The speech analysis block is itself composed of :

- A pre-processing module, which organizes the input speech into manageablelists of letters.

- A morphological analysis module, the task of which is to propose allpossible part of speech categories for each letter taken individually.


18/56

18

General Overview

Technology that converts digital text to audible speech. In other words, it allows adevice to "talk" to the user through its speaker.

Converting text into voice output using speech synthesis techniques. Althoughinitially used by the blind to listen to written material Such as MSAA(MicroSoftActive Accessibility)A software interface that lets a Windows application bedesigned for the visually impaired. It each object (window, dialog box, etc.) in theuser interface to identify itself so a screen reader can be used.screen reader :Software for the visually impaired that reads the contents of a computer screen,

Chapter3Text-To-Speech


19/56

19

converting the text to speech. Screen readers are designed for specific operatingsystems and generally work with most applications.It is now used extensively to convey financial data, e-mail messages and in aphone, text-to-speech might announce a caller's name when they call instead ofplaying a ringtone. It might also read a text message aloud, or speak the names ofmenu items as you scroll through them This feature can be useful for hands-freeuse of a phone while driving, allowing the driver to keep eyes on the road. It is alsovery useful for the vision-impaired.also used in GPS units to announce streetnames when giving directions.Early text-to-speech (TTS) systems had a very robotic sound; however, with theadvent of high-speed chips and advanced software techniques, text-to-speech hasbecome much more natural.Text-to-speech (TTS) is a type ofspeech synthesis application that is used to createa spoken sound version of the text in a computer document, such as a help file or a

Web page. TTS can enable the reading of computer display information for thevisually challenged person whether it was directly introduced in the computer byan operator or scanned and submitted to an Optical Character Recognition (OCR)system, or may simply be used to augment the reading of a text message. CurrentTTS applications include voice-enabled e-mail and spoken prompts in voiceresponse systems. TTS is often used with voice recognition programs. There arenumerous TTS products available, including Read Please 2000, Proverbe SpeechUnit, and Next Up Technology's TextAloud. Lucent, Elan, and AT&T each haveproducts called "Text-to-Speech.In addition to TTS software, a number ofvendors offer products involving hardware, including the Quick Link Pen fromWizCom Technologies, a pen-shaped device that can scan and read words; theRoad Runner from Ostrich Software, a handheld device that reads ASCIItext; andDecTalk TTS from Digital Equipment, an external hardware device that substitutesfor a sound card and which includes an internal software device that works inconjunction with the PC's own sound card.

A brief history of speech synthesis (text to speech)

The history of speech synthesis:

Over the last few years there has been a great development of the quality of thespeech produced with text to speech. Many people think that synthetic speech as itis also called sounds like robots from older movies. The truth is though that somevoices almost sound like recorded speech and due to that we have seen a verystrong growth of user groups for our services the last years.
http://whatis.techtarget.com/definition/speech-synthesishttp://searchmobilecomputing.techtarget.com/definition/voice-enabled-e-mailhttp://searchcrm.techtarget.com/definition/voice-recognitionhttp://searchcio-midmarket.techtarget.com/definition/ASCIIhttp://searchcio-midmarket.techtarget.com/definition/sound-cardhttp://searchcio-midmarket.techtarget.com/definition/sound-cardhttp://searchcio-midmarket.techtarget.com/definition/ASCIIhttp://searchcrm.techtarget.com/definition/voice-recognitionhttp://searchmobilecomputing.techtarget.com/definition/voice-enabled-e-mailhttp://whatis.techtarget.com/definition/speech-synthesis


20/56

20

When we invented the talking web in 2001 the target group was people withreading difficulties but now we see that the user group is much broader.What you maybe dont know is that the first synthetic speech was produced asearly as in the late 18th century. The machine was built in wood and leather andwas very complicated to use generating audible speech. It was constructedby Wolfgang von Kempelen and had great importance in the early studies ofPhonetics. The picture to down is the original construction as it can be seen at theDeutsches Museum (von Meisterwerken der Naturwissenschaft und Technik) inMunich, Germany.

(First there is a human that says a sentence and then the machine tries to say thesame. This was made by a re-construction of Kempelens machine.)In the early 20th century when it was possible to use electricity to create syntheticspeech, the first known electric speech synthesis was Voder and its creatorHomer Dudley showed it to a broader audience in 1939 on the world fair in NewYork.One of the pioneers of the development of speech synthesis in Sweden was GunnarFant. During the 1950s he was responsible for the development of the first Swedishspeech synthesis OVE (Orator VerbisElectris.) By that time it was only Walter

Lawrences Parametric Artificial Talker (PAT) that could compete with OVE inspeech quality

Speech synthesis becomes more human-like

The greatest improvements when it comes to natural speech were during the last 10years. The first voices we used for ReadSpeaker back in 2001 were produced usingDiphone synthesis. The voices are sampled from real recorded speech and split into
http://en.wikipedia.org/wiki/Wolfgang_von_Kempelenhttp://en.wikipedia.org/wiki/Wolfgang_von_Kempelen


21/56

21

phonemes, a small unit of human speech. This was the first example ofConcatenation synthesis. However, they still have an artificial/synthetic sound. Westill use diphone voices for some smaller languages and they are widely used tospeech-enable handheld computers and mobile phones due to their limited resourceconsumption, both memory and CPU.It wasnt until the introduction of a technique called Unit selection, that voices

became very naturally sounding. this is still concatenation synthesis but the usedunits are larger than phonemes, sometimes a complete sentence. We use differentproviders for different languages to always assure we can offer the best voicesavailable for that language.

How does a machine read ?

At first sight, this task does not look too hard to perform. After all, is not thehumanbeing potentially able to correctly pronounce an unknown sentence, even from hischildhood ? We all have, mainly unconsciously, a deep knowledge of the readingrules of our mother tongue. They were transmitted to us, in a simplified form, atprimary school, and we improved them year after year.However, it would be a bold claim indeed to say that it is only a short step before

the computer is likely to equal the human being in that respect. Despite the presentstate of our knowledge and techniques and the progress recently accomplished inthe fields of Signal Processing and Artificial Intelligence, we would have to

express some reservations. As a matter of fact, the reading process draws from thefurthest depths, often unthought of, of the human intelligence.

Figure 1: introduces the functional diagram of a very general TTS synthesizer. Asfor human reading, it comprises a Natural Language Processing module (NLP),capable of producing a phonetic transcription of the text read, together with thedesired intonation and rhythm (often termed as prosody), and a Digital SignalProcessing module (DSP), which transforms the symbolic information it receivesinto speech. But the formalisms and algorithms applied often manage, thanks to a

judicious use of mathematical and linguistic knowledge of developers, to short-circuit certain processing steps. This is occasionally achieved at the expense ofsome restrictions on the text to pronounce


22/56

22

Figure 1: A simple but general functional diagram of a TTS system.

NLP COMPONENT

Figure 2 introduces the skeleton of a general NLP module for TTS purposes. Oneimmediately notices that, in addition with the expected letter-to-sound and prosodygeneration blocks, it comprises a morpho-syntactic analyser, underlying the needfor some syntactic processing in a high quality Text-To-Speech system. Indeed,being able to reduce a given sentence into something like the sequence of its parts-of-speech, and to further describe it in the form of a syntax tree, which unveils itsinternal structure, is required for at least two reasons :

1) Accurate phonetic transcription can only be achieved provided the part ofspeech category of some words is available, as well as if the dependency

relationship between successive words is known.2)Natural prosody heavily relies on syntax. It also obviously has a lot to do withsemantics and pragmatics, but since very few data is currently available on thegenerative aspects of this dependence, TTS systems merely concentrate onsyntax. Yet few of them are actually provided with full disambiguation andstructuration capabilities.


23/56

23

Fig 2.The NLP module of a general Text-To-Speech conversion system.

Text analysis

The text analysis block is itself composed of :

A pre-processing module, which organizes the input sentences into manageablelists of words. It identifies numbers, abbreviations, acronyms and idiomatics andtransforms them into full text when needed. An important problem is encounteredas soon as the character level : that of punctuation ambiguity (including the criticalcase of sentence end detection). It can be solved, to some extent, with elementaryregular grammars.


24/56

24

A morphological analysis module, the task of which is to propose all possible partof speech categories for each word taken individually, on the basis of theirspelling. Inflected, derived, and compound words are decomposed into theirelementerygraphemic units The contextual analysis module considers words in their context, which allows itto reduce the list of their possible part of speech categories to a very restrictednumber of highly probable hypotheses, given the corresponding possible parts ofspeech of neighbouring words. This can be achieved either with n-grams whichdescribe local syntactic dependences in the form of probabilistic finite stateautomata (i.e. as a Markov model), to a lesser extent with mutli-layer perceptrons(i.e., neural networks) trained to uncover contextual rewrite rules, as in [Benello],or with local, non-stochastic grammars provided by expert linguists orautomatically inferred from a training data set with classification and regressiontree (CART) techniques

Finally, a syntactic-prosodic parser, which examines the remaining search spaceand finds the text structure (i.e. its organization into clause and phrase-likeconstituents) which more closely relates to its expected prosodic realization.


25/56

25

Database preparation

Figure 3.A general concatenation-based synthesizer. The upper left hatched blockcorresponds to the development of the synthesizer (i.e. it is processed once for all).Other blocks correspond to run-time operations. Language-dependent operationsand data are indicated by a flag.

Segments are then often given a parametric form, in the form of a temporalsequence of vectors of parameters collected at the output of a speech analyzer andstored in a parametric segment database. The advantage of using a speech modeloriginates in the fact that : Well chosen speech models allow data size reduction, an advantage which ishardly negligible in the context of concatenation-based synthesis given the amount


26/56

26

of data to be stored. Consequently, the analyzer is often followed by a parametricspeech coder. A number of models explicitly separate the contributions of respectively thesource and the vocal tract, an operation which remains helpful for the pre-synthesisoperations : prosody matching and segments concatenation.

Speech synthesis:

A sequence of segments is first deduced from the phonemic input of thesynthesizer, in a block termed as segment list generation in Fig. 3, which interfacesthe NLP and DSP modules. Once prosodic events have been correctly assigned toindividual segments, the prosody matching module queries the synthesis segmentdatabase for the actual parameters, adequately uncoded, of the elementary sounds

to be used, and adapts them one by one to the required prosody. The segmentconcatenation block is then in charge of dynamically matching segments to oneanother, by smoothing discontinuities. Here again, an adequate modelization ofspeech is highly profitable, provided simple interpolation schemes performed on itsparameters approximately correspond to smooth acoustical transitions betweensounds. The resulting stream of parameters is finally presented at the input of asynthesis block, the exact counterpart of the analysis one. Its task is to producespeech.

Segmental quality

The efficiency of concatenative synthesizers to produce high quality speech ismainly subordinated to :

1.The type of segments chosen.2.Segments should obviously exhibit some basic properties :They should allow to account for as many co-articulatory effects aspossible.Given the restricted smoothing capabilities of the concatenation block, they

should be easily connectable.Their number and length should be kept as small as possible.


27/56

27

Examples of Text-to-Speech software:1-Text Speaker2-Alive Text to Speech3-DSpeech Text-to-Speech software

4-Talking Clipboard Text-to-Speech software5-Text-To-Voice by Caltrox Educational Software6-NaturalReader is a text-to-speech software7-ClaroRead is a professional Text-to-Speech software8-Balabolka is a Text-To-Speech (TTS) program9-TextSpeech Pro is a professional set of Text-to-Speech software10-NEOSPEECH TEXT-TO-SPEECH SOFTWARE: NEOSPEECH ISPRIVATELY HELD, HEADQUARTERED IN SANTA CLARA,CALIFORNIA AND BACKED BY THE RESOURCES OF VOICEWARECO., LTD. OF KOREA AND HOYA OF JAPAN.

Text-to-speech software to read text in any application, and convert text to MP3,WAV, OGG or VOX files.


28/56

28

Chapter4System Analysis and Design


29/56

29

Overview

"My experience has shown that many people find it hard to make their design ideasprecise. They are willing to express their ideas in loose, general terms, but are

unwilling to express them with the precision needed to make them into patterns.Above all, they are unwilling to express them as abstract spatial relations amongwell-defined spatial parts. I have also found that people aren't always very good atit; it is hard to do..... If you can't draw a diagram of it, it isn't a pattern. If you thinkyou have a pattern, you must be able to draw a diagram of it. This is a crude, butvital rule. A pattern defines a field of spatial relations, and it must always bepossible to draw a diagram for every pattern. In the diagram, each part will appearas a labeled or colored zone, and the layout of the parts expresses the relationwhich the pattern specifies. If you can't draw it, it isn't a pattern".

Christopher Alexander (1979) in The Timeless Way of Building.

-of Speech To Text:System Analysis

Here we are going to make system analysis and design of Speech to Text part

where system analysis is the key of starting implementing and the way to go on in

the project. Systems are created to solve problems. We need to see all sides of a

problem to come up with an acceptable solution. Analysis involves studying thesystem and seeing how they interact with the entities outside as well as inside the

system. We then come out with detailed specifications of what the system will

accomplish based on the user requirements.

Systems design will take the requirements and analysis into consideration and

come out with a high level and low level design that will form the blue print to the

actual solution to the problem in hand.

In this dynamic world, analysis and design have to look into making systems that a

flexible enough to accommodate changes as they are inevitable in any system.Systems study and analysis is very important; most projects fail because of

shortcomings in this phase. The problem is that most customers have some hazy

requirements, or end results they would like to achieve. This phase really defines,

in technically implementable terms, what the requirements are. Systems Analysis

and design is necessary because it helps you first to identify the problem that


30/56

30

you're trying to find a solution for. The analysis part has a lot to do with

"diagnosis." You want to make sure that you understand the problem in-depth, and

that you understand what the end users need. If you provide a solution that does

not meet users' needs, your system is useless.

Now to look at our system and its analysis and design there are many forms and

types of this analysis. But there are most common used format that to be used here

.We would start with use case diagram to establish main actors and main processes

in our part. Use cases are important because they are in a tracking format. Hence

they make it easy to comprehend about the functional requirements in the system

and also make it easy to identify the various interactions between the users and the

systems within an environment. They are descriptive and hence clearly represent

the value of an interaction between actors and the system. They clarify system

requirements very categorically and systemically making it easier to understand the

system and its interactions with the users. During the analysis phase of the

projects System Development Life Cycle, use cases help to understand the

systems functionality.

At figure (1)

If we look at this figure we would see that this system has many actors that they

are responsible for acting main operation of the system. Here we have three main

actors are DEVELOPER, USER, and SYSTEM. Each associated with its own part

of operation to be done. DEVELOPER has many main operations to do as

mentioned in Speech-to-Text framework part the developer has to save all data

used along the process in database and this database may contain for example

common words and letters and also contain numbers and abbreviations to be used

not only that but also database may contain words and letters of sound to be used

(Speech to letters or words !!). developer also has to create special database for

soundto-letter dictionary associated with previous saved letters and words. User

is the second main actor in our system, user is the responsible for entering the

speech to be wrote as without speech there is no project! . system is the third actor

in our system there are many operations associated with the system as it is the core

of the project and our goal is to deal with system. System has to divide speech

received from user into letters and if not in our common dictionary to use sound-

to-letter . As long as statement divided into letters system has to search if words


31/56

31

are common and are in the database to be ready to be pronounced if not system has

to divide words into letters and go to sound-to-letter dictionary and search for each

sound letter if it is default character or if it has special pronounce depending on the

position of the character in the statement and so on. The last operation is wrote the

letters and concatenate letters together to import the result statement and alsopronounce depend on determining the begin and the end of the statement.

Figure (1)


32/56

32

Now to step to another point or diagram to explain the role of the sequence

diagram and its steps in our project where how this part work and go on. Sequence

diagrams are the interaction diagrams which shows the all operations going to be

operated between the elements in your game by their time sequence. It is written in

Unified Modeling Language which means it has a standard implementation way toshow how things will be. It implements all of the objects in your game, it shows

their operations, the messages they sent to each other and their behaviors in

certain conditions.

Since sequence diagrams are showing the very basic behavior scheme of every

class in your project, drawing is one of the most important things while designing a

game. It will simplify the process of coding the classes and let us simply distribute

functions to our classes for making our game operate properly. to start explainingthe steps of the sequence diagram we would look at figure (2), these process is

greatly depend on the use case operations where here at first developer save words

, symbols and abbreviations in database and also create sound-to-letter data base

and also letters of sound to be used and then we go to the role of user to enter the

speech to be wrote . then the system start to read the speech and divide it into

letters then system search in the letters for numbers and dates to be directly read

from their dictionary. Next step to be used is that the system determine is the word

is common if yes system would read it from dictionary contain the sounds of

common used words. If not common word applying the sound-to-letter rule .


33/56

33

Figure (2)


34/56

34

-Text to Speech:System Analysis of

Here we are going to make system analysis and design of text to speech part where

system analysis is the key of starting implementing and the way to go on in the

project. Systems are created to solve problems. We need to see all sides of a

problem to come up with an acceptable solution. Analysis involves studying the

system and seeing how they interact with the entities outside as well as inside the

system. We then come out with detailed specifications of what the system will

accomplish based on the user requirements.

Systems design will take the requirements and analysis into consideration and

come out with a high level and low level design that will form the blue print to the

actual solution to the problem in hand.

In this dynamic world, analysis and design have to look into making systems that a

flexible enough to accommodate changes as they are inevitable in any system.

Systems study and analysis is very important; most projects fail because of

shortcomings in this phase. The problem is that most customers have some hazy

requirements, or end results they would like to achieve. This phase really defines,

in technically implementable terms, what the requirements are. Systems Analysis

and design is necessary because it helps you first to identify the problem that

you're trying to find a solution for. The analysis part has a lot to do with

"diagnosis." You want to make sure that you understand the problem in-depth, and

that you understand what the end users need. If you provide a solution that doesnot meet users' needs, your system is useless.

Now to look at our system and its analysis and design there are many forms and

types of this analysis. But there are most common used format that to be used

here.We would start with use case diagram to establish main actors and main

processes in our part. Use cases are important because they are in a tracking

format. Hence they make it easy to comprehend about the functional requirements

in the system and also make it easy to identify the various interactions between the

users and the systems within an environment. They are descriptive and hence

clearly represent the value of an interaction between actors and the system. They

clarify system requirements very categorically and systemically making it easier to

understand the system and its interactions with the users. During the analysis phase


35/56

35

of the projects System Development Life Cycle, use cases help to understand the

systems functionality.

At figure (1)

If we look at this figure we would see that this system has many actors that they

are responsible for acting main operation of the system. Here we have three main

actors are DEVELOPER, USER, and SYSTEM. Each associated with its own part

of operation to be done. DEVELOPER has many main operations to do as

mentioned in text-to-speech framework part the developer has to save all data used

along the process in database and this database may contain for example common

words and letters and be used in the process of speech also contain numbers and

abbreviations to be used not only that but also database may contain sounds of

words and letters to be used (word or letter to speech !!). developer also has tocreate special database for letter-to-sound dictionary associated with previous

saved sounds. User is the second main actor in our system, user is the responsible

for entering the statement to be pronounced as without statement there is no

project! . system is the third actor in our system there are many operations

associated with the system as it is the core of the project and our goal is to deal

with system. System has to divide statement received from user into words and if

not in our common dictionary to divide words into letters to use letter-to-sound

dictionary also type of the statement is the responsibility of the system as thesemantic of the statement will determine the type of the statement and then to be

correctly pronounced after dividing statement into words system has to search if

there is any symbols and abbreviations in the statement if found to search for their

sounds in the dictionary of to be directly pronounced. As long as statement

divided into words system has to search if words are common and are in the

database to be ready to be pronounced if not system has to divide words into letters

and go to letter-to-sound dictionary and search for each letter sound if it is default

character or if it has special pronounce depending on the position of the characterin the statement and so on. The last operation is pronounce of word and

concatenate words or letters together to import the result statement and also

pronounce depend on determining the begin and the end of the statement. Also the

tone of pronounce depend on the begin of the statement where high tone and go

low.


36/56

36

Figure (1)

Now to step to another point or diagram to explain the role of the sequence

diagram and its steps in our project where how this part work and go on. Sequence

diagrams are the interaction diagrams which shows the all operations going to be

operated between the elements in your game by their time sequence. It is written in

Unified Modeling Language which means it has a standard implementation way to

show how things will be. It implements all of the objects in your game, it shows

their operations, the messages they sent to each other and their behaviors in


37/56

37

certain conditions.

Since sequence diagrams are showing the very basic behavior scheme of every

class in your project, drawing is one of the most important things while designing a

game. It will simplify the process of coding the classes and let us simply distributefunctions to our classes for making our game operate properly. to start explaining

the steps of the sequence diagram we would look at figure (2), these process is

greatly depend on the use case operations where here at first developer save words

, symbols and abbreviations in database and also create letter-to-sound data base

and also sounds of letters to be used and then we go to the role of user to enter the

statement to be pronounced . then the system start to read the statement and divide

it into words then system search in the words for numbers and dates to be directly

read from their dictionary also system has to determine the type of the statementdepending on the semantic of the statement. Next step to be used is that the system

determine is the word is common if yes system would read it from dictionary

contain the sounds of common used words. If not common word applying the

letter-to-sound rule is the next step. Where searching if the letter has special

pronounce depending on its position in the word if letter has no special case of

pronounce so get the default pronounce of the letter from default sound-to-letter

table in its database and pronounce it then all words or letters to be concatenated to

be pronounced. Depending on the type of the statement will be the tone of the

statement and words. So the result of the speech. These is the process in details and

all details are in frame work chapter.

Keyboard implementation Analysis


38/56

38

Figure (2)

Figure(3)


39/56

39

Chapter5Framework


40/56

40

1.Speech-to-Text Recognition

INTRODUCTION:

Real-time speech-to-text has been defined as the accurate transcription of wordsthat make up spoken language into text momentarily after their utterance(Stuckless, 1994).This report will describe and discuss several applications of new computer-basedtechnologies,which enable deaf and hard of hearing students to read the text of the languagebeing spoken by the instructor and fellow students, virtually in real time. In its

various technological forms, real-time speech to- text is a growing classroomoption for thesestudents.

This report is intended to complement several other such reports in this serieswhich focus on note taking (Hastings, Brecklein, Cermack, Reynolds, Rosen, &Wilson, 1997)2, assistive listening devices (Warick, Clark, Dancer, & Sinclair,1997), and interpreting (Sanderson, Siple, & Lyons, 1999). It is notable thatThe Department of Justice has interpreted the Americans with Disabilities Act(P.L. 101-336) to include computer-aided transcription services underappropriate auxiliary aids and services (28CFR, 36.303).It should be emphasized at the outset that the real time speech-to-text servicesdescribed and discussed in this report are intended to complement, not replace, theoptions that are already available.

DEVELOPMENT OF REAL-TIME SPEECH-TO-TEXT SYSTEMS:

Over the past 20 years, several developments have made it possible to use real-

time speech-to-text transcription services as we know them today. These beganwith the development of smaller, more powerful computer systems, including theircapability of converting stenotypic phonetic abbreviations electronically intounderstandable words. These parallel developments led to the earliest applicationsof steno-based systems both to the classroom and to real-time captioning in 1982.In the later 1980s, laptop computers became widely available. This enhancedportability led to the use of computers for note taking in which the notetaker


41/56

41

used a standard keyboard in the regular class room. It was at this time thatstenotype machines were also linked to laptop computers, enhancing theirportability. In the late 1980s, abbreviation software became available for regularkeyboards (Stinson &Stuckless, 1998).Currently, both steno-based and standardkeyboard approaches are being used with deaf and hard of hearing students inmany mainstream secondary and postsecondary settings. Although the full extentof their usage nationwide remains to be documented, over the past 10 years thereclearly has been an increased demand for speech-to-print transcription services inthe classroom (Cuddihy, Fisher, Gordon, & Shumaker, 1994; Haydu & Patterson,1990;James & Hammersley, 1993; McKee, Stinson, Everhart, & Henderson, 1995;Messerly &Youdelman, 1994; Moore, Bolesky, & Bervinchak,1994; Smith &Rittenhouse, 1990; Stinson , Stuckless, Henderson, & Miller, 1988; Virvan,1991).

TWO CURRENT SPEECH-TO-TEXT OPTIONS:

Currently, two major options are available for providing real-time speech-to-textservices to deaf and hard of hearing students. The first and second parts of thisreport will discuss these two options in order. But first, several general commentsabout the two systems should be made. Steno-based systems. For these systems, atrained stenographer uses a 24-key machine to encode spoken English phoneticallyinto a computer where it is converted into English text and displayed on a

computer screen or television monitor in real time. Generally, the text is producedverbatim. When used in schools, this system is often called CART(computer-aided real-time transcription), an apt acronym in view of the fact thatsteno typists often transport their equipment from one classroom toanother on wheels. Computer-assisted note taking systems. For thesesystems, a typist with special training uses a standard keyboard to input words intoa laptop/PC as they are being spoken. Sometimes these take the form ofsummary notes, sometimes almost as verbatim text. These systems are oftenabbreviated as CAN(computer-assisted note taking).Both types of systems providea real-time text output that students can read on a computer or television

screen in order to follow what is occurring in class. In addition, the text file can beexamined by students, tutors, and instructors after class either onthe screen or as hard copy. These technologies offer receptive communicationto deaf and hard of hearing students. However, they provide limited options forexpressive communication on the part of these students, andService providers need to keep this in mind. We will begin by providing somebasic nuts and bolts information that service providers need in order to


42/56

42

implement a steno-based or computer assisted note taking (CAN) system. For eachof these systems, we address four major questions:

(1) How do these systems work?

(2) What major considerations need to be

addressed with respect to their implementation as a support service in

the classroom?

(3) Who is qualified to provide the service and what is his/her training?

(4) How can the systems effectiveness be evaluated, and what has been

learned from evaluations to date?In considering these systems, we will discuss aspects of particular speech-to-textsystems with which we have had personal experience. Our focus on particularsystems or associated college programs is not intended as an endorsement overother systems or college programs.

The third part of this report pertains to the use of speech-to-print services relativeto other forms of support service, and the fourth part to the development of newspeech-to-text systems, focusing on the status and potential of automaticSpeech recognition (ASR).

APPLICATIONS WITH DEAF AND HARD OF HEARING

STUDENTS:

Steno-based systems provide a two-fold service that includes real-time speech-to-text transcription for deaf and hard of hearing students to read almost instantly inthe classroom, and a written record of the class that they can use later for review.We will discuss these two applications in turn.Real-time classroomimplementation. Steno-based systems can be used to cover a variety of campusevents, sometimes as real-time captioning where the text appears under the videoimage of a speaker. However, their primary application with deaf and hard ofhearing students is in the classroom. Steno based systems as used in the regularclassroom provide a means for the deaf or hard of hearing student to replacelistening with reading what the teacher and fellow students are discussing, in nearreal time. As indicated earlier, the steno typist sits near the front of the classroom,sometimes to the side where he/she is in visual range of the teacher, students, thechalkboard, and other visual media that might be in use. Incidentally, the stenotypists equipment is silent and requires little space. So long as the text is legible tothe deaf or hard of hearing student, it can be displayed in a number of


43/56

43

ways. If the service is being provided for a single student, a second laptop can beused as a screen. However, if a number of deaf and/or hard of hearing students areusing the service, a large TV or projection screen is in order.

From a classroom perspective, the presence of a steno-based system or a computer-assisted note taking system in the class is similar in somerespects to having an interpreter there. More attention will be given to similaritiesand differences later in this report.Hard copy text. Transcripts of lectures can beused as complete classroom notes, preserving the entire lecture and all studentscomments for subsequent review by deaf and hard of hearing students takingthe course. Typically, these transcripts are shared with these students and with theinstructor. Some instructors welcome the transcripts as a way of

tightening their lectures and reviewing their students questions and comments.If the instructor chooses, he/she should be at liberty to share them with hearingmembers of the class also.4 the transcripts can be of value also in tutoringdeaf and hard of hearing students, enabling tutors to organize tutoring sessions inclose accord with course content. Also, interpreters sometimes use them toimprove their signing of course-specific words and expressions.

Once the steno typist has completed the real-time transcription of a class for thedeaf or hard of hearing student(s) enrolled in the course, he/she will edit the text.Depending on the particular class, a 50-minute class is likely to generate 25 to 30pages of text. If the steno typist has a high accuracy rate in a given class, e.g., 98-99%, he/she may be able to correct errors and make the text more readable in one-half hour or less. Obviously more errors (causes of which are discussed later underAccuracy) will require more editing time. Many students who use the text forreview purposes prefer receiving an ASKII disk (edited or unedited) so they canorganize their own format and decide for themselves what they want to retain ordiscard.

How Does Speech-to-TextWork?


44/56

44

A Speech-to-Text Reporter uses an electronic shorthand keyboard. They haveundergone at least three years intensive training in order to produce over 200words per minute with an accuracy rate of around 98%. Several letter keys arepressed at once (a bit like piano chords) which represents a syllable or a wholeword or sometimes a short phrase. The shorthand keyboard is connected to alaptop, where specialist software registers the chord strokes and finds a matchingchord, or string of chords, which has an English counterpart. The software thendisplays the English counterpart on screen for someone to read. The text isdisplayed either on the screen of a laptop for a sole user, or projected onto a largescreen or a series of plasma screens for a larger number of users.

Why do some deaf people need to use Speech-to-Text and not

others?

There are over seven million people in the UK with some degree of hearing loss,from mild to profound. The vast majority of this seven million cannot use BritishSign Language but need access to communication in written English. The STTRprofession was developed as a response to that need and AVSTTR aims tocontinue to develop those skills through CPD and through sharing technologicalinnovation amongst our members. Remote STT is a relatively new area and we areconstantly looking for ways to improve the experience of the end user.

2.Text-to-Speech :-

Overview

You might have already used text-to-speech in products, and maybe evenincorporated it into your own application, but you still dont know how it works.This document will give you a technical overview of text-to-speech so you can


45/56

45

understand how it works, and better understand some of the capabilities andlimitations of the technology.

Text-to-speech fundamentally functions as a pipeline that converts text into PCMdigital audio. The elements of the pipeline are:

1. Text normalization2. Homograph disambiguation3. Word pronunciation4. Prosody5. Concatenate wave segments

Ill cover each of these steps individually

Text Normalization

The "text normalization" component of text-to-speech converts any input text intoa series of spoken words. Trivially, text normalization converts a string like "Johnrode home." to a series of words, "john", "rode", "home", along with a markerindicating that a period occurred. However, this gets more complicated whenstrings like "John rode home at 23.5 mph", where "23.5 mph" is converted to"twenty three point five miles per hour". Heres how text normalization works:

First, text normalization isolates words in the text. For the most part this is astrivial as looking for a sequence of alphabetic characters, allowing for anoccasional apostrophe and hyphen.

Text normalization then searches for numbers, times, dates, and other symbolicrepresentations. These are analyzed and converted to words. (Example: "$54.32" isconverted to "fifty four dollars and thirty two cents.") Someone needs to code upthe rules for the conversion of these symbols into words, since they differdepending upon the language and context.

Next, abbreviations are converted, such as "in." for "inches", and "St." for "street"or "saint". The normalizer will use a database of abbreviations and what they areexpanded to. Some of the expansions depend upon the context of surroundingwords, like "St. John" and "John St.".


46/56

46

The text normalizer might perform other text transformations such as internetaddresses. "http://www.Microsoft.com" is usually spoken as "w w w dot Microsoftdot com".

Whatever remains is punctuation. The normalizer will have rules dictating if thepunctuation causes a word to be spoken or if it is silent. (Example: Periods at theend of sentences are not spoken, but a period in an Internet address is spoken as"dot.")

The rules will vary in complexity depending upon the engine. Some textnormalizers are even designed to handle E-mail conventions like "You***WILL*** go to the meeting. :-("

Once the text has been normalized and simplified into a series of words, it is

passed onto the next module, homograph disambiguation.

Homograph Disambiguation

The next stage of text-to-speech is called "homograph disambiguation." Often itsnot a stage by itself, but is combined into the text normalization or pronunciationcomponents. Ive separated homograph disambiguation out since it doesnt fitcleanly into either.

In English and many other languages, there are hundreds of words that have thesame text, but different pronunciations. A common example in English is "read,"which can be pronounced "reed" or "red" depending upon its meaning. A"homograph" is a word with the same text as another word, but with a differentpronunciation. The concept extends beyond just words, and into abbreviations andnumbers. "Ft." has different pronunciations in "Ft. Wayne" and "100 ft.". Likewise,the digits "1997" might be spoken as "nineteen ninety seven" if the author istalking about the year, or "one thousand nine hundred and ninety seven" if theauthor is talking about the number of people at a concert.

Text-to-speech engines use a variety of techniques to disambiguate thepronunciations. The most robust is to try to figure out what the text is talking aboutand decide which meaning is most appropriate given the context. Once the rightmeaning is know, its usually easy to guess the right pronunciation.


47/56

47

Text-to-speech engines figure out the meaning of the text, and more specifically ofthe sentence, by parsing the sentence and figuring out the part-of-speech for theindividual word. This is done by guessing the part-of-speech based on the wordendings, or by looking the word up in a lexicon. Sometimes a part of speech willbe ambiguous until more context is known, such as for "read." Of course,disambiguation of the part-of-speech may require hand-written rules.

Once the homographs have been disambiguated, the words are sent to the nextstage to be pronounced.

Word Pronunciation

The pronunciation module accepts the text, and outputs a sequence of phonemes,just like you see in a dictionary.

To get the pronunciation of a word, the text-to-speech engine first looks the wordup in its own pronunciation lexicon. If the word is not in the lexicon then the

engine reverts to "letter to sound" rules.

Letter-to-sound rules guess the pronunciation of a word from the text. Theyre kindof the inverse of the spelling rules you were taught in school. There are a numberof techniques for guessing the pronunciation, but the algorithm described here isone of the more easily implemented ones.

The letter-to-sound rules are "trained" on a lexicon of hand-entered pronunciations.The lexicon stores the word and its pronunciation, such as:

hello h eh l oe

An algorithm is used to segment the word and figure out which letter "produces"which sound. You can clearly see that "h" in "hello" produces the "h" phoneme, the"e" produces the "eh" phoneme, the first "l" produces the "l" phoneme, the second"l" nothing, and "o" produces the "oe" phoneme. Of course, in other words theindividual letters produce different phonemes. The "e" in "he" will produce the"ee" phoneme.

Once the words are segmented by phoneme, another algorithm determines whichletter or sequence of letters is likely to produce which phonemes. The first passfigures out the most likely phoneme generated by each letter. "H" almost alwaysgenerates the "h" sound, while "o" almost always generates the "ow" sound. Asecondary list is generated, showing exceptions to the previous rule given the


48/56

48

context of the surrounding letters. Hence, an exception rule might specify that an"o" occurring at the end of the word and preceded by an "l" produces an "oe"sound. The list of exceptions can be extended to include even more surroundingcharacters.

When the letter-to-sound rules are asked to produce the pronunciation of a wordthey do the inverse of the training model. To pronounce "hello", the letter-to-soundrules first try to figure out the sound of the "h" phoneme. It looks through theexception table for an "h"beginning the word followed by "e"; Since it cant findone it uses the default sound for "h", which is "h". Next, it looks in the exceptionsfor how an "e" surrounded by "h" and "l" is pronounced, finding "eh". The rest ofthe characters are handled in the same way.

This technique can pronounce any word, even if it wasnt in the training set, and

does a very reasonable guess of the pronunciation, sometimes better than humans.It doesnt work too well for names because most names are not of English origin,and use different pronunciation rules. (Example: "Mejia" is pronounced as "meh-

jee-uh" by anyone that doesnt know it is Spanish.) Some letter-to-sound rules firstguess what language the word came from, and then use different sets of rules topronounce each different language.

Word pronunciation is further complicated by peoples laziness. People will

change the pronunciation of a word based upon what words precede or follow it,just to make the word easier to speak. An obvious example is the way "the" can be

pronounced as "thee" or "thuh." Other effects including the dropping or changingof phonemes. A commonly used phrase such as "What you doing?" sounds like"Wacha doin?"

Once the pronunciations have been generated, these are passed onto the prosodystage.

Prosody

Prosody is the pitch, speed, and volume that syllables, words, phrases, andsentences are spoken with. Without prosody text-to-speech sounds very robotic,and with bad prosody text-to-speech sounds like its drunk.

The technique that engines use to synthesize prosody varies, but there are somegeneral techniques.


49/56

49

First, the engine identifies the beginning and ending of sentences. In English, thepitch will tend to fall near the end of a statement, and rise for a question. Likewise,volume and speaking speed ramp up when the text-to-speech first starts talking,and fall off on the last word when it stops. Pauses are placed between sentences.

Engines also identify phrase boundaries, such as noun phrases and verb phrases.These will have similar characteristics to sentences, but will be less pronounced.The engine can determine the phrase boundaries by using the part-of-speechinformation generated during the homograph disambiguation. Pauses are placedbetween phrases or where commas occur.

Algorithms then try to determine which words in the sentence are important to themeaning, and these are emphasized. Emphasized words are louder, longer, and willhave more pitch variation. Words that are unimportant, such as those used to make

the sentence grammatically correct, are de-emphasized. In a sentence such as "Johnand Bill walked to the store," the emphasis pattern might be "JOHN and BILLwalked to the STORE." The more the text-to-speech engine "understands" whatsbeing spoken, the better its emphasis will be.

Next, the prosody within a word is determined. Usually the pitch and volume riseon stressed syllables.

All of the pitch, timing, and volume information from the sentence level, phraselevel, and word level are combined together to produce the final output. The output

from the prosody module is just a list of phonemes with the pitch, duration, andvolume for each phoneme.

Play Audio

The speech synthesis is almost done by this point. All the text-to-speech engine hasto do is convert the list of phonemes and their duration, pitch, and volume, intodigital audio.

Methods for generating the digital audio will vary, but many text-to-speechengines generate the audio by concatenating short recordings of phonemes. The


50/56

50

recordings come from a real person. In a simplistic form, the engine receives thephoneme to speak, loads the digital audio from a database, does some pitch, time,and volume changes, and sends it out to the sound card.

It isnt quite that simple for a number of reasons.

Most noticeable is that one recording of a phoneme wont have the same volume,pitch, and sound quality at the end, as the beginning of the next phoneme. Thiscauses a noticeable glitch in the audio. An engine can reduce the glitch by blendingthe edges of the two segments together so at their intersections they both have thesame pitch and volume. Blending the sound quality, which is determined by theharmonics generated by the voice, is more difficult, and can be solved by the nextstep.

The sound that a person makes when he/she speaks a phoneme, changes dependingupon the surrounding phonemes. If you record "cat" in sound recorder, and thenreverse it, the reversed audio doesnt sound like "tak", which has the reversedphonemes of cat. Rather than using one recording per phoneme (about 50), thetext-to-speech engine maintains thousands of recordings (usually 1000-5000).Ideally it would have all possible phoneme context combinations recorded, 50 * 50* 50 = 125,000, but this would be too many. Since many of these combinationssound similar, one recording is used to represent the phoneme within severaldifferent contexts.

Even a database of 1000 phoneme recordings is too large, so the digital audio iscompressed into a much smaller size, usually between 8:1 and 32:1 compression.The more compressed the digital audio, the more muted the voice sounds.

Once the digital audio segments have been concatenated theyre sent off to thesound card, making the computer talk.

Generating a Voice

You might be wondering, "How do you get thousands of recordings of phonemes?"

The first step is to select a voice talent. The voice talent then spends several hoursin a recording studio reading a wide variety of text. The text is designed so that asmany phonemes sequence combinations are recorded as possible. You at least want


51/56

51

them to read enough text so there are several occurrences of each of the 1000 to5000 recording slots.

After the recording session is finished, the recordings are sent to a speechrecognizer which then determines where the phonemes begin and end. Since thetools also knows the surrounding phonemes, its easy to pull out the rightrecordings from the speech. The only trick is to figure out which recording soundsbest. Usually an algorithm makes a guess, but someone must listening to thephoneme recordings just to make sure theyre good.

The selected phoneme recordings are compressed and stored away in the database.The result is a new voice.

Conclusion

This was a high level overview of how text-to-speech works. Most text-to-speech

engines work in a similar manner, although not all of them work this way. The

overview doesnt give you enough detail to write your own text-to-speech engine,

but now you know the basics. If you want more detail you should purchase one of

the numerous technical books on text-to-speech

3.Keyboard implementation survey

Introduction:


52/56

52

-First we want to make deaf people live easily, deal with computer in efficient

manner like normal people, so after researching on the web we have found many

ways:

Hand recognition based on camera:This way depend on hands moves; take their hands moves and matching them with

images stored in the database. Thus we can recognize words and write what they

want easily.

But we find this way some hard to them for some reasons:

1-In my opinion, this way some difficult since there is a camera that should

capture of the moving hands, levels of camera should be identical to them.

2-Distance from the camera and their hands should be quiet fit.

3-Lightinning of the room should be fit for capturing hands.

4-If we apply this way on web, may be more difficult, as processing might

be quite slow.

So for these reasons we think another way I think its easier and more efficient.

Signs Keyboard:

15)%4( )027(

...


53/56

53

) ()2717 (

.

) 15(

...

:

.. ..) (

.) (

.

) ( .


54/56

54

This way depends on using deaf signs as image buttons to form the keyboard. They

can write characters and numbers, not enough they can use this tool to make

sentences and what they want. Compare to another way we dont use any database,

camera, we dont depend on the room lightening, they can use mouse for writing

what they want.

This is a snapshot of the keyboard while working on the web:

Simply, the deaf people can press the image buttons and what they press will be

written on the textbox.

This is snapshot from the keyboard while writing on the web


55/56

55

Technology for making this tool:

1-C# programming language for programming

2-HTML5, CSS3 and Photoshop for designing

3-Visual Studio 2010 as editor

4-Windows 7 operating system


56/56

Advantages of Signs Keyboard method:

1-So simple as we seen

2-Doesnt depend on any Databases

3-Processing will be fast

4-A mere its easy to be used

5-Compare to another way there are no constraints we dont use any

cameras, we dont depend on the room lightening, they can use mouse for

writing what they want.

6-So its so simple than we think

7-There are more characters than we think, we can overcome by two buttons

as we mentioned.

Best wishes

Documents

doc 1 mobility