doc 1 mobility

Embed Size (px)

Citation preview

  • 7/29/2019 doc 1 mobility

    1/56

    1

    Addressing Project2012/2013

  • 7/29/2019 doc 1 mobility

    2/56

    2

    Mansoura University

    Faculty of Computer and Information Science

    Dept. Information Technology

    Graduation Project 2012/2013

    Supervisor:-

    Dr. Mohammed Elmogy

    Team Work:-

    Abeer Ramadan Shams Eldein

    Fatma El-zahraa Alaa Ali

    Fayza Fawzy Abdo Elbrawy

    HebaAllah zain

    Mahmoud Mohamed Abdelsalam

    Mohamed Aziz Rashad

    Mohamed Shalby Elayouti

    Mostafa Abdelhamid Elasey

    Nada Mohamed Mohammed

    Nada Mohamed Sobhi Nada

    OsamaHeshamHagras

  • 7/29/2019 doc 1 mobility

    3/56

    3

    IndexChapter 1: Introduction

    Chapter 2: Speech-to-Text

    Chapter 3: Text To Speech

    Chapter 4: SystemAnalysisandDesign

    Chapter 5: Framework

  • 7/29/2019 doc 1 mobility

    4/56

    4

    Chapter1Introduction

  • 7/29/2019 doc 1 mobility

    5/56

    5

    Sometimes it becomes apparent that previous approaches to a

    problem havent quite worked the way you anticipated.

    Perhaps you just need to clear away the smoky residue of the

    past, take a deep breath, and try again with a new attitude and

    fresh ideas. In golf, its known as a mulligan; in schoolyard

    sports, its called a do-over; and in the computer industry, we

    say its a reboot.

    So our problem is how we can be able to communicate two

    kinds of people who they are not be able to either

    communicate with each other or communicate with Computer

    World.

    In the past, the traditional idea is there is a person which

    represents the middleware between them.

    After refresh ideas, we can create a system that can

    communicate peoples with each other and communicate with

    computer world.

    We say about dump anddeaf kinds and blind kinds. Its our

    problem and our system will solve this problem easily.

    But how?

    Lets thinklets go.

    How to make the blind, deaf and dump people deal withthe computer?

  • 7/29/2019 doc 1 mobility

    6/56

    6

    The blind people will speak what he want and our system will

    take his voice and using NLP System will convert it to

    statement, then to words. This process is called Speech-to-

    Text conversion.

    On the other hand, we should make another process which,

    using Poser program we convert the statements or words to

    signs such that will display to dump and deaf people and

    understand what have been said.

    On the other side, the deaf and dump people will useKeyboard Signs System which makes them easily write what

    they want not enough but our system also support large

    database with familiar words and statements that they can use

    in daily life represented in signs images and these images will

    display in a dictionary form and easily they can use.

    After they write what they want, our system will convert the

    text to voice such that the blind people can hear; this process

    is called Text-to-Speech conversion. And our system supports

    it also.

    Our goal of this system is to make deaf and dump and blind

    people life easily and use the newest technology like Webdevelopment and Mobile development efficiently.

  • 7/29/2019 doc 1 mobility

    7/56

    7

    We can implement this system on web, mobile platform as they

    are the newest technology all over the world. And also for the

    communication process between them.

    Why this project?We find that the two kinds of people suffer from abrupt

    problem in dealing with the computer; they cant use the

    technology well; they have found difficulty in dealing with

    others so we thought of this system for solving this problem.

  • 7/29/2019 doc 1 mobility

    8/56

    8

    Chapter2

  • 7/29/2019 doc 1 mobility

    9/56

    9

    Contents

    The need for real-time speech-to-text conversion The challenges of speech-to-text-conversion in real-time Methods of real-time speech-to text conversion Text adaptations Presentation format

    Speech-To-Text

  • 7/29/2019 doc 1 mobility

    10/56

    10

    PerspectivesAbstract

    Intralingua speech-to-text-conversion is a useful tool for integrating people withhearing impairments in oral communication settings, e. g. counseling interviews or

    conferences. However, the transfer of speech into written language in real time

    requires special techniques as it must be very fast and almost 100% correct to be

    understandable. The paper introduces and discusses different techniques for

    intralingua speech-to-text-conversion.

    The need for real-time speech-to-text conversion

    Language is a very fast and effective way of communicating. To use languagemeans to express an unlimited amount of ideas, thoughts and practical information

    by combining a limited amount of words with the help of a limited amount of

    grammatical rules. The result of language production processes are series of words

    and structure. Series of words are producedi.e. spoken or signedin a very rapid

    and effective way. Any person can follow such language production processes and

    understand what the person wants to express if two preconditions are fulfilled the

    recipients must:

    1. Know the words and grammatical rules the speaker uses and

    2. be able to receive and process the physical signal.

    Most people use oral language for everyday communication, i.e. they speak to

    other people and hear what other people say. People who are deaf or hard-of-

    hearing do not have equal access to spoken language, for them, precondition 2 is

    not fulfilled; their ability to receive speech is impaired. If people who are severely

    impaired in their hearing abilities want to take part in oral communication, they

    need a way to compensate their physical impairment. Hearing aids are sufficientfor many hearing impairment people. However, if hearing aids are insufficient,

    spoken language has to be transferred into a modality which is accessible without

    hearing, e.g. into the visual domain. There are two main methods to transfer

    auditory information into a visible format. The translation into sign language is one

    method and it is best for people who use sign language as a preferred language, as

  • 7/29/2019 doc 1 mobility

    11/56

    11

    e.g. many Deaf people do. However, for people with a hearing disability who do

    not know sign language, sign language interpreting is not an optionas for many

    Hard of Hearing people and people who became hearing impaired later in their life

    or elderly people with various degrees of hearing loss. They prefer their native oral

    language given in a visible modality. For them, a transfer of spoken words intowritten text is the method of choice, in other words: they need an intralingua

    speech-to-text-conversion. Speech-to-text-translation (audiovisual translation) of

    spoken language into written text is an upcoming field since movies on DVDs are

    usually sold with subtitles in various languages. While the original language is

    given auditorily, subtitles provide a translated version in another language at the

    same time visually. The audiovisual transfer from the spoken original language

    into other languages which are presented in the subtitles can be called an

    interlingual audiovisual translation. Interlingual translation aims at transferringmessages from one language into another language. This translation process

    combines classical interpreting with a transfer from spoken language patterns into

    written text patterns. Auditory events which are realized as noises or speech

    melodies would often not be transferred because normally hearing people can

    interpret them by themselves. Interlingual translation primarily addresses the lack

    of knowledge of the original language, i.e. the first precondition for understanding

    language. The intralingual audiovisual transfer differs in many aspects from the

    interlingual audiovisual translation between two languages. First of all, intralingual

    audiovisual transfer for people with hearing impairments addresses primarily

    precondition 2, i.e. the physical ability to perceive the speech signals. The aim of

    an intralingual audiovisual transfer is to provide all auditory information which is

    important for the understanding of an event or action. Words as well as non-

    language sounds like noises or hidden messages which are part of the intonation of

    the spoken words (e.g. irony or sarcasm) need to be transmitted into the visual (or

    haptic) channel. How this can be achieved best, is a question of present and future

    research and development (cf. Neves, in this book). Moreover, people with hearing

    impairment may insist on a word-by-word-transfer of spoken into written language

    because they do not want a third person to decide which parts of a message are

    important (and will therefore be transferred) and which parts are not. As a result,

    intralingual audiovisual transfer for people with hearing impairment might mean

    that every spoken word of a speech has to be written down and that all relevant

    auditory events from outside of the speech have to be described, too (interruptions,

  • 7/29/2019 doc 1 mobility

    12/56

    12

    noises). In the latter case, the intralingual audiovisual transfer would exclusively

    satisfy the physical ability to perceive the speech signal (precondition 2). The

    classical way to realize an intralingual speech-to-text transfer is to stenotype a

    protocol or to record the event and to transfer it into a readable text subsequently.

    This post-event transfer process is time-consuming and often difficult, sinceauditory events easily become ambiguous outside of the actual context. Moreover,

    the time shift involved in the transfer into a readable text means a delayed access to

    the spoken words, i.e. it does not help people with hearing impairments in the

    actual communication situation. However, for counselling interviews, at the

    doctors or at conferences, access to spoken information must be given in real-

    time. For these purposes, the classical methods do not work.

    The challenges of speech-to-text-conversion in real-time :-Real-time speech-to-text-conversion aims at transferring spoken language into

    written text (almost) simultaneously. This gives people with a hearing impairment,

    access to the contents of spoken language in a way that they e.g. become able to

    take part in a conversation within the normal time frame of conversational turn

    taking. Another scenario for real-time speech-to-text-transfer is a live broadcast of

    a football match where the spoken comments of the reporter are so rapidly

    transferred into subtitles that theystill correspond to the scene the reporter

    comments on. An example from the hearing world would be a parliamentarydebate which ends with the electronic delivery of the exact word protocol

    presented to the journalists immediately after the end of the debate. (cf. Eugeni,

    forthcoming) This list could be easily continued. However, most people with a

    hearing disability do not receive real-time speech-to-text services at counselling

    interviews, conferences or when watching a sports event live on TV. Most

    parliamentary protocols are tape recorded or written steno typed and subsequently

    transferred into readable text. What are the challenges of real-time speech-to-text

    conversion that make its use sorare?

    - TimeA good secretary can type about 300 key strokes (letters) per minute. Since the

    average speaking rate is about 150 words per minute (with some variance between

    the speakers and the languages), even the professional typing rate is certainly not

  • 7/29/2019 doc 1 mobility

    13/56

    13

    high enough to transfer a stream of spoken words into a readable form in real-time.

    As a consequence, the speed of typing has to be increased for a sufficient real-time

    speech-to-text transfer. Three different techniques will be discussed in the

    following section methods.

    - Message TransferThe main aim of speech-to-text transfer is to give people access to spoken words

    and auditory events almost simultaneously with the realization of the original

    sound event. However, for people with limited access to spoken language at a

    young age, 1:1 transfer of spoken words into written text may sometimes not be

    very helpful. If children are not sufficiently exposed to spoken language, their oral

    language system may develop more slowly and less effectively compared with

    their peers. As a result, many peoplewith an early hearing impairment are less used

    to the grammatical rules applied in oral language as adults and have a less

    elaborated mental lexicon compared with normal hearing people (Schlenker-

    Schulte, 1991; see also Perfetti et al. 2000 with respect to reading skills among

    deaf readers). If words are unknown or if sentences are too complex, the written

    form does not help their understanding. The consequence for intralingual speech-

    to-text conversion is that precondition, the language proficiency of the audience,

    also has to be addressed, i.e. the written transcript has to be adapted to the

    language abilities of the audience - while the speech goes on. Speech-to-textservice providers not only need to know their audience, they also have to

    understand, and how grammatical complexity can be reduced. They need to know

    techniques of how to make the language in itself more accessible while the

    information transferred is preserved. Aspects of how language can be made more

    accessible will be discussed in the following section text adaptation.

    - Real-time presentation of the written textReading usually means that words are already written down. Presented with a

    written text, people will read at their individual reading speed .This, however, is

    not possible in real-time speech-to-text conversion. Here, the text is written and

    read almost simultaneously, and the control of the reading speed shifts at least

    partly over to the speaker and the speech-to-text provider. The text is not fixed in

    advance, instead new words are produced continuously and readers must follow

  • 7/29/2019 doc 1 mobility

    14/56

    14

    this word production process very closely if they want to use the real-time abilities

    of speech-to-text transfer. Because of this interaction of writing and reading, the

    presentation of the written text must be optimally adapted to the reading needs of

    the audience. This issue will be discussed at the end of the paper in section

    presentation format.

    The challenges of real-time speech-to-text conversion can

    now be summarized as follows:

    1. To be fast enough in producing written language that

    2. It becomes possible to meet the expectations of the audience with respect

    to the characteristics of a written text. Word-by-word transfer enhanced by a

    description of auditory events from the surroundings as well as adaptationsof the original wording into easier forms of language must be possible.

    Moreover,

    3. A successful real-time presentation must match the reading abilities of

    the audience, i.e. the written words must be presented in a way that is

    optimally recognizable and understandable for the readers.

    Methods of real-time speech-to text conversion

    There are three methods that are feasible when realizing (almost) real-time speech-

    to-text transfer: speech recognition, computer assisted note taking (CAN) and

    communication access (or computer aided) real-time translation (CART). The

    methods differ

    1. in their ability to generate exact real-time transcripts.

    2. With respect to the conditions under which these methods can be

    properly applied and

  • 7/29/2019 doc 1 mobility

    15/56

    15

    3. With respect to the amount of training which is needed to become a good

    speech-to-text service provider.

    Speech recognition

    Automatic speech recognition is not yet an option for speech-to-text transfer since

    phrase- and sentence boundaries are not recognized. However, speech recognition

    can be used for real-time speech-to-text conversion if a person re-speaks the

    original words. Re-speaking is primarily necessary for including punctuation and

    speaker identification but also for adapting the language to the language

    proficiency of the audience. Apart from an intensive and permanent training of the

    speech recognition engine, no special training is required. A sound-shielded

    environment is useful. The use of speech recognition systems does not require any

    special training. Linguistic knowledge, however, is necessary for the chunking of

    the words and for adaptations of the wording.

    Tex adaptation

    Spoken and written forms of language rely on different mechanisms to transfer

    messages. Speech for instance is less grammatical and less chunked than text. A

    real-time speech to-text conversion - even if it is a word-for-word service - has to

    chunk the continuous stream of spoken words into sentences and phrases with

    respect to punctuation and paragraphs in order for the text to be comprehensible. A

    correction of grammatical slips might be necessary, too, for word-for-word

    conversions and even more corrections may be necessary for an audience with less

    language proficiency. While intonation may alleviate in congruencies in spoken

    language, congruency errors easily cause misinterpretation in reading. The transfer

    from spoken into written language patterns is only one method of text adaptation.

    As discussed earlier, the speech-to-text provider might also be asked to adapt the

    written text to the language proficiency of the audience. Here, the challenge of

    word-for-word transfer shifts to the challenge of message transfer with a reduced

    set of language material. A less skilled audience might be overstrained especially

    with complex syntactical structures and low frequent words and phrases. The

    speech-to-text provider therefore needs to know whether a word or phrase can be

    well understood or should better be exchanged with some more frequent

  • 7/29/2019 doc 1 mobility

    16/56

    16

    equivalents. S/he also has to know how to split long and complex sentences into

    simpler structures to make them easier to understand.

    Presentation format

    The last challenge of real-time speech-to-text transfer is the presentation of the text

    on the screen in a way that reading is optimally supported. The need to think about

    the presentation format is given as the text on the screen is moving which a

    problem for the reading process is. We usually read a fixed text, and our eyes are

    trained to move in saccades (rapid eye movements) on the basis of a kind of

    preview calculation with respect to the next words (cf. Sereno et al. 1998). But in

    real-time speech-to-text systems, the text appears consecutively on the screen andnew text replaces older text when the screen is filled. A word-by-word presentation

    as a consequence of word-for-word transcription could result in less precise

    saccades which subsequently decreases the reading speed. Reading might be less

    hampered by a presentation line-by-line, as it is e.g. used in C-Print (cf. the online

    presentation at http://www.rit.edu/~techsym/detail.html#T11C). However, for

    slower readers, also line-by-line presentation might be problematic since the whole

    old text is moving upwards whenever a new line is presented. As a consequence,

    the word which was actually fixated by the eyes moves out of the fovea and

    becomes unreadable. The eyes have to look for the word and restart reading it.The

    optimal presentation of real-time text for as many potential readers as possible is

    an issue which is worth further research, not only from the perspective of real-time

    transcription but also for subtitling purposes.

    Perspectives

    Real-time speech-to-text transfer is already a powerful tool which provides peoplewith a hearing impairment access to oral communication. However, elaborated

    dictionaries as they are needed for efficient CAN- or CART-systems are not yet

    developed for many languages. Without those dictionaries, the systems cannot be

    used. Linguistic research has to find easy but efficient strategies for the real-time

    adaptation of the wording in order to make a message understandable also for an

  • 7/29/2019 doc 1 mobility

    17/56

    17

    audience with limited language proficiency. Finally, the optimal presentation of

    moving text to an audience with diverging reading abilities is a fascinating

    research field not only for real-time speech-to-text services but with respect to the

    presentation of movable text in general.

    Speech analysis

    The speech analysis block is itself composed of :

    - A pre-processing module, which organizes the input speech into manageablelists of letters.

    - A morphological analysis module, the task of which is to propose allpossible part of speech categories for each letter taken individually.

  • 7/29/2019 doc 1 mobility

    18/56

    18

    General Overview

    Technology that converts digital text to audible speech. In other words, it allows adevice to "talk" to the user through its speaker.

    Converting text into voice output using speech synthesis techniques. Althoughinitially used by the blind to listen to written material Such as MSAA(MicroSoftActive Accessibility)A software interface that lets a Windows application bedesigned for the visually impaired. It each object (window, dialog box, etc.) in theuser interface to identify itself so a screen reader can be used.screen reader :Software for the visually impaired that reads the contents of a computer screen,

    Chapter3Text-To-Speech

  • 7/29/2019 doc 1 mobility

    19/56

    19

    converting the text to speech. Screen readers are designed for specific operatingsystems and generally work with most applications.It is now used extensively to convey financial data, e-mail messages and in aphone, text-to-speech might announce a caller's name when they call instead ofplaying a ringtone. It might also read a text message aloud, or speak the names ofmenu items as you scroll through them This feature can be useful for hands-freeuse of a phone while driving, allowing the driver to keep eyes on the road. It is alsovery useful for the vision-impaired.also used in GPS units to announce streetnames when giving directions.Early text-to-speech (TTS) systems had a very robotic sound; however, with theadvent of high-speed chips and advanced software techniques, text-to-speech hasbecome much more natural.Text-to-speech (TTS) is a type ofspeech synthesis application that is used to createa spoken sound version of the text in a computer document, such as a help file or a

    Web page. TTS can enable the reading of computer display information for thevisually challenged person whether it was directly introduced in the computer byan operator or scanned and submitted to an Optical Character Recognition (OCR)system, or may simply be used to augment the reading of a text message. CurrentTTS applications include voice-enabled e-mail and spoken prompts in voiceresponse systems. TTS is often used with voice recognition programs. There arenumerous TTS products available, including Read Please 2000, Proverbe SpeechUnit, and Next Up Technology's TextAloud. Lucent, Elan, and AT&T each haveproducts called "Text-to-Speech.In addition to TTS software, a number ofvendors offer products involving hardware, including the Quick Link Pen fromWizCom Technologies, a pen-shaped device that can scan and read words; theRoad Runner from Ostrich Software, a handheld device that reads ASCIItext; andDecTalk TTS from Digital Equipment, an external hardware device that substitutesfor a sound card and which includes an internal software device that works inconjunction with the PC's own sound card.

    A brief history of speech synthesis (text to speech)

    The history of speech synthesis:

    Over the last few years there has been a great development of the quality of thespeech produced with text to speech. Many people think that synthetic speech as itis also called sounds like robots from older movies. The truth is though that somevoices almost sound like recorded speech and due to that we have seen a verystrong growth of user groups for our services the last years.

    http://whatis.techtarget.com/definition/speech-synthesishttp://searchmobilecomputing.techtarget.com/definition/voice-enabled-e-mailhttp://searchcrm.techtarget.com/definition/voice-recognitionhttp://searchcio-midmarket.techtarget.com/definition/ASCIIhttp://searchcio-midmarket.techtarget.com/definition/sound-cardhttp://searchcio-midmarket.techtarget.com/definition/sound-cardhttp://searchcio-midmarket.techtarget.com/definition/ASCIIhttp://searchcrm.techtarget.com/definition/voice-recognitionhttp://searchmobilecomputing.techtarget.com/definition/voice-enabled-e-mailhttp://whatis.techtarget.com/definition/speech-synthesis
  • 7/29/2019 doc 1 mobility

    20/56

    20

    When we invented the talking web in 2001 the target group was people withreading difficulties but now we see that the user group is much broader.What you maybe dont know is that the first synthetic speech was produced asearly as in the late 18th century. The machine was built in wood and leather andwas very complicated to use generating audible speech. It was constructedby Wolfgang von Kempelen and had great importance in the early studies ofPhonetics. The picture to down is the original construction as it can be seen at theDeutsches Museum (von Meisterwerken der Naturwissenschaft und Technik) inMunich, Germany.

    (First there is a human that says a sentence and then the machine tries to say thesame. This was made by a re-construction of Kempelens machine.)In the early 20th century when it was possible to use electricity to create syntheticspeech, the first known electric speech synthesis was Voder and its creatorHomer Dudley showed it to a broader audience in 1939 on the world fair in NewYork.One of the pioneers of the development of speech synthesis in Sweden was GunnarFant. During the 1950s he was responsible for the development of the first Swedishspeech synthesis OVE (Orator VerbisElectris.) By that time it was only Walter

    Lawrences Parametric Artificial Talker (PAT) that could compete with OVE inspeech quality

    Speech synthesis becomes more human-like

    The greatest improvements when it comes to natural speech were during the last 10years. The first voices we used for ReadSpeaker back in 2001 were produced usingDiphone synthesis. The voices are sampled from real recorded speech and split into

    http://en.wikipedia.org/wiki/Wolfgang_von_Kempelenhttp://en.wikipedia.org/wiki/Wolfgang_von_Kempelen
  • 7/29/2019 doc 1 mobility

    21/56

    21

    phonemes, a small unit of human speech. This was the first example ofConcatenation synthesis. However, they still have an artificial/synthetic sound. Westill use diphone voices for some smaller languages and they are widely used tospeech-enable handheld computers and mobile phones due to their limited resourceconsumption, both memory and CPU.It wasnt until the introduction of a technique called Unit selection, that voices

    became very naturally sounding. this is still concatenation synthesis but the usedunits are larger than phonemes, sometimes a complete sentence. We use differentproviders for different languages to always assure we can offer the best voicesavailable for that language.

    How does a machine read ?

    At first sight, this task does not look too hard to perform. After all, is not thehumanbeing potentially able to correctly pronounce an unknown sentence, even from hischildhood ? We all have, mainly unconsciously, a deep knowledge of the readingrules of our mother tongue. They were transmitted to us, in a simplified form, atprimary school, and we improved them year after year.However, it would be a bold claim indeed to say that it is only a short step before

    the computer is likely to equal the human being in that respect. Despite the presentstate of our knowledge and techniques and the progress recently accomplished inthe fields of Signal Processing and Artificial Intelligence, we would have to

    express some reservations. As a matter of fact, the reading process draws from thefurthest depths, often unthought of, of the human intelligence.

    Figure 1: introduces the functional diagram of a very general TTS synthesizer. Asfor human reading, it comprises a Natural Language Processing module (NLP),capable of producing a phonetic transcription of the text read, together with thedesired intonation and rhythm (often termed as prosody), and a Digital SignalProcessing module (DSP), which transforms the symbolic information it receivesinto speech. But the formalisms and algorithms applied often manage, thanks to a

    judicious use of mathematical and linguistic knowledge of developers, to short-circuit certain processing steps. This is occasionally achieved at the expense ofsome restrictions on the text to pronounce

  • 7/29/2019 doc 1 mobility

    22/56

    22

    Figure 1: A simple but general functional diagram of a TTS system.

    NLP COMPONENT

    Figure 2 introduces the skeleton of a general NLP module for TTS purposes. Oneimmediately notices that, in addition with the expected letter-to-sound and prosodygeneration blocks, it comprises a morpho-syntactic analyser, underlying the needfor some syntactic processing in a high quality Text-To-Speech system. Indeed,being able to reduce a given sentence into something like the sequence of its parts-of-speech, and to further describe it in the form of a syntax tree, which unveils itsinternal structure, is required for at least two reasons :

    1) Accurate phonetic transcription can only be achieved provided the part ofspeech category of some words is available, as well as if the dependency

    relationship between successive words is known.2)Natural prosody heavily relies on syntax. It also obviously has a lot to do withsemantics and pragmatics, but since very few data is currently available on thegenerative aspects of this dependence, TTS systems merely concentrate onsyntax. Yet few of them are actually provided with full disambiguation andstructuration capabilities.

  • 7/29/2019 doc 1 mobility

    23/56

    23

    Fig 2.The NLP module of a general Text-To-Speech conversion system.

    Text analysis

    The text analysis block is itself composed of :

    A pre-processing module, which organizes the input sentences into manageablelists of words. It identifies numbers, abbreviations, acronyms and idiomatics andtransforms them into full text when needed. An important problem is encounteredas soon as the character level : that of punctuation ambiguity (including the criticalcase of sentence end detection). It can be solved, to some extent, with elementaryregular grammars.

  • 7/29/2019 doc 1 mobility

    24/56

    24

    A morphological analysis module, the task of which is to propose all possible partof speech categories for each word taken individually, on the basis of theirspelling. Inflected, derived, and compound words are decomposed into theirelementerygraphemic units The contextual analysis module considers words in their context, which allows itto reduce the list of their possible part of speech categories to a very restrictednumber of highly probable hypotheses, given the corresponding possible parts ofspeech of neighbouring words. This can be achieved either with n-grams whichdescribe local syntactic dependences in the form of probabilistic finite stateautomata (i.e. as a Markov model), to a lesser extent with mutli-layer perceptrons(i.e., neural networks) trained to uncover contextual rewrite rules, as in [Benello],or with local, non-stochastic grammars provided by expert linguists orautomatically inferred from a training data set with classification and regressiontree (CART) techniques

    Finally, a syntactic-prosodic parser, which examines the remaining search spaceand finds the text structure (i.e. its organization into clause and phrase-likeconstituents) which more closely relates to its expected prosodic realization.

  • 7/29/2019 doc 1 mobility

    25/56

    25

    Database preparation

    Figure 3.A general concatenation-based synthesizer. The upper left hatched blockcorresponds to the development of the synthesizer (i.e. it is processed once for all).Other blocks correspond to run-time operations. Language-dependent operationsand data are indicated by a flag.

    Segments are then often given a parametric form, in the form of a temporalsequence of vectors of parameters collected at the output of a speech analyzer andstored in a parametric segment database. The advantage of using a speech modeloriginates in the fact that : Well chosen speech models allow data size reduction, an advantage which ishardly negligible in the context of concatenation-based synthesis given the amount

  • 7/29/2019 doc 1 mobility

    26/56

    26

    of data to be stored. Consequently, the analyzer is often followed by a parametricspeech coder. A number of models explicitly separate the contributions of respectively thesource and the vocal tract, an operation which remains helpful for the pre-synthesisoperations : prosody matching and segments concatenation.

    Speech synthesis:

    A sequence of segments is first deduced from the phonemic input of thesynthesizer, in a block termed as segment list generation in Fig. 3, which interfacesthe NLP and DSP modules. Once prosodic events have been correctly assigned toindividual segments, the prosody matching module queries the synthesis segmentdatabase for the actual parameters, adequately uncoded, of the elementary sounds

    to be used, and adapts them one by one to the required prosody. The segmentconcatenation block is then in charge of dynamically matching segments to oneanother, by smoothing discontinuities. Here again, an adequate modelization ofspeech is highly profitable, provided simple interpolation schemes performed on itsparameters approximately correspond to smooth acoustical transitions betweensounds. The resulting stream of parameters is finally presented at the input of asynthesis block, the exact counterpart of the analysis one. Its task is to producespeech.

    Segmental quality

    The efficiency of concatenative synthesizers to produce high quality speech ismainly subordinated to :

    1.The type of segments chosen.2.Segments should obviously exhibit some basic properties :They should allow to account for as many co-articulatory effects aspossible.Given the restricted smoothing capabilities of the concatenation block, they

    should be easily connectable.Their number and length should be kept as small as possible.

  • 7/29/2019 doc 1 mobility

    27/56

    27

    Examples of Text-to-Speech software:1-Text Speaker2-Alive Text to Speech3-DSpeech Text-to-Speech software

    4-Talking Clipboard Text-to-Speech software5-Text-To-Voice by Caltrox Educational Software6-NaturalReader is a text-to-speech software7-ClaroRead is a professional Text-to-Speech software8-Balabolka is a Text-To-Speech (TTS) program9-TextSpeech Pro is a professional set of Text-to-Speech software10-NEOSPEECH TEXT-TO-SPEECH SOFTWARE: NEOSPEECH ISPRIVATELY HELD, HEADQUARTERED IN SANTA CLARA,CALIFORNIA AND BACKED BY THE RESOURCES OF VOICEWARECO., LTD. OF KOREA AND HOYA OF JAPAN.

    Text-to-speech software to read text in any application, and convert text to MP3,WAV, OGG or VOX files.

  • 7/29/2019 doc 1 mobility

    28/56

    28

    Chapter4System Analysis and Design

  • 7/29/2019 doc 1 mobility

    29/56

    29

    Overview

    "My experience has shown that many people find it hard to make their design ideasprecise. They are willing to express their ideas in loose, general terms, but are

    unwilling to express them with the precision needed to make them into patterns.Above all, they are unwilling to express them as abstract spatial relations amongwell-defined spatial parts. I have also found that people aren't always very good atit; it is hard to do..... If you can't draw a diagram of it, it isn't a pattern. If you thinkyou have a pattern, you must be able to draw a diagram of it. This is a crude, butvital rule. A pattern defines a field of spatial relations, and it must always bepossible to draw a diagram for every pattern. In the diagram, each part will appearas a labeled or colored zone, and the layout of the parts expresses the relationwhich the pattern specifies. If you can't draw it, it isn't a pattern".

    Christopher Alexander (1979) in The Timeless Way of Building.

    -of Speech To Text:System Analysis

    Here we are going to make system analysis and design of Speech to Text part

    where system analysis is the key of starting implementing and the way to go on in

    the project. Systems are created to solve problems. We need to see all sides of a

    problem to come up with an acceptable solution. Analysis involves studying thesystem and seeing how they interact with the entities outside as well as inside the

    system. We then come out with detailed specifications of what the system will

    accomplish based on the user requirements.

    Systems design will take the requirements and analysis into consideration and

    come out with a high level and low level design that will form the blue print to the

    actual solution to the problem in hand.

    In this dynamic world, analysis and design have to look into making systems that a

    flexible enough to accommodate changes as they are inevitable in any system.Systems study and analysis is very important; most projects fail because of

    shortcomings in this phase. The problem is that most customers have some hazy

    requirements, or end results they would like to achieve. This phase really defines,

    in technically implementable terms, what the requirements are. Systems Analysis

    and design is necessary because it helps you first to identify the problem that

  • 7/29/2019 doc 1 mobility

    30/56

    30

    you're trying to find a solution for. The analysis part has a lot to do with

    "diagnosis." You want to make sure that you understand the problem in-depth, and

    that you understand what the end users need. If you provide a solution that does

    not meet users' needs, your system is useless.

    Now to look at our system and its analysis and design there are many forms and

    types of this analysis. But there are most common used format that to be used here

    .We would start with use case diagram to establish main actors and main processes

    in our part. Use cases are important because they are in a tracking format. Hence

    they make it easy to comprehend about the functional requirements in the system

    and also make it easy to identify the various interactions between the users and the

    systems within an environment. They are descriptive and hence clearly represent

    the value of an interaction between actors and the system. They clarify system

    requirements very categorically and systemically making it easier to understand the

    system and its interactions with the users. During the analysis phase of the

    projects System Development Life Cycle, use cases help to understand the

    systems functionality.

    At figure (1)

    If we look at this figure we would see that this system has many actors that they

    are responsible for acting main operation of the system. Here we have three main

    actors are DEVELOPER, USER, and SYSTEM. Each associated with its own part

    of operation to be done. DEVELOPER has many main operations to do as

    mentioned in Speech-to-Text framework part the developer has to save all data

    used along the process in database and this database may contain for example

    common words and letters and also contain numbers and abbreviations to be used

    not only that but also database may contain words and letters of sound to be used

    (Speech to letters or words !!). developer also has to create special database for

    soundto-letter dictionary associated with previous saved letters and words. User

    is the second main actor in our system, user is the responsible for entering the

    speech to be wrote as without speech there is no project! . system is the third actor

    in our system there are many operations associated with the system as it is the core

    of the project and our goal is to deal with system. System has to divide speech

    received from user into letters and if not in our common dictionary to use sound-

    to-letter . As long as statement divided into letters system has to search if words

  • 7/29/2019 doc 1 mobility

    31/56

    31

    are common and are in the database to be ready to be pronounced if not system has

    to divide words into letters and go to sound-to-letter dictionary and search for each

    sound letter if it is default character or if it has special pronounce depending on the

    position of the character in the statement and so on. The last operation is wrote the

    letters and concatenate letters together to import the result statement and alsopronounce depend on determining the begin and the end of the statement.

    Figure (1)

  • 7/29/2019 doc 1 mobility

    32/56

    32

    Now to step to another point or diagram to explain the role of the sequence

    diagram and its steps in our project where how this part work and go on. Sequence

    diagrams are the interaction diagrams which shows the all operations going to be

    operated between the elements in your game by their time sequence. It is written in

    Unified Modeling Language which means it has a standard implementation way toshow how things will be. It implements all of the objects in your game, it shows

    their operations, the messages they sent to each other and their behaviors in

    certain conditions.

    Since sequence diagrams are showing the very basic behavior scheme of every

    class in your project, drawing is one of the most important things while designing a

    game. It will simplify the process of coding the classes and let us simply distribute

    functions to our classes for making our game operate properly. to start explainingthe steps of the sequence diagram we would look at figure (2), these process is

    greatly depend on the use case operations where here at first developer save words

    , symbols and abbreviations in database and also create sound-to-letter data base

    and also letters of sound to be used and then we go to the role of user to enter the

    speech to be wrote . then the system start to read the speech and divide it into

    letters then system search in the letters for numbers and dates to be directly read

    from their dictionary. Next step to be used is that the system determine is the word

    is common if yes system would read it from dictionary contain the sounds of

    common used words. If not common word applying the sound-to-letter rule .

  • 7/29/2019 doc 1 mobility

    33/56

    33

    Figure (2)

  • 7/29/2019 doc 1 mobility

    34/56

    34

    -Text to Speech:System Analysis of

    Here we are going to make system analysis and design of text to speech part where

    system analysis is the key of starting implementing and the way to go on in the

    project. Systems are created to solve problems. We need to see all sides of a

    problem to come up with an acceptable solution. Analysis involves studying the

    system and seeing how they interact with the entities outside as well as inside the

    system. We then come out with detailed specifications of what the system will

    accomplish based on the user requirements.

    Systems design will take the requirements and analysis into consideration and

    come out with a high level and low level design that will form the blue print to the

    actual solution to the problem in hand.

    In this dynamic world, analysis and design have to look into making systems that a

    flexible enough to accommodate changes as they are inevitable in any system.

    Systems study and analysis is very important; most projects fail because of

    shortcomings in this phase. The problem is that most customers have some hazy

    requirements, or end results they would like to achieve. This phase really defines,

    in technically implementable terms, what the requirements are. Systems Analysis

    and design is necessary because it helps you first to identify the problem that

    you're trying to find a solution for. The analysis part has a lot to do with

    "diagnosis." You want to make sure that you understand the problem in-depth, and

    that you understand what the end users need. If you provide a solution that doesnot meet users' needs, your system is useless.

    Now to look at our system and its analysis and design there are many forms and

    types of this analysis. But there are most common used format that to be used

    here.We would start with use case diagram to establish main actors and main

    processes in our part. Use cases are important because they are in a tracking

    format. Hence they make it easy to comprehend about the functional requirements

    in the system and also make it easy to identify the various interactions between the

    users and the systems within an environment. They are descriptive and hence

    clearly represent the value of an interaction between actors and the system. They

    clarify system requirements very categorically and systemically making it easier to

    understand the system and its interactions with the users. During the analysis phase

  • 7/29/2019 doc 1 mobility

    35/56

    35

    of the projects System Development Life Cycle, use cases help to understand the

    systems functionality.

    At figure (1)

    If we look at this figure we would see that this system has many actors that they

    are responsible for acting main operation of the system. Here we have three main

    actors are DEVELOPER, USER, and SYSTEM. Each associated with its own part

    of operation to be done. DEVELOPER has many main operations to do as

    mentioned in text-to-speech framework part the developer has to save all data used

    along the process in database and this database may contain for example common

    words and letters and be used in the process of speech also contain numbers and

    abbreviations to be used not only that but also database may contain sounds of

    words and letters to be used (word or letter to speech !!). developer also has tocreate special database for letter-to-sound dictionary associated with previous

    saved sounds. User is the second main actor in our system, user is the responsible

    for entering the statement to be pronounced as without statement there is no

    project! . system is the third actor in our system there are many operations

    associated with the system as it is the core of the project and our goal is to deal

    with system. System has to divide statement received from user into words and if

    not in our common dictionary to divide words into letters to use letter-to-sound

    dictionary also type of the statement is the responsibility of the system as thesemantic of the statement will determine the type of the statement and then to be

    correctly pronounced after dividing statement into words system has to search if

    there is any symbols and abbreviations in the statement if found to search for their

    sounds in the dictionary of to be directly pronounced. As long as statement

    divided into words system has to search if words are common and are in the

    database to be ready to be pronounced if not system has to divide words into letters

    and go to letter-to-sound dictionary and search for each letter sound if it is default

    character or if it has special pronounce depending on the position of the characterin the statement and so on. The last operation is pronounce of word and

    concatenate words or letters together to import the result statement and also

    pronounce depend on determining the begin and the end of the statement. Also the

    tone of pronounce depend on the begin of the statement where high tone and go

    low.

  • 7/29/2019 doc 1 mobility

    36/56

    36

    Figure (1)

    Now to step to another point or diagram to explain the role of the sequence

    diagram and its steps in our project where how this part work and go on. Sequence

    diagrams are the interaction diagrams which shows the all operations going to be

    operated between the elements in your game by their time sequence. It is written in

    Unified Modeling Language which means it has a standard implementation way to

    show how things will be. It implements all of the objects in your game, it shows

    their operations, the messages they sent to each other and their behaviors in

  • 7/29/2019 doc 1 mobility

    37/56

    37

    certain conditions.

    Since sequence diagrams are showing the very basic behavior scheme of every

    class in your project, drawing is one of the most important things while designing a

    game. It will simplify the process of coding the classes and let us simply distributefunctions to our classes for making our game operate properly. to start explaining

    the steps of the sequence diagram we would look at figure (2), these process is

    greatly depend on the use case operations where here at first developer save words

    , symbols and abbreviations in database and also create letter-to-sound data base

    and also sounds of letters to be used and then we go to the role of user to enter the

    statement to be pronounced . then the system start to read the statement and divide

    it into words then system search in the words for numbers and dates to be directly

    read from their dictionary also system has to determine the type of the statementdepending on the semantic of the statement. Next step to be used is that the system

    determine is the word is common if yes system would read it from dictionary

    contain the sounds of common used words. If not common word applying the

    letter-to-sound rule is the next step. Where searching if the letter has special

    pronounce depending on its position in the word if letter has no special case of

    pronounce so get the default pronounce of the letter from default sound-to-letter

    table in its database and pronounce it then all words or letters to be concatenated to

    be pronounced. Depending on the type of the statement will be the tone of the

    statement and words. So the result of the speech. These is the process in details and

    all details are in frame work chapter.

    Keyboard implementation Analysis

  • 7/29/2019 doc 1 mobility

    38/56

    38

    Figure (2)

    Figure(3)

  • 7/29/2019 doc 1 mobility

    39/56

    39

    Chapter5Framework

  • 7/29/2019 doc 1 mobility

    40/56

    40

    1.Speech-to-Text Recognition

    INTRODUCTION:

    Real-time speech-to-text has been defined as the accurate transcription of wordsthat make up spoken language into text momentarily after their utterance(Stuckless, 1994).This report will describe and discuss several applications of new computer-basedtechnologies,which enable deaf and hard of hearing students to read the text of the languagebeing spoken by the instructor and fellow students, virtually in real time. In its

    various technological forms, real-time speech to- text is a growing classroomoption for thesestudents.

    This report is intended to complement several other such reports in this serieswhich focus on note taking (Hastings, Brecklein, Cermack, Reynolds, Rosen, &Wilson, 1997)2, assistive listening devices (Warick, Clark, Dancer, & Sinclair,1997), and interpreting (Sanderson, Siple, & Lyons, 1999). It is notable thatThe Department of Justice has interpreted the Americans with Disabilities Act(P.L. 101-336) to include computer-aided transcription services underappropriate auxiliary aids and services (28CFR, 36.303).It should be emphasized at the outset that the real time speech-to-text servicesdescribed and discussed in this report are intended to complement, not replace, theoptions that are already available.

    DEVELOPMENT OF REAL-TIME SPEECH-TO-TEXT SYSTEMS:

    Over the past 20 years, several developments have made it possible to use real-

    time speech-to-text transcription services as we know them today. These beganwith the development of smaller, more powerful computer systems, including theircapability of converting stenotypic phonetic abbreviations electronically intounderstandable words. These parallel developments led to the earliest applicationsof steno-based systems both to the classroom and to real-time captioning in 1982.In the later 1980s, laptop computers became widely available. This enhancedportability led to the use of computers for note taking in which the notetaker

  • 7/29/2019 doc 1 mobility

    41/56

    41

    used a standard keyboard in the regular class room. It was at this time thatstenotype machines were also linked to laptop computers, enhancing theirportability. In the late 1980s, abbreviation software became available for regularkeyboards (Stinson &Stuckless, 1998).Currently, both steno-based and standardkeyboard approaches are being used with deaf and hard of hearing students inmany mainstream secondary and postsecondary settings. Although the full extentof their usage nationwide remains to be documented, over the past 10 years thereclearly has been an increased demand for speech-to-print transcription services inthe classroom (Cuddihy, Fisher, Gordon, & Shumaker, 1994; Haydu & Patterson,1990;James & Hammersley, 1993; McKee, Stinson, Everhart, & Henderson, 1995;Messerly &Youdelman, 1994; Moore, Bolesky, & Bervinchak,1994; Smith &Rittenhouse, 1990; Stinson , Stuckless, Henderson, & Miller, 1988; Virvan,1991).

    TWO CURRENT SPEECH-TO-TEXT OPTIONS:

    Currently, two major options are available for providing real-time speech-to-textservices to deaf and hard of hearing students. The first and second parts of thisreport will discuss these two options in order. But first, several general commentsabout the two systems should be made. Steno-based systems. For these systems, atrained stenographer uses a 24-key machine to encode spoken English phoneticallyinto a computer where it is converted into English text and displayed on a

    computer screen or television monitor in real time. Generally, the text is producedverbatim. When used in schools, this system is often called CART(computer-aided real-time transcription), an apt acronym in view of the fact thatsteno typists often transport their equipment from one classroom toanother on wheels. Computer-assisted note taking systems. For thesesystems, a typist with special training uses a standard keyboard to input words intoa laptop/PC as they are being spoken. Sometimes these take the form ofsummary notes, sometimes almost as verbatim text. These systems are oftenabbreviated as CAN(computer-assisted note taking).Both types of systems providea real-time text output that students can read on a computer or television

    screen in order to follow what is occurring in class. In addition, the text file can beexamined by students, tutors, and instructors after class either onthe screen or as hard copy. These technologies offer receptive communicationto deaf and hard of hearing students. However, they provide limited options forexpressive communication on the part of these students, andService providers need to keep this in mind. We will begin by providing somebasic nuts and bolts information that service providers need in order to

  • 7/29/2019 doc 1 mobility

    42/56

    42

    implement a steno-based or computer assisted note taking (CAN) system. For eachof these systems, we address four major questions:

    (1) How do these systems work?

    (2) What major considerations need to be

    addressed with respect to their implementation as a support service in

    the classroom?

    (3) Who is qualified to provide the service and what is his/her training?

    (4) How can the systems effectiveness be evaluated, and what has been

    learned from evaluations to date?In considering these systems, we will discuss aspects of particular speech-to-textsystems with which we have had personal experience. Our focus on particularsystems or associated college programs is not intended as an endorsement overother systems or college programs.

    The third part of this report pertains to the use of speech-to-print services relativeto other forms of support service, and the fourth part to the development of newspeech-to-text systems, focusing on the status and potential of automaticSpeech recognition (ASR).

    APPLICATIONS WITH DEAF AND HARD OF HEARING

    STUDENTS:

    Steno-based systems provide a two-fold service that includes real-time speech-to-text transcription for deaf and hard of hearing students to read almost instantly inthe classroom, and a written record of the class that they can use later for review.We will discuss these two applications in turn.Real-time classroomimplementation. Steno-based systems can be used to cover a variety of campusevents, sometimes as real-time captioning where the text appears under the videoimage of a speaker. However, their primary application with deaf and hard ofhearing students is in the classroom. Steno based systems as used in the regularclassroom provide a means for the deaf or hard of hearing student to replacelistening with reading what the teacher and fellow students are discussing, in nearreal time. As indicated earlier, the steno typist sits near the front of the classroom,sometimes to the side where he/she is in visual range of the teacher, students, thechalkboard, and other visual media that might be in use. Incidentally, the stenotypists equipment is silent and requires little space. So long as the text is legible tothe deaf or hard of hearing student, it can be displayed in a number of

  • 7/29/2019 doc 1 mobility

    43/56

    43

    ways. If the service is being provided for a single student, a second laptop can beused as a screen. However, if a number of deaf and/or hard of hearing students areusing the service, a large TV or projection screen is in order.

    From a classroom perspective, the presence of a steno-based system or a computer-assisted note taking system in the class is similar in somerespects to having an interpreter there. More attention will be given to similaritiesand differences later in this report.Hard copy text. Transcripts of lectures can beused as complete classroom notes, preserving the entire lecture and all studentscomments for subsequent review by deaf and hard of hearing students takingthe course. Typically, these transcripts are shared with these students and with theinstructor. Some instructors welcome the transcripts as a way of

    tightening their lectures and reviewing their students questions and comments.If the instructor chooses, he/she should be at liberty to share them with hearingmembers of the class also.4 the transcripts can be of value also in tutoringdeaf and hard of hearing students, enabling tutors to organize tutoring sessions inclose accord with course content. Also, interpreters sometimes use them toimprove their signing of course-specific words and expressions.

    Once the steno typist has completed the real-time transcription of a class for thedeaf or hard of hearing student(s) enrolled in the course, he/she will edit the text.Depending on the particular class, a 50-minute class is likely to generate 25 to 30pages of text. If the steno typist has a high accuracy rate in a given class, e.g., 98-99%, he/she may be able to correct errors and make the text more readable in one-half hour or less. Obviously more errors (causes of which are discussed later underAccuracy) will require more editing time. Many students who use the text forreview purposes prefer receiving an ASKII disk (edited or unedited) so they canorganize their own format and decide for themselves what they want to retain ordiscard.

    How Does Speech-to-TextWork?

  • 7/29/2019 doc 1 mobility

    44/56

    44

    A Speech-to-Text Reporter uses an electronic shorthand keyboard. They haveundergone at least three years intensive training in order to produce over 200words per minute with an accuracy rate of around 98%. Several letter keys arepressed at once (a bit like piano chords) which represents a syllable or a wholeword or sometimes a short phrase. The shorthand keyboard is connected to alaptop, where specialist software registers the chord strokes and finds a matchingchord, or string of chords, which has an English counterpart. The software thendisplays the English counterpart on screen for someone to read. The text isdisplayed either on the screen of a laptop for a sole user, or projected onto a largescreen or a series of plasma screens for a larger number of users.

    Why do some deaf people need to use Speech-to-Text and not

    others?

    There are over seven million people in the UK with some degree of hearing loss,from mild to profound. The vast majority of this seven million cannot use BritishSign Language but need access to communication in written English. The STTRprofession was developed as a response to that need and AVSTTR aims tocontinue to develop those skills through CPD and through sharing technologicalinnovation amongst our members. Remote STT is a relatively new area and we areconstantly looking for ways to improve the experience of the end user.

    2.Text-to-Speech :-

    Overview

    You might have already used text-to-speech in products, and maybe evenincorporated it into your own application, but you still dont know how it works.This document will give you a technical overview of text-to-speech so you can

  • 7/29/2019 doc 1 mobility

    45/56

    45

    understand how it works, and better understand some of the capabilities andlimitations of the technology.

    Text-to-speech fundamentally functions as a pipeline that converts text into PCMdigital audio. The elements of the pipeline are:

    1. Text normalization2. Homograph disambiguation3. Word pronunciation4. Prosody5. Concatenate wave segments

    Ill cover each of these steps individually

    Text Normalization

    The "text normalization" component of text-to-speech converts any input text intoa series of spoken words. Trivially, text normalization converts a string like "Johnrode home." to a series of words, "john", "rode", "home", along with a markerindicating that a period occurred. However, this gets more complicated whenstrings like "John rode home at 23.5 mph", where "23.5 mph" is converted to"twenty three point five miles per hour". Heres how text normalization works:

    First, text normalization isolates words in the text. For the most part this is astrivial as looking for a sequence of alphabetic characters, allowing for anoccasional apostrophe and hyphen.

    Text normalization then searches for numbers, times, dates, and other symbolicrepresentations. These are analyzed and converted to words. (Example: "$54.32" isconverted to "fifty four dollars and thirty two cents.") Someone needs to code upthe rules for the conversion of these symbols into words, since they differdepending upon the language and context.

    Next, abbreviations are converted, such as "in." for "inches", and "St." for "street"or "saint". The normalizer will use a database of abbreviations and what they areexpanded to. Some of the expansions depend upon the context of surroundingwords, like "St. John" and "John St.".

  • 7/29/2019 doc 1 mobility

    46/56

    46

    The text normalizer might perform other text transformations such as internetaddresses. "http://www.Microsoft.com" is usually spoken as "w w w dot Microsoftdot com".

    Whatever remains is punctuation. The normalizer will have rules dictating if thepunctuation causes a word to be spoken or if it is silent. (Example: Periods at theend of sentences are not spoken, but a period in an Internet address is spoken as"dot.")

    The rules will vary in complexity depending upon the engine. Some textnormalizers are even designed to handle E-mail conventions like "You***WILL*** go to the meeting. :-("

    Once the text has been normalized and simplified into a series of words, it is

    passed onto the next module, homograph disambiguation.

    Homograph Disambiguation

    The next stage of text-to-speech is called "homograph disambiguation." Often itsnot a stage by itself, but is combined into the text normalization or pronunciationcomponents. Ive separated homograph disambiguation out since it doesnt fitcleanly into either.

    In English and many other languages, there are hundreds of words that have thesame text, but different pronunciations. A common example in English is "read,"which can be pronounced "reed" or "red" depending upon its meaning. A"homograph" is a word with the same text as another word, but with a differentpronunciation. The concept extends beyond just words, and into abbreviations andnumbers. "Ft." has different pronunciations in "Ft. Wayne" and "100 ft.". Likewise,the digits "1997" might be spoken as "nineteen ninety seven" if the author istalking about the year, or "one thousand nine hundred and ninety seven" if theauthor is talking about the number of people at a concert.

    Text-to-speech engines use a variety of techniques to disambiguate thepronunciations. The most robust is to try to figure out what the text is talking aboutand decide which meaning is most appropriate given the context. Once the rightmeaning is know, its usually easy to guess the right pronunciation.

  • 7/29/2019 doc 1 mobility

    47/56

    47

    Text-to-speech engines figure out the meaning of the text, and more specifically ofthe sentence, by parsing the sentence and figuring out the part-of-speech for theindividual word. This is done by guessing the part-of-speech based on the wordendings, or by looking the word up in a lexicon. Sometimes a part of speech willbe ambiguous until more context is known, such as for "read." Of course,disambiguation of the part-of-speech may require hand-written rules.

    Once the homographs have been disambiguated, the words are sent to the nextstage to be pronounced.

    Word Pronunciation

    The pronunciation module accepts the text, and outputs a sequence of phonemes,just like you see in a dictionary.

    To get the pronunciation of a word, the text-to-speech engine first looks the wordup in its own pronunciation lexicon. If the word is not in the lexicon then the

    engine reverts to "letter to sound" rules.

    Letter-to-sound rules guess the pronunciation of a word from the text. Theyre kindof the inverse of the spelling rules you were taught in school. There are a numberof techniques for guessing the pronunciation, but the algorithm described here isone of the more easily implemented ones.

    The letter-to-sound rules are "trained" on a lexicon of hand-entered pronunciations.The lexicon stores the word and its pronunciation, such as:

    hello h eh l oe

    An algorithm is used to segment the word and figure out which letter "produces"which sound. You can clearly see that "h" in "hello" produces the "h" phoneme, the"e" produces the "eh" phoneme, the first "l" produces the "l" phoneme, the second"l" nothing, and "o" produces the "oe" phoneme. Of course, in other words theindividual letters produce different phonemes. The "e" in "he" will produce the"ee" phoneme.

    Once the words are segmented by phoneme, another algorithm determines whichletter or sequence of letters is likely to produce which phonemes. The first passfigures out the most likely phoneme generated by each letter. "H" almost alwaysgenerates the "h" sound, while "o" almost always generates the "ow" sound. Asecondary list is generated, showing exceptions to the previous rule given the

  • 7/29/2019 doc 1 mobility

    48/56

    48

    context of the surrounding letters. Hence, an exception rule might specify that an"o" occurring at the end of the word and preceded by an "l" produces an "oe"sound. The list of exceptions can be extended to include even more surroundingcharacters.

    When the letter-to-sound rules are asked to produce the pronunciation of a wordthey do the inverse of the training model. To pronounce "hello", the letter-to-soundrules first try to figure out the sound of the "h" phoneme. It looks through theexception table for an "h"beginning the word followed by "e"; Since it cant findone it uses the default sound for "h", which is "h". Next, it looks in the exceptionsfor how an "e" surrounded by "h" and "l" is pronounced, finding "eh". The rest ofthe characters are handled in the same way.

    This technique can pronounce any word, even if it wasnt in the training set, and

    does a very reasonable guess of the pronunciation, sometimes better than humans.It doesnt work too well for names because most names are not of English origin,and use different pronunciation rules. (Example: "Mejia" is pronounced as "meh-

    jee-uh" by anyone that doesnt know it is Spanish.) Some letter-to-sound rules firstguess what language the word came from, and then use different sets of rules topronounce each different language.

    Word pronunciation is further complicated by peoples laziness. People will

    change the pronunciation of a word based upon what words precede or follow it,just to make the word easier to speak. An obvious example is the way "the" can be

    pronounced as "thee" or "thuh." Other effects including the dropping or changingof phonemes. A commonly used phrase such as "What you doing?" sounds like"Wacha doin?"

    Once the pronunciations have been generated, these are passed onto the prosodystage.

    Prosody

    Prosody is the pitch, speed, and volume that syllables, words, phrases, andsentences are spoken with. Without prosody text-to-speech sounds very robotic,and with bad prosody text-to-speech sounds like its drunk.

    The technique that engines use to synthesize prosody varies, but there are somegeneral techniques.

  • 7/29/2019 doc 1 mobility

    49/56

    49

    First, the engine identifies the beginning and ending of sentences. In English, thepitch will tend to fall near the end of a statement, and rise for a question. Likewise,volume and speaking speed ramp up when the text-to-speech first starts talking,and fall off on the last word when it stops. Pauses are placed between sentences.

    Engines also identify phrase boundaries, such as noun phrases and verb phrases.These will have similar characteristics to sentences, but will be less pronounced.The engine can determine the phrase boundaries by using the part-of-speechinformation generated during the homograph disambiguation. Pauses are placedbetween phrases or where commas occur.

    Algorithms then try to determine which words in the sentence are important to themeaning, and these are emphasized. Emphasized words are louder, longer, and willhave more pitch variation. Words that are unimportant, such as those used to make

    the sentence grammatically correct, are de-emphasized. In a sentence such as "Johnand Bill walked to the store," the emphasis pattern might be "JOHN and BILLwalked to the STORE." The more the text-to-speech engine "understands" whatsbeing spoken, the better its emphasis will be.

    Next, the prosody within a word is determined. Usually the pitch and volume riseon stressed syllables.

    All of the pitch, timing, and volume information from the sentence level, phraselevel, and word level are combined together to produce the final output. The output

    from the prosody module is just a list of phonemes with the pitch, duration, andvolume for each phoneme.

    Play Audio

    The speech synthesis is almost done by this point. All the text-to-speech engine hasto do is convert the list of phonemes and their duration, pitch, and volume, intodigital audio.

    Methods for generating the digital audio will vary, but many text-to-speechengines generate the audio by concatenating short recordings of phonemes. The

  • 7/29/2019 doc 1 mobility

    50/56

    50

    recordings come from a real person. In a simplistic form, the engine receives thephoneme to speak, loads the digital audio from a database, does some pitch, time,and volume changes, and sends it out to the sound card.

    It isnt quite that simple for a number of reasons.

    Most noticeable is that one recording of a phoneme wont have the same volume,pitch, and sound quality at the end, as the beginning of the next phoneme. Thiscauses a noticeable glitch in the audio. An engine can reduce the glitch by blendingthe edges of the two segments together so at their intersections they both have thesame pitch and volume. Blending the sound quality, which is determined by theharmonics generated by the voice, is more difficult, and can be solved by the nextstep.

    The sound that a person makes when he/she speaks a phoneme, changes dependingupon the surrounding phonemes. If you record "cat" in sound recorder, and thenreverse it, the reversed audio doesnt sound like "tak", which has the reversedphonemes of cat. Rather than using one recording per phoneme (about 50), thetext-to-speech engine maintains thousands of recordings (usually 1000-5000).Ideally it would have all possible phoneme context combinations recorded, 50 * 50* 50 = 125,000, but this would be too many. Since many of these combinationssound similar, one recording is used to represent the phoneme within severaldifferent contexts.

    Even a database of 1000 phoneme recordings is too large, so the digital audio iscompressed into a much smaller size, usually between 8:1 and 32:1 compression.The more compressed the digital audio, the more muted the voice sounds.

    Once the digital audio segments have been concatenated theyre sent off to thesound card, making the computer talk.

    Generating a Voice

    You might be wondering, "How do you get thousands of recordings of phonemes?"

    The first step is to select a voice talent. The voice talent then spends several hoursin a recording studio reading a wide variety of text. The text is designed so that asmany phonemes sequence combinations are recorded as possible. You at least want

  • 7/29/2019 doc 1 mobility

    51/56

    51

    them to read enough text so there are several occurrences of each of the 1000 to5000 recording slots.

    After the recording session is finished, the recordings are sent to a speechrecognizer which then determines where the phonemes begin and end. Since thetools also knows the surrounding phonemes, its easy to pull out the rightrecordings from the speech. The only trick is to figure out which recording soundsbest. Usually an algorithm makes a guess, but someone must listening to thephoneme recordings just to make sure theyre good.

    The selected phoneme recordings are compressed and stored away in the database.The result is a new voice.

    Conclusion

    This was a high level overview of how text-to-speech works. Most text-to-speech

    engines work in a similar manner, although not all of them work this way. The

    overview doesnt give you enough detail to write your own text-to-speech engine,

    but now you know the basics. If you want more detail you should purchase one of

    the numerous technical books on text-to-speech

    3.Keyboard implementation survey

    Introduction:

  • 7/29/2019 doc 1 mobility

    52/56

    52

    -First we want to make deaf people live easily, deal with computer in efficient

    manner like normal people, so after researching on the web we have found many

    ways:

    Hand recognition based on camera:This way depend on hands moves; take their hands moves and matching them with

    images stored in the database. Thus we can recognize words and write what they

    want easily.

    But we find this way some hard to them for some reasons:

    1-In my opinion, this way some difficult since there is a camera that should

    capture of the moving hands, levels of camera should be identical to them.

    2-Distance from the camera and their hands should be quiet fit.

    3-Lightinning of the room should be fit for capturing hands.

    4-If we apply this way on web, may be more difficult, as processing might

    be quite slow.

    So for these reasons we think another way I think its easier and more efficient.

    Signs Keyboard:

    15)%4( )027(

    ...

  • 7/29/2019 doc 1 mobility

    53/56

    53

    ) ()2717 (

    .

    ) 15(

    ...

    :

    .. ..) (

    .) (

    .

    ) ( .

  • 7/29/2019 doc 1 mobility

    54/56

    54

    This way depends on using deaf signs as image buttons to form the keyboard. They

    can write characters and numbers, not enough they can use this tool to make

    sentences and what they want. Compare to another way we dont use any database,

    camera, we dont depend on the room lightening, they can use mouse for writing

    what they want.

    This is a snapshot of the keyboard while working on the web:

    Simply, the deaf people can press the image buttons and what they press will be

    written on the textbox.

    This is snapshot from the keyboard while writing on the web

  • 7/29/2019 doc 1 mobility

    55/56

    55

    Technology for making this tool:

    1-C# programming language for programming

    2-HTML5, CSS3 and Photoshop for designing

    3-Visual Studio 2010 as editor

    4-Windows 7 operating system

  • 7/29/2019 doc 1 mobility

    56/56

    Advantages of Signs Keyboard method:

    1-So simple as we seen

    2-Doesnt depend on any Databases

    3-Processing will be fast

    4-A mere its easy to be used

    5-Compare to another way there are no constraints we dont use any

    cameras, we dont depend on the room lightening, they can use mouse for

    writing what they want.

    6-So its so simple than we think

    7-There are more characters than we think, we can overcome by two buttons

    as we mentioned.

    Best wishes