Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
VoIP Text
A study into the effectiveness of speech-to-text as an assistive tool
in VoIP communications
Phase 2 Report Prepared by B. Shirley and I.Rattigan
i
Contents
CONTENTS ........................................................................................................................................................I
EXECUTIVE SUMMARY OF OUTCOMES ............................................................................................. 1
INTRODUCTION........................................................................................................................................ 1
OBJECTIVE EVALUATION OF ASR SOFTWARE ............................................................................... 3
Subject Selection ................................................................................................................................................... ‐ 3 ‐
Analysis ................................................................................................................................................................. ‐ 4 ‐
Results .................................................................................................................................................................. ‐ 4 ‐
QUALITATIVE EVALUATION OF VOIP CLIENT................................................................................ 6
Discussion with Hard of Hearing Users ................................................................................................................... ‐ 6 ‐
Questionnaire Results.......................................................................................................................................... ‐ 10 ‐
Discussion with Deaf Users and Representative Organisations ............................................................................. ‐ 12 ‐
CONCLUSIONS......................................................................................................................................... 14
APPENDIX A: CALCULATION METHOD FOR DATA ANALYSIS................................................... 1
1. Word Error Rate ........................................................................................................................................... ‐ 1 ‐
2. Word Recognition Rate................................................................................................................................. ‐ 1 ‐
3. Words Correct Rate ...................................................................................................................................... ‐ 1 ‐
APPENDIX B: OBJECTIVE RESULTS .................................................................................................... 1
B.1. Summary of Word Error Rate Results .............................................................................................................. ‐ 1 ‐
B.2. Subject 1 Results............................................................................................................................................. ‐ 2 ‐
B.3. Subject 2 Results............................................................................................................................................. ‐ 2 ‐
B.4. Subject 3 Results............................................................................................................................................. ‐ 3 ‐
B.5. Subject 4 Results............................................................................................................................................. ‐ 3 ‐
B.6. Subject 5 Results............................................................................................................................................. ‐ 4 ‐
B.7. Subject 6 Results............................................................................................................................................. ‐ 4 ‐
ii
APPENDIX C: QUESTIONNAIRE WEIGHTED SCORING ................................................................. 5
APPENDIX D: SUBJECT INTERVIEWS................................................................................................. 6
D.1. Group Discussion............................................................................................................................................ ‐ 6 ‐
D.2. Individual Discussion with Subject 6................................................................................................................ ‐ 9 ‐
D.3. Individual Discussion with Subjects 7 & 8 ...................................................................................................... ‐ 12 ‐
‐ 1 ‐
Executive Summary of Outcomes Full conclusions are presented later in the report, this section is designed to give an overview of key outcomes from this phase of the project.
• Windows Speech Recogniser 8.0 appears to have now reached parity of performance with leading edge commercial ASR packages; market leading speech recognition is now available to Windows users for free.
• A great deal of interest was shown in the software with seven out of eight participants
in the focus groups who used the software stating that they would use the software for some telephone calls. Five out of eight stated that they would like to use the software for all telephone calls.
• There was a high demand for the software to be utilized for business calls (eg. to bank
managers).
• Although word recognition rate for some people fell below the performance indicated as acceptable by the draft of ETSI ES 202 975 V0.0.6r3 (<10% word error rates) there were clear indications that longer usage would improve rates to considerably lower than this. Error rates of less than 5% have been reliably demonstrated.
• A dialogue took place with the RNID during which support was expressed for field trials
of the software as being more indicative of true performance than lab based focus groups.
Introduction The spread of Voice over Internet Protocol (VoIP) services, equipment and clients is transforming telephony worldwide. In addition to providing very inexpensive, or even free, international telephone calls there is potentially additional benefit in using computer networks to facilitate telephony. Currently hearing impaired and deaf users are excluded from VoIP services; unless the message is in text form to begin with the hearing impaired user cannot access these services effectively. A current draft ETSI standard (ETSI ES 202 975 V0.0.6r3 (2008‐09)) outlines suggested acceptable performance measures for such a system but as yet these have not been verified. This project looks at the feasibility of utilizing “speech to text” software in order to generate text from natural speech over VoIP and so improve accessibility. The project will assess the accuracy of current state of the art Automatic Speech Recognition (ASR) software and also investigate the rationale behind the acceptable performance measure suggested by the draft ETSI document and assess if this is a reasonable level of performance. During phase 1 of this work, the University of Salford had developed a software VoIP, voice chat, client based on open source libraries intended to assess the potential for accessible access to VoIP. The client had been implemented utilizing Google‐Talk open source libraries and had been extensively debugged so providing a robust test‐bed for formal assessment of whether
‐ 2 ‐
the current state‐of‐the‐art of Automatic Speech Recognition (ASR) is sufficient to provide benefits for hearing impaired and deaf users. The scenario considered most useful, and that which is least catered for currently, is that of informal contact with friends and family which may be considered too personal to utilize services involving a 3rd party transcriber and too trivial to make the expense of such services appropriate. As part of the development an informal assessment had been carried out that had yielded results indicating that the software may indeed be useful in this context. Phase 2 of this work, documented here builds on phase 1 of the project during which software was developed to integrate speech to text translation into a voice chat client. This second phase compares the performance of two ‘best in class’ ASR engines with reference to the draft ETSI document ETSI ES 202 975 V0.0.6r3 (2008‐09). The ETSI draft recommends that a word error rate of 10% should be considered acceptable for an application such as that developed here and both Dragon Naturally Speaking 10 and Windows Speech Recogniser 8.0 (bundled with Windows 7) are assessed with reference to this recommendation. In addition user assessments have been carried out using semi‐structured interviews and questionnaires. These sessions were designed to assess both the software usability and also how well the performance of the speech recognition software assessment correlates to practical use of speech recognition in the context of IP telephony. To summarise, this phase carries out :
• a formal evaluation of the Automatic Speech Recognition performance, by objectively testing the word recognition rate (WRR) of the software with 6 subjects and examining the tolerance of error.
• A more formal qualitative assessment has been performed in which the Voice Chat client has been demonstrated and used by members of the deaf and hard of hearing community followed by a discussion of the software’s usability and performance.
• Views and opinions have been sought from deaf and hard of hearing groups including TAG, NADP and the RNID
‐ 3 ‐
Objective Evaluation of ASR Software Following on from the preliminary evaluation of the speech recognition engines performed in Phase 1 of the project, a more thorough evaluation has been carried out to determine which software package will be more suitable to carry out the qualitative evaluation on the Salford/Ofcom VoIP Client that has been developed. The speech recognition engines that were tested were the Windows Speech Recogniser 8.0, which is a standard feature of Windows 71 and Dragon Naturally Speaking 10.0 ASR Engines. As with the preliminary analysis performed in phase 1, two genres of conversation were used: Example 1: Directions “To get to the train station, take your first right, then walk down this road for about 5 minutes. You'll see a large red building on your left and a police station to your right, head right past the police station, and then continue up this road for about 6 minutes. Stop when you come to a coffee shop on your left hand side. You are close to the station now, turn right, then take your second right and continue straight on, you'll be able to see the train station from here, just keep walking straight ahead.” Example 2: Phone Conversation “Hello Auntie how are you today? Sebastian is now five years old He starts school this year I hope dinner was nice We had chicken and rice, it was very nice Although the chicken was a little chewy it was still a very nice meal I think you can get a cheap flight on August 15th It will only cost you 300 pounds which I think is very cheap Hopefully the weather won't be too bad if you decide to come it's been raining all winter Last week it rained every day although, saying that, it's actually quite nice outside now Still, it's not as hot as where you live! Maybe we should come over there to visit instead It would be nice to see Gary and Steve again It was nice to talk to you again Hopefully we will see you soon Give my love to Hannah Goodbye!”
Subject Selection Six subjects between the ages of 25 and 55 years used both of the two ASR packages and their accuracy results recorded. Of the six subjects selected, three were male and three were female and only one of the subjects had used speech recognition frequently prior to the tests. No attempt was made to hand pick subjects for having particularly good diction and as a result participants had a variety of regional accents. Each subject trained the software to an advanced level in order to gain better performance. The Dragon Naturally Speaking training required microphone configuration and optimization, 1 Windows 7 RC1 version was used for testing
‐ 4 ‐
then approximately 20 minutes of training. This training consisted of reading a passage from a published book stored in its database. The training for Windows Speech Recogniser 8.0 also required microphone configuration and optimization, followed by approximately 10 minutes of training by reading a series of fragmented sentences displayed in sequence on the screen.
Analysis The analysis of the ASR engines was performed by calculated the Word Error Rate (WER), Word Recognition Rate (WRR) and Words Correct Rate (WCR)[1] Three methods were used for assessing the performance of the speech recognition engines tested in this phase. Word Error Rate is calculated based on defined criteria that counts errors in translation. Errors are counted wherever words are substituted for an incorrect word, wherever words are missed out and wherever additional words are inserted. Word Recognition Rate and Words Correct Rate both present the opposite case, namely the number of words that have been recognized, rather than the number of errors made. The main difference between these parameters is that Word Correct Rate method does not consider insertion errors (where additional words are added) whereas the Word Recognition Rate method deducts from the score where additional words are inserted. The calculation method for each of these defined methods of quantifying the success rate of the speech recognition is presented in Appendix A . For the purpose of this report the WER and WRR rate will be shown in the results. Detailed data for individual subjects are shown in Appendix B. Results Results are presented here as WRR scores for each subject taking part in the quantitative assessments In this section the mean results and also the p‐value which indicates the statistical significance of the result have been included. Any p‐value with a value less than 0.05 indicates clear statistical significance with a 95% confidence level. Example 1: Directions
Subject 1 2 3 4 5 6 Dragon Naturally Speaking 10
96% 79% 79% 80% 81% 80%
Windows Speech Recogniser 8.0
95% 90% 85% 81% 88% 86%
This table indicates that the Windows Speech Recogniser ASR engine performed better in this assessment than Dragon Naturally Speaking, this result is statistically significant with a 95% confidence level.
‐ 5 ‐
Example 2: Phone Conversation Subject 1 2 3 4 5 6
Dragon Naturally Speaking 10
99% 80% 74% 84% 77% 88%
Windows Speech Recogniser 8.0
92% 89% 86% 74% 89% 75%
In this assessment the mean indicates that Windows Speech Recogniser 8.0 performed slightly better than Dragon Naturally Speaking but this result has no statistical significance. Overall Results
Subject 1 2 3 4 5 6
Dragon Naturally Speaking 98%% 79.7% 76.4% 82.5% 78.9% 85.0% Windows Speech Recogniser 8.0
93.1% 89.4% 85.4% 76.8% 88.6% 79.3%
Over both the assessments (directions and informal conversation) no statistical significance was shown. It is possible that some further significance may have been found had more subjects been assessed however it is equally possible that the different speech engines performed better for different voices. It is clear however that in the directions example, the Windows Speech Recogniser 8.0 ASR engine performed significantly better. This example is considered to be more demanding than the conversational example as the text has less sentence‐like structure and could therefore be more difficult for an ASR engine making more use of contextual information.
Figure 1 Mean Word error rate (WER) and Word Recognised Rate (WRR) shown with standard deviation.
‐ 6 ‐
Qualitative Evaluation of VoIP Client During the sessions to evaluate the software the talker (whose speech the ASR engine was required to recognise) was the Research Assistant on the project. He is experienced in using speech recognition software and had trained the software fully thus ensuring that the evaluations were set in the context of a long‐time user of the software. In order to gather feedback about the potential use of the software, three methods of obtaining information were used. The first method involved a demonstration of the software to a number of local deaf and hard of hearing organisations and obtaining their views and opinions on the software through discussion. Participants in this testing were from the Salford NHS Sensory Centre and the Manchester Deaf Centre. The second method involved deaf and hard of hearing volunteers using the software in a scripted conversation followed by both individual and group discussions of their views and opinions. Participants in this part of the testing were primarily members of Cicada. The third method of testing was a questionnaire given to the volunteers who had used the software as in the second method of testing, to obtain definitive information. Participants in this section of the training were primarily from Cicada, a group representing cochlear implantees. Views were also sought from TAG, the National Association of Deafened People and the RNID. In order to gather useful information the speech recognition software was trained using the recommended Windows training package. To improve the software performance, passages of text were also dictated into word processing software, where corrections can be made to erroneous translations. These corrections are then stored by the recognition software to improve recognition accuracy. The aim of this additional training was to give a demonstration of a user utilizing the software on a frequent basis.
Discussion with Hard of Hearing Users A demonstration of the Salford Ofcom VoIP client was given to six cochlear implantees, one deaf person one hard of hearing person and one normal hearing person. Due to participant schedules, one group demonstration and three individual demonstrations were given. The hearing impaired participants were instructed to log in to the software using profiles that were set up before the tests and then perform a number of tasks to give the participants a chance to navigate round the software. The tasks involved included:
• Adding a contact to their contact list; • Changing the display and colour scheme of the window; • Changing the size of the font; • Making a call to the added contact; • Having a scripted conversation with the added contact.
The demonstration was followed by a discussion of the software and its usability. Results from the discussion have been filtered into three main categories, usability, speech recognition performance and possible improvements. Comments and notes from all three discussions are
‐ 7 ‐
detailed in Appendix C. A wide variety of views and opinions were gathered throughout the discussions and these are detailed on the following page.
Usability All of the participants involved in the demonstration stated that they found the software easy to use, with the exception of one user who required some assistance. A number of participants commented on the simplicity of the display, with one participant stating that this was particularly useful for users with poor computer literacy. When asked if the participants would use the software to talk specifically with family and friends for example, in a telephone conversation, several of the participants said they would not use the software for this purpose. The reason given was because when they talk to family and friends, they are familiar with the sound of the person’s voice enough to be able to hear and understand what is being said. Some participants however did state that they would use the software for communication with family and friends who live overseas, one person stating specifically that it would increase her confidence in using the phone if the software was integrated into the telephone system. One participant stated that the software would be of most use at work (he is a deaf nurse) however, bearing in mind that the person at the other end of the call would have to have the software installed he would use it with family and other people (such as his doctor) who he spoke to regularly. It was stated that problems with telephony often arise when the hard of hearing person is speaking to somebody whose voice they are unfamiliar with, such as in business calls. Therefore as a preference, many of the participants would prefer to use the software for business and customer service calls. A comment made by one participant was that if the users were trained in how to speak into the software, i.e. slowly and concisely, this would allow the hard of hearing user to better hear what is being said, as well as providing the text on the screen for better understanding. Another concern among some participants was the need to have the software on the transmitting end of the conversation. An opinion put forward was that it could be difficult and time consuming to distribute the software between family members and friends. Another view is that family and friends would be understanding to the needs of the deaf user and therefore be willing to put in the time and effort.
Speech Recognition Performance All but one of the participants were impressed with the performance of the speech recognition, stating that in general there were no more than 5 words incorrect out of approximately 360 words. One of the participants who was impressed with the speech recognition performance, stated that he would not use the software in its current form, commenting that he would prefer to use this sort of technology on the receiving end of the conversation. The participant expressed that as long as the context of the message was maintained, then some errors would not be significant. This participant however, also stated that he did not think the technology at this
‐ 8 ‐
time was good enough to translate all speech with minimal or no training, but would definitely use the software if this was possible. One issue that arose was how the text was displayed on the screen. There was a difference in opinion as to whether the text should appear as scrolling text, appearing word by word, or whether the text should be displayed in the form of sentences and phrases. In some instances the user was confused as to when they were supposed to reply to a question or statement due to the fact that they were unsure whether the other person had finished speaking. If the text was scrolling, it was thought that it would be an indication as to when the other person had finished speaking. Contrary to this, the majority of participants felt that scrolling text would be unnatural to read, and therefore preferred the text to be displayed in sentence form.
Improvements Many of the participants felt that a video link should be included within the software, as this is common with other voice chat clients and that it can be beneficial to be able to lip read the person on the other side of the conversation. Additionally, it is also possible to see if a user has misunderstood or is confused by something that they have heard, and therefore gives a visual cue to the other user to clarify what has been said. The participants concluded that including Speech to Text in internet messaging services could then be used as an aid to complement what the user can see and hear from the client. Some participants also felt that it was distracting to have the user name or email address of the sender appear with every message and would only need to be shown at the beginning of the conversation. One possible solution would be to have different coloured text for each person involved in the conversation, including any third parties that may join the conversation. A number of participants thought it would be useful to have the software in the form of a plug‐in for voice chat clients already available, such as Skype and Googletalk, allowing them to use the speech recognition with services that they are familiar with. In addition to this, it was mentioned that future work could be done to include the software in mobile telephony, such as an iPhone or Windows Mobile application. Regarding the comments made in the previous section, it was considered useful to include the option of including a check box to allow the user to toggle between scrolling text and block text display. This would allow the user to select a preference of how the text is to be displayed. It was also suggested that the delay between the speech and text being transmitted could be confusing at times. Although the delay is relatively short compared to Teletext subtitles, the users stated they would prefer the text and speech to be transmitted simultaneously. Another comment, voiced by several participants indicated that it would be easier to know when a person has finished speaking if scrolling text is used, indicating to the deaf or hard of hearing user when they could reply. While using the software some participants felt confused as to when they were to reply to the displayed text. One other suggestion was to use standard text‐phone notation, such as “GA” for “Go Ahead” indicating to the deaf user that they can reply, or “SK” for “Stop Keying” indicating the conversation is over. It was also speculated as to whether this process could be automated using noise gate software, by automatically showing that the deaf user could reply after a period of silence.
‐ 9 ‐
Interestingly the final participant in the research had not requested STT or BSL support and the VoIPText software was used as a means of communication during much of the interview and test session. This was a real test of the ASR engine capability as the conversation was unscripted and although there were errors in translating some of the speech the session went very well. It quickly became apparent that a means of explaining that an error had occurred was needed and the word “correction” was quickly established as a means of letting the participant know that a mistake had occurred. The sentence would then be repeated. Coupled with the use of “GA” (indicating ‘go ahead’) when a response was required a reasonable level of communication sufficient to carry out the interview was established. This participant was very impressed with the software and, although there were translation errors, he was very keen to take part in field trials stating that he thought that his manager at work and his wife would both find the program useful.
‐ 10 ‐
Questionnaire Results A number of specific questions were put forward to the participants of the software demonstration with the aim of determining the usability of the software. Of the eight participants who took part in the usability demonstration, all had between average and advanced computer knowledge but only one had previously used speech recognition. The results from the questions were given an average weighted score (AWS) in order to give an indication of the quality for each question; 1 being high quality and 0 being low quality. The weighting method is detailed in Appendix B. The tallied answers from the questionnaire are given below together with an average weighted score for each question:
When logging into the software did you find it: Average Weighted Score: 0.69
Very Easy: 1 Easy: 4 Neither Difficult or Easy: 3 Difficult: 0 Very Difficult: 0
When adding a contact to your contact list did you find it: Average Weighted Score: 0.72
Very Easy: 2 Easy: 3 Neither Difficult or Easy: 3 Difficult: 0 Very Difficult: 0
Did you find making a call: Average Weighted Score: 0.66
Very Easy: 2 Easy: 2 Neither Difficult or Easy: 3 Difficult: 1 Very Difficult: 0
The speech recognition software performed: Average Weighted Score: 0.75
Very Well: 4 Well: 1 Adequately: 2 Poorly: 1 Very Poorly: 0
‐ 11 ‐
If an error occurred in the output text the message was still understandable: Average Weighted Score: 0.69
Always: 3 Almost Always: 0 Sometimes: 5 Almost Never: 0 Never: 0
In your opinion, the time in which the text appeared was: Average Weighted Score: 0.88
Never Too Long: 6 Sometimes Too Long: 2 Often Too Long: 0
Would you use speech recognition for some telephone calls: Average Weighted Score: 0.88
Yes: 7 No: 0 Not Applicable: 1
Would you use speech recognition for most telephone calls: Average Weighted Score: 0.62
Yes: 5 No: 2 Not Applicable: 1
Overall I was by the speech recognition software: Average Weighted Score: 0.72
Very Impressed: 2 Impressed: 4 Satisfied: 1 Unimpressed: 1 Very Unimpressed: 0
‐ 12 ‐
Discussion with Deaf Users and Representative Organisations The Salford/Ofcom VoIP client was demonstrated to members of the Manchester Deaf Center and the Sensory Team at the Salford NHS branch. The participants for this discussion were unable to travel to Salford University and network access was unavailable at the only locations that could be used. This meant that participants did not have the opportunity to use the software themselves however a thorough demonstration of its use was given. Members of both groups were extremely impressed with the software and the level of speech recognition demonstrated. One point that has come to light regarding the software is the importance of training both the speech recognition software and also the user speaking into it. As discussed in the Objective analysis of the software, it is important that the user knows how to speak into the software to maintain a low word error rate. If a user were to use the software without any training, then it would be expected that a high word error would be produced and the user would not be inclined to use the software for communication. Possible solutions highlighted would be to have a mandatory training session provided when creating a profile, which automatically runs the microphone configuration and training session, as well as running an audio or video tutorial to indicate how the user is expected to speak into the software. Another topic of discussion was that many of the completely deaf community frequently use SMS text, Bluetooth messaging and internet messaging services to communicate with friends and family. A significant advantage to using internet messaging is the ability to see the person who they are having a conversation with, particularly if the deaf user is lip reading or using sign language to communicate. Providing automatic text subtitles using speech recognition software was considered to be an additional advantage, particularly to lip readers who would be able to use the subtitled text to complement what they see over the video link. In comparison to textphone and Minicom services currently used, which only show text for a short period of time; one advantage of the Salford Ofcom VoIP client is the ability to be able to scroll back through a conversation if something was misunderstood. In addition to this, if the user were to be in a business call (or to a bank manager) there is the possibility the user can save the text as a reference document for future conversations. Demonstrations and discussions were carried out during sessions with three organizations representing deaf people, the National Association for Deafened People (NADP), TAG and the RNID. Members of TAG and the NADP visited Salford University and a presentation was given as to the aims of the project, and the intended usage of the software that had been developed. Both organizations were extremely supportive of the project and were enthusiastic about its potential. A demonstration of the software was given during which a discussion was carried out between rooms utilizing the VoIPText client. All those present from both TAG and the NADP were very supportive indeed and very interested in assisting with recruiting people to carry out further assessments. It is anticipated that these groups could be instrumental in identifying participants in any field trials that may take place in the future.
‐ 13 ‐
An additional visit took place to the office of the Royal National Institute for Deaf People (RNID) at which time a full discussion of the issues surrounding the software application were discussed. In principle the RNID had a number of objections based on their perception of Ofcom policy in the area of assistive communications. Principle among these being the fact that the system relies on normal hearing people installing special software for the system to function. The RNID consider that this does not in fact make for true equality of access to services because non‐hearing impaired people are required to take special measures. There was also concern expressed that relatively high scores for speech recognition under laboratory conditions may not necessarily reflect acceptable performance in the real world environment. There was frustration expressed that, in the opinion of the RNID, policies were not being put in place to allow for greater access to services such as TypeTalk, until recently operated by the RNID but now handed over to BT. Although objections were noted the case was made to the RNID that even access requiring special measures to be taken by non‐hearing impaired people would improve the situation for deaf people. The possibility of extending the study to include field trials of the software in people’s homes was discussed at some length and the RNID were broadly supportive of this suggestion considering that it would provide much more useful data as to whether speech recognition technology was as yet sufficiently well developed for this use.
‐ 14 ‐
Conclusions (Key findings are presented in bold type.) The results from the objective test show that the Windows 7 ASR engine performs slightly better than Dragon Naturally Speaking for a majority group of subjects who have not had experience of using speech recognition software. Although the t‐Test results for Example 1 show that there is a significant difference in performance between the two software packages with some material, the overall results (word count for Examples 1 & 2) show that there no significant difference. Either Windows 7 speech recognition or Dragon Naturally Speaking would therefore be suitable for extended assessments of the software. Market leading speech recognition software is now available to Windows users for free in the shape of Windows Speech Recogniser, the first time that ASR bundled with an operating systems has been able to compete with leading commercial packages. In all of the results for the subjects with no experience of using ASR the word error rate was higher than the proposed 10% WER in the current draft ETSI standard (ETSI ES 202 975 V0.0.6r3 (2008‐09)). The subject who did have previous experience of ASR however did achieve word error rates of well below 10%, suggesting that training the user how to speak into the software, as well as training the software itself may achieve significantly more accurate results. In a number of instances the one subject with experience of speech recognition software far exceeded the 90% recognition rate gaining scores of 99.3% words recognized and 98% of words recognized with Dragon Naturally Speaking speech recognition. One other possible influence on the results are the different approaches to training the software packages. Although the Dragon Naturally Speaking training session takes longer to perform (and therefore more words are gathered) the Windows 7 training session appeared to be more beneficial in other ways as it influences the user to speak in fragmented sentences allowing the software to process shorter sections of text. Therefore, for the qualitative testing of the software, it was decided that the Windows 7 Speech Recognition package would be better suited. This was not only based on the overall results of the objective testing, but also on the idea that the software would be free of charge (requiring no purchase of additional software such as Dragon Naturally Speaking). Although Windows 7 speech recognition was used for this qualitative assessment, the scarcity of Windows 7 currently means that other ASR engines may be more practical for extended ‘field trials’. In all discussions with various members of the deaf and hard of hearing community, all but one of the participants were impressed with the Salford Ofcom Client and the performance of the speech recognition software. It was stated however by a number of participants, that in order to maintain the reliability of the speech recognition, and therefore the use of the software client, training and ongoing support would need to be provided to potential users. Results from the questionnaire show that on average, users found using the software easy to use. This is supported by the average weighted scores (AWS) for each of the questions, none of which were lower than 0.5. The questionnaire also indicates that 87% of the participants
‐ 15 ‐
would like to use the software for some telephone calls, and around 62% would use the software for all telephone calls. Many of the hard of hearing participants (who had some hearing ability) suggested that they would prefer to use the software for business and banking calls, rather than with friends and family. The reason for this preference was that the hard of hearing person is familiar with the voices of friends and family and so finds them easier to understand, even over a phone line. The difficulty for the hard of hearing user arises when they are speaking to someone with an unfamiliar voice or unusual accent or dialect, therefore reflecting the preference to use the software for these types of conversation. Discussions with the Deaf community indicated that many already use internet messaging services for communication, due to the ability to be able to lip read and sign via a video link. These members were however very impressed with the software and the performance of the speech recognition, indicating that speech to text would be an added benefit to internet messaging communication with friends and family who were not familiar with deaf communications. A number of improvements to the software client were suggested by the participants, including the removal of the username/email address of the user every time a message was broadcast. It was suggested that this can be distracting to the message and that using different coloured texts would be a better way to view a conversation. One other option that was suggested was to include a tick box option to select whether the text should appear to be scrolling or as block sentences. This would allow a user to select their own preference, as well as indicating when the user has finished speaking. It was stressed as important to show some indication as to when the deaf or hard of hearing user could speak as this would remove some confusion as to when a reply could be made. Every attempt has been made to report recommendations and comments accurately however there are clearly some practical limitations to implementing the suggestions made to improve the software: The flow of text (word by word or paragraph by paragraph) is largely dependent on the ASR engine used and how it operates. Because the ASR engines utilize contextual parameters in identifying words they require blocks of text for processing. The usefulness of a video link is not as yet clear, research suggests that a video link without synchronization to the voice and text may actually be detrimental to understanding.
‐ 1 ‐
Appendix A: Calculation Method for Data Analysis For these definitions and calculation methods terms are defined as follows: Nr the total number of words in the reference text, S the number of substituted words in the output text, D the number of deleted words I the number of inserted words. An example of input text (I/P) and output text (O/P) is given below where substituted words are highlighted in blue, deleted words in red and inserted words are highlighted in green:
I/P: “Sebastian is now 5 years old, he starts school this year” O/P: “So bastion is now 5 years hold, he school his”
1. Word Error Rate The Word Error Rate is defined as:
rNIDSWER ++
=
2. Word Recognition Rate The word recognition rate is defined as:
rNIHWERWRR −
=−= 1
Where H is defined as the number of words correctly recognized: ( )DSNH r +−=
3. Words Correct Rate The words correct rate, which does not consider insertion errors, is defined as:
NHWCR =
‐ 1 ‐
Appendix B: Objective Results
B.1. Summary of Word Error Rate Results In this section the mean results and also the p‐value which indicates the statistical significance of the result have been included. Any p‐value with a value less than 0.05 indicates clear statistical significance with a 95% confidence level. Example 1: Directions
Subject 1 2 3 4 5 6 mean (WER)
Std. Dev
Dragon Naturally Speaking 4.1% 20.6% 20.6% 19.6% 18.6% 19.6% 17.2% 6.4%
Windows 7 5.2% 10.3% 15.5% 18.6% 12.4% 14.4% 12.7% 4.6%
T‐Test (Paired Two‐Tails) ( ) 041.0== tTP Indicating that the Windows 7 ASR engine performed better in this assessment than Dragon Naturally Speaking with a 95% confidence level. Example 2: Phone Conversation
Subject 1 2 3 4 5 6 mean (WER)
Std. Dev
Dragon Naturally Speaking 0.7% 20.1% 25.5% 16.1% 22.8% 12.1% 16.2% 9.0%
Windows 7 8.1% 10.7% 14.1% 26.2% 10.7% 24.8% 15.8% 7.8%
T‐Test (Paired Two‐Tails) ( ) 929.0== tTP In this assessment the mean alone indicates that Windows 7 performed slightly better than Dragon Naturally Speaking but this could be an erroneous result and has no statistical significance.
‐ 2 ‐
B.2. Subject 1 Results Microsoft Speech Recogniser 8.0 Dragon Naturally Speaking 10.0
Example 1: Directions Example 1: Directions Total Words: 97 Total Words: 97Substitutions (S) 5 Substitutions (S) 4Insertions (I) 0 Insertions (I) 0Deletions (D) 0 Deletions (D) 0Word Error Rate (WER) 5% Word Error Rate (WER) 4%Words Recognised Rate (WRR) 95% Words Recognised Rate (WRR) 96%Word Correct Rate (WCR) 95% Word Correct Rate (WCR) 96%
Example 2: Phone Conversation Example 2: Phone Conversation Total Words: 149 Total Words: 149Substitutions (S) 9 Substitutions (S) 1Insertions (I) 2 Insertions (I) 0Deletions (D) 1 Deletions (D) 0Word Error Rate (WER) 8% Word Error Rate (WER) 1%Words Recognised Rate (WRR) 92% Words Recognised Rate (WRR) 99%Word Correct Rate (WCR) 93% Word Correct Rate (WCR) 99%
B.3. Subject 2 Results Microsoft Speech Recogniser 8.0 Dragon Naturally Speaking 10.0
Example 1: Directions Example 1: Directions Total Words: 97 Total Words: 97Substitutions (S) 7 Substitutions (S) 9Insertions (I) 0 Insertions (I) 3Deletions (D) 3 Deletions (D) 8Word Error Rate (WER) 10% Word Error Rate (WER) 21%Words Recognised Rate (WRR) 90% Words Recognised Rate (WRR) 79%Word Correct Rate (WCR) 90% Word Correct Rate (WCR) 82%
Example 2: Phone Conversation Example 2: Phone Conversation Total Words: 149 Total Words: 149Substitutions (S) 10 Substitutions (S) 21Insertions (I) 1 Insertions (I) 0Deletions (D) 5 Deletions (D) 9Word Error Rate (WER) 11% Word Error Rate (WER) 20%Words Recognised Rate (WRR) 89% Words Recognised Rate (WRR) 80%Word Correct Rate (WCR) 90% Word Correct Rate (WCR) 80%
‐ 3 ‐
B.4. Subject 3 Results Microsoft Speech Recogniser 8.0 Dragon Naturally Speaking 10.0
Example 1: Directions Example 1: Directions Total Words: 97 Total Words: 97Substitutions (S) 13 Substitutions (S) 16Insertions (I) 0 Insertions (I) 0Deletions (D) 2 Deletions (D) 4Word Error Rate (WER) 15% Word Error Rate (WER) 21%Words Recognised Rate (WRR) 85% Words Recognised Rate (WRR) 79%Word Correct Rate (WCR) 85% Word Correct Rate (WCR) 79%
Example 2: Phone Conversation Example 2: Phone Conversation Total Words: 149 Total Words: 149Substitutions (S) 15 Substitutions (S) 24Insertions (I) 4 Insertions (I) 0Deletions (D) 2 Deletions (D) 14Word Error Rate (WER) 14% Word Error Rate (WER) 26%Words Recognised Rate (WRR) 86% Words Recognised Rate (WRR) 74%Word Correct Rate (WCR) 89% Word Correct Rate (WCR) 74%
B.5. Subject 4 Results Microsoft Speech Recogniser 8.0 Dragon Naturally Speaking 10.0
Example 1: Directions Example 1: Directions Total Words: 97 Total Words: 97Substitutions (S) 15 Substitutions (S) 12Insertions (I) 0 Insertions (I) 0Deletions (D) 3 Deletions (D) 7Word Error Rate (WER) 19% Word Error Rate (WER) 20%Words Recognised Rate (WRR) 81% Words Recognised Rate (WRR) 80%Word Correct Rate (WCR) 81% Word Correct Rate (WCR) 80%
Example 2: Phone Conversation Example 2: Phone Conversation Total Words: 149 Total Words: 149Substitutions (S) 24 Substitutions (S) 12Insertions (I) 3 Insertions (I) 1Deletions (D) 12 Deletions (D) 11Word Error Rate (WER) 26% Word Error Rate (WER) 16%Words Recognised Rate (WRR) 74% Words Recognised Rate (WRR) 84%Word Correct Rate (WCR) 76% Word Correct Rate (WCR) 85%
‐ 4 ‐
B.6. Subject 5 Results Microsoft Speech Recogniser 8.0 Dragon Naturally Speaking 10.0
Example 1: Directions Example 1: Directions Total Words: 97 Total Words: 97Substitutions (S) 11 Substitutions (S) 15Insertions (I) 0 Insertions (I) 0Deletions (D) 1 Deletions (D) 3Word Error Rate (WER) 12% Word Error Rate (WER) 19%Words Recognised Rate (WRR) 88% Words Recognised Rate (WRR) 81%Word Correct Rate (WCR) 88% Word Correct Rate (WCR) 81%
Example 2: Phone Conversation Example 2: Phone Conversation Total Words: 149 Total Words: 149Substitutions (S) 10 Substitutions (S) 19Insertions (I) 2 Insertions (I) 2Deletions (D) 4 Deletions (D) 13Word Error Rate (WER) 11% Word Error Rate (WER) 23%Words Recognised Rate (WRR) 89% Words Recognised Rate (WRR) 77%Word Correct Rate (WCR) 91% Word Correct Rate (WCR) 79%
B.7. Subject 6 Results Microsoft Speech Recogniser 8.0 Dragon Naturally Speaking 10.0
Example 1: Directions Example 1: Directions Total Words: 97 Total Words: 97Substitutions (S) 10 Substitutions (S) 13Insertions (I) 1 Insertions (I) 0Deletions (D) 3 Deletions (D) 6Word Error Rate (WER) 14% Word Error Rate (WER) 20%Words Recognised Rate (WRR) 86% Words Recognised Rate (WRR) 80%Word Correct Rate (WCR) 87% Word Correct Rate (WCR) 80%
Example 2: Phone Conversation Example 2: Phone Conversation Total Words: 149 Total Words: 149Substitutions (S) 25 Substitutions (S) 16Insertions (I) 5 Insertions (I) 2Deletions (D) 7 Deletions (D) 0Word Error Rate (WER) 25% Word Error Rate (WER) 12%Words Recognised Rate (WRR) 75% Words Recognised Rate (WRR) 88%Word Correct Rate (WCR) 79% Word Correct Rate (WCR) 89%
‐ 5 ‐
Appendix C: Questionnaire Weighted Scoring In order to obtain a single number rating for the questionnaire completed by the test participants a method was devised to score answers to each of the questions. The equation for calculating the average weighted score (AWS) is given by:
Where T is the number of “ticks” next to the particular answer, S is the designated score for that answer and N is the number of participants. The AWS gives an impression of the overall ease or difficulty all the users experienced whilst using the software.
Answer S Very Easy 1 Easy 0.75 Neither Easy or Difficult 0.5 Difficult 0.25 Very Difficult 0
Always 1 Almost Always 0.75 Sometimes 0.5 Almost Never 0.25 Never 0
Very Impressed 1 Impressed 0.75 Satisfied 0.5 Unimpressed 0.25 Very Unimpressed 0
Never Too Long 1 Sometimes Too Long 0.5 Often Too Long 0
Yes 1 No 0 Not Applicable 0
‐ 6 ‐
Appendix D: Subject Interviews Two forms of interview were used to obtain views and opinions regarding the voice chat software. The first form of interview was a group discussion with cochlear implantees. For individuals who were unable to attend the group discussions, an individual demonstration was scheduled for them to trial the software.
D.1. Group Discussion Total Attending: 5 Subjects and Interviewer How did you find using the software, and would you be likely to use it yourself? SUBJECT 1: If I was able to pick up the phone and have text come to me rather than speech then I’d use the phone more often. I think it could be useful to integrate this into the telephone system, and if it could be used with something like Skype that would be wonderful. SUBJECT 2: I’d definitely use it but I don’t think I could make it work if I had to help the other person using the software. If it could all be done (speech to text) at my end rather than their end, picking up the speech and turning it into text. Even if it didn’t get everything right, it would be a little more help to get the gist of what has been said, which would be brilliant. But in its current state I don’t think I could get it to work. SUBJECT 3: In SUBJECT 2’s case where he’s running a business and he needs to interrupt with his people, this would be extremely difficult because he can’t use the phone normally. Then it’s unreasonable to expect all his clients before they even contact him to get all that software on. For me if other people have it then in certain circumstances I can see it being helpful. But that is the downside, in that the other person who’s calling you has to have the software on as well. SUBJECT 4: I agree that (SUBJECT 3’s comment) is a major problem. I however have a sister who lives in Greece who would probably be willing to use it if it was available. SUBJECT 3: It would be brilliant to use with friends and family. SUBJECT 2: How come it has to be at both ends? Why can’t it just be at one end? If you decided that you didn’t mind the odd mistake at your end, could it be made to work properly on your end only? SUBJECT 3: After all, you have it in subtitles on the television, where there are obvious mistakes sometimes. But the people who do it regularly, the computer they’re using has to get used to their voice, they’ve got to spend time getting it to recognize them and teaching it vocabulary. SUBJECT 4: the point about that is that subtitles aren’t a conversation, they’re just one way. With a conversation the consequences of a mistake could be more serious. If I’d said “dad has just died” and it came out “dad has just dined”, you wouldn’t know. SUBJECT 1: If it could be made to work at one end, if you thought the other person had made a mistake then you only have to say could you rephrase it. I think it could be very useful to get it to work only at one end ultimately. Then deaf people could use it with more people and not only with people who have the software. Have any of you ever used speech recognition before? SUBJECT 2: No, that’s why I was impressed by how accurate it seemed. I was expecting it to be much more garbled.
‐ 7 ‐
SUBJECT 2: Is that because the software isn’t currently advanced enough to decipher just anybody. It’s got to be trained to the person speaking. SUBJECT 1: But (Text Type) involves a lot of work, whereas with this software once you’ve trained it, it doesn’t need any more input other than speech. SUBJECT 4: That’s the whole point. The people who can use the keyboards don’t come cheap and there’s too few of them, only 25 in the whole country. One of the reasons we’re here. SUBJECT 5: I went to the cochlear implant meeting hosted by Red‐Bee media, and they spoke to us about subtitles. It made you realize why there are mistakes made, it depends on things like the computer and the person manning it and even though the training is intense the speed at which it has to be done at means mistakes will be made. SUBJECT 4: When SUBJECT 2 mentions that at the moment the speech recognition software isn’t good enough for this type of communication, the implications are that it’s going to get better... isn’t it? Do you think speech to text is good enough for communications in this way? SUBJECT 1: If what we’ve seen is a good example then yes. SUBJECT 4: Oh yes, yes, within the limitations that you’ve talked about. SUBJECT 3: To me I was confused as to when I could communicate and was unaware when it was a two way conversation and I found it was inaudible (laptop speakers used) I couldn’t really hear it clear enough to be able to reply, and there appeared be a slight delay between what was heard and what appeared on the screen as text. I wasn’t sure whether I should’ve been saying something back. SUBJECT 4: I was trying not to listen to the speech, because of the delay I found it slightly confusing. If you’ve got the text, you don’t really want to listen at the same time do you... do you? SUBJECT 1: Not if there is a time delay. To be honest I could hear that you were speaking but couldn’t hear what was being said so I just focused on the text. SUBJECT 2: I’m not sure if it’s good enough for everyday use because the speaker voice had to be trained into the software first, for me that’s going to cause major problems, for instance it I were to ring “Company A” the persons voice is going to have to be trained at that end. The way I’m thinking about using it (at the deaf users end) then speech recognition technology isn’t ready for this purpose. (Would prefer it to be automatic for any voice). Until it gets to that point (i.e. automatic translation with no training), is it really going to help as much? SUBJECT 5: Especially since if you ring “Company A”, sometime you’re going to get a peculiar accent, which doesn’t worry us because we are used to listening to peculiar accents. SUBJECT 3: It’s quite difficult as well because you’re going to have the problem of call centres, where a lot of them are situated in Asia I think their accents are difficult to understand for even normal hearing people. SUBJECT 2: I agree, when I go out to customers and they are contacting call centres, they too find it difficult to understand some call centres and they are normal hearing people. I come across this about 5 times a week and there are so many people who struggle to understand these call centres. SUBJECT 4: Thinking though if this software is only requires on one end of the conversation, and for example banks were to use the software on their end, I think this could work very well. Based on the original question if two people, one deaf one hearing, are in regular contact and are accustomed to using this software, but the deaf person wouldn’t need to be, it’s only the hearing person sending the message who would need to be. SUBJECT 4: To me, I think typing a message over Voice chat software would probably be just as quick as speaking into the software and waiting for the text to appear. SUBJECT 5: Also you’ve got to arrange for people to be on their computer at that time, which you don’t need with normal telephones.
‐ 8 ‐
SUBJECT 4: But some people do arrange times to call, don’t they. SUBJECT 5: It’s very interesting though because our son is in Canada so we’d have to arrange calls with him due to the time difference. SUBJECT 4: My problem is that I don’t have a land line and rely on a wireless modem which is very slow which I think is going to be more and more of a problem for me because the more information your sending the more you rely on the speed of the transmission. How would you improve the software? SUBJECT 2: I’d like to be able to include webcam access you that you’d be able to see each other. I think this would be useful for lip reading and most voice chat clients have that facility anyway. SUBJECT 1: I think if you’re not going to have the picture, then you’re no better off, you might as well use something like instant internet messaging, so you need a picture really. You could then have the image and speech together with the subtitles. SUBJECT 3: When using Skype we found it really useful to be able to see our relatives overseas, the quality of the video and audio enabled us to read her lips as well as listen to the speech. It was helpful to do that rather than reading text from the screen. Although there was no text appearing when using this there was no problem in communicating, it was pretty good. Who would use this software for telephone calls? SUBJECT 1: If it was available and good enough, then yes I’d use it all the time. At the moment I avoid using the phone and have no confidence in using it. The problem is as SUBJECT 2 said before; you can’t expect all the people who ring us to be using the software. SUBJECT 4: My dream would be to have a little black box, like a mobile, that I could just hold in front of me and see what was being said. I would use this software if it works as it has today, definitely. SUBJECT 5: I think we would use this software for overseas calls to family/friends. SUBJECT 2: I think it needs automatic training to the user’s voice, in its present form I wouldn’t use this, and I would still use instant messaging to talk to my close friends and family. It would have to be truly groundbreaking for it to be useful for me. I do however think it is a good first step in developing this.
‐ 9 ‐
D.2. Individual Discussion with Subject 6 Do you use the internet for internet messaging, SMS, phone calls, etc? SUBJECT 6: Just messaging, not for telephone calls. Why do you prefer using the internet for messaging as opposed to, for example, SMS text messaging? SUBJECT 6: Because the ability to read what has been said in for example in chat room, windows live, googletalk, msn messenger. If I was going to use any voice application (i.e. telephone, Voice Chat communication), not necessarily voice recognition I may not understand what the person on the other line has said. This is what makes this voice chat client so good, the ability to read what has actually been said, it’s fabulous. Apart from the earlier exercise, have you ever used speech recognition before? SUBJECT 6: No Or do you know of anybody has used speech recognition for communications purposes with for example yourself? SUBJECT 6: No How did you find the speech recognition performed in the earlier conversation? SUBJECT 6: Very good, very impressed by it. Did you rely on only the text of the message to understand the conversation, or did you use a combination of the speech and the text displayed? SUBJECT 6: The sound was only at the beginning, but I would require the text. If I had some speakers for the speech it would be useful but I would probably always be reliant on the text as an aid in combination just as we would use subtitles on the television... we would be listening but also reading to put the two together to understand. There is also the unfamiliarity of your voice, which is always something you are going to have when contacting businesses for example a bank; it can be difficult to understand somebody you have never spoken to. How did you find using the software? SUBJECT 6: It was easy to navigate round the software, very happy with it, happy with the number of themes, once logged in with the email and password. I can appreciate that some people may need more themes for high contrast but for me font 12, black on white was fine. Smashing! SUBJECT 6: One thing, you’ve got the sentence and the repeating email. When you finish on type talk you key in GA which means Go Ahead, which lets the other person know when they can speak, if you were speaking if only the email would pop up only once. There is another one when they want to disconnect which is SK which means Stop Keying... e.g. SK SK SK. Where there any specific thoughts or feeling you had during the conversation? SUBJECT 6: Only the instance “bouquet” I wondered where it came from, because it seemed out of place (NB “Okay” was spoken). I see your point about using GA and SK to indicate when I’ve finished speaking, as you have made a comment in the middle of one of my sentences. (This was as we were reviewing the text in the window.) Do you think it could be worthwhile having the text appear word by word (scrolling text) as opposed to block sentences appearing?
‐ 10 ‐
SUBJECT 6: My own way is reading the sentence in one go, but a lot of the older people complain the text appears and disappears the too fast in for example subtitles. It’s a good question actually because many people may have a different preference. You’ll get different answers to that question. Do you think it would be useful therefore, anticipating people’s views on this, to have an option to have the text appear word by word or as a block sentence? SUBJECT 6: Yes. One advantage this has for example is that when you are reading subtitles the text disappears, you can actually go back and read what has been said if you have missed something. This is why there can be a delay in for example type talk... the deaf user is reading what is on their screen. Because the text is not disappearing you have time to catch up with what has been said. Subtitles... there gone and it’s on to the next sentence. Some people say it can be exhausting watching television because they are constantly trying to keep up with the subtitles. So because of that I think the word by word or the block sentence may not be such a big issue because the text is still there to read. That’s the main thing Can you offer any suggestions that you think would improve the software? SUBJECT 6: Nothing really beyond the GA and SK usage and limiting the number of times the email/name of the person appears during the conversation. Beyond that I think it works really well. Wouldn’t have any difficulty if the other person was speaking into the system, it would be smashing. It would be really useful to use with businesses when trying to find out why a bill is so much when it should be some other amount. It would be nice to get a proper answer, so I think this is really fabulous. It’s really good. I have also given a demonstration of the software to the Salford Sensory Centre in Moorside who were also very impressed with the software. One of the things they were concerned about was to have a very strong arm of training for the user who will be speaking into the STT. The hearing user will have to go through the training otherwise the word error rate will be much higher than what has been shown to the level of training I have just demonstrated. There is defiantly some level of training with both the software and the user. Do you have any thoughts or opinions on this, as one of the worries was that a new user will not train the software, gets a high WER, and think the software is not good enough for this use? The hearing person may also think it takes too long to train the software... SUBJECT 6: A lot of companies like to say they have services for the deaf, for example a text line, text phone, but its one phone and it’s not guaranteed that the person on the other end will be trained in its use. Being able to get hold of people in businesses and training them in the use of the software would be two concerns. It can also be difficult when talking to for example communication providers because very often you will be asking them a question about a recent bill and they may direct the conversation to a sales pitch, changing the nature of the conversation. This can be difficult to follow as the person on the other line can be under pressure to make a sale. What about using the software with family and friends? SUBJECT 6: That would be much better and easier because they will understand your needs and I’d imagine they would be more willing to learn how to use the software. This software would rule out the difficulty in struggling to hear what people are saying. Who would you use the software for telephone calls with? SUBJECT 6: Primarily businesses, because they are the people I struggle to hear. I don’t tent to struggle hearing family and friends and I can get them to repeat what they said. That’s because I have some level of hearing, if I was completely deaf then I would have to use it with everyone. Anyone who is deaf, they will probably say everyone. I would defiantly use it with businesses. So I guess the scenario would be you would have a number of people trained in the use of this software in a business, which you could converse with, is that right? SUBJECT 6: Yes. The training should be basic and should be included anyway, as they are already using headsets with microphones. With type talk for example there will be a designated person to use this. If they are out for lunch you’ve had it, nobody else will pick up the phone because they are not trained. With Speech recognition, it could be more widespread within the company and could be part of the disability training that customer service
‐ 11 ‐
employees have to go through when they first start (call centres). Trained in speech recognition. The training is much easier for this software than it would be for text type. Imagine there are 100 computers in a call centre who use this sort of technology that would be fantastic. It would be interesting to see what the business attitude is towards that. Companies should be aware that the number of people with some form of hearing loss is growing (1 in 7 published in RNID article) and therefore have some means of providing a service to those customers. Quite often though they might be in a couple so the hearing person would be able to use the telephone on behalf of the deaf/hard of hearing person. There should be greater awareness now. The problem is that businesses want to get the calls done as quickly as possible for sales and there has to be a change in attitude towards the customers. How would you compare this software to msn, googletalk? SUBJECT 6: When using just the text I would say that it is the same as Google‐Chat/talk. If both the speech and the text were transmitted, even though the speakers on the laptop weren’t bad, I would probably still be reliant on the text. The software is easy to use and similar to other clients but I was wondering what the need for this would be for chatting, but then it would be really beneficial for business calls. If you were dealing with purely family you may not use this, you would just use chat system. I know this girl who has just had an [cochlear] implant, she signs and lip reads and she uses Skype for chat which gives her the ability to communicate via video link. It is good that there could be a choice in the way people communicate with the deaf community, and you will get different answers from people with different levels of hearing loss. You will also probably find that it could be easier on the other side of the conversation as it saves the person having to type... they can simply speak so that could be a major benefit. As I say this could be an advantage to businesses as well. So you would find this more useful with businesses rather than for calls with friends and family? SUBJECT 6: Yes, but you have to remember that the further you go down in hearing loss, you would probably find this would become more useful, where the worse the hearing is the more reliant they become on the text and the sound becomes less important. That said, if I were talking to a relative in Australia for example then I would use this because of the reduced cost for telephone calls... I think it’s free to talk computer to computer. Can you think of any final comments? SUBJECT 6: its straightforward, it works very well... I think I only saw three errors in the whole conversation which is a better strike rate than subtitles, it’s easy to use. I’m very impressed with it, and it’s opening the potential for it, realizing how you could use it in different situations and I think it would be useful to show the business community to see what their thoughts are and how they would adapt it. I can also see that you would be able to save conversations so you can go back to see what was said in an earlier conversation including name, time, emails and it would be good for record keeping printing the conversation. I think that would be really useful.
‐ 12 ‐
D.3. Individual Discussion with Subjects 7 & 8 How did you find the speech recognition performed? SUBJECT 7: You asked about the quality of the speech recognition and the accuracy, I think in the little passage we had just now we only had 3 errors, two of which were misidentified phonemes at the beginning of words where “Thanks SUBJECT 7” became “Banks erroneous text” and the other instance was the word “of” being shown as “off” or maybe vice versa, as errors they were so insignificant that the context of the conversation was not lost. Have you both used ASR before? SUBJECT 7: I’ve not used it since about 10 years ago and even then it was somebody else in the office that used it frequently and at that stage it was disappointing. I used it not very successfully and even with a degree of training the success rate was very low. One of my lab assistants persevered with it and put an awful lot of effort into the training and she was the only one who used it to the point where the error rate was sufficient. The others who used it had such a poor success rate that they spent more time editing the text than speaking, or writing the text initially. Do you use the internet for telephone calls or text messaging? SUBJECT 7: We both use Skype very heavily in business. Why do you prefer using that for, say example picking up the telephone? SUBJECT 7: Primarily, telephone calls to West Siberia are expensive. Also especially for an employee who is working on her own, the ability to see who you are talking to, it gives her more of a feeling of belonging rather than being an isolated voice on the other end of the telephone and provides better interaction. SUBJECT 8: Also, she is working with a foreign language, and if I can see that she is struggling with something I have said, you can quite clearly tell from somebody’s face that they haven’t understood what has been said and explain clearer. SUBJECT 7: The conversations in this case are quite often bi‐lingual with English and Russian being used by both sides, so being able to see the other person helps, in my case it helps with lip reading. Did you rely on the text of the message to understand the message or was the speech used as well? (Note: SUBJECT 7 disabled his cochlear implants for the purpose of the demonstration) SUBJECT 7: Completely. If you did not disable your implants, do you think you would rely on the speech, text or a combination of both? SUBJECT 7: It depends on the quality of the speaker. There are some people I talk to in business who don’t know I’m deaf because I talk to them on a regular basis and they speak clearly anyway. There are others whose speech manner in the telephone is such that I don’t generally bother trying to understand what is being said and I either get somebody to interpret or somebody else to take the call. My guess is that this sort of software will also struggle with those sort of speakers. SUBJECT 8: There are also some speakers who speak so fast that even I have to say ‘hold on a minute’ and then you get the overseas call centres who have such a poor command of English that even I find it difficult to understand what they are saying. An aside about WER from Previous Objective Tests SUBJECT 7: That sort of error rate (15 to 25%) would really degrade conversation.
‐ 13 ‐
Mentions user has to be trained to use the software too. SUBJECT 7: Dragon if I put on my Air Traffic Control Voice is virtually perfect. 100% success rate with that. SUBJECT 7: You talk of the need for training as a potential drawback which it could be. It may also be a potential benefit because it may enforce a discipline on the speaker so, if you are using text as an aid to speech the discipline enforced by the program may actually mean that the recipient may not need the text because the program has forced the speaker to slow down, to speak more clearly, which is just the thing the deaf person needs if they are just hearing impaired rather than no hearing. This apparent drawback may in fact work in your favor. How did you find using the software? SUBJECT 7: I think at the moment it’s at the stage we would call proof of concept. The software show that the concept works and I’d imagine that you wouldn’t want too many “bells and whistles”. I’m assuming that you’d want to use this software with people who have poor computer literacy. I think that you would have to keep your interface very simplistic. How would you compare it to products like Skype and Googletalk? SUBJECT 7: We’re used to using Skype and I normally use Skype with a video camera on because if I’m using it with an English speaker (relative overseas) we’d have video on or off depending on what’s going on, but if I was talking to somebody in Russia having the video on does mean I can see the face and it helps me understand the poorly spoken, mispronounced English spoken by the Russian. SUBJECT 8: One thing you don’t want is the user name appearing on every line. Once you’ve established that you’re speaking to “Fred in Australia” then unless “Jimmy in New Zealand” wants to join in the conversation, then you don’t want the user name/email address appearing all the time. You don’t need to keep saying “this is Fred... this is Fred... this is Fred”. SUBJECT 7: I don’t know what the situation is now but I do know that the BBC did a lot of work on the Tele‐text subtitles and understanding of text on screens. They found that the by the time that subtitles came into serious use the reading speed of many of the subtitle users was not brilliant, especially those who were signers, so by having a big long lump of address at the beginning I don’t know whether they would just disregard that as being something they don’t need to read, or whether it does distract from the conversation. Whether that still holds true today, I’m not sure. Were there any specific thoughts or feelings you had during the conversation? SUBJECT 8: We mentioned this comment where we didn’t know if the person was still speaking... SUBJECT 7: Yes, once or twice because of where the line breaks occurred I wondered whether the program was still processing something, whether you had finished speaking or whether the end of the sentence was just missing. I think some indicator that your microphone is still live would be worth looking at. What are your thoughts on how the text is displayed, i.e. word by word or bulk text? SUBJECT 7: I don’t particularly have any views but as I understand it the scrolling text is generally generated by ASR which is turning a known announcer’s voice into subtitles as something is read Blocks of text are normally done by a text typist. Therefore you get two different qualities of text. I find the chunks of text slightly more natural because we are brought up to think of text as something which occurs on paper and therefore it stays there and your eyes move to read it but it could well be that as more and more people grow up with computers scrolling text may well be a more efficient way of getting the information across. Scrolling text to me doesn’t seem like a natural eye scan. Can you think of any suggestions you might have to improve the software? SUBJECT 8: Would you be able to combine this with a web cam so you could see the other person as you would do in Skype?
‐ 14 ‐
Issues with bandwidth for audio and video transmission... SUBJECT 8: We sometimes have issues but that is because we live in a rural area, but we do get a good image on occasions. SUBJECT 7: I believe that there are now some relatively new compression algorithms for webcams around which are possibly a little bit demanding of the sending processor. There is sometimes an indicator on Skype which it tells you that you are receiving what Skype considers “high definition”, which is really quite a respectable picture indeed. It is certainly fast enough to be able to lip read with no problems. Whether this is feasible to use this as an add‐in, because things like Skype provide the ability to develop add‐ins as open source software. I think it would be worthwhile to investigate whether this could be used as an add‐in for voice chat services. I certainly know there are a number of cochlear implantees who do communicate via Skype on the grounds that they can lip read from the picture. SUBJECT 8: The video quality is good enough that you can read a document when it is held up to the screen. Any other improvements? SUBJECT 7: At this stage once you have something that’s robust enough then really you just have to try it out and see what the acceptance is. SUBJECT 8: If you get to the stage where you could send us the software so that we can talk without us coming to Manchester, we would be willing to spend some time on that. Would you use this application for your telephone calls (if you were using it like Skype)? SUBJECT 7: If I was using it like Skype then probably yes. The vast majority of my calls are business calls and the understanding of the call is so user dependant. We could have calls coming from accounting centres where the quality of English is so poor that I have instructed staff not to bother wasting time listening to abysmal English. Some of the calls really are that bad and no software is going to be able to cope with these scenarios. On the other hand some of the companies have such a good manner of speaking and are so clear that they don’t need to know that I’m deaf. SUBJECT 8: There are also some British accents/dialects that would be difficult to hear over the phone such as a Liverpool or Birmingham accent that could have the software on his end so that you can concentrate on the text rather than the speech. SUBJECT 7: But remember that person (and software) would still have to be trained. You actually have the advantage that this is enforcing discipline on the person who is using it. Potentially if you could get a lot of companies using this software it would be superb because of the level of discipline in would enforce on telephony. Why not consider using this as a secondary market tool just to teach people who use telephones in companies to speak in an acceptable fashion. They would see what they say on the screen and you could say “that is what the deaf person perceives... can you understand what you have said from the text on the screen?” Some people on the phone talk far too fast. What do think about the level of effort that may be required to train the software? Would people be willing to put the time in? SUBJECT 8: I think it depends on how desperate you are to communicate. When SUBJECT 7 was completely deaf before receiving the implant I was desperate to communicate. Our only communication was writing. SUBJECT 7: If I was receiving a telephone call, we would need a second earpiece, and SUBJECT 8 would repeat while I lip read, which was very, very, tiring and it needs somebody at the other end who is willing to take part in that type of scenario. I’ve never used type‐talk but I understand on a number of occasions that people have refused to use type‐talk because they find the delays unnatural. Whether this would occur using this software I’m not sure. If the training only takes for example two 5 minute sessions, as in Microsoft Speech, then I’d imagine that the majority of people would be willing to do the training.
‐ 15 ‐
SUBJECT 8: Have you ever thought about putting this software onto mobile phones? The problem is when I ring SUBJECT 7 on his mobile, for normal business conversation it’s ok, but for other general conversation he has no clue what I’m on about. Iphones application mentioned from meeting with Manchester Deaf Centre SUBJECT 7: I think it has promise (the software) as with all modern technologies it’s difficult to see if it will be put into common usage. SUBJECT 8: But saying that, there are mobile technologies available for the blind which improve accessibility. From that point of view somebody has that technology sorted. So it could be one idea that could be looked at for future development. Can you see any potential problems or issues in using the software? SUBJECT 7: It’s not really an issue but if you look at the previous conversation you have nearly 1/3 of the screen taken up by the user name. Have you considered the possibility of just using a colour scheme for the received text to identify the person? You do away with the username appearing all the time, and it’s not that easy to find my responses in all that text. My responses are there but they don’t stand out so if you were reviewing the conversation, I think using two colours would be worth considering. Of all the things we’ve talked about, what do you think would be very important to include with the software? SUBJECT 7: If the camera were to be included, it would have to included as an option for instances of having no webcam, low bandwidth, or they just want privacy. I think that as a Voice Chat add‐in could be a really valuable tool. Skype for example has the ability of using add‐ins, like the ability to record conversations or video. SUBJECT 7: I used to get involved in advisory panel meetings and we found that even in well controlled meetings, I couldn’t follow the meeting due to seating arrangements for example. Something like this software would be invaluable for letting a deaf person follow a meeting. If you’ve got a meeting with 8 to 10 people every 3 months or 6 months, then it would not take very long for the meeting participants to train the software to the required level and have the speech shown, colour coded for each participant, on a monitor in front of the deaf person. This would also meet you have verbatim minutes of the meeting. Any further comments? SUBJECT 7: In principle I think it has a lot of promise and I find it very interesting. 1. McCowan, I., et al., On the Use of Information Retrieval Measures for Speech Recognition
Evaluation. 2005.