1 Zöe Handley Learning Sciences Research Institute (LSRI), University of Nottingham and Marie-Josée Hamel Department of French, Dalhousie University Is

1

Zöe HandleyLearning Sciences Research Institute (LSRI), University of Nottingham

and

Marie-Josée HamelDepartment of French, Dalhousie University

Is Text-to-Speech Synthesis Ready for use in CALL?

CALL 2008, Antwerp, Belgium

2

Plan TTS synthesis in CALL Evaluation Requirements analysis Readiness of TTS synthesis for CALL Conclusions

3

TTS synthesis What is TTS synthesis?

– Speech synthesis “systems … allow the generation of novel messages, either from scratch (i.e. entirely by rule) or by re-combining shorter pre-stored units” (van Bezooijen and van Heuven, 1997: 709)

– Text-to-Speech Synthesis systems allow the automatic generation of speech from text

Why use TTS in CALL?– There is a general need in language

learning and teaching for “self-paced interactive learning environments” which provide “controlled interactive speaking practice outside the classroom” (Ehsani and Knodt, 1998: 45).

http://www.acapela-group.com/text-to-speech-interactive-demo.html

Graham Lucy



4

CALL Applications Reading machine

– Talking dictionaries, texts (de Pijper, 1997; Hamel, 2003), word processors, and conjugators, dictations (Santiago-Oriola, 1999; Mercier et al., 1999), and grapheme-phoneme exercises

Pronunciation model– Auditory discrimination; repetition

(Hamel, 1998; Mercier et al., 2000)

Conversational partner– In combination with automatic

speech recognition, speech understanding, the generative power of TTS synthesis can be harnessed to provide learners with interactive speaking practice, i.e. a dialogue partner (Raux and Eskenazi, 2004; Senef et al., 2004)

Oxford Hachette 4 French Dictionary on CD-ROM

5

Benefits of TTS synthesis Improvements on other

media– Easy creation and editing of

speech samples

– Simultaneous presentation of text and speech

– Low storage requirements

– Non-human and therefore perceived as non-judgemental

Adds value– Generation of examples

on demand (Sherwood, 1981) and therefore the automatic generation of feedback, conversational turns, and exercises with speech models

6

Why evaluation? Few CALL applications integrating TTS synthesis are available

on the market

Few evaluations of TTS synthesis for the purposes of CALL have been conducted

Since the failure of the language laboratory teachers have been sceptical about unevaluated technologies

TTS synthesis is being used in CALL in roles in which it has not been used in previous applications outside CALL - the most common, perhaps only, role that TTS synthesis assumes outside CALL is that of a reading machine

7

1. Basic research evaluation of TTS synthesis for use in CALL– Viability and potential benefits of the use of TTS synthesis in CALL

2. Technology evaluation of TTS synthesis for use in CALL – Adequacy of TTS synthesis for use in CALL

3. Judgemental evaluation of the CALL application – Potential of the CALL program to provide ideal conditions for SLA

4. Judgemental evaluation of the teacher-planned activity– Potential of the planned activity to provide ideal conditions for SLA

5. Usage evaluation of the teacher-planned activity– Learner’s performance in the planned activity

This is a combination of the levels of evaluation recommended by Chapelle (2001) for the evaluation of CALL activities and by ELSE (1999) for the evaluation of Speech and Language Technologies (SALT).

Framework for the evaluation of TTS synthesis

for use in CALL (Handley and Hamel, 2005)

8

Evaluations of TTS Synthesis for CALL Technology evaluations of TTS synthesis for use in CALL

• Stratil et al (1987)– Evaluated the quality of a Spanish TTS chip for use for the presentation of grammar

exercises in a language laboratory.

Usage evaluation of the teacher-planned activity– Outcome-oriented

• Santiao-Oriola (1999)– Evaluated the use of a French TTS synthesiser for the presentation of dictation

exercises.

• Hincks (2002)– Evaluated the use of a Swedish TTS synthesiser in combination with a speech editor

(re-synthesis) for teaching the lexical stress of English to Swedophones.

– Process-oriented• Cohen (1993)

– Evaluated the use of a talking word processor to support literacy activities, namely writing stories, for young learners of French as a second language.

9

Requirements analysis The evaluation process

– ISO (1999) and EAGLES (1999) guidelines

– Establish the evaluation requirements

• Establish the purpose of the evaluation

• Identify the types of products to be evaluated

• Specify the quality model– Specify the evaluation

• Select metrics• Establish rating levels for

metrics• Establish criteria for

assessment– Design the evaluation– Execute the evaluation

CALL requirements“When the language competence of the system begins to outstrip that of some of the better second language users, such systems become useful adjunct tools” (Keller and Zellner-Keller, 2000)

10

CALL requirements analysis Ideal conditions for Second Language Acquisition (SLA)

(Chapelle, 2001)– Language learning potential

• Goals of SLA– Communicative competence– Quality of the output– Primary requirement: Comprehensibility/intelligibility– Secondary requirements: Accuracy and naturalness– At both the level of individual speech sounds and the prosodic

level

• Focus on form– Flexibility– Speech rate, pitch

11

Explorative investigation (Handley and Hamel, 2005)

Research questions1. Do the different roles identified impose different requirements on the

quality of speech synthesis?2. Does comprehensibility account for acceptability for use in CALL?

Method– 17 French teachers – One research TTS system, FIPSvox from the University of Geneva – 3 roles: (1) reading machine, (2) pronunciation model, and (3)

conversational partner– Likert scales: (1) comprehensibility, (2) acceptability, and (3)

appropriateness– Word pointing paradigm (van Santen, 1993)

Results1. Most suitable as a dialogue partner. Least suitable as a pronunciation

model.2. Comprehensibility is not the only requirement. Accuracy and naturalness

matter as do register and expressiveness.

12

Is TTS synthesis ready for use in CALL?

Research questions– Do the different roles identified impose different requirements on the quality

of speech synthesis? – Is TTS synthesis ready for use in CALL?

Design– Within subjects, N = 17, French Teachers

– Dependent variables– Quality of the speech output– Acceptability– Adequacy

– Independent variables– Role of TTS in CALL: (1) Reading Machine (RM), (2) Pronunciation

Model (PM) at the (a) segmental level and (b) suprasegmental level, and (3) Conversational Partner (CP)

– TTS synthesis system

13

Systems evaluated

1. http://www.research.att.com/~ttsweb/tts/demo.php#top French English

2. http://212.8.184.250/tts/demo_login.jsp French English

3. http://www.multitel.be/TTS/layout.php?page=eLite_demo French English

4. http://www.acapela-group.com/text-to-speech-interactive-demo.html French English

http://www.research.att.com/~ttsweb/tts/demo.php#top

http://212.8.184.250/tts/demo_login.jsp

http://www.multitel.be/TTS/layout.php?page=eLite_demo


14

Questionnaire MOS-CALL

– ITU-T Overall Quality Test– MOS-X (Polkosky and Lewis, 2003)

On-line presentation of questionnaire

15

Is TTS synthesis ready for use in CALL?

Different TTS synthesis systems are most suitable for use in different roles

Reinforces the need to evaluate every TTS synthesis system

System 4 is ready for use in all applications where TTS synthesis adds value

Mean ratings of adequacy

Mean ratings of acceptability

16

System 1: AT&T Next-Gen (Alain)Mean ratings of quality of output

17

System 2: Nuance Vocalizer (Julie)Mean ratings of quality of output

18

System 3: eLite (Vincent)Mean ratings of quality of output

19

Do the different roles have different requirements?

Differences in adequacy were statistically significant for systems 2 and 4 (χ²r = 8.010, df = 3, p = 0.046; χ²r = 8.063, df = 3, p = 0.045, respectively)

But, not for systems 1 and 3 (χ²r = 2.352, df = 3, p = 0.503; χ²r = 3.467, df = 3, p = 0.325; χ²r = 3.194, respectively)

Differences in acceptability were not significant (system 1 χ²r = 6.616, df = 3, p = 0.085, system 2 χ²r = 6.303, df = 3, p = 0.098, system 3 χ²r = 3.194, df = 3, p = 0.363, and system 4 χ²r = 5.547, df = 3, p = 0.163)

Mean ratings of adequacy

Mean ratings of acceptability

20

Conclusions Some French TTS synthesis systems are reaching readiness

for use in CALL in applications which add value

In order to fully meet the requirements of CALL more attention needs to be paid to accuracy and naturalness, in particular at the prosodic level, and expressiveness– Expressive speech synthesis is the focus of much current research

(Campbell et al., 2006)

This may not be the case for all languages; different languages pose different problems to TTS

It will not be long before learners will be able to benefit from the support of an untiring non-judgemental substitute native speaker 24/7 in CALL applications.

Documents

1 Zöe Handley Learning Sciences Research Institute (LSRI), University of Nottingham and Marie-Josée Hamel Department of French, Dalhousie University Is