35
MMP - M204 Information Desi MMP - M204 Information Desi gn/Cross Media Publishing - gn/Cross Media Publishing - Spoken Language Interfaces Spoken Language Interfaces 1 4. Speech Synthesis 4. Speech Synthesis – Introduction to Speech synthesis – TTS – Unit Selection – Animated Characters that speak

4. Speech Synthesis

Embed Size (px)

DESCRIPTION

4. Speech Synthesis. Introduction to Speech synthesis TTS Unit Selection Animated Characters that speak. Introduction to Speech synthesis. The term Speech synthesis refers to the technologies that enable computers or other electronic systems to output simulated human speech. - PowerPoint PPT Presentation

Citation preview

Page 1: 4. Speech Synthesis

MMP - M204 Information Design/CrossMMP - M204 Information Design/Cross Media Publishing - Spoken Language I Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLAnterfaces - Dr. Ingrid Kirschning (UDLA))

11

4. Speech Synthesis4. Speech Synthesis

– Introduction to Speech synthesis– TTS– Unit Selection– Animated Characters that speak

Page 2: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 2

Introduction to Speech synthesis

The term Speech synthesis refers to the technologies that enable computers or other electronic systems to output simulated human speech.

Important are: intelligibility and naturalness.

Naturalness is often evaluated depending on every situation.

Page 3: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 3

TTS

Text – to – Speech Translates text into speech using phonetic rules

to transcribe the text and then speak it. Requires information about:

Abbreviations (Dr., Nr., etc., …) Specific readings of numbers and symbols ($, #,

…) Reading of time formats (1:45, 13;45…) Pronunciation of each letter in every context (“cat”,

“tar”, “Jane”,…)

Page 4: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 4

Types of Speech Synthesis

Concatenative Based on human speech samples

Diphones Words Variable length units

Formant Synthesis Simulates human speech electronically

using phonological rules

Page 5: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 5

Animated Characters that speak

Baldi & Ms Gurney Tongue models Vismes

Expressions: fear, anger, happiness,… Visible speech: vowels, consonants

Real-time animation by moving from one target position to the next

Page 6: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 6

Page 7: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 7

Studying Expressions (Viseme A)

Page 8: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 8

3D Model (neutral face)

Page 9: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 9

3D Model (angry)

Page 10: 4. Speech Synthesis

MMP - M204 Information Design/CrossMMP - M204 Information Design/Cross Media Publishing - Spoken Language I Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLAnterfaces - Dr. Ingrid Kirschning (UDLA))

1010

5. Natural Language5. Natural Language

– Introduction to Natural Language Understanding (NLU)

– Examples of structured queries and natural language

– Parsing– Natural Language Generation (NLG)– Interface Design Challenges by Candence

Kamm, AT&T Labs

Page 11: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 11

Introduction to Natural Language Understanding (NLU)

Page 12: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 12

Natural Language (NL) This technology that has found its way into current

applications is used for both, input and output. Because of the multi-disciplinary community

involved in speech user interface development, the term Natural language has several connotations.

Linguistic: it’s the language spoken and written by a given culture (English, German,…)

It can be discussed in terms of phonology, grammars, semantics and pragmatics.

From the standpoint of interaction, we focus and study pragmatics. We study how people use language and how this can be used to control computers. We also want a computer to respond consistently under different circumstances.

Page 13: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 13

Natural Language

It is common to use special languages (artificial languages) to solve specific problems, for example: mathematics, music, chemistry, computer languages.

Command languages are specialized to control a computer. Example:

MS-DOS commands: dir, del, … UNIX console commands: ls, rm,… In natural language a user would say:

“list all the files in this directory” or “delete the file …”

Page 14: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 14

Specialized Usage of NL The goal is to achieve continuous natural

language interaction with machines, however for now it is possible to apply it only for more specialized uses:

Natural Language Database Queries Text database queries are increasingly

popular and commercially available They analyze natural language requests

grammatically and apply additional linguistic analysis and business rules to create a request ans return results.

Page 15: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 15

Examples of queries A structured query language may be faster

than natural language queries but require the user to be trained to use it:

Altavista (www.altavista.com) is a search engine: In Altavista you can type in a question in

natural language: what italian resaturants are in seattle?

Yahoo (www.yahoo.com) is a directory: In Yahoo you would type:

italian and restaurant and seattle

Page 16: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 16

Parsing

This is a technology that analyzes a text It works Similar to word-spotting It scans the text and maps each word

with a grammar to find the important keywords

Advanced systems will search for synonyms, corrects mistakes between singular and plural usage, etc.

Page 17: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 17

Natural Language Generation

Natural language is also used to generate understandable output for speech synthesis and for text output.

Medical and Legal reporting, generation of weather reports.

However, although it seems like the natural way to interact, it is not proven that it is optimal. What is faster, and more accurate: typed NL, spoken NL or GUI’s?

Page 18: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 18

Nowadays it appears that in some situations spoken or written NL is better, with a reduced vocabulary, but it will require users to know the limitations

A multi-modal approach is likely to be the best solution in many settings, where the user can decide the kind of interaction (mouse, speech or typing), and where these complement each other.

Page 19: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 19

Interface Design Challenges by Candence Kamm, AT&T Labs People don’t wait their turn in an

interaction People use different words to reach the

same goal (“yes”, “ok”, “sure”, “yep”…) The interfaces need to handle the

limitations of the technology, work around it or warn the user of what could happen, etc.

Newer technology pose new challenges (wireless telephones, etc.)

Page 20: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 20

The problem is to teach people what words the system knows, the domain it can handle.

Collecting data on the exact utterances people are going to say is very important.

The systems require very heavy error correcting, error recovery capabilities.

If you want an application to work today, you have to focus it as narrowly as possible and try to guide the user trough it.

Page 21: 4. Speech Synthesis

MMP - M204 Information Design/CrossMMP - M204 Information Design/Cross Media Publishing - Spoken Language I Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLAnterfaces - Dr. Ingrid Kirschning (UDLA))

2121

Speech PortalsSpeech Portals

• Introduction

• Voice eXtensible Markup Language VXML

• VUI –Voice User Interface

• Web-GUI vs. VUI

Page 22: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 22

Speech Portals

Developed to provide information (Banks, Companies, Events…)

These can be interactive or not Interactive portals can be programmed

via various technologies, one of them is VXML.

Page 23: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 23

VXML Voice eXtensible Markup Language or VXML is

based on XML, for developing speech interfaces.

Users access VXML by dialing the phone number of the application.

A VXML Gateway accesses the internet to retrieve the web-page associated to that number and interprets it.

The Gateway manages the interaction using ASR, Speech Synthesis and is the link between the telephony service and the internet.

Page 24: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 24

VXML Network Interactions

VXML Gateway Web Server

Telephony (PSTN or ISDN) Internet

Device

Transport

Page 25: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 25

The voice Portal Components

Voice PortalAudio ResourceTelephony ResourceASR ResourceTTS ResourceTCP/IP ResourceVoiceXML Browser/Interpreter

Page 26: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 26

VXML enabling Network

Voice Portal Web Servers

Internet

Telephony Network

TCP/IP

TCP/IP

Page 27: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 27

VUI –Voice User Interface A user interface is the part that a user

interacts with At basic level a VUI should

Provide users with mental models of how the application works and what task it supports

Collect user input in the form of spoken input or DTMF sounds (telephone keypad)

Deliver system output to the telephone receiver

Support users in task completion Support recovery from user or system errors

Page 28: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 28

VUI User Characteristics

Voice applications should be targeted to well-defined user groups, but some common characteristics of VUI users help to understand the differences to PC GUI users: Limited PC/Internet experience Mobile environment Single I/O mode

Page 29: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 29

Web-GUI vs. VUIWEB Back Button Home Button Home link on the web-page Screen layout with colors, theme graphics Pop-up windows to indicate errors and recovery Help link Links to other web-pages Form-input, Selection Lists, radio buttons In-progress/status indicator

Page 30: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 30

Web-GUI vs. VUIVUI Voice commands to return to previous step Voice commands to take the user to a known

starting point (main menu) Recorded announcements and TTS voices,

speaking style, gender and tone branding. Tones, TTS, or recorded messages to indicate

errors and recovery Help messages System functions programmed to prompt input VoiceXML forms associated with prompts Audio-hourglass tone, sound, music or

message that indicates what the system is doing.

Page 31: 4. Speech Synthesis

MMP - M204 Information Design/CrossMMP - M204 Information Design/Cross Media Publishing - Spoken Language I Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLAnterfaces - Dr. Ingrid Kirschning (UDLA))

3131

7. Spoken Dialogue Systems7. Spoken Dialogue Systems

– Introduction– Basic components– CU Communicator– Galaxy Architecture

Page 32: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 32

Basic components

Speech Synthesis

Parser

Dialogue Management

Natural Language Generation

Speech Recognition

ApplicationDatabase

Page 33: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 33

CU CommunicatorCSLR Colorado University

Financed by DARPA (beginning Abril´99) GALAXY Architecture Interface via telephone line Reservation of flights, hotels and car-rental Robust parsing and event-driven dialogue

management.

Page 34: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 34

Galaxy Architecture

NLG Speech Recognition

Speech Synthesis

NLPDialogue Management

Audio

Reliability Server

Database

HUB

Page 35: 4. Speech Synthesis

MMP - M204 Information Design/Cross Media Publishing - Spoken Language Interfaces - Dr. Ingrid Kirschning (UDLA) 35

The End

“By becoming a user you will be able to understand the tasks and appreciate the constraints involved”