Download ppt - © 2007 IBM Corporation 1 Speech Transcription for Broadcast Activities: The science, the art, and business realities Sara H. Basson Michael Picheny Bhuvana

1 © 2007 IBM Corporation

Speech Transcription for Broadcast Activities:

The science, the art, and business realities

Sara H. Basson

Michael Picheny

Bhuvana Ramabhadran

IBM T.J Watson Research Center


Agenda

Captioning and Transcription: The need The options Automated speech transcription: state of the art Is it ready for prime time? – samples from network

transcripts Quality control Near-term solutions The future


Lack of Captioning and Transcription – The Problem

Proliferation of multimedia information

Audio: not always the medium of choice

–Violates accessibility

•22,000,000 Americans listed as deaf or hard of hearing

•Aging users

US Federal Gov’t: 2001 amendment to Section 508 of the Rehabilitation Act: mandates that information that federal agencies provide to the public or to their employees be accessible.

Time for editing (= cost of captioning) decreases as speech recognition accuracy improves.


Transcription of Audio Material: It’s the Law

Telecommunications Act of 1996:

100% of new English-language programming must be captioned by 2006

100% of Spanish-language programming must be captioned by 2010


Transcription Contrasted with Other Speech Recognition

Closed Captioning

General dictation

Call center data mining

Government intelligence applications

Unconstrained Speech

Conversational

Large Vocabulary

High Resource

Telephone, Broadcast,Speeches

Transcription Transaction Embedded

“For mortgage rates, say or press 1…”

“Please say your tracking number…”

Name Dialer

More constrained

More directed

Large Vocabulary

Lower Resource

Telephone

Direction giving in car

Spoken commands in car

Phrase translation on a PDA

Most constrained

Most directed

Smaller Vocabulary

Lowest Resource

Embedded in a device


Audio requiring transcription/captioning

Webcasts

Podcasts

Television programming

Movies

Digitized lectures

e-Learning materials

Corporate training

Meetings

Conferences

Tourist information

Medical transcription

Legal transcription

Call center data

= Strong accessibility requirement (user demand, and corporate/legal mandates)


Speech Recognition Challenges Over Time

• Connected Digit Sequences (TI Digits)

• TIMIT Acoustic-Phonetic Continuous Speech Corpus

• Broadcast News (BN)

• Speech in Noisy Environments (SPINE)

• Switchboard (SWB)• Telephone conversations (about 70 topics)

• MALACH Corpus

Increa

sing com

plexity

IBM Research

© 2007 IBM Corporation8

1

10

100

1990 1994 1998 2002 2006 2010

Wo

rd E

rro

r ra

te (

%)

Progress in Base Technology Research

1

10

2001 2002 2003 2004 2005

Wo

rd E

rro

r R

ate

(%)

Progress in Conversational Speech Progress in IBM Speech Products

IBM SuperhumanSpeech Project

NIST Benchmarks

IBM Embedded Via Voice in Car

IBM Websphere Voice Server - Telephony

The NIST benchmark uses different test datasets each year, focusing on conversational speech.

Human Performance – Conversational Telephony

Base speech recognition technology has improved steadily over the last 15 years. Current error rates are low enough for many practical applications.

Average error rates for 10 simple tasks (digits, name dialing, etc.) In-car tests are performed at several speed/noise levels.


MALACH: A challenging speech corpus

Emotional speech• young man they ripped his teeth and beard out they beat him

Disfluencies• A- a- a- a- band with on- our- on- our- arm

Multimedia digital archive: 116,000 hours of interviews with over 52,000 survivors, liberators, rescuers and witnesses of the Nazi Holocaust, recorded in 32 languages.

Goal: improved access to large multilingual spoken archives

Challenges:

Frequent interruptions:

• CHURCH TWO DAYS these were the people who were to go to march TO MARCH and your brother smuggled himself SMUGGLED IN IN IN IN


Named Entity Detection in Segmentation

my dad was a traveling salesperson man and was a good providerwe I cannot complain as a child we had a pretty good life and itstarted in nineteen thirty three Hitler came to power and startedfirst with the communist started trouble then started with the J ewsand I felt already in school when I went to school they put me inthe last row of the class because I was J ewish how how old wereyou when you first noticed that you were treated differently I wasseven seven years old this was my first second grade going to toschool it started I looked I looked fairly dark I don't look like a realGerman blue eyes and blond I was beaten up in in school by theyoungsters and I was afraid to go to school so my father decidedmy mother was born in Oswiecim this became Auschwitz later onthe famous infamous place to go to Oswiecim to visit hergrandmother per- a lot of family live in Oswiecim so our familywent to Oswiecimwe stayed there about a year and we picked up alittle bit of the Polish language I started school kind of in thevillage and it was pretty nice we had a lot of family there cousinsand and uncles and we stayed there till nineteen thirty four andmy dad decided that it calmed down in Berlin we should come backwe did not believe that really it will grow to something big thisHitler so we came back to Berlin and my parents put me in a aJ ewish boys school was called Kaiserstrasser and we lived prettymuch in the center of...

Person

Location

31 named entity tags:

Organization

CountryCardinal number

MoneyDateDurationAge

Ordinal numberPercentage

AnimalPlantSubstanceOccupationDisease…


Captioning audio: What are the options?

Stenographers Cost, availability

Automatic speech recognition Performance for speaker independent, any topic, multiple speakers, noisy backgrounds…..

Options Issues

Captioning and transcribing audio material: Additional Advantages

Text-based search vs. audio-based search Reading text: faster than listening to the auditory

equivalent Second language learners Individuals with certain learning disabilities


Understandability….ASR vs. stenocaptioning: Manageable errors

ASR: a picture perfect landing for the space shuttle atlantis this morning the shuttle touched

down at the kennedy space center in florida about six twenty one this morning IN ending a twelve day mission

TRUTH: a picture perfect landing for the space shuttle atlantis this morning the shuttle touched

down at the kennedy space center in florida about six twenty one this morning ** ending a twelve day mission

ASR: since the diet drug combination FEN fen was pulled off the market some dieters ****

been looking for something that would work as well we will see what's in the works

TRUTH: since the diet drug combination PHEN fen was pulled off the market some dieters

HAVE been looking for something that would work as well we will see what's in the works


Understandability….ASR vs. stenocaptioning: Distracting/confusing

ASR: ** TOOK IT makes a lot of FOLKS and also ** THAT e. mail volleys more than twice pick up the phone

TRUTH O. K. THAT makes a lot of SENSE and also IF AN e. mail volleys more than twice pick up the phone

ASR: STAY connected through e. mail has become very common in a lot of homes IN on the job but ********* on

how it's used it can be terrific FOR disastrous we will look at some e. mail problems THAT possible solutions

TRUTH: STAYING connected through e. mail has become very common in a lot of homes AND on the job but

DEPENDING on how it's used it can be terrific OR disastrous we will look at some e. mail problems AND possible solutions

ASR: so they do not have to make their own interpretation makes a lot of THINGS another tip TO write an e. mail IS

WHAT IT a news paper article in other words state the most pertinent information first we always say in the news business do not bury the lead

TRUTH: so they do not have to make their own interpretation makes a lot of SENSE another tip TOO write an e. mail

AS YOU WOULD a news paper article in other words state the most pertinent information first we always say in the news business do not bury the lead


Text and punctuation


Quality control for broadcast captioning

Thursday, July 05, 2007Closed Captions On Ohio TV: 24/7 Gibberish Dished To The Disabled


Quality control for Broadcast Captioning

Q: Do captions have to meet accuracy requirements, such as having only so many spelling errors per program?

A: At present, captions are not required to meet any particular quality or accuracy standards. The Federal Communications Commission concluded that program providers have incentives to offer high quality captions, in keeping with the overall quality of the programs they offer. The FCC also concluded that it would be difficult to develop and monitor quality standards at this time. However, viewers may let video providers know whether they are satisfied with the captions through purchases of advertised products, subscriptions to program services, or contacts with providers concerning the programs.

The above information has been excerpted from the FCC guidelines and the Captioned Media Program of the National Association of the Deaf.


Using ASR for captioning….incrementally…UK Media and re-speaking


Using ASR for Broadcast Captioning..incrementally…Protitle Live System

• Enables creation of subtitles in all major languages, using speech recognition

• Functions

Correction in real time Validation in real time

• Timing Total cycle time between 2 to 7 seconds 5 seconds on average

• Economics- Re-speaking: 1/10th the cost of real time stenographer


Using ASR for Broadcast captioning…incrementally…Real-time editing

Assume: speaker obtains 80 percent ASR accuracy when speaking at a rate of 150 words a minute

Editor needs to correct 15 words in a minute to increase the accuracy to 90 percent.– by choosing the 15 most important errors, some of the remaining 15 errors may not detract

significantly from understanding. In classrooms in the UK and in other countries disabled students have people taking notes

for them who are trying to type or write much faster than 15 words/minute to record as much as possible. If instead of trying to record everything, the speaker used speech recognition, the note taker need only type the corrections.

People can read four or more times faster than somebody speaks. Therefore: possible to do ‘something else’ when reading words displayed at speaking

speeds Real time editing can be separated into three activities:

– Finding the error and highlighting it

– Entering the correction

– Replacing the error with the correction Using foot pedals to move the highlight to the exact position and triggering the replacement

could enable the hands to remain free for entering the corrections.

Source: Professor M. Wald, Southampton University


Automated measures of accuracy

Proposal from the WGBH National Center for Accessible Media (NCAM) Use language-processing tools to develop an automated caption

accuracy assessment system for real-time captions on live news programming

Can text-based data mining and speech-to-text technologies produce meaningful data about stenocaption accuracy?

– Explore the capabilities of data mining software agents to identify discrepancies between errors contained within stenocaption data sets and speech-to-text data sets, and generate a caption accuracy analysis of the data set under review.

Through these methods, goal is to: Improve the ability of the television community to monitor and maintain

the quality of live captioning they offer to viewers who are deaf or hard of hearing

Ease the current burden on caption viewers to document and advocate for comprehensible captions.


Future vision…

Automatic Speech Transcription for less regulated arenas

– Captioning podcasts, lectures, meetings, presentations…

Easier tools to modify and customize

Easier and more cost-effective mechanisms to deliver

Understanding quality control issues - - what is accuracy, what is the cost of an error

Back-up options

More pervasive usage Higher quality deliverables