1 © 2007 IBM Corporation
Speech Transcription for Broadcast Activities:
The science, the art, and business realities
Sara H. Basson
Michael Picheny
Bhuvana Ramabhadran
IBM T.J Watson Research Center
2 © 2007 IBM Corporation
Agenda
Captioning and Transcription: The need The options Automated speech transcription: state of the art Is it ready for prime time? – samples from network
transcripts Quality control Near-term solutions The future
3 © 2007 IBM Corporation
Lack of Captioning and Transcription – The Problem
Proliferation of multimedia information
Audio: not always the medium of choice
–Violates accessibility
•22,000,000 Americans listed as deaf or hard of hearing
•Aging users
US Federal Gov’t: 2001 amendment to Section 508 of the Rehabilitation Act: mandates that information that federal agencies provide to the public or to their employees be accessible.
Time for editing (= cost of captioning) decreases as speech recognition accuracy improves.
4 © 2007 IBM Corporation
Transcription of Audio Material: It’s the Law
Telecommunications Act of 1996:
100% of new English-language programming must be captioned by 2006
100% of Spanish-language programming must be captioned by 2010
5 © 2007 IBM Corporation
Transcription Contrasted with Other Speech Recognition
Closed Captioning
General dictation
Call center data mining
Government intelligence applications
Unconstrained Speech
Conversational
Large Vocabulary
High Resource
Telephone, Broadcast,Speeches
Transcription Transaction Embedded
“For mortgage rates, say or press 1…”
“Please say your tracking number…”
Name Dialer
More constrained
More directed
Large Vocabulary
Lower Resource
Telephone
Direction giving in car
Spoken commands in car
Phrase translation on a PDA
Most constrained
Most directed
Smaller Vocabulary
Lowest Resource
Embedded in a device
6 © 2007 IBM Corporation
Audio requiring transcription/captioning
Webcasts
Podcasts
Television programming
Movies
Digitized lectures
e-Learning materials
Corporate training
Meetings
Conferences
Tourist information
Medical transcription
Legal transcription
Call center data
= Strong accessibility requirement (user demand, and corporate/legal mandates)
7 © 2007 IBM Corporation
Speech Recognition Challenges Over Time
• Connected Digit Sequences (TI Digits)
• TIMIT Acoustic-Phonetic Continuous Speech Corpus
• Broadcast News (BN)
• Speech in Noisy Environments (SPINE)
• Switchboard (SWB)• Telephone conversations (about 70 topics)
• MALACH Corpus
Increa
sing com
plexity
IBM Research
© 2007 IBM Corporation8
1
10
100
1990 1994 1998 2002 2006 2010
Wo
rd E
rro
r ra
te (
%)
Progress in Base Technology Research
1
10
2001 2002 2003 2004 2005
Wo
rd E
rro
r R
ate
(%)
Progress in Conversational Speech Progress in IBM Speech Products
IBM SuperhumanSpeech Project
NIST Benchmarks
IBM Embedded Via Voice in Car
IBM Websphere Voice Server - Telephony
The NIST benchmark uses different test datasets each year, focusing on conversational speech.
Human Performance – Conversational Telephony
Base speech recognition technology has improved steadily over the last 15 years. Current error rates are low enough for many practical applications.
Average error rates for 10 simple tasks (digits, name dialing, etc.) In-car tests are performed at several speed/noise levels.
9 © 2007 IBM Corporation
MALACH: A challenging speech corpus
Emotional speech• young man they ripped his teeth and beard out they beat him
Disfluencies• A- a- a- a- band with on- our- on- our- arm
Multimedia digital archive: 116,000 hours of interviews with over 52,000 survivors, liberators, rescuers and witnesses of the Nazi Holocaust, recorded in 32 languages.
Goal: improved access to large multilingual spoken archives
Challenges:
Frequent interruptions:
• CHURCH TWO DAYS these were the people who were to go to march TO MARCH and your brother smuggled himself SMUGGLED IN IN IN IN
10 © 2007 IBM Corporation
Named Entity Detection in Segmentation
my dad was a traveling salesperson man and was a good providerwe I cannot complain as a child we had a pretty good life and itstarted in nineteen thirty three Hitler came to power and startedfirst with the communist started trouble then started with the J ewsand I felt already in school when I went to school they put me inthe last row of the class because I was J ewish how how old wereyou when you first noticed that you were treated differently I wasseven seven years old this was my first second grade going to toschool it started I looked I looked fairly dark I don't look like a realGerman blue eyes and blond I was beaten up in in school by theyoungsters and I was afraid to go to school so my father decidedmy mother was born in Oswiecim this became Auschwitz later onthe famous infamous place to go to Oswiecim to visit hergrandmother per- a lot of family live in Oswiecim so our familywent to Oswiecimwe stayed there about a year and we picked up alittle bit of the Polish language I started school kind of in thevillage and it was pretty nice we had a lot of family there cousinsand and uncles and we stayed there till nineteen thirty four andmy dad decided that it calmed down in Berlin we should come backwe did not believe that really it will grow to something big thisHitler so we came back to Berlin and my parents put me in a aJ ewish boys school was called Kaiserstrasser and we lived prettymuch in the center of...
Person
Location
31 named entity tags:
Organization
CountryCardinal number
MoneyDateDurationAge
Ordinal numberPercentage
AnimalPlantSubstanceOccupationDisease…
11 © 2007 IBM Corporation
Captioning audio: What are the options?
Stenographers Cost, availability
Automatic speech recognition Performance for speaker independent, any topic, multiple speakers, noisy backgrounds…..
Options Issues
Captioning and transcribing audio material: Additional Advantages
Text-based search vs. audio-based search Reading text: faster than listening to the auditory
equivalent Second language learners Individuals with certain learning disabilities
12 © 2007 IBM Corporation
Understandability….ASR vs. stenocaptioning: Manageable errors
ASR: a picture perfect landing for the space shuttle atlantis this morning the shuttle touched
down at the kennedy space center in florida about six twenty one this morning IN ending a twelve day mission
TRUTH: a picture perfect landing for the space shuttle atlantis this morning the shuttle touched
down at the kennedy space center in florida about six twenty one this morning ** ending a twelve day mission
ASR: since the diet drug combination FEN fen was pulled off the market some dieters ****
been looking for something that would work as well we will see what's in the works
TRUTH: since the diet drug combination PHEN fen was pulled off the market some dieters
HAVE been looking for something that would work as well we will see what's in the works
13 © 2007 IBM Corporation
Understandability….ASR vs. stenocaptioning: Distracting/confusing
ASR: ** TOOK IT makes a lot of FOLKS and also ** THAT e. mail volleys more than twice pick up the phone
TRUTH O. K. THAT makes a lot of SENSE and also IF AN e. mail volleys more than twice pick up the phone
ASR: STAY connected through e. mail has become very common in a lot of homes IN on the job but ********* on
how it's used it can be terrific FOR disastrous we will look at some e. mail problems THAT possible solutions
TRUTH: STAYING connected through e. mail has become very common in a lot of homes AND on the job but
DEPENDING on how it's used it can be terrific OR disastrous we will look at some e. mail problems AND possible solutions
ASR: so they do not have to make their own interpretation makes a lot of THINGS another tip TO write an e. mail IS
WHAT IT a news paper article in other words state the most pertinent information first we always say in the news business do not bury the lead
TRUTH: so they do not have to make their own interpretation makes a lot of SENSE another tip TOO write an e. mail
AS YOU WOULD a news paper article in other words state the most pertinent information first we always say in the news business do not bury the lead
14 © 2007 IBM Corporation
Text and punctuation
15 © 2007 IBM Corporation
Quality control for broadcast captioning
Thursday, July 05, 2007Closed Captions On Ohio TV: 24/7 Gibberish Dished To The Disabled
16 © 2007 IBM Corporation
Quality control for Broadcast Captioning
Q: Do captions have to meet accuracy requirements, such as having only so many spelling errors per program?
A: At present, captions are not required to meet any particular quality or accuracy standards. The Federal Communications Commission concluded that program providers have incentives to offer high quality captions, in keeping with the overall quality of the programs they offer. The FCC also concluded that it would be difficult to develop and monitor quality standards at this time. However, viewers may let video providers know whether they are satisfied with the captions through purchases of advertised products, subscriptions to program services, or contacts with providers concerning the programs.
The above information has been excerpted from the FCC guidelines and the Captioned Media Program of the National Association of the Deaf.
17 © 2007 IBM Corporation
Using ASR for captioning….incrementally…UK Media and re-speaking
18 © 2007 IBM Corporation
Using ASR for Broadcast Captioning..incrementally…Protitle Live System
• Enables creation of subtitles in all major languages, using speech recognition
• Functions
Correction in real time Validation in real time
• Timing Total cycle time between 2 to 7 seconds 5 seconds on average
• Economics- Re-speaking: 1/10th the cost of real time stenographer
19 © 2007 IBM Corporation
Using ASR for Broadcast captioning…incrementally…Real-time editing
Assume: speaker obtains 80 percent ASR accuracy when speaking at a rate of 150 words a minute
Editor needs to correct 15 words in a minute to increase the accuracy to 90 percent.– by choosing the 15 most important errors, some of the remaining 15 errors may not detract
significantly from understanding. In classrooms in the UK and in other countries disabled students have people taking notes
for them who are trying to type or write much faster than 15 words/minute to record as much as possible. If instead of trying to record everything, the speaker used speech recognition, the note taker need only type the corrections.
People can read four or more times faster than somebody speaks. Therefore: possible to do ‘something else’ when reading words displayed at speaking
speeds Real time editing can be separated into three activities:
– Finding the error and highlighting it
– Entering the correction
– Replacing the error with the correction Using foot pedals to move the highlight to the exact position and triggering the replacement
could enable the hands to remain free for entering the corrections.
Source: Professor M. Wald, Southampton University
20 © 2007 IBM Corporation
Automated measures of accuracy
Proposal from the WGBH National Center for Accessible Media (NCAM) Use language-processing tools to develop an automated caption
accuracy assessment system for real-time captions on live news programming
Can text-based data mining and speech-to-text technologies produce meaningful data about stenocaption accuracy?
– Explore the capabilities of data mining software agents to identify discrepancies between errors contained within stenocaption data sets and speech-to-text data sets, and generate a caption accuracy analysis of the data set under review.
Through these methods, goal is to: Improve the ability of the television community to monitor and maintain
the quality of live captioning they offer to viewers who are deaf or hard of hearing
Ease the current burden on caption viewers to document and advocate for comprehensible captions.
21 © 2007 IBM Corporation
Future vision…
Automatic Speech Transcription for less regulated arenas
– Captioning podcasts, lectures, meetings, presentations…
Easier tools to modify and customize
Easier and more cost-effective mechanisms to deliver
Understanding quality control issues - - what is accuracy, what is the cost of an error
Back-up options
More pervasive usage Higher quality deliverables
22 © 2007 IBM Corporation