View
29
Download
2
Category
Preview:
DESCRIPTION
DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech. John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen, Lance Ramshaw, Dave Stallard, Bing Xiang. Objective. - PowerPoint PPT Presentation
Citation preview
1
DUTIE Speech: Determining Utility Thresholds for
Information Extraction from Speech
John Makhoul, Rich Schwartz,Alex Baron, Ivan Bulyko, Long Nguyen,
Lance Ramshaw, Dave Stallard, Bing Xiang
2
Objective
Estimate speech recognition accuracy required to support utility in the form of question answering (QA)
Follow-on to earlier DUTIE study from text– Entities and relations extracted into database, which was used
by human subjects for QA task– Measured human QA performance as function of information
extraction (IE) scores Extension to speech recognition
– Measure effect of speech recognition error on IE scores– Assume same relation between IE scores and QA, infer effect of
speech recognition on QA performance
3
Original DUTIE Study with Text Input
Databases of fully automated IE and manual annotation– Populated with entities, relations, co-reference links– 946 articles
Two databases were blended to produce a continuum of database qualities, as measured by– Entity Value Score (EVS)– Relation Value Score (RVS)
For each database, measured human performance– QA performance– Time taken to answer each question in seconds
4
DUTIE Results
Need to reduce IE error rate by about in half to achieve 70% QA performance
0.55
0.62
0.67
0.740.71
0.80
203
184
177
164
152
140
y = -0.1328x2 + 0.38x + 0.553R2 = 0.9957
0.50
0.60
0.70
0.80
0.90
1.00
0% 17% 33% 50% 67% 83% 100%
Extraction Blend(Entity, Relationship Value Score)
QA
Per
form
ance
135
150
165
180
195
210
Tim
e pe
r Que
stio
n (S
econ
ds)
Score Time Poly. (Score)
0.70
Inferred IE Required: 46% (75, 48)
(27.4)
Target Accuracy: .70
(56, 27) (69, 41) (76, 50) (84, 63)(64, 36) (100, 100)
5
Relative QA Performance vs. EVS
Same results, just scaled by QA with perfect IE scores
QA 1.548EVS(ref )2 3.26EVS(ref ) 0.715
6
DUTIE Speech Corpus
Data #articles #hoursTDT 578 15.5
Newswire 368 19.2All 946 34.7
The DUTIE speech corpus consists of 946 articles with 34.7 hours of audio data in total– Same articles as in the original DUTIE study– 15.5 hours of TDT broadcast news data
• ABC, CNN, PRI, VOA (Jan. 1998 ~ June 1998, Oct. 2000 ~ Dec. 2000)• MNB, NBC (Oct. 2000 ~ Dec. 2000)
– 19.2 hours of Newswire read speech recorded at LDC• APW, NYT (Feb. 1998 ~ June 1998, Oct. 2000 ~ Dec. 2000)
7
DUTIE Speech Process
Speech Recognition– Takes audio; outputs text in SNOR format– Run at four different levels of accuracy
Punctuation– Takes recognition output; adds periods/commas– Two methods: Forced alignment vs. automatic punctuation
Information Extraction (IE)– Takes punctuated text and finds entities and relations– Produces ACE Program Format (APF) XML
Scoring IE– Compares test and reference APFs and computes Entity
Value Score and Relation Value Score
8
Block Diagram
Recognizer
ForcedPunctuation
AutomaticPunctuation
APF Aligner
Scorer
ReferenceAPFs
ReferenceText
InformationExtraction
Speech
Text Text
PunctuatedText
PunctuatedText
Entities andRelations (APFs)
APFs
Value Scores
9
Speech Recognition Four systems to produce a range of word error rates
– System I: BBN RT04 stand-alone 10xRT system, with heavily-weighted DUTIE text in language model training (cheating)
– System II: BBN RT04 stand-alone 10xRT system, with normally-weighted DUTIE text in language model training (some cheating)
– System III: BBN RT02 system (Fair)– System IV: BBN RT02 system, with decreased grammar
weight in decoding (degraded)
System Training (hrs) WER of TDT(%)WER of Newswire(%)Average WER(%)I 1700 8.9 6.6 7.6II 1700 11.7 10.0 10.8III 140 19.2 16.5 17.7IV 140 25.4 23.1 24.1
10
Sentence Boundary Detection Model
Sentence boundary included periods, questions marks, exclamation points
Use a 3-gram LM to compute probabilities of sentence boundary at each word position [Stolcke 1996]
Training data– TDT3 closed captions (12M words)– HUB4 transcripts (120M words)– Gigaword News articles from 2000 (100M words)
Use Viterbi to find the most likely sequence of tags
11
Automatic Punctuation Results
3-gram word LM gives near-state-of-the-art period error rate (state-of-the-art is 60% as reported at RT-04)
Punctuation performance is sensitive to WER (in part due to LM being trained on errorless text)
Further improvements possible with new models or prosodic features
WER(%) Period Error Rate (%)
0 587.6 60
10.8 6217.7 6524.1 68
State-of-the-art ASR
12
Reference Punctuation
Tokenize reference into words labeled with punctuation triplets1) Punctuation attached to beginning of word2) Punctuation attached to end of word3) Unattached punctuation (e.g. hyphens) to right of word
Align reference and hypothesis words Attach each reference word’s punctuation to the hypothesis word
it is aligned to
Ref text: Hello, I’m looking for a size ten shoe. I prefer black, and don’t care about price.
ASR out:JELLO I’M LOOKING FOR * SHOE I PREFER * AND DON’T CARE ABOUT PRICE
Output: JELLO, I’M LOOKING FOR SHOE. I PREFER, AND DON’T CARE ABOUT PRICE.
13
Information Extraction
Finds entities and relations between them Identifies entities by character offset interval in the input
text file– Character offset is defined literally: All whitespace and
punctuation is included! Produces ACE Program Format (APF) XML expression
MOSCOW (AP) _ Presidents Leonid Kuchma of Ukraine and Boris Yeltsin of Russia signed an economic cooperation plan Friday
``We have covered the entire list of questions and discussed how we will be tackling them,'' Yeltsin was quoted as saying .
….. <entity ID="2" TYPE="GPE" SUBTYPE="Other“> <entity_mention TYPE="NAM" ID="104-1"> <extent> <charseq START="75" END="82"></charseq> </extent> </entity_mention> </entity> ….
14
Scoring IE, Part I
IE scoring program compares the character offset intervals of entities in reference and test APFs– Requires 30% overlap
Problem #1: Character offsets in reference APFs reflect all whitespace formatting in original text file– But recognizer output will have different character
offsets, so offsets will be wrong Solution
1. Align words in reference and test 2. Based on this alignment, compute character offset
mapping between reference and test3. Change character positions in test APF using mapping4. Compute IE scores
15
Scoring IE, Part II
Problem #2: IE scoring program only compares character offset intervals, not the words in them– So it may ignore word errors in a name
• “George Hush” vs. “George Bush” Solution: Modify scoring program to require match of
alphanumeric characters in the test and reference character intervals– Modification courtesy of George Doddington– Requires 50% content overlap
16
Detailed Results
WERAll
PunctuationCorrectPeriod
CorrectComma
AutoPeriod
EntityScore (%)
RelationScore (%)
0.0%
X 59.7 27.3
X X 58.9 25.9
X 53.4 21.3
X 51.7 19.9
7.6%X 49.3 22.9
X X 48.5 21.4
X 42.7 17.8
10.8%X 46.9 22.0
X X 46.0 20.5
X 41.5 17.7
X 40.5 17.4
17.7%X 39.1 17.8
X X 38.0 16.6
X 33.9 14.3
24.1%X 31.0 15.2
X X 30.2 14.4
X 26.1 11.8
17
Effect of Punctuation on Entity Value Score
Sentence boundaries are required but locations are not critical (loss is 2.8% relative with 62% period error rate)
Loss of comma results in 9.5% reduction in Entity score– Importance of appositives to IE (“George W. Bush, President of the
United States, said this morning …”)
30
35
40
45
50
55
60
65
0 10.8
Word Error (%)
Entit
y Va
lue
Scor
e All Punctuation
Correct Periodand Comma
Correct Period, No Comma
Automatic Period,No Comma
18
Entity Value Scores as Function of WER
Effect of WER on Entity score is linear Loss for automatic punctuation relative to reference
is 13.5% relative
0.469
0.391
0.310
0.517
0.4270.405
0.339
0.261
0.597
0.493 y = -1.166x + 0.592R2 = 0.996
y = -1.035x + 0.514R2 = 0.996
0%
10%
20%
30%
40%
50%
60%
0% 5% 10% 15% 20% 25% 30%
Word Error Rate
Ent
ity V
alue
Sco
re
Reference PunctuationAuto Punctuation(Linear (Ref Punc(Linear (Auto Punc
19
Relation Value Scores as Function of WER
Loss for automatic punctuation relative to reference is 25% relative
0.220
0.178
0.152
0.199
0.178 0.174
0.143
0.118
0.229
0.273
y = -0.505x + 0.271R2 = 0.994
y = -0.340x + 0.203R2 = 0.979
0%
5%
10%
15%
20%
25%
30%
0% 5% 10% 15% 20% 25% 30%
Word Error Rate
Rel
atio
n V
alue
Sco
re
Reference PunctuationAuto Punctuation(Linear (Ref Punc(Linear (Auto Punc
20
Relation Between WER and IE Scores
Entity Value Score (EVS) and Relation Value Score (RVS) are linear function of WER
WERRVSRVSWER
EVSEVS 7.11
)0(;21
)0(
Automatic punctuation has multiplicative effect on scores
Relative QA as a function of EVS
)7.11(75.0)(
__
)21(865.0)(
__
WERrefRVSpuncautoRVS
WERrefEVSpuncautoEVS
QA 1.548EVS(ref )2 3.26EVS(ref ) 0.715
21
Predicted Relative QA vs. WER and EVS(ref)
At 12% WER with today’s IE, we get 33% of maximum QA– Near zero for 25% WER (e.g., non-English)
With half the IE error rate, half WER, half the loss from punctuation, we estimate 72% of maximum QA
22
Conclusions
IE scores degrade linearly with WER Sentence boundaries are required but locations are not
critical Commas are important for IE With current technology (e.g., 12% WER and 60% EVS
on text), we can only achieve 33% of maximum QA performance
If IE error and WER were cut by half and loss due to commas cut in half, QA performance could increase to over 70% of maximum
Recommended