Upload
kathryn-h
View
236
Download
3
Embed Size (px)
Citation preview
1
2
3 Q1
4
56
78
9
10
11
12
13
14
15
16
17
18
19
20
21
2223
24
25
26
27
28
29
30
31
32
33
34
35
36
Available online at www.sciencedirect.com
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
www.elsevier.com/locate/specom
ScienceDirect
Speech Communication xxx (2014) xxx–xxx
The Hearing-Aid Speech Perception Index (HASPI)
James M. Kates ⇑, Kathryn H. Arehart
Department of Speech Language and Hearing Sciences, University of Colorado, Boulder, CO 80309, USA
Received 23 September 2013; received in revised form 5 May 2014; accepted 23 June 2014
Abstract
This paper presents a new index for predicting speech intelligibility for normal-hearing and hearing-impaired listeners. The Hearing-Aid Speech Perception Index (HASPI) is based on a model of the auditory periphery that incorporates changes due to hearing loss. Theindex compares the envelope and temporal fine structure outputs of the auditory model for a reference signal to the outputs of the modelfor the signal under test. The auditory model for the reference signal is set for normal hearing, while the model for the test signal incor-porates the peripheral hearing loss. The new index is compared to indices based on measuring the coherence between the reference andtest signals and based on measuring the envelope correlation between the two signals. HASPI is found to give accurate intelligibility pre-dictions for a wide range of signal degradations including speech degraded by noise and nonlinear distortion, speech processed usingfrequency compression, noisy speech processed through a noise-suppression algorithm, and speech where the high frequencies arereplaced by the output of a noise vocoder. The coherence and envelope metrics used for comparison give poor performance for at leastone of these test conditions.� 2014 Elsevier B.V. All rights reserved.
Keywords: Speech intelligibility; Intelligibility index; Auditory model; Hearing loss; Hearing aids
37
38
39
40
41
42
43
44
45
46
47
48
49
50
1. Introduction
Signal degradations, such as additive noise or nonlineardistortion, can reduce speech intelligibility for both nor-mal-hearing and hearing-impaired listeners, even whenhearing aids are used. Hearing aids, in particular, can pres-ent a wide range of signal modifications since the input sig-nal may be noisy and the hearing aid may incorporateseveral nonlinear processing algorithms (Kates, 2008).Hearing aid processing includes dynamic-range compres-sion, in which low-level portions of the signal receivegreater amplification than the high-level portions, and thetime-varying gain causes distortion of the signal envelopeand introduces modulation sidebands. Noise suppression
51
52
53
54
http://dx.doi.org/10.1016/j.specom.2014.06.002
0167-6393/� 2014 Elsevier B.V. All rights reserved.
⇑ Corresponding author. Tel.: +1 720 226 1266.E-mail addresses: [email protected] (J.M. Kates), Kathryn.
[email protected] (K.H. Arehart).
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
algorithms attenuate the noisier portions of the noisyspeech signal, and like dynamic range compression modifythe signal envelope and introduce modulation sidebands.Frequency compression (Souza et al., in press), in whichhigh-frequency portions of the spectrum are shifted tolower frequencies where a hearing-impaired listener mayhave better sound thresholds, is also implemented in sev-eral hearing aids. The frequency shifting causes inherentdistortions including reducing spacing between harmonics,altered spectral peak levels, and modified spectral shape(McDermott, 2011).
Many of these degradation mechanisms simultaneouslyaffect the signal envelope and the signal temporal fine struc-ture (TFS). Additive noise, for example, reduces the enve-lope modulation depth by filling in the pauses in the speechand also corrupts the TFS of the speech by adding timingjitter corresponding to the random fluctuations of thenoise. Peak clipping, which may be used to prevent
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
2 J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
unacceptable loud sounds, reduces the signal modulationdepth by removing the signal peaks and also modifies theTFS by introducing additional frequency components cor-responding to the harmonic distortion products. Thus formany forms of signal degradation, changes to the signalenvelope and to the TFS are closely related.
Changes to the signal TFS have been successfully usedto predict speech intelligibility. The TFS changes are oftenmeasured using the coherence function (Carter et al., 1973;Shaw, 1981; Kates, 1992). In the time domain, the coher-ence is computed by taking the cross-correlation betweena noise-free unprocessed reference signal and the noisyprocessed signal and dividing by the product of theroot-mean-squared (RMS) intensities of the two signals.The magnitude-squared coherence is converted to a sig-nal-to-distortion ratio (SDR) which can be used in a man-ner similar to the signal-to-noise ratio (SNR) in computingthe Speech Intelligibility Index (SII) (ANSI, 1997) to pro-duce the coherence SII (CSII) (Kates and Arehart, 2005).
Changes to the signal envelope have also been used topredict speech intelligibility. The original version of theSpeech Transmission Index (STI) (Houtgast andSteeneken, 1971; Steeneken and Houtgast, 1980), for exam-ple, used bands of amplitude-modulated noise as the probesignals and measured the reduction in signal modulationdepth. However, this original version of the STI is notaccurate for hearing-aid processing such as dynamic-rangecompression (Hohmann and Kollmeier, 1995). Speech-based versions of the STI have been developed that arebased on estimating the SNR from cross-correlations ofthe signal envelopes in each frequency band (Ludvigsenet al., 1990; Holube and Kollmeier, 1996; Goldsworthyand Greenberg, 2004; Payton and Shrestha, 2008). Anintelligibility index based on averaging envelope correla-tions for 20-ms speech segments has been developed byChristiansen et al. (2010), and Taal et al. (2011b) havedeveloped the short-time objective intelligibility measure(STOI) which uses envelope correlations computed for382-ms speech segments. Changes in the envelope time–fre-quency modulation have also been used as the basis of aspeech intelligibility index (Elhilali et al., 2003).
If intelligibility can be predicted using either signalcoherence or envelope correlation, is there any reason toprefer one approach over the other? A procedure that com-bines coherence with changes in the signal envelope may bemore robust than one that uses just the coherence becausethere are several situations where a coherence-basedapproach can fail. One example where coherence will per-form poorly is frequency compression. Frequency com-pression (Aguilera Munoz et al., 1999; Simpson et al.,2005; Glista et al., 2009) is intended to improve the audibil-ity of high-frequency speech sounds by shifting them tolower frequency regions where listeners with high-fre-quency hearing loss have better hearing thresholds. How-ever, the cross-correlation between a sinusoid and afrequency-shifted version of the sinusoid will approachzero as the duration of the observation interval is
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
increased. Thus frequency compression will lead to predic-tions of lower intelligibility as the amount of frequencyshift is increased even if the intelligibility has not actuallybeen affected, and the predicted loss in intelligibility willdepend on the size of the speech segments used in comput-ing the intelligibility index.
A second situation where coherence has limitations isfor some forms of noise suppression, specifically the idealbinary mask (IBM). In IBM processing, the speech isdivided into frequency bands and each band furtherdivided into time segments to produce time–frequencycells. If the SNR in a time–frequency cell is greater thana preset threshold (e.g. 0 dB) the gain for that cell is setto 1, otherwise the cell is attenuated (Wang et al., 2008;Kjems et al., 2009). High intelligibility is found for noisyspeech when the ideal mask, computed from the speechand noise with the threshold set to the signal-to-noise ratio,is applied to a signal comprised of noise alone (Wang et al.,2008). The IBM output in this case is amplitude-modulatednoise. The cross-correlation between the reference speechand modulated noise is therefore low and a coherence-based procedure would predict low intelligibility. Poor cor-relation of the CSII with IBM-processed speech has beenreported by Christiansen et al. (2010) and by Taal et al.(2011a, 2011b).
A third example is the noise vocoder (Dudley, 1939;Shannon et al., 1995), in which the speech is replaced bybands of noise having the same envelope modulation asthe speech. Excellent intelligibility can be obtained eventhough the speech TFS has been replaced by the randomfluctuations of the noise (Shannon et al., 1995; Stoneet al., 2008; Souza and Rosen, 2009; Anderson, 2010).However, a coherence-based calculation will predict lowerintelligibility because of the reduction in the cross-correla-tion between the original speech and the noise vocoder out-put. Poor correlation of the CSII with noise-vocodedspeech has been reported by Cosentino et al. (2012),although Chen and Loizou (2011) found comparable per-formance between the CSII and an envelope-based versionof the STI.
These weaknesses in the use of coherence to predictintelligibility suggest that a procedure that combines coher-ence with changes in the envelope modulation may be moreaccurate than one that is based on coherence alone. Forexample, the results of Gomez et al. (2012) show that com-bining the CSII with an envelope measurement improvesthe accuracy in comparison to the CSII alone when predict-ing speech intelligibility for normal-hearing listeners forspeech corrupted by various forms of additive noise.
An additional concern is predicting speech intelligibilityfor hearing-impaired listeners. An accurate intelligibilityindex for hearing-aid users has to deal with noisy input sig-nals, the distortion introduced by the hearing-aid process-ing, and the hearing loss. Hearing loss is most oftenmodeled as a shift in auditory threshold, and this thresholdshift has been represented as an increase in the internalauditory noise level in the SII calculation procedure
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx 3
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
(Pavlovic et al., 1986; Humes et al., 1986; Payton et al.,1994; Holube and Kollmeier, 1996; Ching et al., 1998;Kates and Arehart, 2005). A similar modification of thehearing threshold has been applied to the STI (Humeset al., 1986; Payton et al., 1994; Holube and Kollmeier,1996). Limitations in the accuracy of the predictions haveled to empirical modifications of the SII, including a“desensitization factor” that increases with increasing hear-ing loss (Pavlovic et al., 1986) and a frequency-dependentproficiency factor that also depends on the hearing loss(Ching et al., 1998).
A more thorough model of peripheral hearing losswould be expected to yield more accurate intelligibility pre-dictions. An auditory model (Dau et al., 1996) was used byHolube and Kollmeier (1996) for intelligibility predictions,and hearing loss was first implemented as a threshold shiftbased on the audiogram. Individual adjustments of the fil-ter bandwidths and forward masking time constants werethen incorporated into the model, which resulted in a smallimprovement in the accuracy of the intelligibility predic-tions for speech in noise. Hines and Harte (2010) also useda cochlear model (Zilany and Bruce, 2006) as an auditoryfront end for their intelligibility calculations. However,they only present simulation results, so the benefit of theirapproach in predicting intelligibility for hearing-impairedlisteners has not been verified.
The purpose of this paper is to present a new intelligibil-ity index that (1) combines measurements of coherencewith measurements of envelope fidelity to give improvedaccuracy for a wide range of processing conditions, and(2) is accurate for hearing-impaired as well as normal-hear-ing listeners. The new index, the Hearing Aid Speech Per-ception Index (HASPI), uses an auditory model thatincorporates aspects of normal and impaired peripheralauditory function (Kates, 2013). The auditory coherenceis computed from the modeled basilar-membrane vibrationoutput in each frequency band, and provides a measure-ment sensitive to the changes in the speech temporal finestructure. The cepstral correlation is computed from theenvelope output in each frequency band, and provides ameasurement of the fidelity with which the envelopetime–frequency modulation has been preserved.
The remainder of the paper starts with a description ofthe data used to train and evaluate the intelligibility indi-ces. The datasets include noise and nonlinear distortion,frequency compression for speech in babble noise, noisyspeech processed using an ideal binary mask noise suppres-sion algorithm, and speech partially replaced by the outputof a noise vocoder; these data are described next. The audi-tory model used for the new index is then described, fol-lowed by a description of how the outputs of theauditory model are combined to produce the new HASPIindex. The CSII and an envelope-based index based onthe STOI are used as comparisons in the paper. A modifiedversion of the STOI was derived because the STOI as pub-lished does not take auditory threshold or hearing loss intoaccount. The revised CSII and modified STOI calculations
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
are then described. Results are presented for the four differ-ent datasets, followed by a discussion of the factors thatinfluence the model accuracy.
2. Intelligibility data
The original CSII was fitted to speech corrupted bynoise and distortion (Kates and Arehart, 2005), and thosedata are described below. The revised CSII and HASPI arefit to four datasets, which comprise the noise and distortiondata used for the original CSII plus results from three addi-tional experiments. These additional datasets comprise fre-quency compression, noise suppression, and noise vocoderdata. For all experiments, subjects listened to speech pre-sented monaurally over headphones in a sound booth. Itis hypothesized that the CSII may not perform as well asHASPI for these additional datasets.
2.1. Noise and distortion
The noise and distortion data comprises the intelligibil-ity scores reported by Kates and Arehart (2005). Thirteenadult listeners with normal hearing and nine adult listenerswith hearing loss of presumed cochlear origin participatedin the experiments. The test materials consisted of theHearing-in-Noise-Test (HINT) sentences (Nilsson et al.,1994). The sentences were digitized at a 44.1 kHz samplingrate and down-sampled to 22.05 kHz to approximate thebandwidth typically found in hearing aids (Kates, 2008).Each test sentence was combined with additive noise, orwas subjected to symmetric peak-clipping distortion orsymmetric center-clipping distortion.
The additive noise was extracted from the opposite chan-nel of the HINT test compact disc. The noise has the samelong-term spectrum as the sentences. SNR values rangedfrom �5 to 30 dB, and an unprocessed condition was alsoincluded. The peak-clipping and center-clipping distortionthresholds were set as a percentage of the cumulative histo-gram of the magnitudes of the signal samples for each sen-tence. Peak-clipping thresholds ranged from infiniteclipping to no clipping, and center-clipping thresholds ran-ged from 98% to no clipping. The stimuli were presented tothe normal-hearing listeners at an equalized-RMS level of65 dB SPL. The speech signals were amplified for the indi-vidual hearing loss, when present, using the NationalAcoustics Laboratories-Revised (NAL-R) linear prescrip-tive formula (Byrne and Dillon, 1986). During the sessions,listeners verbally repeated each sentence after it was pre-sented. The tester then scored the proportion of completeHINT sentences that were correctly repeated by the listener.
2.2. Frequency compression
The frequency-compression data comprises the intelligi-bility scores for frequency-compressed speech reported bySouza et al. (in press) and Arehart et al. (2013a). Fourteenadult listeners with normal to near-normal hearing and 26
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
4 J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
adult listeners with mild-to-moderate high-frequency lossparticipated in the experiments. The stimuli for the intelli-gibility tests consisted of low-context IEEE sentences(Rosenthal, 1969) spoken by a female talker. All of thestimuli were digitized at a 44.1 kHz sampling rate anddownsampled to 22.05 kHz. The sentences were used inquiet and combined with multi-talker babble at SNRsranging from 10 to �10 dB in steps of 5 dB. After the addi-tion of the babble, the sentences were processed using fre-quency compression.
Frequency compression was implemented using sinusoi-dal modeling (McAulay and Quatieri, 1986). The signalwas first divided into low-frequency and high-frequencybands. The low-frequency signal was used without furthermodification and sinusoidal modeling was applied to thehigh-frequency signal. The ten highest peaks in the high-frequency band were selected, and the amplitude and phaseof each peak were preserved while the frequencies werereassigned to lower values. Output sinusoids were then syn-thesized at the shifted frequencies (Quatieri and McAulay,1986; Aguilera Munoz et al., 1999) and combined with theoriginal low-frequency signal to produce the frequency-compressed output.
The frequency-compression parameters represented therange that might be available in wearable hearing aids, andincluded three frequency compression ratios (1.5:1, 2:1,and 3:1) and three frequency compression cutoff frequencies(1, 1.5, and 2 kHz). A control condition having no frequencycompression was also included. The stimulus level for thenormal-hearing subjects was 65 dB SPL. The speech signalswere amplified for the individual hearing loss, when present,using NAL-R equalization (Byrne and Dillon, 1986).
Scoring in Souza et al. (in press) was based on keywordscorrect (5 per sentence for 50 words per condition per lis-tener). For compatibility with the Kates and Arehart(2005) data, the keywords correct results were convertedto sentences correct: a correct sentence required that all fivekeywords be identified correctly, otherwise the sentencewas scored as incorrect.
2.3. Ideal binary mask
Arehart et al. (2013b) measured intelligibility scores fornoisy speech processed through an ideal binary mask noise-suppression algorithm. The data presented in their paperwere obtained from thirty older subjects having mild tomoderate hearing losses and from seven younger subjectshaving normal hearing. The stimuli for the intelligibilitytests consisted of low-context IEEE sentences (Rosenthal,1969) spoken by a female talker. All of the stimuli were dig-itized at a 44.1-kHz sampling rate and downsampled to20 kHz for the noise-suppression processing. The sentenceswere combined with four-talker babble at signal-to-noiseratios of �18 to +12 dB in steps of 6 dB. The sentence levelprior to noise suppression was set to 65 dB SPL.
The noisy speech stimuli were processed with a binarymask noise-reduction strategy (Kjems et al., 2009; Ng
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
et al., in press). The target speech signal, the masker signal,and the target-plus-masker mixture were each separatedinto frequency bands using analysis filterbanks consistingof 64 gammatone filters (Patterson et al., 1995) with centerfrequencies equally distributed on the equivalent rectangu-lar bandwidth number (ERBN) scale (Moore and Glasberg,1983) over the frequency range of 50 and 8000 Hz. This fre-quency scale corresponds to approximately uniform spac-ing along the cochlear partition. The processing was donein time frames each having a duration of 20 ms with fifty-percent overlap, resulting in a new frame every 10 ms.
A time–frequency cell consists of one frame in one fre-quency band. In each time–frequency cell, the local sig-nal-to-noise ratio was determined from the clean targetand separate masker signals. The local SNR was then com-pared to a local criterion (LC) of 0 dB, resulting in an idealbinary mask decision equal to 1 if the local SNR was aboveLC, and 0 otherwise. The data of Kjems et al. (2009) indi-cate that a LC of 0 dB is most effective for SNRs in therange of approximately +5 to �10 dB. Similar to the pro-cedure in Li and Loizou (2008), errors were introduced intothe ideal binary mask by randomly flipping a certain per-centage (0%, 10%, and 30%) of the time–frequency unitseither from 0 to 1 or from 1 to 0. The binary patterns werethen converted into gain values, where 1’s were convertedinto 0 dB gain and the zeros were converted into an atten-uation of either 10 dB or 100 dB. The noisy speech signalwas then multiplied by the binary gain values to give theprocessed signal in the frequency domain. The processedsignal was then filtered through a time-reversed gamm-atone filterbank, thereby ensuring a constant processinggroup delay independent of frequency.
The IBM processing is not practical in a hearing aidsince separate noise and speech files must be available.However, the envelope distortion introduced by the IBMalgorithm will be similar to that resulting from othernoise-suppression strategies that are used in hearing aids,such as spectral subtraction (Kates, 2008). In spectral sub-traction, for example, noisy speech cells are also attenu-ated, and the amount of attenuation is based on theestimate of the instantaneous SNR. The errors added tothe IBM processing are related to the errors in the amountof spectral subtraction attenuation that are introduced byimperfect estimation of the speech and noise levels.
Following the noise suppression processing, the speechsignals were amplified for the individual hearing loss, whenpresent, using NAL-R equalization (Byrne and Dillon,1986). Scoring in Arehart et al. (2013b) was based on key-words correct; for compatibility with the Kates andArehart (2005) data, the keywords correct results were con-verted to sentences correct.
2.4. Noise vocoder
Anderson (2010) obtained intelligibility scores forspeech where the high frequencies of noisy speech werereplaced by the output of a noise vocoder. Intelligibility
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421 Q2
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx 5
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
was measured as the number of frequency bands subjectedto noise vocoding was increased. Ten subjects with normalhearing and ten subjects with hearing loss participated inthe experiments. The test materials were low-context sen-tences from the IEEE corpus (Rosenthal, 1969) spokenby a male and by a female talker. All stimuli were digitizedat 44.1 kHz and were down-sampled to 22.05 kHz. Thebackground noise was multi-talker babble.
The speech was processed without any noise and atSNRs of 18 and 12 dB. The sentences were passed througha bank of 32 band-pass filters with center frequencies dis-tributed on an ERBN filter scale (Slaney, 1993). The signalenvelope for each vocoded band was generated via the Hil-bert transform. The Gaussian noise used for the noisevocoding was passed through the same linear-phase FIRfilterbank as the speech. Two vocoded signals were pro-duced. One signal was produced by multiplying the filterednoise by the speech envelope. For the second signal, thenoise envelope fluctuations were first removed by dividingthe filtered noise by its own envelope before multiplyingit by the speech envelope. As a last processing step, boththe speech and noise were passed through the same filtersas in the first filtering stage to remove any out-of-bandmodulation products, and the RMS level of the vocodedsignal in each frequency band was matched to that of theoriginal speech.
Noise vocoding was applied to the noisy speech startingat the highest frequency bands and proceeding to lower fre-quencies. The amount of noise vocoding was increased insteps of two frequency bands from no bands vocoded tothe 16 highest-frequency bands vocoded. The upper cutofffrequency of the 16-band vocoded condition was 1.6 kHz.The stimulus level for the normal-hearing listeners was65 dB SPL, and NAL-R amplification (Dillon and Byrne,1986) was provided for the hearing-impaired listeners.The stimuli were presented monaurally in a sound boothusing headphones. Intelligibility was scored in terms ofkeywords correct. For compatibility with the Kates andArehart (2005) data, the keywords correct results were con-verted to sentences correct.
3. Auditory model
The approach to predicting speech intelligibility used inHASPI is to compare the output of an auditory model for adegraded test signal with the output for an unprocessedinput signal. A detailed description of the auditory modelis presented in Kates (2013) and is summarized here. Themodel is an extension of the Kates and Arehart (2010)auditory model; that model has been shown to give outputsthat can be used to produce accurate predictions of speechquality for a wide variety of hearing losses and processingconditions.
The overall model block diagram is presented in Fig. 1.The comparison of the processed and reference signalsrequires that they be temporally aligned, so the modelincludes two alignment steps. The first step is a rough
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
alignment of the broadband signals that removes largedelay differences. Each signal then goes through the middleear and cochlear mechanics models. A second temporalalignment step then removes any remaining timing differ-ences between the reference and processed signals in eachfrequency band by adjusting the signal delay to maximizethe cross-correlation of the signals in each band. The sepa-rate signals then go through the inner hair-cell (IHC)model. The last temporal alignment step is compensationfor the frequency-dependent group delay of the gamm-atone filters used in the auditory filterbank. This alignmentstep is independent of the signal properties since it is purelya function of the filters used for the frequency analysis. Inthe final processing step the auditory model outputs areconverted into the descriptive signal characteristics (e.g.the cepstral correlation and auditory coherence describedin Section 4.2) that are used to compare the processed sig-nal with the reference signal.
The processing for one signal is shown in the block dia-gram of Fig. 2. The figure shows the initial processingstages, followed by the processing associated with one fre-quency and at the outputs of the filter banks. The auditorymodel starts with sample rate conversion to 24 kHz, fol-lowed by the middle ear filter. The next stage is a linearauditory filterbank, with the filter bandwidths adjusted toreflect the input signal intensity and the increase in filterbandwidth due to outer hair-cell (OHC) damage.Dynamic-range compression is then provided in each fre-quency band, with the compression controlled by the out-put in the corresponding frequency band from thecontrol filter bank. The amount of compression is reducedwith increasing OHC damage. Hearing loss due to IHCdamage is represented as a subsequent attenuation stage,and IHC firing-rate adaptation is also included in themodel. For moderate hearing losses, approximately 80%of the total loss given by the audiogram was ascribed toOHC damage (Moore et al., 1999), with the remainderascribed to IHC damage.
The envelope output in each frequency band comprisesthe compressed envelope signal after conversion to dBabove auditory threshold. The dynamic range of the basilarmembrane vibration signal in each frequency band is com-pressed using the same control function as for the envelopein that band, so the envelope of the vibration tracks thecomputed envelope output. The auditory threshold forthe vibration signal is represented as a low-level additivewhite noise. Both the envelope and the vibration outputsare available for modeling speech intelligibility.
The primary purpose of the middle ear model is toreproduce the low-frequency and high-frequency attenua-tion observed in the equal-loudness contours at low signallevels (Suzuki and Takeshima, 2004). The filter is a 2-polehighpass at 350 Hz in cascade with a 1-pole low-pass at5000 Hz (Kates, 1991).
The parallel filter bank used for the auditory analysisconsists of fourth-order gammatone filters (Cooke, 1991;Patterson et al., 1995; Immerseel and Peeters, 2003). A
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
Fig. 1. Block diagram showing the reference and processed signal comparison.
Fig. 2. Block diagram of the auditory model used to extract the signals in each frequency band.
6 J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
total of 32 bands were used to cover the frequency rangefrom 80 to 8000 Hz. Hearing loss due to OHC damage isincorporated into the filter bank as an increase in filterbandwidth (Moore et al., 1999); the bandwidth increase issmall at low frequencies and increases with increasing fre-quency. The filter bandwidth at 8 kHz for maximum lossis four times the normal bandwidth.
The shape of the auditory filters depends on the intensityof the input signal as well as on the degree of hearing loss,with the filters becoming broader as the signal intensityincreases. For normal hearing, the filter bandwidth is setto the ERB (Moore and Glasberg, 1983) for intensitiesbelow 50 dB SPL. For impaired hearing, the bandwidthat and below 50 dB SPL is set to the bandwidth computedfor the amount of OHC damage related to the hearing loss.For both normal and impaired hearing, the bandwidth forintensities at or above 100 dB is set to the widest bandwidthused in the model, which corresponds to maximum OHC
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
damage. Linear interpolation is used for intensities between50 and 100 dB SPL.
A separate gammatone filter bank controls the dynamic-range compression. The control filter bandwidths are set tocorrespond to the widest filters in the model, and the filtercenter frequencies are shifted upward by a small amount.The control filter bandwidths thus match the auditory anal-ysis bandwidths for the maximum hearing loss, and arewider than the auditory analysis filters for reduced hearingloss and normal hearing. Signal power outside the pass-band of the analysis filter can still be within the passbandof the control filter. The control filter will detect this signalpower, and the compression rule, described next, willreduce the analysis filter gain. Therefore these wide controlfilters provide two-tone suppression in the cochlear model(Zhang et al., 2001; Bruce et al., 2003), in which a tone out-side the normal filter bandwidth can reduce the output fora tone within the filter passband.
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
627627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx 7
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
The control signal envelope is the input to the compres-sion rule. The compression gain is then passed through an800-Hz low-pass filter to approximate the compression timedelay observed in the cochlea (Zhang et al., 2001). Inputswithin 30 dB of normal auditory threshold (0 dB SPL)receive linear gain. Inputs between 30 and 100 dB SPL arecompressed. The system reverts to linear gain for inputsabove 100 dB SPL. The compression ratio in the modelfor normal hearing increases linearly with ERB numberfrom a compression ratio of 1.25:1 at 80 Hz to a compres-sion ratio of 3.5:1 at 8 kHz. This compression behavior isconsistent with physiological measurements of compressionin the cochlea (Cooper and Rhode, 1997) and with psycho-physical estimates of compression in the human ear (Hicksand Bacon, 1999; Plack and Oxenham, 2000).
OHC damage shifts the auditory threshold and reducesthe compression ratio. As a result, the OHC damage pro-duces output levels as a function of input signal intensitythat show a pattern similar to the loudness recruitmentfound in hearing-impaired listeners (Kiessling, 1993). Theshifted curves are constructed so that an input of 100 dBSPL in a given frequency band always produces the sameoutput level independent of the amount of OHC damage.In the case of maximum OHC damage, the system isreduced to linear amplification. Intermediate amounts ofOHC damage result in an intermediate shift of the com-pression behavior.
The envelope signal, after dynamic-range compression,is converted to dB above auditory threshold. Normalthreshold is used since attenuation due to OHC damagehas already been applied to the signals. The hearing lossdue to IHC damage is applied as an additional attenuationafter the dB SL conversion. The compressed average out-puts in dB SL correspond to firing rates in the auditorynerve (Sachs and Abbas, 1974; Yates et al., 1990) averagedover the population of inner hair-cell synapses.
The IHC synapse provides the rapid and short-termadaptation observed in the neural firing rate (Harris andDallos, 1979; Gorga and Abbas, 1981). The rapid adapta-tion time constant is 2 ms and the short-term time constantis 60 ms. Compensation for the group delay of the gamm-atone filters is then applied to the output of the IHC syn-apse model since adjustment for the filter delay appearsto occur higher in the auditory pathway (Wojtczak et al.,2012). The envelope output in each frequency band com-prises the compressed envelope signal after conversion todB above auditory threshold; the BM vibration signal iscentered at the carrier frequency for each band and is mod-ified by the same amplitude modulation as the envelopesignal.
4. Intelligibility indices
This paper compares HASPI to the CSII and to an enve-lope-based index based on the STOI. The CSII, which isbased on coherence, is described first. This is followed byHASPI, which combines coherence and envelope. The final
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
index described is an envelope-based index motivated bythe STOI, but which is adapted for hearing-impaired aswell as normal-hearing listeners.
4.1. CSII
The CSII (Kates and Arehart, 2005) estimated the frac-tion of sentences understood correctly for noisy and dis-torted speech. To calculate the CSII, the speech was firstdivided into 16-ms segments having a 50% overlap. Eachsegment was multiplied by a Hamming window. The powerin each segment was computed, and the segments wereassigned to one of three levels: low-level (�30 to �10 dBre: RMS), mid-level (�10 to 0 dB re: RMS), and high-level(greater than 0 dB re: RMS) where RMS is the RMS levelaveraged over the entire utterance. The short-time FFTwas computed for each segment. The magnitude-squaredcoherence (MSC) was then computed in the frequencydomain (Carter et al., 1973; Kates, 1992) over the segmentsin each of the low-, mid-, and high-level groups to givethree sets of MSC values as a function of frequency. Inthe case of the signal being entirely attenuated by the pro-cessing, the resultant MSC was set to zero. Each MSC wasconverted to the signal-to-distortion ratio (SDR), and theSDR was converted to dB. The SII was then computedfor each intensity region using the critical-band procedurefor 21 bands (ANSI, 1997) to give the three CSII values.The intelligibility index I3 is given by the weighted combi-nation of the CSII values followed by a logistic functiontransformation.
New weights were computed for the data in this paper.Like the HASPI data fit, the CSII weights were chosen togive a minimum root-mean-squared error fit of the modelto the combined datasets. Equal weight was given to thenormal-hearing and hearing-impaired listener results, andequal weight was given to each of the four datasets. Themodified index is given by:
p ¼ �2:623þ 0:0CSIILow þ 9:259CSIIMid þ 0:470CSIIHigh
I3 ¼1
1þ e�p
ð1Þ
4.2. HASPI
The HASPI computation combines an envelope modu-lation term with auditory coherence terms. Both termsare based on the outputs of the auditory model describedin Section 3. The reference signal is the output of the modelfor normal hearing, with the input having no noise or otherdegradation. For normal hearing listeners, the processedsignal is the output of the normal-hearing model havingthe degraded signal as its input. For impaired hearing,the auditory model used for the processed signal ismodified to incorporate the hearing loss and the modelinput includes the amplification used to compensate forthe loss.
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660661
663663
664
665
666
667
668
669
670
671
672
673
674
675676
678678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698699
701701
702
703704
706706
707
708
709
710
711
712
713
714
715
8 J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
4.2.1. Cepstral correlation
The cepstral correlation computation is closely relatedto the procedure used by Kates and Arehart (2010) for pre-dicting speech quality. The envelope samples output by theauditory model, when taken across frequency at a giventime slot, constitute a short-time log magnitude spectrumon an auditory frequency scale. The inverse Fourier trans-form of this log spectrum produces a set of coefficients thatare similar to the mel cepstrum (Imai, 1983). In the model,only a small number of cepstrum coefficients are needed, sothe cepstrum computation is performed in the frequencydomain by fitting the auditory model envelope outputs ateach time sample with a set of half-cosine basis functions.These basis functions are very similar to the principal com-ponents for the short-time spectra of speech (Zahorian andRothenberg, 1981) and have been used for accuratemachine recognition of both consonants (Nossair andZahorian, 1991) and vowels (Zahorian and Jagharghi,1993). The basis functions are given by:
bjðkÞ ¼ cos½ðj� 1Þpk=ðK � 1Þ�; ð2Þ
where j is the basis function number and k is the gamm-atone filter index for frequency bands 0 though K � 1 forK = 32. The first six basis functions are illustrated in Fig. 3.
Let ek(m) denote the sequence of smoothed sub-sampledenvelope samples in frequency band k for the reference sig-nal, and let dk(m) be the envelope samples for the degradedsignal. The envelope smoothing is provided by 16-ms vonHann windows having 50% overlap, giving a lowpass filtercutoff frequency of 62.5 Hz and a smoothed envelope sam-pling rate of 125 Hz. The reference-signal cepstral sequencepj(m) and the degraded-signal sequence qj(m) are then givenby:
pjðmÞ ¼XK�1
k¼0
bjðkÞekðmÞ
qjðmÞ ¼XK�1
k¼0
bjðkÞdkðmÞ:ð3Þ
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733Fig. 3. Cepstral correlation basis functions.
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
The cepstrum correlation is computed by taking thecross-correlation of the cepstral sequences for the referenceand degraded signals. The sequence values correspondingto silences in the reference speech are removed from ceps-tral sequences pj(m) and qj(m), and the mean value is sub-tracted from each pruned sequence to yield the zero-mean edited sequences pjðmÞ and qjðmÞ. The silences aredetected by converting the log envelopes in each band tolinear values and summing the linear values across fre-quency. The summed values are converted back to dB,and segments having an intensity less than 2.5 dB re:thresh-old are removed from the correlation calculation. A justifi-cation for this approach to silence detection is that thelinear values in each frequency band correspond to specificloudness (Moore and Glasberg, 2004; Kates, 2013) and thesum across frequency is thus related to the loudness of thesignal (Moore and Glasberg, 2004). Thus segments havinga loudness near or below auditory threshold are removedfrom the calculation.
The normalized correlation is then given by:
rðjÞ ¼P
m2SpeechpjðmÞqjðmÞPm2Speechp2
j ðmÞh i1=2 P
m2Speechq2j ðmÞ
h i1=2: ð4Þ
The average cepstrum correlation is given by the average ofthe normalized correlation values r(2) though r(6):
c ¼ 1
5
X6
j¼2
rðjÞ: ð5Þ
Kates and Arehart (2010) found that a similar calculationlead to an index that accurately predicted speech qualityratings.
The application of the cepstral basis functions to theauditory model envelope output is illustrated in Figs. 4–6. The envelope outputs from the auditory model are plot-ted in Fig. 4 for the sentence “The boy got into trouble.”Black is the highest dB envelope value and white is audi-tory threshold. The same sentence, with additive stationaryspeech-shaped noise at a SNR of 6 dB, is plotted in Fig. 5.The noise fills in the silences in the speech, reduces the spec-tral contrast, and introduces random variations in theenvelopes.
The results of fitting basis functions 2 and 3 to the noise-free and noisy speech are plotted in Fig. 6; the solid linesrepresent the clean speech and the dashed lines the noisyspeech. The cepstral basis functions were fitted to the enve-lope values for each overlapping 16-ms windowed segmentof the speech. Basis function 2 is in the upper panel, andbasis function 3 is in the lower panel. Basis function 2 mea-sures spectral tilt. A positive value indicates that the lowfrequencies of the signal have more energy than the highfrequencies, and a negative value indicates the opposite.Thus large negative excursions are associated with thehigh-frequency bursts at 0.17, 0.52, 0.84, and 0.96 s, andlarge positive values are associated with the vowels. Basisfunction 3 measures the central spectral concentration of
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
Fig. 4. Auditory spectrogram showing the envelope outputs from theauditory model for the sentence “The boy got into trouble.”
Fig. 5. Auditory spectrogram for the same sentence combined withspeech-shaped stationary noise at a 6-dB SNR.
Fig. 6. Cepstral correlation basis functions 2 and 3 applied to each 16-msspeech segment and plotted as a function of time. The solid line is for thenoise-free sentence and the dashed line is for the noisy sentence.
J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx 9
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
the signal. A positive value indicates that the energy is con-centrated in the lower and higher frequency edges of thespectrum, and a negative value indicates that the energyis concentrated in the mid frequencies of the spectrum.
Additive noise flattens the noisy speech spectrum. In thelimiting case of a perfectly flat auditory spectrum, all of thebasis functions fit to the spectrum will return values ofzero. For the 6-dB SNR used in this example, the noisegreatly reduces the magnitude of the fluctuations in thecurves for both basis functions as compared to the curves
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
for the clean sentence. The result of the noise is thereforea reduction in the cross-covariance of the clean and noisysignals as compared to that for the clean signal with itself.
4.2.2. Auditory coherence
The auditory coherence term is related to the low-, mid-,and high-level signal coherence calculations used by Katesand Arehart (2005) for the CSII intelligibility computation.In the CSII calculation described in Section 4.1, the signalsegments were assigned to one of three intensity regionsbased on the intensity of each segment. That procedurecannot be used for the auditory coherence calculationbecause the signal intensity at the output of the auditorymodel has been modified by the OHC dynamic-range com-pression. Thus the intensity regions used for the CSII areno longer valid at the output of the auditory model. TheCSII used a frequency-domain procedure to calculate thecoherence (Carter et al., 1973; Kates, 1992), but the coher-ence can also be calculated in the time domain; the magni-tude of the coherence in a narrow frequency band isequivalent to the correlation coefficient (Shaw, 1981).
The basilar membrane output of the auditory model wasdivided into 16-ms segments having a 50% overlap, witheach segment multiplied by a von Hann window. Theintensity of the reference and degraded signals and short-time normalized cross-correlation between them was com-puted for each segment in each auditory frequency band.The intensity of the vibration output from the auditorymodel was in dB SL. The intensity in each segment of thereference signal was converted from log to linear ampli-tude, and the segment intensities summed across frequen-cies to form a broadband intensity signal. The segments
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791792
794794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823824
826826
827
828
829
830
831
832
833
834
835
836
837
Fig. 7. Normalized BM vibration cross-correlation values computed foreach auditory frequency band and 16-ms speech segment.
10 J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
of the reference signal that correspond to silent intervalswere identified, and the corresponding segments in the ref-erence and degraded signals were discarded. A cumulativehistogram of the intensities of the remaining segments wasthen created, with segments assigned to either the lowestthird, middle third, or upper third of the histogram.
The short-time normalized cross-correlations for thelow-level, mid-level, and high-level segments were thenaveraged across time and frequency to produce thelow-, mid-, and high-level auditory coherence values. Letxk(m, n) be the BM vibration for the reference signal andyk(m, n) be the BM vibration for the degraded signal in fre-quency band k and segment m, with n the sample indexwithin the segment. The signals after being windowedand converted to zero-mean are given by xkðm; nÞ andykðm; nÞ. The normalized cross correlation for segment m
in frequency band k is given by:
zðm; kÞ ¼ Maxs
Pnxkðm; nÞykðm; nþ sÞP
nx2kðm; nÞ
� �1=2 Pny2
kðm; nÞ� �1=2
( ); ð6Þ
where the delay s is chosen over the range of �1 to 1 ms toyield the maximum value of the cross-correlation. The val-ues of z(m,k) for the low-intensity segments were averagedto produce the low-level auditory coherence, and the sameprocedure applies to the mid- and high-level segments pro-duced the mid- and high-level auditory coherence values.
The normalized cross-correlation values given by z(m,k)are plotted in Fig. 7 for the noisy speech signal shown inFig. 5 cross-correlated with the clean speech plotted inFig. 4. Black represents a correlation of 1, and white repre-sents a correlation of 0. The correlation tends towards 1 forthe more intense portions of the speech, including the vow-els and some onsets. The correlation is close to 0 for thespeech silences and the less-intense portions of thesentence.
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
4.2.3. HASPI modelThe intelligibility model is a linear weighting of the cep-
strum correlation and the three auditory coherence values,followed by a logistic function transformation. The weightswere chosen to give a minimum root-mean-squared errorfit of the model to the combined datasets. Equal weightwas given to the normal-hearing and hearing-impaired lis-tener results, and equal weight was given to each of thefour datasets: noise and distortion, frequency compression,noise suppression, and noise vocoder.
Let c be the computed cepstral correlation value givenby Eq. (5). Let aLow be the low-level auditory coherencevalue, aMid be the mid-level value, and aHigh be the high-level value. The HASPI intelligibility index is given by:
p ¼ �9:047þ 14:817cþ 0:0aLow þ 0:0aMid þ 4:616aHigh
H ¼ 1
1þ e�p
ð7Þ
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
4.3. Short-time envelope correlation index (STECI)
The short-time envelope correlation index (STECI) ismotivated by the short-time objective intelligibility mea-sure (STOI) of Taal et al. (2011b). The STOI was designedto model the effects of additive noise and ideal binary masknoise suppression on speech. The STOI assumes normalhearing and conversational speech levels since it does nottake the auditory threshold or the signal intensity intoaccount. Thus the STOI calculation as presented by Taalet al. (2011b) cannot be used for hearing-impaired listenersbecause it cannot represent the reduction in audibility thataccompanies hearing loss. To overcome this limitation, anew index, the STECI, has been derived based on theshort-time averaging approach implemented in the STOIin combination with the auditory model used for HASPI.
To compute STECI, the reference and processed signalenvelopes output by the auditory model are smoothedand sub-sampled using the same procedure as used forthe cepstral correlation described in Section 4.2.1, givingenvelope samples in each auditory band based onwindowed 16-ms segments having 50% overlap. There are32 frequency bands, with center frequencies spanning80–8000 Hz. Let ek(m) denote the sequence of smoothedsub-sampled envelope samples in frequency band k forthe reference signal, and let dk(m) be the envelope samplesfor the degraded signal. Segments corresponding to silencesin the reference signal are then pruned from both thereference and processed envelope signals, giving envelopesekðmÞ and dkðmÞ, respectively.
The pruned envelope sequences are grouped into short-time vectors, with each vector comprising the envelopessampled over a 384-ms analysis interval and having a
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
859
860861
863863
864
865
866
867
868869
871871
872
873
874875
877877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx 11
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
50% segment overlap. The short-time vector for the refer-ence signal is given by:
EkðmÞ ¼ ½ekðm� N þ 1Þ; ekðm� N þ 2Þ; . . . ; ekðmÞ�T ; ð8Þ
where N encompasses the 384-ms analysis length and T
denotes transpose. A similar vector Dk(m) is formed forthe processed signal. The intermediate intelligibility mea-sure for the analysis interval is given by the normalizedcross-correlation:
gk;m ¼½EkðmÞ � EkðmÞ�
T ½DkðmÞ � DkðmÞ�jjEkðmÞ � EkðmÞjjjjDkðmÞ � DkðmÞjj
; ð9Þ
where the overbar denotes the average of the correspond-ing vector. The final step in computing STECI is to formthe average of the intermediate intelligibility measures:
S ¼ 1
KMRk;mgk;m ð10Þ
where K is the number of frequency bands and M is thetotal number of analysis frames.
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
5. Results
Scatter plots for the index predictions are presented inFig. 8. The open circles represent each processing conditionaveraged over the NH listeners, while the filled squares giveeach processing condition averaged over the HI listeners.The diagonal line represents perfect predictions; a pointabove the line indicates that the model prediction is lessthan the observed intelligibility, while a point below theline indicates that the model prediction is higher than theobserved intelligibility. The correlation coefficient shownin each plot is for the combined NH and HI listenergroups.
The plots for the noise and distortion data are presentedin sub-plots (a–c) of Fig. 8 for HASPI, CSII, and STECI,respectively. In all three sub-plots, there are a large numberof points clustered near (1,1), which indicates perfect intel-ligibility. These points contribute little to the overall corre-lation coefficient, which is therefore dominated by theaccuracy for the poorer intelligibility conditions. The Pear-son correlation coefficient for HASPI using all of the datapoints is 0.978. When just those points are used for whichthe HASPI predicted intelligibility is <0.9, the correlationcoefficient is 0.971. Both the NH and HI listener predic-tions show about the same number of points above andbelow the diagonal line, so there is little apparent bias inthe HASPI predictions. The CSII also does well, with acorrelation coefficient of 0.972 when all of the data pointsare used. Most of the CSII predictions lie below the diago-nal, which indicates that CSII has a tendency to overesti-mate the intelligibility. The performance of the STECIfor these data is worse than for the other two approaches,with a correlation coefficient of 0.825. For the NH subjects,the STECI underestimates the intelligibility for additivestationary noise and overestimates the intelligibility for
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
center clipping distortion. For the HI subjects, the STECIoverestimates intelligibility for peak clipping and for centerclipping distortion.
The plots for the frequency compression data are pre-sented in sub-plots (d–f) of Fig. 8. For HASPI and STECI,the majority of the points for the NH listeners are plottedabove the diagonal line, while the majority of points for theHI listeners are below the line. These trends in the predic-tions indicate a slight bias towards underestimating intelli-gibility for the NH listeners and overestimatingintelligibility for the HI listeners. There are also some out-liers in the lower right-hand corner of the plot for bothHASPI and STECI. These points correspond to speechwith no additive noise where the compression cutoff fre-quency has been set to 1 kHz. These outliers are not pres-ent in a scatter plot for the CSII, which suggests that signalchanges in the vicinity of 1 kHz that are important forintelligibility are not detected by the cepstral correlationused in HASPI or the envelope correlation used in STECI,but are detected by the coherence, that is, there are impor-tant changes in the signal that affect the temporal finestructure but not the envelope. A possible candidate couldbe changes in the harmonic structure in the vicinity of thefirst and second formants.
The plots for the ideal binary mask noise suppressiondata are presented in sub-plots (g–i) of Fig. 8. For all threeindices, the points for the NH listeners tend to lie on orbelow the diagonal line, indicating that all of the indicesoverestimate the intelligibility for these subjects. The low-intelligibility points for the HI listeners for all three indicesare below the line, while the points for the high-intelligibil-ity conditions are above the line, indicating that modelsdesigned to fit only the HI data would have a shifted offsetand a steeper slope than the models presented here that arefit to all of the subjects. As expected, the STECI gives thehighest correlation with the subject intelligibility scoressince the STOI, on which STECI is based, was developedto fit this type of signal processing. The spread of the HIpoints in the sub-plots is less than the spread for the NHpoints, which is consistent with there being just sevenNH subjects in the dataset.
Intelligibility index predictions for the noise vocoder arepresented in Fig. 9 along with the NH subject data. Thevalues are presented for people with normal hearing listen-ing to IEEE sentences in a background of multi-talker bab-ble at a SNR of 12 dB. This experimental condition waschosen to illustrate the benefits of including the envelope,as opposed to just the TFS, in formulating the intelligibilityindex. Similar behavior occurs for the other SNRs used inthe experiment and for the hearing-impaired listeners.Intelligibility is plotted as the number of vocoded bandsis increased from none to 16. The number of bands is indi-cated on the plot by the cutoff frequency of the vocodedhigh-frequency speech region; bands above the cutoff fre-quency have been noise-vocoded while those below the cut-off frequency contain the unprocessed speech. Theintelligibility averaged over the NH subjects is indicated
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Fig. 8. Intelligibility predictions for the HASPI, CSII, and STECI intelligibility indices for the noise and distortion, frequency compression, and idealbinary mask experiment data. The plotted points and indicated correlation coefficients are averaged over the subjects in the NH and HI groups.
12 J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
by the diamonds connected by the dot-dash line. There isno apparent trend in the intelligibility scores as the numberof vocoded high-frequency bands is increased; the domi-nant effect is the subject variability. The HASPI predictionis consistent with the noise-vocoder subject results, with aminimal decrease in intelligibility as the noise vocoder cut-off frequency is moved from no vocoding down to 1.6 kHz.The STECI prediction also shows a minimal effect of cutofffrequency, although STECI for this experiment fails to pre-dict the overall high degree of intelligibility achieved by thesubjects. The CSII prediction, however, shows a substantialdecrease in predicted intelligibility as the amount of vocod-ing is increased, starting with near-perfect intelligibility forno processing and decreasing to 81% correct when all of thefrequency bands above 1.6 kHz are vocoded.
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
The results of fitting the average subject intelligibilityscores for the noise and distortion, frequency compression,and noise suppression datasets with the HASPI, CSII, andSTECI indices are presented in Tables 1 and 2. Results forusing the cepstral correlation alone are also presented. Theentries in Table 1 are the Pearson correlation coefficientsmeasured for the indicated combinations of subject groupand dataset. The correlation coefficient indicates how wella straight line describes the relationship between the actualand predicted values, even if the line is offset and has aslope that differs from 1. The predictions and subject rat-ings were averaged over the subjects in each group beforecomputing the correlation coefficients for the processingconditions. All correlation coefficients have p < 0.001.The root-mean-squared (RMS) error computed for each
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017Q3
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
Fig. 9. Intelligibility scores and model predictions for the noise vocoderoutput at an input SNR of 12 dB, normal-hearing subjects.
J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx 13
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
combination of subject group and dataset are presented inTable 2. The RMS error indicates how closely the predictedvalues match the actual scores, but does not assume a lin-ear relationship. Again, the values for each of the process-ing conditions were averaged over the subjects in the group
Table 1Pearson correlation coefficients for perceptual models for the normal-hearingaveraged over the subjects. All of the model results are for a minimum mean-squsing all the data.
Signal processing Subject group Pearson
HASPI
Noise and distort Normal hearing .936Hearing impaired .962NH plus HI .978
Freq compress Normal hearing .964Hearing impaired .967NH plus HI .968
Ideal binary mask Normal hearing .954Hearing impaired .992NH plus HI .978
All 3 processing NH plus HI .972
Table 2RMS errors for perceptual models for the normal-hearing (NH), hearing-impsubjects. All of the model results are for a minimum mean-squared error (MMS
Signal processing Subject group RMS e
HASPI
Noise and distort Normal hearing .118Hearing impaired .095NH plus HI .072
Freq compress Normal hearing .107Hearing impaired .147NH plus HI .119
Ideal binary mask Normal hearing .204Hearing impaired .065NH plus HI .121
All 3 processing NH plus HI .100
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
before computing the error. The NH and HI values used incomputing the entries in Tables 1 and 2 for the combinedNH plus HI groups have been weighted to compensatefor the different number of subjects in each group, thus giv-ing equal importance to the NH and HI subjects in com-puting the average over the two groups. Likewise, theentries for the average over the three processing experi-ments were weighted to give equal importance to eachexperiment.
Correlation coefficients for the noise vocoder data arenot included in the tables since the intelligibility scores wereall very close to 1. The intelligibility scores plotted in Fig. 9were for an SNR of 12 dB, the poorest SNR considered inthe experiment, yet the sentence intelligibility was stillabove 90% for all vocoder conditions for the NH listeners.Thus neither the SNR nor number of vocoded frequencybands had an impact on speech intelligibility. Since thereare no processing trends in the data, the correlation coeffi-cient would reflect only the subject variability and not thefit of the model to the data.
The Pearson correlation coefficients computed using theindividual subject data rather than the averages over the
(NH), hearing-impaired (HI), and combined NH and HI subject groupsuared error (MMSE) fit of the model to the combined NH plus HI subjects
correlation coefficient
CSII STECI Cep Corr
.937 .645 .904
.980 .916 .948
.972 .825 .952
.940 .904 .954
.948 .949 .960
.946 .935 .961
.947 .975 .950
.982 .992 .985
.968 .988 .973
.967 .940 .960
aired (HI), and combined NH and HI subject groups averaged over theE) fit of the model to the combined NH plus HI subjects using all the data.
rror
CSII STECI Cep Corr
.120 .258 .156
.128 .165 .111
.111 .182 .101
.133 .158 .117
.138 .175 .163
.123 .142 .128
.180 .126 .191
.142 .108 .082
.133 .091 .129
.121 .136 .115
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
14 J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
subject groups are presented in Table 3. The correlationcoefficients are lower than for the subject averages pre-sented in Table 1 since the values in Table 3 include theintersubject variability. Significant differences between thecorrelation coefficients are indicated in Table 4. The Wil-liams t-test (Williams, 1959; Steiger, 1980) was used to testthe correlation values for significant differences betweenpairs of indices. A zero indicates no difference, + indicatesthat the first index is significantly more accurate than thesecond at the 5% level, and ++ indicates significance atthe 1% level. A � indicates that the first index is signifi-cantly less accurate than the second at the 5% level, and�� indicates a difference at the 1% level.
In Table 3, the CSII has a higher correlation coefficientthan HASPI for the NH and HI noise and distortion data,but the differences are not significant. In comparing thedata of Tables 1 and 2, the CSII has a higher correlationcoefficient than HASPI for the NH and HI noise and dis-tortion data, but also has a higher RMS error. This differ-ence in performance metrics suggests that the CSIIpredictions have a more linear relationship with the subjectscores, but that the line lies somewhat off the diagonal,which increases the RMS error. The STECI, which is basedon correlating the envelopes within each frequency band,has substantially worse performance than either the CSIIor HASPI for the NH subjects. For the HI subjects, STECIis significantly worse than CSII, while the differencebetween STECI and HASPI approaches significance(p = 0.06). The cepstral correlation model without theauditory coherence, which thus depends only on the enve-
Table 3Pearson correlation coefficients for perceptual models for the normal-hearincomputed over the individual subjects.
Signal processing Subject group Pearson
HASPI
Noise and distort Normal hearing 0.849Hearing impaired 0.874
Freq compress Normal hearing 0.917Hearing impaired 0.866
Ideal binary mask Normal hearing 0.929Hearing impaired 0.911
Table 4Significant differences in the Pearson correlation coefficients presented in Tasignificantly more accurate than the second at the 5% level, and ++ indicates sless accurate than the second at the 5% level, and �� indicates a difference a
Signal processing Subject group Significance
HASPI –CSII HASPI –STECI HA
Noise and NH 0 ++ ++Distortion HI 0 0 ++Frequency NH ++ ++ ++Compression HI + ++ ++Ideal binary NH 0 � 0Mask HI ++ 0 ++
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
lope correlations, is not as accurate as the complete HASPIfor the NH and HI subjects, but is more accurate thanSTECI for the NH listeners.
The results for frequency compression show a consistentadvantage for HASPI over the other indices in terms of thecorrelation coefficient. The HASPI results are significantlybetter than those for CSII, STECI, and the cepstral corre-lation alone. While the CSII also gives high correlationcoefficients, they are not as good as those found for HASPIor the cepstral correlation. However, there is no significantdifference between the CSII and cepstral correlation accu-racy. In terms of the average RMS error, HASPI has thelowest errors for the NH and combined subject groups,while the CSII has the lowest error for the HI group. Thelargest RMS error for the HI group was found for theSTECI. Thus combining an envelope fidelity term with asmall amount of temporal fine structure fidelity appearsto be the most accurate approach, although the outliersin Fig. 8(a and c) indicate that there is still room forimprovement.
The results for the ideal binary mask show a significantadvantage for STECI over HASPI for the NH subjects,although there is no significant difference between thesetwo indices for the HI subjects. The advantage for theNH listeners is not surprising since the STOI was designedand optimized for NH ideal binary mask data. HASPI isalso significantly better than CSII and cepstral correlationfor the HI subjects, although there is no significant differ-ence in performance for the NH subjects. STECI is also sig-nificantly better than CSII for both groups of listeners.
g (NH) and hearing-impaired (HI) subject groups. The correlations are
correlation coefficient
CSII STECI Cep Corr
0.852 0.594 0.8190.923 0.823 0.862
0.895 0.863 0.9070.847 0.833 0.851
0.926 0.952 0.9260.833 0.893 0.921
ble 3. A zero indicates no difference, + indicates that the first index isignificance at the 1% level. A � indicates that the first index is significantlyt the 1% level.
SPI –Cep Corr CSII –STECI CSII –Cep Corr STECI –Cep Corr
++ 0 ��++ ++ 0++ 0 ��0 0 ���� 0 ++�� 0 ++
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx 15
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
When looking at the averaged data in Table 1, all of theindices do well for this dataset. This pattern is also reflectedin the RMS errors shown in Table 2, where HASPI has thelowest average error for the HI listeners but STECI has thelowest average error for the NH and combined listenergroups.
In Table 4, HASPI has significantly higher correlationsthan CSII for three out of the six comparisons, is signifi-cantly better than STECI for three out of the six, and is sig-nificantly better than cepstral correlation for five out of thesix. The only condition for which HASPI was significantlyworse than another index was for STECI applied to theNH ideal binary mask data, and here HASPI still gavegood performance. Given the number of situations whereHASPI is significantly better than the other indices, andthe fact that it is significantly worse than another indexfor only one situation, makes HASPI a viable approachfor estimating intelligibility.
The average index performance is summarized in thelast lines of Tables 1 and 2, where the performance hasbeen averaged over processing and subjects. Each of thethree types of processing were given equal weight in com-puting the average, and the NH and HI subjects were alsogiven equal weight. HASPI has the highest correlationcoefficient and the lowest RMS error. The CSII has the sec-ond-highest average correlation coefficient, while the ceps-tral correlation has the second-lowest RMS error. HASPIalso works well for the types of processing considered inthis study, while the CSII is especially weak for the noisevocoder dataset, STECI is weak for both the noise and dis-tortion and the frequency compression datasets, and ceps-tral correlation is weak for the noise and distortion dataset.These results reinforce the idea that accurate intelligibilitypredictions can be based either on envelope or temporalfine structure, but the best performance is achieved whenboth are incorporated into the intelligibility index.
An additional analysis was performed to compare HAS-PI fitted solely to NH listeners with the index fitted to HIlisteners. The HASPI coefficients in Eq. (7) represent theoptimum fit to the combined NH and HI listener data.Approximately three-quarters of the weight in the modelof Eq. (7) is given to the cepstral correlation, and one-quar-ter to the high-level auditory coherence. When the model-ing approach is applied to just the NH data, the resultantweights are approximately half for the cepstral correlationand half for the auditory coherence. Fitting the model tojust the HI data results in full weight for the cepstral corre-lation and zero weight for the auditory coherence. Thus theparameters for the combined model represent an average ofthe values that would be used for either listener groupalone. The NH-alone model, for correlations computedfor the data averaged over the subjects in the group, doesslightly better than the combined model for the NH listen-ers for the noise and distortion dataset (r = 0.948), andgives comparable accuracy for the frequency compression(r = 0.961) and ideal binary mask datasets (r = 0.956).The HI-alone model does better than the combined model
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
for the HI listeners for ideal binary mask dataset(r = 0.971), but is not as accurate for the noise and distor-tion (r = 0.953) and for the frequency compression datasets(r = 0.941). Overall, the combined model appears to be aseffective as the separate models while being simpler toimplement.
6. Discussion
It was proposed that an index based on coherence alonewould not perform as well as one incorporating bothcoherence and envelope modulation when applied to fre-quency compression, noise suppression, and noise vocoderdata. The results presented in this paper mainly supportthat hypothesis. Both HASPI and CSII work well for thenoise and distortion dataset, with HASPI having slightlybetter accuracy than CSII for the predictions when aver-aged over all of the subjects. Both indices also work wellfor the ideal binary mask dataset, with HASPI having sig-nificantly better accuracy than the CSII for the HI listeners.In addition, for the frequency compression dataset and forthe noise vocoder output, HASPI is substantially moreaccurate than the CSII.
The accuracy of HASPI also compares favorably withother intelligibility indices that use envelope modulation.A direct comparison between indices is difficult becauseHASPI has been fit to different datasets than used for theother models. Previous experiments include additive noiseand ideal binary mask noise suppression, with correlationcoefficients for data averaged over the subjects rangingfrom r = 0.88 to r = 0.96 (Christiansen et al., 2010; Taalet al., 2011b; Gomez et al., 2012).
The combination of envelope and coherence was alsocompared to two indices based on the envelope alone forthe data considered in this paper. HASPI is significantlymore accurate than the cepstral correlation index for fiveout of the six comparisons presented in Table 4. The differ-ences in the averaged correlation coefficients are larger forthe noise and distortion data than for the frequency com-pression or ideal binary mask datasets, but for nearly allconditions adding the coherence information improvedthe index performance. This behavior indicates that whilethe envelope carries much of the information needed forspeech intelligibility, it does not convey all of the informa-tion required.
The amount of coherence as opposed to envelope infor-mation also appears to depend on the hearing loss. Enve-lope cues appear to carry a larger perceptual weight thanthe TFS cues for sentence materials (Smith et al., 2002;Fogerty, 2011) and are more readily available than TFScues to hearing-impaired listeners (Hopkins et al., 2008;Hopkins and Moore, 2011). The higher relative weightplaced on the cepstral correlation compared to the auditorycoherence terms in the HASPI calculation is consistentwith these experimental studies, as was the reduced weightplaced on the coherence term for the version of HASPI fitto just the HI listeners as opposed to the NH listeners.
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
16 J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
The noise vocoder predictions are also consistent withthe interpretation that envelope is more important thanTFS for speech materials. The CSII, which is based onthe cross-correlation of the reference and degraded signalsin each frequency band, is much more sensitive to changesin the TFS than is HASPI. Replacing a band of speech withthe output of the noise vocoder preserves much of the enve-lope, but the cross-correlation of the speech and vocoderoutput is low due to the differences in the signal TFS. Asa result, the CSII predicted a loss in intelligibility as thenumber of vocoded bands was increased, even though thesubject data showed no reduction in intelligibility.
When the auditory coherence terms are combined withthe cepstral correlation to form HASPI, the optimumweights are 0 for the low- and mid-level coherence values.These auditory coherence weights are in contrast to theweights of the low-, mid-, and high-level CSII components,which give the greatest importance to the mid-level coher-ence. Kates and Arehart (2005) hypothesized that the mid-level segments primarily comprised consonant–vowel andvowel–consonant transitions, while the high-level segmentsprimarily comprised vowels. Keeping with this interpreta-tion, the inclusion of only the high-level auditory coherencevalue in HASPI suggests that the cepstral correlation con-veys information about consonants and the vowel–conso-nant and consonant–vowel transitions, and that the high-level auditory coherence adds information about the vow-els and formant transitions to the model that may bemissed by the cepstral correlation.
The poor performance of STECI for the NH noise anddistortion data also suggests that the specific form of theenvelope measurements is important for predicting speechintelligibility. STECI measures the accuracy in preservingthe envelope modulations within each frequency band.The cepstral correlation used for HASPI is based on theprinciple components of the short-time spectrum, and mea-sures the fidelity in reproducing the time–frequency modula-tions of the signal rather than just the temporal modulationswithin each band. STECI performs well for the ideal binarymask dataset, where the effect of the noise suppression is pri-marily changes in the envelope modulation within eachband, but does poorly for the noise and distortion datasetwhere the peak-clipping and center-clipping distortionschange the shape of the short-time spectra and generate dis-tortion products across a wide range of frequencies.
These results also indicate that the choice of experimen-tal data is important in developing an index. For manytypes of signal modification, such as additive noise andnonlinear distortion, both the envelope and temporal finestructure are affected by the processing. Thus an accurateindex for the effects of these signal modifications can bedeveloped by measuring either the envelope or the TFSchanges. Consistent with this signal behavior, there wasno significant difference between the CSII and HASPIaccuracy for the noise and distortion data. However, thereare conditions, such as the frequency compression data,where the coherence is reduced more than the envelope cor-
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
relation for the same reduction in intelligibility, and forthese conditions the HASPI approach is significantly moreaccurate. For the ideal binary mask data, it was the STECIthat was most accurate. Thus developing an index for justone type of processing does not guarantee that it will workwell for all types of processing. The type of and amount ofnoise and distortion in a hearing aid is generally not knowna priori, and HASPI has the advantage of giving accuratepredictions for all of the processing conditions consideredin this study.
The noise-suppression studies cited above also foundthat the CSII is a poor predictor of intelligibility for bin-ary-masked speech (Christiansen et al., 2010; Taal et al.,2011b). The processing used in those papers matched thebinary mask local criterion to the SNR of the noisy signal,while the binary mask experiment cited in this paper used afixed local criterion of 0 dB independent of the SNR. In thelimit of a �60 dB SNR, the processing used in the citedpapers produced a binary-modulated noise that stillapproximated the envelope of the clean speech and yieldedhigh intelligibility. However, the CSII, which is based onthe signal coherence, gives a near-zero value for the�60 dB SNR condition since the temporal fine structureof the clean and modulated noisy signals is uncorrelated.In the experiment reported in this paper, on the other hand,the local criterion was kept constant at 0 dB independent ofthe SNR, so the number of 1s in the binary gain patterndecreases as the SNR decreases. In the case of �60 dBSNR and 100 dB attenuation, the processing cited in thispaper would attenuate the entire signal since all cells wouldhave negative SNRs, giving an inaudible output having nointelligibility. This result is consistent with a CSII of zero.Because of these experimental differences, the CSII wouldbe expected to be an accurate predictor of intelligibilityfor the binary mask processing cited in this paper.
The envelope modulation in HASPI is lowpass filteredat 62.5 Hz. Many languages are characterized by envelopemodulation frequencies below 20–30 Hz (Greenberg andArai, 2004; Souza and Rosen, 2009), and the 62.5-Hz cutofffrequency used in HASPI produced accurate results forEnglish sentences. However, tonal languages such as Man-darin may need higher modulation cutoff frequencies (Chenand Loizou, 2011). Furthermore, Chen et al. (2013) haveshown that the CSII is more accurate than an envelope-based metric for predicting the intelligibility of Mandarinsentences corrupted by noise and two-talker interference.Thus extending HASPI to other languages may require fur-ther investigation of the envelope-modulation cutoff fre-quency and the ratio of envelope to TFS informationused in the model.
7. Summary and conclusions
This paper has presented a new index for predictingspeech intelligibility. HASPI compares the envelope andTFS outputs of an auditory model for a reference signalto the outputs of the model for a degraded signal. The
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389Q4
1390
1391
1392
1393Q7
1394Q5
1395
1396
13971398139914001401140214031404140514061407140814091410141114121413141414151416141714181419142014211422
J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx 17
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
model for the reference signal is adjusted for normal hear-ing, while the model for the degraded signal incorporatesthe peripheral hearing loss. The auditory model includesthe middle-ear transfer function, an auditory filterbank,outer hair-cell dynamic-range compression, two-tone sup-pression, and adaptation of the inner hair-cell firing rate.Hearing loss causes a shift in auditory threshold, broaderauditory filters, a reduction in the dynamic-range compres-sion ratio, and a reduction in the amount of two-tonesuppression.
The comparison of the reference and degraded signalenvelopes uses cepstral correlation. The short-time logspectrum is approximated using a set of half-cosine basisfunctions. The amplitude of each basis function fluctuatesover time, and loss of intelligibility is computed using theaverage of the cross-correlation of the basis functions forthe reference and degraded signals. The comparison ofthe reference and degraded signal temporal behavior usesauditory coherence. The signals are divided into segments,and the normalized cross-correlation calculated for eachsegment in each frequency band and averaged over all ofthe segments in each of three intensity regions. The high-intensity segments were found to be the most useful in com-bination with the cepstral correlation. The final model is aweighted combination of the cepstral correlation and high-intensity auditory coherence, followed by a logistic func-tion transformation.
The HASPI index was trained using four datasets. Anindex trained on only one dataset may not work well onother data, so an objective of the present study was to usedata from several different experiments to create a modelwith a greater chance of generalizing to unknown signal-processing or distortion mechanisms. HASPI was foundto offer comparable accuracy to the CSII for a dataset com-prising speech corrupted by additive noise and nonlineardistortion and to be more accurate than indices based onthe envelope alone. The accuracy of HASPI and the CSIIwere also found to be comparable for NH listeners for noisyspeech processed through an ideal binary mask noise-sup-pression algorithm having a local criterion fixed at 0 dB,although the STECI envelope-modulation index was themost accurate for these data. The indices were also evalu-ated using a dataset comprising speech with additive babbleand processed through frequency compression; HASPI wasfound to offer superior accuracy to the CSII and STECI forthese data. Note that the results for STECI do not necessar-ily mean that the same results would be found for applyingthe STOI to data for normal-hearing listeners given the dif-ferences in the auditory models used for the two indices.The indices were also compared for speech having varyingamounts of its high-frequency content replaced by the out-put of a noise vocoder, and HASPI correctly predicted noloss of intelligibility while the CSII incorrectly predicted asubstantial loss. When the prediction accuracy was aver-aged over all of the processing conditions and both subjectgroups, HASPI was found to have the highest correlationcoefficient and the lowest RMS error.
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
HASPI does not claim to describe how humans processspeech when listening through hearing aids. Instead, themodel aims at providing an appropriate model in a trans-formed representation space in order to accommodate asmany effects in speech intelligibility degradation as possi-ble. There is no guarantee that HASPI can be generalizedto conditions beyond those used to train the model. Forexample, HASPI was derived using data for monauralheadphone listening. Additional research is needed todetermine the accuracy of the index for real hearing aids,as opposed to simulated processing, and also for the pro-cessing interactions that occur in hearing aids when morethan one algorithm is in use. Further research is alsoneeded to deal with the acoustic effects of the head andear that occur in real-world hearing-aid use, the effects ofbinaural listening, and the impact of room reverberation.A final consideration is the relative importance of envelopeand temporal fine structure for predicting intelligibility fortonal languages such as Mandarin.
8. Uncited references
ANSI (1989) and Kates (2008).
Acknowledgments
The authors thank Dr. Rosalinda Baca for providing thestatistical analysis used in this paper. Author JMK wassupported by a grant from GN ReSound. Author KHAwas supported by a NIH Grant (R01 DC60014) and bythe grant from GN ReSound.
References
Aguilera Munoz, C.M., Nelson, P.B., Rutledge, J.C., Gago, A., 1999.Frequency lowering processing for listeners with significant hearingloss. In: Electronics, Circuits, and Systems: Proc. ICECS 1999, vol. 2.Cypress, Pafos, pp. 741–744 (September 5–8, 1999).
Anderson, M.C., 2010. The Role of Temporal Fine Structure in SoundQuality Perception. Ph.D. Thesis, University of Colorado Dept.Speech Lang. Hear. Sciences, 2010.
ANSI S3.6-1989. American National Standard: Specification for Audi-ometer. American National Standards Institute, New York.
ANSI S3.5-1997. American National Standard: Methods for the Calcu-lation of the Speech Intelligibility Index. American National StandardsInstitute, New York.
Arehart, K.H., Souza, P., Baca, R., Kates, J.M., 2013a. Working memory,age, and hearing loss: susceptibility to hearing aid distortion. EarHear. 34, 251–260.
Arehart, K.H., Souza, P.E., Lunner, T., Pedersen, M.S., Kates, J.M.,2013b. Relationship between distortion and working memory fordigital noise-reduction processing in hearing aids. In: Proc. Mtgs.Acoust. (POMA) 19, 050084: Acoust. Soc. Am. 165th Meeting,Montreal, June 2–7, 2013.
Bruce, I.C., Sachs, M.B., Young, E.D., 2003. An auditory–peripherymodel of the effects of acoustic trauma on auditory nerve responses. J.Acoust. Soc. Am. 113, 369–388.
Byrne, D., Dillon, H., 1986. The national acoustics laboratories’ (NAL)new procedure for selecting gain and frequency response of a hearingaid. Ear Hear. 7, 257–265.
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
14231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490
1491149214931494149514961497149814991500150115021503150415051506150715081509151015111512151315141515151615171518151915201521152215231524152515261527152815291530153115321533153415351536153715381539154015411542154315441545Q6
1546154715481549155015511552155315541555155615571558
18 J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
Carter, G.C., Knapp, C.H., Nuttall, A.H., 1973. Estimation of themagnitude-squared coherence function via overlapped fast Fouriertransform processing. IEEE Trans. Audio Electroacoust. 21, 337–344.
Chen, F., Loizou, P.C., 2011. Predicting the intelligibility of vocodedspeech. Ear Hear. 32, 331–338.
Chen, F., Guan, T., Wong, L.N., 2013, Effect of temporal fine structure onspeech intelligibility modeling. In: Proc. 35th Annual Int. Conf. IEEE-EMBS, Osaka, July 3–7, 2013, pp. 4199–4202.
Ching, T.Y.C., Dillon, H., Byrne, D., 1998. Speech recognition of hearing-impaired listeners: predictions from audibility and the limited role ofhigh-frequency amplification. J. Acoust. Soc. Am. 103, 1128–1140.
Christiansen, C., Pedersen, M.S., Dau, T., 2010. Prediction of speechintelligibility based on an auditory preprocessing model. SpeechCommun. 52, 678–692.
Cooke, M., 1991. Modeling Auditory Processing and Organization. PhDThesis, U. Sheffield, May, 1991.
Cooper, N.P., Rhode, W.S., 1997. Mechanical responses to two-tonedistortion products in the apical and basal turns of the mammaliancochlea. J. Neurophysiol. 78, 261–270.
Cosentino, S., Marquardt, T., McAlpine, D., Falk, T.H., 2012. Towardsobjective measures of speech intelligibility for cochlear implant users inreverberant environments. In: Proc. 11th Int. Conf. on Info. Sci., Sig.Proc., and Their Appl. (ISSPA), Montreal, 2–5 July 2012, pp. 666–671.
Dau, T., Puschel, D., Kohlrausch, A., 1996. A quantitative model of the“effective” signal processing in the auditory system: I. Model structure.J. Acoust. Soc. Am. 99, 3615–3622.
Dudley, H., 1939. Remaking speech. J. Acoust. Soc. Am. 11, 169–177.Elhilali, M., Chi, T., Shamma, S., 2003. A spectro-temporal modulation
index (STMI) for assessment of speech intelligibility. Speech Commun.41, 331–348.
Fogerty, D., 2011. Perceptual weighting of individual and concurrent cuesfor sentence intelligibility: frequency, envelope, and fine structure. J.Acoust. Soc. Am. 129, 977–988.
Glista, D., Scollie, S., Bagatto, M., Seewald, R., Parsa, V., Johnson, A.,2009. Evaluation of nonlinear frequency compression: clinical out-comes. Int. J. Audiol. 48, 632–644.
Goldsworthy, R.L., Greenberg, J.E., 2004. Analysis of speech-basedspeech transmission index methods with implications for nonlinearoperations. J. Acoust. Soc. Am. 116, 3679–3689.
Gomez, A.M., Schwerin, B., Paliwal, K., 2012. Improving objectiveintelligibility prediction by combining correlation and coherence basedmethods with a measure based on the negative distortion ratio. SpeechCommun. 54, 503–515.
Gorga, M.P., Abbas, P.J., 1981. AP measurements of short-termadaptation in normal and acoustically traumatized ears. J. Acoust.Soc. Am. 70, 1310–1321.
Greenberg, S., Arai, T., 2004. What are the essential cues for understand-ing spoken language? IEICE Trans. Inf. and Syst. E87-D, pp. 1059–1070.
Harris, D.M., Dallos, P., 1979. Forward masking of auditory nerve fiberresponses. J. Neurophys. 42, 1083–1107.
Hicks, M.L., Bacon, S.P., 1999. Psychophysical measures of auditorynonlinearities as a function of frequency in individuals with normalhearing. J. Acoust. Soc. Am. 105, 326–338.
Hines, A., Harte, N., 2010. Speech intelligibility from image processing.Speech Commun. 52, 736–752.
Hohmann, V., Kollmeier, B., 1995. The effect of multichannel dynamiccompression on speech intelligibility. J. Acoust. Soc. Am. 97, 1191–1195.
Holube, I., Kollmeier, B., 1996. Speech intelligibility predictions inhearing-impaired listeners based on a psychoacoustically motivatedperception model. J. Acoust. Soc. Am. 100, 1703–1716.
Hopkins, K., Moore, B.C.J., 2011. The effects of age and cochlear hearingloss on temporal fine structure sensitivity, frequency sensitivity, andspeech reception in noise. J. Acoust. Soc. Am. 130, 334–349.
Hopkins, K., Moore, B.C.J., Stone, M.A., 2008. Effects of moderatecochlear hearing loss on the ability to benefit from temporal finestructure information in speech. J. Acoust. Soc. Am. 123, 1140–1153.
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
Houtgast, T., Steeneken, H.J.M., 1971. Evaluation of speech transmissionchannels by using artificial signals. Acustica 25, 355–367.
Humes, L.E., Dirks, D.D., Bell, T.S., Ahlstrom, C., Kincaid, G.E., 1986.Application of the Articulation Index and the Speech TransmissionIndex to the recognition of speech by normal-hearing and hearing-impaired listeners. J. Speech Hear. Res. 29, 447–462.
Imai, S., 1983. Cepstral analysis synthesis on the mel frequency scale. In:Proc. IEEE Int. Conf. Acoust. Speech and Sig. Proc., vol. 8, Boston,April 14–16, 1983, pp. 93–96.
Immerseel, L.V., Peeters, S., 2003. Digital implementation of lineargammatone filters: comparison of design methods. Acoust. Res. Lett.Online 4, 59–64.
Kates, J.M., 1991. A time domain digital cochlear model. IEEE Trans.Sig. Proc. 39, 2573–2592.
Kates, J.M., 1992. On using coherence to measure distortion in hearingaids. J. Acoust. Soc. Am. 91, 2236–2244.
Kates, J.M., 2008. Digital Hearing Aids. Plural Publishing, San Diego,CA, ISBN-13: 978-1-59756-317-8, pp. 1–16 (Chapter 1).
Kates, J.M., 2008. Digital Hearing Aids. Plural Publishing, San Diego.Kates, J.M., 2013. An auditory model for intelligibility and quality
predictions. Proc. Mtgs. Acoust. (POMA) 19, 050184: Acoust. Soc.Am. 165th Meeting, Montreal, June 2–7, 2013.
Kates, J.M., Arehart, K.H., 2005. Coherence and the speech intelligibilityindex. J. Acoust. Soc. Am. 117, 2224–2237.
Kates, J.M., Arehart, K.H., 2010. The hearing aid speech quality index(HASQI). J. Audio Eng. Soc. 58, 363–381.
Kiessling, J., 1993. Current approaches to hearing aid evaluation. J.Speech-Lang. Path. Audiol. Monogr. Suppl. 1, 39–49.
Kjems, U., Boldt, J.B., Pedersen, M.S., Wang, D., 2009. Role of maskpattern in intelligibility of ideal binary-masked noisy speech. J. Acoust.Soc. Am. 126, 1415–1426.
Li, N., Loizou, P.C., 2008. Factors influencing intelligibility of idealbinary-masked speech: implications for noise reduction. J. Acoust.Soc. Am. 123, 1673–1682.
Ludvigsen, C., Elberling, C., Keidser, G., Poulsen, T., 1990. Prediction ofintelligibility of nonlinearly processed speech. Acta Otolaryngol.Suppl. 469, 190–195.
McAulay, R.J., Quatieri, T.F., 1986. Speech analysis/synthesis based on asinusoidal representation. IEEE Trans. Acoust. Speech and Sig. Proc.ASSP-34, pp. 744–754.
McDermott, H.J., 2011. A technical comparison of digital frequency-lowering algorithms available in two current hearing aids. PLoS One 6(7), e22358. http://dx.doi.org/10.1371/journal.pone.0022358.
Moore, B.C.J., Glasberg, B.R., 1983. Suggested formulae for calculatingauditory-filter bandwidths and excitation patterns. J. Acoust. Soc. Am.74, 750–753.
Moore, B.C.J., Glasberg, B.R., 2004. A revised model of loudnessperception applied to cochlear hearing loss. Hear. Res. 188, 70–88.
Moore, B.C.J., Vickers, D.A., Plack, C.J., Oxenham, A.J., 1999. Inter-relationship between different psychoacoustic measures assumed to berelated to the cochlear active mechanism. J. Acoust. Soc. Am. 106,2761–2778.
Ng, E.H., Rudner, M., Lunner, T., Pedersen, M.S., Ronnberg, J., 2013.Effects of noise and working memory capacity on memory processingof speech for hearing-aid users. Int. J. Audiol. (in press).
Nilsson, M., Soli, S.D., Sullivan, J., 1994. Development of the hearing innoise test for the measurement of speech reception thresholds in quietand in noise. J. Acoust. Soc. Am. 95, 1085–1099.
Nossair, Z.B., Zahorian, S.A., 1991. Dynamic spectral shape features asacoustic correlates for initial stop consonants. J. Acoust. Soc. Am. 89,2978–2991.
Patterson, R.D., Allerhand, M.H., Giguere, C., 1995. Time-domainmodeling of peripheral auditory processing: a modular architectureand a software platform. J. Acoust. Soc. Am. 98, 1890–1894.
Pavlovic, C.V., Studebaker, G.A., Sherbecoe, R.L., 1986. An articulationindex based procedure for predicting the speech recognitionperformance of hearing-impaired individuals. J. Acoust. Soc. Am.80, 50–57.
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://
1559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595
15961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630
1631
J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx 19
SPECOM 2233 No. of Pages 19, Model 5+
5 July 2014
Payton, K., Shrestha, M., 2008. Analysis of short-time speech transmis-sion index algorithms. Proc. Acoustics 2008, Paris, pp. 634–638.
Payton, K.L., Uchanski, R.M., Braida, L.D., 1994. Intelligibility ofconversational and clear speech in noise and reverberation for listenerswith normal and impaired hearing. J. Acoust. Soc. Am. 95, 1581–1592.
Plack, C.J., Oxenham, A.J., 2000. Basilar-membrane nonlinearity esti-mated by pulsation threshold. J. Acoust. Soc. Am. 107, 501–507.
Quatieri, T.F., and McAulay, R.J., 1986. Speech transformations based ona sinusoidal representation. IEEE Trans. Acoust. Speech and Sig.Proc. ASSP-34, pp. 1449–1464.
Rosenthal, S., 1969. IEEE: recommended practices for speech qualitymeasurements. IEEE Trans. Audio Electroacoust. 17, 227–246.
Sachs, M.B., Abbas, P.J., 1974. Rate versus level functions for auditory-nerve fibers in cats: tone-burst stimuli. J. Acoust. Soc. Am. 56, 1835–1847.
Shannon, R.V., Zeng, F.-G., Kamath, V., Wygonski, J., Ekelid, M., 1995.Speech recognition with primarily temporal cues. Science 270, 303–304.
Shaw, J.C., 1981. An introduction to the coherence function and its use inEEG signal analysis. J. Med. Eng. Technol. 5, 279–288.
Simpson, A., Hersbach, A.A., McDermott, H.J., 2005. Improvements inspeech perception with an experimental nonlinear frequency compres-sion hearing device. Int. J. Audiol. 44, 281–292.
Slaney, M., 1993. An efficient implementation of the Patterson-Holds-worth auditory filter bank. Apple Computer Technical Report #35.Apple Computer Library, Cupertino, CA.
Smith, Z.M., Delgutte, B., Oxenham, A.J., 2002. Chimaeric sounds revealdichotomies in auditory perception. Nature 416, 87–90.
Souza, P., Rosen, S., 2009. Effects of envelope bandwidth on theintelligibility of sine- and noise-vocoded speech. J. Acoust. Soc. Am.126, 792–805.
Souza, P., Arehart, K.H., Kates, J.M., Croghan, N.B.H., Gehani, N.,2013. Exploring the limits of frequency lowering. J. Speech Lang.Hear. Res. (in press).
Steeneken, H.J.M., Houtgast, T., 1980. A physical method for measuringspeech-transmission quality. J. Acoust. Soc. Am. 67, 318–326.
Steiger, J.H., 1980. Tests for comparing elements of a correlation matrix.Psychol. Bull. 87, 245–251.
Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002
Stone, M.A., Fullgrabe, C., Moore, B.C.J., 2008. Benefit of high-rateenvelope cues in vocoder processing: effect of number of channels andspectral region. J. Acoust. Soc. Am. 124, 2272–2282.
Suzuki, Y., Takeshima, H., 2004. Equal-loudness-level contours for puretones. J. Acoust. Soc. Am. 116, 918–933.
Taal, C.H., Hendriks, R.C., Heusdens, R., 2011a. An evaluation ofobjective measures for intelligibility prediction of time–frequencyweighted noisy speech. J. Acoust. Soc. Am. 130, 3013–3027.
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J., 2011b. An algorithmfor intelligibility prediction of time–frequency weighted noisy speech.IEEE Trans. Audio Speech Lang. Proc. 19, 2125–2136.
Wang, D.L., Kjems, U., Pedersen, M.S., Boldt, J.B., Lunner, T., 2008.Speech perception of noise with binary gains. J. Acoust. Soc. Am. 124,2303–2307.
Williams, E.J., 1959. The comparison of regression variables. J. RoyalStat. Soc. Ser. B 21, 396–399.
Wojtczak, M., Biem, J.A., Micheyl, C., Oxenham, A.J., 2012. Perceptionof across-frequency asynchrony and the role of cochlear delay. J.Acoust. Soc. Am. 131, 363–377.
Yates, G.K., Winter, I.M., Robertson, D., 1990. Basilar membranenonlinearity determines auditory nerve rate-intensity functions andcochlear dynamic range. Hear. Res. 45, 203–220.
Zahorian, S.A., Jagharghi, A.J., 1993. Spectral-shape features versusformants as acoustic correlates for vowels. J. Acoust. Soc. Am. 94,1966–1982.
Zahorian, S.A., Rothenberg, M., 1981. Principal-components analysis forlow-redundancy encoding of speech spectra. J. Acoust. Soc. Am. 69,832–845.
Zhang, X., Heinz, M.G., Bruce, I.C., Carney, L.H., 2001. A phenome-nological model for the response of auditory nerve fibers: I. Nonlineartuning with compression and suppression. J. Acoust. Soc. Am. 109,648–670.
Zilany, M., Bruce, I., 2006. Modeling auditory-nerve responses for highsound pressure levels in the normal and impaired auditory periphery. J.Acoust. Soc. Am. 120, 1446–1466.
-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://