The Hearing-Aid Speech Perception Index (HASPI)

1

2

3 Q1

4

56

78

9

10

11

12

13

14

15

16

17

18

19

20

21

2223

24

25

26

27

28

29

30

31

32

33

34

35

36

Available online at www.sciencedirect.com

SPECOM 2233 No. of Pages 19, Model 5+

5 July 2014

www.elsevier.com/locate/specom

ScienceDirect

Speech Communication xxx (2014) xxx–xxx

The Hearing-Aid Speech Perception Index (HASPI)

James M. Kates ⇑, Kathryn H. Arehart

Department of Speech Language and Hearing Sciences, University of Colorado, Boulder, CO 80309, USA

Received 23 September 2013; received in revised form 5 May 2014; accepted 23 June 2014

Abstract

This paper presents a new index for predicting speech intelligibility for normal-hearing and hearing-impaired listeners. The Hearing-Aid Speech Perception Index (HASPI) is based on a model of the auditory periphery that incorporates changes due to hearing loss. Theindex compares the envelope and temporal fine structure outputs of the auditory model for a reference signal to the outputs of the modelfor the signal under test. The auditory model for the reference signal is set for normal hearing, while the model for the test signal incor-porates the peripheral hearing loss. The new index is compared to indices based on measuring the coherence between the reference andtest signals and based on measuring the envelope correlation between the two signals. HASPI is found to give accurate intelligibility pre-dictions for a wide range of signal degradations including speech degraded by noise and nonlinear distortion, speech processed usingfrequency compression, noisy speech processed through a noise-suppression algorithm, and speech where the high frequencies arereplaced by the output of a noise vocoder. The coherence and envelope metrics used for comparison give poor performance for at leastone of these test conditions.� 2014 Elsevier B.V. All rights reserved.

Keywords: Speech intelligibility; Intelligibility index; Auditory model; Hearing loss; Hearing aids

37

38

39

40

41

42

43

44

45

46

47

48

49

50

1. Introduction

Signal degradations, such as additive noise or nonlineardistortion, can reduce speech intelligibility for both nor-mal-hearing and hearing-impaired listeners, even whenhearing aids are used. Hearing aids, in particular, can pres-ent a wide range of signal modifications since the input sig-nal may be noisy and the hearing aid may incorporateseveral nonlinear processing algorithms (Kates, 2008).Hearing aid processing includes dynamic-range compres-sion, in which low-level portions of the signal receivegreater amplification than the high-level portions, and thetime-varying gain causes distortion of the signal envelopeand introduces modulation sidebands. Noise suppression

51

52

53

54

http://dx.doi.org/10.1016/j.specom.2014.06.002

0167-6393/� 2014 Elsevier B.V. All rights reserved.

⇑ Corresponding author. Tel.: +1 720 226 1266.E-mail addresses: [email protected] (J.M. Kates), Kathryn.

[email protected] (K.H. Arehart).

Please cite this article in press as: Kates, J.M., Arehart, K.H., The Hearingdx.doi.org/10.1016/j.specom.2014.06.002

algorithms attenuate the noisier portions of the noisyspeech signal, and like dynamic range compression modifythe signal envelope and introduce modulation sidebands.Frequency compression (Souza et al., in press), in whichhigh-frequency portions of the spectrum are shifted tolower frequencies where a hearing-impaired listener mayhave better sound thresholds, is also implemented in sev-eral hearing aids. The frequency shifting causes inherentdistortions including reducing spacing between harmonics,altered spectral peak levels, and modified spectral shape(McDermott, 2011).

Many of these degradation mechanisms simultaneouslyaffect the signal envelope and the signal temporal fine struc-ture (TFS). Additive noise, for example, reduces the enve-lope modulation depth by filling in the pauses in the speechand also corrupts the TFS of the speech by adding timingjitter corresponding to the random fluctuations of thenoise. Peak clipping, which may be used to prevent

-Aid Speech Perception Index (HASPI), Speech Comm. (2014), http://


mailto:[email protected]






55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

2 J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx


5 July 2014

unacceptable loud sounds, reduces the signal modulationdepth by removing the signal peaks and also modifies theTFS by introducing additional frequency components cor-responding to the harmonic distortion products. Thus formany forms of signal degradation, changes to the signalenvelope and to the TFS are closely related.

Changes to the signal TFS have been successfully usedto predict speech intelligibility. The TFS changes are oftenmeasured using the coherence function (Carter et al., 1973;Shaw, 1981; Kates, 1992). In the time domain, the coher-ence is computed by taking the cross-correlation betweena noise-free unprocessed reference signal and the noisyprocessed signal and dividing by the product of theroot-mean-squared (RMS) intensities of the two signals.The magnitude-squared coherence is converted to a sig-nal-to-distortion ratio (SDR) which can be used in a man-ner similar to the signal-to-noise ratio (SNR) in computingthe Speech Intelligibility Index (SII) (ANSI, 1997) to pro-duce the coherence SII (CSII) (Kates and Arehart, 2005).

Changes to the signal envelope have also been used topredict speech intelligibility. The original version of theSpeech Transmission Index (STI) (Houtgast andSteeneken, 1971; Steeneken and Houtgast, 1980), for exam-ple, used bands of amplitude-modulated noise as the probesignals and measured the reduction in signal modulationdepth. However, this original version of the STI is notaccurate for hearing-aid processing such as dynamic-rangecompression (Hohmann and Kollmeier, 1995). Speech-based versions of the STI have been developed that arebased on estimating the SNR from cross-correlations ofthe signal envelopes in each frequency band (Ludvigsenet al., 1990; Holube and Kollmeier, 1996; Goldsworthyand Greenberg, 2004; Payton and Shrestha, 2008). Anintelligibility index based on averaging envelope correla-tions for 20-ms speech segments has been developed byChristiansen et al. (2010), and Taal et al. (2011b) havedeveloped the short-time objective intelligibility measure(STOI) which uses envelope correlations computed for382-ms speech segments. Changes in the envelope time–fre-quency modulation have also been used as the basis of aspeech intelligibility index (Elhilali et al., 2003).

If intelligibility can be predicted using either signalcoherence or envelope correlation, is there any reason toprefer one approach over the other? A procedure that com-bines coherence with changes in the signal envelope may bemore robust than one that uses just the coherence becausethere are several situations where a coherence-basedapproach can fail. One example where coherence will per-form poorly is frequency compression. Frequency com-pression (Aguilera Munoz et al., 1999; Simpson et al.,2005; Glista et al., 2009) is intended to improve the audibil-ity of high-frequency speech sounds by shifting them tolower frequency regions where listeners with high-fre-quency hearing loss have better hearing thresholds. How-ever, the cross-correlation between a sinusoid and afrequency-shifted version of the sinusoid will approachzero as the duration of the observation interval is


increased. Thus frequency compression will lead to predic-tions of lower intelligibility as the amount of frequencyshift is increased even if the intelligibility has not actuallybeen affected, and the predicted loss in intelligibility willdepend on the size of the speech segments used in comput-ing the intelligibility index.

A second situation where coherence has limitations isfor some forms of noise suppression, specifically the idealbinary mask (IBM). In IBM processing, the speech isdivided into frequency bands and each band furtherdivided into time segments to produce time–frequencycells. If the SNR in a time–frequency cell is greater thana preset threshold (e.g. 0 dB) the gain for that cell is setto 1, otherwise the cell is attenuated (Wang et al., 2008;Kjems et al., 2009). High intelligibility is found for noisyspeech when the ideal mask, computed from the speechand noise with the threshold set to the signal-to-noise ratio,is applied to a signal comprised of noise alone (Wang et al.,2008). The IBM output in this case is amplitude-modulatednoise. The cross-correlation between the reference speechand modulated noise is therefore low and a coherence-based procedure would predict low intelligibility. Poor cor-relation of the CSII with IBM-processed speech has beenreported by Christiansen et al. (2010) and by Taal et al.(2011a, 2011b).

A third example is the noise vocoder (Dudley, 1939;Shannon et al., 1995), in which the speech is replaced bybands of noise having the same envelope modulation asthe speech. Excellent intelligibility can be obtained eventhough the speech TFS has been replaced by the randomfluctuations of the noise (Shannon et al., 1995; Stoneet al., 2008; Souza and Rosen, 2009; Anderson, 2010).However, a coherence-based calculation will predict lowerintelligibility because of the reduction in the cross-correla-tion between the original speech and the noise vocoder out-put. Poor correlation of the CSII with noise-vocodedspeech has been reported by Cosentino et al. (2012),although Chen and Loizou (2011) found comparable per-formance between the CSII and an envelope-based versionof the STI.

These weaknesses in the use of coherence to predictintelligibility suggest that a procedure that combines coher-ence with changes in the envelope modulation may be moreaccurate than one that is based on coherence alone. Forexample, the results of Gomez et al. (2012) show that com-bining the CSII with an envelope measurement improvesthe accuracy in comparison to the CSII alone when predict-ing speech intelligibility for normal-hearing listeners forspeech corrupted by various forms of additive noise.

An additional concern is predicting speech intelligibilityfor hearing-impaired listeners. An accurate intelligibilityindex for hearing-aid users has to deal with noisy input sig-nals, the distortion introduced by the hearing-aid process-ing, and the hearing loss. Hearing loss is most oftenmodeled as a shift in auditory threshold, and this thresholdshift has been represented as an increase in the internalauditory noise level in the SII calculation procedure




169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

J.M. Kates, K.H. Arehart / Speech Communication xxx (2014) xxx–xxx 3


5 July 2014

(Pavlovic et al., 1986; Humes et al., 1986; Payton et al.,1994; Holube and Kollmeier, 1996; Ching et al., 1998;Kates and Arehart, 2005). A similar modification of thehearing threshold has been applied to the STI (Humeset al., 1986; Payton et al., 1994; Holube and Kollmeier,1996). Limitations in the accuracy of the predictions haveled to empirical modifications of the SII, including a“desensitization factor” that increases with increasing hear-ing loss (Pavlovic et al., 1986) and a frequency-dependentproficiency factor that also depends on the hearing loss(Ching et al., 1998).

A more thorough model of peripheral hearing losswould be expected to yield more accurate intelligibility pre-dictions. An auditory model (Dau et al., 1996) was used byHolube and Kollmeier (1996) for intelligibility predictions,and hearing loss was first implemented as a threshold shiftbased on the audiogram. Individual adjustments of the fil-ter bandwidths and forward masking time constants werethen incorporated into the model, which resulted in a smallimprovement in the accuracy of the intelligibility predic-tions for speech in noise. Hines and Harte (2010) also useda cochlear model (Zilany and Bruce, 2006) as an auditoryfront end for their intelligibility calculations. However,they only present simulation results, so the benefit of theirapproach in predicting intelligibility for hearing-impairedlisteners has not been verified.

The purpose of this paper is to present a new intelligibil-ity index that (1) combines measurements of coherencewith measurements of envelope fidelity to give improvedaccuracy for a wide range of processing conditions, and(2) is accurate for hearing-impaired as well as normal-hear-ing listeners. The new index, the Hearing Aid Speech Per-ception Index (HASPI), uses an auditory model thatincorporates aspects of normal and impaired peripheralauditory function (Kates, 2013). The auditory coherenceis computed from the modeled basilar-membrane vibrationoutput in each frequency band, and provides a measure-ment sensitive to the changes in the speech temporal finestructure. The cepstral correlation is computed from theenvelope output in each frequency band, and provides ameasurement of the fidelity with which the envelopetime–frequency modulation has been preserved.

The remainder of the paper starts with a description ofthe data used to train and evaluate the intelligibility indi-ces. The datasets include noise and nonlinear distortion,frequency compression for speech in babble noise, noisyspeech processed using an ideal binary mask noise suppres-sion algorithm, and speech partially replaced by the outputof a noise vocoder; these data are described next. The audi-tory model used for the new index is then described, fol-lowed by a description of how the outputs of theauditory model are combined to produce the new HASPIindex. The CSII and an envelope-based index based onthe STOI are used as comparisons in the paper. A modifiedversion of the STOI was derived because the STOI as pub-lished does not take auditory threshold or hearing loss intoaccount. The revised CSII and modified STOI calculations


are then described. Results are presented for the four differ-ent datasets, followed by a discussion of the factors thatinfluence the model accuracy.

2. Intelligibility data

The original CSII was fitted to speech corrupted bynoise and distortion (Kates and Arehart, 2005), and thosedata are described below. The revised CSII and HASPI arefit to four datasets, which comprise the noise and distortiondata used for the original CSII plus results from three addi-tional experiments. These additional datasets comprise fre-quency compression, noise suppression, and noise vocoderdata. For all experiments, subjects listened to speech pre-sented monaurally over headphones in a sound booth. Itis hypothesized that the CSII may not perform as well asHASPI for these additional datasets.

2.1. Noise and distortion

The noise and distortion data comprises the intelligibil-ity scores reported by Kates and Arehart (2005). Thirteenadult listeners with normal hearing and nine adult listenerswith hearing loss of presumed cochlear origin participatedin the experiments. The test materials consisted of theHearing-in-Noise-Test (HINT) sentences (Nilsson et al.,1994). The sentences were digitized at a 44.1 kHz samplingrate and down-sampled to 22.05 kHz to approximate thebandwidth typically found in hearing aids (Kates, 2008).Each test sentence was combined with additive noise, orwas subjected to symmetric peak-clipping distortion orsymmetric center-clipping distortion.

The additive noise was extracted from the opposite chan-nel of the HINT test compact disc. The noise has the samelong-term spectrum as the sentences. SNR values rangedfrom �5 to 30 dB, and an unprocessed condition was alsoincluded. The peak-clipping and center-clipping distortionthresholds were set as a percentage of the cumulative histo-gram of the magnitudes of the signal samples for each sen-tence. Peak-clipping thresholds ranged from infiniteclipping to no clipping, and center-clipping thresholds ran-ged from 98% to no clipping. The stimuli were presented tothe normal-hearing listeners at an equalized-RMS level of65 dB SPL. The speech signals were amplified for the indi-vidual hearing loss, when present, using the NationalAcoustics Laboratories-Revised (NAL-R) linear prescrip-tive formula (Byrne and Dillon, 1986). During the sessions,listeners verbally repeated each sentence after it was pre-sented. The tester then scored the proportion of completeHINT sentences that were correctly repeated by the listener.

2.2. Frequency compression

The frequency-compression data comprises the intelligi-bility scores for frequency-compressed speech reported bySouza et al. (in press) and Arehart et al. (2013a). Fourteenadult listeners with normal to near-normal hearing and 26




277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386



5 July 2014

adult listeners with mild-to-moderate high-frequency lossparticipated in the experiments. The stimuli for the intelli-gibility tests consisted of low-context IEEE sentences(Rosenthal, 1969) spoken by a female talker. All of thestimuli were digitized at a 44.1 kHz sampling rate anddownsampled to 22.05 kHz. The sentences were used inquiet and combined with multi-talker babble at SNRsranging from 10 to �10 dB in steps of 5 dB. After the addi-tion of the babble, the sentences were processed using fre-quency compression.

Frequency compression was implemented using sinusoi-dal modeling (McAulay and Quatieri, 1986). The signalwas first divided into low-frequency and high-frequencybands. The low-frequency signal was used without furthermodification and sinusoidal modeling was applied to thehigh-frequency signal. The ten highest peaks in the high-frequency band were selected, and the amplitude and phaseof each peak were preserved while the frequencies werereassigned to lower values. Output sinusoids were then syn-thesized at the shifted frequencies (Quatieri and McAulay,1986; Aguilera Munoz et al., 1999) and combined with theoriginal low-frequency signal to produce the frequency-compressed output.

The frequency-compression parameters represented therange that might be available in wearable hearing aids, andincluded three frequency compression ratios (1.5:1, 2:1,and 3:1) and three frequency compression cutoff frequencies(1, 1.5, and 2 kHz). A control condition having no frequencycompression was also included. The stimulus level for thenormal-hearing subjects was 65 dB SPL. The speech signalswere amplified for the individual hearing loss, when present,using NAL-R equalization (Byrne and Dillon, 1986).

Scoring in Souza et al. (in press) was based on keywordscorrect (5 per sentence for 50 words per condition per lis-tener). For compatibility with the Kates and Arehart(2005) data, the keywords correct results were convertedto sentences correct: a correct sentence required that all fivekeywords be identified correctly, otherwise the sentencewas scored as incorrect.

2.3. Ideal binary mask

Arehart et al. (2013b) measured intelligibility scores fornoisy speech processed through an ideal binary mask noise-suppression algorithm. The data presented in their paperwere obtained from thirty older subjects having mild tomoderate hearing losses and from seven younger subjectshaving normal hearing. The stimuli for the intelligibilitytests consisted of low-context IEEE sentences (Rosenthal,1969) spoken by a female talker. All of the stimuli were dig-itized at a 44.1-kHz sampling rate and downsampled to20 kHz for the noise-suppression processing. The sentenceswere combined with four-talker babble at signal-to-noiseratios of �18 to +12 dB in steps of 6 dB. The sentence levelprior to noise suppression was set to 65 dB SPL.

The noisy speech stimuli were processed with a binarymask noise-reduction strategy (Kjems et al., 2009; Ng


et al., in press). The target speech signal, the masker signal,and the target-plus-masker mixture were each separatedinto frequency bands using analysis filterbanks consistingof 64 gammatone filters (Patterson et al., 1995) with centerfrequencies equally distributed on the equivalent rectangu-lar bandwidth number (ERBN) scale (Moore and Glasberg,1983) over the frequency range of 50 and 8000 Hz. This fre-quency scale corresponds to approximately uniform spac-ing along the cochlear partition. The processing was donein time frames each having a duration of 20 ms with fifty-percent overlap, resulting in a new frame every 10 ms.

A time–frequency cell consists of one frame in one fre-quency band. In each time–frequency cell, the local sig-nal-to-noise ratio was determined from the clean targetand separate masker signals. The local SNR was then com-pared to a local criterion (LC) of 0 dB, resulting in an idealbinary mask decision equal to 1 if the local SNR was aboveLC, and 0 otherwise. The data of Kjems et al. (2009) indi-cate that a LC of 0 dB is most effective for SNRs in therange of approximately +5 to �10 dB. Similar to the pro-cedure in Li and Loizou (2008), errors were introduced intothe ideal binary mask by randomly flipping a certain per-centage (0%, 10%, and 30%) of the time–frequency unitseither from 0 to 1 or from 1 to 0. The binary patterns werethen converted into gain values, where 1’s were convertedinto 0 dB gain and the zeros were converted into an atten-uation of either 10 dB or 100 dB. The noisy speech signalwas then multiplied by the binary gain values to give theprocessed signal in the frequency domain. The processedsignal was then filtered through a time-reversed gamm-atone filterbank, thereby ensuring a constant processinggroup delay independent of frequency.

The IBM processing is not practical in a hearing aidsince separate noise and speech files must be available.However, the envelope distortion introduced by the IBMalgorithm will be similar to that resulting from othernoise-suppression strategies that are used in hearing aids,such as spectral subtraction (Kates, 2008). In spectral sub-traction, for example, noisy speech cells are also attenu-ated, and the amount of attenuation is based on theestimate of the instantaneous SNR. The errors added tothe IBM processing are related to the errors in the amountof spectral subtraction attenuation that are introduced byimperfect estimation of the speech and noise levels.

Following the noise suppression processing, the speechsignals were amplified for the individual hearing loss, whenpresent, using NAL-R equalization (Byrne and Dillon,1986). Scoring in Arehart et al. (2013b) was based on key-words correct; for compatibility with the Kates andArehart (2005) data, the keywords correct results were con-verted to sentences correct.

2.4. Noise vocoder

Anderson (2010) obtained intelligibility scores forspeech where the high frequencies of noisy speech werereplaced by the output of a noise vocoder. Intelligibility




387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421 Q2

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498



5 July 2014

was measured as the number of frequency bands subjectedto noise vocoding was increased. Ten subjects with normalhearing and ten subjects with hearing loss participated inthe experiments. The test materials were low-context sen-tences from the IEEE corpus (Rosenthal, 1969) spokenby a male and by a female talker. All stimuli were digitizedat 44.1 kHz and were down-sampled to 22.05 kHz. Thebackground noise was multi-talker babble.

The speech was processed without any noise and atSNRs of 18 and 12 dB. The sentences were passed througha bank of 32 band-pass filters with center frequencies dis-tributed on an ERBN filter scale (Slaney, 1993). The signalenvelope for each vocoded band was generated via the Hil-bert transform. The Gaussian noise used for the noisevocoding was passed through the same linear-phase FIRfilterbank as the speech. Two vocoded signals were pro-duced. One signal was produced by multiplying the filterednoise by the speech envelope. For the second signal, thenoise envelope fluctuations were first removed by dividingthe filtered noise by its own envelope before multiplyingit by the speech envelope. As a last processing step, boththe speech and noise were passed through the same filtersas in the first filtering stage to remove any out-of-bandmodulation products, and the RMS level of the vocodedsignal in each frequency band was matched to that of theoriginal speech.

Noise vocoding was applied to the noisy speech startingat the highest frequency bands and proceeding to lower fre-quencies. The amount of noise vocoding was increased insteps of two frequency bands from no bands vocoded tothe 16 highest-frequency bands vocoded. The upper cutofffrequency of the 16-band vocoded condition was 1.6 kHz.The stimulus level for the normal-hearing listeners was65 dB SPL, and NAL-R amplification (Dillon and Byrne,1986) was provided for the hearing-impaired listeners.The stimuli were presented monaurally in a sound boothusing headphones. Intelligibility was scored in terms ofkeywords correct. For compatibility with the Kates andArehart (2005) data, the keywords correct results were con-verted to sentences correct.

3. Auditory model

The approach to predicting speech intelligibility used inHASPI is to compare the output of an auditory model for adegraded test signal with the output for an unprocessedinput signal. A detailed description of the auditory modelis presented in Kates (2013) and is summarized here. Themodel is an extension of the Kates and Arehart (2010)auditory model; that model has been shown to give outputsthat can be used to produce accurate predictions of speechquality for a wide variety of hearing losses and processingconditions.

The overall model block diagram is presented in Fig. 1.The comparison of the processed and reference signalsrequires that they be temporally aligned, so the modelincludes two alignment steps. The first step is a rough


alignment of the broadband signals that removes largedelay differences. Each signal then goes through the middleear and cochlear mechanics models. A second temporalalignment step then removes any remaining timing differ-ences between the reference and processed signals in eachfrequency band by adjusting the signal delay to maximizethe cross-correlation of the signals in each band. The sepa-rate signals then go through the inner hair-cell (IHC)model. The last temporal alignment step is compensationfor the frequency-dependent group delay of the gamm-atone filters used in the auditory filterbank. This alignmentstep is independent of the signal properties since it is purelya function of the filters used for the frequency analysis. Inthe final processing step the auditory model outputs areconverted into the descriptive signal characteristics (e.g.the cepstral correlation and auditory coherence describedin Section 4.2) that are used to compare the processed sig-nal with the reference signal.

The processing for one signal is shown in the block dia-gram of Fig. 2. The figure shows the initial processingstages, followed by the processing associated with one fre-quency and at the outputs of the filter banks. The auditorymodel starts with sample rate conversion to 24 kHz, fol-lowed by the middle ear filter. The next stage is a linearauditory filterbank, with the filter bandwidths adjusted toreflect the input signal intensity and the increase in filterbandwidth due to outer hair-cell (OHC) damage.Dynamic-range compression is then provided in each fre-quency band, with the compression controlled by the out-put in the corresponding frequency band from thecontrol filter bank. The amount of compression is reducedwith increasing OHC damage. Hearing loss due to IHCdamage is represented as a subsequent attenuation stage,and IHC firing-rate adaptation is also included in themodel. For moderate hearing losses, approximately 80%of the total loss given by the audiogram was ascribed toOHC damage (Moore et al., 1999), with the remainderascribed to IHC damage.

The envelope output in each frequency band comprisesthe compressed envelope signal after conversion to dBabove auditory threshold. The dynamic range of the basilarmembrane vibration signal in each frequency band is com-pressed using the same control function as for the envelopein that band, so the envelope of the vibration tracks thecomputed envelope output. The auditory threshold forthe vibration signal is represented as a low-level additivewhite noise. Both the envelope and the vibration outputsare available for modeling speech intelligibility.

The primary purpose of the middle ear model is toreproduce the low-frequency and high-frequency attenua-tion observed in the equal-loudness contours at low signallevels (Suzuki and Takeshima, 2004). The filter is a 2-polehighpass at 350 Hz in cascade with a 1-pole low-pass at5000 Hz (Kates, 1991).

The parallel filter bank used for the auditory analysisconsists of fourth-order gammatone filters (Cooke, 1991;Patterson et al., 1995; Immerseel and Peeters, 2003). A




499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

Fig. 1. Block diagram showing the reference and processed signal comparison.

Fig. 2. Block diagram of the auditory model used to extract the signals in each frequency band.



5 July 2014

total of 32 bands were used to cover the frequency rangefrom 80 to 8000 Hz. Hearing loss due to OHC damage isincorporated into the filter bank as an increase in filterbandwidth (Moore et al., 1999); the bandwidth increase issmall at low frequencies and increases with increasing fre-quency. The filter bandwidth at 8 kHz for maximum lossis four times the normal bandwidth.

The shape of the auditory filters depends on the intensityof the input signal as well as on the degree of hearing loss,with the filters becoming broader as the signal intensityincreases. For normal hearing, the filter bandwidth is setto the ERB (Moore and Glasberg, 1983) for intensitiesbelow 50 dB SPL. For impaired hearing, the bandwidthat and below 50 dB SPL is set to the bandwidth computedfor the amount of OHC damage related to the hearing loss.For both normal and impaired hearing, the bandwidth forintensities at or above 100 dB is set to the widest bandwidthused in the model, which corresponds to maximum OHC


damage. Linear interpolation is used for intensities between50 and 100 dB SPL.

A separate gammatone filter bank controls the dynamic-range compression. The control filter bandwidths are set tocorrespond to the widest filters in the model, and the filtercenter frequencies are shifted upward by a small amount.The control filter bandwidths thus match the auditory anal-ysis bandwidths for the maximum hearing loss, and arewider than the auditory analysis filters for reduced hearingloss and normal hearing. Signal power outside the pass-band of the analysis filter can still be within the passbandof the control filter. The control filter will detect this signalpower, and the compression rule, described next, willreduce the analysis filter gain. Therefore these wide controlfilters provide two-tone suppression in the cochlear model(Zhang et al., 2001; Bruce et al., 2003), in which a tone out-side the normal filter bandwidth can reduce the output fora tone within the filter passband.




535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

627627

628

629

630

631

632

633

634

635

636

637

638

639

640

641



5 July 2014

The control signal envelope is the input to the compres-sion rule. The compression gain is then passed through an800-Hz low-pass filter to approximate the compression timedelay observed in the cochlea (Zhang et al., 2001). Inputswithin 30 dB of normal auditory threshold (0 dB SPL)receive linear gain. Inputs between 30 and 100 dB SPL arecompressed. The system reverts to linear gain for inputsabove 100 dB SPL. The compression ratio in the modelfor normal hearing increases linearly with ERB numberfrom a compression ratio of 1.25:1 at 80 Hz to a compres-sion ratio of 3.5:1 at 8 kHz. This compression behavior isconsistent with physiological measurements of compressionin the cochlea (Cooper and Rhode, 1997) and with psycho-physical estimates of compression in the human ear (Hicksand Bacon, 1999; Plack and Oxenham, 2000).

OHC damage shifts the auditory threshold and reducesthe compression ratio. As a result, the OHC damage pro-duces output levels as a function of input signal intensitythat show a pattern similar to the loudness recruitmentfound in hearing-impaired listeners (Kiessling, 1993). Theshifted curves are constructed so that an input of 100 dBSPL in a given frequency band always produces the sameoutput level independent of the amount of OHC damage.In the case of maximum OHC damage, the system isreduced to linear amplification. Intermediate amounts ofOHC damage result in an intermediate shift of the com-pression behavior.

The envelope signal, after dynamic-range compression,is converted to dB above auditory threshold. Normalthreshold is used since attenuation due to OHC damagehas already been applied to the signals. The hearing lossdue to IHC damage is applied as an additional attenuationafter the dB SL conversion. The compressed average out-puts in dB SL correspond to firing rates in the auditorynerve (Sachs and Abbas, 1974; Yates et al., 1990) averagedover the population of inner hair-cell synapses.

The IHC synapse provides the rapid and short-termadaptation observed in the neural firing rate (Harris andDallos, 1979; Gorga and Abbas, 1981). The rapid adapta-tion time constant is 2 ms and the short-term time constantis 60 ms. Compensation for the group delay of the gamm-atone filters is then applied to the output of the IHC syn-apse model since adjustment for the filter delay appearsto occur higher in the auditory pathway (Wojtczak et al.,2012). The envelope output in each frequency band com-prises the compressed envelope signal after conversion todB above auditory threshold; the BM vibration signal iscentered at the carrier frequency for each band and is mod-ified by the same amplitude modulation as the envelopesignal.

4. Intelligibility indices

This paper compares HASPI to the CSII and to an enve-lope-based index based on the STOI. The CSII, which isbased on coherence, is described first. This is followed byHASPI, which combines coherence and envelope. The final


index described is an envelope-based index motivated bythe STOI, but which is adapted for hearing-impaired aswell as normal-hearing listeners.

4.1. CSII

The CSII (Kates and Arehart, 2005) estimated the frac-tion of sentences understood correctly for noisy and dis-torted speech. To calculate the CSII, the speech was firstdivided into 16-ms segments having a 50% overlap. Eachsegment was multiplied by a Hamming window. The powerin each segment was computed, and the segments wereassigned to one of three levels: low-level (�30 to �10 dBre: RMS), mid-level (�10 to 0 dB re: RMS), and high-level(greater than 0 dB re: RMS) where RMS is the RMS levelaveraged over the entire utterance. The short-time FFTwas computed for each segment. The magnitude-squaredcoherence (MSC) was then computed in the frequencydomain (Carter et al., 1973; Kates, 1992) over the segmentsin each of the low-, mid-, and high-level groups to givethree sets of MSC values as a function of frequency. Inthe case of the signal being entirely attenuated by the pro-cessing, the resultant MSC was set to zero. Each MSC wasconverted to the signal-to-distortion ratio (SDR), and theSDR was converted to dB. The SII was then computedfor each intensity region using the critical-band procedurefor 21 bands (ANSI, 1997) to give the three CSII values.The intelligibility index I3 is given by the weighted combi-nation of the CSII values followed by a logistic functiontransformation.

New weights were computed for the data in this paper.Like the HASPI data fit, the CSII weights were chosen togive a minimum root-mean-squared error fit of the modelto the combined datasets. Equal weight was given to thenormal-hearing and hearing-impaired listener results, andequal weight was given to each of the four datasets. Themodified index is given by:

p ¼ �2:623þ 0:0CSIILow þ 9:259CSIIMid þ 0:470CSIIHigh

I3 ¼1

1þ e�p

ð1Þ

4.2. HASPI

The HASPI computation combines an envelope modu-lation term with auditory coherence terms. Both termsare based on the outputs of the auditory model describedin Section 3. The reference signal is the output of the modelfor normal hearing, with the input having no noise or otherdegradation. For normal hearing listeners, the processedsignal is the output of the normal-hearing model havingthe degraded signal as its input. For impaired hearing,the auditory model used for the processed signal ismodified to incorporate the hearing loss and the modelinput includes the amplification used to compensate forthe loss.




642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660661

663663

664

665

666

667

668

669

670

671

672

673

674

675676

678678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698699

701701

702

703704

706706

707

708

709

710

711

712

713

714

715



5 July 2014

4.2.1. Cepstral correlation

The cepstral correlation computation is closely relatedto the procedure used by Kates and Arehart (2010) for pre-dicting speech quality. The envelope samples output by theauditory model, when taken across frequency at a giventime slot, constitute a short-time log magnitude spectrumon an auditory frequency scale. The inverse Fourier trans-form of this log spectrum produces a set of coefficients thatare similar to the mel cepstrum (Imai, 1983). In the model,only a small number of cepstrum coefficients are needed, sothe cepstrum computation is performed in the frequencydomain by fitting the auditory model envelope outputs ateach time sample with a set of half-cosine basis functions.These basis functions are very similar to the principal com-ponents for the short-time spectra of speech (Zahorian andRothenberg, 1981) and have been used for accuratemachine recognition of both consonants (Nossair andZahorian, 1991) and vowels (Zahorian and Jagharghi,1993). The basis functions are given by:

bjðkÞ ¼ cos½ðj� 1Þpk=ðK � 1Þ�; ð2Þ

where j is the basis function number and k is the gamm-atone filter index for frequency bands 0 though K � 1 forK = 32. The first six basis functions are illustrated in Fig. 3.

Let ek(m) denote the sequence of smoothed sub-sampledenvelope samples in frequency band k for the reference sig-nal, and let dk(m) be the envelope samples for the degradedsignal. The envelope smoothing is provided by 16-ms vonHann windows having 50% overlap, giving a lowpass filtercutoff frequency of 62.5 Hz and a smoothed envelope sam-pling rate of 125 Hz. The reference-signal cepstral sequencepj(m) and the degraded-signal sequence qj(m) are then givenby:

pjðmÞ ¼XK�1

k¼0

bjðkÞekðmÞ

qjðmÞ ¼XK�1

k¼0

bjðkÞdkðmÞ:ð3Þ

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733Fig. 3. Cepstral correlation basis functions.


The cepstrum correlation is computed by taking thecross-correlation of the cepstral sequences for the referenceand degraded signals. The sequence values correspondingto silences in the reference speech are removed from ceps-tral sequences pj(m) and qj(m), and the mean value is sub-tracted from each pruned sequence to yield the zero-mean edited sequences pjðmÞ and qjðmÞ. The silences aredetected by converting the log envelopes in each band tolinear values and summing the linear values across fre-quency. The summed values are converted back to dB,and segments having an intensity less than 2.5 dB re:thresh-old are removed from the correlation calculation. A justifi-cation for this approach to silence detection is that thelinear values in each frequency band correspond to specificloudness (Moore and Glasberg, 2004; Kates, 2013) and thesum across frequency is thus related to the loudness of thesignal (Moore and Glasberg, 2004). Thus segments havinga loudness near or below auditory threshold are removedfrom the calculation.

The normalized correlation is then given by:

rðjÞ ¼P

m2SpeechpjðmÞqjðmÞPm2Speechp2

j ðmÞh i1=2 P

m2Speechq2j ðmÞ

h i1=2: ð4Þ

The average cepstrum correlation is given by the average ofthe normalized correlation values r(2) though r(6):

c ¼ 1

5

X6

j¼2

rðjÞ: ð5Þ

Kates and Arehart (2010) found that a similar calculationlead to an index that accurately predicted speech qualityratings.

The application of the cepstral basis functions to theauditory model envelope output is illustrated in Figs. 4–6. The envelope outputs from the auditory model are plot-ted in Fig. 4 for the sentence “The boy got into trouble.”Black is the highest dB envelope value and white is audi-tory threshold. The same sentence, with additive stationaryspeech-shaped noise at a SNR of 6 dB, is plotted in Fig. 5.The noise fills in the silences in the speech, reduces the spec-tral contrast, and introduces random variations in theenvelopes.

The results of fitting basis functions 2 and 3 to the noise-free and noisy speech are plotted in Fig. 6; the solid linesrepresent the clean speech and the dashed lines the noisyspeech. The cepstral basis functions were fitted to the enve-lope values for each overlapping 16-ms windowed segmentof the speech. Basis function 2 is in the upper panel, andbasis function 3 is in the lower panel. Basis function 2 mea-sures spectral tilt. A positive value indicates that the lowfrequencies of the signal have more energy than the highfrequencies, and a negative value indicates the opposite.Thus large negative excursions are associated with thehigh-frequency bursts at 0.17, 0.52, 0.84, and 0.96 s, andlarge positive values are associated with the vowels. Basisfunction 3 measures the central spectral concentration of




734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

Fig. 4. Auditory spectrogram showing the envelope outputs from theauditory model for the sentence “The boy got into trouble.”

Fig. 5. Auditory spectrogram for the same sentence combined withspeech-shaped stationary noise at a 6-dB SNR.

Fig. 6. Cepstral correlation basis functions 2 and 3 applied to each 16-msspeech segment and plotted as a function of time. The solid line is for thenoise-free sentence and the dashed line is for the noisy sentence.



5 July 2014

the signal. A positive value indicates that the energy is con-centrated in the lower and higher frequency edges of thespectrum, and a negative value indicates that the energyis concentrated in the mid frequencies of the spectrum.

Additive noise flattens the noisy speech spectrum. In thelimiting case of a perfectly flat auditory spectrum, all of thebasis functions fit to the spectrum will return values ofzero. For the 6-dB SNR used in this example, the noisegreatly reduces the magnitude of the fluctuations in thecurves for both basis functions as compared to the curves


for the clean sentence. The result of the noise is thereforea reduction in the cross-covariance of the clean and noisysignals as compared to that for the clean signal with itself.

4.2.2. Auditory coherence

The auditory coherence term is related to the low-, mid-,and high-level signal coherence calculations used by Katesand Arehart (2005) for the CSII intelligibility computation.In the CSII calculation described in Section 4.1, the signalsegments were assigned to one of three intensity regionsbased on the intensity of each segment. That procedurecannot be used for the auditory coherence calculationbecause the signal intensity at the output of the auditorymodel has been modified by the OHC dynamic-range com-pression. Thus the intensity regions used for the CSII areno longer valid at the output of the auditory model. TheCSII used a frequency-domain procedure to calculate thecoherence (Carter et al., 1973; Kates, 1992), but the coher-ence can also be calculated in the time domain; the magni-tude of the coherence in a narrow frequency band isequivalent to the correlation coefficient (Shaw, 1981).

The basilar membrane output of the auditory model wasdivided into 16-ms segments having a 50% overlap, witheach segment multiplied by a von Hann window. Theintensity of the reference and degraded signals and short-time normalized cross-correlation between them was com-puted for each segment in each auditory frequency band.The intensity of the vibration output from the auditorymodel was in dB SL. The intensity in each segment of thereference signal was converted from log to linear ampli-tude, and the segment intensities summed across frequen-cies to form a broadband intensity signal. The segments




775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791792

794794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823824

826826

827

828

829

830

831

832

833

834

835

836

837

Fig. 7. Normalized BM vibration cross-correlation values computed foreach auditory frequency band and 16-ms speech segment.



5 July 2014

of the reference signal that correspond to silent intervalswere identified, and the corresponding segments in the ref-erence and degraded signals were discarded. A cumulativehistogram of the intensities of the remaining segments wasthen created, with segments assigned to either the lowestthird, middle third, or upper third of the histogram.

The short-time normalized cross-correlations for thelow-level, mid-level, and high-level segments were thenaveraged across time and frequency to produce thelow-, mid-, and high-level auditory coherence values. Letxk(m, n) be the BM vibration for the reference signal andyk(m, n) be the BM vibration for the degraded signal in fre-quency band k and segment m, with n the sample indexwithin the segment. The signals after being windowedand converted to zero-mean are given by xkðm; nÞ andykðm; nÞ. The normalized cross correlation for segment m

in frequency band k is given by:

zðm; kÞ ¼ Maxs

Pnxkðm; nÞykðm; nþ sÞP

nx2kðm; nÞ

� �1=2 Pny2

kðm; nÞ� �1=2

( ); ð6Þ

where the delay s is chosen over the range of �1 to 1 ms toyield the maximum value of the cross-correlation. The val-ues of z(m,k) for the low-intensity segments were averagedto produce the low-level auditory coherence, and the sameprocedure applies to the mid- and high-level segments pro-duced the mid- and high-level auditory coherence values.

The normalized cross-correlation values given by z(m,k)are plotted in Fig. 7 for the noisy speech signal shown inFig. 5 cross-correlated with the clean speech plotted inFig. 4. Black represents a correlation of 1, and white repre-sents a correlation of 0. The correlation tends towards 1 forthe more intense portions of the speech, including the vow-els and some onsets. The correlation is close to 0 for thespeech silences and the less-intense portions of thesentence.

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

4.2.3. HASPI modelThe intelligibility model is a linear weighting of the cep-

strum correlation and the three auditory coherence values,followed by a logistic function transformation. The weightswere chosen to give a minimum root-mean-squared errorfit of the model to the combined datasets. Equal weightwas given to the normal-hearing and hearing-impaired lis-tener results, and equal weight was given to each of thefour datasets: noise and distortion, frequency compression,noise suppression, and noise vocoder.

Let c be the computed cepstral correlation value givenby Eq. (5). Let aLow be the low-level auditory coherencevalue, aMid be the mid-level value, and aHigh be the high-level value. The HASPI intelligibility index is given by:

p ¼ �9:047þ 14:817cþ 0:0aLow þ 0:0aMid þ 4:616aHigh

H ¼ 1

1þ e�p

ð7Þ


4.3. Short-time envelope correlation index (STECI)

The short-time envelope correlation index (STECI) ismotivated by the short-time objective intelligibility mea-sure (STOI) of Taal et al. (2011b). The STOI was designedto model the effects of additive noise and ideal binary masknoise suppression on speech. The STOI assumes normalhearing and conversational speech levels since it does nottake the auditory threshold or the signal intensity intoaccount. Thus the STOI calculation as presented by Taalet al. (2011b) cannot be used for hearing-impaired listenersbecause it cannot represent the reduction in audibility thataccompanies hearing loss. To overcome this limitation, anew index, the STECI, has been derived based on theshort-time averaging approach implemented in the STOIin combination with the auditory model used for HASPI.

To compute STECI, the reference and processed signalenvelopes output by the auditory model are smoothedand sub-sampled using the same procedure as used forthe cepstral correlation described in Section 4.2.1, givingenvelope samples in each auditory band based onwindowed 16-ms segments having 50% overlap. There are32 frequency bands, with center frequencies spanning80–8000 Hz. Let ek(m) denote the sequence of smoothedsub-sampled envelope samples in frequency band k forthe reference signal, and let dk(m) be the envelope samplesfor the degraded signal. Segments corresponding to silencesin the reference signal are then pruned from both thereference and processed envelope signals, giving envelopesekðmÞ and dkðmÞ, respectively.

The pruned envelope sequences are grouped into short-time vectors, with each vector comprising the envelopessampled over a 384-ms analysis interval and having a




859

860861

863863

864

865

866

867

868869

871871

872

873

874875

877877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934



5 July 2014

50% segment overlap. The short-time vector for the refer-ence signal is given by:

EkðmÞ ¼ ½ekðm� N þ 1Þ; ekðm� N þ 2Þ; . . . ; ekðmÞ�T ; ð8Þ

where N encompasses the 384-ms analysis length and T

denotes transpose. A similar vector Dk(m) is formed forthe processed signal. The intermediate intelligibility mea-sure for the analysis interval is given by the normalizedcross-correlation:

gk;m ¼½EkðmÞ � EkðmÞ�

T ½DkðmÞ � DkðmÞ�jjEkðmÞ � EkðmÞjjjjDkðmÞ � DkðmÞjj

; ð9Þ

where the overbar denotes the average of the correspond-ing vector. The final step in computing STECI is to formthe average of the intermediate intelligibility measures:

S ¼ 1

KMRk;mgk;m ð10Þ

where K is the number of frequency bands and M is thetotal number of analysis frames.

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

5. Results

Scatter plots for the index predictions are presented inFig. 8. The open circles represent each processing conditionaveraged over the NH listeners, while the filled squares giveeach processing condition averaged over the HI listeners.The diagonal line represents perfect predictions; a pointabove the line indicates that the model prediction is lessthan the observed intelligibility, while a point below theline indicates that the model prediction is higher than theobserved intelligibility. The correlation coefficient shownin each plot is for the combined NH and HI listenergroups.

The plots for the noise and distortion data are presentedin sub-plots (a–c) of Fig. 8 for HASPI, CSII, and STECI,respectively. In all three sub-plots, there are a large numberof points clustered near (1,1), which indicates perfect intel-ligibility. These points contribute little to the overall corre-lation coefficient, which is therefore dominated by theaccuracy for the poorer intelligibility conditions. The Pear-son correlation coefficient for HASPI using all of the datapoints is 0.978. When just those points are used for whichthe HASPI predicted intelligibility is <0.9, the correlationcoefficient is 0.971. Both the NH and HI listener predic-tions show about the same number of points above andbelow the diagonal line, so there is little apparent bias inthe HASPI predictions. The CSII also does well, with acorrelation coefficient of 0.972 when all of the data pointsare used. Most of the CSII predictions lie below the diago-nal, which indicates that CSII has a tendency to overesti-mate the intelligibility. The performance of the STECIfor these data is worse than for the other two approaches,with a correlation coefficient of 0.825. For the NH subjects,the STECI underestimates the intelligibility for additivestationary noise and overestimates the intelligibility for


center clipping distortion. For the HI subjects, the STECIoverestimates intelligibility for peak clipping and for centerclipping distortion.

The plots for the frequency compression data are pre-sented in sub-plots (d–f) of Fig. 8. For HASPI and STECI,the majority of the points for the NH listeners are plottedabove the diagonal line, while the majority of points for theHI listeners are below the line. These trends in the predic-tions indicate a slight bias towards underestimating intelli-gibility for the NH listeners and overestimatingintelligibility for the HI listeners. There are also some out-liers in the lower right-hand corner of the plot for bothHASPI and STECI. These points correspond to speechwith no additive noise where the compression cutoff fre-quency has been set to 1 kHz. These outliers are not pres-ent in a scatter plot for the CSII, which suggests that signalchanges in the vicinity of 1 kHz that are important forintelligibility are not detected by the cepstral correlationused in HASPI or the envelope correlation used in STECI,but are detected by the coherence, that is, there are impor-tant changes in the signal that affect the temporal finestructure but not the envelope. A possible candidate couldbe changes in the harmonic structure in the vicinity of thefirst and second formants.

The plots for the ideal binary mask noise suppressiondata are presented in sub-plots (g–i) of Fig. 8. For all threeindices, the points for the NH listeners tend to lie on orbelow the diagonal line, indicating that all of the indicesoverestimate the intelligibility for these subjects. The low-intelligibility points for the HI listeners for all three indicesare below the line, while the points for the high-intelligibil-ity conditions are above the line, indicating that modelsdesigned to fit only the HI data would have a shifted offsetand a steeper slope than the models presented here that arefit to all of the subjects. As expected, the STECI gives thehighest correlation with the subject intelligibility scoressince the STOI, on which STECI is based, was developedto fit this type of signal processing. The spread of the HIpoints in the sub-plots is less than the spread for the NHpoints, which is consistent with there being just sevenNH subjects in the dataset.

Intelligibility index predictions for the noise vocoder arepresented in Fig. 9 along with the NH subject data. Thevalues are presented for people with normal hearing listen-ing to IEEE sentences in a background of multi-talker bab-ble at a SNR of 12 dB. This experimental condition waschosen to illustrate the benefits of including the envelope,as opposed to just the TFS, in formulating the intelligibilityindex. Similar behavior occurs for the other SNRs used inthe experiment and for the hearing-impaired listeners.Intelligibility is plotted as the number of vocoded bandsis increased from none to 16. The number of bands is indi-cated on the plot by the cutoff frequency of the vocodedhigh-frequency speech region; bands above the cutoff fre-quency have been noise-vocoded while those below the cut-off frequency contain the unprocessed speech. Theintelligibility averaged over the NH subjects is indicated




971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

Fig. 8. Intelligibility predictions for the HASPI, CSII, and STECI intelligibility indices for the noise and distortion, frequency compression, and idealbinary mask experiment data. The plotted points and indicated correlation coefficients are averaged over the subjects in the NH and HI groups.



5 July 2014

by the diamonds connected by the dot-dash line. There isno apparent trend in the intelligibility scores as the numberof vocoded high-frequency bands is increased; the domi-nant effect is the subject variability. The HASPI predictionis consistent with the noise-vocoder subject results, with aminimal decrease in intelligibility as the noise vocoder cut-off frequency is moved from no vocoding down to 1.6 kHz.The STECI prediction also shows a minimal effect of cutofffrequency, although STECI for this experiment fails to pre-dict the overall high degree of intelligibility achieved by thesubjects. The CSII prediction, however, shows a substantialdecrease in predicted intelligibility as the amount of vocod-ing is increased, starting with near-perfect intelligibility forno processing and decreasing to 81% correct when all of thefrequency bands above 1.6 kHz are vocoded.


The results of fitting the average subject intelligibilityscores for the noise and distortion, frequency compression,and noise suppression datasets with the HASPI, CSII, andSTECI indices are presented in Tables 1 and 2. Results forusing the cepstral correlation alone are also presented. Theentries in Table 1 are the Pearson correlation coefficientsmeasured for the indicated combinations of subject groupand dataset. The correlation coefficient indicates how wella straight line describes the relationship between the actualand predicted values, even if the line is offset and has aslope that differs from 1. The predictions and subject rat-ings were averaged over the subjects in each group beforecomputing the correlation coefficients for the processingconditions. All correlation coefficients have p < 0.001.The root-mean-squared (RMS) error computed for each




1001

1002

1003

1004

1005

1006

1007

1008

1009

1010

1011

1012

1013

1014

1015

1016

1017Q3

1018

1019

1020

1021

1022

1023

1024

1025

1026

1027

Fig. 9. Intelligibility scores and model predictions for the noise vocoderoutput at an input SNR of 12 dB, normal-hearing subjects.



5 July 2014

combination of subject group and dataset are presented inTable 2. The RMS error indicates how closely the predictedvalues match the actual scores, but does not assume a lin-ear relationship. Again, the values for each of the process-ing conditions were averaged over the subjects in the group

Table 1Pearson correlation coefficients for perceptual models for the normal-hearingaveraged over the subjects. All of the model results are for a minimum mean-squsing all the data.

Signal processing Subject group Pearson

HASPI

Noise and distort Normal hearing .936Hearing impaired .962NH plus HI .978

Freq compress Normal hearing .964Hearing impaired .967NH plus HI .968

Ideal binary mask Normal hearing .954Hearing impaired .992NH plus HI .978

All 3 processing NH plus HI .972

Table 2RMS errors for perceptual models for the normal-hearing (NH), hearing-impsubjects. All of the model results are for a minimum mean-squared error (MMS

Signal processing Subject group RMS e

HASPI

Noise and distort Normal hearing .118Hearing impaired .095NH plus HI .072

Freq compress Normal hearing .107Hearing impaired .147NH plus HI .119

Ideal binary mask Normal hearing .204Hearing impaired .065NH plus HI .121

All 3 processing NH plus HI .100


before computing the error. The NH and HI values used incomputing the entries in Tables 1 and 2 for the combinedNH plus HI groups have been weighted to compensatefor the different number of subjects in each group, thus giv-ing equal importance to the NH and HI subjects in com-puting the average over the two groups. Likewise, theentries for the average over the three processing experi-ments were weighted to give equal importance to eachexperiment.

Correlation coefficients for the noise vocoder data arenot included in the tables since the intelligibility scores wereall very close to 1. The intelligibility scores plotted in Fig. 9were for an SNR of 12 dB, the poorest SNR considered inthe experiment, yet the sentence intelligibility was stillabove 90% for all vocoder conditions for the NH listeners.Thus neither the SNR nor number of vocoded frequencybands had an impact on speech intelligibility. Since thereare no processing trends in the data, the correlation coeffi-cient would reflect only the subject variability and not thefit of the model to the data.

The Pearson correlation coefficients computed using theindividual subject data rather than the averages over the

(NH), hearing-impaired (HI), and combined NH and HI subject groupsuared error (MMSE) fit of the model to the combined NH plus HI subjects

correlation coefficient

CSII STECI Cep Corr

.937 .645 .904

.980 .916 .948

.972 .825 .952

.940 .904 .954

.948 .949 .960

.946 .935 .961

.947 .975 .950

.982 .992 .985

.968 .988 .973

.967 .940 .960

aired (HI), and combined NH and HI subject groups averaged over theE) fit of the model to the combined NH plus HI subjects using all the data.

rror

CSII STECI Cep Corr

.120 .258 .156

.128 .165 .111

.111 .182 .101

.133 .158 .117

.138 .175 .163

.123 .142 .128

.180 .126 .191

.142 .108 .082

.133 .091 .129

.121 .136 .115




1028

1029

1030

1031

1032

1033

1034

1035

1036

1037

1038

1039

1040

1041

1042

1043

1044

1045

1046

1047

1048

1049

1050

1051

1052

1053

1054

1055

1056

1057

1058

1059

1060

1061

1062

1063

1064

1065

1066

1067

1068

1069

1070

1071

1072

1073

1074

1075

1076

1077

1078

1079

1080

1081

1082

1083

1084

1085

1086

1087



5 July 2014

subject groups are presented in Table 3. The correlationcoefficients are lower than for the subject averages pre-sented in Table 1 since the values in Table 3 include theintersubject variability. Significant differences between thecorrelation coefficients are indicated in Table 4. The Wil-liams t-test (Williams, 1959; Steiger, 1980) was used to testthe correlation values for significant differences betweenpairs of indices. A zero indicates no difference, + indicatesthat the first index is significantly more accurate than thesecond at the 5% level, and ++ indicates significance atthe 1% level. A � indicates that the first index is signifi-cantly less accurate than the second at the 5% level, and�� indicates a difference at the 1% level.

In Table 3, the CSII has a higher correlation coefficientthan HASPI for the NH and HI noise and distortion data,but the differences are not significant. In comparing thedata of Tables 1 and 2, the CSII has a higher correlationcoefficient than HASPI for the NH and HI noise and dis-tortion data, but also has a higher RMS error. This differ-ence in performance metrics suggests that the CSIIpredictions have a more linear relationship with the subjectscores, but that the line lies somewhat off the diagonal,which increases the RMS error. The STECI, which is basedon correlating the envelopes within each frequency band,has substantially worse performance than either the CSIIor HASPI for the NH subjects. For the HI subjects, STECIis significantly worse than CSII, while the differencebetween STECI and HASPI approaches significance(p = 0.06). The cepstral correlation model without theauditory coherence, which thus depends only on the enve-

Table 3Pearson correlation coefficients for perceptual models for the normal-hearincomputed over the individual subjects.

Signal processing Subject group Pearson

HASPI

Noise and distort Normal hearing 0.849Hearing impaired 0.874

Freq compress Normal hearing 0.917Hearing impaired 0.866

Ideal binary mask Normal hearing 0.929Hearing impaired 0.911

Table 4Significant differences in the Pearson correlation coefficients presented in Tasignificantly more accurate than the second at the 5% level, and ++ indicates sless accurate than the second at the 5% level, and �� indicates a difference a

Signal processing Subject group Significance

HASPI –CSII HASPI –STECI HA

Noise and NH 0 ++ ++Distortion HI 0 0 ++Frequency NH ++ ++ ++Compression HI + ++ ++Ideal binary NH 0 � 0Mask HI ++ 0 ++


lope correlations, is not as accurate as the complete HASPIfor the NH and HI subjects, but is more accurate thanSTECI for the NH listeners.

The results for frequency compression show a consistentadvantage for HASPI over the other indices in terms of thecorrelation coefficient. The HASPI results are significantlybetter than those for CSII, STECI, and the cepstral corre-lation alone. While the CSII also gives high correlationcoefficients, they are not as good as those found for HASPIor the cepstral correlation. However, there is no significantdifference between the CSII and cepstral correlation accu-racy. In terms of the average RMS error, HASPI has thelowest errors for the NH and combined subject groups,while the CSII has the lowest error for the HI group. Thelargest RMS error for the HI group was found for theSTECI. Thus combining an envelope fidelity term with asmall amount of temporal fine structure fidelity appearsto be the most accurate approach, although the outliersin Fig. 8(a and c) indicate that there is still room forimprovement.

The results for the ideal binary mask show a significantadvantage for STECI over HASPI for the NH subjects,although there is no significant difference between thesetwo indices for the HI subjects. The advantage for theNH listeners is not surprising since the STOI was designedand optimized for NH ideal binary mask data. HASPI isalso significantly better than CSII and cepstral correlationfor the HI subjects, although there is no significant differ-ence in performance for the NH subjects. STECI is also sig-nificantly better than CSII for both groups of listeners.

g (NH) and hearing-impaired (HI) subject groups. The correlations are

correlation coefficient

CSII STECI Cep Corr

0.852 0.594 0.8190.923 0.823 0.862

0.895 0.863 0.9070.847 0.833 0.851

0.926 0.952 0.9260.833 0.893 0.921

ble 3. A zero indicates no difference, + indicates that the first index isignificance at the 1% level. A � indicates that the first index is significantlyt the 1% level.

SPI –Cep Corr CSII –STECI CSII –Cep Corr STECI –Cep Corr

++ 0 ��++ ++ 0++ 0 ��0 0 �� 0 ++�� 0 ++




1088

1089

1090

1091

1092

1093

1094

1095

1096

1097

1098

1099

1100

1101

1102

1103

1104

1105

1106

1107

1108

1109

1110

1111

1112

1113

1114

1115

1116

1117

1118

1119

1120

1121

1122

1123

1124

1125

1126

1127

1128

1129

1130

1131

1132

1133

1134

1135

1136

1137

1138

1139

1140

1141

1142

1143

1144

1145

1146

1147

1148

1149

1150

1151

1152

1153

1154

1155

1156

1157

1158

1159

1160

1161

1162

1163

1164

1165

1166

1167

1168

1169

1170

1171

1172

1173

1174

1175

1176

1177

1178

1179

1180

1181

1182

1183

1184

1185

1186

1187

1188

1189

1190

1191

1192

1193

1194

1195

1196

1197

1198

1199



5 July 2014

When looking at the averaged data in Table 1, all of theindices do well for this dataset. This pattern is also reflectedin the RMS errors shown in Table 2, where HASPI has thelowest average error for the HI listeners but STECI has thelowest average error for the NH and combined listenergroups.

In Table 4, HASPI has significantly higher correlationsthan CSII for three out of the six comparisons, is signifi-cantly better than STECI for three out of the six, and is sig-nificantly better than cepstral correlation for five out of thesix. The only condition for which HASPI was significantlyworse than another index was for STECI applied to theNH ideal binary mask data, and here HASPI still gavegood performance. Given the number of situations whereHASPI is significantly better than the other indices, andthe fact that it is significantly worse than another indexfor only one situation, makes HASPI a viable approachfor estimating intelligibility.

The average index performance is summarized in thelast lines of Tables 1 and 2, where the performance hasbeen averaged over processing and subjects. Each of thethree types of processing were given equal weight in com-puting the average, and the NH and HI subjects were alsogiven equal weight. HASPI has the highest correlationcoefficient and the lowest RMS error. The CSII has the sec-ond-highest average correlation coefficient, while the ceps-tral correlation has the second-lowest RMS error. HASPIalso works well for the types of processing considered inthis study, while the CSII is especially weak for the noisevocoder dataset, STECI is weak for both the noise and dis-tortion and the frequency compression datasets, and ceps-tral correlation is weak for the noise and distortion dataset.These results reinforce the idea that accurate intelligibilitypredictions can be based either on envelope or temporalfine structure, but the best performance is achieved whenboth are incorporated into the intelligibility index.

An additional analysis was performed to compare HAS-PI fitted solely to NH listeners with the index fitted to HIlisteners. The HASPI coefficients in Eq. (7) represent theoptimum fit to the combined NH and HI listener data.Approximately three-quarters of the weight in the modelof Eq. (7) is given to the cepstral correlation, and one-quar-ter to the high-level auditory coherence. When the model-ing approach is applied to just the NH data, the resultantweights are approximately half for the cepstral correlationand half for the auditory coherence. Fitting the model tojust the HI data results in full weight for the cepstral corre-lation and zero weight for the auditory coherence. Thus theparameters for the combined model represent an average ofthe values that would be used for either listener groupalone. The NH-alone model, for correlations computedfor the data averaged over the subjects in the group, doesslightly better than the combined model for the NH listen-ers for the noise and distortion dataset (r = 0.948), andgives comparable accuracy for the frequency compression(r = 0.961) and ideal binary mask datasets (r = 0.956).The HI-alone model does better than the combined model


for the HI listeners for ideal binary mask dataset(r = 0.971), but is not as accurate for the noise and distor-tion (r = 0.953) and for the frequency compression datasets(r = 0.941). Overall, the combined model appears to be aseffective as the separate models while being simpler toimplement.

6. Discussion

It was proposed that an index based on coherence alonewould not perform as well as one incorporating bothcoherence and envelope modulation when applied to fre-quency compression, noise suppression, and noise vocoderdata. The results presented in this paper mainly supportthat hypothesis. Both HASPI and CSII work well for thenoise and distortion dataset, with HASPI having slightlybetter accuracy than CSII for the predictions when aver-aged over all of the subjects. Both indices also work wellfor the ideal binary mask dataset, with HASPI having sig-nificantly better accuracy than the CSII for the HI listeners.In addition, for the frequency compression dataset and forthe noise vocoder output, HASPI is substantially moreaccurate than the CSII.

The accuracy of HASPI also compares favorably withother intelligibility indices that use envelope modulation.A direct comparison between indices is difficult becauseHASPI has been fit to different datasets than used for theother models. Previous experiments include additive noiseand ideal binary mask noise suppression, with correlationcoefficients for data averaged over the subjects rangingfrom r = 0.88 to r = 0.96 (Christiansen et al., 2010; Taalet al., 2011b; Gomez et al., 2012).

The combination of envelope and coherence was alsocompared to two indices based on the envelope alone forthe data considered in this paper. HASPI is significantlymore accurate than the cepstral correlation index for fiveout of the six comparisons presented in Table 4. The differ-ences in the averaged correlation coefficients are larger forthe noise and distortion data than for the frequency com-pression or ideal binary mask datasets, but for nearly allconditions adding the coherence information improvedthe index performance. This behavior indicates that whilethe envelope carries much of the information needed forspeech intelligibility, it does not convey all of the informa-tion required.

The amount of coherence as opposed to envelope infor-mation also appears to depend on the hearing loss. Enve-lope cues appear to carry a larger perceptual weight thanthe TFS cues for sentence materials (Smith et al., 2002;Fogerty, 2011) and are more readily available than TFScues to hearing-impaired listeners (Hopkins et al., 2008;Hopkins and Moore, 2011). The higher relative weightplaced on the cepstral correlation compared to the auditorycoherence terms in the HASPI calculation is consistentwith these experimental studies, as was the reduced weightplaced on the coherence term for the version of HASPI fitto just the HI listeners as opposed to the NH listeners.




1200

1201

1202

1203

1204

1205

1206

1207

1208

1209

1210

1211

1212

1213

1214

1215

1216

1217

1218

1219

1220

1221

1222

1223

1224

1225

1226

1227

1228

1229

1230

1231

1232

1233

1234

1235

1236

1237

1238

1239

1240

1241

1242

1243

1244

1245

1246

1247

1248

1249

1250

1251

1252

1253

1254

1255

1256

1257

1258

1259

1260

1261

1262

1263

1264

1265

1266

1267

1268

1269

1270

1271

1272

1273

1274

1275

1276

1277

1278

1279

1280

1281

1282

1283

1284

1285

1286

1287

1288

1289

1290

1291

1292

1293

1294

1295

1296

1297

1298

1299

1300

1301

1302

1303

1304

1305

1306

1307

1308

1309

1310

1311



5 July 2014

The noise vocoder predictions are also consistent withthe interpretation that envelope is more important thanTFS for speech materials. The CSII, which is based onthe cross-correlation of the reference and degraded signalsin each frequency band, is much more sensitive to changesin the TFS than is HASPI. Replacing a band of speech withthe output of the noise vocoder preserves much of the enve-lope, but the cross-correlation of the speech and vocoderoutput is low due to the differences in the signal TFS. Asa result, the CSII predicted a loss in intelligibility as thenumber of vocoded bands was increased, even though thesubject data showed no reduction in intelligibility.

When the auditory coherence terms are combined withthe cepstral correlation to form HASPI, the optimumweights are 0 for the low- and mid-level coherence values.These auditory coherence weights are in contrast to theweights of the low-, mid-, and high-level CSII components,which give the greatest importance to the mid-level coher-ence. Kates and Arehart (2005) hypothesized that the mid-level segments primarily comprised consonant–vowel andvowel–consonant transitions, while the high-level segmentsprimarily comprised vowels. Keeping with this interpreta-tion, the inclusion of only the high-level auditory coherencevalue in HASPI suggests that the cepstral correlation con-veys information about consonants and the vowel–conso-nant and consonant–vowel transitions, and that the high-level auditory coherence adds information about the vow-els and formant transitions to the model that may bemissed by the cepstral correlation.

The poor performance of STECI for the NH noise anddistortion data also suggests that the specific form of theenvelope measurements is important for predicting speechintelligibility. STECI measures the accuracy in preservingthe envelope modulations within each frequency band.The cepstral correlation used for HASPI is based on theprinciple components of the short-time spectrum, and mea-sures the fidelity in reproducing the time–frequency modula-tions of the signal rather than just the temporal modulationswithin each band. STECI performs well for the ideal binarymask dataset, where the effect of the noise suppression is pri-marily changes in the envelope modulation within eachband, but does poorly for the noise and distortion datasetwhere the peak-clipping and center-clipping distortionschange the shape of the short-time spectra and generate dis-tortion products across a wide range of frequencies.

These results also indicate that the choice of experimen-tal data is important in developing an index. For manytypes of signal modification, such as additive noise andnonlinear distortion, both the envelope and temporal finestructure are affected by the processing. Thus an accurateindex for the effects of these signal modifications can bedeveloped by measuring either the envelope or the TFSchanges. Consistent with this signal behavior, there wasno significant difference between the CSII and HASPIaccuracy for the noise and distortion data. However, thereare conditions, such as the frequency compression data,where the coherence is reduced more than the envelope cor-


relation for the same reduction in intelligibility, and forthese conditions the HASPI approach is significantly moreaccurate. For the ideal binary mask data, it was the STECIthat was most accurate. Thus developing an index for justone type of processing does not guarantee that it will workwell for all types of processing. The type of and amount ofnoise and distortion in a hearing aid is generally not knowna priori, and HASPI has the advantage of giving accuratepredictions for all of the processing conditions consideredin this study.

The noise-suppression studies cited above also foundthat the CSII is a poor predictor of intelligibility for bin-ary-masked speech (Christiansen et al., 2010; Taal et al.,2011b). The processing used in those papers matched thebinary mask local criterion to the SNR of the noisy signal,while the binary mask experiment cited in this paper used afixed local criterion of 0 dB independent of the SNR. In thelimit of a �60 dB SNR, the processing used in the citedpapers produced a binary-modulated noise that stillapproximated the envelope of the clean speech and yieldedhigh intelligibility. However, the CSII, which is based onthe signal coherence, gives a near-zero value for the�60 dB SNR condition since the temporal fine structureof the clean and modulated noisy signals is uncorrelated.In the experiment reported in this paper, on the other hand,the local criterion was kept constant at 0 dB independent ofthe SNR, so the number of 1s in the binary gain patterndecreases as the SNR decreases. In the case of �60 dBSNR and 100 dB attenuation, the processing cited in thispaper would attenuate the entire signal since all cells wouldhave negative SNRs, giving an inaudible output having nointelligibility. This result is consistent with a CSII of zero.Because of these experimental differences, the CSII wouldbe expected to be an accurate predictor of intelligibilityfor the binary mask processing cited in this paper.

The envelope modulation in HASPI is lowpass filteredat 62.5 Hz. Many languages are characterized by envelopemodulation frequencies below 20–30 Hz (Greenberg andArai, 2004; Souza and Rosen, 2009), and the 62.5-Hz cutofffrequency used in HASPI produced accurate results forEnglish sentences. However, tonal languages such as Man-darin may need higher modulation cutoff frequencies (Chenand Loizou, 2011). Furthermore, Chen et al. (2013) haveshown that the CSII is more accurate than an envelope-based metric for predicting the intelligibility of Mandarinsentences corrupted by noise and two-talker interference.Thus extending HASPI to other languages may require fur-ther investigation of the envelope-modulation cutoff fre-quency and the ratio of envelope to TFS informationused in the model.

7. Summary and conclusions

This paper has presented a new index for predictingspeech intelligibility. HASPI compares the envelope andTFS outputs of an auditory model for a reference signalto the outputs of the model for a degraded signal. The




1312

1313

1314

1315

1316

1317

1318

1319

1320

1321

1322

1323

1324

1325

1326

1327

1328

1329

1330

1331

1332

1333

1334

1335

1336

1337

1338

1339

1340

1341

1342

1343

1344

1345

1346

1347

1348

1349

1350

1351

1352

1353

1354

1355

1356

1357

1358

1359

1360

1361

1362

1363

1364

1365

1366

1367

1368

1369

1370

1371

1372

1373

1374

1375

1376

1377

1378

1379

1380

1381

1382

1383

1384

1385

1386

1387

1388

1389Q4

1390

1391

1392

1393Q7

1394Q5

1395

1396

13971398139914001401140214031404140514061407140814091410141114121413141414151416141714181419142014211422



5 July 2014

model for the reference signal is adjusted for normal hear-ing, while the model for the degraded signal incorporatesthe peripheral hearing loss. The auditory model includesthe middle-ear transfer function, an auditory filterbank,outer hair-cell dynamic-range compression, two-tone sup-pression, and adaptation of the inner hair-cell firing rate.Hearing loss causes a shift in auditory threshold, broaderauditory filters, a reduction in the dynamic-range compres-sion ratio, and a reduction in the amount of two-tonesuppression.

The comparison of the reference and degraded signalenvelopes uses cepstral correlation. The short-time logspectrum is approximated using a set of half-cosine basisfunctions. The amplitude of each basis function fluctuatesover time, and loss of intelligibility is computed using theaverage of the cross-correlation of the basis functions forthe reference and degraded signals. The comparison ofthe reference and degraded signal temporal behavior usesauditory coherence. The signals are divided into segments,and the normalized cross-correlation calculated for eachsegment in each frequency band and averaged over all ofthe segments in each of three intensity regions. The high-intensity segments were found to be the most useful in com-bination with the cepstral correlation. The final model is aweighted combination of the cepstral correlation and high-intensity auditory coherence, followed by a logistic func-tion transformation.

The HASPI index was trained using four datasets. Anindex trained on only one dataset may not work well onother data, so an objective of the present study was to usedata from several different experiments to create a modelwith a greater chance of generalizing to unknown signal-processing or distortion mechanisms. HASPI was foundto offer comparable accuracy to the CSII for a dataset com-prising speech corrupted by additive noise and nonlineardistortion and to be more accurate than indices based onthe envelope alone. The accuracy of HASPI and the CSIIwere also found to be comparable for NH listeners for noisyspeech processed through an ideal binary mask noise-sup-pression algorithm having a local criterion fixed at 0 dB,although the STECI envelope-modulation index was themost accurate for these data. The indices were also evalu-ated using a dataset comprising speech with additive babbleand processed through frequency compression; HASPI wasfound to offer superior accuracy to the CSII and STECI forthese data. Note that the results for STECI do not necessar-ily mean that the same results would be found for applyingthe STOI to data for normal-hearing listeners given the dif-ferences in the auditory models used for the two indices.The indices were also compared for speech having varyingamounts of its high-frequency content replaced by the out-put of a noise vocoder, and HASPI correctly predicted noloss of intelligibility while the CSII incorrectly predicted asubstantial loss. When the prediction accuracy was aver-aged over all of the processing conditions and both subjectgroups, HASPI was found to have the highest correlationcoefficient and the lowest RMS error.


HASPI does not claim to describe how humans processspeech when listening through hearing aids. Instead, themodel aims at providing an appropriate model in a trans-formed representation space in order to accommodate asmany effects in speech intelligibility degradation as possi-ble. There is no guarantee that HASPI can be generalizedto conditions beyond those used to train the model. Forexample, HASPI was derived using data for monauralheadphone listening. Additional research is needed todetermine the accuracy of the index for real hearing aids,as opposed to simulated processing, and also for the pro-cessing interactions that occur in hearing aids when morethan one algorithm is in use. Further research is alsoneeded to deal with the acoustic effects of the head andear that occur in real-world hearing-aid use, the effects ofbinaural listening, and the impact of room reverberation.A final consideration is the relative importance of envelopeand temporal fine structure for predicting intelligibility fortonal languages such as Mandarin.

8. Uncited references

ANSI (1989) and Kates (2008).

Acknowledgments

The authors thank Dr. Rosalinda Baca for providing thestatistical analysis used in this paper. Author JMK wassupported by a grant from GN ReSound. Author KHAwas supported by a NIH Grant (R01 DC60014) and bythe grant from GN ReSound.

References

Aguilera Munoz, C.M., Nelson, P.B., Rutledge, J.C., Gago, A., 1999.Frequency lowering processing for listeners with significant hearingloss. In: Electronics, Circuits, and Systems: Proc. ICECS 1999, vol. 2.Cypress, Pafos, pp. 741–744 (September 5–8, 1999).

Anderson, M.C., 2010. The Role of Temporal Fine Structure in SoundQuality Perception. Ph.D. Thesis, University of Colorado Dept.Speech Lang. Hear. Sciences, 2010.

ANSI S3.6-1989. American National Standard: Specification for Audi-ometer. American National Standards Institute, New York.

ANSI S3.5-1997. American National Standard: Methods for the Calcu-lation of the Speech Intelligibility Index. American National StandardsInstitute, New York.

Arehart, K.H., Souza, P., Baca, R., Kates, J.M., 2013a. Working memory,age, and hearing loss: susceptibility to hearing aid distortion. EarHear. 34, 251–260.

Arehart, K.H., Souza, P.E., Lunner, T., Pedersen, M.S., Kates, J.M.,2013b. Relationship between distortion and working memory fordigital noise-reduction processing in hearing aids. In: Proc. Mtgs.Acoust. (POMA) 19, 050084: Acoust. Soc. Am. 165th Meeting,Montreal, June 2–7, 2013.

Bruce, I.C., Sachs, M.B., Young, E.D., 2003. An auditory–peripherymodel of the effects of acoustic trauma on auditory nerve responses. J.Acoust. Soc. Am. 113, 369–388.

Byrne, D., Dillon, H., 1986. The national acoustics laboratories’ (NAL)new procedure for selecting gain and frequency response of a hearingaid. Ear Hear. 7, 257–265.




14231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490

1491149214931494149514961497149814991500150115021503150415051506150715081509151015111512151315141515151615171518151915201521152215231524152515261527152815291530153115321533153415351536153715381539154015411542154315441545Q6

1546154715481549155015511552155315541555155615571558



5 July 2014

Carter, G.C., Knapp, C.H., Nuttall, A.H., 1973. Estimation of themagnitude-squared coherence function via overlapped fast Fouriertransform processing. IEEE Trans. Audio Electroacoust. 21, 337–344.

Chen, F., Loizou, P.C., 2011. Predicting the intelligibility of vocodedspeech. Ear Hear. 32, 331–338.

Chen, F., Guan, T., Wong, L.N., 2013, Effect of temporal fine structure onspeech intelligibility modeling. In: Proc. 35th Annual Int. Conf. IEEE-EMBS, Osaka, July 3–7, 2013, pp. 4199–4202.

Ching, T.Y.C., Dillon, H., Byrne, D., 1998. Speech recognition of hearing-impaired listeners: predictions from audibility and the limited role ofhigh-frequency amplification. J. Acoust. Soc. Am. 103, 1128–1140.

Christiansen, C., Pedersen, M.S., Dau, T., 2010. Prediction of speechintelligibility based on an auditory preprocessing model. SpeechCommun. 52, 678–692.

Cooke, M., 1991. Modeling Auditory Processing and Organization. PhDThesis, U. Sheffield, May, 1991.

Cooper, N.P., Rhode, W.S., 1997. Mechanical responses to two-tonedistortion products in the apical and basal turns of the mammaliancochlea. J. Neurophysiol. 78, 261–270.

Cosentino, S., Marquardt, T., McAlpine, D., Falk, T.H., 2012. Towardsobjective measures of speech intelligibility for cochlear implant users inreverberant environments. In: Proc. 11th Int. Conf. on Info. Sci., Sig.Proc., and Their Appl. (ISSPA), Montreal, 2–5 July 2012, pp. 666–671.

Dau, T., Puschel, D., Kohlrausch, A., 1996. A quantitative model of the“effective” signal processing in the auditory system: I. Model structure.J. Acoust. Soc. Am. 99, 3615–3622.

Dudley, H., 1939. Remaking speech. J. Acoust. Soc. Am. 11, 169–177.Elhilali, M., Chi, T., Shamma, S., 2003. A spectro-temporal modulation

index (STMI) for assessment of speech intelligibility. Speech Commun.41, 331–348.

Fogerty, D., 2011. Perceptual weighting of individual and concurrent cuesfor sentence intelligibility: frequency, envelope, and fine structure. J.Acoust. Soc. Am. 129, 977–988.

Glista, D., Scollie, S., Bagatto, M., Seewald, R., Parsa, V., Johnson, A.,2009. Evaluation of nonlinear frequency compression: clinical out-comes. Int. J. Audiol. 48, 632–644.

Goldsworthy, R.L., Greenberg, J.E., 2004. Analysis of speech-basedspeech transmission index methods with implications for nonlinearoperations. J. Acoust. Soc. Am. 116, 3679–3689.

Gomez, A.M., Schwerin, B., Paliwal, K., 2012. Improving objectiveintelligibility prediction by combining correlation and coherence basedmethods with a measure based on the negative distortion ratio. SpeechCommun. 54, 503–515.

Gorga, M.P., Abbas, P.J., 1981. AP measurements of short-termadaptation in normal and acoustically traumatized ears. J. Acoust.Soc. Am. 70, 1310–1321.

Greenberg, S., Arai, T., 2004. What are the essential cues for understand-ing spoken language? IEICE Trans. Inf. and Syst. E87-D, pp. 1059–1070.

Harris, D.M., Dallos, P., 1979. Forward masking of auditory nerve fiberresponses. J. Neurophys. 42, 1083–1107.

Hicks, M.L., Bacon, S.P., 1999. Psychophysical measures of auditorynonlinearities as a function of frequency in individuals with normalhearing. J. Acoust. Soc. Am. 105, 326–338.

Hines, A., Harte, N., 2010. Speech intelligibility from image processing.Speech Commun. 52, 736–752.

Hohmann, V., Kollmeier, B., 1995. The effect of multichannel dynamiccompression on speech intelligibility. J. Acoust. Soc. Am. 97, 1191–1195.

Holube, I., Kollmeier, B., 1996. Speech intelligibility predictions inhearing-impaired listeners based on a psychoacoustically motivatedperception model. J. Acoust. Soc. Am. 100, 1703–1716.

Hopkins, K., Moore, B.C.J., 2011. The effects of age and cochlear hearingloss on temporal fine structure sensitivity, frequency sensitivity, andspeech reception in noise. J. Acoust. Soc. Am. 130, 334–349.

Hopkins, K., Moore, B.C.J., Stone, M.A., 2008. Effects of moderatecochlear hearing loss on the ability to benefit from temporal finestructure information in speech. J. Acoust. Soc. Am. 123, 1140–1153.


Houtgast, T., Steeneken, H.J.M., 1971. Evaluation of speech transmissionchannels by using artificial signals. Acustica 25, 355–367.

Humes, L.E., Dirks, D.D., Bell, T.S., Ahlstrom, C., Kincaid, G.E., 1986.Application of the Articulation Index and the Speech TransmissionIndex to the recognition of speech by normal-hearing and hearing-impaired listeners. J. Speech Hear. Res. 29, 447–462.

Imai, S., 1983. Cepstral analysis synthesis on the mel frequency scale. In:Proc. IEEE Int. Conf. Acoust. Speech and Sig. Proc., vol. 8, Boston,April 14–16, 1983, pp. 93–96.

Immerseel, L.V., Peeters, S., 2003. Digital implementation of lineargammatone filters: comparison of design methods. Acoust. Res. Lett.Online 4, 59–64.

Kates, J.M., 1991. A time domain digital cochlear model. IEEE Trans.Sig. Proc. 39, 2573–2592.

Kates, J.M., 1992. On using coherence to measure distortion in hearingaids. J. Acoust. Soc. Am. 91, 2236–2244.

Kates, J.M., 2008. Digital Hearing Aids. Plural Publishing, San Diego,CA, ISBN-13: 978-1-59756-317-8, pp. 1–16 (Chapter 1).

Kates, J.M., 2008. Digital Hearing Aids. Plural Publishing, San Diego.Kates, J.M., 2013. An auditory model for intelligibility and quality

predictions. Proc. Mtgs. Acoust. (POMA) 19, 050184: Acoust. Soc.Am. 165th Meeting, Montreal, June 2–7, 2013.

Kates, J.M., Arehart, K.H., 2005. Coherence and the speech intelligibilityindex. J. Acoust. Soc. Am. 117, 2224–2237.

Kates, J.M., Arehart, K.H., 2010. The hearing aid speech quality index(HASQI). J. Audio Eng. Soc. 58, 363–381.

Kiessling, J., 1993. Current approaches to hearing aid evaluation. J.Speech-Lang. Path. Audiol. Monogr. Suppl. 1, 39–49.

Kjems, U., Boldt, J.B., Pedersen, M.S., Wang, D., 2009. Role of maskpattern in intelligibility of ideal binary-masked noisy speech. J. Acoust.Soc. Am. 126, 1415–1426.

Li, N., Loizou, P.C., 2008. Factors influencing intelligibility of idealbinary-masked speech: implications for noise reduction. J. Acoust.Soc. Am. 123, 1673–1682.

Ludvigsen, C., Elberling, C., Keidser, G., Poulsen, T., 1990. Prediction ofintelligibility of nonlinearly processed speech. Acta Otolaryngol.Suppl. 469, 190–195.

McAulay, R.J., Quatieri, T.F., 1986. Speech analysis/synthesis based on asinusoidal representation. IEEE Trans. Acoust. Speech and Sig. Proc.ASSP-34, pp. 744–754.

McDermott, H.J., 2011. A technical comparison of digital frequency-lowering algorithms available in two current hearing aids. PLoS One 6(7), e22358. http://dx.doi.org/10.1371/journal.pone.0022358.

Moore, B.C.J., Glasberg, B.R., 1983. Suggested formulae for calculatingauditory-filter bandwidths and excitation patterns. J. Acoust. Soc. Am.74, 750–753.

Moore, B.C.J., Glasberg, B.R., 2004. A revised model of loudnessperception applied to cochlear hearing loss. Hear. Res. 188, 70–88.

Moore, B.C.J., Vickers, D.A., Plack, C.J., Oxenham, A.J., 1999. Inter-relationship between different psychoacoustic measures assumed to berelated to the cochlear active mechanism. J. Acoust. Soc. Am. 106,2761–2778.

Ng, E.H., Rudner, M., Lunner, T., Pedersen, M.S., Ronnberg, J., 2013.Effects of noise and working memory capacity on memory processingof speech for hearing-aid users. Int. J. Audiol. (in press).

Nilsson, M., Soli, S.D., Sullivan, J., 1994. Development of the hearing innoise test for the measurement of speech reception thresholds in quietand in noise. J. Acoust. Soc. Am. 95, 1085–1099.

Nossair, Z.B., Zahorian, S.A., 1991. Dynamic spectral shape features asacoustic correlates for initial stop consonants. J. Acoust. Soc. Am. 89,2978–2991.

Patterson, R.D., Allerhand, M.H., Giguere, C., 1995. Time-domainmodeling of peripheral auditory processing: a modular architectureand a software platform. J. Acoust. Soc. Am. 98, 1890–1894.

Pavlovic, C.V., Studebaker, G.A., Sherbecoe, R.L., 1986. An articulationindex based procedure for predicting the speech recognitionperformance of hearing-impaired individuals. J. Acoust. Soc. Am.80, 50–57.


http://dx.doi.org/10.1371/journal.pone.0022358



1559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595

15961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630

1631



5 July 2014

Payton, K., Shrestha, M., 2008. Analysis of short-time speech transmis-sion index algorithms. Proc. Acoustics 2008, Paris, pp. 634–638.

Payton, K.L., Uchanski, R.M., Braida, L.D., 1994. Intelligibility ofconversational and clear speech in noise and reverberation for listenerswith normal and impaired hearing. J. Acoust. Soc. Am. 95, 1581–1592.

Plack, C.J., Oxenham, A.J., 2000. Basilar-membrane nonlinearity esti-mated by pulsation threshold. J. Acoust. Soc. Am. 107, 501–507.

Quatieri, T.F., and McAulay, R.J., 1986. Speech transformations based ona sinusoidal representation. IEEE Trans. Acoust. Speech and Sig.Proc. ASSP-34, pp. 1449–1464.

Rosenthal, S., 1969. IEEE: recommended practices for speech qualitymeasurements. IEEE Trans. Audio Electroacoust. 17, 227–246.

Sachs, M.B., Abbas, P.J., 1974. Rate versus level functions for auditory-nerve fibers in cats: tone-burst stimuli. J. Acoust. Soc. Am. 56, 1835–1847.

Shannon, R.V., Zeng, F.-G., Kamath, V., Wygonski, J., Ekelid, M., 1995.Speech recognition with primarily temporal cues. Science 270, 303–304.

Shaw, J.C., 1981. An introduction to the coherence function and its use inEEG signal analysis. J. Med. Eng. Technol. 5, 279–288.

Simpson, A., Hersbach, A.A., McDermott, H.J., 2005. Improvements inspeech perception with an experimental nonlinear frequency compres-sion hearing device. Int. J. Audiol. 44, 281–292.

Slaney, M., 1993. An efficient implementation of the Patterson-Holds-worth auditory filter bank. Apple Computer Technical Report #35.Apple Computer Library, Cupertino, CA.

Smith, Z.M., Delgutte, B., Oxenham, A.J., 2002. Chimaeric sounds revealdichotomies in auditory perception. Nature 416, 87–90.

Souza, P., Rosen, S., 2009. Effects of envelope bandwidth on theintelligibility of sine- and noise-vocoded speech. J. Acoust. Soc. Am.126, 792–805.

Souza, P., Arehart, K.H., Kates, J.M., Croghan, N.B.H., Gehani, N.,2013. Exploring the limits of frequency lowering. J. Speech Lang.Hear. Res. (in press).

Steeneken, H.J.M., Houtgast, T., 1980. A physical method for measuringspeech-transmission quality. J. Acoust. Soc. Am. 67, 318–326.

Steiger, J.H., 1980. Tests for comparing elements of a correlation matrix.Psychol. Bull. 87, 245–251.


Stone, M.A., Fullgrabe, C., Moore, B.C.J., 2008. Benefit of high-rateenvelope cues in vocoder processing: effect of number of channels andspectral region. J. Acoust. Soc. Am. 124, 2272–2282.

Suzuki, Y., Takeshima, H., 2004. Equal-loudness-level contours for puretones. J. Acoust. Soc. Am. 116, 918–933.

Taal, C.H., Hendriks, R.C., Heusdens, R., 2011a. An evaluation ofobjective measures for intelligibility prediction of time–frequencyweighted noisy speech. J. Acoust. Soc. Am. 130, 3013–3027.

Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J., 2011b. An algorithmfor intelligibility prediction of time–frequency weighted noisy speech.IEEE Trans. Audio Speech Lang. Proc. 19, 2125–2136.

Wang, D.L., Kjems, U., Pedersen, M.S., Boldt, J.B., Lunner, T., 2008.Speech perception of noise with binary gains. J. Acoust. Soc. Am. 124,2303–2307.

Williams, E.J., 1959. The comparison of regression variables. J. RoyalStat. Soc. Ser. B 21, 396–399.

Wojtczak, M., Biem, J.A., Micheyl, C., Oxenham, A.J., 2012. Perceptionof across-frequency asynchrony and the role of cochlear delay. J.Acoust. Soc. Am. 131, 363–377.

Yates, G.K., Winter, I.M., Robertson, D., 1990. Basilar membranenonlinearity determines auditory nerve rate-intensity functions andcochlear dynamic range. Hear. Res. 45, 203–220.

Zahorian, S.A., Jagharghi, A.J., 1993. Spectral-shape features versusformants as acoustic correlates for vowels. J. Acoust. Soc. Am. 94,1966–1982.

Zahorian, S.A., Rothenberg, M., 1981. Principal-components analysis forlow-redundancy encoding of speech spectra. J. Acoust. Soc. Am. 69,832–845.

Zhang, X., Heinz, M.G., Bruce, I.C., Carney, L.H., 2001. A phenome-nological model for the response of auditory nerve fibers: I. Nonlineartuning with compression and suppression. J. Acoust. Soc. Am. 109,648–670.

Zilany, M., Bruce, I., 2006. Modeling auditory-nerve responses for highsound pressure levels in the normal and impaired auditory periphery. J.Acoust. Soc. Am. 120, 1446–1466.




Documents

The Hearing-Aid Speech Perception Index (HASPI)