57
CHAPTER 2 Audition JOSH H. MCDERMOTT INTRODUCTION Sound is created when matter in the world vibrates and takes the form of pressure waves that propagate through the air, containing clues about the environment around us. Audition is the process by which organisms utilize these clues to derive information about the world. Audition is a crucial sense for most ani- mals. Humans use sound to infer a vast number of important things—what some- one said, their emotional state when they said it, and the whereabouts and nature of objects we cannot see, to name but a few. When hearing is impaired (via congenital conditions, noise exposure, or aging), the consequences can be devastating, such that a large industry is devoted to the design of prosthetic hearing devices. As listeners, we are largely unaware of the computations underlying our auditory sys- tem’s success, but they represent an impres- sive feat of engineering. The computational This chapter is an updated version of a chapter written for The Oxford Handbook of Cognitive Neuroscience. I thank Dana Boebinger, Alex Kell, Wiktor Mlynarski, and Kevin Woods for helpful comments on earlier drafts of this chapter. Supported by a McDonnell Scholar Award and a National Science Foundation CAREER Award. challenges of everyday audition are reflected in the gap between biological and machine hearing systems—machine systems for interpreting sound currently fall short of human abilities. At present, smart phones and other machine systems recognize speech reasonably well in quiet conditions, but in a noisy restaurant they are all but useless. Understanding the basis of our success in perceiving sound will hopefully help us to replicate it in machine systems and restore it in biological auditory systems when their function becomes impaired. The goal of this chapter is to provide a bird’s-eye view of contemporary hearing research. The chapter is an updated version of one I wrote a few years ago (McDermott, 2013). I provide brief overviews of classic areas of research as well as some central themes and advances from the past 10 years. The first section describes the sensory trans- duction of the cochlea. The second section discusses modulation and its measurement by subcortical and cortical regions of the audi- tory system, a key research focus of the last few decades. The third and fourth sections describe some of what is known about the primary and nonprimary auditory cortex, respectively. The fifth section discusses the perception of sound source properties, Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience, Fourth Edition, edited by John T. Wixted. Copyright © 2018 John Wiley & Sons, Inc. DOI: 10.1002/9781119170174.epcn202 1

Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

CHAPTER 2

Audition

JOSH H. MCDERMOTT

INTRODUCTION

Sound is created when matter in the worldvibrates and takes the form of pressure wavesthat propagate through the air, containingclues about the environment around us.Audition is the process by which organismsutilize these clues to derive information aboutthe world.

Audition is a crucial sense for most ani-mals. Humans use sound to infer a vastnumber of important things—what some-one said, their emotional state when theysaid it, and the whereabouts and nature ofobjects we cannot see, to name but a few.When hearing is impaired (via congenitalconditions, noise exposure, or aging), theconsequences can be devastating, such thata large industry is devoted to the design ofprosthetic hearing devices.

As listeners, we are largely unaware of thecomputations underlying our auditory sys-tem’s success, but they represent an impres-sive feat of engineering. The computational

This chapter is an updated version of a chapter writtenfor The Oxford Handbook of Cognitive Neuroscience.I thank Dana Boebinger, Alex Kell, Wiktor Mlynarski,and Kevin Woods for helpful comments on earlier draftsof this chapter. Supported by a McDonnell ScholarAward and a National Science Foundation CAREERAward.

challenges of everyday audition are reflectedin the gap between biological and machinehearing systems—machine systems forinterpreting sound currently fall short ofhuman abilities. At present, smart phonesand other machine systems recognize speechreasonably well in quiet conditions, but ina noisy restaurant they are all but useless.Understanding the basis of our success inperceiving sound will hopefully help us toreplicate it in machine systems and restoreit in biological auditory systems when theirfunction becomes impaired.

The goal of this chapter is to provide abird’s-eye view of contemporary hearingresearch. The chapter is an updated versionof one I wrote a few years ago (McDermott,2013). I provide brief overviews of classicareas of research as well as some centralthemes and advances from the past 10 years.The first section describes the sensory trans-duction of the cochlea. The second sectiondiscusses modulation and its measurement bysubcortical and cortical regions of the audi-tory system, a key research focus of the lastfew decades. The third and fourth sectionsdescribe some of what is known about theprimary and nonprimary auditory cortex,respectively. The fifth section discussesthe perception of sound source properties,

Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience, Fourth Edition, edited by John T. Wixted.Copyright © 2018 John Wiley & Sons, Inc.DOI: 10.1002/9781119170174.epcn202

1

Page 2: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

2 Audition

focusing on location, loudness, and pitch. Thesixth section presents an overview of auditoryscene analysis. I conclude with a discussionof where hearing research is headed.

THE PROBLEM

Just by listening, we can routinely apprehendmany aspects of the world around us: thesize of a room in which we are talking,whether it is windy or raining outside, thespeed of an approaching car, or whether thesurface someone is walking on is gravel ormarble. This ability is nontrivial because theproperties of the world that are of interestto a listener are generally not explicit inthe acoustic input—they cannot be easilyrecognized or discriminated from the soundwaveform itself. The brain must process thesound entering the ear to generate represen-tations in which the properties of interest aremore evident. One of the main objectives ofhearing science is to understand the natureof this transformation and its instantiation inthe brain.

Like other senses, audition is furthercomplicated by a second challenge—that ofscene analysis. Although listeners are gener-ally interested in the properties of individualobjects or events, the ears are rarely pre-sented with the sounds from isolated sources.Instead, the sound signal that reaches the earis typically a mixture of sounds from differ-ent sources. Such mixtures of sound sourcesoccur frequently in natural auditory environ-ments, for example in social settings, wherea single speaker of interest may be talkingamong many others, and in music. From themixture it receives as input, the brain mustderive representations of the individual soundsources of interest, as are needed to under-stand someone’s speech, recognize a melody,or otherwise guide behavior. Known as the“cocktail party problem” (Cherry, 1953), or

“auditory scene analysis” (Bregman, 1990),this problem has analogs in other sensorymodalities, but the nature of sound presentsthe auditory system with unique challenges.

SOUND MEASUREMENT—THEPERIPHERAL AUDITORY SYSTEM

The transformation of the raw acoustic inputinto representations that are useful for behav-ior is apparently instantiated over many brainareas and stages of neural processing, span-ning the cochlea, midbrain, thalamus, andcortex (Figure 2.1). The early stages of thiscascade are particularly intricate in the audi-tory system relative to other sensory systems,with many processing stations occurringprior to the cortex. The sensory organ of thecochlea is itself a complex multicomponentsystem, whose investigation remains a con-siderable challenge—the mechanical natureof the cochlea renders it much more difficultto probe (e.g., with electrodes) than the retinaor olfactory epithelium, for instance. Periph-eral coding of sound is also unusual relativeto that of other senses in its degree of clinicalrelevance. Unlike vision, for which the mostcommon forms of dysfunction are optical innature, and can be fixed with glasses, hearingimpairment typically involves altered periph-eral neural processing, and its treatment hasbenefited from a detailed understanding ofthe processes that are altered. Much of hear-ing research has accordingly been devotedto understanding the nature of the measure-ments made by the auditory periphery, andthey provide a natural starting point for anydiscussion of how we hear.

Frequency Selectivity and the Cochlea

Hearing begins with the ear, where soundis transduced into action potentials that aresent to the brain via the auditory nerve.

Page 3: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Measurement—The Peripheral Auditory System 3

Auditory

cortexMedial

geniculate

nucleus

Superiorolivary

nucleus

Inferior

colliculus

Auditory nerve Cochlearnucleus

Left cochlea

Figure 2.1 The auditory system. Sound is transduced by the cochlea, processed by an interconnectedset of subcortical areas, and then fed into the core regions of auditory cortex.Source: From Goldstein (2007). © 2007 South-Western, a part of Cengage, Inc. Reproduced withpermission. www.cengage/com/permissions

The transduction process is marked byseveral distinctive signal transformations,the most obvious of which is produced byfrequency tuning.

The key components of sound transduc-tion are depicted in Figure 2.2. Sound inducesvibrations of the eardrum, which are thentransmitted via the bones of the middle ear tothe cochlea, the sensory organ of the auditorysystem. The cochlea is a coiled, fluid-filledtube. Several membranes extend throughthe tube and vibrate in response to sound.Transduction of this mechanical vibrationinto an electrical signal occurs in the organ ofCorti, a mass of cells attached to the basilarmembrane. The organ of Corti in particu-lar contains what are known as hair cells,

named for the stereocilia that protrude fromthem. The inner hair cells are responsiblefor sound transduction. When the section ofmembrane on which they lie vibrates, stere-ocilia shear against the membrane above,opening mechanically gated ion channels andinducing a voltage change within the body ofthe cell. Neurotransmitter release is triggeredby the change in membrane potential, gener-ating action potentials in the auditory nervefibers that the hair cell synapses with. Thiselectrical signal is carried by the auditorynerve fibers to the brain.

The frequency tuning of the transductionprocess occurs because different parts ofthe basilar membrane vibrate maximally inresponse to different frequencies. This is

Page 4: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

4 Audition

Inner

hair cells

Outer

hair cells

Cochlea

Cross section of cochlea

Auditory

nerve

Vestibular

nerve

Spiral

ganglion

Middle

canal

Vestibularcanal

Basilar

membrane

Tympaniccanal

Round

window

Oval

window

Cochlea Auditory

nerve

StereociliaOrgan

of Corti

Innerhair cellsAuditory nerve

Tectorial

membrane

Basilar

membrane Tunnel

of Corti

Outer

hair cellsEfferent

fibers

Tectorial

membrane

Reissner’s

membrane

Afferent

fibers

Figure 2.2 Structure of the peripheral auditory system. (Top left) Diagram of ear. The eardrum trans-mits sound to the cochlea via the middle ear bones (ossicles). (Top middle) Inner ear. The semicircularcanals abut the cochlea. Sound enters the cochlea via the oval window and causes vibrations along thebasilar membrane, which runs through the middle of the cochlea. (Top right) Cross section of cochlea.The organ of Corti, containing the hair cells that transduce sound into electrical potentials, sits on topof the basilar membrane. (Bottom) Schematic of section of the organ of Corti. The shearing that occursbetween the basilar and tectorial membranes when they vibrate (in response to sound) causes the hair cellstereocilia to deform. The deformation causes a change in the membrane potential of the inner hair cells,transmitted to the brain via afferent auditory nerve fibers. The outer hair cells, which are 3 times morenumerous than the inner hair cells, serve as a feedback system to alter the basilar membrane motion,tightening its tuning and amplifying the response to low-amplitude sounds.Source: From Wolfe (2006, Chapter 9). Reproduced with permission of Oxford University Press.

partly due to mechanical resonances—thethickness and stiffness of the membranevary along its length, producing a differ-ent resonant frequency at each point. Themechanical resonances are actively enhancedvia a feedback process, believed to be medi-ated largely by a second set of cells, calledthe outer hair cells. The outer hair cells abutthe inner hair cells on the organ of Corti andserve to alter the basilar membrane vibration

rather than transduce it. They expand andcontract in response to sound (Ashmore,2008; Dallos, 2008; Hudspeth, 2008). Theirmotion alters the passive mechanics of thebasilar membrane, amplifying the response tolow-intensity sounds and tightening the fre-quency tuning of the resonance. The upshotis that high frequencies produce vibrationsat the basal end of the cochlea (close to theeardrum), while low frequencies produce

Page 5: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Measurement—The Peripheral Auditory System 5

vibrations at the apical end (far from theeardrum), with frequencies in between stim-ulating intermediate regions. The auditorynerve fibers that synapse onto individual innerhair cells are thus frequency-tuned—they fireaction potentials in response to a local rangeof frequencies, collectively providing therest of the auditory system with a frequencydecomposition of the incoming waveform.As a result of this behavior, the cochlea isoften described functionally as a set of band-pass filters—filters that each pass frequencieswithin a particular range, and eliminate thoseoutside of it. Collectively the filters span theaudible spectrum.

The frequency decomposition of thecochlea is conceptually similar to the Fouriertransform, but differs in important respects.Whereas the Fourier transform uses linearlyspaced frequency bins, each separated bythe same number of Hz, the tuning band-width of auditory nerve fibers increaseswith their preferred frequency. This char-acteristic is evident in Figure 2.3A, inwhich the frequency response of a set ofauditory nerve fibers is plotted on a log-arithmic frequency scale. Although thelowest frequency fibers are broader on alog scale than the high frequency fibers,in absolute terms their bandwidths aremuch lower—several hundred Hz insteadof several thousand. The distribution of bestfrequency along the cochlea also follows aroughly logarithmic function, apparent inFigure 2.3B, which plots the best frequencyof a large set of nerve fibers against thedistance along the cochlea of the hair cellthat they synapse with. These features offrequency selectivity are present in mostbiological auditory systems. It is partly forthis reason that a log scale is commonly usedfor frequency.

Cochlear frequency selectivity has a hostof perceptual consequences—for instance,our ability to detect a particular frequencyis limited largely by the signal-to-noise

ratio of the cochlear filter centered on thefrequency. There are many treatments offrequency selectivity and perception (Moore,2003), as it is perhaps the most studied aspectof hearing.

Although the frequency tuning of thecochlea is uncontroversial, the teleologicalquestion of why the cochlear transductionprocess is frequency tuned remains lesssettled. How does frequency tuning aid thebrain’s task of recovering useful informationabout the world from its acoustic input? Overthe last two decades, a growing number ofresearchers have endeavored to explain prop-erties of sensory systems as optimal for thetask of encoding natural sensory stimuli, ini-tially focusing on coding questions in vision,and using notions of efficiency as the opti-mality criterion (Field, 1987; Olshausen &Field, 1996). Lewicki and his colleagues haveapplied similar concepts to hearing, usingalgorithms that derive efficient and sparserepresentations of sounds (Lewicki, 2002;Smith & Lewicki, 2006), properties believedto be desirable of early sensory represen-tations. They report that for speech, or setsof environmental sounds and animal vocal-izations, efficient representations for soundlook much like the representation producedby auditory nerve fiber responses—soundsare represented with filters whose tuningis localized in frequency. Interestingly, theresulting representations share the depen-dence of bandwidth and frequency foundin biological hearing—bandwidths increasewith frequency as they do in the ear. More-over, representations derived in the sameway for “unnatural” sets of sounds, suchas samples of white noise, do not exhibitfrequency tuning, indicating that the resultis at least somewhat specific to the sortsof sounds commonly encountered in theworld. These results suggest that frequencytuning of the sort found in the ear providesan efficient means to encode the soundsthat were likely of importance when the

Page 6: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

6 Audition

0.5 1 2 5 10

0.1 0.3 1 3 10 30

0

50

(A)

(B) 100

80

60

40

20

0

Frequency (kHz)

Perc

ent D

ista

nce F

rom

Base

Thre

shold

(dB

SP

L)

Figure 2.3 Frequency selectivity. (A) Threshold tuning curves of auditory nerve fibers from a cat ear,plotting the level that was necessary to evoke a criterion increase in firing rate for a given frequency(Miller, Schilling, Franck, & Young, 1997). (B) The tonotopy of the cochlea. The position along thebasilar membrane at which auditory nerve fibers synapse with a hair cell (determined by dye injections)is plotted versus their best frequency (Liberman, 1982).Source: Both parts of this figure are courtesy of Eric Young (Young, 2010), who replotted data from theoriginal sources. Reproduced with permission of Oxford University Press.

auditory system evolved, possibly explainingits ubiquitous presence in auditory systemsas an optimal distribution of limited neu-ral coding resources. It remains to be seenwhether this framework can explain potentialvariation in frequency tuning bandwidthsacross species—humans have recently been

claimed to possess narrower tuning than otherspecies (Joris et al., 2011; Shera, Guinan, &Oxenham, 2002)—or the broadening offrequency tuning with increasing soundintensity (Rhode, 1978), but it provides onemeans by which to understand the origins ofperipheral auditory processing.

Page 7: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Measurement—The Peripheral Auditory System 7

Amplitude Compression

A second salient transformation that occursin the cochlea is that of amplitude com-pression. Compression is reflected in thefact that the mechanical response of thecochlea to a soft sound (and thus the neuralresponse that results) is larger than wouldbe expected given the response to an intensesound. The response elicited by a sound isnot proportional to the sound’s amplitude(as it would be if the response were linear),but rather to a compressive nonlinear func-tion of amplitude. The dynamic range ofthe response to sound is thus “compressed”relative to the dynamic range of the acousticinput. Whereas the range of audible soundscovers five orders of magnitude, or 100 dB,the range of cochlear response covers onlyone or two orders of magnitude (Ruggero,Rich, Recio, & Narayan, 1997).

Compression appears to serve to map therange of amplitudes that the listener needsto hear (i.e., those commonly encountered inthe environment) onto the physical operatingrange of the cochlea. Without compression, itwould have to be the case that either soundslow in level would be inaudible, or soundshigh in level would be indiscriminable (forthey would fall outside the range that couldelicit a response change). Compression per-mits very soft sounds to produce a physicalresponse that is (just barely) detectable,while maintaining some discriminability ofhigher levels.

The compressive nonlinearity is oftenapproximated as a power function with anexponent of 0.3 or so. It is not obvious whythe compressive nonlinearity should take theparticular form that it does. Many differentfunctions could in principle serve to compressthe output response range. It remains to beseen whether compression can be explainedin terms of optimizing the encoding of theinput, as has been proposed for frequency

tuning (though see Escabi, Miller, Read, andSchreiner (2003)). Most machine hearingapplications also utilize amplitude compres-sion prior to analyzing sound, however, and itis widely agreed to be useful to amplify lowamplitudes relative to large when processingsound.

Amplitude compression was first noticedin measurements of the physical vibrationsof the basilar membrane (Rhode, 1971;Ruggero, 1992), but it is also apparent inauditory nerve fiber responses (Yates, 1990)and is believed to account for a number ofperceptual phenomena (Moore & Oxenham,1998). The effects of compression are relatedto cochlear amplification, in that compressionresults from response amplification that islimited to low-intensity sounds. Compressionis achieved in part via the outer hair cells,whose motility modifies the motion of thebasilar membrane in response to sound (Rug-gero & Rich, 1991). Outer hair cell functionis frequently altered in hearing impairment,one consequence of which is a loss of com-pression, something that hearing aids attemptto mimic.

Neural Coding in the Auditory Nerve

Although frequency tuning and amplitudecompression are at this point uncontroversialand relatively well understood, several otherempirical questions about peripheral auditorycoding remain unresolved. One importantissue involves the means by which the audi-tory nerve encodes frequency information.As a result of the frequency tuning of theauditory nerve, the spike rate of a nervefiber contains information about frequency(a large firing rate indicates that the soundinput contains frequencies near the centerof the range of the fiber’s tuning). Collec-tively, the firing rates of all nerve fibers couldthus be used to estimate the instantaneousspectrum of a sound. However, spike timings

Page 8: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

8 Audition

also carry frequency information. At least forlow frequencies, the spikes that are fired inresponse to sound do not occur randomly, butrather tend to occur at the peak displacementsof the basilar membrane vibration. Becausethe motion of a particular section of the mem-brane mirrors the bandpass filtered soundwaveform, the spikes occur at the waveformpeaks (Rose, Brugge, Anderson, & Hind,1967). If the input is a single frequency,spikes thus occur at a fixed phase of thefrequency cycle (Figure 2.4A). This behavioris known as “phase locking” and producesspikes at regular intervals corresponding tothe period of the frequency. The spike timingsthus carry information that could potentiallyaugment or supercede that conveyed by therate of firing.

Phase locking degrades in accuracy asfrequency is increased (Figure 2.4B) dueto limitations in the temporal fidelity of thehair cell membrane potential (Palmer &Russell, 1986), and is believed to be largelyabsent for frequencies above 4 kHz in mostmammals, though there is some variabilityacross species (Johnson, 1980; Palmer &

Russell, 1986; Sumner & Palmer, 2012).The appeal of phase locking as a code forsound frequency is partly due to featuresof rate-based frequency selectivity that areunappealing from an engineering standpoint.Although frequency tuning in the auditorysystem (as measured by auditory nerve spikerates or psychophysical masking experi-ments) is narrow at low stimulus levels, itbroadens considerably as the level is raised(Glasberg & Moore, 1990; Rhode, 1978).Phase locking, by comparison, is robustto sound level—even though a nerve fiberresponds to a broad range of frequencieswhen the level is high, the time inter-vals between spikes continue to conveyfrequency-specific information, as the peaksin the bandpass-filtered waveform tend tooccur at integer multiples of the periods ofthe component frequencies.

Our ability to discriminate frequency isimpressive, with thresholds on the order of1% (Moore, 1973), and there has been long-standing interest in whether this ability in partdepends on fine-grained spike timing infor-mation (Heinz, Colburn, & Carney, 2001).

(B)(A)1

0

100 1000 10000

Frequency (Hz)

Phase

-Lockin

g Index

Figure 2.4 Phase locking. (A) A 200 Hz pure tone stimulus waveform aligned in time with severaloverlaid traces of an auditory nerve fiber’s response to the tone. Note that the spikes are not uniformlydistributed in time, but rather occur at particular phases of the sinusoidal input. (B) A measure of phaselocking for each of a set of nerve fibers in response to different frequencies. Phase locking decreases athigh frequencies.Source: Reprinted from Javel and Mott (1988). Reproduced with permission of Elsevier.

Page 9: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Measurement—The Peripheral Auditory System 9

Although phase locking remains uncharac-terized in humans due to the unavailabilityof human auditory nerve recordings, it ispresumed to occur in much the same way asin nonhuman auditory systems. Moreover,several psychophysical phenomena are con-sistent with a role for phase locking in humanhearing. For instance, frequency discrimina-tion becomes much poorer for frequenciesabove 4 kHz (Moore, 1973), roughly thepoint at which phase locking declines in non-human animals. The fundamental frequencyof the highest note on a piano is also approx-imately 4 kHz; this is also the point abovewhich melodic intervals between pure tones(tones containing a single frequency) are alsomuch less evident (Attneave & Olson, 1971;Demany & Semal, 1990). These findings pro-vide some circumstantial evidence that phaselocking is important for deriving precise esti-mates of frequency, but definitive evidenceremains elusive. It remains possible that theperceptual degradations at high frequenciesreflect a lack of experience with such fre-quencies, or their relative unimportance fortypical behavioral judgments, rather than aphysiological limitation.

The upper limit of phase locking is alsoknown to decrease markedly at each succes-sive stage of the auditory system (Wallace,Anderson, & Palmer, 2007). By primaryauditory cortex, the upper cutoff is in theneighborhood of a few hundred Hz. It wouldthus seem that the phase locking that occursrobustly in the auditory nerve would need tobe rapidly transformed into a spike rate codeif it were to benefit processing throughoutthe auditory system. Adding to the puzzle isthe fact that frequency tuning is not thoughtto be dramatically narrower at higher stagesin the auditory system. Such tighteningmight be expected if the frequency informa-tion provided by phase-locked spikes wastransformed to yield improved rate-based fre-quency tuning at subsequent stages (though

see Bitterman, Mukamel, Malach, Fried, andNelken (2008)).

Feedback

Like other sensory systems, the auditorysystem can be thought of as a processing cas-cade, extending from the sensory receptors tocortical areas believed to mediate auditory-based decisions. This “feed-forward” view ofprocessing underlies much auditory research.As in other systems, however, feedback fromlater stages to earlier ones is ubiquitous andsubstantial, and in the auditory system isperhaps even more pronounced than else-where in the brain. Unlike the visual system,for instance, the auditory pathways containfeedback extending all the way back to thesensory receptors. The function of much ofthis feedback remains poorly understood,but one particular set of projections—thecochlear efferent system—has been thesubject of much discussion.

Efferent connections to the cochleaoriginate primarily from the superior oli-vary nucleus, an area of the midbrain afew synapses removed from the cochlea(Figure 2.1, though the efferent pathwaysare not shown). The superior olive is dividedinto two subregions, medial and lateral, andto first order, these give rise to two efferentprojections: one from the medial superiorolive to the outer hair cells, called the medialolivocochlear (MOC) efferents, and onefrom the lateral superior olive to the innerhair cells (the LOC efferents) (Elgoyhen &Fuchs, 2010). The MOC efferents have beenmore thoroughly studied than their LOCcounterparts. Their activation (by electricalstimulation, for instance) is known to reducethe basilar membrane response to low-intensity sounds and causes the frequencytuning of the response to broaden. This isprobably because the MOC efferents inhibitthe outer hair cells, which are crucial to

Page 10: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

10 Audition

amplifying the response to low-intensitysounds, and to sharpening frequency tuning.

The MOC efferents may serve a protec-tive function by reducing the response toloud sounds (Rajan, 2000), but their mostcommonly proposed function is to enhancethe response to transient sounds in noise(Guinan, 2006). When the MOC fibers aresevered, for instance, performance on tasksinvolving discrimination of tones in noise isreduced (May & McQuone, 1995). Noise-related MOC effects are proposed to derivefrom its influence on adaptation, whichwhen induced by background noise reducesthe detectability of transient foregroundsounds by decreasing the dynamic rangeof the auditory nerve’s response. BecauseMOC activation reduces the response toongoing sound, adaptation induced by con-tinuous background noise is reduced, thusenhancing the response to transient tones thatare too brief to trigger the MOC feedbackthemselves (Kawase, Delgutte, & Liberman,1993; Winslow & Sachs, 1987). Anotherinteresting but controversial proposal is thatthe MOC efferents play a role in auditoryattention. One study, for instance, found thatpatients whose vestibular nerve (containingthe MOC fibers) had been severed werebetter at detecting unexpected tones after thesurgery, suggesting that selective attentionhad been altered so as to prevent the focusingof resources on expected frequencies (Scharf,Magnan, & Chays, 1997). See Guinan (2006)for a recent review of these and other ideasabout MOC efferent function.

SOUND MEASUREMENT—MODULATION

Subcortical Auditory Pathways

The auditory nerve feeds into a cascadeof interconnected subcortical regions thatlead up to the auditory cortex, as shown inFigure 2.1. The subcortical auditory pathways

have complex anatomy, only some of whichis depicted in Figure 2.1. In contrast to thesubcortical pathways of the visual system,which are less complex and largely preservethe representation generated in the retina, thesubcortical auditory areas exhibit a panoplyof interesting response properties not foundin the auditory nerve, many of which remainactive topics of investigation. Several sub-cortical regions will be referred to in thesections that follow in the context of othertypes of acoustic measurements or perceptualfunctions. One of the main features thatemerges in subcortical auditory regions istuning to amplitude modulation, the subjectof the next section.

Amplitude Modulation and theEnvelope

The cochlea decomposes the acoustic inputinto frequency channels, but much of theimportant information in sound is conveyedby the way that the output of these fre-quency channels is modulated in amplitude.Consider Figure 2.5A, which displays in bluethe output of one such frequency channelfor a short segment of a speech signal. Theblue waveform oscillates at a rapid rate, butits amplitude waxes and wanes at a muchlower rate (evident in the close-up view ofFigure 2.5B). This waxing and waning isknown as “amplitude modulation” and is acommon feature of many modes of soundproduction (e.g., vocal articulation). Theamplitude is captured by what is known as the“envelope” of a signal, shown in red for thesignal of Figures 2.5A and B. The envelopesof a set of bandpass filters can be stacked ver-tically and displayed as an image, generatinga spectrogram (referred to as a cochleogramwhen the filters mimic the frequency tun-ing of the cochlea, as in Figure 2.5C).Figure 2.5D shows the spectra of the sig-nal and its envelope. The signal spectrumis bandpass (because it is the output of a

Page 11: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Measurement—Modulation 11

0 100 200 300 400–50

–40

–30

–20

–10

0

Frequency (Hz)

Po

we

r (d

B)

1700 1720 1740 1760

Time (ms)

Pre

ssu

re

0 500 1000 1500

Time (ms)

Pre

ssu

re

Time (ms)

Fre

qu

en

cy (

Hz)

0 500 1000 1500

8997(C)

(A) (B)

(D)

5277

3057

1732

942

470

188

Figure 2.5 Amplitude modulation. (A) The output of a bandpass filter (centered at 340 Hz) for arecording of speech, plotted in blue, with its envelope plotted in red. (B) Close-up of part of (A) (cor-responding to the black rectangle in (A)). Note that the filtered sound signal (like the unfiltered signal)fluctuates around zero at a high rate, whereas the envelope is positive valued, and fluctuates more slowly.(C) Cochleagram of the same speech signal formed from the envelopes of a set of filters mimicking thefrequency tuning of the cochlea (one of which is plotted in (A)). The cochleagram is produced by plot-ting each envelope horizontally in grayscale. (D) Power spectra of the filtered speech signal in (A) andits envelope. Note that the envelope contains power only at low frequencies (modulation frequencies),whereas the filtered signal has power at a restricted range of high frequencies (audio frequencies).

bandpass filter), with energy at frequenciesin the audible range. The envelope spectrum,in contrast, is lowpass, with most of thepower below 10 Hz, corresponding to theslow rate at which the envelope changes.The frequencies that compose the envelopeare typically termed “modulation frequen-cies,” distinct from the “audio frequencies”that compose the signal that the envelope isderived from.

The information carried by a cochlearchannel can thus be viewed as the product ofan amplitude envelope—that varied slowly—and its “fine structure”—a waveform that

varies rapidly, at a rate close to the centerfrequency of the channel (Rosen, 1992).The envelope and fine structure have a clearrelation to common signal processing for-mulations in which the output of a bandpassfilter is viewed as a single sinusoid varyingin amplitude and frequency—the envelopedescribes the amplitude variation, and thefine structure describes the frequency varia-tion. The envelope of a frequency channel isstraightforward to extract from the auditorynerve—the envelope results from lowpassfiltering a spike train, as the envelope isreflected in relatively slow changes in the

Page 12: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

12 Audition

rectified sound signal. Despite the fact thatenvelope and fine structure are not com-pletely independent (Ghitza, 2001), therehas been much interest in the last decade indistinguishing their roles in different aspectsof hearing (Smith, Delgutte, & Oxenham,2002) and its impairment (Lorenzi, Gilbert,Carn, Garnier, & Moore, 2006).

Perhaps surprisingly, the temporal infor-mation contained in amplitude envelopes canbe sufficient for speech comprehension evenwhen spectral information is severely limited.In a classic paper, Shannon and colleaguesisolated the information contained in theamplitude envelopes of speech signals with astimulus known as “noise-vocoded speech”(Shannon et al., 1995). Noise-vocoded speechis generated by filtering a speech signal and anoise signal into frequency bands, multiply-ing the frequency bands of the noise by theenvelopes of the speech, and then summingthe modified noise bands to synthesize a newsound signal. By using a small number ofbroad frequency bands, spectral informationcan be greatly reduced, leaving amplitudevariation over time (albeit smeared across abroader than normal range of frequencies)as the primary signal cue. Examples areshown in Figure 2.6 for two, four, and eightbands. Shannon and colleagues found that theresulting stimulus was intelligible even whenjust a few bands were used (i.e., with muchbroader frequency tuning than is present inthe cochlea), indicating that the temporalmodulation of the envelopes contains muchinformation about speech content.

Modulation Tuning

Amplitude modulation has been proposedto be analyzed by dedicated banks of filtersoperating on the envelopes of cochlear filteroutputs rather than the sound waveform itself(Dau, Kollmeier, & Kohlrausch, 1997). Earlyevidence for such a notion came from mask-ing and adaptation experiments, which found

that the detection of a modulated signal wasimpaired by a masker or adapting stimulusmodulated at a similar frequency (Bacon &Grantham, 1989; Houtgast, 1989; Tansley &Suffield, 1983). There is now considerableevidence from neurophysiology that singleneurons in the midbrain, thalamus, and cortexexhibit some degree of tuning to modula-tion (Depireux, Simon, Klein, & Shamma,2001; Joris, Schreiner, & Rees, 2004; Miller,Escabi, Read, & Schreiner, 2001; Rodriguez,Chen, Read, & Escabi, 2010; Schreiner &Urbas, 1986; Schreiner & Urbas, 1988;Woolley, Fremouw, Hsu, & Theunissen,2005), loosely consistent with the idea of amodulation filter bank (Figure 2.7A).

Modulation tuning in single neurons isoften studied by measuring spectrotemporalreceptive fields (STRFs) (Depireux et al.,2001), conventionally estimated using tech-niques such as spike-triggered averaging(Theunissen et al., 2001). To compute aSTRF, neuronal responses to a long, randomstimulus are recorded, after which the stim-ulus spectrogram segments preceding eachspike are averaged to yield the STRF—thestimulus, described in terms of audio fre-quency content over time, that on averagepreceded a spike. Alternatively, a linearmodel can be fit to the neuronal responsegiven the stimulus (Willmore & Smyth,2003). In Figure 2.7B, for instance, the STRFconsists of a decrease in power followed byan increase in power in the range of 10 kHz;the neuron would thus be likely to respondwell to a rapidly modulated 10 kHz tone,and less so to a tone whose amplitude wasconstant. This STRF can be viewed as a filterthat passes modulations in a certain range ofrates, that is, modulation frequencies. Modu-lation tuning functions (e.g., those shown inFigure 2.7A) can be obtained via the Fouriertransform of the STRF. Note, though, that thesample STRF in Figure 2.7B is also tunedin audio frequency (the dimension on the

Page 13: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Measurement—Modulation 13

9114

3545

1296

387

20

9114

3545

1296

387

20

9114

3545

1296

387

20

9114

3545

1296

387

20

"She ar- gues with her sis- ter"

0 0.5 1.0 1.5

0 0.5 1.0 1.5

0 0.5 1.0 1.5

0 0.5 1.0 1.5

Fre

quency (

Hz)

Time (s)

(A)

(B)

(C)

(D)

Figure 2.6 Noise-vocoded speech. (A) Cochleagram of a speech utterance, generated as in Figure 2.5C.(B–D) Cochleagrams of noise-vocoded versions of the utterance from (A), generated with eight (B),four (C), or two (D) channels. To generate the noise-vocoded speech, the amplitude envelope of theoriginal speech signal was measured in each of the frequency bands in (B), (C), and (D). A white noisesignal was then filtered into these same bands and the noise bands were multiplied by the correspondingspeech envelopes. These modulated noise bands were then summed to generate a new sound signal. It isvisually apparent that the sounds in (B)–(D) are spectrally coarser versions of the original utterance. Goodspeech intelligibility is usually obtained with only four channels, indicating that patterns of amplitudemodulation can support speech recognition in the absence of fine spectral detail.

y-axis), responding only to modulations offairly high audio frequencies. Such receptivefields are commonly observed in subcor-tical auditory regions such as the inferiorcolliculus and medial geniculate nucleus.

The signal processing effects of subcorti-cal auditory circuitry are encapsulated in the

modulation filter bank model, as shown inFigure 2.7C (Dau et al., 1997; McDermott &Simoncelli, 2011). The sound waveformis passed through a set of bandpass filtersthat simulate cochlear frequency selectiv-ity. The envelopes of the filter outputs areextracted and passed through a compressive

Page 14: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

14 Audition

Temporal Modulation Frequency (Hz)

Response (

dB

)

Modulation Tuning Curves

0(A)

(C)

–5

–10

–15

0 50 100

Envelope &compressivenonlinearity

Modulation Filter Model of Subcortical Signal Processing

Modulationfiltering

Cochlear

filtering

Time Preceding Spike (ms)

Fre

qu

en

cy (

kH

z)

4 6 8 10

(B)

1

4.5

20

Example STRF—Inferior Colliculus

Figure 2.7 Modulation tuning. (A) Example temporal modulation tuning curves for neurons in themedial geniculate nucleus of the thalamus. (B) Example spectrotemporal receptive field (STRF) from athalamic neuron. Note that the modulation in the STRF is predominantly along the temporal dimension,and that this neuron would thus be sensitive primarily to temporal modulation. (C) Diagram of modula-tion filter bank model of peripheral auditory processing. The sound waveform is filtered by a simulatedcochlear filter bank, the envelopes of which are passed through a compressive nonlinearity before beingfiltered by a modulation filter bank.Source: From Miller, Escabi, Read, and Schreiner (2002). Reproduced with permission of the AmericanPhysiological Society. Diagram modified from McDermott and Simoncelli (2011). Reproduced withpermission of Elsevier.

Page 15: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Measurement—Modulation 15

nonlinearity, simulating cochlear com-pression. These envelopes are then passedthrough a modulation filter bank. Because themodulation filters operate on the envelopeof a particular cochlear channel, they aretuned both in audio frequency (courtesyof the cochlea) and modulation frequency,like the sample STRF in Figure 2.7B. It isimportant to note that the model discardsthe fine structure of each cochlear subband.The fine structure is reflected in the phaselocking evident in auditory nerve fibers, but isneglected in envelope-based models of audi-tory processing (apart from being implicitlycaptured to some extent by the envelopesof adjacent filters, which jointly constraintheir fine structure). This model is oftenconceived as capturing the signal processingthat occurs between the ear and the thalamus(McDermott & Simoncelli, 2011), althoughit is clearly only a first-pass approximation.

One inadequacy of the modulation filterbank model of auditory processing is that thefull range of modulation frequencies that areperceptually relevant does not appear to berepresented at any single stage of auditoryprocessing. Neurophysiological studies innonhuman animals have generally foundsubcortical neurons to prefer relatively highmodulation rates (up to 100–200 Hz) (Milleret al., 2002), with lower modulation ratesbeing represented preferentially in the cortex(Schreiner & Urbas 1986; Schreiner & Urbas,1988). Neuroimaging results in humans havesimilarly found that the auditory cortexresponds preferentially to low modulationfrequencies (in the range of 4–8 Hz) (Boemio,Fromm, Braun, & Poeppel, 2005; Giraudet al., 2000; Schonwiesner & Zatorre, 2009).It seems that the range of preferred modula-tion frequencies decreases as one ascends theauditory pathway.

Based on this, it is intriguing to speculatethat successive stages of the auditory systemmight process structure at progressively

longer (slower) timescales, analogous to theprogressive increase in receptive field sizethat occurs in the visual system from V1to inferotemporal cortex (Lerner, Honey,Silbert, & Hasson, 2011). Within the cortex,however, no hierarchy is clearly evident as ofyet, at least in the response to simple patternsof modulation (Giraud et al., 2000; Boemioet al., 2005). Moreover, there is considerablevariation within each stage of the pathwayin the preferred modulation frequency ofindividual neurons (Miller et al., 2001;Rodriguez et al., 2010). There are severalreports of topographic organization for mod-ulation frequency in the inferior colliculus,in which a gradient of preferred modulationfrequency is observed orthogonal to the gra-dient of preferred audio frequency (Baumannet al., 2011; Langner, Sams, Heil, & Schulze,1997). Similar topographic organization hasbeen proposed to exist in the cortex, thoughthe issue remains unsettled (Barton, Venezia,Saberi, Hickok, & Brewer, 2012; Herdeneret al., 2013; Nelken et al., 2008).

As with the frequency tuning of the audi-tory nerve (Lewicki, 2002; Smith & Lewicki,2006), modulation tuning has been proposedto be consistent with an efficient codingstrategy. Modulation tuning bandwidths inthe inferior colliculus tend to increase withpreferred modulation frequency (Rodriguezet al., 2010), as would be predicted if thelowpass modulation spectra of most naturalsounds (Attias & Schreiner, 1997; McDer-mott, Wrobleski, & Oxenham, 2011; Singh &Theunissen, 2003) were to be divided intochannels conveying equal power. Auditoryneurons have also been found to convey moreinformation about sounds whose amplitudedistribution follows that of natural soundsrather than that of white noise (Escabi et al.,2003). Studies of STRFs in the bird auditorysystem also indicate that neurons are tunedto the properties of bird song and othernatural sounds, maximizing discriminability

Page 16: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

16 Audition

of behaviorally important sounds (Hsu,Woolley, Fremouw, Theunissen, 2004;Woolley et al., 2005). Similar argumentshave been made about the coding of bin-aural cues to sound localization (Harper &McAlpine, 2004).

PRIMARY AUDITORY CORTEX

The auditory nucleus of the thalamus directsmost of its projections to one region ofthe auditory cortex, defined on this basisas primary auditory cortex. Other corticalregions also receive thalamic projections,but they are substantially sparser. Primaryauditory cortex is also often referred to as the“core” auditory cortex. In humans, primaryauditory cortex occupies Heschl’s gyrus,also known as the transverse temporal gyrus,located within the lateral sulcus. The rarecases in which humans have bilateral lesionsof primary auditory cortex produce profoundhearing impairment, termed “cortical deaf-ness” (Hood, Berlin, & Allen, 1994). Thestructure and functional properties of thePAC are relatively well established comparedto the rest of the auditory cortex, and it is thelast stage of auditory processing for whichcomputational models exist.

Spectrotemporal Modulation Tuning

Particularly in the auditory cortex, neuronsoften exhibit tuning for spectral modulationin addition to the tuning for temporal mod-ulation discussed in the previous section.Spectral modulation is variation in powerthat occurs along the frequency axis. Spec-tral modulation is frequently evident innatural sounds such as speech, both fromindividual frequency components, and fromformants—the broad peaks in the instan-taneous spectra produced by vocal tractresonances that characterize vowel sounds

(e.g., Figure 2.5C). Tuning to spectral mod-ulation is generally less pronounced than toamplitude modulation, but is an importantfeature of cortical responses (Barbour &Wang, 2003). Examples of cortical STRFswith spectral modulation sensitivity areshown in Figure 2.8A. Observations ofcomplex spectrotemporal modulation tun-ing in cortical neurons underlie what isarguably the standard model of cortical audi-tory processing (Figure 2.8B), in which acochleogram-like representation is passedthrough a bank of filters tuned to temporaland spectral modulations of various rates(Chi, Ru, & Shamma, 2005).

The STRF approximates a neuron’s out-put as a linear function of the cochlearinput—the result of convolving the spectro-gram of the acoustic input with the STRF.However, particularly in the cortex, it isclear that linear models are inadequate toexplain neuronal responses (Christianson,Sahani, & Linden, 2008; Machens, Wehr, &Zador, 2004; Rotman, Bar Yosef, & Nelken,2001; Theunissen, Sen, & Doupe, 2000).Understanding the nonlinear contributionsis an important direction of future research(Ahrens, Linden, & Sahani, 2008; David,Mesgarani, Fritz, & Shamma, 2009), butat present much analysis is restricted tolinear receptive field estimates. There areestablished methods for computing STRFs,and they exhibit many interesting propertieseven though it is clear that they are not thewhole story.

Tonotopy

Although many of the functional propertiesof cortical neurons are distinct from what isfound in auditory nerve responses, frequencytuning persists. Many cortical neurons havea preferred frequency, although they areoften less responsive to pure tones (relativeto sounds with more complex spectra) andoften have broader tuning than neurons in

Page 17: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Primary Auditory Cortex 17

8(A)

(B)

Example STRFs—Cortex

Spectrotemporal Filter Model of Auditory Cortical Processing

2

0.5

8F

req

ue

ncy (

kH

z)

2

0 40 80

Time Preceding Spike (ms)

0 40 80

0.5

0

8

Time (s)

Fre

qu

en

cy

(kH

z)

Fre

qu

en

cy

0.7

2

Time

1

0.5

–8 –4 –2 2 4Rate (Hz)

8

Sca

le (

Cyc

/Oct)

Figure 2.8 STRFs. (A) Example STRFs from cortical neurons. Note that the STRFs feature spectralmodulation in addition to temporal modulation, and as such are selective for more complex acousticfeatures. Cortical neurons typically have longer latencies than subcortical neurons, but this is not evidentin the STRFs, probably because of nonlinearities in the cortical neurons that produce small artifacts in theSTRFs (Stephen David, personal communication, 2011). (B) Spectrotemporal filter model of auditorycortical processing, in which a cochleogram-like representation of sound is filtered by a set of linearspectrotemporal filters tuned to scale (spectral modulation) and rate (temporal modulation).Source: (A): Mesgarani, David, Fritz, and Shamma (2008). Reproduced with permission of AIP Pub-lishing LLC. (B): Chi, Ru, and Shamma (2005); Mesgarani and Shamma (2011). Reproduced withpermission of AIP Publishing LLC and IEEE.

peripheral stages (Moshitch, Las, Ulanovsky,Bar Yosef, & Nelken, 2006). Moreover,neurons tend to be spatially organized tosome extent according to their best fre-quency, forming “tonotopic” maps. Corticalfrequency maps were one of the first reportedfindings in single-unit neurophysiologystudies of the auditory cortex in animals,and have since been found using functionalmagnetic resonance imaging (fMRI) inhumans (Formisano et al., 2003; Humphries,

Liebenthal, & Binder, 2010; Talavage et al.,2004) as well as monkeys (Petkov, Kayser,Augath, & Logothetis, 2006). Tonotopicmaps are also present in subcortical audi-tory regions. Although never formallyquantified, it seems that tonotopy is lessrobust than the retinotopy found in thevisual system (evident, for instance, in two-photon imaging studies [Bandyopadhyay,Shamma, & Kanold, 2010; Rothschild,Nelken, & Mizrahi, 2010]).

Page 18: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

18 Audition

200 400 800 1600 3200 6400

Best Frequency (Hz)

Figure 2.9 Tonotopy. Best frequency of voxels in the human auditory cortex, measured with fMRI,plotted on the flattened cortical surface. Note that the best frequency varies quasi-smoothly over thecortical surface, and is suggestive of two maps that are approximately mirror images of each other.Source: From Humphries, Liebenthal, and Binder (2010). Reproduced with courtesy of Elsevier.

Although the presence of some degree oftonotopy in the cortex is beyond question,its functional importance remains unclear.Frequency selectivity is not the end goal ofthe auditory system, and it does not obvi-ously bear much relevance to behavior, so itis unclear why tonotopy would be a dominantprinciple of organization throughout the audi-tory system. At present, however, tonotopyremains a staple of textbooks and reviewchapters such as this. Practically, tonotopy isuseful to auditory neuroscientists because itprovides a simple functional signature of theprimary auditory cortex. Figure 2.9 showsan example tonotopic map obtained in ahuman listener with fMRI. Humans exhibita stereotyped high-low-high gradient of pre-ferred frequency, typically interpreted as twomirror-reversed maps. These two maps aresometimes referred to as Te 1.0 and Te 1.2 inhumans (Morosan et al., 2001). The macaque

exhibits similar organization, although addi-tional fields are typically evident (Baumann,Petkov, & Griffiths, 2013). Tonotopy remainsthe primary functional criterion by whichauditory cortical regions are distinguished.

NONPRIMARY AUDITORY CORTEX

Largely on grounds of anatomy and con-nectivity, the mammalian auditory cortex isstandardly divided into three sets of regions(Figure 2.10): a core region receiving directinput from the thalamus, a “belt” regionsurrounding it, and a “parabelt” regionbeyond that (Kaas & Hackett 2000; Sweet,Dorph-Petersen, & Lewis, 2005). Withinthese areas tonotopy is often used to delineatedistinct “fields.” The core region is divided inthis way into areas A1, R (for rostral), and RT(for rostrotemporal) in nonhuman primates,

Page 19: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Nonprimary Auditory Cortex 19

RTM

(C)

(B)

(A)

RPB

CPB

RM

RRT

ALML

MM

CM

CL

AI

RTL

Figure 2.10 Anatomy of the auditory cortex. (A) Lateral view of a macaque’s cortex. The approx-imate location of the parabelt region is indicated with dashed red lines. (B) View of the brain from(A) after removal of the overlying parietal cortex. Approximate locations of the core (solid red line),belt (dashed yellow line), and parabelt (dashed orange line) regions are shown. Abbreviations: superiortemporal gyrus (STG), superior temporal sulcus (STS), lateral sulcus (LS), central sulcus (CS), arcuatesulcus (AS), insula (INS). (C) Connectivity between A1 and other auditory cortical areas. Solid lineswith arrows denote dense connections; dashed lines with arrows denote less dense connections. RT (therostrotemporal field), R (the rostral field), and A1 comprise the core; all three subregions receive inputfrom the thalamus. The areas surrounding the core comprise the belt, and the two regions outlined withdashed lines comprise the parabelt. The core has few direct connections with the parabelt or more distantcortical areas.Source: From Kaas and Hackett (2000).

Page 20: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

20 Audition

with A1 and R receiving direct input from themedial geniculate nucleus of the thalamus.There are also multiple belt areas (Petkovet al., 2006), each receiving input from thecore areas. Functional imaging reveals manyadditional areas that respond to sound inthe awake primate, including parts of theparietal and frontal cortex (Poremba et al.,2003). There are some indications that thethree core regions have different properties(Bendor & Wang, 2008), and that stimulusselectivity increases in complexity from thecore to surrounding areas (Kikuchi, Hor-witz, & Mishkin, 2010; Rauschecker & Tian,2004; Tian & Rauschecker, 2004), suggestiveof a hierarchy of processing. However, atpresent there is not a single widely acceptedframework for auditory cortical organization.Several principles of organization have beenproposed with varying degrees of empiricalsupport.

Some of the proposed organizationalprinciples clearly derive inspiration fromthe visual system. For instance, selectivityfor vocalizations and selectivity for spatiallocation have been found to be partiallysegregated, each being most pronouncedin a different part of the lateral belt (Tian,Reser, Durham, Kustove, & Rauschecker,2001; Woods, Lopez, Long, Rahman, &Recanzone, 2006). These regions have thusbeen proposed to constitute the beginning ofventral “what” and dorsal “where” pathwaysanalogous to those in the visual system,perhaps culminating in the same parts of theprefrontal cortex as the analogous visual path-ways (Cohen et al., 2009; Romanski et al.,1999). Functional imaging results in humanshave also been viewed as supportive of thisframework (Ahveninen et al., 2006; Alain,Arnott, Hevenor, Graham, & Grady, 2001;Warren, Zielinski, Green, Rauschecker, &Griffiths, 2002). Additional evidence for a“what”/”where” dissociation comes from arecent study in which sound localization and

temporal pattern discrimination in cats wereselectively impaired by reversibly deactivat-ing different regions of nonprimary auditorycortex (Lomber & Malhotra, 2008). How-ever, other studies have found less evidencefor segregation of tuning properties in earlyauditory cortex (Bizley, Walker, Silverman,King, & Schnupp, 2009). Moreover, theproperties of the “what” stream remain rela-tively undefined (Recanzone, 2008); at thispoint it has been defined mainly by reducedselectivity to location.

There have been further attempts to extendthe characterization of a ventral auditorypathway by testing for specialization for theanalysis of particular types of sounds, poten-tially analogous to what has been found in theventral visual system (Kanwisher, 2010). Themost widely proposed specialization is forspeech and/or for vocalizations more gener-ally. Responses to speech have been reportedin the superior temporal gyrus (STG) ofhumans for over a decade (Binder et al.,2000; Hickok & Poeppel, 2007; Lieben-thal, Binder, Spitzer, Possing, & Medler,2005; Obleser, Zimmermann, Van Meter, &Rauschecker, 2007; Scott, Blank, Rosen, &Wise, 2000). Recent fMRI results indicatethat the STG is involved in an analysis ofspeech that is at least partly distinct fromlinguistic processing, in that its response isdriven by speech structure even when thespeech is foreign and thus unintelligible(Overath, McDermott, Zarate, & Poeppel,2015). The extent of naturalistic speechstructure was manipulated using “quilts” thatconcatenate speech segments of some lengthin random order. As the quilt segment lengthincreases, the stimulus becomes increasinglysimilar to natural speech. The response ofHeschl’s gyrus (primary auditory cortex inhumans) was found to be similar irrespectiveof quilt segment length. By contrast, theresponse of regions of the STG increasedwith segment length, indicating sensitivity

Page 21: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Nonprimary Auditory Cortex 21

to the temporal structure of speech (Overathet al., 2015). These results complement recentfindings of phonemic selectivity in the STG,measured using recordings from the surfaceof the cortex in epilepsy patients (Mes-garani, Cheung, Johnson, & Chang, 2014).The relationship of these speech-selectiveresponses to the representation of voiceidentity (Belin, Zatorre, Lafaille, Ahad, &Pike, 2000) remains unclear. Voice-selectiveregions appear to be present in macaquenonprimary auditory cortex (Petkov et al.,2008) and could be homologous to voice- orspeech-selective regions in humans.

Traditionally, segregation of function hasbeen explored by testing whether particularbrain regions respond more to one class ofsound than to a small set of other classes, withthe sound classes typically linked to particu-lar prior hypotheses. The approach is limitedby the ability of the experimenter to con-struct relevant hypotheses, and by the smallsets of stimulus conditions used to establishselectivity. One recent study from my lab hasattempted to circumvent these limitations bymeasuring responses throughout the auditorycortex to a very large set of natural sounds(Norman-Haignere, Kanwisher, & McDer-mott, 2015). We measured fMRI responsesof “voxels” (small volumes of brain tissue)to 165 natural sounds intended to be repre-sentative of the sounds we encounter in dailylife, including speech, music, and many typesof environmental sounds. We then inferredtuning functions across this stimulus setwhose linear combination could best explainthe voxel responses. This “voxel decomposi-tion analysis” yielded six components, eachcharacterized by a response profile acrossthe stimulus set and a weight for each voxelin auditory cortex. Four of the componentshad responses that were largely explained byfrequency and modulation tuning, and thuswere not strongly selective for the category ofthe sounds (Figures 2.11A and B). The most

frequency-selective components (numbered 1and 2 in Figure 2.11) had weights that werestrongest in the low- and high-frequencyportions of the tonotopic map, respectively(Figure 2.11C), as one would expect. Thelast two components were strongly selectivefor speech and music, responding strongly toevery speech or music sound, respectively,and much less to other types of sounds. Thespeech-selective component localized lateralto primary auditory cortex, in the STG,consistent with other recent work on speechselectivity (Overath et al., 2015). By contrast,the music-selective component was largelylocalized anterior to primary auditory cortex.The results thus provide evidence for distinctpathways for music and speech processingin nonprimary auditory cortex. This apparentfunctional segregation raises many questionsabout the role of these regions in speechand music perception, about their evolution-ary history, and about their dependence onauditory experience and expertise.

One obvious feature of the compo-nent weight maps in Figure 2.11C is astrong degree of bilaterality. This symmetrycontrasts with several prior proposals forfunctional segregation between hemispheres.One proposal is that the left and right auditorycortices are specialized for different aspectsof signal processing, with the left optimizedfor temporal resolution and the right forfrequency resolution (Zatorre, Belin, & Pen-hune, 2002). The evidence for hemisphericdifferences comes mainly from functionalimaging studies that manipulate spectral andtemporal stimulus characteristics (Samson,Zeffiro, Toussaint, & Belin, 2011; Zatorre &Belin, 2001) and neuropsychology studiesthat find pitch perception deficits associ-ated with right temporal lesions (Johnsrude,Penhune, & Zatorre, 2000; Zatorre, 1985).A related alternative idea is that the two hemi-spheres are specialized to analyze distincttimescales, with the left hemisphere more

Page 22: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

22 Audition

Component 1

–2.4 49.3

Component 2 Component 3 Component 4 Component 5 Component 6

Significance of Component Voxel Weight (–log10[p])

–1.8 41.2 –2.4 49.5 82.80.9 31.5 –2.2 22.6–5.8

All Sounds TestedSorted by Component Response Magnitude, Colored by Category Labels

Component Response Profiles to All 165 Sounds Colored by Category(A)

(B)

(C)

Instr. musicVocal music

English speechForeign speech

Nonspeech vocalAnimal vocal

Human NonvocalAnimal Nonvocal

NatureMechanical

Env. sounds

Average Component Response to Different Categories

0

1

2

0

1

2

Music

WhistlingRingtone

Telephone dialingFirst nonvocal

sound

Speech

Music

with vocals

Wind chimes

Response

Magnitude

Response

Magnitude

Component 1 Component 2 Component 3 Component 4 Component 5 Component 6

Component 1 Component 2 Component 3 Component 4 Component 5 Component 6

Component Voxel Weights Plotted in Anatomical Coordinates

R

L

Low-frequency primary area

High-frequency primary area

Figure 2.11 Functional organization of the nonprimary auditory cortex. (A) Results of decomposingvoxel responses to 165 natural sounds into six components. Each component is described by its responseto each of the sounds, here ordered by the response magnitude and color coded with the sound cate-gory. The first four components are well described by selectivity to established acoustic properties, andare not strongly category selective. By contrast, the last two components are selective for speech andmusic, respectively. (B) The average response of each of the six components to each category of sound.(C) The weights for each component plotted on an inflated brain. White and black outlines mark thehigh- and low-frequency fields of the tonotopic map, commonly identified with primary auditory cortex.Components 1–4 are primarily localized in and around primary auditory cortex, whereas the speech- andmusic-selective components localize to distinct regions of nonprimary auditory cortex.Source: From Norman-Haignere, Kanwisher, and McDermott (2015). Reprinted with permission ofElsevier.

responsive to short-scale temporal variation(e.g., tens of milliseconds) and the right hemi-sphere more responsive to long-scale varia-tion (e.g., hundreds of milliseconds) (Boemio

et al., 2005; Poeppel, 2003). Such asymme-tries are not obvious in our fMRI results, butmight become evident with measurementsthat have better temporal resolution.

Page 23: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Source Perception 23

SOUND SOURCE PERCEPTION

Ultimately, we wish to understand not onlywhat acoustic measurements are made bythe auditory system, as were characterizedin the previous sections, but also how theygive rise to perception—what we hear whenwe listen to sound. Following Helmholtz, wemight suppose that the purpose of auditionis to infer something about the events inthe world that produce sound. We can oftenidentify sound sources with a verbal label, forinstance, and realize that we heard a fingersnap, a flock of birds, or construction noise.Even if we cannot determine the object(s)that caused the sound, we may nonethelessknow something about what happened: thatsomething fell onto a hard floor, or into water(Gaver, 1993). Despite the richness of theseaspects of auditory recognition, remarkablylittle is known at present about them (speechrecognition stands alone as an exception),mainly because they are rarely studied (thoughsee Gygi, Kidd, & Watson, 2004; Lutfi, 2008;and McDermott & Simoncelli, 2011).

Perhaps because they are more easilylinked to peripheral processing than areour recognition abilities, researchers havebeen more inclined to instead study theperception of isolated properties of soundsor their sources (e.g., location, intensity,rate of vibration, or temporal pattern). Muchresearch has concentrated in particular onthree well-known properties of sound: spatiallocation, pitch, and loudness. This focus ison the one hand unfortunate, as auditoryperception is much richer than the hegemonyof these three attributes in hearing sciencewould indicate. However, their study hasgiven rise to rich lines of research that haveyielded many useful insights about hearing.

Localization

Localization is less precise in hearing than invision, but enables us to localize objects that

we may not be able to see. Human observerscan judge the location of a source to withina few degrees if conditions are optimal. Theprocesses by which this occurs are among thebest understood in hearing.

Spatial location is not made explicit on thecochlea, which provides a map of frequencyrather than of space, and instead must bederived from three primary sources of infor-mation. Two of these are binaural, resultingfrom differences in the acoustic input to thetwo ears. Sounds to one side of the verticalmeridian reach the two ears at different timesand with different intensities. This is due tothe difference in path length from the sourceto the ears, and to the acoustic shadowingeffect of the head. These interaural timeand level differences vary with directionand thus provide a cue to a sound source’slocation. Binaural cues are primarily usefulfor deriving the location of a sound in thehorizontal plane, because changes in eleva-tion do not change interaural time or intensitydifferences much. To localize sounds in thevertical dimension, or to distinguish soundscoming from in front of the head from thosefrom in back, listeners rely on a third sourceof information: the filtering of sounds bythe body and ears. This filtering is directionspecific, such that a spectral analysis canreveal peaks and valleys in the frequencyspectrum that are signatures of location in thevertical dimension (Figure 2.12; discussedfurther below).

Interaural time differences (ITD) aretypically a fraction of a millisecond, andjust-noticeable differences in ITD (whichdetermine spatial acuity) can be as low as10 μs (Klump & Eady, 1956). This is striking,given that neural refractory periods (whichdetermine the minimum interspike intervalfor a single neuron) are on the order of amillisecond, which one might think wouldput a limit on the temporal resolution ofneural representations. Typical interaural

Page 24: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

24 Audition

Ga

in (

dB

)

Frequency (kHz)

Ele

va

tio

n (

de

g)

1 5 10 15

–30

0

30

–30

–20

–10

0

10

Figure 2.12 Head-related transfer functions. Sample HRTF for the left ear of one human listener.The gray level represents the amount by which a frequency originating at a particular elevation is atten-uated or amplified by the torso, head, and ear of the listener. Sounds are filtered differently dependingon their elevation, and the spectrum that is registered by the cochlea thus provides a localization cue.Note that most of the variation in elevation-dependent filtering occurs at high frequencies (above 4 kHz).Source: From Zahorik, Bangayan, Sundareswaran, Wang, and Tam (2006). Reproduced with permissionof AIP Publishing LLC.

level differences (ILD) for a single soundsource in a quiet environment can be as largeas 20 dB, with a just-noticeable differenceof about 1 dB. ILDs result from the acous-tic shadow cast by the head, and althoughthe relationship between ILD and locationis complex (Culling & Akeroyd, 2010),to first order, ILDs are more pronouncedfor high frequencies, as low frequenciesare less affected by the acoustic shadow(because their wavelengths are comparableto the dimensions of the head). ITDs, incontrast, support localization most effec-tively at low frequencies, when the timedifference between individual cycles of sinu-soidal sound components can be detected viaphase-locked spikes from the two ears (phaselocking, as we discussed earlier, degrades athigh frequencies). That said, ITDs betweenthe envelopes of high-frequency sounds canalso produce percepts of localization. Theclassical “duplex” view that localization isdetermined by either ILDs or ITDs, depend-ing on the frequency (Rayleigh, 1907), isthus not fully appropriate for realistic naturalsounds, which in general produce perceptibleITDs across the spectrum. It must also benoted that ITDs and ILDs recorded in naturalconditions (i.e., with multiple sound sources

and background noise) exhibit values andfrequency dependence that are distinct fromthose expected from classical considerationsof single sound sources in quiet (Mlynarski &Jost, 2014). More generally, localization inreal-world conditions with multiple sourcesis understudied and remains poorly under-stood. See Middlebrooks and Green (1991)for a review of much of the classic behavioralwork on sound localization.

The binaural cues to sound location areextracted in the superior olive, a subcorticalregion where inputs from the two ears arecombined. There appears to be an elegantsegregation of function, with ITDs beingextracted in the medial superior olive (MSO)and ILDs being extracted in the lateral supe-rior olive (LSO). In both cases, accuratecoding of interaural differences is madepossible by neural signaling with unusuallyhigh temporal precision. This precision isneeded both to encode submillisecond ITDsas well as ILDs of brief transient events,for which the inputs from the ears must bealigned in time. Brain structures subsequentto the superior olive largely inherit its ILDand ITD sensitivity. See Yin and Kuwada(2010) for a recent review of the physiologyof binaural localization.

Page 25: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Source Perception 25

Binaural cues are of little use in distin-guishing sounds at different locations on thevertical dimension (relative to the head), orin distinguishing front from back, as inter-aural time and level differences are largelyunaffected by changes across these locations.Instead, listeners rely on spectral cues pro-vided by the filtering of a sound by the torso,head, and ears of a listener. The filteringresults from the reflection and absorption ofsound by the surfaces of a listener’s body,with sound from different directions pro-ducing different patterns of reflection. Theeffect of these interactions on the soundthat reaches the eardrum can be describedby a linear filter known as the head-relatedtransfer function (HRTF). The overall effectis that of amplifying some frequencieswhile attenuating others. A broadband soundentering the ear will thus be endowed withpeaks and valleys in its frequency spectrum(Figure 2.12).

Compelling sound localization can beperceived when these peaks and valleys areartificially induced. The effect of the filteringis obviously confounded with the spectrumof the unfiltered sound source, and the lis-tener must make some assumptions about thesource spectrum. When these assumptionsare violated, as with narrowband soundswhose spectral energy occurs at a peak inthe HRTF of a listener, sounds are mislocal-ized (Middlebrooks, 1992). For broadbandsounds, however, HRTF filtering producessignatures that are sufficiently distinct as tosupport localization in the vertical dimensionto within 5 degrees or so in some cases,though some locations are more accuratelyperceived than others (Makous & Middle-brooks, 1990; Wightman & Kistler, 1989).

The bulk of the filtering occurs in the outerear (the pinna), the folds of which producedistinctive pattern of reflections. Becausepinna shapes vary across listeners, the HRTFis listener specific as well as location specific,with spectral peaks and valleys that are in dif-ferent places for different listeners. Listeners

appear to learn the HRTFs for their set ofears. When ears are artificially modified withplastic molds that change their shape, local-ization initially suffers considerably, but overa period of weeks, listeners regain the abilityto localize with the modified ears (Hofman,Van Riswick, & van Opstal, 1998). Listenersthus learn at least some of the details oftheir particular HRTF through experience,although sounds can be localized even whenthe peaks and valleys of the pinna filteringare somewhat blurred (Kulkarni & Colburn,1998). Moreover, compelling spatializationis often evident even if a generic HRTFis used.

The physiology of HRTF-related cues forlocalization is not as developed as it is forbinaural cues, but there is evidence that mid-brain regions may again be important. Manyinferior colliculus neurons, for instance,show tuning to sound elevation (Delgutte,Joris, Litovsky, & Yin, 1999). The selectivityfor elevation presumably derives from tuningto particular spectral patterns (peaks andvalleys in the spectrum) that are diagnosticof particular locations (May, Anderson, &Roos, 2008).

Although the key cues for sound local-ization are extracted subcortically, lesionstudies reveal that the cortex is essen-tial for localizing sound behaviorally.Ablating the auditory cortex typically pro-duces large deficits in localizing sounds(Heffner & Heffner, 1990), with unilaterallesions producing deficits specific to loca-tions contralateral to the side of the lesion(Jenkins & Masterton, 1982). Consistentwith these findings, tuning to sound loca-tion is widespread in auditory corticalneurons, with the preferred location gener-ally positioned in the contralateral hemifield(Middlebrooks, 2000). Topographic repre-sentations of space have not been foundto be evident within individual auditorycortical areas, though one recent reportargues that such topography may be evidentacross multiple areas (Higgins, Storace,

Page 26: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

26 Audition

Escabi, & Read, 2010). See Grothe, Pecka,and McAlpine (2010) for a recent review ofthe physiological basis of sound localization.

Pitch

Although the word “pitch” is often usedcolloquially to describe the perception ofsound frequency, in hearing research it hasa more specific meaning—pitch is definedas the perceptual correlate of periodicity.Vocalizations, instrument sounds, and somemachine sounds are all often produced by

periodic physical processes. Our vocal cordsopen and close at regular intervals, produc-ing a series of clicks separated by regulartemporal intervals. Instruments producesounds via strings that oscillate at a fixedrate, or via tubes in which the air vibratesat particular resonant frequencies, to givetwo examples. Machines frequently featurerotating parts, which often produce soundsat every rotation. In all these cases, theresulting sounds are periodic—the soundpressure waveform consists of a single shapethat repeats at a fixed rate (Figure 2.13A).

Frequency (kHz)Time (ms)

0–1

–0.5

0.5

1(A) (B)

(C)

0

2 4 6 8 10

0–1

–0.5

0

1

0.5

2 4 6 8 10

0–140

–120

–100

–80

–60

–40

–20

2 4 6 8 10

Time Lag (ms)

Am

plit

ude

(Arb

itra

ry U

nits)

Am

plit

ude

(dB

Att

enuation)

Corr

ela

tion

Autocorrelation

SpectrumWaveform

Figure 2.13 Periodicity and pitch. Periodicity and pitch. Waveform, spectrum, and autocorrelationfunction for a note (the A above middle C, with an F0 of 440 Hz) played on an oboe. (A) Excerpt ofwaveform. Note that the waveform repeats every 2.27 ms, which is the period. (B) Spectrum. Note thepeaks at integer multiples of the F0, characteristic of a periodic sound. In this case the F0 is physicallypresent, but the second, third, and fourth harmonics actually have higher amplitude. (C) Autocorrelation.The correlation coefficient is always 1 at a lag of 0 ms, but because the waveform is periodic, correla-tions close to 1 are also found at integer multiples of the period (2.27, 4.55, 6.82, and 9.09 ms, in thisexample).Source: From McDermott and Oxenham (2008a). Reprinted with permission of Elsevier.

Page 27: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Source Perception 27

Perceptually, such sounds are heard as havinga pitch that can vary from low to high, propor-tional to the frequency at which the waveformrepeats (the fundamental frequency, thatis, the F0).

Pitch is important because periodicityis important—the period is often related toproperties of the source that are useful toknow, such as its size, or tension. Pitch is alsoused for communicative purposes, varyingin speech prosody, for instance, to conveymeaning or emotion. Pitch is a centerpiece ofmusic, forming the basis of melody, harmony,and tonality. Listeners also use pitch to tracksound sources of interest in auditory scenes.

Many physically different sounds—allthose with a particular period—have thesame pitch. The periodicity is unrelated towhether a sound’s frequencies fall in highor low regions of the spectrum, for instance,though in practice periodicity and the cen-ter of mass of the spectrum are sometimescorrelated. Historically, pitch has been afocal point of hearing research because itis an important perceptual property with anontrivial relationship to the acoustic input.Debates on pitch and related phenomena dateback at least to Helmholtz, and continue tooccupy researchers today (Plack, Oxenham,Popper, & Ray, 2005).

One central debate concerns whether pitchis derived from an analysis of frequency ortime. Periodic waveforms produce spec-tra whose frequencies are harmonicallyrelated—they form a harmonic series, beinginteger multiples of the fundamental fre-quency, whose period is the period of thewaveform (Figure 2.13B). Pitch couldthus conceivably be detected with har-monic templates applied to an estimate of asound’s spectrum obtained from the cochlea(Goldstein, 1973; Shamma & Klein, 2000;Terhardt, 1974; Wightman, 1973). Alter-natively, periodicity could be assessed inthe time domain, for instance via the

autocorrelation function (Cariani & Delgutte,1996; de Cheveigne and Kawahara, 2002;Meddis and Hewitt, 1991). The autocorre-lation measures the correlation of a signalwith a delayed copy of itself. For a periodicsignal that repeats with some period, theautocorrelation exhibits peaks at multiples ofthe period (Figure 2.13C).

Such analyses are in principle functionallyequivalent: The power spectrum is related tothe autocorrelation via the Fourier transform,and detecting periodicity in one domain ver-sus the other might simply seem a question ofimplementation. In the context of the auditorysystem, however, the two concepts diverge,because information is limited by distinct fac-tors in the two domains. Time-domain modelsare typically assumed to utilize fine-grainedspike timing (i.e., phase locking) with con-comitant temporal resolution limits (becausephase locking is absent for high frequencies).In contrast, frequency-based models (oftenknown as “place models,” in reference tothe frequency-place mapping that occurs onthe basilar membrane) rely on the patternof excitation along the cochlea, which islimited in resolution by the frequency tuningof the cochlea (Cedolin & Delgutte, 2005).Cochlear frequency selectivity is present intime-domain models of pitch as well, but itsrole is typically not to estimate the spectrumbut simply to restrict an autocorrelation anal-ysis to a narrow frequency band (Bernstein &Oxenham, 2005), one consequence of whichmight be to improve its robustness in thepresence of multiple sound sources. Reviewsof the current debates and their historical ori-gins are available elsewhere (de Cheveigne,2004; Plack & Oxenham, 2005), and we willnot discuss them exhaustively here.

Research on pitch has provided manyimportant insights about hearing even thougha conclusive account of pitch remains elusive.One contribution of pitch research has beento reveal the importance of the resolvability

Page 28: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

28 Audition

of individual frequency components by thecochlea, a principle that has importance inother aspects of hearing as well. Becausethe frequency resolution of the cochlea isapproximately constant on a logarithmicscale, whereas the components of a harmonictone are equally spaced on a linear scale(separated by a fixed number of Hz, equalto the fundamental frequency of the tone;Figure 2.14A), multiple high-numbered har-monics fall within a single cochlear filter(Figure 2.14B). Because of the nature of thelog scale, this is true regardless of whetherthe fundamental is low or high. As a result,the excitation pattern induced by a tone on thecochlea (of a human with normal hearing) isbelieved to contain resolvable peaks for onlythe first 10 or so harmonics (Figure 2.14).

There is now abundant evidence thatresolvability places strong constraints onpitch perception. For instance, human pitchperception is determined predominantly bylow-numbered harmonics (harmonics 1–10or so in the harmonic series), presumablyowing to the peripheral resolvability of theseharmonics. Moreover, pitch discriminationis much worse for tones synthesized withonly high-numbered harmonics than fortones containing only low-numbered har-monics, an effect not accounted for simplyby the frequency range in which the harmon-ics occur (Houtsma & Smurzynski, 1990;Shackleton & Carlyon, 1994). This mightbe taken as evidence that the spatial patternof excitation, rather than the periodicity thatcould be derived from the autocorrelation,underlies pitch perception, but variants ofautocorrelation-based models have alsobeen proposed to account for the effect ofresolvability (Bernstein & Oxenham, 2005).Resolvability has since been demonstrated toconstrain sound segregation as well as pitch(Micheyl & Oxenham, 2010b); see below.

The past decade has seen consider-able interest in the neural mechanisms of

pitch perception in both humans and non-human animals. One question is whetherpitch is analyzed in a particular part of thebrain. If so, one might expect the region torespond more to stimuli with pitch than tothose lacking it, other things being equal.Although initially controversial (Hall &Plack, 2009), it is now reasonably wellestablished that a region of the human audi-tory cortex exhibits this response signature,responding more to harmonic tones thanto spectrally matched noise when measuredwith fMRI (Norman-Haignere, Kanwisher, &McDermott, 2013; Patterson, Uppenkamp,Johnsrude, & Griffiths, 2002; Penagos,Melcher, & Oxenham, 2004; Schonwiesner &Zatorre, 2008). The region appears to bepresent in every normal hearing listener, andexhibits a stereotypical location, overlap-ping the low-frequency portion of primaryauditory cortex and extending anterior intononprimary cortex (Figures 2.15A and B).Moreover, the region is driven by resolvedharmonics, mirroring their importance topitch perception (Figures 2.15C and D)(Norman-Haignere et al., 2013; Penagoset al., 2004). It is also noteworthy that some-thing similar to pitch selectivity emerges fromdecomposing responses to natural sounds intotheir underlying components (see Component4 in Figure 2.11), indicating that it is one ofthe main selectivities present in the auditorycortex (Norman-Haignere et al., 2015).

Although it remains unclear whether aregion with similar characteristics is presentin nonhuman animals, periodicity-tuned neu-rons are present in a similarly located regionof the marmoset auditory cortex (Bendor &Wang, 2005). It is thus conceivable thathomologous regions exist for pitch process-ing in the two species (Bendor & Wang,2006). Comparable neurophysiology resultshave yet to be reported in other species(Fishman, Reser, Arezzo, & Steinschneider,1998), however, and some have argued that

Page 29: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Source Perception 29

0

10

20

30

40

50

60

Level (d

B)

Excitation (

dB

)

Tim

e (

ms)

0 500 1000 1500 2000 2500 3000 3500

60

0

10

20

30

70

80

90

0 500 1000 1500 2000 2500 3000 3500

Input Spectrum(A)

(B)

(C)

(D)

Center Frequency (Hz)

Resolved Unresolved

Auditory Filterbank

Excitation Pattern

Basilar Membrane Vibration

Frequency (Hz)

StimulusWaveform

Figure 2.14 Resolvability. (A) Spectrum of a harmonic complex tone composed of 35 harmonics ofequal amplitude. The fundamental frequency is 100 Hz—the lowest frequency in the spectrum and theamount by which adjacent harmonics are separated. (B) Frequency responses of auditory filters, each ofwhich represents a particular point on the cochlea. Note that because a linear frequency scale is used, thefilters increase in bandwidth with center frequency (compare to Figure 2.3a), such that many harmonicsfall within the passband of the high-frequency filters. (C) The resulting pattern of excitation along thecochlea in response to the tone in (A). The excitation is the amplitude of vibration of the basilar mem-brane as a function of characteristic frequency (the frequency to which a particular point on the cochlearesponds best, that is, the center frequency of the auditory filter representing the response properties ofthe cochlea at that point). Note that the first 10 or so harmonics produce resolvable peaks in the patternof excitation, but that higher-numbered harmonics do not. The latter are thus said to be “unresolved.”(D) The pattern of vibration that would be observed on the basilar membrane at several points along itslength. When harmonics are resolved, the vibration is dominated by the harmonic close to the charac-teristic frequency, and is thus sinusoidal. When harmonics are unresolved, the vibration pattern is morecomplex, reflecting the multiple harmonics that stimulate the cochlea at those points.Source: Reprinted from Plack (2005). © Chris Plack.

Page 30: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

30 Audition

03 4 5 6 8 10

Lowest Harmonic Number

1215 Noise

0.2

% S

ignal C

hange

0.5

1

2

4 n = 8

n = 12

8

3 4 5 6 8 10

Lowest Harmonic Number

DiscriminationThresholds

0.2 kHz 6.4 kHz

25%

Right Left

50%

Frequency of Maximum Response

Outline of pitch-sensitive voxels

% of Subjects with a Pitch Response at Each Voxel

Average Response ofPitch-Sensitive Voxels

1215

% F

0 D

iffere

nce

0.4

0.6

0.8

1

(C)

(B)

(A)

(D)

Figure 2.15 Pitch-responsive regions in humans. (A) Anatomical distribution of pitch-sensitive vox-els, defined as responding significantly more to harmonic tones than to spectrally matched noise.(B) Tonotopic map obtained by plotting the frequency yielding the maximum response in each voxel(averaged across 12 listeners). Black outline replotted from (A) indicates the location of pitch responsivevoxels, which overlap the low-frequency lobe of the tonotopic map and extend anteriorly. (C) Averageresponse of pitch-sensitive voxels to tones varying in harmonic resolvability and to spectrally matchednoise. Responses are lower for tones containing only high-numbered (unresolved) harmonics. (D) Pitchdiscrimination thresholds for the tones used in the fMRI experiment. Thresholds are higher for tonescontaining only high-numbered (unresolved) harmonics.Source: From Norman-Haignere, Kanwisher, and McDermott (2013). Reproduced with permission ofthe Society for Neuroscience.

pitch is encoded by ensembles of neuronswith broad tuning rather than single neuronsselective for particular fundamental fre-quencies (Bizley, Walker, King, & Schnupp,2010). In general, pitch-related responses

can be difficult to disentangle from arti-factual responses to distortions introducedby the nonlinearities of the cochlea (deCheveigne, 2010; McAlpine, 2004), thoughsuch distortions cannot account for the pitch

Page 31: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Sound Source Perception 31

responses evident with fMRI in humans(Norman-Haignere & McDermott, 2016).See Walker, Bizley, King, and Schnupp(2011) and Winter (2005) for recent reviewsof the brain basis of pitch.

It is important to note that hearing researchhas tended to equate pitch perception withthe problem of estimating the F0 of a sound.However, in many real-world contexts (e.g.,the perception of music or speech intona-tion), the changes in pitch over time arearguably more important than the absolutevalue of the F0. Pitch increases or decreasesare what capture the identity of a melody,or the intention of a speaker. Less is knownabout how this relative pitch informationis represented in the brain, but the righttemporal lobe has been argued to be impor-tant, in part on the basis of brain-damagedpatients with apparently selective deficitsin relative pitch (Johnsrude et al., 2000).See McDermott and Oxenham (2008a) for areview of the perceptual and neural basis ofrelative pitch.

Loudness

Loudness is the perhaps the most immediateperceptual property of sound. To first order,loudness is the perceptual correlate of soundintensity. In real-world listening scenarios,loudness exhibits additional influences thatsuggest it serves to estimate the intensity ofa sound source, as opposed to the intensityof the sound entering the ear (which changeswith distance and the listening environment).However, loudness models that capture exclu-sively peripheral processing nonetheless haveconsiderable predictive power. The abilityto predict perceived loudness is importantin many practical situations, and is a centralissue in the fitting of hearing aids. The alteredcompression in hearing-impaired listenersaffects the perceived loudness of sounds, andamplification runs the risk of making soundsuncomfortably loud unless compression is

introduced artificially. There has thus beenlongstanding interest in quantitative modelsof loudness.

For a sound with a fixed spectral profile,such as a pure tone or a broadband noise, therelationship between loudness and intensitycan be approximated via the classic Stevenspower law (Stevens, 1955). However, therelation between loudness and intensity isnot as simple as one might imagine. Forinstance, loudness increases with increasingbandwidth—a sound whose frequencies lie ina broad range will seem louder than a soundwhose frequencies lie in a narrow range, evenwhen their physical intensities are equal.

Standard models of loudness thus positsomething somewhat more complex than asimple power law of intensity: that loudnessis linearly related to the total amount ofneural activity elicited by a stimulus at thelevel of the auditory nerve (ANSI, 2007;Moore & Glasberg, 1996). The effect ofbandwidth on loudness is explained via thecompression that occurs in the cochlea:Loudness is determined by the neural activitysummed across nerve fibers, the spikes ofwhich are generated after the output of aparticular place on the cochlea is nonlinearlycompressed. Because compression boostslow responses relative to high responses,the sum of several compressed responses tolow amplitudes (produced by the several fre-quency channels stimulated by a broadbandsound) is greater than a single compressedresponse to a large amplitude (produced by asingle frequency channel responding to a nar-rowband sound of equal intensity). Loudnessalso increases with duration for durations upto half a second or so (Buus, Florentine, &Poulsen, 1997), suggesting that it is computedfrom neural activity integrated over someshort window.

Loudness is also influenced in interestingways by the apparent distance of a soundsource. Because intensity attenuates withdistance from a sound source, the intensity of

Page 32: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

32 Audition

a sound at the ear is determined conjointly bythe intensity and distance of the source. Theauditory system appears to use loudness asa perceptual estimate of a source’s intensity(i.e., the intensity at the point of origin):Sounds that appear more distant seem louderthan those that appear closer but have thesame overall intensity. Visual cues to distancehave some influence on perceived loudness(Mershon, Desaulniers, Kiefer, Amerson, &Mills, 1981), but the cue provided by theamount of reverberation also seems to beimportant. The more distant a source, theweaker the direct sound from the source tothe listener, relative to the reverberant soundthat reaches the listener after reflection offof surfaces in the environment (Figure 2.14).This ratio of direct to reverberant soundappears to be used both to judge distance andto calibrate loudness perception (Zahorik &Wightman, 2001), though how the listenerestimates this ratio from the sound signalremains unclear at present. Loudness thusappears to function somewhat like size orbrightness perception in vision, in whichperception is not based exclusively on retinalsize or light intensity (Adelson, 2000).

AUDITORY SCENE ANALYSIS

Thus far we have discussed how the auditorysystem represents single sounds in isolation,as might be produced by a note played onan instrument, or a word uttered by some-one talking. The simplicity of such isolatedsounds renders them convenient objects ofstudy, yet in many auditory environmentsisolated sounds are not the norm. It is oftenthe case that lots of things make sound at thesame time, causing the ear to receive a mix-ture of multiple sources as its input. ConsiderFigure 2.16, which displays cochleagramsof a single “target” speaker along with thatof the mixture that results from adding to it

the utterances of one, three, and seven addi-tional speakers, as might occur in a socialsetting. The brain’s task in this case is to takesuch a mixture as input and recover enoughof the content of a target sound source toallow speech comprehension or otherwisesupport behavior. This is a nontrivial task.In the example of Figure 2.16, for instance,it is apparent that the structure of the tar-get utterance is progressively obscured asmore speakers are added to the mixture.The presence of competing sounds greatlycomplicates the computational extractionof just about any sound source property,from pitch (de Cheveigne, 2006) to location(Mandel, Weiss, & Ellis, 2010). Humanlisteners, however, parse auditory sceneswith a remarkable degree of success. In theexample of Figure 2.16, the target remainslargely audible to most listeners even in themixture of eight speakers. This is the classic“cocktail party problem” (Bee & Micheyl,2008; Bregman, 1990; Bronkhorst, 2000;Cherry, 1953; Carlyon, 2004; Darwin, 1997;McDermott, 2009).

Historically, the “cocktail party problem”has referred to two conceptually distinctproblems that in practice are closely related.The first, known as sound segregation, isthe problem of deriving representations ofindividual sound sources from a mixtureof sounds. The second, selective attention,entails the task of directing attention to onesource among many, as when listening toa particular speaker at a party. These twoproblems are related because the abilityto segregate sounds is probably dependenton attention (Carlyon, Cusack, Foxton, &Robertson, 2001; Shinn-Cunningham, 2008;Woods & McDermott, 2015), though theextent and nature of this dependence remainsan active area of study (Macken, Tremblay,Houghton, Nicholls, & Jones, 2003; Masu-tomi, Barascud, Kashino, McDermott, &Chait, 2016). Here we will focus on the

Page 33: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Auditory Scene Analysis 33

Fre

qu

en

cy (

Hz)

"She ar- gues with her sis- ter"

0 0.5 1 1.5

6000

2700

1200

420

80

Fre

qu

en

cy (

Hz)

One additional speaker

0 0.5 1 1.5

6000

2700

1200

420

80

Fre

qu

en

cy (

Hz)

Three additional speakers

0 0.5 1 1.5

6000

2700

1200

420

80

Fre

qu

en

cy (

Hz)

Seven additional speakers

Time (s)

0 0.5 1 1.5

6000

2700

1200

420

80

0 0.5 1 1.5

6000

2700

1200

420

80

0 0.5 1 1.5

6000

2700

1200

420

80

Time (s)

0 0.5 1 1.5

6000

2700

1200

420

80

–50

–40

–30

–20

–10

0

Common onset

Common offset

Harmonic frequencies

Le

ve

l (d

B)

Figure 2.16 The cocktail party problem. Cochleagrams of a single “target” utterance (top row), andthe same utterance mixed with one, three, and seven additional speech signals from different speakers.The mixtures approximate the signal that would enter the ear if the additional speakers were talking asloudly as the target speaker, but were standing twice as far away from the listener (to simulate cocktailparty conditions). The grayscale denotes attenuation from the maximum energy level across all of thesignals (in dB), such that gray levels can be compared across cochleagrams. Cochleagrams in the rightcolumn are identical to those on the left except for the superimposed color masks. Pixels labeled greenare those where the original target speech signal is more than –50 dB but the mixture level is at least 5dB higher, thus masking the target speech. Pixels labeled red are those where the target had less and themixture had more than –50 dB energy. Cochleagrams were computed from a filter bank with bandwidthsand frequency spacing similar to those in the ear. Each pixel is the rms amplitude of the signal within afrequency band and time window.Source: From McDermott (2009). Reprinted with permission of Elsevier.

Page 34: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

34 Audition

first problem, of sound segregation, whichis typically studied under conditions wherelisteners pay full attention to a target sound.Al Bregman, a Canadian psychologist, is typ-ically credited with drawing interest to thisproblem and pioneering its study (Bregman,1990).

Sound Segregation and AcousticGrouping Cues

Sound segregation is a classic example ofan ill-posed problem in perception. Manydifferent sets of sounds are physically con-sistent with the mixture that enters the ear (inthat their sum is equal to the mixture). Theauditory system must infer the set of soundsthat actually occurred. As in other ill-posedproblems, this inference is only possiblewith the aid of assumptions that constrainthe solution. In this case the assumptionsconcern the nature of sounds in the world,and are presumably learned from experiencewith natural sounds (or perhaps hardwiredinto the auditory system via evolution).

Grouping cues (i.e., sound properties thatdictate whether sound elements are heardas part of the same sound) are examplesof these assumptions. For instance, naturalsounds that have pitch, such as vocalizations,contain frequencies that are harmonicallyrelated, evident as banded structures inthe cochleagram of the target speaker inFigure 2.16. Harmonically related frequen-cies are unlikely to occur from the chancealignment of multiple different sounds, andthus when they are present in a mixture, theyare likely to be due to the same sound, andare generally heard as such (de Cheveigne,McAdams, Laroche, & Rosenberg, 1995).Moreover, a component that is mistuned(in a tone containing otherwise harmonicfrequencies) segregates from the rest ofthe tone (Hartmann, McAdams, & Smith,1990; Moore, Glasberg, & Peters, 1986;

Roberts & Brunstrom, 1998). Understandingsound segregation requires understanding theacoustic regularities, such as harmonicity,that characterize natural sound sources, andthat are used by the auditory system. It ismy view that these regularities should berevealed by analysis of natural sounds, butfor now research in this area is mostly beingdriven by intuitions about sound propertiesthat might be important for segregation.

Perhaps the most important genericacoustic grouping cue is common onset:Frequency components that begin and end atthe same time are likely to belong to the samesound. Onset differences, when manipulatedexperimentally, cause frequency componentsto perceptually segregate from each other(Cutting, 1975; Darwin, 1981). Interestingly,a component that has an earlier or lateronset than the rest of a set of harmonics hasreduced influence over the perceived pitchof the entire tone (Darwin & Ciocca, 1992),suggesting that pitch computations operateon frequency components that are deemedlikely to belong together, rather than on theraw acoustic input.

Onset may be viewed as a special caseof co-modulation—amplitude modulationthat is common to different spectral regions.In some cases relatively slow co-modulationpromotes grouping of different spectral com-ponents (Hall, Haggard, & Fernandes, 1984),though abrupt onsets seem to be most effec-tive. Common offset also promotes grouping,though is less effective than common onset(Darwin, 1984), perhaps because abruptoffsets are less common in natural sounds(Cusack & Carlyon, 2004).

Not every intuitively plausible groupingcue produces a robust effect when assessedpsychophysically. For instance, frequencymodulation (FM) that is shared (“coherent”)across multiple frequency components, as invoiced speech, has been proposed to promotetheir grouping (Bregman, 1990; McAdams,

Page 35: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Auditory Scene Analysis 35

1989). However, listeners are poor at dis-criminating coherent from incoherent FM ifthe component tones are not harmonicallyrelated, indicating that sensitivity to FMcoherence may simply be mediated by thedeviations from harmonicity that occur whenharmonic tones are incoherently modulated(Carlyon, 1991).

Failures of segregation are often referredto as “informational masking,” so-calledbecause they often manifest as masking-likeeffects on the detectability of a target tone,but cannot be explained in terms of classical“energetic masking” (in which the responseto the target is swamped by a masker thatfalls within the same peripheral channel).Demonstrations of informational maskingtypically present a target tone along withother tones that lie outside a “protectedregion” of the spectrum, such that they do notstimulate the same filters as the target tone.These “masking” tones nonetheless oftenelevate the detection threshold for the target,sometimes quite dramatically (Durlach et al.,2003; Lutfi, 1992; Neff, 1995; Watson, 1987).The effect is presumably due to impairmentsin the ability to segregate the target tone fromthe masker tones.

One might also think that the task ofsegregating sounds would be greatly aidedby the tendency of distinct sound sourcesin the world to originate from distinct loca-tions. In practice, spatial cues are indeed ofsome benefit, for instance in hearing a targetsentence from one direction amid distractingutterances from other directions (Bronkhorst,2000; Hawley, Litovsky, & Culling, 2004;Ihlefeld & Shinn-Cunningham, 2008; Kidd,Arbogast, Masson, & Gallun, 2005). How-ever, spatial cues are surprisingly ineffectiveat segregating one frequency componentfrom a group of others (Culling & Sum-merfield, 1995), especially when pittedagainst other grouping cues such as onsetor harmonicity (Darwin & Hukin, 1997).

The benefit of listening to a target with adistinct location (Bronkhorst, 2000; Hawleyet al., 2004; Ihlefeld & Shinn-Cunningham,2008; Kidd et al., 2005) may thus be dueto the ease with which the target can beattentively tracked over time amid competingsound sources, rather than to a facilitation ofauditory grouping per se (Darwin & Hukin,1999). Moreover, humans are usually ableto segregate monaural mixtures of soundswithout difficulty, demonstrating that spatialseparation is often not necessary for high per-formance. For instance, much popular musicof the 20th century was released in mono, andyet listeners have no trouble distinguishingmany different instruments and voices in anygiven recording. Spatial cues thus contributeto sound segregation, but their presence orabsence does not seem to fundamentally alterthe problem.

The weak effect of spatial cues on segre-gation may reflect their fallibility in complexauditory scenes. Binaural cues can be con-taminated when sounds are combined ordegraded by reverberation (Brown & Palo-maki, 2006), and can even be deceptive, aswhen caused by echoes (whose direction isgenerally different from the original soundsource). It is possible that the efficacy of dif-ferent grouping cues in general reflects theirreliability in natural conditions. Evaluatingthis hypothesis will require statistical analy-sis of natural auditory scenes, an importantdirection for future research.

Sequential Grouping

Because the cochleagram approximates theinput that the cochlea provides to the restof the auditory system, it is common toview the problem of sound segregation asone of deciding how to group the variousparts of the cochleagram (Bregman, 1990).However, the brain does not receive an entirespectrogam at once—sound arrives gradually

Page 36: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

36 Audition

over time. Many researchers thus distin-guish between the problem of simultaneousgrouping (determining how the spectralcontent of a short segment of the auditoryinput should be segregated) and sequentialgrouping (determining how the groups fromeach segment should be linked over time,for instance to form a speech utterance, or amelody) (Bregman, 1990).

Although most of the classic groupingcues (onset/comodulation, harmonicity, ITD,etc.) are quantities that could be measuredover short timescales, the boundary betweenwhat is simultaneous or sequential is unclearfor most real-world signals, and it may bemore appropriate to view grouping as beinginfluenced by processes operating at multipletimescales rather than two cleanly dividedstages of processing. There are, however,contexts in which the bifurcation into simul-taneous and sequential grouping stages isnatural, as when the auditory input consistsof discrete sound elements that do not overlapin time. In such situations interesting dif-ferences are sometimes evident between thegrouping of simultaneous and sequential ele-ments. For instance, spatial cues, which arerelatively weak as a simultaneous cue, have astronger influence on sequential grouping oftones (Darwin & Hukin, 1997).

Another clear case of sequential process-ing can be found in the effects of soundrepetition. Sounds that occur repeatedly inthe acoustic input are detected by the audi-tory system as repeating, and are inferredto be a single source. Perhaps surprisingly,this is true even when the repeating source isembedded in mixtures with other sounds, andis never presented in isolation (McDermottet al., 2011). In such cases the acoustic inputitself does not repeat, but the source repeti-tion induces correlations in the input that theauditory system detects and uses to extractthe repeating sound. The informativeness ofrepetition presumably results from the fact

that mixtures of multiple sounds tend not tooccur repeatedly, such that when a structuredoes repeat, it is likely to be a single source.

Effects of repetition are also evident inclassic results on “informational masking,”in that the detectability of a target tone amidmasking tones can be increased when thetarget is repeatedly presented (Kidd, Mason,Deliwala, & Woods, 1994; Kidd, Mason, &Richards, 2003). Similarly, when repeat-ing patterns of tones appear in a randombackground of tones, they reliably “popout” and are detected (Teki, Chait, Kumar,Shamma, & Griffiths, 2013). Moreover,segregation via repetition seems to be robustto inattention—listeners are able to makejudgments about target sources embeddedin mixtures even when the repetitions thatenable target detection and discriminationoccur while the listeners perform a difficultsecond concurrent task (Masutomi et al.,2016). Although repeating structure is rarelypresent in speech, it is common in music, andin many animal vocalizations, which oftenconsist of a short call repeated several timesin quick succession, perhaps facilitating theirsegregation from background sounds.

Streaming

One type of sequential segregation effecthas particularly captured the imagination ofthe hearing community, and merits specialmention. When two pure tones of differentfrequency are repeatedly presented in alter-nation (Figure 2.17A), one of two perceptualstates is commonly reported by listeners: onein which the two repeated tones are heardas a single “stream” whose pitch varies overtime, and one in which two streams are heard,one with the high tones and one with the lowtones (Bregman & Campbell, 1971). If thefrequency separation between the two tonesis small, and if the rate of alternation is slow,one stream is generally heard. When the

Page 37: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

00

1

123(A

)

(B)

(C)

(D)

(E)

0

0

–2

0

Power (dB)

–4

0

–6

0F

0F1

F2

–5 –

10

–5

–5

–15

–10

–5

05

10

15

0

05

5–

10

10

0 Sem

itones

from

500 H

zS

em

itones

from

1500 H

zF

2F1

F0

Semitones from 200 Hz

5

F0

01

Fre

qu

en

cy (

kH

z)

23

01

Tim

e (

s)

123

Frequency (kHz)

F2

Se

miton

es fro

m 1

50

0 H

z

–15

–10

–5

05

10

15

F1

Se

miton

es fro

m 5

00

Hz

–10

Sp

ea

ke

r 1

Sp

ea

ke

r 2

(μ +

/– 2σ)

of m

ea

ns o

f

dis

trib

utio

ns fo

r a

ll fe

ma

le

sp

ea

ke

rs in

TIM

IT

–5

05

10

15

F0

Se

miton

es fro

m 2

00

Hz

Fig

ure

2.17

Stre

amin

g.(A

)Sp

ectr

ogra

mof

“A-B

-A”

alte

rnat

ing

tone

stim

ulus

com

mon

lyus

edto

inve

stig

ate

stre

amin

g.O

ver

time

the

two

repe

atin

gto

nes

segr

egat

ein

tose

para

test

ream

s.(B

)Sp

ectr

ogra

mof

two

conc

urre

ntut

tera

nces

bytw

ofe

mal

esp

eake

rs.(

C)

Pow

ersp

ectr

umof

ash

ort

exce

rpt

from

one

ofth

eut

tera

nces

in(B

).T

hesp

ectr

umex

hibi

tspe

aks

corr

espo

ndin

gto

harm

onic

sof

the

fund

amen

talf

requ

ency

(i.e

.,F0

)th

atde

term

ines

the

pitc

h,as

wel

las

peak

sin

the

spec

tral

enve

lope

,kno

wn

asfo

rman

ts,t

hatd

eter

min

eth

evo

wel

bein

gpr

oduc

ed.(

D)

Pitc

han

dfo

rman

ttra

ject

orie

sfo

rth

etw

out

tera

nces

in(B

).T

heye

llow

line

plot

sth

etr

ajec

tory

for

the

utte

ranc

ein

(C).

Ope

nan

dcl

osed

circ

les

deno

teth

ebe

ginn

ing

and

end

ofth

etr

ajec

tori

es,r

espe

ctiv

ely.

(E)

Mar

gina

ldis

trib

utio

nsof

F0,F

1,an

dF2

mea

sure

dac

ross

10se

nten

ces

for

the

two

spea

kers

in(B

).R

edba

rssh

owth

era

nge

ofm

ean

valu

esof

F0,F

1,an

dF2

for

the

53fe

mal

esp

eake

rsin

the

TIM

ITsp

eech

data

base

.Not

eth

atdi

ffer

ence

sbe

twee

nth

eav

erag

efe

atur

esof

spea

kers

are

smal

lre

lativ

eto

the

vari

abili

typr

oduc

edby

asi

ngle

spea

ker,

such

that

mos

tpai

rsof

spea

kers

ofth

esa

me

gend

erw

illha

veov

erla

ppin

gtr

ajec

tori

es.

Sou

rce:

Cre

ated

byK

evin

Woo

ds;p

anel

sD

–Efr

omW

oods

and

McD

erm

ott(

2015

).R

epri

nted

with

perm

issi

onof

Els

evie

r.

37

Page 38: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

38 Audition

frequency separation is larger or the rateis faster, two streams tend to be heard, inwhich case “streaming” is said to occur (vanNoorden, 1975).

An interesting hallmark of this phe-nomenon is that when two streams areperceived, judgments of the temporal orderof elements in different streams are impaired(Bregman & Campbell, 1971; Micheyl &Oxenham, 2010a). This latter finding pro-vides compelling evidence for a substantivechange in the representation underlyingthe two percepts. Subsequent research hasdemonstrated that separation along mostdimensions of sound can elicit streaming(Moore & Gockel, 2002). The streamingeffects in these simple stimuli may be viewedas a variant of grouping by similarity—elements are grouped together when theyare similar along some dimension, and seg-regated when they are sufficiently different,presumably because this similarity reflectsthe likelihood of having been produced by thesame source.

Although two-tone streaming continues tobe widely studied, its relevance to real-worldstreaming is unclear. The canonical (andarguably most important) real-world stream-ing problem is that of following one voiceamid others, and mixtures of speech utter-ances are different in almost every respectfrom the A-B-A streaming stimulus. Asshown in Figure 2.17B, mixtures of speakersphysically overlap in both time and fre-quency. Moreover, even when represented interms of perceptually relevant features suchas pitch and the first two formants (whichhelp define vowels; Figure 2.17C), twospeakers of the same gender follow highlyintertwined trajectories (Figure 2.17D). It isthus not obvious that insights from A-B-Astreaming will translate straightforwardly tothe streaming of speech and other naturalsounds, though there are occasions in classi-cal music in which alternating pitches stream

(Bregman, 1990). One alternative approach isto synthesize stimuli that bear more similarityto the mixtures of sources we encounter inthe world. For instance, when presented withmixtures of synthetic stimuli that evolve overtime like spoken utterances, human listenerscan track one of two “voices” with the aidof selective attention (Woods & McDermott,2015).

Sound Texture

Although most work on scene analysis hasfocused on the perception of individualsound sources occurring concurrently witha few others, many natural scenes featurelarge numbers of similar sound elements, asproduced by rain, fire, or groups of insects oranimals (Figure 2.18A). The superposition ofsuch similar acoustic events collectively givesrise to aggregate statistical properties, and theresulting sounds are referred to as “textures”(Saint-Arnaud & Popat, 1995; McDermott &Simoncelli, 2011). Sound textures are ubiq-uitous in the world, and commonly formthe background for “foreground” soundswe want to recognize, such as someonetalking. Textures also convey informationthemselves about the surrounding environ-ment. Textures have only recently begun tobe studied in audition, but are an appealingstarting point for understanding auditoryrepresentation—they have rich, behaviorallyrelevant structure, and yet their propertiesdo not change over time, simplifying therepresentational requirements.

Human listeners can reliably identifymany different textures (McDermott &Simoncelli, 2011), raising the question ofhow they do so. Motivated in part by theobservation that textures tend to be tempo-rally homogeneous, we have proposed thatthey are represented with statistics: timeaverages of acoustic measurements made inthe early auditory system. One piece of evi-dence for this idea comes from sound texture

Page 39: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Auditory Scene Analysis 39

40 91 209 478 1093 25000.5

0.6

0.7

0.8

0.9

1n = 12

Pro

po

rtio

n C

orr

ect

Excerpt Duration (ms)

Texture discriminationExemplar discrimination

Time

Texture Discrimination

Which sound was produced by a different source?

Exemplar Discrimination

Which sound was different from the other two?

First orlast?

First orlast?

Co

ch

lea

r C

ha

nn

el (H

z)

Time (s)

8844

2507

596

52

8844

2507

596

52

Stream(A)

(B)

Fire Insects

Applause Geese Waves

1 2 3 4

Figure 2.18 Sound texture. (A) Cochleagrams of six sample sound textures. Note the high degree oftemporal homogeneity. (B) Texture and exemplar discrimination. Listeners were presented with threesound signals. In the texture discrimination task, two of the excerpts were from one texture, and one wasfrom a different texture, and listeners identified the excerpt that was produced by a distinct source. In theexemplar discrimination task, two of the excerpts were identical, and the third was a distinct excerpt thatwas the same texture. Listeners identified the excerpt that was different from the other two.Source: (A) From McDermott and Simoncelli (2011). (B) From McDermott, Schemitsch, and Simon-celli (2013). Reproduced with permission of Macmillan.

synthesis, in which signals are synthesized tohave statistics matched to those of particularreal-world sounds (McDermott, Oxenham, &Simoncelli, 2009; McDermott & Simoncelli,2011). The logic of this procedure is that ifsuch statistics underlie our perception, thensynthetic signals that share the statisticalproperties of real-world textures shouldsound like them (Heeger & Bergen, 1995;Portilla & Simoncelli, 2000). We found thatrealistic-sounding examples of many textures

(water, fire, insects, etc.) can be synthesizedfrom relatively simple statistics (momentsand correlations) computed from the auditorymodel of Figure 2.7, suggesting that suchstatistics could underlie our perception.

Further evidence for statistical representa-tions of texture comes from psychophysicalexperiments (Figure 2.18B). When humanlisteners are asked to discriminate excerptsfrom different textures, their performanceimproves with the excerpt duration, as

Page 40: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

40 Audition

expected (longer excerpts provide moreinformation on which to base discrimina-tion). However, when asked to discriminatedifferent excerpts of the same texture, perfor-mance declines with duration, even thoughlonger excerpts again provide more infor-mation with which to tell the excerpts apart(McDermott et al., 2013). This result is whatone would expect if the texture excerptswere represented with “summary” statisticsthat are averaged over time: As durationincreases, the summary statistics of differentexcerpts of the same texture converge to thesame value, rendering the excerpts difficult totell apart if only their statistics are retained.The results suggest that the acoustic detailsthat compose a texture are accumulated intostatistical summaries but then become inac-cessible. These statistical summaries permitdistinct textures to be differentiated, but limitthe ability to distinguish temporal detail.

Because texture is a relatively new topicof study, many interesting questions remainopen. Texture statistics are time averages, andthe mechanisms for computing the averagesremain uncharacterized, as does their neurallocus. It also remains unclear how texturesand the other sound sources that are oftensuperimposed on them are segregated, andwhether the averaging process for computingtexture statistics is selective, averaging onlythose acoustic details that are likely to belongto the texture rather than to other sources.

Filling In

Although it is common to view sound seg-regation as the problem of grouping thespectrogram-like output of the cochlea acrossfrequency and time, this cannot be the wholestory. That is because large swathes of asound’s time-frequency representation areoften physically obscured (masked) by othersources, and are thus not physically avail-able to be grouped. Masking is evident in

the green pixels of Figure 2.16, which rep-resent points where the target source hassubstantial energy, but where the mixtureexceeds it in level. If these points are simplyassigned to the target, or omitted from itsrepresentation, its level at those points willbe misconstrued, and the sound potentiallymisidentified. To recover an accurate esti-mate of the target source, it is necessaryto infer not just the grouping of the energyin the cochleagram, but also the structureof the target source in the places where itis masked.

There is considerable evidence thatthe auditory system in many cases infers“missing” portions of sound sources whenevidence suggests that they are likely to havebeen masked. For instance, tones that areinterrupted by noise bursts are “filled in”by the auditory system, such that they areheard as continuous in conditions in whichphysical continuity is likely given the stim-ulus (Warren, Obusek, & Ackroff, 1972).Known as the continuity effect, it occursonly when the interrupting noise bursts aresufficiently intense in the appropriate partof the spectrum to have masked the toneshould it have been present continuously.Continuity is also heard for frequency glides(Ciocca & Bregman, 1987; Kluender & Jeni-son, 1992), as well as oscillating frequency-or amplitude-modulated tones (Carlyon,Micheyl, Deeks, & Moore, 2004; Lyzenga,Carlyon, & Moore, 2005). The perceptionof continuity across intermittent maskerswas actually first reported for speech sig-nals interrupted by noise bursts (Warren,1970). For speech the effect is often termed“phonemic restoration,” and likely indicatesthat knowledge of speech acoustics (andperhaps of other types of sounds as well)influences the inference of the masked por-tion of sounds. Similar effects occur forspectral gaps in sounds—they are perceptu-ally filled in when evidence indicates they are

Page 41: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

Auditory Scene Analysis 41

likely to have been masked (McDermott &Oxenham, 2008b). Filling in effects in hear-ing are conceptually similar to completionunder and over occluding surfaces in vision,though the ecological constraints provided bymasking (involving the relative intensity oftwo sounds) are distinct from those providedby occlusion (involving the relative depth oftwo surfaces). Neurophysiological evidenceindicates that the representation of tones inprimary auditory cortex reflects the perceivedcontinuity, responding as though the tonewere continuously present despite beinginterrupted by noise (Petkov, O’Connor, &Sutter, 2007; Riecke, van Opstal, Goebel, &Formisano, 2007).

Separating Sound Sources Fromthe Environment

Thus far we have mainly discussed how theauditory system segregates the signals frommultiple sound sources, but listeners face asecond important scene-analysis problem.The sound that reaches the ear from a sourceis almost always altered to some extent bythe surrounding environment, and these envi-ronmental influences complicate the task ofrecognizing the source. Typically the soundproduced by a source reflects off multiplesurfaces prior to reaching the ears, such thatthe ears receive some sound directly from thesource, but also many reflected copies (Figure2.19). These reflected copies (echoes) aredelayed, as their path to the ear is lengthened,but generally also have altered frequencyspectra, as reflective surfaces absorb somefrequencies more than others. Because eachreflection can be well described with a linearfilter applied to the source signal, the signalreaching the ear, which is the sum of thedirect sound along with all the reflections,can be described simply as the result ofapplying a single composite linear filterto the source (Gardner, 1998). Significant

filtering of this sort occurs in almost everynatural listening situation, such that soundproduced in anechoic conditions (in whichall surfaces are minimally reflective) soundsnoticeably strange and unnatural.

Listeners are often interested in the prop-erties of sound sources, and one might thinkof the environmental effects as a nuisancethat should simply be discounted. However,environmental filtering imbues the acousticinput with useful information; for instance,about the size of a room where sound isproduced and the distance of the source fromthe listener (Bronkhorst & Houtgast, 1999).It may thus be more appropriate to think ofseparating source and environment, at leastto some extent, rather than simply recoveringthe source (Traer & McDermott, 2016).Reverberation is commonly used in musicproduction, for instance, to create a sense ofspace, or to give a different feel to particularinstruments or voices.

The loudness constancy phenomena dis-cussed earlier (Zahorik & Wightman, 2001)are one example where the brain appears toinfer the properties of the sound source sepa-rately from that of the environment, but thereare several others. One of the most interestinginvolves the treatment of echoes in soundlocalization. The echoes that are common inmost natural environments pose a problemfor localization, as they generally comefrom directions other than that of the source(Figure 2.19B). The auditory system appearsto solve this problem by perceptually fusingsimilar impulsive sounds that occur within ashort duration of each other (on the order of10 ms or so), and using the sound that occursfirst to determine the perceived location. This“precedence effect,” so-called because of thedominance of the sound that occurs first, wasdescribed and named by Hans Wallach (Wal-lach, Newman, & Rosenzweig, 1949), one ofthe great Gestalt psychologists, and has sincebeen the subject of an interesting literature.

Page 42: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

42 Audition

0 50 100 150 200 250

0

Time (ms)

Pre

ssure

(A

rbitra

ry U

nits)

Direct sound

(A)

(B)

An early reflection

Figure 2.19 Reverberation. (A) Impulse response for a classroom. This is the sound waveformrecorded in this room in response to a click (impulse) produced at a particular location in the room.The top arrow indicates the impulse that reaches the microphone directly from the source (that thusarrives first). The lower arrow indicates one of the subsequent reflections (i.e., echoes). After the earlyreflections, a gradually decaying reverberation tail is evident (cut off at 250 ms for clarity). The soundsignal resulting from a different kind of source could be produced by convolving the sound from thesource with this impulse response. (B) Schematic diagram of the sound reflections that contribute to thesignal that reaches a listener’s ears in a typical room. The brown box in the upper right corner depicts thespeaker producing sound. The green lines depict the path taken by the direct sound to the listener’s ears.Blue and green lines depict sound reaching the ears after one and two reflections, respectively. Soundreaching the ear after more than two reflections is not shown.Source: Part B from Culling and Akeroyd (2010). Reproduced with permission of Oxford UniversityPress.

For instance, the maximum delay at whichechoes are perceptually suppressed increasesas two pairs of sounds are repeatedly pre-sented (Freyman, Clifton, & Litovsky, 1991),presumably because the repetition providesevidence that the second sound is indeedan echo of the first, rather than being due

to a distinct source (in which case it wouldnot occur at a consistent delay following thefirst sound). Moreover, reversing the order ofpresentation can cause an abrupt breakdownof the effect, such that two sounds are heardrather than one, each with a different loca-tion. See Brown, Stecker, and Tollin (2015)

Page 43: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

The Future of Hearing Research 43

and Litovsky, Colburn, Yost, and Guzman(1999) for reviews of the precedence-effectliterature.

Reverberation also poses a challengefor sound recognition, because differentenvironments alter the sound from a sourcein different ways. Large amounts of rever-beration (with prominent echoes at verylong delays), as are present in some largeauditoriums, can in fact greatly reduce theintelligibility of speech. Moderate amountsof reverberation, however, as are present inmost spaces, typically have minimal effecton our ability to recognize speech and othersounds. Our robustness to everyday rever-beration appears to be due in part to implicitknowledge of the regularities of real-worldenvironmental acoustics. A recent large-scalestudy of impulse responses of everydayspaces found that they typically exhibitstereotyped properties (Traer & McDermott,2016), presumably due to regularities in theway that common materials and environ-mental geometries reflect sound. Reverberantsound almost always decays exponentially,for instance, and at rates that depend onfrequency in a consistent way (mid frequen-cies decay slowest and high frequencies thefastest). When these regularities are violatedin synthetic reverberation, the resulting sounddoes not resemble reverberation to humanlisteners, and the properties of the underlyingsound source are more difficult to separatefrom the effects of the environment (Traer &McDermott, 2016).

Part of our robustness to reverberation alsolikely derives from a process that adapts tothe history of echo stimulation. In reverber-ant conditions, the intelligibility of a speechutterance has been found to be higher whenpreceded by another utterance than whennot, an effect that does not occur in anechoicconditions (Brandewie & Zahorik, 2010;Watkins, 2005). Such results, like those ofthe precedence effect, are consistent with the

idea that listeners construct a model of theenvironment’s contribution to the acousticinput and use it to partially discount it whenjudging properties of a source. Analogouseffects have been found with nonspeechsounds. When listeners hear instrumentsounds preceded by speech or music that hasbeen passed through a filter that “colors” thespectrum, the instrument sound is identifieddifferently, as though listeners internalizedthe filter, assume it to be an environmentaleffect, and discount it to some extent whenidentifying the sound (Stilp, Alexander,Kiefte, & Kluender, 2010).

THE FUTURE OF HEARINGRESEARCH

Hearing science is one of the oldest areas ofpsychology and neuroscience, with a strongresearch tradition dating back over 100 years,yet there remain many important open ques-tions. Historically, hearing science tended tofocus on the periphery—sound transductionand “early” sensory processing. This focuscan be explained in part by the challenge ofunderstanding the cochlea, the considerablecomplexity of the early auditory system, andthe clinical importance of peripheral audi-tion. However, the focus on the periphery hasleft many central aspects of audition underex-plored, and recent trends in hearing researchreflect a shift toward the study of theseneglected mid- and high-level questions.

One important set of questions concernsthe interface of audition with the rest of cog-nition, via attention and memory. Attentionresearch ironically flourished in hearing earlyon (with the classic dichotic listening studiesof Cherry [1953]), but then largely moved tothe visual domain. Recent years have seenrenewed interest, but much is still unclearabout the role of attention in perceptual orga-nization, about the representation of sound

Page 44: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

44 Audition

outside of focused attention, and about themechanisms of attentional selection in theauditory system.

Another promising research area involvesworking memory. Auditory short-term mem-ory may have some striking differenceswith its visual counterpart (Demany, Trost,Serman, & Semal, 2008), and appears closelylinked to attention and perhaps auditoryscene analysis (Conway, Cowan, & Bunting,2001). Studies of these topics in auditionalso hold promise for informing us moregenerally about the structure of cognition, asthe similarities and differences with respectto visual cognition will reveal much aboutwhether attention and memory mechanismsare domain general (perhaps exploitingcentral resources) or specific to particularsensory systems.

Interactions between audition and theother senses are also attracting increasedinterest. Information from other sensorysystems likely plays a crucial role in hearing,given that sound on its own often providesambiguous information. The sounds pro-duced by rain and applause, for instance,can in some cases be quite similar, suchthat multisensory integration (using visual,somatosensory, or olfactory input) may helpto correctly recognize the sound source.Cross-modal interactions in localization(Alais & Burr, 2004) are similarly powerful.Understanding cross-modal effects withinthe auditory system (Bizley, Nodal, Bajo,Nelken, & King, 2007; Ghazanfar, 2009;Kayser, Petkov, & Logothetis, 2008) andtheir role in behavior will be a significantdirection of research going forward.

In addition to the uncharted territoryin perception and cognition, there remainimportant open questions about peripheralprocessing. Some of these unresolved issues,such as the mechanisms of outer hair cellfunction, have great importance for under-standing hearing impairment. Others may

dovetail with higher-level function. Forinstance, the role of efferent connectionsto the cochlea is still uncertain, with somehypothesizing a role in attention or segre-gation (Guinan, 2006). The role of phaselocking in frequency encoding and pitchperception is another basic issue that remainscontroversial and debated, and that haswidespread relevance to mid-level audition.

As audition continues to evolve as a field,I believe useful guidance will come from acomputational analysis of the inference prob-lems the auditory system must solve (Marr,1982). This necessitates thinking about thebehavioral demands of real-world listeningsituations, as well as the constraints imposedby the way that information about the worldis encoded in a sound signal. Many of theseissues are becoming newly accessible withrecent advances in computational power andsignal-processing techniques.

For instance, one of the most importanttasks a listener must perform with sound issurely that of recognition—determining whatit was in the world that caused a sound, be it aparticular type of object, or of a type of event,such as something falling on the floor (Gaver,1993; Lutfi, 2008). Recognition is computa-tionally challenging because the same type ofoccurrence in the world typically produces adifferent sound waveform each time it occurs.A recognition system must generalize acrossthe variation that occurs within categories, butnot the variation that occurs across categories(DiCarlo & Cox, 2007). Realizing this com-putational problem allows us to ask how theauditory system solves it. One place wherethese issues have been explored to someextent is speech perception (Holt & Lotto,2010). The ideas explored there—about howlisteners achieve invariance across differentspeakers and infer the state of the vocal appa-ratus along with the accompanying intentionsof the speaker—could perhaps be extended toaudition more generally (Rosenblum, 2004).

Page 45: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

References 45

The inference problems of audition canalso be better appreciated by examiningreal-world sound signals, and formal anal-ysis of these signals seems likely to yieldvaluable clues. As discussed in previoussections, statistical analysis of natural soundshas been a staple of recent computationalauditory neuroscience (Carlson, Ming, &DeWeese, 2012; Harper & McAlpine, 2004;Mlynarski, 2015; Rodriguez et al., 2010;Smith & Lewicki, 2006), where naturalsound statistics have been used to explainthe mechanisms observed in the peripheralauditory system. However, sound analysisseems likely to provide insight into mid- andhigh-level auditory problems as well. Forinstance, the acoustic grouping cues used insound segregation are almost surely rootedto some extent in natural sound statistics,and examining such statistics could revealunexpected cues. Similarly, because soundrecognition must generalize across the vari-ability that occurs within sounds producedby a particular type of source, examiningthis variability in natural sounds may provideclues to how the auditory system achievesthe appropriate invariance in this domain.

The study of real-world auditory compe-tence will also necessitate measuring auditoryabilities and physiological responses withmore realistic sound signals. The tones andnoises that have been the staple of classicalpsychoacoustics and auditory physiologyhave many uses, but also have little incommon with many everyday sounds. Onechallenge of working with realistic sig-nals is that actual recordings of real-worldsounds are often uncontrolled, and mayintroduce confounds associated with theirfamiliarity. Methods of synthesizing novelsounds with naturalistic properties (Cavaco &Lewicki, 2007; McDermott & Simoncelli,2011) are thus likely to be useful exper-imental tools. Considering more realisticsound signals will in turn necessitate more

sophisticated models, particularly of corticalneural responses. The modulation filterbankmodels of Figures 2.7C and 2.8B have servedhearing researchers well, but are clearly inad-equate as models of complex auditory behav-ior and of cortical neural responses to naturalsounds (Norman-Haignere et al., 2015).

We must also consider more realistic audi-tory behaviors. Hearing does not normallyoccur while we are seated in a quiet room,listening over headphones, and paying fullattention to the acoustic stimulus, but ratherin the context of everyday activities in whichsound is a means to some other goal. The needto respect this complexity while maintainingsufficient control over experimental condi-tions presents a challenge, but not one thatis insurmountable. For instance, neurophys-iology experiments involving naturalisticbehavior are becoming more common, withpreparations being developed that will per-mit recordings from freely moving animalsengaged in vocalization (Eliades & Wang,2008) or locomotion—ultimately, perhaps areal-world cocktail party.

REFERENCES

Adelson, E. H. (2000). Lightness perception andlightness illusions. In M. S. Gazzaniga (Ed.),The new cognitive neurosciences (2nd ed.,pp. 339–351). Cambridge, MA: MIT Press.

Ahrens, M. B., Linden, J. F., & Sahani, M. (2008).Influences in auditory cortical responses mod-eled with multilinear spectrotemporal methods.Journal of Neuroscience, 28(8), 1929–1942.

Ahveninen, J., Jääskeläinen, I. P., Raij, T., Bon-massar, G., Devore, S., Hämäläinen, M., . . .Belliveau, J. W. (2006). Task-modulated “what”and “where” pathways in human auditory cor-tex. Proceedings of the National Academy ofSciences, USA, 103(39), 14608–14613.

Alain, C., Arnott, S. R., Hevenor, S., Graham,S., & Grady, C. L. (2001). “What” and “where”in the human auditory system. Proceedings of

Page 46: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

46 Audition

the National Academy of Sciences, USA, 98,12301–12306.

Alais, D., & Burr, D. E. (2004). The ventriloquisteffect results from near-optimal bimodal inte-gration. Current Biology, 14, 257–262.

ANSI. (2007). American National Standard Pro-cedure for the Computation of Loudness ofSteady Sounds (ANSI S3.4). Retrieved fromhttps://global.ihs.com/doc_detail.cfm?&csf=ASA&document_name=ANSI%2FASA%20S3%2E4&item_s_key=00009561

Ashmore, J. (2008). Cochlear outer hair cell motil-ity. Physiological Review, 88, 173–210.

Attias, H., & Schreiner, C. E. (1997). Tempo-ral low-order statistics of natural sounds. InM. Mozer, M. Jordan, & T. Petsche (Eds.),Advances in neural information processing(pp. 27–33). Cambridge, MA: MIT Press.

Attneave, F., & Olson, R. K. (1971). Pitch as amedium: A new approach to psychophysicalscaling. American Journal of Psychology, 84(2),147–166.

Bacon, S. P., & Grantham, D. W. (1989). Modula-tion masking: Effects of modulation frequency,depth, and phase. Journal of the Acoustical Soci-ety of America, 85, 2575–2580.

Bandyopadhyay, S., Shamma, S. A., & Kanold,P. O. (2010). Dichotomy of functional orga-nization in the mouse auditory cortex. NatureNeuroscience, 13(3), 361–368.

Barbour, D. L., & Wang, X. (2003). Contrast tun-ing in auditory cortex. Science, 299, 1073–1075.

Barton, B., Venezia, J. H., Saberi, K., Hickok,G., & Brewer, A. A. (2012). Orthogonal acousticdimensions define auditory field maps in humancortex. Proceedings of the National Academy ofSciences, USA, 109(50), 20738–20743.

Baumann, S., Griffiths, T. D., Sun, L., Petkov,C. I., Thiele, A., & Rees, A. (2011). Orthogo-nal representation of sound dimensions in theprimate midbrain. Nature Neuroscience, 14(4),423–425.

Baumann, S., Petkov, C. I., & Griffiths, T. D.(2013). A unified framework for the organiza-tion of the primate auditory cortex. Frontiers inSystems Neuroscience, 7, 1–8.

Bee, M. A., & Micheyl, C. (2008). The cock-tail party problem: What is it? How can it be

solved? And why should animal behavioristsstudy it? Journal of Comparative Psychology,122(3), 235–251.

Belin, P., Zatorre, R. J., Lafaille, P., Ahad, P., &Pike, B. (2000). Voice-selective areas in humanauditory cortex. Nature, 403, 309–312.

Bendor, D., & Wang, X. (2005). The neuronal rep-resentation of pitch in primate auditory cortex.Nature, 426, 1161–1165.

Bendor, D., & Wang, X. (2006). Cortical represen-tations of pitch in monkeys and humans. CurrentOpinion in Neurobiology, 16, 391–399.

Bendor, D., & Wang, X., (2008). Neural responseproperties of primary, rostral, and rostrotempo-ral core fields in the auditory cortex of marmosetmonkeys. Journal of Neurophysiology, 100(2),888–906.

Bernstein, J. G. W., & Oxenham, A. J. (2005). Anautocorrelation model with place dependence toaccount for the effect of harmonic number onfundamental frequency discrimination. Journalof the Acoustical Society of America, 117(6),3816–3831.

Binder, J. R., Frost, J. A., Hammeke, T. A., Bellgo-wan, P. S. F., Springer, J. A., Kaufman, J. N., &Possing, E. T. (2000). Human temporal lobeactivation by speech and nonspeech sounds.Cerebral Cortex, 10(5), 512–528.

Bitterman, Y., Mukamel, R., Malach, R., Fried,I., & Nelken, I. (2008). Ultra-fine frequency tun-ing revealed in single neurons of human audi-tory cortex. Nature, 451(7175), 197–201.

Bizley, J. K., Nodal, F. R., Bajo, V. M., Nelken,I., & King, A. J. (2007). Physiological andanatomical evidence for multisensory interac-tions in auditory cortex. Cerebral Cortex, 17,2172–2189.

Bizley, J. K., Walker, K. M. M., King, A. J., &Schnupp, J. W. (2010). Neural ensemble codesfor stimulus periodicity in auditory cortex. Jour-nal of Neuroscience, 30(14), 5078–5091.

Bizley, J. K., Walker, K. M. M., Silverman, B. W.,King, A. J., & Schnupp, J. W. (2009). Interde-pendent encoding of pitch, timbre, and spatiallocation in auditory cortex. Journal of Neuro-science, 29(7), 2064–2075.

Boemio, A., Fromm, S., Braun, A., & Poeppel, D.(2005). Hierarchical and asymmetric temporal

Page 47: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

References 47

sensitivity in human auditory cortices. NatureNeuroscience, 8, 389–395.

Brandewie, E., & Zahorik, P. (2010). Prior lis-tening in rooms improves speech intelligibility.Journal of the Acoustical Society of America,128, 291–299.

Bregman, A. S. (1990). Auditory scene anal-ysis: The perceptual organization of sound.Cambridge, MA: MIT Press.

Bregman, A. S., & Campbell, J. (1971). Primaryauditory stream segregation and perception oforder in rapid sequences of tones. Journal ofExperimental Psychology, 89, 244–249.

Bronkhorst, A. W. (2000). The cocktail partyphenomenon: A review of research on speechintelligibility in multiple-talker conditions. ActaAcustica united with Acustica, 86, 117–128.

Bronkhorst, A. W., & Houtgast, T. (1999). Audi-tory distance perception in rooms. Nature, 397,517–520.

Brown, A. D., Stecker, G. C., & Tollin, D. J.(2015). The precedence effect in sound localiza-tion. Journal of the Association for Research inOtolaryngology, 16, 1–28.

Brown, G. J., & Palomaki, K. J. (2006). Reverber-ation. In D. Wang & G. J. Brown (Eds.), Com-putational auditory scene analysis: Principles,algorithms, and applications (pp. 209–250).Hoboken, NJ: Wiley.

Buus, S., Florentine, M., & Poulsen, T. (1997).Temporal integration of loudness, loudness dis-crimination, and the form of the loudnessfunction. Journal of the Acoustical Society ofAmerica, 101, 669–680.

Cariani, P. A., & Delgutte, B. (1996). Neural corre-lates of the pitch of complex tones. I. Pitch andpitch salience. Journal of Neurophysiology, 76,1698–1716.

Carlson, N. L., Ming, V. L., & DeWeese, M. R.(2012). Sparse codes for speech predict spec-trotemporal receptive fields in the inferior col-liculus. PLoS Computational Biology, 8(7),e1002594.

Carlyon, R. P. (1991). Discriminating betweencoherent and incoherent frequency modulationof complex tones. Journal of the AcousticalSociety of America, 89, 329–340.

Carlyon, R. P. (2004). How the brain separatessounds. Trends in Cognitive Sciences, 8(10),465–471.

Carlyon, R. P., Cusack, R., Foxton, J. M., &Robertson, I. H. (2001). Effects of attentionand unilateral neglect on auditory stream seg-regation. Journal of Experimental Psychology:Human Perception and Performance, 27(1),115–127.

Carlyon, R. P., Micheyl, C., Deeks, J. M., &Moore, B. C. (2004). Auditory processing ofreal and illusory changes in frequency modula-tion (FM) phase. Journal of the Acoustical Soci-ety of America, 116(6), 3629–3639.

Cavaco, S., & Lewicki, M. S. (2007). Statisti-cal modeling of intrinsic structures in impactsounds. Journal of the Acoustical Society ofAmerica, 121(6), 3558–3568.

Cedolin, L., & Delgutte, B. (2005). Pitch of com-plex tones: Rate-place and interspike intervalrepresentations in the auditory nerve. Journal ofNeurophysiology, 94, 347–362.

Cherry, E. C. (1953). Some experiments on therecognition of speech, with one and two ears.Journal of the Acoustical Society of America,25(5), 975–979.

Chi, T., Ru, P., & Shamma, S. A. (2005). Mul-tiresolution spectrotemporal analysis of com-plex sounds. Journal of the Acoustical Societyof America, 118(2), 887–906.

Christianson, G. B., Sahani, M., & Linden, J. F.(2008). The consequences of response nonlin-earities for interpretation of spectrotemporalreceptive fields. Journal of Neuroscience, 28(2),446–455.

Ciocca, V., & Bregman, A. S. (1987). Perceivedcontinuity of gliding and steady-state tonesthrough interrupting noise. Perception & Psy-chophysics, 42, 476–484.

Cohen, Y. E., Russ, B. E., Davis, S. J., Baker, A. E.,Ackelson, A. L., & Nitecki, R. (2009). A func-tional role for the ventrolateral prefrontal cortexin non-spatial auditory cognition. Proceedingsof the National Academy of Sciences, USA, 106,20045–20050.

Conway, A. R. A., Cowan, N., & Bunting,M. F. (2001). The cocktail party phenomenon

Page 48: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

48 Audition

revisited: The importance of working memorycapacity. Psychonomic Bulletin & Review, 8,331–335.

Culling, J. F., & Akeroyd, M. A. (2010).Spatial hearing. In C. J. Plack (Ed.), TheOxford handbook of auditory science: Hearing(pp. 123–144). New York, NY: Oxford Univer-sity Press.

Culling, J. F., & Summerfield, Q. (1995).Perceptual separation of concurrent speechsounds: Absence of across-frequency groupingby common interaural delay. Journal of theAcoustical Society of America, 98(2), 785–797.

Cusack, R., & Carlyon, R. P. (2004). Auditoryperceptual organization inside and outside thelaboratory. In J. G. Neuhoff (Ed.), Ecologicalpsychoacoustics (pp. 15–48). San Diego, CA:Elsevier/Academic Press.

Cutting, J. E. (1975). Aspects of phonologi-cal fusion. Journal of Experimental Psychol-ogy: Human Perception and Performance, 104:105–120.

Dallos, P. (2008). Cochlear amplification, outerhair cells and prestin. Current Opinion in Neu-robiology, 18, 370–376.

Darwin, C. (1984). Perceiving vowels in the pres-ence of another sound: Constraints on formantperception. Journal of the Acoustical Society ofAmerica, 76(6), 1636–1647.

Darwin, C. J. (1981). Perceptual grouping ofspeech components differing in fundamentalfrequency and onset-time. Quarterly Journalof Experimental Psychology Section A, 33(2),185–207.

Darwin, C. J. (1997). Auditory grouping. Trends inCognitive Sciences, 1, 327–333.

Darwin, C. J., & Ciocca, V. (1992). Grouping inpitch perception: Effects of onset asynchronyand ear of presentation of a mistuned compo-nent. Journal of the Acoustical Society of Amer-ica, 91, 3381–3390.

Darwin, C. J., & Hukin, R. W. (1997). Perceptualsegregation of a harmonic from a vowel by inter-aural time difference and frequency proximity.Journal of the Acoustical Society of America,102(4), 2316–2324.

Darwin, C. J., & Hukin, R. W. (1999). Audi-tory objects of attention: The role of interaural

time differences. Journal of Experimental Psy-chology: Human Perception and Performance,25(3), 617–629.

Dau, T., Kollmeier, B., & Kohlrausch, A. (1997).Modeling auditory processing of amplitudemodulation. I. Detection and masking withnarrow-band carriers. Journal of the AcousticalSociety of America, 102(5), 2892–2905.

David, S. V., Mesgarani, N., Fritz, J. B., &Shamma, S. A. (2009). Rapid synaptic depres-sion explains nonlinear modulation of spectro-temporal tuning in primary auditory cortexby natural stimuli. Journal of Neuroscience,29(11), 3374–3386.

de Cheveigne, A. (2004). Pitch perception models.In C. J. Plack & A. J. Oxenham (Eds.), Pitch(pp. 169–233). New York, NY: Springer Verlag.

de Cheveigne, A. (2006). Multiple F0 estimation.In D. Wang & G. J. Brown (Eds.), Computa-tional auditory scene analysis: Principles, algo-rithms, and applications (pp. 45–80). Hoboken,NJ: Wiley.

de Cheveigne, A. (2010). Pitch perception. In C. J.Plack (Ed.), The Oxford handbook of auditoryscience: Hearing. New York, NY: Oxford Uni-versity Press.

de Cheveigne, A., & Kawahara, H. (2002). YIN,a fundamental frequency estimator for speechand music. Journal of the Acoustical Society ofAmerica, 111, 1917–1930.

de Cheveigne, A., McAdams, S., Laroche, J., &Rosenberg, M. (1995). Identification of con-current harmonic and inharmonic vowels: Atest of the theory of harmonic cancellation andenhancement. Journal of the Acoustical Societyof America, 97(6), 3736–3748.

Delgutte, B., Joris, P. X., Litovsky, R. Y., & Yin,T. C. T. (1999). Receptive fields and binauralinteractions for virtual-space stimuli in the catinferior colliculus. Journal of Neurophysiology,81, 2833–2851.

Demany, L., & Semal, C. (1990). The upperlimit of “musical” pitch. Music Perception, 8,165–176.

Demany, L., Trost, W., Serman, M., & Semal, C.(2008). Auditory change detection: Simplesounds are not memorized better than complexsounds. Psychological Science, 19, 85–91.

Page 49: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

References 49

Depireux, D. A., Simon, J. Z., Klein, D. J., &Shamma, S. A. (2001). Spectro-temporalresponse field characterization with dynamicripples in ferret primary auditory cortex. Jour-nal of Neurophysiology, 85(3), 1220–1234.

DiCarlo, J. J., & Cox, D. D. (2007). Untanglinginvariant object recognition. Trends in CognitiveSciences, 11, 333–341.

Durlach, N. I., Mason, C. R., Kidd, G., Arbogast,T. L., Colburn, H. S., & Shinn-Cunningham,B. G. (2003). Note on informational masking.Journal of the Acoustical Society of America,113(6), 2984–2987.

Elgoyhen, A. B., & Fuchs, P. A. (2010). Efferentinnervation and function. In P. A. Fuchs (Ed.),The Oxford handbook of auditory science: Theear (pp. 283–306). New York, NY: Oxford Uni-versity Press.

Eliades, S. J., & Wang, X. (2008). Neuralsubstrates of vocalization feedback monitor-ing in primate auditory cortex. Nature, 453,1102–1106.

Escabi, M. A., Miller, L. M., Read, H. L., &Schreiner, C. E. (2003). Naturalistic auditorycontrast improves spectrotemporal coding in thecat inferior colliculus. Journal of Neuroscience,23, 11489–11504.

Field, D. J. (1987). Relations between the statisticsof natural images and the response profiles ofcortical cells. Journal of the Optical Society ofAmerica A, 4(12), 2379–2394.

Fishman, Y. I., Reser, D. H., Arezzo, J. C., &Steinschneider, M. (1998). Pitch vs. spectralencoding of harmonic complex tones in primaryauditory cortex of the awake monkey. BrainResearch, 786, 18–30.

Formisano, E., Kim, D., Di Salle, F., van deMoortele, P., Ugurbil, K., & Goebel, R. (2003).Mirror-symmetric tonotopic maps in human pri-mary auditory cortex. Neuron, 40(4), 859–869.

Freyman, R. L., Clifton, R. K., & Litovsky, R. Y.(1991). Dynamic processes in the precedenceeffect. Journal of the Acoustical Society ofAmerica, 90, 874–884.

Gardner, W. G. (1998). Reverberation algorithms.In M. Kahrs & K. Brandenburg (Eds.), Applica-tions of digital signal processing to audio andacoustics. Norwell, MA: Kluwer Academic.

Gaver, W. W. (1993). What in the world do wehear? An ecological approach to auditory sourceperception. Ecological Psychology, 5(1), 1–29.

Ghazanfar, A. A. (2009). The multisensory rolesfor auditory cortex in primate vocal communi-cation. Hearing Research, 258, 113–120.

Ghitza, O. (2001). On the upper cutoff frequencyof the auditory critical-band envelope detectorsin the context of speech perception. Journalof the Acoustical Society of America, 110(3),1628–1640.

Giraud, A., Lorenzi, C., Ashburner, J., Wable,J., Johnsrude, I. S., Frackowiak, R., & Klein-schmidt, A. (2000). Representation of the tem-poral envelope of sounds in the human brain.Journal of Neurophysiology, 84(3), 1588–1598.

Glasberg, B. R., & Moore, B. C. J. (1990). Deriva-tion of auditory filter shapes from notched-noisedata. Hearing Research, 47, 103–138.

Goldstein, J. L. (1973). An optimum processor the-ory for the central formation of the pitch of com-plex tones. Journal of the Acoustical Society ofAmerica, 54, 1496–1516.

Grothe, B., Pecka, M., & McAlpine, D. (2010).Mechanisms of sound localization in mammals.Physiological Review, 90, 983–1012.

Guinan, J. J. (2006). Olivocochlear efferents:Anatomy, physiology, function, and the mea-surement of efferent effects in humans. Ear andHearing, 27(6), 589–607.

Gygi, B., Kidd, G. R., & Watson, C. S. (2004).Spectral-temporal factors in the identification ofenvironmental sounds. Journal of the AcousticalSociety of America, 115(3), 1252–1265.

Hall, D. A., & Plack, C. J. (2009). Pitch process-ing sites in the human auditory brain. CerebralCortex, 19(3), 576–585.

Hall, J. W., Haggard, M. P., & Fernandes, M. A.(1984). Detection in noise by spectro-temporalpattern analysis. Journal of the Acoustical Soci-ety of America, 76, 50–56.

Harper, N. S., & McAlpine, D. (2004). Optimalneural population coding of an auditory spatialcue. Nature, 430, 682–686.

Hartmann, W. M., McAdams, S., & Smith, B. K.(1990). Hearing a mistuned harmonic in an oth-erwise periodic complex tone. Journal of theAcoustical Society of America, 88, 1712–1724.

Page 50: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

50 Audition

Hawley, M. L., Litovsky, R. Y., & Culling, J. F.(2004). The benefit of binaural hearing in acocktail party: Effect of location and type ofinterferer. Journal of the Acoustical Society ofAmerica, 115(2), 833–843.

Heeger, D. J., & Bergen, J. (1995). Pyramid-basedtexture analysis/synthesis. In S. G. Mair &R. Cook (Eds.), SIGGRAPH 95 Visual Pro-ceedings: 22nd International ACM Conferenceon Computer Graphics and Interactive Tech-niques (Computer Graphics: Annual Confer-ence Series) (pp. 229–238). New York, NY:ACM.

Heffner, H. E., & Heffner, R. S. (1990). Effect ofbilateral auditory cortex lesions on sound local-ization in Japanese macaques. Journal of Neu-rophysiology, 64(3), 915–931.

Heinz, M. G., Colburn, H. S., & Carney, L. H.(2001). Evaluating auditory performance limits:I. One-parameter discrimination using a com-putational model for the auditory nerve. NeuralComputation, 13, 2273–2316.

Herdener, M., Esposito, F., Scheffler, K., Schnei-der, P., Logothetis, N. K., Uludag, K., & Kayser,C. (2013). Spatial representations of temporaland spectral sound cues in human auditory cor-tex. Cortex, 49(10), 2822–2833.

Hickok, G., & Poeppel, D. (2007). The corti-cal organization of speech processing. NatureReviews Neuroscience, 8(5), 393–402.

Higgins, N. C., Storace, D. A., Escabi, M. A., &Read, H. L. (2010). Specialization of binauralresponses in ventral auditory cortices. Journalof Neuroscience, 30(43), 14522–14532.

Hofman, P. M., Van Riswick, J. G. A., & vanOpstal, A. J. (1998). Relearning sound localiza-tion with new ears. Nature Neuroscience, 1(5),417–421.

Holt, L. L., & Lotto, A. J. (2010). Speech per-ception as categorization. Attention, Perception,and Psychophysics, 72(5), 1218–1227.

Hood, L. J., Berlin, C. I., & Allen, P. (1994).Cortical deafness: A longitudinal study. Jour-nal of the American Academy of Audiology, 5,330–342.

Houtgast, T. (1989). Frequency selectivity inamplitude-modulation detection. Journal of theAcoustical Society of America, 85, 1676–1680.

Houtsma, A. J. M., & Smurzynski, J. (1990).Pitch identification and discrimination for com-plex tones with many harmonics. Journal of theAcoustical Society of America, 87(1), 304–310.

Hsu, A., Woolley, S. M., Fremouw, T. E., & The-unissen, F. E. (2004). Modulation power andphase spectrum of natural sounds enhance neu-ral encoding performed by single auditory neu-rons. Journal of Neuroscience, 24, 9201–9211.

Hudspeth, A. J. (2008). Making an effort to lis-ten: Mechanical amplification in the ear. Neu-ron, 59(4), 530–545.

Humphries, C., Liebenthal, E., & Binder, J. R.(2010). Tonotopic organization of human audi-tory cortex. NeuroImage, 50(3), 1202–1211.

Ihlefeld, A., & Shinn-Cunningham, B. (2008).Spatial release from energetic and informationalmasking in a divided speech identification task.Journal of the Acoustical Society of America,123(6), 4380–4392.

Javel, E., & Mott, J. B. (1988). Physiological andpsychophysical correlates of temporal processesin hearing. Hearing Research, 34, 275–294.

Jenkins, W. M., & Masterton, R. B. (1982). Soundlocalization: Effects of unilateral lesions in cen-tral auditory system. Journal of Neurophysiol-ogy, 47, 987–1016.

Johnson, D. H. (1980). The relationship betweenspike rate and synchrony in responses ofauditory-nerve fibers to single tones. Jour-nal of the Acoustical Society of America, 68,1115–1122.

Johnsrude, I. S., Penhune, V. B., & Zatorre, R. J.(2000). Functional specificity in the right humanauditory cortex for perceiving pitch direction.Brain, 123(1), 155–163.

Joris, P. X., Bergevin, C., Kalluri, R., McLaugh-lin, M., Michelet, P., van der Heijden, M., &Shera, C. A. (2011). Frequency selectiv-ity in Old-World monkeys corroborates sharpcochlear tuning in humans. Proceedings of theNational Academy of Sciences, USA, 108(42),17516–17520.

Joris, P. X., Schreiner, C. E., & Rees, A. (2004).Neural processing of amplitude-modulatedsounds. Physiological Review, 84, 541–577.

Kaas, J. H., & Hackett, T. A. (2000). Subdivisionsof auditory cortex and processing streams in

Page 51: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

References 51

primates. Proceedings of the National Academyof Sciences, USA, 97, 11793–11799.

Kanwisher, N. (2010). Functional specificity inthe human brain: A window into the func-tional architecture of the mind. Proceedings ofthe National Academy of Sciences, USA, 107,11163–11170.

Kawase, T., Delgutte, B., & Liberman, M. C.(1993). Anti-masking effects of the olivo-cochlear reflex, II: Enhancement of auditory-nerve response to masked tones. Journal ofNeurophysiology, 70, 2533–2549.

Kayser, C., Petkov, C. I., & Logothetis, N. K.(2008). Visual modulation of neurons in audi-tory cortex. Cerebral Cortex, 18(7), 1560–1574.

Kidd, G., Arbogast, T. L., Mason, C. R., & Gallun,F. J. (2005). The advantage of knowing whereto listen. Journal of the Acoustical Society ofAmerica, 118(6), 3804–3815.

Kidd, G., Mason, C. R., Deliwala, P. S., & Woods,W. S. (1994). Reducing informational maskingby sound segregation. Journal of the AcousticalSociety of America, 95(6), 3475–3480.

Kidd, G., Mason, C. R., & Richards, V. M. (2003).Multiple bursts, multiple looks, and streamcoherence in the release from informationalmasking. Journal of the Acoustical Society ofAmerica 114(5), 2835–2845.

Kikuchi, Y., Horwitz, B., & Mishkin, M. (2010).Hierarchical auditory processing directed ros-trally along the monkey’s supratemporal plane.Journal of Neuroscience, 30(39), 13021–13030.

Kluender, K. R., & R. L. Jenison (1992). Effectsof glide slope, noise intensity, and noiseduration on the extrapolation of FM glidesthrough noise. Perception & Psychophysics, 51,231–238.

Klump, R. G., & Eady, H. R. (1956). Some mea-surements of interural time difference thresh-olds. Journal of the Acoustical Society ofAmerica, 28, 859–860.

Kulkarni, A., & Colburn, H. S. (1998). Roleof spectral detail in sound-source localization.Nature, 396, 747–749.

Langner, G., Sams, M., Heil, P., & Schulze, H.(1997). Frequency and periodicity are repre-sented in orthogonal maps in the human auditory

cortex: Evidence from magnetoencephalogra-phy. Journal of Comparative Physiology, 181,665–676.

Lerner, Y., Honey, C. J., Silbert, L. J., & Has-son, U. (2011). Topographic mapping of a hier-archy of temporal receptive windows using anarrated story. Journal of Neuroscience, 31(8),2906–2915.

Lewicki, M. S. (2002). Efficient coding of naturalsounds. Nature Neuroscience, 5(4), 356–363.

Liberman, M. C. (1982). The cochlear frequencymap for the cat: labeling auditory-nerve fibers ofknown characteristic frequency. Journal of theAcoustical Society of America, 72, 1441–1449.

Liebenthal, E., Binder, J. R., Spitzer, S. M., Poss-ing, E. T., & Medler, D. A. (2005). Neuralsubstrates of phonemic perception. CerebralCortex, 15(10), 1621–1631.

Litovsky, R. Y., Colburn, H. S., Yost, W. A., &Guzman, S. J. (1999). The precedence effect.Journal of the Acoustical Society of America,106, 1633–1654.

Lomber, S. G., & Malhotra, S. (2008). Doubledissociation of “what” and “where” processingin auditory cortex. Nature Neuroscience, 11(5),609–616.

Lorenzi, C., Gilbert, G., Carn, H., Garnier, S., &Moore, B. C. J. (2006). Speech perception prob-lems of the hearing impaired reflect inabilityto use temporal fine structure. Proceedings ofthe National Academy of Sciences, USA, 103,18866–18869.

Lutfi, R. A. (1992). Informational processing ofcomplex sounds. III. Interference. Journal of theAcoustical Society of America, 91, 3391–3400.

Lutfi, R. A. (2008). Sound source identifica-tion. In W. A. Yost & A. N. Popper (Eds.),Springer handbook of auditory research: Audi-tory perception of sound sources. New York,NY: Springer-Verlag.

Lyzenga, J., Carlyon, R. P., & Moore, B. C. J.(2005). Dynamic aspects of the continuity illu-sion: Perception of level and of the depth, rate,and phase of modulation. Hearing Research,210, 30–41.

Machens, C. K., Wehr, M. S., & Zador, A. M.(2004). Linearity of cortical receptive fields

Page 52: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

52 Audition

measured with natural sounds. Journal of Neu-roscience, 24, 1089–1100.

Macken, W. J., Tremblay, S., Houghton, R. J.,Nicholls, A. P., & Jones, D. M. (2003).Does auditory streaming require attention?Evidence from attentional selectivity in short-term memory. Journal of Experimental Psychol-ogy: Human Perception and Performance, 29,43–51.

Makous, J. C., & Middlebrooks, J. C. (1990).Two-dimensional sound localization by humanlisteners. Journal of the Acoustical Society ofAmerica, 87, 2188–2200.

Mandel, M. I., Weiss, R. J., & Ellis, D. P. W.(2010). Model-based expectation maximizationsource separation and localization. IEEE Trans-actions on Audio, Speech, and Language Pro-cessing, 18(2), 382–394.

Marr, D. C. (1982). Vision: A computational inves-tigation into the human representation and pro-cessing of visual information. New York, NY:Freeman.

Masutomi, K., Barascud, N., Kashino, M., McDer-mott, J. H., & Chait, M. (2016). Sound seg-regation via embedded repetition is robust toinattention. Journal of Experimental Psychol-ogy: Human Perception and Performance,42(3), 386–400.

May, B. J., Anderson, M., & Roos, M. (2008). Therole of broadband inhibition in the rate represen-tation of spectral cues for sound localization inthe inferior colliculus. Hearing Research, 238,77–93.

May, B. J., & McQuone, S. J. (1995). Effects ofbilateral olivocochlear lesions on pure-tone dis-crimination in cats. Auditory Neuroscience, 1,385–400.

McAdams, S. (1989). Segregation of concurrentsounds: I. Effects of frequency modulationcoherence. Journal of the Acoustical Society ofAmerica, 86, 2148–2159.

McAlpine, D. (2004). Neural sensitivity to period-icity in the inferior colliculus: Evidence for therole of cochlear distortions. Journal of Neuro-physiology, 92, 1295–1311.

McDermott, J. H. (2009). The cocktail party prob-lem. Current Biology, 19, R1024–R1027.

McDermott, J. H. (2013). Audition. In K.Ochsner & S. Kosslyn (Eds.), The Oxfordhandbook of cognitive neuroscience. Oxford,United Kingdom: Oxford University Press.

McDermott, J. H., & Oxenham, A. J. (2008a).Music perception, pitch, and the auditory sys-tem. Current Opinion in Neurobiology, 18,452–463.

McDermott, J. H., & Oxenham, A. J. (2008b).Spectral completion of partially masked sounds.Proceedings of the National Academy of Sci-ences, USA, 105(15), 5939–5944.

McDermott, J. H., Oxenham, A. J., & Simoncelli,E. P. (2009, October). Sound texture synthesisvia filter statistics. In 2009 IEEE Workshop onApplications of Signal Processing to Audio andAcoustics (pp. 297–300), New Paltz, NY.

McDermott, J. H., Schemitsch, M., & Simoncelli,E. P. (2013). Summary statistics in auditory per-ception. Nature Neuroscience, 16(4), 493–498.

McDermott, J. H., & Simoncelli, E. P. (2011).Sound texture perception via statistics of theauditory periphery: Evidence from sound syn-thesis. Neuron, 71, 926–940.

McDermott, J. H., Wrobleski, D., & Oxen-ham, A. J. (2011). Recovering sound sourcesfrom embedded repetition. Proceedings of theNational Academy of Sciences, USA, 108(3),1188–1193.

Meddis, R., & Hewitt, M. J. (1991). Virtual pitchand phase sensitivity of a computer model of theauditory periphery: Pitch identification. Jour-nal of the Acoustical Society of America, 89,2866–2882.

Mershon, D. H., Desaulniers, D. H., Kiefer, S. A.,Amerson, T. L. J., & Mills, J. T. (1981).Perceived loudness and visually–determinedauditory distance. Perception, 10, 531–543.

Mesgarani, N., Cheung, C., Johnson, K., &Chang, E. F. (2014). Phonetic feature encod-ing in human superior temporal gyrus. Science,343(6174), 1006–1010.

Mesgarani, N., David, S. V., Fritz, J. B., &Shamma, S. A. (2008). Phoneme representationand classification in primary auditory cortex.Journal of the Acoustical Society of America,123(2), 899–909.

Page 53: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

References 53

Mesgarani, N., & Shamma, S. A. (2011, May).Speech processing with a cortical representa-tion of audio. In IEEE International Confer-ence on Acoustic, Speech and Signal Processing(ICASSP) (pp. 5872–5875). Prague, CzechRepublic.

Micheyl, C., & Oxenham, A. J. (2010a). Objec-tive and subjective psychophysical measuresof auditory stream integration and segregation.Journal of the Association for Research in Oto-laryngology, 11(4), 709–724.

Micheyl, C., & Oxenham, A. J. (2010b). Pitch,harmonicity and concurrent sound segregation:Psychoacoustical and neurophysiological find-ings. Hearing Research, 266, 36–51.

Middlebrooks, J. C. (1992). Narrow-band soundlocalization related to external ear acoustics.Journal of the Acoustical Society of America,92(5), 2607–2624.

Middlebrooks, J. C. (2000). Cortical representa-tions of auditory space. In M. S. Gazzaniga(Ed.), The new cognitive neurosciences (2nd ed.,pp. 425–436). Cambridge, MA: MIT Press.

Middlebrooks, J. C., & Green, D. M. (1991).Sound localization by human listeners. AnnualReview of Psychology, 42, 135–159.

Miller, L. M., Escabi, M. A., Read, H. L., &Schreiner, C. E. (2001). Spectrotemporal recep-tive fields in the lemniscal auditory thalamusand cortex. Journal of Neurophysiology, 87,516–527.

Miller, L. M., Escabi, M. A., Read, H. L., &Schreiner, C. E. (2002). Spectrotemporal recep-tive fields in the lemniscal auditory thalamusand cortex. Journal of Neurophysiology, 87,516–527.

Miller, R. L., Schilling, J. R., Franck, K. R., &Young, E. D. (1997). Effects of acoustic traumaon the representation of the vowel /e/ in cat audi-tory nerve fibers. Journal of the Acoustical Soci-ety of America, 101(6), 3602–3616.

Mlynarski, W. (2015). The opponent channel pop-ulation code of sound location is an efficientrepresentation of natural binaural sounds. PLoSComputational Biology, 11(5), e1004294.

Mlynarski, W., & Jost, J. (2014). Statistics ofnatural binaural sounds. PLOS ONE, 9(10),e108968.

Moore, B. C., & Glasberg, B. R. (1996). A revi-sion of Zwicker’s loudness model. Acta Acus-tica united with Acustica, 82(2), 335–345.

Moore, B. C. J. (1973). Frequency differenceslimens for short-duration tones. Journal of theAcoustical Society of America, 54, 610–619.

Moore, B. C. J. (2003). An introduction to the psy-chology of hearing. San Diego, CA: AcademicPress.

Moore, B. C. J., Glasberg, B. R., & Peters, R. W.(1986). Thresholds for hearing mistuned partialsas separate tones in harmonic complexes. Jour-nal of the Acoustical Society of America, 80,479–483.

Moore, B. C. J., & Gockel, H. (2002). Factorsinfluencing sequential stream segregation. ActaAcustica united with Acustica, 88, 320–332.

Moore, B. C. J., & Oxenham, A. J. (1998). Psy-choacoustic consequences of compression inthe peripheral auditory system. PsychologicalReview, 105(1), 108–124.

Morosan, P., Rademacher, J., Schleicher, A.,Amunts, K., Schormann, T., & Zilles, K. (2001).Human primary auditory cortex: Cytoarchitec-tonic subdivisions and mapping into a spatialreference system. NeuroImage, 13, 684–701.

Moshitch, D., Las, L., Ulanovsky, N., Bar Yosef,O., & Nelken, I. (2006). Responses of neuronsin primary auditory cortex (A1) to pure tones inthe halothane-anesthetized cat. Journal of Neu-rophysiology, 95(6), 3756–3769.

Neff, D. L. (1995). Signal properties that reducemasking by simultaneous, random-frequencymaskers. Journal of the Acoustical Society ofAmerica, 98, 1909–1920.

Nelken, I., Bizley, J. K., Nodal, F. R., Ahmed, B.,King, A. J., & Schnupp, J. W. (2008). Responsesof auditory cortex to complex stimuli: Func-tional organization revealed using intrinsic opti-cal signals. Journal of Neurophysiology, 99(4),1928–1941.

Norman-Haignere, S., Kanwisher, N., & McDer-mott, J. H. (2013). Cortical pitch regions inhumans respond primarily to resolved harmon-ics and are located in specific tonotopic regionsof anterior auditory cortex. Journal of Neuro-science, 33(50), 19451–19469.

Page 54: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

54 Audition

Norman-Haignere, S., Kanwisher, N., & McDer-mott, J. H. (2015). Distinct cortical pathways formusic and speech revealed by hypothesis-freevoxel decomposition. Neuron, 88, 1281–1296.

Norman-Haignere, S., & McDermott, J. H. (2016).Distortion products in auditory fMRI research:Measurements and solutions. NeuroImage, 129,401–413.

Obleser, J., Zimmermann, J., Van Meter, J., &Rauschecker, J. P. (2007). Multiple stagesof auditory speech perception reflected inevent-related FMRI. Cerebral Cortex, 17(10),2251–2257.

Olshausen, B. A., & Field, D. J. (1996). Emer-gence of simple-cell receptive field propertiesby learning a sparse code for natural images.Nature, 381, 607–609.

Overath, T., McDermott, J. H., Zarate, J. M., &Poeppel, D. (2015). The cortical analysis ofspeech-specific temporal structure revealed byresponses to sound quilts. Nature Neuroscience,18, 903–911.

Palmer, A. R., & Russell, I. J. (1986).Phase-locking in the cochlear nerve of theguinea-pig and its relation to the receptorpotential of inner hair-cells. Hearing Research,24, 1–15.

Patterson, R. D., Uppenkamp, S., Johnsrude,I. S., & Griffiths, T. D. (2002). The processing oftemporal pitch and melody information in audi-tory cortex. Neuron, 36(4), 767–776.

Penagos, H., Melcher, J. R., & Oxenham, A. J.(2004). A neural representation of pitch saliencein nonprimary human auditory cortex revealedwith functional magnetic resonance imaging.Journal of Neuroscience, 24(30), 6810–6815.

Petkov, C. I., C. Kayser, M. Augath & N. K.Logothetis (2006). Functional imaging revealsnumerous fields in the monkey auditory cortex.PLoS Biology 4(7), 1213–1226.

Petkov, C. I., Kayser, C., Steudel, T., Whittingstall,K., Augath, M., & Logothetis, N. K. (2008). Avoice region in the monkey brain. Nature Neu-roscience, 11, 367–374.

Petkov, C. I., O’Connor, K. N., & Sutter, M. L.(2007). Encoding of illusory continuity in pri-mary auditory cortex. Neuron, 54, 153–165.

Plack, C. J. (2005). The sense of hearing. Mahwah,NJ: Lawrence Erlbaum.

Plack, C. J., & Oxenham, A. J. (2005). The psy-chophysics of pitch. In C. J. Plack, A. J. Oxen-ham, R. R. Fay, & A. J. Popper (Eds.), Pitch:Neural coding and perception (pp. 7–25). NewYork, NY: Springer Verlag.

Plack, C. J., Oxenham, A. J., Popper, A. J., & Fay,R. R. (Eds.) (2005). Pitch: Neural coding andperception. New York, NY: Springer.

Poeppel, D. (2003). The analysis of speech in dif-ferent temporal integration windows: cerebrallateralization as “asymmetric sampling in time.”Speech Communication, 41, 245–255.

Poremba, A., Saunders, R. C., Crane, A. M., Cook,M., Sokoloff, L., & Mishkin, M. (2003). Func-tional mapping of the primate auditory system.Science, 299, 568–572.

Portilla, J., & Simoncelli, E. P. (2000). A para-metric texture model based on joint statisticsof complex wavelet coefficients. InternationalJournal of Computer Vision, 40(1), 49–71.

Rajan, R. (2000). Centrifugal pathways protecthearing sensitivity at the cochlea in noisy envi-ronments that exacerbate the damage inducedby loud sound. Journal of Neuroscience, 20,6684–6693.

Rauschecker, J. P., & Tian, B. (2004). Processingof band-passed noise in the lateral auditory beltcortex of the rhesus monkey. Journal of Neuro-physiology, 91, 2578–2589.

Rayleigh, L. (1907). On our perception of sounddirection. Philosophical Magazine, 3, 456–464.

Recanzone, G. H. (2008). Representation of con-specific vocalizations in the core and belt areasof the auditory cortex in the alert macaquemonkey. Journal of Neuroscience, 28(49),13184–13193.

Rhode, W. S. (1971). Observations of the vibrationof the basilar membrane in squirrel monkeysusing the Mossbauer technique. Journal of theAcoustical Society of America, 49, 1218–1231.

Rhode, W. S. (1978). Some observations oncochlear mechanics. Journal of the AcousticalSociety of America, 64, 158–176.

Riecke, L., van Opstal, J., Goebel, R., &Formisano, E. (2007). Hearing illusory sounds

Page 55: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

References 55

in noise: Sensory-perceptual transformationsin primary auditory cortex. Journal of Neuro-science, 27(46), 12684–12689.

Roberts, B., & Brunstrom, J. M. (1998). Perceptualsegregation and pitch shifts of mistuned com-ponents in harmonic complexes and in regularinharmonic complexes. Journal of the Acousti-cal Society of America, 104(4), 2326–2338.

Rodriguez, F. A., Chen, C., Read, H. L., & Escabi,M. A. (2010). Neural modulation tuning char-acteristics scale to efficiently encode naturalsound statistics. Journal of Neuroscience, 30,15969–15980.

Romanski, L. M., B. Tian, J. B. Fritz, M. Mishkin,P. S. Goldman-Rakic & J. P. Rauschecker(1999). Dual streams of auditory afferents tar-get multiple domains in the primate prefrontalcortex. Nature Neuroscience, 2(12), 1131–1136.

Rose, J. E., Brugge, J. F., Anderson, D. J., &Hind, J. E. (1967). Phase-locked response tolow-frequency tones in single auditory nervefibers of the squirrel monkey. Journal of Neu-rophysiology, 30, 769–793.

Rosen, S. (1992). Temporal information in speech:Acoustic, auditory and linguistic aspects. Philo-sophical Transactions of the Royal Society B:Biological Sciences, 336, 367–373.

Rosenblum, L. D. (2004). Perceiving articulatoryevents: Lessons for an ecological psychoacous-tics. In J. G. Neuhoff (Ed.), Ecological Psychoa-coustics (pp. 219–248). San Diego, CA: ElsevierAcademic Press.

Rothschild, G., Nelken, I., & Mizrahi, A. (2010).Functional organization and population dynam-ics in the mouse primary auditory cortex. NatureNeuroscience, 13(3), 353–360.

Rotman, Y., Bar Yosef, O., & Nelken, I. (2001).Relating cluster and population responses tonatural sounds and tonal stimuli in cat pri-mary auditory cortex. Hearing Research, 152,110–127.

Ruggero, M. A. (1992). Responses to sound of thebasilar membrane of the mammalian cochlea.Current Opinion in Neurobiology, 2, 449–456.

Ruggero, M. A., & Rich, N. C. (1991). Furosemidealters organ of Corti mechanics: Evidence forfeedback of outer hair cells upon the basilar

membrane. Journal of Neuroscience, 11,1057–1067.

Ruggero, M. A., Rich, N. C., Recio, A., &Narayan, S. S. (1997). Basilar-membraneresponses to tones at the base of the chinchillacochlea. Journal of the Acoustical Society ofAmerica, 101, 2151–2163.

Saint-Arnaud, N., & Popat, K. (1995). Analysisand synthesis of sound texture. In AJCAI Work-shop on Computational Auditory Scene Analysis(pp. 293–308). Montreal, Canada.

Samson, F., Zeffiro, T. A., Toussaint, A., & Belin,P. (2011). Stimulus complexity and categoricaleffects in human auditory cortex: An ActivationLikelihood Estimation meta-analysis. Frontiersin Psychology, 1, 1–23.

Scharf, B., Magnan, J., & Chays, A. (1997). On therole of the olivocochlear bundle in hearing: 16case studies. Hearing Research, 103, 101–122.

Schonwiesner, M., & Zatorre, R. J. (2008).Depth electrode recordings show double dis-sociation between pitch processing in lateralHeschl’s gyrus. Experimental Brain, Research187(97–105).

Schonwiesner, M., & Zatorre, R. J. (2009).Spectro-temporal modulation transfer functionof single voxels in the human auditory cortexmeasured with high-resolution fMRI. Proceed-ings of the National Academy of Sciences, USA,106(34), 14611–14616.

Schreiner, C. E., & Urbas, J. V. (1986). Represen-tation of amplitude modulation in the auditorycortex of the cat. I. Anterior auditory field. Hear-ing Research, 21, 227–241.

Schreiner, C. E., & Urbas, J. V. (1988). Represen-tation of amplitude modulation in the auditorycortex of the cat. II. Comparison between corti-cal fields. Hearing Research, 32, 49–64.

Scott, S. K., Blank, C. C., Rosen, S., & Wise,R. J. S. (2000). Identification of a pathway forintelligible speech in the left temporal lobe.Brain, 123(12), 2400–2406.

Shackleton, T. M., & Carlyon, R. P. (1994). Therole of resolved and unresolved harmonics inpitch perception and frequency modulation dis-crimination. Journal of the Acoustical Society ofAmerica, 95(6), 3529–3540.

Page 56: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

56 Audition

Shamma, S. A., & Klein, D. (2000). The case ofthe missing pitch templates: How harmonic tem-plates emerge in the early auditory system. Jour-nal of the Acoustical Society of America, 107,2631–2644.

Shannon, R. V., Zeng, F. G., Kamath, V., Wygon-ski, J., & Ekelid, M. (1995). Speech recog-nition with primarily temporal cues. Science,270(5234), 303–304.

Shera, C. A., Guinan, J. J., & Oxenham, A. J.(2002). Revised estimates of human cochleartuning from otoacoustic and behavioral mea-surements. Proceedings of the National Acad-emy of Sciences, USA, 99(5), 3318–3323.

Shinn-Cunningham, B. G. (2008). Object-basedauditory and visual attention. Trends in Cogni-tive Sciences 12(5), 182–186.

Singh, N. C., & Theunissen, F. E. (2003). Mod-ulation spectra of natural sounds and etholog-ical theories of auditory processing. Journalof the Acoustical Society of America, 114(6),3394–3411.

Smith, E. C., & Lewicki, M. S. (2006). Efficientauditory coding. Nature, 439, 978–982.

Smith, Z. M., Delgutte, B., & Oxenham, A. J.(2002). Chimaeric sounds reveal dichotomies inauditory perception. Nature, 416, 87–90.

Stevens, S. S. (1955). The measurement of loud-ness. Journal of the Acoustical Society of Amer-ica, 27(5), 815–829.

Stilp, C. E., Alexander, J. M., Kiefte, M., &Kluender, K. R. (2010). Auditory color con-stancy: Calibration to reliable spectral prop-erties across nonspeech context and targets.Attention, Perception, and Psychophysics,72(2), 470–480.

Sumner, C. J., & Palmer, A. R. (2012). Auditorynerve fibre responses in the ferret. EuropeanJournal of Neuroscience, 36, 2428–2439.

Sweet, R. A., Dorph-Petersen, K., & Lewis, D. A.(2005). Mapping auditory core, lateral belt, andparabelt cortices in the human superior temporalgyrus. Journal of Comparative Neurology, 491,270–289.

Talavage, T. M., Sereno, M. I., Melcher, J. R., Led-den, P. J., Rosen, B. R., & Dale, A. M. (2004).Tonotopic organization in human auditory cor-tex revealed by progressions of frequency

sensitivity. Journal of Neurophysiology, 91,1282–1296.

Tansley, B. W., & Suffield, J. B. (1983). Time-course of adaptation and recovery of channelsselectively sensitive to frequency and amplitudemodulation. Journal of the Acoustical Society ofAmerica, 74, 765–775.

Teki, S., Chait, M., Kumar, S., Shamma, S. A., &Griffiths, T. D. (2013). Segregation of complexacoustic scenes based on temporal coherence.eLIFE, 2, e00699.

Terhardt, E. (1974). Pitch, consonance, and har-mony. Journal of the Acoustical Society ofAmerica, 55, 1061–1069.

Theunissen, F. E., David, S. V., Singh, N. C., Hsu,A., Vinje, W. E., & Gallant, J. L. (2001). Esti-mating spatio-temporal receptive fields of audi-tory and visual neurons from their responses tonatural stimuli. Network, 12(3), 289–316.

Theunissen, F. E., Sen, K., & Doupe, A. J. (2000).Spectral-temporal receptive fields of non-linearauditory neurons obtained using natural sounds.Journal of Neuroscience, 20, 2315–2331.

Tian, B., & J. P. Rauschecker (2004). Processing offrequency-modulated sounds in the lateral audi-tory belt cortex of the rhesus monkey. Journalof Neurophysiology, 92, 2993–3013.

Tian, B., Reser, D., Durham, A., Kustov, A., &Rauschecker, J. P. (2001). Functional specializa-tion in rhesus monkey auditory cortex. Science292, 290–293.

Traer, J., & McDermott, J. H. (2016). Statisticsof natural reverberation enable perceptual sep-aration of sound and space. Proceedings ofthe National Academy of Sciences, USA, 113,E7856–E7865.

van Noorden, L. P. A. S. (1975). Temporal coher-ence in the perception of tone sequences (Doc-toral dissertation). Eindhoven University ofTechnology, The Netherlands.

Walker, K. M. M., Bizley, J. K., King, A. J., &Schnupp, J. W. (2011). Cortical encoding ofpitch: Recent results and open questions. Hear-ing Research, 271(1–2), 74–87.

Wallace, M. N., Anderson, L. A., & Palmer, A. R.(2007). Phase-locked responses to pure tones inthe auditory thalamus. Journal of Neurophysiol-ogy, 98(4), 1941–1952.

Page 57: Audition' in: Stevens’ Handbook of Experimental ...mcdermottlab.mit.edu/...Audition_Stevens_Handbook.pdf · The fifth section discusses the perception of sound source properties,

References 57

Wallach, H., Newman, E. B., & Rosenzweig, M. R.(1949). The precedence effect in sound local-ization. American Journal of Psychology, 42,315–336.

Warren, J. D., Zielinski, B. A., Green, G. G. R.,Rauschecker, J. P., & Griffiths, T. D. (2002). Per-ception of sound-source motion by the humanbrain. Neuron, 34, 139–148.

Warren, R. M. (1970). Perceptual restoration ofmissing speech sounds. Science, 167, 392–393.

Warren, R. M., Obusek, C. J., & Ackroff, J. M.(1972). Auditory induction: Perceptual synthe-sis of absent sounds. Science, 176, 1149–1151.

Watkins, A. J. (2005). Perceptual compensation foreffects of reverberation in speech identification.Journal of the Acoustical Society of America,118, 249–262.

Watson, C. S. (1987). Uncertainty, informationalmasking and the capacity of immediate audi-tory memory. In W. A. Yost & C. S. Watson(Eds.), Auditory processing of complex sounds(pp. 267–277). Hillsdale, NJ: Erlbaum.

Wightman, F. (1973). The pattern-transformationmodel of pitch. Journal of the Acoustical Societyof America, 54, 407–416.

Wightman, F., & Kistler, D. J. (1989). Head-phone simulation of free-field listening. II:Psychophysical validation. Journal of theAcoustical Society of America, 85(2), 868–878.

Willmore, B. D. B., & Smyth, D. (2003). Meth-ods for first-order kernel estimation: simple-cellreceptive fields from responses to naturalscenes. Network, 14, 553–577.

Winslow, R. L., & Sachs, M. B. (1987). Effectof electrical stimulation of the crossed olivo-cochlear bundle on auditory nerve response totones in noise. Journal of Neurophysiology,57(4), 1002–1021.

Winter, I. M. (2005). The neurophysiology ofpitch. In C. J. Plack, A. J. Oxenham, R. R. Fay &A. J. Popper (Eds.), Pitch: Neural coding andperception. New York, NY: Springer Verlag.

Woods, K. J. P., & McDermott, J. H. (2015). Atten-tive tracking of sound sources. Current Biology,25, 2238–2246.

Woods, T. M., Lopez, S. E., Long, J. H., Rah-man, J. E., & Recanzone, G. H. (2006).Effects of stimulus azimuth and intensity on thesingle-neuron activity in the auditory cortex ofthe alert macaque monkey. Journal of Neuro-physiology, 96(6), 3323–3337.

Woolley, S. M., Fremouw, T. E., Hsu, A., &Theunissen, F. E. (2005). Tuning for spectro-temporal modulations as a mechanism for audi-tory discrimination of natural sounds. NatureNeuroscience, 8(10), 1371–1379.

Yates, G. K. (1990). Basilar membrane non-linearity and its influence on auditory nerverate-intensity functions. Hearing Research, 50,145–162.

Yin, T. C. T., & Kuwada, S. (2010). Binaural local-ization cues. In A. Rees & A. R. Palmer (Eds.),The Oxford handbook of auditory science: Theauditory brain (pp. 271–302). New York, NY:Oxford University Press.

Young, E. D. (2010). Level and spctrum. InA. Rees & A. R. Palmer (Eds.), The Oxfordhandbook of auditory science: The auditorybrain (pp. 93–124). New York, NY: OxfordUniversity Press.

Zahorik, P., Bangayan, P., Sundareswaran, V.,Wang, K., & Tam, C. (2006). Perceptual recal-ibration in human sound localization: Learn-ing to remediate front-back reversals. Journalof the Acoustical Society of America, 120(1),343–359.

Zahorik, P., & Wightman, F. L. (2001). Loudnessconstancy with varying sound source distance.Nature Neuroscience, 4(1), 78–83.

Zatorre, R. J. (1985). Discrimination and recogni-tion of tonal melodies after unilateral cerebralexcisions. Neuropsychologia 23(1), 31–41.

Zatorre, R. J., & Belin, P. (2001). Spectral andtemporal processing in human auditory cortex.Cerebral Cortex, 11, 946–953.

Zatorre, R. J., Belin, P., & Penhune, V. B. (2002).Structure and function of auditory cortex: Musicand speech. Trends in Cognitive Sciences, 6(1),37–46.