32
This article was downloaded by: [University of Toronto Libraries] On: 29 April 2014, At: 13:00 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of New Music Research Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/nnmr20 Optimizing auditory images and distance metrics for selforganizing timbre maps Petri Toiviainen a a Department of Musicology , University of Jyväskylä , PL 35, Jyväskylä, FIN40351, Finland Phone: +358 41 601 353 Fax: +358 41 601 353 E- mail: Published online: 03 Jun 2008. To cite this article: Petri Toiviainen (1996) Optimizing auditory images and distance metrics for selforganizing timbre maps , Journal of New Music Research, 25:1, 1-30, DOI: 10.1080/09298219608570695 To link to this article: http://dx.doi.org/10.1080/09298219608570695 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is

Optimizing auditory images and distance metrics for self‐organizing timbre maps*

  • Upload
    petri

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

This article was downloaded by: [University of Toronto Libraries]On: 29 April 2014, At: 13:00Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T3JH, UK

Journal of New Music ResearchPublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/nnmr20

Optimizing auditory imagesand distance metrics forself‐organizing timbre mapsPetri Toiviainen aa Department of Musicology , University ofJyväskylä , PL 35, Jyväskylä, FIN‐40351, FinlandPhone: +358 41 601 353 Fax: +358 41 601 353 E-mail:Published online: 03 Jun 2008.

To cite this article: Petri Toiviainen (1996) Optimizing auditory images and distancemetrics for self‐organizing timbre maps , Journal of New Music Research, 25:1, 1-30,DOI: 10.1080/09298219608570695

To link to this article: http://dx.doi.org/10.1080/09298219608570695

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all theinformation (the “Content”) contained in the publications on our platform.However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness,or suitability for any purpose of the Content. Any opinions and viewsexpressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of theContent should not be relied upon and should be independently verified withprimary sources of information. Taylor and Francis shall not be liable for anylosses, actions, claims, proceedings, demands, costs, expenses, damages,and other liabilities whatsoever or howsoever caused arising directly orindirectly in connection with, in relation to or arising out of the use of theContent.

This article may be used for research, teaching, and private study purposes.Any substantial or systematic reproduction, redistribution, reselling, loan,sub-licensing, systematic supply, or distribution in any form to anyone is

expressly forbidden. Terms & Conditions of access and use can be found athttp://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

Journal of New Music Research, Vol. 25 (1996), pp. 1-30 0929-8215/96/2501-001$12.00© Swets & Zeitlinger

Optimizing Auditory Images and Distance Metricsfor Self-Organizing Timbre Maps*

Petri Toiviainen

ABSTRACT

The effect of using different auditory images and distance metrics on the finalconfiguration of a self-organized timbre map is examined by comparing distancematrices, obtained from simulations, with a similarity rating matrix, obtained usingthe same set of stimuli as in the simulations. Gradient images, which are intendedto represent idealizations of physiological gradient maps in the auditory pathway,are constructed. The optimal auditory image and distance metric, with respect to thesimilarity rating data, are searched using the gradient method.

INTRODUCTION

Timbre is defined by the American Standards Association as "that attribute ofauditory sensation in terms of which a listener can judge that two sounds similarlypresented and having the same loudness and pitch are dissimilar" (AmericanStandard Acoustical Terminology SI.1-1960, p. 45). Aptly named "the psycho-acoustician's multidimensional wastebasket category" by McAdams and Bregman(1985), it is very difficult to analyze in physical or mathematical terms. Anyacoustic characteristic that does not contribute to the perception of pitch, loudness,or duration only, could contribute to the perception of timbre.

There have been a number of studies aiming at extracting the most salientacoustic attributes affecting the perception of timbre (e.g., Berger 1964; Saldanha& Corso 1964; Plomp & Steeneken 1971; Wedin & Goude 1972; Grey 1975;Plomp 1976; Grey 1977; Grey & Gordon 1978; Wessel 1979; Iverson &Krumhansl 1993). A method widely used in these studies is similarity rating (SR):subjects are asked to rate, on a given scale, the similarity of all possible pairs inthe set of stimuli. Multidimensional scaling (MDS) (Kruskal 1964a, 1964b) is thenused to map the tones into a low-dimensional space — frequently referred to as thetimbre space. The mutual distances of the tones in the constructed timbre space

*Sound examples are available in the JNMR Electronic Appendix (EA) which can be found onthe WWW at http://www.swets.nl/jnmr/jnmr.html

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

2 PETRITOIVIAINEN

approximate the respective similarity ratings. By examining MDS maps it has beenfound (Wedin & Goude 1972; Plomp 1976; Grey 1977; Wessel 1979; Iverson &Krumhansl 1993) that the spectral energy distribution in the steady-state portionof a tone is one of the main contributors to the perception of timbre: morespecifically, one dimension in the timbre space corresponds to the "brightness" oftones.

Apart from the spectral energy distribution, dynamic attributes, mostly in theonset portion of a tone, have been shown to have a considerable influence on theperception of timbre (Berger 1964; Saldanha & Corso 1964; Grey 1977; Wessel1979; Iverson & Krumhansl 1993). According to Grey (1977), the most salientdynamic attributes were the presence of synchronicity in the onset period of higherharmonics; and the presence of low-amplitude, high-frequency noise during theonset period. Wessel (1979) found that similarity ratings related to the quality of"bite" in the onset. Iverson and Krumhansl (1993) found that differences inamplitude envelopes contribute to similarity judgements; furthermore, the dynamicattributes are not only present at the onset, but also throughout the tones.

A timbre space can also be constructed from a set of acoustical signals bymeans of connectionist models; for this purpose, the Kohonen self-organizing map(KSOM) has been used by a number of researchers (Feiten, Frank & Ungvary1991; De Poli, Prandoni & Tonella 1993; Toiviainen 1992; Cosi, De Poli &Lauzzana 1994; Feiten & Giinzel 1994; Toiviainen, Kaipainen & Louhivuori 1995).In all these studies the tones were preprocessed, using a series of Short-TimeFourier Transforms (STFT) or advanced auditory modelling (AM) to convert theinput signal from the time domain to the frequency domain. In this way a set ofspectral image vectors was achieved; each component of a spectral image vectorrepresents the spectral energy of the corresponding tone within a given frequencyband and a given interval of time. The set of spectral images achieved in this waywas then used as input to a KSOM, to obtain a bidimensional projective mappingof the set of stimuli.

Feiten et al. (1991) used as input 102 synthetic random sounds of six differentcategories. They found that the KSOM was able to map the sounds in agreementwith the predefined categories. De Poli et al. (1993) used a three-dimensionalversion of the KSOM. As input they used the sound stimuli of Grey (1975), inorder to reconstruct Grey's timbre space. According to them, the analogies withGrey's results were encouraging. Cosi et al. (1994) used an auditory model and theKSOM to map the sounds of 12 acoustic instruments in both clean and noisyconditions. The obtained map showed a topological organization which was foundto agree with subjective classification of those sounds. Moreover, the KSOM wasable to recognize noisy versions of the sounds. Toiviainen (1992) and Feiten andGiinzel (1994) employed two hierarchical KSOMs. In both studies the dynamicsounds were treated as sequences of steady-state components. The first KSOM

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 3

mapped the steady-state spectra; the obtained trajectories were then used as inputto the second KSOM.

In a connectionist model aiming at timbre classification or recognition, it isnecessary to extract the most significant parameters of the incoming sound signalby means of a preprocessing stage. In timbre and speech research, this hastraditionally been based on methods like Short-Time Fourier Transform, Cepstrum,or Linear Predictive Coding (Rabiner & Shafer 1978; Feiten et al. 1991). All thesemethods are based on the analysis of a series of successive frames, and a quasi-periodic model of the signal, i.e., on the assumption that the properties of thesignal do not change significantly within an analysis frame. This assumption maycause subtle dynamic phenomena — e.g., frequency sweeps, which may be impor-tant in the perception of timbre — to be discarded. More recent multiresolutionanalysis approaches, such as Wavelet Transform (Kronland-Martinet & Grossmann1991), seem to alleviate this shortcoming, but they are still based on a mathemati-cal transformation of the signal and thus are not adequate to represent a model ofhuman auditory processing.

Recently, knowledge about the functioning of the human auditory system,including the dynamics of the basilar membrane, the mechanical response of haircells, and the electrical response of auditory nerve fibres, has become moreaccurate. On the basis of this knowledge, a number of computational models of theauditory periphery have been developed (Meddis 1986; Ghitza 1986; Cohen 1989;Van Immerseel & Martens 1992; Brown 1992; Brown & Cooke 1994; Cosi et al.1994). Van Immerseel and Martens (1992) used their auditory model for phoneticclassification and segmentation of speech utterances; according to them, the use ofan auditory model clearly improved the performance of the system, compared withtraditional preprocessing strategies.

According to a common view, single auditory nerve fibres have qualitativelyuniform properties. At later stages of the auditory pathway, many different celltypes and regions can, however, be detected (see, e.g., Pickles 1982). Manyexperiments have demonstrated that, in the brain stem nuclei, certain features ofthe auditory signal are selectively extracted. For instance, cells have been foundin the cochlear nuclei and the inferior colliculus, which give particularly strongresponses to certain dynamic features of time-varying signals.

There is evidence, psychological as well as physiological, that the humanauditory system encodes frequency transitions (Kay & Matthews 1972; Gardner &Wilson 1979; Steiger & Bregman 1981) and uses the obtained frequency transitionmaps for processing incoming signals (Bregman & Dannenbring 1973; Ciocca &Bregman 1987). Also physiological studies on other mammals (Suga 1965;Watanabe & Ohgushi 1968; M0ller 1972; Mendelson & Cynader 1985; Shamma,Vranic & Wiser 1992) support this assumption. Using the psychophysical paradigmof selective adaptation, Gardner and Wilson (1979) found evidence of channels in

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

4 PETRITOIVIAINEN

the human auditory system which are responsive to the direction of frequencymodulation. Suga (1965) studied inhibitory areas of single auditory neurons inecho-locating bats using pure and frequency modulated (FM) tone pulses. He wasable to identify areas which respond to FM-tone pulses in both the inferiorcolliculus and the cochlear nucleus.

Besides frequency transitions, further dynamic characteristics of sound areencoded in the low levels of the auditory system: in the cochlear nucleus there arecells responsive to onsets and offsets (Pickles 1982; Brown 1992). Brown hasproposed that the onset and offset cells form a two-dimensional map, one dimen-sion being frequency and the other excitatory or inhibitory delay, for onset andoffset, respectively. Nelson, Erulkar and Bryan (1966) found neurons in the inferiorcolliculus which were sensitive to amplitude modulation and were specificallyresponsive to a certain speed or direction of modulation. In the cochlear nucleus,cells with similar response properties have been found by M0ller (1978).

The auditory system has been found to represent the shape of the instantaneousacoustic spectrum in various ways. For instance, by measuring the distribution ofinhibitory and excitatory responses to a set of acoustic stimuli across the surfaceof the primary auditory cortex of the ferret, Shamma et al. (1992) found that thedistribution of responses encodes the locally averaged gradient of the acousticspectrum. By a series of discrimination experiments done on human subjects, theyfound that the detection of a change in peak symmetry is both sensitive andindependent of peak shape; this result can be viewed as a perceptual correlate ofthe physiological gradient map.

The above-mentioned studies support the general view that the auditory systemspecializes at an early stage of processing: it detects various features of theincoming signal already in the cochlear nucleus and inferior colliculus. It isunclear, however, which of these features, and to what extent, are important intimbre perception and identification. Connectionist modelling may offer oneapproach to this problem. For instance, the KSOM, described below, is able toextract the most salient features of a given input set. Using different kinds ofauditory images as input to a self-organizing network, a set of various timbre mapscan be achieved; these can then be compared with SR data obtained using the sameset of timbres. From the results of these comparisons one might then concludewhich features are essential in the perception of timbre.

During the self-organizing process of the KSOM, the input vectors arecompared with the synaptic vectors of the network using a distance metric. Usingdifferent metrics, different aspects of the input vectors can be emphasized or de-emphasized. This then causes different timbre maps to emerge. For obviousreasons, the prevalent choice for distance metric in the KSOM simulations is theEuclidean metric: it corresponds to the way distances are measured in everydaylife. Euclidean metric, however, does not necessarily correspond to how the

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 5

auditory system compares stimuli. In fact, using steady-state, harmonic tones,Plomp (1976, 93-96) found that the city-block distance (see below) of spectrayielded a higher correlation with perceived timbral distances than the Euclideandistance; the tones were represented as sound-pressure levels in 15 frequencybands. Feiten and Giinzel (1993), again using steady-state, harmonic tones, utilizedtone discrimination experiments in order to find a distance measure which wouldcorrelate with human auditory perception. They found that when the tones wererepresented as sound-pressure levels of the partials, the optimal distance measurewas the city-block distance. Representing the tones as loudnesses (in phons) of thepartials, and using the Minkowski metric with exponent X = 5 (see below) yielded,however, the best results. Neither of the studies mentioned above examined toneswith dynamic spectra; it may be that the optimum distance metric for such tonesis different. Comparing self-organized timbre maps, obtained using various metrics,with the respective SR data, could provide hints about timbral features that areemphasized or de-emphasized by the auditory system.

The goals of the present study were (1) to examine the effect of differentauditory images and distance metrics on the final configuration of a self-organizedtimbre map; and (2) to explore whether the KSOM is able to project the usedauditory images into two dimensions while maintaining the metrical relationsbetween the images. The main method was that of calculating correlationcoefficient values between distance matrices obtained from the simulations and aSR matrix obtained using the same set of tones. The overall aim was to find theauditory image and distance metric which would yield the highest correlation withthe SR data. First, sets of auditory images were constructed by varying the degreeof emphasis on the onset of tones, and combining spectral and gradient1 images,i.e., images which were supposed to qualitatively represent responses of FM- andAM-sensitive neurons as well as neurons encoding the spectral gradient. Secondly,matrices of inter-image distances were calculated using various Minkowski metrics;the obtained distance matrices were compared with the SR matrix by calculatingPearson correlation coefficients. Thirdly, a series of KSOM simulations was carriedout with the obtained sets of auditory images. Finally, matrices of responsedistances on the KSOM were calculated using two types of distance metrics; thesewere then compared with the SR matrix by means of the correlation coefficient.All computations were performed on a Macintosh Ilci, using Think C compiler.

MATERIALS

The tone material used in the experiments consisted of 27 tones. The tones wereproduced by additive synthesis, with a duration of 500 ms, a sampling rate of20,000 Hz, a fundamental frequency of 440 Hz, and 16 partials with a harmonic

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

6 PETRITOIVIAINEN

frequency series. The synthesis algorithm was controlled by six variableparameters, chosen so that they would correspond to the most salient acousticattributes of timbre (Grey 1977; Iverson & Krumhansl 1993). These parameterswere brightness, attack time, attack asynchronity, sustain time, decay time, andtime scaling. The values of the parameters were chosen so that they produced awide range of different tones, including tones resembling piano, strings, brass, andwoodwind instruments. A more thorough description of the synthesis algorithm,with the values of the parameters for each tone, can be found in Toiviainen et al.(1995).

SIMILARITY RATING EXPERIMENT

Nine subjects participated in the similarity rating experiment: seven students ofmusicology and two researchers at the University of Jyvaskyla. The group ofsubjects consisted of five males and four females, all of whom have at leastmoderate skill in some musical instrument. The tones were reproduced, using 16-bit resolution, by a DigiDesign Sound Accelerator card, which was controlled bya Macintosh Quadra 700 computer. Two Audiowell MP 120 speakers were used.The computer controlled the presentation of stimuli and recorded the responses ofsubjects.

During the experiment, subjects heard each possible pair of tones which arepresented in random order, they differ for each subject. They were asked to ratethe similarity of each pair on a scale between zero (for completely similar) andtwelve (for very dissimilar). Since the order in which tones of a pair are presentedin similarity rating experiments has been found to have little effect (Grey 1977;Iverson & Krumhansl 1993), it was ignored. The experiment yielded, thus, a 27 x27 triangular SR matrix for each subject. In this study, the SR matrix obtained byaveraging the SR matrices of each subject was used; the standard deviations of thesubjects' ratings for each pair of stimuli ranged from zero to 3.41, the mean ofthose values being 1.94. A more thorough description of the SR experiment canbe found in Toiviainen et al. (1995).

AUDITORY IMAGES

The timbre maps were constructed by a two-stage process (see Fig. 1). In the firststage, the auditory images were computed using an auditory model. In the secondstage, the obtained auditory images were fed into a KSOM which then projectedthem onto a two-dimensional map.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES

KSOM •••

•••

•••

••••

••••

••

tauditory

sound

image

signal

— - — ^— - ,

auditorymodel

"in

tu i l i l l l

Fig. 1. The method used for constructing the timbre maps from sound signals.

Auditory model

To obtain auditory images of the tones, the stimuli were preprocessed, using theperipheral part of an auditory model by Van Immerseel and Martens (1992),modified by Leman (1994) for musical purposes. The model takes into account thefiltering of the outer and middle ear, the dynamics of the basilar membrane, themechanical response of hair cells, and the electrical response of auditory nervefibres.• The sound transmission by the outer and middle ear chain is represented by a

second-order low-pass filter with a resonance frequency of 4 kHz.

7

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

8 PETRI TOIVIAINEN

• The bandpass filtering in the cochlea is implemented as a bank of 20asymmetric bandpass filters, the center distances of which are one critical bandapart, in the range of 220 to 7075 Hz.

• The mechanical-to-neural transduction in the hair cells is described by a haircell model in each of the 20 channels. This model includes the followingfeatures: (1) half-wave rectification: only the positive phase of the signal iscaptured by the hair cells; and (2) dynamic range compression: the model hasa transition zone of 50 dB, with values for spontaneous and saturated firing ratebeing 0.05 and 0.15 spikes/msec, respectively.

• The auditory nerve transmission properties are modelled by a third-order low-pass filter; this is supposed to model the loss of phase-locking capabilities ofthe auditory nerve fibres at high frequencies. The original model (VanImmerseel & Martens 1991) used a filtering frequency of 250 Hz to allow aconsiderable down-sampling, making real-time speech analysis possible. In thisstudy, a modified version of the model is used (Leman 1994), with a filteringfrequency of 1250 Hz. This tallies better with what is known about thesynchronization properties of the auditory neurons (e.g., Javel, McGee, Horst& Farley 1988).

The output of the auditory model is a 20-component vector, updated every 0.4 ms;each component is supposed to represent the probability of neural firing since thelast update.

Minkowski metrics

The Minkowski distance of two real-valued vectors, p = (£„ ... ^n) and q = (r\v ...r\n) is defined by

Ip-ql, -

l/X

(1)

By varying the value of the Minkowski exponent X, a wide range of differentmetrics can be achieved. The value X = 2 yields the Euclidean distance:

1/2

l p - q [ 2 =

while the value X = 1 yields the city-block distance:

(2)

(3)1=1

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 9

The value of the Minkowski exponent is not restricted to integers; it can have anypositive value. When X » 1, the mutual distance of two vectors is mainlydetermined by the largest difference present in any component; when A . « 1, it ismainly determined by the number of components for which the vectors differsignificantly, irrespective of how large these differences might be.

Auditory images: continuous domain

This section contains a mathematical formulation, in continuous domain, of howthe onsets of the spectral images can be emphasized, and how the gradient imagescan be constructed. The auditory model used in this study in fact produces adiscrete output; the aim of this section is, however, to give a general description,which in principle could be applied to an auditory model producing a continuousoutput.

The spectral image 5 of a tone, having a duration T, can be expressed as ascalar-valued function on the subset [/0/J x [0,T] of the frequency-time-plane:

S:s(f,t), JklfJJ, te[0,T]. (4)

Here f0 and /j denote the minimum and maximum of the frequencies included inthe image, respectively; s(f,f) is obtained by temporally integrating the output ofthe auditory model:

(5)

Here p(f,t) denotes the neural firing probability of a neuron with a best frequency/, within a time interval [t-dt,i\\ x denotes the time constant. The integration isnecessary in order to smooth the amplitude modulations which are present in theneural firing probabilities.

The onset of the spectral image is emphasized by defining an exponentialmapping of time:

, k=0

ekT-l

and a mapping on the frequency-time plane

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

10

ak.Ok(f,t) = *(/,'

PETRI TOIVIAINEN

(7)

The exponential parameter k defines the degree of emphasis on the onset;subsequently, it is referred to as the emphasis parameter. When k = 0, there is noemphasis on the onset; the greater the value of k, the stronger the onset isemphasized (see Fig. 2). The onset-emphasized spectral image, 5^, is now definedto be

ilte[0,T] (8)

The physiological gradient maps are modelled by means of directional derivatives.For a given direction vector e = (e^e,) on the frequency-time-plane, the rate ofchange of function Tk along the direction of e is expressed by its directionalderivative, defined by

da.Y* = • Vatn (9)

When e is parallel to the f-axis, the value of the corresponding directionalderivative yields the frequency derivative of the spectrum at a given point of the

•a

^

auditory image auditory image

Fig. 2. Emphasizing the onset in the auditory images by means of an exponentialmapping of time.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 11

frequency-time plane; Yt can, thus, be regarded to be an idealization of thephysiological spectral gradient map (cf., Shamma et al. 1992). When e is parallelto the t-axis, Y* yields the temporal change of the image at a given frequency andcan, thus, be seen as an idealization of responses of AM-sensitive cochlear nucleuscells (cf., Nelson et al. 1966; M0ller 1978). When e is parallel to neither of theaxes, the corresponding directional derivative can be seen to represent the responseof FM-sensitive neurons.

For a given e and k, the corresponding gradient image, F*, is now defined tobe

o,/,L te[O,T].

An illustration of gradient images is presented in Fig. 3.For a set {e^ ej, ..., en} of direction vectors on the frequency-time plane, a

composite auditory image can be constructed as a vector-valued function, definedby

•„,/,], te [0,7]

The oc's are the weighting factors of the corresponding images.The Minkowski distance of two (vector-valued) auditory images <E>lf and «J>2>

1S

defined, analogously to Eq. (1), by

i/xdfdt ,

where l9i(/>0-<P2(/»0lx is the Minkowski norm of vector tPiCf.O-^Cfr), and theintegration is carried out over \f0 / J x [0,7].

Auditory images: discrete domain

The spectral images which are presented to the KSOM are finite-dimensionalvectors. Therefore, each component of a spectral image vector can be consideredto be a sample of a continuous auditory image, defined in the previous section.This section describes how the auditory image vectors were formed.

Analogously to Eq. (5), the neural firing patterns fj, obtained from the auditorymodel, were temporally integrated, using a leaky integrator defined by the equation

sw = cs,+(l-c)fw . (13)

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

12 PETRITOIVIAINEN

Here Sj is the state of the integrator at time i, and the decay parameter c is definedby

-AWT (14)

where AT is the updating interval of the neural firing patterns, and T the timeconstant; the value used for t was 20 ms. The state of the integrator, which is,thus, a 20-dimensional vector, was sampled 20 times during a period of 500 ms.This gave rise to a 400-component vector, representing the spectral image of thestimulus. The sampling instants D, were defined by the equation

= int\iiTIN)

ATiT,

where xk(-) is the exponential mapping of time defined in Eq. (6); N is the totalnumber of samples taken; T is the duration of tones; and int(-) returns the integerpart of the argument. Similarly to Eq. (6), parameter k defines the degree ofemphasis on the onset. When k=0, there is no emphasis on the onset; the greaterthe value of k, the stronger the onset is emphasized (see Fig. 2).

The obtained spectral image vector can be described as a matrix

S = (5y) EE (s(f.,v)), (16)

where s(fj, i),) denotes the value of component j of the integrator at samplinginstant i. From matrix S, approximations of four directional derivatives wereconstructed by calculating matrices as follows:

Fig. 3. An illustration of spectral and gradient images of tone 24 of the set of stimuliused (see Toiviainen, Kaipainen & Louhivuori 1995). In each of the subfiguresthe horizontal axis stands for time and the vertical axis for frequency, a) Thespectral image; the gradient images for direction vectors b) e = (0,1) (parallel tothe f-axis); c) e = (1,0) (parallel to the t-axis); d) e = (1//2, 1//2) (45 degreescounterclockwise from the t-axis); and e = (1//2, — 1/V2) (45 degrees clockwisefrom the t-axis). Dark areas denote high values.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 13

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

14 PETRITOIVIAINEN

(20)

matrix components, for which the corresponding Sy's are undefined, were set tozero. These matrices were considered idealizations of the spectral gradient,amplitude modulation, and upward and downward FM maps, respectively; they willsubsequently be referred to as the f-gradient, the t-gradient, the upward-gradient,and the downward-gradient, respectively.

Following (11), the composite auditory image C was defined by

C S (c,) = («v^/,a^,a^,a2gf)); (21),)

C can, thus, be described as a third-order matrix. The Minkowski distance betweentwo auditory images, C, and, C2 was defined, analogously to Eq. (12), by

(22)

where \tCy-Cy[x is the Minkowski norm of vector c,j-cff.

Comparing the auditory images with similarity ratings

In order to compare the auditory images with the SR data, a triangular distancematrix D = (dv) was constructed, for each set of 27 auditory images, using thedefinition

_ |1C-C,.D|,K/ (23)iJ " l o , i>j

The similarity of the obtained distance matrix with the SR data was then evaluatedby calculating the Pearson correlation coefficient corr(D,S) according to theformula

(24)

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 15

Here the summations are carried out over matrix components with i <j, and d ands denote the means of those components.

Since the components of matrix D depend on the values of the Minkowskiexponent X as well as the parameters, k, as, OLp a,, occl, and ac2, the correlationcoefficient corr(D,S) can be described as a functional of those parameters. To findthe optimal auditory image and distance metrics, the problem is to find parametervalues P* = (X'.fc'.a'.o^.a'.a'LOC^) which maximize the value of this functional.Obviously, corr(D,S) is continuous and differentiable with respect to all theparameters. The maximum value of it can, thus, be found using the gradient ascentmethod (see, e.g., Kohonen 1989): starting with an arbitrary initial point in theparameter space, p0 = (k°,ko,as,ttf,at,a.cl,aa), the local maximum of corr(D,S)can be found by constructing a sequence (p,) of parameter values:

P,+1 = P,.+gV,cOrr(D,S)|p=P|; (25)

here g is the gain factor and Vp the mathematical gradient operator in theparameter space, defined by

( 2 6 )

With a suitable choice for the initial point, the sequence defined by Eq. (25)converges to the point in parameter space which yields the global maximum ofcorr(D.S).

The gradient method described above was used in order to find the set ofparameter values (X,*,£\as,Oy,a,,acl,ac2) which would yield the highestcorrelation between the auditory image distance matrix and the similarity ratingmatrix. Since it would have been computationally very expensive to recalculate thespectral images every time the value of the exponential parameter k was changed,this parameter was kept constant in the calculations; to compensate for this, thegradient method was applied with several discrete values of k, which formed adense grid.

KOHONEN MAP

In the cerebral cortex, several kinds of ordered feature maps (for instance,somatosensory maps connected with the sense of touch (e.g., Knudsen et al. 1987)and the movement of muscles; tonotopic mapping in the primary auditory cortex(e.g., Lauter et al. 1985); and retinotectal mapping onto primary visual cortex (e.g.,

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

16 PETRITOIVIAINEN

Pearlman 1985)) have been found. These maps are two- or three-dimensionalprojective representations of the received stimuli, containing information about themost salient features and their interrelationships. It is commonly believed that thecortical feature maps originate from self-organization. There exists a fairly wellunderstood and demonstrated theory of self-organization (Kohonen 1989);according to Kohonen, lateral inhibition and redistribution of synaptic resources areresponsible for self-organization in biological systems.

Kohonen has formalized the process of self-organization into a simple, yeteffective, numerical algorithm. Given a set of input vectors in a multidimensionalvector space, the Kohonen self-organizing map (KSOM) identifies the most salientfeatures, i.e., the dimensions with highest variance, of the input set, and maps thosefeatures onto a 2-dimensional space, while retaining the topological relationshipsof the input vectors.

The Kohonen network consists of (1) n input neurons, each having a specifiedactivation level a;. The input to the network is, thus, an n-dimensional vector a =(at, a2, ..., an); (2) m output neurons receiving activation from the input neurons.The output neurons usually form a planar array. Each output neuron is identifiedby its coordinates in the array; and (3) connections from each input neuron to eachoutput neuron. A weight wtj is associated to the connection from input neuron i tooutput neuron j . The connections to output neuron j can thus be represented by ann-dimensional vector w; = (w{j, w2j, ..., wnj).

There are several variants of the Kohonen learning algorithm; the one used inthis study is identical to that described in Toiviainen et al. (1995), except that theMinkowski distance was used when comparing the input and weight vectors.

Distance metrics on KSOM

During the ordering of the KSOM, there are two distance metrics involved. Oneis defined in the space of input and weight vectors; by means of this metric, thesimilarity of the input and weight vectors is evaluated. The other is defined on thesurface of the map; each neuron is associated with a position vector, and theinterneuronal distances are calculated using this metric. For instance, theneighbourhood function, utilized when adapting the weight vectors, makes use ofthe latter metric.

One of the goals of this study is to find out to what extent the KSOM is ableto project auditory images onto a lower-dimensional space without distorting theirmetrical relationships. Therefore, the final configuration of the KSOM is comparedwith the SR data by calculating distance matrices obtained from the map'sresponses to the input vectors rather than from the weight vectors.

To obtain distance matrices, two different distance metrics are used. First, thelocus of response to each input vector is defined to be the position of the neuron

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 17

whose weight vector is closest to the input vector in question; the vectors arecompared using the Minkowski metric. The distance between two responses is thendefined as being the Euclidean distance between the loci of response. In this paper,this is referred to as the response focus metric (RFM).

The second metric takes into account the configuration of the whole map. Thisis done by defining an activation function and calculating the centroid of theactivation pattern obtained in this way. For a given input vector v, the activationlevel of neuron i, having a weight vector w,-, is given by

Here 11-11 denotes the Minkowski norm with the same exponent that was used intraining. According to Eq. (27), an input vector which is identical with the weightvector yields the maximum activation value, a, = 1. The activation value decreaseslinearly when the distance between v and w, is increased and is zero when llv - Wjll> llw,ll. The locus of response c to an input vector is defined to be the centroid ofthe activation pattern:

E™i"i

c = (28)

where r, is the position vector of neuron i, and a, is its activation level. Theresponse distances are then defined as Euclidean distances between the loci ofresponse. In this paper, this is referred to as the centroid-of-activation metric(CAM).

Comparing timbre maps with SR's

The similarity of the obtained matrices of inter-response distances and the SR datawas then evaluated in the same manner as that of the auditory image distancematrices, i.e., by calculating the Pearson correlation coefficient between the twomatrices. The correlations obtained using the two distance metrics defined on theKSOM will subsequently be referred to as the RFM and CAM correlations,whereas the correlations obtained using the auditory image distance matrices willbe referred to as the AIM correlations.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

18 PETRI TOIVIAINEN

RESULTS

Auditory images

To begin with, the value of corr(D,S) was calculated with parameter values X =2, k = 0, a, = 1, and af = a, = acl = ac2 = 0. Thus, Euclidean metric was used forcalculating the distances between auditory images; the onsets of the tones were notemphasized; and no gradient images were included. This yielded the valueCOAT(D,S) = 0.677. To examine the effect of the used metrics, the correlationvalues were calculated with Minkowski exponents X = 1,2,...,10, while keeping thevalues of the other parameters unchanged. The result is presented graphically inFig. 4. The graph shows that the maximum correlation, corr(D,S) = 0.716, wasobtained with X = 5. This is concordant with Feiten's and Giinzel's (1993) results;they found that representing the partial tones as loudnesses and using theMinkowski metric with X = 5 yielded the best correspondence with results fromtone discrimination experiments.

Next, the effect of emphasizing the onset was explored. This was done bycomputing the value of corr(D,S) with emphasis parameter values 0 < k < 0.48,using steps of 0.02; the other parameters had constant values X = 2, tts = 1, andoty = a, = acl = ac2 = 0. Euclidean metric was thus used, and no gradient imageswere included. The result of this calculation is depicted in Fig. 5. As is evident

0.72-1

0.71-

0.70-

g 0.69-

% 0.68-

a

6 0.67-

0.66-

0.65-0.64- i

2i

3i

4 5 6 7 8Minkowski exponent

10

Fig. 4. The value of co/r(D,S) as a function of the Minkowski exponent X. The otherparameters have the values, k = 0, as = 1, and 0^= a, = a,., = ac2 = 0.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 19

0.90 n

0.85-

g 0.80-

o 0.75-

0.70-

0.65 i i i i i i i i i i i i i i i i i i i i i i i0.1 0.2 0.3

Emphasis parameter k0.4

Fig. 5. The value of corr(D,S) as a function of the emphasis parameter k. The otherparameters have the values % = 2, as = 1 and ar = a, = acl = a^ = 0.

from this graph, emphasizing the onset to the right extent increased the correlationsignificantly; the maximum value, corr(D,S) = 0.866, was achieved with k = 0.20.This result is readily understandable in the light of the studies which havedemonstrated the importance of onset in timbre classification and recognition.

At the next stage, both the Minkowski exponent X and the emphasis parameterk were varied. The ranges of their values were 1 <, X < 10 and 0 < k < 0.48, withsteps of 1 and 0.02, respectively. The other parameters were still kept constant, a,= 1, Oy = a, = <xcl = ac2 = 0. The values of corr(D,S) as a function of parametersX and k is presented in Fig. 6. This graph manifests that varying the emphasisparameter affected the correlation qualitatively in the same way, regardless of thevalue of the Minkowski exponent; the overall correlation values, however, tendedto decrease with increasing X. The maximum correlation value, corr(D,S) = 0.868,was obtained with X = 1 and k = 0.20. Replacing the Euclidean metric with thecity-block metric yielded, thus, a slightly higher correlation.

The significance of gradient images in timbre perception was first studied byconstructing four sets of auditory images, each consisting of one gradient imagealone. In other words, for each set of images, the weighting factor of one gradientimage was set to 1, while the other weighting factors were set to zero. With thesesets of auditory images, the values of corr(D,S) were computed using theparameter values 1 < X < 5 and 0 < k < 0.48. It was found that in each casecorr(D,S) decreased when either the Minkowski exponent or the emphasis

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

20 PETRI TOIVIAINEN

<3

Fig. 6. The value of corr(D$) as a function of the Minkowski exponent X and theemphasis parameter k. The other parameters have the values a, = 1 and af = a,

= «cl = «c2 = 0.

parameter was increased. Moreover, the overall correlation values were sig-nificantly lower than those in Fig. 6. The maximum values of corr(D,S) for eachgradient image, with the respective values of X and k, are presented in Table 1. Forthe sake of comparison, those values are also presented for the spectral image. Thecorrelation value for each of the gradient images is significantly lower than that forthe spectral image; in the light of this result, none of the gradient images is thechief factor in the perception of timbre.

Finally, the gradient method was used in order to find the optimal combinedauditory image and distance metric; that is, a combination of parameter valuesp* = (X',k*,ci.s,ci.j,a.,,(Xl.l,o.C2) which would yield the maximum value ofcorr(D,S). Since in the calculation of Pearson correlation coefficient the variancesof the two variables are normalized, multiplying the a 's with a constant does notalter the correlation. Therefore, one of the a 's can be kept constant. The spectral

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

sGfG,GclGc2

0.8680.7880.6830.8000.740

11111

OPTIMIZING AUDITORY IMAGES 21

Table 1. Maximum values of corr(D,S) for five auditory images, with correspondingvalues for X and k. S denotes the spectral image, while Gf, G,, Gcl, and Gc2

denote the gradient images.

Auditory image corr(D,S) X k

0.200.300.280.300.12

image being undoubtedly the most important factor in the perception of timbre, anatural choice was to fix as to value 1. Moreover, the search for the maximum wasstarted with zero values for the other a's. In a series of calculations, the initialvalues for the Minkowski exponent and the emphasis parameter were chosen fromthe neighbourhood of the point X = 1, k = 0.20, i.e., the point of maximum correla-tion in Fig. 6. The maximum value achieved was corr(D,S) = 0.882; this wasobtained with the parameter values X = 0.690, k = 0.21, af= 1.510, a, = 0.010, acl

= 0.262 and occ2 = 0.000. As a result of adding gradient images and allowing theMinkowski exponent to change continuously, a slight increase in correlation wasthus achieved, the main contributors to this being parameters X and af.

Figure 7 presents a summary of the correlation values between the auditoryimage distance matrices and the SR matrix, achieved in the earlier stages of thisstudy. This graph indicates that, compared with the initial value 0.677; an increaseof correlation by 0.205 was achieved, the main contributor to this being theemphasizing of onset.

Timbre maps

Using various sets of auditory images as input, a series of KSOM simulations wascarried out. In each simulation, the input data consisted of 27 vectors, whosedimension was 400, unless otherwise indicated. For comparing the input andweight vectors during the training procedure, the Minkowski metric was used. Ineach simulation, a 12 x 12 Kohonen network was trained for 50,000 cycles, usinga constant learning rate of 0.05. In the beginning, the radius of the topologicalneighbourhood was 6, and it was linearly decreased after every cycle so as to reachzero at the end of the training. From the obtained timbre maps, distance matriceswere calculated using both the response focus metric (RFM) and the centroid-of-activation metric (CAM). These matrices were then, similarly to those obtainedfrom the auditory images, compared with the SR data by calculating the Pearsoncorrelation.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

22 PETRI TOIVIAINEN

Fig. 7. Summary of the correlation values between the auditory image distance matricesand the SR matrix: (a) X = 2, k = 0, a, = 1 and af = a, = acl = ac2 = 0; (b) X =5, k = 0, as = 1 and af= a, = acl = ac2 = 0 ; (c) X = 2, k = 0.20, as = 1 and a{

= a, = acl = ac2 = 0 and; (d) X = 1, A: = 0.20, as = 1 and a = a, = acl = ac2 =0; and (e) X = 0.690, * = 0.21, a, = 1 and 0^= 1.510, a, = 0.010, acl = 0.262 ac2

= 0.000.

The KSOM was first trained with auditory images which did not include anygradient images. In four sets of simulations, the values of the parameters X and kwere set equal to those used in calculating the correlations a-d of Fig. 7. Theobtained correlation values are presented in Fig. 8. For comparison, the respective,pre-Kohonen, AIM correlations — that is, correlations obtained from the auditoryimage distance matrices — are also presented. Three essential facts can be observedin Fig. 8. First, in each set of simulations, the AIM correlation was the highest,followed by the CAM and RFM correlations, in that order. The post-Kohonencorrelations being lower than the pre-Kohonen correlation, might indicate that itis not possible to reduce timbre to two dimensions without compromising themetrical relationships between the input stimuli. Secondly, the post-Kohonencorrelation values are high when the AIM-correlation is high, and vice versa; thisshows that the chosen preprocessing strategy is an essential factor in this kind ofmodelling. Lastly, the CAM correlations are regularly higher than the respectiveRFM correlations; taking into account the whole activation patterns of the KSOMseems, thus, to yield a better metric than paying attention only to the unit with thehighest response. The highest CAM and RFM correlations, 0.748 and 0.685,respectively, were obtained with parameter values X = 1 and k = 0.21.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 23

Fig. 8. The values of corr(D,S) obtained using the centroid of activation (CAM) andresponse focus (RFM) metrics. The parameter values used for groups a-d are thesame as those used for bars a-d of Fig. 7, respectively. For comparison, therespective AIM correlations are included.

At the next stage, the KSOM was trained with sets of auditory images whichincluded gradient images. Although the optimal Minkowski exponent value wasfound, using the gradient method, to be X = 0.690, the value X = 1 was insteadused in the subsequent simulations. This was done for two reasons. First, changingthe Minkowski exponent from 1 to 0.690 did not increase the AIM correlationsignificantly. Secondly, using an integer X in the training of the KSOM reduces thecomputation time considerably. The value of the emphasis parameter in thesesimulations was k = 0.21. In the first four experiments, the auditory images in eachsimulation consisted of one gradient image alone. In other words, for each set ofimages, the weighting factor of one gradient image was set to 1, while the otherweighting factors were set to zero. The obtained CAM and RFM correlation valuesare presented in Fig. 9, bars b-e. For comparison, the respective AIM correlationsare also presented. Next, the KSOM was trained with combined auditory images.To reduce computational load, the f-gradient was the only gradient image to beincluded; the calculations utilizing the gradient method showed it to be the mainfactor responsible for the correlation increase. Two network architectures were

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

24 PETRI TOIVIAINEN

AIM CAM RFM

Fig. 9. The AIM, CAM, and RFM correlation values obtained using (a) the spectralimage (same as group d of Fig. 8); (b) the f-gradient image; (c) the t-gradientimage; (d) the upward-gradient image; (e) the downward-gradient image;combined spectral and f-gradient images with (f) one KSOM (see Fig. 11.a); (g)three KSOMs (see Fig. ll.b).

used at this stage (see Fig. 10). In the first approach, a single KSOM was trainedwith auditory images, each of which consisted of a 400-dimensional spectral imageand a 400-dimensional f-gradient image; the components of the latter weremultiplied by 1.510, corresponding to the optimal combination of parameter valuesfound by the gradient method. In the second approach, two KSOMs were firsttrained, one with the spectral images and the other with the f-gradient images. Foreach input stimulus, the coordinates of the centroids of activation on these maps,thus forming a four-dimensional vector, were used as input to a third KSOM. TheCAM and RFM correlation values, obtained using the combined auditory images,are presented in Fig. 9, bars f-g.

The graph of Fig. 9 shows that also these simulations yielded CAM and RFM

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 25

KSOM

••

BI

I1

a•I1

1

r

t

KSON/

a

1

•I11

•I1I

I

tspectral image f-gradient image

KSOM i KSOM

spectral image f-gradient image

Fig. 10. Two network architectures used in constructing timbre maps from combinedauditory images, (a) one KSOM; (b) three KSOMs; the input to the uppermostKSOM is composed of the coordinates of the centroids of activation on the twolower KSOMs.

correlations which were lower than the respective AIM correlations. High AIMcorrelations tended to imply high CAM and RFM correlations, and vice versa,although this correspondence does not seem to be as clear as in Fig. 8. The CAMcorrelations were higher than the respective RFM correlations, except when usingthe upward-gradient images. The reason for this anomaly remained unsolved. Usingcombined auditory images with the one-network architecture (Fig. lO.a) yielded thehighest CAM correlation value, 0.768; the highest RFM correlation value, 0.709,was obtained using combined auditory images with the three-network architecture(Fig. lO.b).

SUMMARY AND DISCUSSION

In this study, the effect of using different auditory images and distance metrics onthe final configuration of a self-organized timbre map was examined. This was

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

26 PETRITOIVIAINEN

done by calculating correlation coefficient values between distance matricesobtained from the simulations and a similarity rating matrix obtained using thesame set of stimuli. In comparing auditory images and weight vectors of theKSOM, Minkowski metrics were used; this provided a wide choice of metrics,controllable by one parameter. Distance matrices were constructed both from theauditory images, obtained from an auditory model, and from the responses of aKSOM to them. The auditory image and Minkowski metric, which would maxi-mize the correlation with the SR data, were searched using the gradient method.

In the first stage of the study, the correlations between the SR matrix anddistance matrices calculated from the auditory images were examined. To beginwith, the effect of the metric used on the obtained correlations was examined; itwas found that, for spectral images with no emphasis on the onset, the Minkowskimetric with the exponent value X = 5 yielded the highest correlation. This resultwas found to be in accordance with that of Feiten and Giinzel (1993). Next, theonset period of tones was emphasized in the spectral images by means of anexponential mapping of time; the amount of emphasis was controlled by aparameter. It was found that emphasizing the onset by the appropriate amountyielded correlations which were significantly higher than those obtained withoutemphasizing the onset. It was also found that in this case the city-block metric, i.e.,the Minkowski metric with X = 1, yielded the highest correlation. The role ofgradient maps in timbre perception was studied by constructing auditory images,which were composed of the spectral image and four idealized gradient images; theweight of each sub-image was controllable by a parameter. By applying thegradient method, the optimal auditory image and Minkowski metric was searched.It was found that adding the gradient images to the auditory images yielded asomewhat higher correlation than using spectral images alone. With the methodsmentioned above, a correlation increase from 0.677 to 0.882 was achieved; themain contributor to this was found to be the emphasizing of onset.

In order to study the correspondence between the SR data and the final con-figuration of the KSOM, two distance metrics were defined on the Kohonen map.The first one, referred to as the RFM, considers the neuron with the highestactivation only, while the other, referred to as the CAM, takes into account thespatial distribution of the whole activation pattern. The simulations revealed thatthe correlations obtained using the RFM and CAM were regularly lower than thoseobtained using the distances between the input auditory images. This is probablydue to the fact that it is not possible to project the timbre space onto twodimensions without significantly distorting the metrical relationships between thetimbres. This statement is supported, for instance, by Grey's (1977) results: hereported that, in multidimensional scaling of timbres, the obtained correlationsbetween the actual and projected distances were 0.78 for four dimensions, 0.75 forthree dimensions, and 0.68 for two dimensions.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 27

The post-Kohonen correlations were found to depend on the respective pre-Kohonen correlations: high values of the latter tended to imply high values of theformer, and vice versa. The CAM correlations were in general higher than therespective RFM correlations; a metric which takes into account the globaldistribution of activation was, thus, found to be in better agreement with thepsychological data. In the last stage of the study, the KSOM was trained withcombined spectral and f-gradient images. This was done by making use of twonetwork architectures: (1) a single KSOM, and (2) a two-level network, where thespectral and f-gradient images were classified separately. The former was foundto yield a higher CAM correlation, while the latter yielded a higher RFMcorrelation.

Although the emphasizing of onset was found to be the main contributor to theincrease of correlation between the SR data and the distance matrices, the way itwas modelled certainly lacks neurophysiological relevance. It is difficult to imagineany process in the auditory pathway which would be responsible for such anexponential scaling of time. Further work is needed in order to convincingly modelthese processes involved in auditory attention and sensory memory. Since the onsetof a tone seems to be the most important factor in timbre perception, highercorrelations would undoubtedly have been obtained simply by including only ashort initial period of the tone in the auditory image, and using a linear time scale.However, such an approach, which completely disregards the final period of thetone, would be an even more implausible one.

The onset period of certain musical tones, for instance those produced by brassinstruments, contains rapid frequency modulations. Their role in timbre perceptionhas not yet been carefully examined. This study failed to clarify this question:although the upward- and downward-gradients, defined earlier, can be regarded asidealizations of physiological FM maps, their frequency and time resolutions arecertainly insufficient for extracting such quick transitions. Further study is neededin order to elucidate this problem.

The tone stimuli used in this study were synthetic; they were used in order toacquire a wide variety of timbres with an optimally economical set of variables andto keep the number of stimuli reasonable. This is especially important, since wehave planned to use these stimuli in EMG measurements in order to try to localizeresponses to them on the auditory cortex; because of the length of this experiment,only a subset of these stimuli can be used, and knowing what each stimulus iscomposed of helps in choosing a representative subset.

The Pearson correlation coefficient was used in comparing the psychologicaland simulation data. While the use of this method can be defended on the plea thatit is a widely used tool in data analysis, arguments against using it in analyzingthis kind of data can certainly be stated. For instance, when the Pearson correlationis used, it is supposed that the two variables are linearly related; it is not, however,

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

28 PETRITOIVIAINEN

self-evident that this kind of relationship exists between the variables analyzed in

this study. It would possibly be more plausible to use the methods of non-

parametric statistics (see, e.g., Conover 1971). This would include the use of

correlation measures such as the Spearman rank correlation, the calculation of

which involves replacing the absolute values of the variables with their respective

positions in the ordered set of the values.

ACKNOWLEDGEMENTS

I am indebted to Marc Leman for many useful discussions concerning this study and valuablecomments on an earlier draft. I am also grateful to Topi Jarvinen, Mauri Kaipainen, and JukkaLouhivuori for helpful suggestions. Finally, I express my gratitude to John Richardson for proof-reading the text.

NOTE

1. It must be noted that the term 'gradient' in this paper does not generally mean themathematical derivate operator

but rather refers to the derivative of some quantity. When the aforementioned operator ismeant, it will be referred to as the 'mathematical gradient'.

REFERENCES

American Standard Acoustical Terminology (1960). SI.1-1960. American Standards Association:New York.

Berger, K.W. (1964). Some factors in the recognition of timbre. Journal of the Acoustical Societyof America, 36, 1888-1891.

Bregman, A.S. & Dannenbring, G.L. (1973). The effect of continuity on auditory streamsegregation. Perception and Psychophysics, 13 (2), 308—312.

Brown, G.J. (1992). Computational auditory scene analysis: A representational approach.Doctoral Thesis. Univ. of Sheffield, Dept. of Computer Science.

Brown, G.J. & Cooke, M. (1994). Perceptual grouping of musical sounds: A computationalmodel. Journal of New Music Research, 23(2), 107-132.

Ciocca, V. & Bregman, A. S. (1987). Perceived continuity of gliding and steady-state tonesthrough interrupting noise. Perception and Psychophysics, 42 (5), 476—484.

Cohen, J. (1989). Application of an auditory model to speech recognition. Journal of theAcoustical Society of America, 85, 2623—2629.

Conover, W.J. (1971). Practical nonparametric statistics. New York: John Wiley & Sons.Cosi, P., De Poli, G. & Lauzzana, G. (1994). Auditory modelling and self-organizing neural

networks for timbre classification. Journal of New Music Research, 23, 71—98.De Poli, G., Prandoni, P. & Tonella, P. (1993). Timbre clustering by self-organizing neural

networks. Proc. of X Colloquium on Musical Informatics. Milan: University of Milan.Feiten, B., Frank, R. & Ungvary, T. (1991). Organisation of sounds with neural nets. Proc. of

1991 ICMC. San Francisco: ICMA.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

OPTIMIZING AUDITORY IMAGES 29

Feiten, B. & Günzel, S. (1993). Distance measure for the organization of sounds. Acustica, 78,181-184.

Feiten, B. & Giinzel, S. (1994). Automatic Indexing of a sound database using self-organizingneural nets. Computer Music Journal, 18(3), 53—65.

Gardner, R.B. & Wilson, J.P. (1979). Evidence for direction-specific channels in the processingof frequency modulation. Journal of the Acoustical Society of America, 66(3), 704—709.

Ghitza, O. (1986). Auditory nerve representation as a front-end for speech recognition in a noisyenvironment. Comput. Speech Language, 1, 109—130.

Grey, J.M. (1975). An exploration of musical timbre. Ph.D. dissertation, Stanford University,Stanford, California.

Grey, J.M. (1977). Multidimensional perceptual scaling of musical timbres. Journal of AcousticalSociety of America, 61, 1270-1277.

Grey, J.M. & Gordon, J.W. (1978). Perceptual effects of spectral modifications of musicaltimbres. Journal of the Acoustical Society of America, 63, 1493—1500.

Iverson, P. & Krumhansl, C.L. (1993). Isolating the dynamic attributes of musical timbre. Journalof the Acoustical Society of America, 94, 2595-2603.

Javel, E., McGee, J., Horst, J.W. & Farley, G.R. (1988). Temporal mechanisms in auditorystimulus encoding. In G.M. Edelman, W.E. Gall & W.M. Cowan (eds.), Auditory function:neurobiological bases of hearing. New York: John Wiley & Sons.

Kay, R.H. & Matthews, D.R. (1972). On the existence in human auditory pathways of channelsselectively tuned to the modulation present in frequency-modulated tones. Journal ofPhysiology (London), 225, 657-677.

Knudsen, E.I., du Lac, S. & Esterly, S.D. (1987). Computational maps in the brain. AnnualReview of Neuroscience, 10, 41—45.

Kohonen, T. (1989). Self-organization and associative memory. (2nd Edn.) Berlin etc.: Springer-Verlag.

Kronland-Martinet, R. & Grossman, A. (1991). Application of time-frequency and time-scalemethods (wavelet transform) to the analysis synthesis and transformation of natural sounds.In G. De Poli, A. Piccialli & C. Roads (eds.), Representations of Musical Signals. Cambridge:MIT Press.

Kruskal, J.B. (1964a). Multidimensional scaling by optimizing goodness of fit to a nonmetrichypothesis. Psychometrica, 29, 1—27.

Kruskal, J.B. (1964b). Nonmetric multidimensional scaling: A numerical method. Psychometrica,29, 115-129.

Lauter, J.L., Hersovitch, P., Formby, C. & Raichle, M.R. (1985). Tonotopic organization in thehuman auditory cortex revealed by positron emission tomography. Hearing Research, 20,199-205.

Leman, M. (1994). Schema-based tone center recognition of musical signals. Journal of NewMusic Research, 23(2), 169-204.

McAdams, S. & Bregman, A. (1985). Hearing musical streams. Computer Music Journal, 3(4),26-43.

Meddis, R. (1986). Simulation of mechanical to neural transduction in the auditory receptor.Journal of the Acoustical Society of America, 79, 702—711.

Mendelson, J.R. & Cynader, M.S. (1985). Sensitivity of cat primary auditory cortex (AI) neuronsto the direction and rate of frequency modulation. Brain Research, 327, 331—335.

M0ller, A.R. (1972). Coding of amplitude and frequency modulated sounds in the cochlearnucleus of the rat. Ada Physiologica Scandinavica, 86, 223—238.

M0ller, A.R. (1978). Coding of time-varying sounds in the cochlear nucleus. Audiology, 17,446-468.

Nelson, P.G., Erulkar, S.D. & Bryan, J.S. (1966). Responses of units of the inferior colliculus totime-varying acoustic stimuli. J. Neurophysiol, 29, 834—860.

Pearlman, A.G. (1985). The visual cortex of the normal mouse and the reeler mutant. In Peters,A. & Jones, E.G. (eds.), Cerebral Cortex. Volume 3: Visual Cortex. New York: PlenumPress.

Pickles, J.O. (1982). An introduction to the physiology of hearing. London: Academic Press.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014

30 PETRI TOIVIAINEN

Plomp, R. & Steeneken, H.J.M. (1971). Pitch versus timbre. Proc. 7th International Congress onAcoustics, 3, 377-380. Budapest.

Plomp, R. (1976). Aspects of tone sensation. London: Academic Press.Rabiner, L. R. & Shafer, R. W. (1978). Digital processing of speech signals. Englewood Cliffs,

New Jersey: Prentice-Hall.Saldanha, E.L. & Corso, J.F. (1964). Timbre cues and the identification of musical instruments.

Journal of the Acoustical Society of America, 36, 2021—2026.Shamma, S.A., Vranic, S. & Wiser, P. (1992). Spectral gradient columns in primary auditory

cortex: physiological and psychoacoustical correlates. Advances in the Biosciences, 83,397-404.

Steiger, H. & Bregman, A.S. (1981). Capturing frequency components of glided tones: Frequencyseparation, orientation, and alignment. Perception and Psychophysics, 30(5), 425—435.

Suga, N. (1965). Analysis of frequency-modulated sounds by auditory neurons of echo-locatingbats. Journal of Physiology (London), 179, 26—53.

Toiviainen, P. (1992). The organisation of timbres: a two-stage neural network model. 10thEuropean Conference on Artificial Intelligence / Workshop on Artificial Intelligence andMusic - Workshop notes. Vienna: ECCAI/OGAI.

Toiviainen, P., Kaipainen, M. & Louhivuori, J. (1995). Musical timbre: similarity ratings correlatewith computational feature space distances. Journal of New Music Research 24(3), 282—298.

Van Immerseel, L.M. & Martens, J.-P. (1992). Pitch and voiced/unvoiced determination with anauditory model. Journal of the Acoustical Society of America, 91, 3511—3526.

Watanabe, T. & Ohgushi, K. (1968). FM sensitive auditory neuron. Proc. Japan Acad., 44,968-973.

Wedin, L. & Goude, G. (1972). Dimension analysis of the perception of instrumental timbre.Scandinavian Journal of Psychology, 13, 228—240.

Wessel, D.L. (1979). Timbre space as a musical control structure. Computer Music Journal, 3,45-52.

Petri ToiviainenUniversity of JyvaskylaDepartment of MusicologyPL 35FIN-40351 JyvaskylaFinlandTel.: +358 41 601 353Fax: +358 41 601 331E-mail: [email protected]

Born in 1959, in Jyväskylä, Finland, he studied at theUniversity of Jyväskylä, and graduated in TheoreticalPhysics in 1987. Since 1990 he has been working asa research scientist in a research project on cognitivemusicology, supported by the Finnish Academy ofSciences. His research interests include connectionistmodelling of cognitive processes of music, particularlyimprovisation and timbre perception. As a jazz pianoplayer, he has performed at jazz festivals in Europe

and America. His hobbies include beer brewing and winter bathing.

Dow

nloa

ded

by [

Uni

vers

ity o

f T

oron

to L

ibra

ries

] at

13:

00 2

9 A

pril

2014