12
Natural cross-modal mappings between visual and auditory features Brigham and Womens Hospital and Harvard Medical School, Cambridge, MA, USA Karla K. Evans Psychology Department, Princeton University, Princeton, NJ, USA Anne Treisman The brain may combine information from different sense modalities to enhance the speed and accuracy of detection of objects and events, and the choice of appropriate responses. There is mounting evidence that perceptual experiences that appear to be modality-specic are also inuenced by activity from other sensory modalities, even in the absence of awareness of this interaction. In a series of speeded classication tasks, we found spontaneous mappings between the auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast. By dissociating the task variables from the features that were cross-modally related, we nd that the interactions happen in an automatic fashion and are possibly located at the perceptual level. Keywords: cross-modal integration, auditory, visual, perceptual level Citation: Evans, K. K., & Treisman, A. (2010). Natural cross-modal mappings between visual and auditory features. Journal of Vision, 10(1):6, 112, http://journalofvision.org/10/1/6/, doi:10.1167/10.1.6. Introduction Objects and events in the environment typically produce a correlated input to several sensory modalities at once. The perceptual system may disambiguate and maximize the incoming information by combining inputs from several modalities. To do this, it must determine which signals originate from the same source by detecting correspondences or mismatches between the different sensory streams. Spatial and temporal coincidence are the usual cues for integration, but the brain may also rely on a feature correspondence between the different sensory inputs (see Welch & Warren, 1986). The most dramatic examples occur in synesthetes, who often experience stimuli both in the appropriate modality and in another. For example, a letter may produce an additional sensation of color, or a visual shape may evoke a taste or a smell. In non-synesthetes, cross-modal metaphors are also commonVfor example a bitter smile, a loud tie, a heavy odor, suggesting that some mappings may be natural to most humans. Over the past 40 years, studies in adults and children have investigated how both static and dynamic stimula- tions in one modality can modulate the perception of information in another modality. Pitch is involved in a number of cross-modal correspondences, and this is also the auditory feature that we investigated. In 1883, Stumpf concluded that all world languages apply the labels “high” and “low” to the pitch of tones. Bernstein and Edelstein (1971) were the first to measure the cross-modal congruence between pitch and visual vertical position, subsequently replicated by Melara and O’Brien (1987), Ben-Artzi and Marks (1995), and Patching and Quinlan (2002). Partic- ipants classified visual stimuli as high and low more quickly when the visual stimulus was accompanied by a tone that was congruent rather than incongruent (e.g., high pitch with high position rather than low). Even 6-month- old babies exhibit this cross-modal correspondence (Braaten, 1993), and young infants “match” visual arrows pointing up or down with tones sweeping up or down in frequency (Wagner, Winner, Cicchetti, & Gardner, 1981). Auditory pitch is also matched to other visual attributes. One is the lightness or darkness of a surface, with participants showing faster responses to a higher pitch presented with a lighter surface and a lower pitch with a darker surface (Marks, 1987; Martino & Marks, 1999; Melara, 1989). Another is size: In a perceptual matching paradigm, 9 year-olds matched high pitch tones with small sizes and low pitch with large sizes (Marks, Hammeal, & Bornstein, 1987). Three year olds reliably matched high pitch tones with small bouncing balls and low pitch tones with large ones (Mondloch & Maurer, 2004), and adults judged the size of a variable visual stimulus relative to a standard more rapidly when the irrelevant sound fre- quency presented simultaneously was congruent with the size of the visual stimulus (e.g., high pitch tone with a small visual stimulus; Gallace & Spence, 2006). Finally, pitch can also generate cross-modal correspondences with visual shape: high pitch is matched to angular rather than rounded shapes (Marks, 1987). In addition to auditory pitch, loud sounds can improve the perception of bright Journal of Vision (2010) 10(1):6, 112 http://journalofvision.org/10/1/6/ 1 doi: 10.1167/10.1.6 Received February 2, 2009; published January 12, 2010 ISSN 1534-7362 * ARVO

Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

Natural cross-modal mappings between visual andauditory features

Brigham and Women’s Hospital and Harvard MedicalSchool, Cambridge, MA, USAKarla K. Evans

Psychology Department, Princeton University, Princeton,NJ, USAAnne Treisman

The brain may combine information from different sense modalities to enhance the speed and accuracy of detection ofobjects and events, and the choice of appropriate responses. There is mounting evidence that perceptual experiences thatappear to be modality-specific are also influenced by activity from other sensory modalities, even in the absence ofawareness of this interaction. In a series of speeded classification tasks, we found spontaneous mappings between theauditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast. Bydissociating the task variables from the features that were cross-modally related, we find that the interactions happen in anautomatic fashion and are possibly located at the perceptual level.

Keywords: cross-modal integration, auditory, visual, perceptual level

Citation: Evans, K. K., & Treisman, A. (2010). Natural cross-modal mappings between visual and auditory features. Journalof Vision, 10(1):6, 1–12, http://journalofvision.org/10/1/6/, doi:10.1167/10.1.6.

Introduction

Objects and events in the environment typically producea correlated input to several sensory modalities at once.The perceptual system may disambiguate and maximizethe incoming information by combining inputs fromseveral modalities. To do this, it must determine whichsignals originate from the same source by detectingcorrespondences or mismatches between the differentsensory streams. Spatial and temporal coincidence arethe usual cues for integration, but the brain may also relyon a feature correspondence between the different sensoryinputs (see Welch & Warren, 1986). The most dramaticexamples occur in synesthetes, who often experiencestimuli both in the appropriate modality and in another.For example, a letter may produce an additional sensationof color, or a visual shape may evoke a taste or a smell.In non-synesthetes, cross-modal metaphors are alsocommonVfor example a bitter smile, a loud tie, a heavyodor, suggesting that some mappings may be natural tomost humans.Over the past 40 years, studies in adults and children

have investigated how both static and dynamic stimula-tions in one modality can modulate the perception ofinformation in another modality. Pitch is involved in anumber of cross-modal correspondences, and this is also theauditory feature that we investigated. In 1883, Stumpfconcluded that all world languages apply the labels “high”and “low” to the pitch of tones. Bernstein and Edelstein(1971) were the first to measure the cross-modal congruence

between pitch and visual vertical position, subsequentlyreplicated by Melara and O’Brien (1987), Ben-Artzi andMarks (1995), and Patching and Quinlan (2002). Partic-ipants classified visual stimuli as high and low morequickly when the visual stimulus was accompanied by atone that was congruent rather than incongruent (e.g., highpitch with high position rather than low). Even 6-month-old babies exhibit this cross-modal correspondence(Braaten, 1993), and young infants “match” visual arrowspointing up or down with tones sweeping up or down infrequency (Wagner, Winner, Cicchetti, & Gardner, 1981).Auditory pitch is also matched to other visual attributes.

One is the lightness or darkness of a surface, withparticipants showing faster responses to a higher pitchpresented with a lighter surface and a lower pitch with adarker surface (Marks, 1987; Martino & Marks, 1999;Melara, 1989). Another is size: In a perceptual matchingparadigm, 9 year-olds matched high pitch tones with smallsizes and low pitch with large sizes (Marks, Hammeal, &Bornstein, 1987). Three year olds reliably matched highpitch tones with small bouncing balls and low pitch toneswith large ones (Mondloch & Maurer, 2004), and adultsjudged the size of a variable visual stimulus relative to astandard more rapidly when the irrelevant sound fre-quency presented simultaneously was congruent with thesize of the visual stimulus (e.g., high pitch tone with asmall visual stimulus; Gallace & Spence, 2006). Finally,pitch can also generate cross-modal correspondences withvisual shape: high pitch is matched to angular rather thanrounded shapes (Marks, 1987). In addition to auditorypitch, loud sounds can improve the perception of bright

Journal of Vision (2010) 10(1):6, 1–12 http://journalofvision.org/10/1/6/ 1

doi: 10 .1167 /10 .1 .6 Received February 2, 2009; published January 12, 2010 ISSN 1534-7362 * ARVO

Page 2: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

lights and large objects, whereas soft sounds facilitate theperception of dim lights and small objects (Marks, 1987;Smith & Sera, 1992).At what level might such cross-modal correspondences

arise? Some cross-modal interactions may be based onsimilar neural codes, for example for dimensions likeloudness and brightness that can be described in terms ofmore or less an increase in stimulation produces anincrease in firing in each modality. Melara (1989) usedmultidimensional scaling to look for changes in theperceptual similarity relations between tones or lightnessinduced by concurrent stimuli in the other modality. Hefound none and concluded that the interactions result fromfailures of separability at the decision rather than theperceptual level. Some correspondences may arisethrough frequent associations in everyday experience.For example, larger objects resonate in lower pitch. Theimage of a cat is associated with the sound “meow”(Long, 1977). Pre-linguistic children as well as adults mayshow these learned perceptual associations (Braaten,1993; Wagner et al., 1981). Verbal labels may alsomediate cross-modal links. Melara and Marks (1990)presented the visual bigrams “LO” and “HI” simulta-neously with sounds of different pitch and found effectsthat are unlikely to reflect inherent perceptual similarities.For stimuli such as these, the representations may shareonly a more abstract polarity (Garner, 1974, 1976;Martino & Marks, 1999). It seems then that cross-modalcorrespondences can arise at almost any level.The speeded classification paradigm is often used to

probe for interactions between different dimensions ormodalities. The task requires the rapid identification of astimulus in one modality (or on one dimension of aunimodal stimulus), while an irrelevant stimulus inanother modality (or another dimension of a unimodalstimulus) is ignored. If the two modalities are automaticallyregistered, it may be difficult to attend selectively to one ofthe pair and variation in the irrelevant modality may affectlatencies to the relevant one. Any overall increase inlatencies induced by variation on the irrelevant dimension isknown as Garner interference, but if there is a correspond-ence between the polarities on the two dimensions, specificpairings may also show interference when the mapping isincongruent and facilitation when it is congruent.In our studies, we selected four cross-modal mappings

of auditory pitch to visual position, size, spatial frequency,and contrast. We used speeded classification to compareperformance between unimodal visual and auditorypresentations and bimodal simultaneous presentations inwhich the pairings were either congruent or incongruent.Having identified which pairs of visual dimensions showcongruency effects with concurrent tones and can there-fore be presumed to interact at some level, we used a newexperimental paradigm to test whether these interactionsarose automatically and possibly suggest an interaction atthe perceptual level rather than being mediated by sharedlabels or responses. For each pair of dimensions, we used

both a “direct” task in which, as in previously publishedexperiments, the task was to discriminate the stimuli onthe dimensions on which the hypothesized correspondencewould be shown, and also an “indirect” task, in which thetask was to discriminate the stimuli on some otherorthogonal dimension (see Figure 1).The direct tasks allow the corresponding cross-modal

features to interact at any level between sensory registra-tion and response. In the indirect tasks, the potentiallycorresponding dimensions are orthogonal both to the taskand to the correct response keys. The congruent featuresare no more connected with one value than with the otheron the task relevant dimension, so any congruence effectscannot be attributed to convergence at the decision orresponse level. There is still, however a possibility thatirrelevant coding of both stimuli occurs in parallel withthe relevant coding, for example in the language area, andthat if there is any conflict between the two outcomes ofthis irrelevant coding, it could slow down the response tothe relevant dimension. So if, for example, the highposition and low tone are automatically labeled as such,the disagreement between these task-irrelevant verballabels could conceivably slow down the response to therelevant orientation or instrument discriminations. Thiscould apply to any condition in which verbal labels areautomatically generated and are mismatched acrossmodalities. Examples are pitch and position or pitch andcontrast, which may share the labels “high” and “low”.However, it is less plausible with pitch and size (where thenatural labels are “small” and “large”) and with spatialfrequency, which most naive participants would labelwide and narrow or thick and thin, not high and low. If wefind any interference in these cases, it is likely to reflect amismatch or conflict between the sensory representations.

Methods

Participants

After giving informed consent, eighty-five students(42 males) from Princeton University, aged 18 to 30 years,with normal or corrected-to-normal vision, participated inthe experiments for course credit (8 in Experiments 1, 3,and 9; 12 in Experiment 2; 10 each in Experiments 4, 5, 6,and 7; and 9 in Experiment 8). Every aspect of this studywas carried out in accordance with the regulations ofPrinceton University’s Institutional Review Panel forHuman Subjects.

Apparatus and stimuli

The stimuli were presented using MATLAB (Math-works, Natick, MA) with Psychophysics Toolbox

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 2

Page 3: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

Figure 1. The timeline for (a) Experiment 1 and (b) Experiment 2. (a) and (b) represent the bimodal stimuli (both auditory and visual) andparticipant responses in Experiments 1 and 2. One-third of the trials in each experiment were presented as unimodal stimuli forcomparison. Not drawn to scale.

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 3

Page 4: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

extensions (Brainard, 1997; Pelli, 1997) running on aMacintosh G3. The visual displays were presented on a17W Apple screen at a viewing distance of 57 cm andmonitor refresh rate of 75 Hz. The sound wave files wereplayed through speakers positioned to the left and right ofthe computer screen and with center-to-center distance of51 cm between the speakers.The stimuli used in Experiments 1 to 8 were pairs with

one auditory and one visual stimulus, played with thesame onset and a shared duration of 120 ms. The auditorystimulus in the direct tasks (Experiments 1, 3, 5, and 7)was one of two pure tones of different pitches, 1500 Hz(high pitch) and 1000 Hz (low pitch) presented at anintensity of about 75 dB. The auditory stimuli in theindirect tasks (Experiments 2, 4, 6, and 8) were computer-generated tones simulating a violin or a piano, each witheither high or low pitch (G2 or C2) presented at anintensity of 75 dB.The visual stimulus in Experiments 1 to 8 consisted of a

black and white sinusoidal grating (luminance of 8 forblack and 320 cd/m2 for white) presented on a graybackground (80 cd/m2). With the exceptions describedbelow, the grating was positioned in the center of a grayscreen. Its size was 3- visual angle, its frequency 4 cycles/deg, it was oriented to the left (45-), and it had highcontrast. In Experiments 1 and 2, the grating varied inposition: It was presented 4.5- either above or below acentral fixation cross. In Experiments 3 and 4, it varied insize: It could be either small (subtending 3- visual angle)or large (subtending 6- visual angle). In Experiments 5and 6, it varied in spatial frequency: It could be either6 cycles/deg (high spatial frequency) or 2 cycles/deg (lowspatial frequency). In Experiments 7 and 8, it varied incontrast. The bars of the grating could be either highcontrast (0.95) or low contrast (0.56). In the Indirect tasks,in Experiments 2, 4, 6, and 8: on half the trials the gratinghad bars oriented at 45- and on half the trials at 135-. InExperiment 9, the two stimuli were in the same modality(vision). The grating was presented 4.5- above or belowfixation, or at the center. It could be small or large (1.5- or6-) or intermediate size (3-). It was high contrast and itwas oriented to the left (45-).

Procedure

Participants performed the speeded classification task in9 different experiments, separately judging the auditory orthe visual stimuli (Experiments 1–8), or judging the visualstimuli on each of two dimensions (Experiment 9). Eachexperiment consisted of 3 conditions of which one wasunimodal (i.e., stimuli were presented only in one modal-ity, or in Experiment 9 only one dimension) and the othertwo were bimodal, one consisting of a congruent and theother of an incongruent combination of stimuli. In anyblock of the experiment, participants responded only to

one modality. Stimuli in the other modality wereirrelevant to the task.Experiments 1, 3, 5, 7, and 9 employed the direct task

(i.e., respond to the same features on which the corre-spondence was tested) and Experiments 2, 4, 6, and 8employed the indirect task (i.e., respond to differentfeatures from those with the hypothesized cross-modalcorrespondence). In the direct tasks, participants discrimi-nated either between the auditory stimuli (the two tonepitches) or the visual stimuli (two visual spatial positions,or two sizes, or two spatial frequencies, or two levels ofcontrast of the gratings). In the indirect tasks, participantsdiscriminated either between the auditory stimuli (twodifferent instrumentsVviolin or piano) or the visualstimuli (the tilt of two differently oriented gratings, 45-or 135-). The auditory stimuli were still high or low inpitch and the visual gratings were high or low in visualposition, large or small in size, high or low in spatialfrequency, or high or low in contrast.

“Direct” tasks

In the auditory task, participants heard a pure tone (highor low pitch) played for 120 ms through the speakers.They responded by pressing the “s” key if they heard ahigh and the “k” key if they heard a low pitch tone. In thevisual task, participants on each trial saw a fixation crossfor 500 ms, followed by a grating with bars oriented 45degrees to the left that appeared for 120 ms. In Experi-ment 1, it appeared either below or above the fixationcross, and participants made a choice response to thelocation, pressing “s” for below fixation and “k” for above(Figure 1a). By choosing the opposite response mappingin the two modalities, we hoped to avoid direct priming ofthe response keys in the congruent pairings.1

The procedure was the same for the other direct taskexperiments (3, 5, and 7) except that the visual task inExperiment 3 was to discriminate the size of the grating(large or small, see Figure 2a); in Experiment 5 it was tojudge the width of the bars in the grating (wide or narrow,see Figure 2b), and in Experiment 7 it was to judge thecontrast of the grating (high contrast or low, see Figure 2c).High visual position, small size, high spatial frequency,and high contrast gratings paired with high pitch werepairings that we predicted might be congruent, as werelow visual position, large size, low spatial frequency, andlow contrast gratings paired with low pitch. The conversepairings were considered incongruent.

“Indirect” tasks

Participants responded to different auditory and visualfeatures from those expected to show a cross-modalcorrespondence. The auditory task was to indicate whetherthe tone was produced by a violin or a piano (pressing “s”for violin and “k” for piano). The visual task was to

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 4

Page 5: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

Figure 2. The timeline for (a) Experiments 3 (direct task) and 4 (indirect task) for pitch and size; (b) Experiments 5 (direct task) and 6(indirect task) for pitch and spatial frequency; and (c) Experiments 7 (direct task) and 8 (indirect task) for pitch and contrast. Not drawn toscale.

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 5

Page 6: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

indicate whether the grating they saw was left- or right-oriented (pressing “s” for left and “k” for right). The toneswere still equally often high or low in pitch and thegrating still appeared equally often above and belowfixation (Figure 1b). The same pairings were consideredcongruent and incongruent as in the direct task.In Experiment 9, we tested two dimensions within the

visual modality. In the single-dimension position task, thegrating was always 3- in size, and participants indicatedwhether it was presented above or below the fixationcross. In the single dimension size task, grating varied inboth size and position. However, as with the bimodalconditions in previous experiments, participants onlyresponded to a single dimension (size or position) andignored the other dimension.

Results and discussion

The primary dependent measure in these studies wasreaction time (RT), measured from stimulus onset, oncorrect trials only. We also recorded the accuracy of theresponses. The mean RTs are shown for auditory pitchtogether with visual position, size, and spatial frequencyin Figure 3. Note that the considerable baseline differ-ences across experiments and for Direct versus Indirecttasks are for different groups of participants, whereas thedifferences between unimodal and bimodal congruent orincongruent, as well as the differences between visual andauditory tasks are within participants.There was no evidence of a speed-accuracy trade-off: a

significant congruency effect (faster RTs for bimodalcongruent than incongruent) was always associated witheither a significantly lower error rate for congruent pairsor no statistically significant difference. Significantlyfaster RTs for visual tasks were always associated eitherwith significantly lower error rates for the visual task orno difference.Figure 3 shows the mean response times and the

accuracy in each of the six conditions in Experiments 1to 6 separately for the Direct and Indirect tasks. Table 1shows the results of four mixed model ANOVAs separatelyon the RTs with visual position (Experiments 1 and 2),size (Experiments 3 and 4), spatial frequency (Experi-ments 5 and 6), and contrast (Experiments 7 and 8). Thewithin-participant factors were attended modality (visualor auditory) and condition (bimodal congruent, bimodalincongruent, and unimodal), and the between-participantfactor was task (direct or indirect). The significant effectsfor each ANOVA are summarized in Table 1. The generalpattern is similar across the three visual dimensions ofposition, size, and spatial frequency but not contrast, andthere were also some informative differences betweenposition, size, and spatial frequency.

Cross-modal congruence effects

Our main interest was in the effects of cross-modalcongruence and how it depends on whether the task isDirect or Indirect. Table 2 shows the differences betweenthe unimodal conditions and the congruent or incongruentconditions in Experiments 1 to 6, as well as the overalleffect of congruence (Incongruent minus Congruent). Weran an ANOVA on the RT differences between Incon-gruent and Congruent RTs (i.e., the data in the 3rd and 6thcolumns of Table 2) with task and visual dimension asbetween-participant factors and modality and congruenceas within-participant factors. The direct task (mean19.7 ms) showed a larger overall congruence effect than theindirect task (mean 11.4 ms; F(1, 52) = 13.6, prep = 0.99,)p2 = 0.21). This is not too surprising, since two

additional factors could contribute to the direct taskeffects: (1) the fact that the dimensions on which themapping occurs are those that determine the responses andthey therefore receive more attention; (2) the fact that, inthe direct task, at least for the position dimension, therecould be convergence on the shared verbal labels “high”and “low”. These had to be remapped to opposite responsekeys for auditory and for visual stimuli, but they mightnevertheless have had some effect at a level before thefinal response selection.The important finding is that there is a significant effect

of congruence in the indirect as well as the direct task onthree visual dimensions (position, size, and spatialfrequency, see Table 1). The cross-modal mappingbetween auditory and visual stimuli occurs automaticallyand affects performance even when it is completelyirrelevant to the task.

Visual dimensions: Position, size, and spatialfrequency

The differences in average congruence effects betweenthe three visual dimensions were significant (F(2, 52) =13.2, prep = 0.99, )p

2 = 0.34) with a closer mappingbetween spatial position and pitch (in Experiments 1 and2, mean 23.8 ms) than between either pitch and size (inExperiments 3 and 4, mean 11.3 ms) or pitch and spatialfrequency (in Experiments 5 and 6, mean 11.7 ms). Thiscould be due to a stronger contribution from the matchingverbal labels, “high” and “low” for both position andpitch. The labels do not apply so naturally to size or tospatial frequency (where the obvious labels would belarge/small and wide/narrow or thick/thin).The interaction between visual dimension and direct

versus indirect task was also significant (F(2, 52) = 3.69,prep = 0.94, )p

2 = 0.13), reflecting the fact that thedifference between direct and indirect tasks was consid-erably larger for spatial position (Experiments 1 and 2) thanfor size (Experiments 3 and 4), with the difference disap-pearing completely for spatial frequency (Experiments 5

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 6

Page 7: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

and 6). Again the larger effect on the direct task for visualposition could result from the verbal labeling high/lowwhen those are the attended dimensions. It is striking thatfor spatial frequency the direct and the indirect tasks showthe same effect of congruence whether participants arejudging the spatial frequency of the gratings (the dimensionon which the congruence with pitch is varied) or judgingtheir orientation. The congruence relation seems to be

registered automatically, as though the stimuli are taken inwholistically before being separately categorized.Table 2 divides the congruence effects into facilitation

(shown by the difference between unimodal and congruentconditions) and interference (shown by the differencebetween unimodal and incongruent conditions). In theANOVAs shown in Table 1, the difference betweenunimodal and congruent conditions showed significant

Figure 3. Mean response time in milliseconds with standard error of mean, averaged over participants in different conditions on differentjudgment tasks. (a, b) Pitch and position correspondence (Experiments 1 and 2); (c, d) pitch and size correspondence (Experiments 3 and 4);(e, f) pitch and spatial frequency correspondence (Experiments 5 and 6). The mean percentage of errors for each condition is shown atthe foot of each column. Star indicates significant difference between the RTs.

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 7

Page 8: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

Direct task Indirect task

Congruent minusunimodal

Incongruent minusunimodal

Incongruent minuscongruent

Congruent minusunimodal

Incongruent minusunimodal

Incongruent minuscongruent

Pitch–positionAuditory j27.0 18.1 45.1 j7.5 15.2 22.7Visual j13.1 5.5 18.6 3.7 12.4 8.7Mean j20.0 11.8 31.8 j1.9 13.8 15.7

Pitch–sizeAuditory j2.1 13.5 15.6 10.3 17.6 7.3Visual j4.0 10.4 14.4 2.1 9.9 7.9Mean j3.0 12.0 15.0 6.2 13.8 7.6

Pitch–spatial frequencyAuditory 3.7 18.3 14.6 15.9 30.7 14.8Visual 9.4 19.7 10.3 6.6 13.7 7.1Mean 6.6 19.0 12.4 11.2 22.2 11.0

Table 2. Effects of particular pairings in speeded classification: Differences in milliseconds between bimodal congruent, bimodalincongruent, and unimodal stimuli. Facilitation is shown by negative numbers and interference by positive numbers.

Visual dimensionExperiments 1and 2: Position

Experiments 3and 4: Size

Experiments 5 and 6:Spatial frequency

Experiments 7and 8: Contrast

Significant effectsModality(visual/auditory)

F(1, 18) = 39.6 F(1, 16) = 61.5 F(1, 18) = 18 F(1, 17) = 48prep = 0.99 prep = 0.99 prep = 0.99 prep = 0.99)p2 = 0.69 )p

2 = 0.79 )p2 = 0.50 )p

2 = 0.74Task (direct/indirect) F(1, 18) = 33.8 F(1, 16) = 4.4 F G 1 F G 1

prep = 0.99 prep = 0.91)p2 = 0.65 )p

2 = 0.21Modality � task F(1, 18) = 6.4 F(1, 16) = 33.5 F(1, 18) = 8.5 F G 1

prep = 0.94 prep = 0.99 prep = 0.97)p2 = 0.26 )p

2 = 0.68 )p2 = 0.32

Condition (congruent,incongruent, unimodal)

F(2, 36) = 23.8 F(2, 32) = 18.2 F(2, 36) = 19.6 F G 1prep = 0.99 prep = 0.99 prep = 0.99)p2 = 0.57 )p

2 = 0.53 )p2 = 0.52

Congruent vs. incongruent F(1, 18) = 167.6 F(1, 16) = 32.1 F(1, 18) = 33.7 F G 1prep = 0.99 prep = 0.99 prep = 0.99)p2 = 0.90 )p

2 = 0.67 )p2 = 0.65

Congruent vs. unimodal F(1, 18) = 6.24 F G 1 F G 1 F G 1prep = 0.95)p2 = 0.26

Incongruent vs. unimodal F(1, 18) = 7.4 F(1, 16) = 21.0 F(1, 18) = 32.8 F G 1prep = 0.96 prep = 0.99 prep = 0.99)p2 = 0.29 )p

2 = 0.57 )p2 = 0.65

Condition � task F(2, 36) = 3.1 F G 1 F G 1 F G 1prep = 0.92)p2 = 0.15

Condition � modality F(2, 36) = 7.3 F G 1 F G 1 F G 1prep = 0.99)p2 = 0.29

Table 1. Summary of ANOVAs on response times (from Figure 3) in Experiments 1 to 8.

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 8

Page 9: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

facilitation only for position and pitch, whereas thedifference between unimodal and incongruent showedsignificant interference for all three visual dimensions.There may be some baseline interference simply fromhaving a simultaneous stimulus in another modality,which is overridden by facilitation from congruence onlywhen the latter is very strong. Earlier experiments onGarner interference have typically used as baseline aneutral stimulus on the irrelevant dimension rather thanthe absence of any irrelevant stimulus. We wanted toavoid any interaction between the modalities in thebaseline condition. However, in relating our results tothose of others, it may be more meaningful simply tocompare congruent and incongruent pairings. To theextent that these differ we can conclude that there is anatural correspondence between polarities on the twomodalities.The two remaining conditionsVpitch with visual con-

trast and visual position with visual sizeVwere analyzedseparately since they showed a different pattern of results.

Correspondence between pitch and contrast

Interestingly, there was no evidence for a congruencymapping between pitch and contrast, despite the fact thatthe verbal labels might apply better to contrast than to sizeor spatial frequency.

Within-modality correspondence between position andsize (Experiment 9)

Using the “direct” task only, we probed the interactionbetween the two visual dimensions of position and sizeand found no congruence effect. Thus there was nointeraction between position (congruent 366 ms, SEM6 ms; incongruent 367 ms, SEM 7 ms) and size (congruent492 ms, SEM 11 ms; incongruent 484 ms, SEM). Theonly significant effect was that of visual dimension (prep =0.99, )p

2 = 0.92) showing that participants were faster(367 ms, SEM 7 ms) and more accurate (3% error rate,SEM 1%) in classifying the stimuli by their positionthan by their size (487 ms, SEM 8 ms; 8% error rate,SEM 2%).

Auditory versus visual task

RTs were faster in the visual than in the auditorymodality. This was not our main interest and we made noattempt to equate discriminability across the auditory andvisual features. However, the faster responses to the visualstimuli here may have been due in part to the more naturalstimulus-response mapping that we chose in this experi-ment for vision relative to audition. Participants have beenshown to respond faster with the key on the left to lowerlocations and with the key on the right to higher locations,

an effect sometimes called the “orthogonal Simon effect”(Lu & Proctor, 1995). We used the opposite mapping foraudition in order to avoid cross-modal priming at theresponse level. Response compatibility may also havespeeded responses to the visual stimuli in the indirect task,where we used the left key for left tilt of the Gabors andthe right key for right tilt. There was no obviouscompatible mapping between the auditory instruments,violin and piano and the left or right keys. Although thesedifferent stimulus-response compatibility effects were notour main interest, we followed up on their possible role inan experiment described in Appendix A.Overall the auditory judgments (mean 20.0 ms) were

also more strongly affected by congruence than the visualones (mean 11.2 ms; F(1,52) = 22.88, prep = 0.99, )p

2 =0.31). The results suggest some bias in selective attentiontoward the visual modality, making it harder to ignorewhen it is irrelevant. The visual classifications werealways made faster than the auditory ones, which mayhave made them less vulnerable to interference. If theauditory stimuli are discriminated more slowly, they maybecome available too late to affect the visual classifica-tions. There was no modality by task interaction in thecongruence effects, meaning that the difference betweendirect and indirect tasks was the same for both modalities.

General discussion

Our experiments have demonstrated a familiar cross-modal correspondence between auditory pitch and visualposition and further explored a bidirectional correspond-ence between pitch and size that had been previouslyreported mainly in children. The findings replicate withinour paradigm those described in the Introduction section(Ben-Artzi & Marks, 1995; Bernstein & Edelstein, 1971;Melara & O’Brien, 1987; Patching & Quinlan, 2002 forpitch with visual position, and Gallace & Spence, 2006;Marks et al., 1987 for pitch and size). They also providethe first experimental evidence of a cross-modal corre-spondence between pitch and spatial frequency and showno evidence of a correspondence between pitch andcontrast. The absence of any effect with contrast isperhaps surprising given that Marks (1974, 1989) foundinteractions between pitch and brightness. Our manipu-lation of contrast left the average intensity constantwhereas average brightness may be what was relevant inthe experiments by Marks. There were some differences inthe strength of the interactions that we observed withdifferent dimensions. Pitch with spatial position gave thelargest effects. Most of the correspondences appeared tobe asymmetrical, with generally stronger effects on pitchthan on the visual dimensions. The asymmetries may havebeen due in part to differences in the speed of responses.

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 9

Page 10: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

Faster responses would allow more time for the congru-ency or incongruency to affect the slower responses.In our studies, the indirect, orthogonal task sometimes

showed effects that were as large as those in the directtask, despite the fact that attention was directed to therelevant dimensions in the direct case. For both pitch withvisual size and pitch with spatial frequency, the effectswere as large in the indirect as in the direct task. Thelarger effect in the direct task with visual position mayreflect the fact that the verbal labels are shared and mayactively conflict when there is a mismatch. As we arguedearlier, it is not clear where the congruence effects couldarise in the case of spatial frequency and size, since theverbal codes do not correspond for the visual and theauditory stimuli. They are certainly automatic andindependent of attention. It appears that the stimuli arerepresented wholistically, with mismatches between the“expected” correspondences on irrelevant dimensionsdetracting from discriminations on the attended dimen-sions. Because no direct convergence between theauditory and the visual tasks exists in the indirectcondition, it seems most plausible that they reflect anintrinsic correspondence at the perceptual level.We tested for corresponding polarity matches within a

single modality, between the visual dimensions of sizeand position. The finding that the corresponding polaritymatches are not present is important because it suggeststhat each shares something different with auditory pitch,as if there are two or more independent mappings ratherthan a single representation at a more abstract level, whichis shared between pitch and all the visual dimensions thathave cross-modal correspondences.Other papers have reported effects of an irrelevant

stimulus dimension on the speed or accuracy of responseand have used them as evidence for automatic processingof the irrelevant dimension but not as a tool to locate theeffect in the system. For example, Rusconi, Kwan,Giordano, Umilta, and Butterworth (2006) showed thatlow tones are responded to more quickly and moreaccurately when the response key is low or to the left ofspace and high tones when the key is high or to the right.In a version of the experiment that seems initially toparallel our indirect task, they showed that the samestimulus-response mapping benefit also occurs when theresponse is made to a separate and supposedly irrelevantdimensionVthe instrument that played the tone (wind orpercussion). In this case, the mapping was orthogonal tothe pitch; for example, a wind instrument could bemapped to the left key and a percussion instrument tothe right key. However, when the pitch of the tonehappened to be low, pressing the left key to a windinstrument was faster than when it happened to be high. Asimilar finding was reported earlier by Dehaene, Bossini,and Giraux (1993) for numbers, which also appear to bemapped onto response space, with small numbers givingfaster responses to a key on the left and high numbers to akey on the right. Again this spatial mapping effect was

present even when the task was to judge the parity of thenumbers rather than their magnitude. Thus the spatialcorrespondence seems to be detected automatically, but itstill reflected priming of the response location by thestimulus even when the response was made to a differentdimension of the stimulus.Our indirect tasks differ from these in that there was no

association between the potentially corresponding stimuliand the responses: both responses were made equallyoften to both stimuli. The only association was betweenthe two stimuli in a simultaneously presented pairVforexample, a high tone and a high position. The experimentsof Rusconi et al. and Dehaene et al. showed priming onlyof responses whose locations were associated with thepreferred mapping of pitch or number size. The criticaldifference in our experiments is not that the primingdimension is irrelevant to the task, nor that it is detectedautomatically rather than intentionally, although both arethe case, but the fact that the responses were orthogonal tothe stimuli and therefore cannot be the site of theinterference or facilitation. The IAT (Greenwald,McGhee, & Schwartz, 1998) offers another example ofpriming from an irrelevant dimension. For example in oneversion of the test, whatever response key is associatedwith positive words is also associated with white faces andwhatever response key is associated with negative wordsis also associated with black faces. The association here isbetween a particular semantic or emotional valence and aparticular response key. Clearly the priming here is not atthe perceptual level, but it does depend on a consistentmapping of stimuli to responses.Our method also improves upon other designs previ-

ously used to understand possible levels of audio-visualcongruence interactions. For example, Gallace and Spence(2006) designed an experiment in which they eliminatedthe possibility that the effects of alternating irrelevantsound on disk size discrimination could be attributed to abias in response selection by having participants report onwhether two disks matched in size or not. However, intheir experiment, even though participants were notlabeling the size of the object as smaller or larger, theywere still judging the size of the object (i.e., the featurewhose potential congruence interaction was being studied)rather than an orthogonal dimension. This leaves open thepossibility of verbal or semantic mediation, which wasruled out with our indirect tasks.Why should these cross-modal interactions exist? Some

of the apparently arbitrary mappings seen in synesthesiahave been attributed to cross-activation between brainareas that happen to be close together (such as the color-grapheme synesthesia in color area V4 and the number-grapheme area in the fusiform gyrus; Ramachandran &Hubbard, 2001). They attribute the many and variedcorrespondences found in individual synesthetes to multi-sensory convergence zones in the region of the temporal-parietal-occipital (TPO) junction, and might well locateour congruency effects between auditory pitch and the

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 10

Page 11: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

various visual dimensions of position, lightness, brightness,shape, size, and spatial frequency to the same areas. Ourdata suggest that the natural mappings we observe betweenauditory pitch and the visual features of position, size, andspatial frequency may be the result of multisensory areasmodulating modality-specific sensory systems’ responsesto heighten the perceptual salience of congruent pairs.There is some evidence of this type of multisensoryinteractions being successfully used by visual-to-auditorysensory substitution devices (e.g., vOICe, see www.seeingwithsound.com) that use pitch of the soundscaperepresenting a visual object to render visible to the blindthe position of that object in the vertical plane.

Appendix A

We conducted an additional experiment to make surethat stimulus-response compatibility did not interact withour main effect of congruency and in some way foster theresults we observe. We reran Experiment 1 that showsboth the largest congruence effects and the most likelypotential interactions with location compatibility, counter-balancing both the response allocations, and the compa-tibility across modalities. In total, twenty-four participantstook part completing 960 trials each. Response allocationswere as given follows.Group 1: Visual compatible and auditory incompatible.

Left key for visual low and auditory high; right key forvisual high and auditory low.Group 2: Visual incompatible and auditory incompa-

tible. Left key for visual high and auditory low and rightkey for visual low and auditory high.Group 3: Visual and auditory compatible. Left key for

visual and auditory low and right key for visual andauditory high.Group 4: Visual and auditory incompatible. Left key for

visual and auditory high and right key for visual andauditory low.We imported the RT data into a mixed model ANOVA

with modality (auditory and visual) and condition (bimo-dal congruent, bimodal incongruent, and unimodal) aswithin-participant factors, and response allocation (Group 1,Group 2, Group 3, Group 4) as a between-participantfactor. We still find the main effects of modality andcongruency (F(1, 3) = 134.44, p G 0.01 and F(1, 3) =35.27, p G 0.01, respectively) but no main effect ofresponse allocation (compatibility) nor any interactions ofcompatibility with either modality or condition.

Acknowledgments

This research was supported by NIH Grant 2004 2RO1MH 058383-04A1 Visual Coding and the Deployment ofAttention, by Grant # 1000274 from the Israeli Binational

Science Foundation, and by NIH Grant 1RO1MH062331Spatial Representations and Attention. We are grateful toJeremy Wolfe for his helpful suggestions.

Commercial relationships: none.Corresponding author: Karla K. Evans.Email: [email protected]: Visual Attention Lab, 64 Sidney Street, Suite170, Cambridge, MA 02139, USA.

Footnote

1In case there were any interactions between the visual

and the auditory stimuli in Experiment 1 and the spatiallocations of the response keys, such as the “orthogonalSimon” effect (Lu & Proctor, 1995), we ran a supple-mentary experiment (see Appendix A) to check how theresponse allocations affected the congruency effects thatwere our main topic of interest.

References

Ben-Artzi, E., & Marks, L. E. (1995). Visual-auditoryinteraction in speeded classification: Role of stim-ulus difference. Perception & Psychophysics, 57,1151–1162. [PubMed]

Bernstein, I. H., & Edelstein, B. A. (1971). Effects ofsome variations in auditory input upon visual choicereaction time. Journal of Experimental Psychology,87, 241–247. [PubMed]

Braaten, R. (1993). Synesthetic correspondence betweenvisual location and auditory pitch in infants. Paperpresented at the Annual Meeting of the PsychonomicSociety.

Brainard, D. H. (1997). The psychophysics toolbox.Spatial Vision, 10, 433–436. [PubMed]

Dehaene, S., Bossini, S., & Giraux, P. (1993). The mentalrepresentation of parity and number magnitude.Journal of Experimental Psychology General, 122,371–371.

Gallace, A., & Spence, C. (2006). Multisensory synestheticinteractions in the speeded classification of visual size.Perception & Psychophysics, 68, 1191–1203.[PubMed]

Garner, W. R. (1974). The processing of information andstructure. Potomac, MD: L. Erlbaum Associates;distributed by Halsted Press, New York.

Garner, W. R. (1976). Interaction of stimulus dimensionsin concept and choice processes. Cognitive Psychol-ogy, 8, 98–123.

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 11

Page 12: Natural cross-modal mappings between visual and auditory ... · auditory feature of pitch and the visual features of vertical location, size, and spatial frequency but not contrast

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K.(1998). Measuring individual differences in implicitcognition: The implicit association test. Journal ofPersonality, 74, 1464–1480. [PubMed]

Long, J. (1977). Contextual assimilation and its effect onthe division of attention between nonverbal signals.Quarterly Journal of Experimental Psychology, 29,397–414.

Lu, C. H., & Proctor, R. W. (1995). The influence ofirrelevant location information on performance: Areview of the Simon and spatial Stroop effects.Psychonomic Bulletin and Review, 2, 174–174.

Marks, L. E. (1974). On associations of light and sound:The mediation of brightness, pitch, and loudness.American Journal of Psychology, 87, 173–188.[PubMed]

Marks, L. E. (1987). On cross-modal similarity: Auditory–visual interactions in speeded discrimination. Journalof Experimental Psychology and Human PerceptionPerformance, 13, 384–394. [PubMed]

Marks, L. E. (1989). On cross-modal similarity: Theperceptual structure of pitch, loudness, and bright-ness. Journal of Experimental Psychology andHuman Perception Performance, 15, 586–602.[PubMed]

Marks, L. E., Hammeal, R. J., & Bornstein, M. H. (1987).Perceiving similarity and comprehending metaphor.Monogram Society Research Children Development,52, 1–102. [PubMed]

Martino, G., & Marks, L. E. (1999). Perceptual andlinguistic interactions in speeded classification: Testsof the semantic coding hypothesis. Perception, 28,903–923. [PubMed]

Melara, R. D. (1989). Dimensional interaction betweencolor and pitch. Journal of Experimental Psychologyand Human Perception Performance, 15, 69–79.[PubMed]

Melara, R. D., & Marks, L. E. (1990). Processes under-lying dimensional interactions: Correspondences

between linguistic and nonlinguistic dimensions.Memory Cognition, 18, 477–495. [PubMed]

Melara, R. D., & O’Brien, T. P. (1987). Interactionbetween synesthetically corresponding dimensions.Journal of Experimental Psychology: General, 116,323–336.

Mondloch, C. J., & Maurer, D. (2004). Do small whiteballs squeak? Pitch-object correspondences in youngchildren. Cognitive Affect Behavioral Neuroscience,4, 133–136. [PubMed]

Patching, G. R., & Quinlan, P. T. (2002). Garner andcongruence effects in the speeded classification ofbimodal signals. Journal of Experimental Psychologyand Human Perception Perform, 28, 755–775.[PubMed]

Pelli, D. G. (1997). The VideoToolbox software for visualpsychophysics: Transforming numbers into movies.Spatial Vision, 10, 437–442. [PubMed]

Ramachandran, V. S., & Hubbard, E. M. (2001). Psycho-physical investigations into the neural basis ofsynaesthesia. Proceedings of Biological Science,268, 979–983. [PubMed]

Rusconi, E., Kwan, B., Giordano, B. L., Umilta, C., &Butterworth, B. (2006). Spatial representation of pitchheight: The SMARC effect. Cognition, 99, 113–129.[PubMed]

Smith, L. B., & Sera, M. D. (1992). A developmentalanalysis of the polar structure of dimensions. Cogni-tion Psychology, 24, 99–142. [PubMed]

Wagner, S., Winner, E., Cicchetti, D., & Gardner, H.(1981). “Metaphorical” mapping in human infants.Child Development, 52, 728–731.

Welch, R. B., & Warren, D. H. (1986). Intersensoryinteractions. In K. R. Boff, L. Kaufman, & J. P.Thomas (Eds.), Handbook of perception and perfor-mance (vol. 1, pp. 251–256). New York: John Wileyand Sons.

Journal of Vision (2010) 10(1):6, 1–12 Evans & Treisman 12