Lectures on Virtual Environment Development L6

Audio in VEs

Ruth Aylett

Use of audio in VEs

Important but still under-utilised channel forHCI including virtual environments.

Speech recognition for hands-free input All computers now have sound output

– At least a beep– Usually CD-quality stereo sound

Conventional stereo places a sound in anyspot between the left and right loudspeakers.– In true 3D sound, the source can be placed in any

location: right or left, up or down, near or far.

Potential uses

Associate sounds with particular events Associate sounds with static objects Associate sound with the motion of an object Use localised sound to attract attention to an object Use ambient sounds to add to the feeling of

immersion Use sounds to add to the feeling of realism Use speech to communicate with devices or avatars Use sound as a warning or alarm signal

Overall impact

High Quality audio provides:– Increased realism

• Reinforces visuals

– Strong immersive sense• World exists beyond part that is seen

– Strong positional cues– Extra information about the environment– The shape of the world

What does ‘High Quality’ mean?

VR sound environment

VR equipment creates a difficult soundenvironment.

CAVE– Stand in a glass box– Pretend you can’t hear the echoes– Also hard to place speakers

Semi-immersive VR theatre– Sit in a big cylinder– Reflects sound in a very strange way.

Workbench

Not so bad but still...– Big flat screen 1 metre in front of you– Sound coming from surround speakers– Creates echoes inappropriate for scene

Much VR audio is based on very high qualityheadphones– Use head tracking to get position and orientation– Play to user– Problem solved! - well, no

Mon-aural source

Phantom source

L speaker R speaker

Stereo sound In the entertainment industry, stereo was

the first successfulcommercialproductinvolvingspatialsound.

To placesound on the left, send its signal to the leftloudspeaker, to place it on the right, send itssignal to the right loudspeaker.

*and if the speakers are wired "in phase" and if the listener is more or less midwaybetween the speakers and if the room is not too acoustically irregular

Stereo techniques

If same signal sent to both speakers*, phantom sourceseems to originate from point midway between them.

Crossfading signal from one speaker to the other givesimpression of source moving continuously between thetwo positions.

Simple crossfading cannot create impression of sourceoutside of line segment between speakers.

Can also shift the location of the phantom source byexploiting the precedence effect (delay).

A world of sound

You are surrounded by sound all thetime– Silence is unheard of!

The environment affects (shapes?) thesound you hear– Size, shape, materials

Rendering sound: auralisation

To generate correct echoes must modelsound behaviour in the space– Rooms are complex– Filled with different materials

• Reflective, Absorbant, Frequency filtering

Just like rendering light

Simulation of Room Characteristics

---- Early Reflections---- Reverberant Field----- Direct Sound

Over time

Putting sound into a VE

What sound? ‘Ambient’ sounds

– ‘Surround’ sound‘– Often use recorded sounds

Positional sounds– Designed to give a strong sense of something

happening in a particular place– Also often provided by using recorded sounds

Positional sound

Using sound to create the sense ofactive things in the environment– Enhances presence– Enhances immersion

Need to deal with many components– Reflections (echoes)– Diffraction effects

City models

VisClim: scene in Linköping’s Storatorget Surrounding environment

– Vehicles?• Several roads nearby

– People?• Many people in the square

– Weather noise effects• Rainfall• Snowfall (no sound but damping effect)

Visclim

Air Traffic control

Simulation– No ‘ambient’ sound required– No aircraft noises– No realism wanted?

Positional warnings?– Designed to draw the users attention to

the location of a problem– Which may be out of the field of view

What it looked like

Creating positional sound

Amplitude– (or more)

Synchronisation– Audio delays

Frequency– Head-Related Transfer Function (HRTF)

Amplitude

Generate audio from position sources Calculate amplitude from distance Include damping factors

– Air conditions– Snow– Directional effect of the ears

Synchronisation

Ears are very precise instruments Very good at hearing when something

happens after something else– Sound travels slowly (c 340 m/sec in air):

different distance to each ear Use this to help define direction

– Difference in amplitude gives only veryapproximate direction information

Speed effect

30 centimetres =0.0008seconds

Human can hear ≤ 700µS

What is 3D sound?

Able to position sounds all around a listener. Sounds created by loudspeakers/headphones:

perceived as coming from arbitrary points inspace.

Conventional stereo systems generally cannotposition sounds to side, rear, above, below

Some commercial products claim 3Dcapability - e.g stereo multimedia systemsmarketed as having “3D technology”. Butusually untrue.

3D positional sound

Humans have stereo ears Two sound pulse impacts

– One difference in amplitude– One difference in time of arrival

How is it that a human can resolvesound in 3D?– Should only be possible in 2D?

Frequency

Frequency responses of the earschange in different directions– Role of pinnae– You hear a different frequency filtering in

each ear– Use that data to work out 3D position

information

Head-Related TransferFunction

Unconscious use of time delay, amplitude difference,and tonal information at each ear to determine thelocation of the sound.– Known as sound localisation cues.– Sound localisation by human listeners has been studied

extensively. Transformation of sound from a point in space to the

ear canal can be measured accurately– Head-Related Transfer Functions (HRTFs).

Measurements are usually made by insertingminiature microphones into ear canals of a humansubject or a manikin.

HRTFs

HRTFs are 3D– Depend on ear shape (Pinnae) and

resonant qualities of the head!– Allows positional sound to be 3D

Computationally difficult– Originally done in special hardware

(Convolvotron)– Can now be done in real-time using DSP

HRTFs

First series of HRTFmeasurement experimentsin 1994 by Bill Gardner andKeith Martin, MachineListening Group at MITMedia Lab.

Data from theseexperiments made available for free on the web.

Picture shows Gardner and Martin with dummy usedfor experiment - called a KEMAR dummy.

A measurement signal is played by a loudspeakerand recorded by the microphones in the dummyhead.

HRTFs Recorded signals processed by computer,

derives two HRTFs (left and right ears)corresponding to sound source location.– HRTF typically consists of several hundred

numbers– describes time delay, amplitude, and tonal

transformation for particular sound source locationto left and right ears of the subject.

Measurement procedure repeated for manylocations of sound source relative to head– database of hundreds of HRTFs describing sound

transformation characteristics of a particular head.

HRTFs

Mimick process of natural hearing– reproducing sound localisation cues at the ears of listener.

Use pair of measured HRTFs as specification for apair of digital audio filters.

Sound signal processed by digital filters and listenedto over headphones– Reproduces sound localisation cues for each ear

– listener should perceive sound at the location specified bythe HRTFs.

This process is called binaural synthesis (binauralsignals are defined as the signals at the ears of alistener).

HRTFs

You should be able to describe the process involved ingenerating true 3D audio using HRTFs.

The problem

Rendering audio is really, really hard Much bigger problem than lighting Material properties are more complex

– Can’t fake it as easily– Properties are always a problem

Good methods exist but problem toocomputationally hard for these to be ingeneral use at present

What is possible now

Constraint is real time audio rendering– Must adapt to dynamic user who moves

unpredictably

Simple (reflectionless) stereo positional sound– Using amplitude– Using synchronization– Using HRTF frequency filtering

Useful for audio cues and simpleenvironmental sounds

What about surround sound?

Principal format for digitaldiscrete surround is the"5.1 channel" system.

The 5.1 name stands for fivechannels (in front: left, right andcentre, and behind: left surroundand right surround) of full-bandwidth audio (20 Hz to20 kHz)– sixth channel at times contain additional bass information to

maximise the impact of scenes such as explosions, etc.– This channel has a narrow freq. response (3 Hz to 120 Hz), thus

sometimes referred to as the ".1" channel.

What about surround sound?

Surround sound systems NOTtrue 3D audio systems - justcollection of more speakers.

Various commercialsurround sound formats - forhome entertainment, Dolbyis big name.– Dolby Surround Digital.

Lots of other proprietaryapproaches - e.g. theBattleChair (pictured).

Dolby Headphone

Dolby Headphone: based proprietary algorithm,presumably similar to HRTFs, originally developed byAustralian company Lake Technology– attempts to produce convincing surround-sound effects through

ordinary stereo headphones.

Technology originally developedfor VR or tele-conferencingapplications but not marketedfor consumer applications.

A more genuine 3Daudio system developed byUK company Sensaura.

Voice interaction

Voice input for control– Continuous? Discrete?

Voice output for information– Positional - alerts– Non-positional - ‘voice over’– Character-based - social channel

Voice output Voice synthesis

– Computer strings together set of phonemes (basiclanguage sound units)

– Problems with articulation: sounds robotic Unit selection voices

– Uses large database created from real voices– Plus sophisticated algorithm for putting bits

together– Good results but need very large memory (1

gigabyte) to hold database– Takes lots of time and expertise to create ‘voice’

Documents

Lectures on Virtual Environment Development L6