Richard Swales U1157958

Richard Swales – Edit Point ADR Sync

An Investigation Into The Importance and Necessity of Automated Dialogue Replacement Synchronisation Across Edit

Points in Film.

By

Richard Swales

U1157958

Project 1: Literature Review and Organisation

A Report

Submitted in Requirement for

The Degree of BSc (Hons) Popular Music Production

University of Huddersfield

1st Supervisor Mr. Braham Hughes

2nd Supervisor Mr. Austin Moore

05/12/14

NHE 2440

Richard Swales

1

1


Abstract:

Automated dialogue replacement is a time absorbing process in which one of its main goals, after capturing an actor’s

performance, is reproducing good audio to video synchronisation. There are technologies that aid this sync reproduction

and many tolerance values given, which indicate how far out of sync the audio element is allowed to be before it

becomes unacceptable. A camera edit point, or cut point, is a point at which there is a chance that the synchronisation

tolerances published do not apply, due to an abrupt emergence into a new scene or camera angle and so the

synchronisation tolerance values will have to be adjusted accordingly. This project will present current synchronisation

tolerances as published by leading broadcast institutions as well as look into how synchronisation errors are created and

dealt with in the film and television broadcasting industries. This research will lead into a number of tests, introduced in

this report, to find out if the current tolerances apply to edit points and present new tolerances for edit point ADR

synchronisation.

Richard Swales

2

2


Contents

Chapter 1: An Introduction to the Project

1-1: Automated Dialogue Replacement p.4

1-2: A Brief Introduction to Synchronisation Tolerances p.4

1-3: Structural Overview of the Project p.5

Chapter 2: Synchronisation Tolerances and Incorporating them into Film and Television

2-1: Sync/Delay Problems in the Broadcast Chain p.6

2-2: Current Sync Tolerance and Acceptance Rates p.7

2-3: Tolerances at Work in the Film and Television Industries p.9

2-4: Working Outside the Synchronisation Limits p.9

2-5: Tools to Aid ADR Synchronisation p.10

2-6: Dialogue Specific Research p.10

Chapter 3: Synchronisation Detection Tests

3-1: Methods for Detection Tests. p.12

3-2: Testing Stimuli p.12

3-3: Test 1 p.13

3-4: Test 2 p.13

3-5: Test 3 p.13

Appendices p.14

Reference List p.15

Richard Swales

3

3


Chapter 1 – An Introduction to the Project

1-1 Automated Dialogue Replacement.

The course of automated dialogue replacement or ADR (referred to as many things see appendix 1) can easily cause

synchronisation errors in the postproduction process. Reasons for recording ADR are numerous such as, technical

problems, perspective and voice quality, acting problems, line changes (Purcell, 2007, p278) and many more, but usually

it is needed when the on set sounds are deemed unusable due to background noise masking the dialogue line. This on set

sound is always preferred to ADR as it seizes the actor’s emotions and performance on set, which can be difficult to

reproduce in an ADR studio and ADR can always “kill the charm” (Purcell, 2007, p278) of an actor’s performance. The

original sounds do not require sync editing to the extremes of ADR as sync points, such as clapperboards, are used to

ensure the audio is in sync with the video. ADR is also used to capture the dialogue used for narration purposes and also

non-dialogue sounds, such as heavy breathing, but these do not require sync editing and so are disregarded in this

project. Trying to reproduce the sync is one of ADR’s main goals after capturing the actor’s performance. This sync

recapture is tough to emulate and often the re-recorded dialogue has to be edited to match the originally filmed mouth

movements. This is a time absorbing process which could be re-evaluated by looking at sync tolerances and when

humans can spot sync errors. Sometimes sync can be purposely introduced to film to accentuate styles of language. For

instance as Michael Chion (1994, p.65) explains that in French cinema there is a “tight and narrow synchronization”

where as Italian film sync is “off by a tenth of a second or so” to take “ into consideration the totality of the speaking

body” (Chion, 1993, p.65). Generally sync is tight to the movements, creating a more believable scene for the audience

and this is key as “the moment the sound and picture appear out of sync, the audience is immediately taken out of the

fantasy” (Viers, 2012, p.159)

1-2 A Brief Introduction to Synchronisation Tolerances.

“Some sound editors claim that they can spot sync errors at half a frame”, (Linear acoustics, 2004) and some even less

than that, but generally there are wider tolerances, which have been tested by leading authorities, giving a window of

acceptance. The amount of bodies that have researched sync tolerances and acceptance levels gives an idea of the

problem caused by sync errors. A lot of research has been conducted in the area of television and broadcasting which

shows that this is a problem across the film and television industry. However, the research in television and broadcasting

is still relevant here as human sync perception does not change across different formats and there are still limits set that

are adhered to in the film industry. This project aims to look at these sync tolerances around edit points in films, as this Richard Swales

4

4


is an area where is it possible that time may be saved by not having to closely edit sync. Some sync errors across edit

points are unavoidable due to the editors in the cutting room not paying close attention to the action across a cut. This

means that there are sometimes double movements, which sound editors have to brush over or else face some

questionable sync sounds.

1-3 Structural Overview of the Project.

Chapter 2 presents the current sync tolerances given by a number of leading broadcast institutions, outlining which ones

are regarded as the standard and which ones are referred to in the film and television industries.

Chapter 3 outlines the tests and experiments to be carried out to decipher if these synchronisation tolerances are wider or

smaller around cut edit points in film. These results will be concluded in the research report.

Richard Swales

5

5


Chapter 2 – Synchronisation tolerances and incorporation them into film and television

2-1 Sync/Delay Problems in the Broadcast Chain.

As previously mentioned there has been a number of sync tolerance rates proposed by authorities and leading bodies

associated with television broadcasting. These bodies aim their research at synchronisation errors caused in the

broadcast chain, rather than errors induced in the postproduction process. The broadcast chain involves a number of

steps to get the programs from the TV studio out to television sets for audiences to view. These steps are as follows:

Camera

Studio/Outside Broadcast

Codec

Compilation Station

Codec

Local Station

Emission Codec

Local Transmitter

Television Set

Audience

(Waddell, Jones, Goldberg, undated)

When being edited in editing software, audio and video do impose latency issues themselves. When audio is processed

using tools such as EQ and compressors, the audio signal does not impose much of a delay at all and latency issues

usually stay “under 1ms, in the digital domain, falling to micro session in the analog domain” and so “no compensating

video delay needs to be added” (Linear Acoustics, 2004). Video, on the other hand, takes more time to be processed and

“delay is inevitable” (Linear Acoustics, 2004). This means that delay compensation has to be incorporated in video

devices to ensure that the sync error is kept under control. “Each digital audio and video component in the chain from

production to reception imposes some degree of latency on the signals passing through it,”(Advanced Television

Systems Committee [ATSC], 2003). Along each of these steps in the chain the delays between audio and video must be

attended to, to ensure that the overall timing delay is kept to a minimum. These minimum limits will be explored in this

project.

2-2 Current Sync Tolerances and Acceptance Rates.Richard Swales

6

6


The European Broadcasting Union (EBU, 2007) carried out “subjective tests of the relative delays at which failure of the

synchronism between lip movements and speech becomes perceptible to 50% of observers” The results of these tests

showed a tolerance value of +40ms to -60ms where the positive figure is audio before the visual and the negative is

audio after the visual. They also go on to say that “the accuracy of A/V synchronization at each stage should lie within

the range of audio 5ms early to 15ms late”(European Broadcasting Union, [EBU] 2007), but this is not relevant to the

end of chain audience detection. The Advanced Television Systems Committee has produced figures based around the

“end to end DTV audio-video production, distribution and broadcast system” (ATSC, 2003). They claim “the sound

program should never lead the video program by more than 15 ms, and should never lag the video program by more than

45ms”(-15ms to +45ms) (ATSC, 2003). However, the figures produced are aimed at digital television (DTV)

broadcasting and they actually dismiss the results given by the International Telecommunications Union (ITU, 1998),

which are regarded as the standard figures for synchronisation detection and acceptability.

These figures produced by the International Telecommunications Union (ITU, 1998) (see Figure 1) are widely

referenced in many papers looking at synchronisation detection, however there is little knowledge on the actual tests

they carried out to retrieve the results. The publication explains that the research consisted of “subjective evaluation

undertaken in Japan, Switzerland and Australia” (International Telecommunications Union [ITU], 1998) but this is the

only information they give on the tests they carried out. The results of these subjective tests show a detectability

threshold of “+45ms to -125ms” and an acceptability threshold of “+90ms to - 185ms on the average” ”(ITU, 1998).

Richard Swales

7

7


The figure of -185ms has been reproduced in a different set of tests carried out by The Institute of Electrical and

Electronics Engineers (IEEE, 2008). Younkin and Corriveau (2008) published a figure of “-185.19ms”, through

conducting research aimed at lip-sync detection. This then disregards the synchronisation rates given by ATSC as they

are much too low, but they go on to say, “a direct comparison does not lend to include differences in methodology,

processing, or specific conditions” (Younkin and Corriveau, 2008) so this counters ATSC’s claim of dismissing ITU’s

results, as the two tests do not draw this direct comparison. The IEEE research clearly states that “The goal is to

establish the detection threshold of lip-sync errors” (Younkin and Corriveau, 2008) and so this separates itself from the

ATSC findings and appears a clearer more trustworthy tolerance. Another reason that the research figures concluded by

Younkin and Corriveau (2008) are trusted is the fact that they match up to the figures presented by ITU in 1998, being

within 0.19ms of the negative value. It is clear to see that throughout all of these results published that the value of the

audio before visual (the negative value) is always smaller than the audio after visual (the positive value) and therefore it

is asymmetric. This is a “consequence of human acclimation to the laws of physics, which set the speed of light and

sound to be widely different”(Cugnini, 2010). Put simply, “ we are accustomed to sound arriving a bit late. It goes

against nature to hear something before we see it.” (Purcell, 2007). Light travels at approximately 299,792,458 meters

per second and sound at approximately 340.29 meters per second, this is why when we watch fireworks we often see the

explosion before we hear the consequential sound. Looking into this reality of light traveling faster than sound Mason

Richard Swales

Figure 1: Detectability and Acceptability Thresholds. (International Telecommunications Union [ITU], 1998)

8

8


and Salmon (2009) input that “sound caused by an event will always reach an observer later than light from that event”

and so they go on to say “correct synchronisation is achieved by presenting the sound later than the image”, but this

depends on a number of factors, such as acquisition equipment, frame rate and camera types (Mason and Salmon, 2009).

2-3 Tolerances at work in the film and television industries

In response to the ITU research figures Linear Acoustics, creators of professional grade broadcast equipment used by

networks such as NBC claims that “this range is probably far too wide for truly acceptable performance, and tighter

tolerances are generally obeyed” (Audio and Video Synchronisation, 2004) although they do not say why the range is

too high and this has not been broadly considered when looking into synchronisation tolerances. This claim however

may speak some truth. The BBC released a technical recommendation containing standards agreed to by the BBC,

BSkyB, BTSport, Channel 4, Channel 5, ITV and S4C (British Broadcasting Corporation [BBC] 2014). This contains

technical specifications, picture and sound quality requirements, and delivery requirements deemed acceptable by the

broadcasters listed. On page 21 of this technical specification it reads, “the relative timing of sound to vision should not

exhibit any perceptible error. Sound must not lead or lag the vision by more than 5ms.” (BBC, 2014), which is miniscule

in comparison to the figures presented by the ITU. However if we compare these figures with the “range of audio 5ms

early to 15ms late” (European Broadcasting Union, [EBU] 2007), a value given in discussion with each stage of the

broadcast chain, then these figures do match up to a degree. The stage of sound and sync editing is just one of the first

stages in the broadcast chain and so the + and – 5ms figures given to work within, do fit in here.

2-4 Working Outside the Synchronisation Limits

There are some situations where it is hard to work within these very tight specifications. Tom Hobbs, Delivery Assistant

at Films at 59 in Bristol, explained that when films get delivered they have to include “a record report that accompanies

the final master to the broadcasters. Any noticeable cheated sync (dialogue used from other takes etc.) or ADR will be

added to this report with relevant timecodes to let them know we are aware of it and that it cannot be improved.”

(Hobbs, personal communication, November 3rd, 2014), so sometimes keeping to this industry requirement is hard to do.

Vanessa Theme Ament, author of The Foley Grail, says, “the editor will only be concerned if the sync is noticeably

early or late” (Ament, 2009, p.136) which gives a sense of, ‘if it looks ok then it is ok’.

Richard Swales

9

9


2-5 Tools to Aid ADR Synchronisation

Tom Hobbs, from Films at 59, also spoke about a software plugin called VocAlign by SynchroArts. VocAlign aids

editors in syncing ADR recordings to the original captured on set. It takes the waveform from the original audio that

might be distorted and corrupted in some way and then matches the ADR recording’s waveform to the original, helping

to get phrasing, timing, and sentence flow closer to the original. However this can only be used if the original audio is

still fully intact and only partly unusable, so if the track has a lot of noise in the background, then VocAlign will struggle

to read the waveform. There is always a need for manual sync editing, and so a need for sync tolerances.

2-6 Dialogue Specific Research

A study into television sync error detection has been conducted by Byron Reeves and David Voelker (1993), in which

they presented three different versions of six television segments with a varying amount of sync error, from 0 to 2.5 and

up to 5 frames. After each segment of video viewers evaluated the speakers dialogue and they were asked if they could

spot sync errors (Reeves and Voelker, 1993). They did not present figures in milliseconds and so we assume that they

are working in a standard television frame rate but still depending on the video standard this can vary. “NTSC video is

usually said to run at approximately 30 fps, and PAL runs at 25 fps.” (Adobe, 2011). The figures for the research

conducted by Reeves and Voelker (1993) are presented in milliseconds here:

2.5 fields (frames) - at NTSC (33ms per frame) = 83ms(approximately)

5 fields (frames) - at NTSC (33ms per frame) = 167ms(approximately)

2.5 fields (frames) - at PAL (40ms per frame) = 100ms(approximately)

5 fields (frames) - at PAL (40ms per frame) = 200ms(approximately)

It would be assumed that this experiment was carried out under NTSC standards due to the fact that they took place at

Stanford University in California, USA, which is under the NTSC region list. The conclusion of their results shows that

“Viewers can accurately tell when a television segment is in perfect synch, and when it is 5 fields out of synch. Viewers

cannot accurately tell the same segments are 2.5 fields out of synch” with a field being a frame of film (Reeves and

Voelker, 1993). Summarised and converted into milliseconds, we can see that according to these results viewers can spot

sync errors at 167ms, but cannot spot sync errors at 83ms. There were no values tested in-between these two points and

so it is hard to find the cut off at which viewers can and cannot detect errors. The paper by Reeves and Voelker (1993) Richard Swales

10

10


states that at 100ms sync errors are not detected, which is under the 145ms detection value given by ITU (1998). These

results are interesting to see as they were drawn on detection of sync error based around dialogue and not general sync

which is assumed for all the research conducted by ITU, EBU, ATSC, IEEE etc. This research is the only source found

that looks into the specific area of dialogue sync detection, an area in which many gaps are still present. This project will

hopefully fill in some of these dialogue specific sync detection limits.

Richard Swales

11

11


Chapter 3 – Synchronisation Error Detection Tests

3-1 Methods used for Synchronisation Error Detection Tests

There will be a number of separate tests conducted to determine whether the synchronisation tolerances discussed apply

to ADR sync editing around edit points in films. These will consist of:

Test 1: Testing current sync tolerances to see if they still apply to the stimuli and testing conditions provided.

Test 2: Using the thresholds from Test 1 and applying them to edit point sync to gauge whether the tolerance ranges still

apply around edit points.

Test 3: Taking sync error past the thresholds around edit points and testing if there is a new detectability limit for edit

points.

3-2 Testing Stimuli

A short film will be created for the purposes of testing. This film will be a very simple setting of two people having a

conversation. There will be a camera focused on both of them detailing their mouth movements. Directional

microphones will be placed in front of them and connected straight into the camera equipment. This process is known as

a single system and so “sound sync is achieved as the image is recorded” (Viers, 2012, p.159) which means there is a

start point for when sync errors are introduced. The only downside to using a single system is that “cameras are

manufactured to produce great images, not necessarily great sound” (Viers, 2012, p.164), but the choice of camera will

be explored to ensure the audio quality does not suffer too much. The video will be edited in professional standard video

editing software and then imported into a professional digital audio workstation for audio separation, preparing to induce

sync errors. All of the tests in the series will use the staircase method as seen in Younkin and Corriveau (2008)’s

research on lip sync detection. Starting at 0ms, or perfect sync, the error will be gradually increased from film to film

until the viewer detects a synchronisation error. The error amount will then be decreased until the viewer reports that the

error is no longer apparent. The mean will then be taken between these two values to determine the threshold of

detection. This will be repeated with a number of viewers to collect enough results to find an accurate average of error

detection. The placement of the errors will change between tests to switch the area of concentration, so from general

dialogue sync error to edit point sync error.

Richard Swales

12

12


3-3 Test 1

This test is to be conducted to determine whether the sync tolerances discussed still apply when brought into context of

the testing medium and environment. The sync tolerances given by the ITU (1998) and the IEEE (2008) will be adhered

to for the duration of testing due to the fact that theses are the only two sources research has uncovered that match

figures accurately, and also due to the point that the ITU research is most commonly referenced in research with

synchronisation tolerances. These are as follows:

Audio before visual: 45ms

-

Audio after visual: 125ms

Testing will occur within these boundaries as the research shows that when sync is presented outside these values, sync

errors become detectable. By testing in these boundaries this will give us a start and end value at which to test, going

higher than these values would be pointless as we already know the outcome due to the research conducted. This will

provide us with a new detectability range used for the duration of the following tests. The test will involve subjects

spotting where they think they can see an error in dialogue synchronisation, then the time code will be logged to see if

they picked up on the sync errors in the correct places. There will be variations in the amount of sync pre-delay and sync

delay to test the full range of values presented in the research.

3-4 Test 2

The re-evaluated thresholds from Test 1 will be applied to specific edit points in the film. The same method of testing

will be used to find out if at any edit point there is any change in the sync thresholds presented. Crucially in this test

viewers will not be alerted to the fact that all the induced sync errors are around the edit points. This will allow for a fair

test and not allow the viewers to be concentrating more closely on the dialogue sync around the edits points.

3-5 Test 3

This test will only take place if results from Test 2 show that edit sync detection tolerances give similar results to Test 1.

If Test 2 shows these results, then the detection tolerance limits will be expanded to find out the value of sync error

detection for dialogue around edit points in film.

Richard Swales

13

13


Appendices

Appendix 1: ADR can be commonly referred to as: Automated Dialogue Replacement, Additional Dialogue Recording

or Looping.

Richard Swales

14

14


Reference List

Adobe. (2011) Video Learning Guide for Flash: NTSC and PAL video standards. Retrieved from

http://www.adobe.com/devnet/flash/learning_guide/video/part06.html

Advanced Television Systems Committee. (2003) IS-191: Relative Timing of Sound and Vision for Broadcast

Operations.

Ament, V.A. (2009). The Foley Grail: The Art of Performing Sound for Film, Games, and Animation. Oxford:

Focal Press.

British Broadcasting Corporation. (2014). Technical Standards for Delivery of Television Programs to BBC:

Version 4.2.

Chion, M. (1994). Audio-Vision: Sound On Screen. (C. Gorbman, Trans). New York: Columbia University

Press.

Cugnini, A. (2010) Correction Lip Sync Errors. Retrieved from

http://www.tvtechnology.com/media-systems/0191/correcting-lip-sync-errors/255400#sthash.c82qgEEi.dpuf

European Broadcasting Union. (2007). R37-2007: The Relative Timing of Sound and Vision Components of a

Television Signal.

International Telecommunications Union. (1998). ITU-R BT.1359: Relative Timing of Sound and Vision for

Broadcasting.

Linear Acoustics. (2004). Audio and Video Synchronization: Defining the Problem and Implementing Solutions

Purcell, J. (2007). Dialogue Editing for Motion Pictures: A Guide to the Invisible Art (2nd ed.) Oxford: Focal

Press

Reeves, B., Voelker, D. (1993) Effects of Audio-Video Asynchrony on Viewer’s Memory, Evaluation of

Content and Detection Ability. Stanford University

Viers, R. (2012). The Location Sound Bible: How to Record Professional Dialog for Film and TV. California:

Michael Wise Productions.

Waddell, P., Jones, G., Goldberg, A. (undated) Audio/Video Synchronization Standards and Solutions, A Status

Report [PowerPoint slides]. Retrieved from http://www.atsc.org/cms/pdf/audio_seminar/12%20-%20JONES

%20-%20Audio%20and%20Video%20synchronization-Status.pdf

Richard Swales

15

15


Younkin, A.C., & Corriveau, P.J. (2008) IEEE Transactions on Broadcasting: Determining the Amount of

Audio-Video Synchronization Errors Perceptible to the Average End-User, 54(3), 623-627

Richard Swales

16

16

Documents

Richard Swales U1157958