Transcript
Page 1: TOWARDS MEASURING INTONATION QUALITY OF ......choir piece Locus Iste . We demonstrate the robustness of the proposed measure by comparing scenarios of different complexity regarding

TOWARDS MEASURING INTONATION QUALITY OF CHOIRRECORDINGS: A CASE STUDY ON BRUCKNERā€™S LOCUS ISTE

Christof WeiƟ, Sebastian J. Schlecht, Sebastian Rosenzweig, Meinard MĆ¼llerInternational Audio Laboratories Erlangen, Germany

[email protected]

ABSTRACT

Unaccompanied vocal music is a central part of West-ern art music, yet it requires excellent skills for singers toachieve proper intonation. In this paper, we analyze into-nation deficiencies by introducing an intonation cost mea-sure that can be computed from choir recordings and mayhelp to assess the singersā€™ intonation quality. With our ap-proach, we measure the deviation between the recordingā€™slocal salient frequency content and an adaptive referencegrid based on the equal-tempered scale. The adaptivityintroduces invariance of the local intonation measure toglobal intonation drifts. In our experiments, we computethis measure for several recordings of Anton Brucknerā€™schoir piece Locus Iste. We demonstrate the robustness ofthe proposed measure by comparing scenarios of differentcomplexity regarding the availability of aligned scores andmulti-track recordings, as well as the number of singers perpart. Even without using score information, our cost mea-sure shows interesting trends, thus indicating the potentialof our method for real-world applications.

1. INTRODUCTION

Unaccompanied vocal music constitutes the nucleus ofWestern art music and the starting point of polyphonyā€™sevolution. Despite an increasing number of studies [1,4,5,8ā€“11,16,17,21] dating back to the 1930s [29], many facetsof polyphonic a cappella singing are yet to be explored andunderstood. A central challenge of a cappella singing is theadjustment of pitch in order to stay in tune relative to thefellow singers. Even if choirs achieve locally good intona-tion, they may suffer from intonation drifts slowly evolvingover time [8ā€“10,15ā€“17,21,23]. Thus, one has to deal withdifferent intonation issues that refer to harmonic (or verti-cal) and melodic (or horizontal) intonation. In Fig. 1, weshow how these different aspects of intonation quality maybe visualized separately. Our schematic example illustratesthe assessment of note-wise pitch deviations (color-coded)in the presence of a global pitch drift. Fig. 1a shows the

cĀ© Christof WeiƟ, Sebastian J. Schlecht, Sebastian Rosen-zweig, Meinard MĆ¼ller. Licensed under a Creative Commons Attribution4.0 International License (CC BY 4.0). Attribution: Christof WeiƟ, Se-bastian J. Schlecht, Sebastian Rosenzweig, Meinard MĆ¼ller. ā€œTowardsMeasuring Intonation Quality of Choir Recordings: A Case Study onBrucknerā€™s Locus Isteā€, 20th International Society for Music InformationRetrieval Conference, Delft, The Netherlands, 2019.

Dev

iatio

n (c

ents

)

pitc

hpi

tch

pitc

h

a) b)

c) d)

pitc

htime time

Dev

iatio

n (c

ents

)

Figure 1. Note-wise analysis for polyphonic music.(a) Global intonation matching a fixed reference grid.(b) Global intonation higher than a fixed reference grid.(c) Intonation drift shown against a fixed reference grid.(d) Intonation drift with deviations from an adaptive grid.

idealized situation, where each note-wise pitch lies on afixed reference grid. In Fig. 1b, all pitches are sharp (toohigh) compared to the same reference grid. The intonationquality, however, should be considered equivalent in bothcases. Fig. 1c illustrates a different situation with down-ward intonation drift. Using a fixed reference grid, the de-viations gradually accumulate. To compensate for this, onecan use an adaptive reference grid [12, 15, 31] so that thevertical intonation quality is separated from the horizontalintonation drift. The residual pitch deviationsā€”relative tothe adaptive gridā€”only refer to vertical intonation prob-lems (Fig. 1d), which are in the focus of our analysis.

In this paper, we propose an intonation cost measurethat can be computed from choir recordings and may helpto assess the singersā€™ intonation quality. Developing sucha measure encompasses two central challenges: (i) accu-rate estimation of the local salient frequency content froma choir recording, and, (ii) reliable measurement of intona-tion quality corresponding to human perception on the onehand and to music theory on the other hand.

Concerning the estimation of the local salient frequencycontent (i), recordings of polyphonic choir pieces con-stitute extremely difficult scenarios. Often, the different

276

Page 2: TOWARDS MEASURING INTONATION QUALITY OF ......choir piece Locus Iste . We demonstrate the robustness of the proposed measure by comparing scenarios of different complexity regarding

parts of a musical composition are highly correlated bothin rhythm (joint on- and offsets) and harmony (overlap ofpartials). In the case of a mix recording (single-track), thisleads to overlaps in timeā€“frequency representations, whichmakes the estimation of fundamental frequencies (F0s) [3]or partial tracking [25] hard. On the other hand, captur-ing intonation at the sub-semitone level requires high fre-quency resolutions. One can leverage these problems usingdedicated recording scenarios where singers are isolatedacoustically [5] or recorded sequentially [4]. Alternatively,special devices such as Larynx microphones [6, 16, 28] oradditional score information [9] can help to simplify theF0-estimation problem [18, 27]. In Section 3.2, we detailthe strategy used for this paperā€™s experiments.

For assessing intonation quality (ii), we aim towards de-veloping a robust intonation cost measure. Ideally, a highvalue of this measure indicates passages of low intonationquality. Hereby, intonation quality may relate to humanperception such as the measure proposed in [30] based onpsychometric curves [26]. On the other hand, intonationquality may be guided by music theory or historical per-formance practice [9]. In particular, choirs often aim forjust intonation, which involves complex adjustment strate-gies according to the harmonic context [10]. In contrast tosuch ideas, we follow a simplified approach based on 12-tone equal temperament (12-TET). Even though 12-TETis not considered to be the ideal intonation practice forWestern choir performance, it is a first approximation andcan provide useful feedback [15]. As a major advantage,our strategy estimates the sub-semitone intonation qualityfor any type of notated chord irrespective of its harmonicconsonanceā€”in contrast to other methods [30] that mea-sure a mixture of harmonic consonance and intonation.

As our main contribution, we propose a 12-TET-basedintonation cost measure (Section 2). Inspired by priorwork [15, 24], we accumulate the overall deviation offrequency components from an adaptive 12-TET gridweighted by their corresponding amplitudes. For testingthis measure, we compiled a small but diverse dataset ofAnton Brucknerā€™s Graduale Locus Iste using different per-formances (Section 3). We evaluate the robustness of ourmethod by comparing scenarios of varying complexity re-garding the availability of aligned scores and multi-trackrecordings (Section 4.1). Finally, we apply our method todifferent performances and show its benefit for assessingthe overall intonation quality of a recording (Section 4.2).Section 5 concludes the paper and gives an outlook on fu-ture work and practical applications.

2. MEASURING INTONATION QUALITY

In the following, we describe the computation of an into-nation cost measure based on frequency deviations from a12-tone equal-temperament (12-TET) grid.

2.1 Intonation Cost Measure

The proposed measure operates on a set of N frequencycomponents P := {(f1, a1), . . . , (fN , aN )}, where each

š¹

āˆ† š‘“š¹ š‘“ , š‘Ž

š¹ š‘“ , š‘Ž

š¹ š‘“ , š‘Ž

āˆ† š‘“

āˆ† š‘“cent

cent

cent

cent

š¹ āˆ† š‘“š¹ š‘“ , š‘Ž

š¹ š‘“ , š‘Ž

š¹ š‘“ , š‘Ž

āˆ† š‘“

āˆ† š‘“cent

cent

cent

cent

Figure 2. Grid deviations āˆ†Ļ„ (fn) in cents of frequencycomponents fn (solid gray lines) from a 12-TET grid(dashed blue lines) shifted by Ļ„ . The corresponding am-plitudes an are indicated by the grayscale colors.

tuple (fn, an) āˆˆ P denotes the frequency fn and ampli-tude an of an individual component.

First, we convert a given frequency f in Hertz (Hz) tocents by

Fcent(f) := 1200 Ā· log2

(f

f0

), (1)

where f0 := 55 Hz is an arbitrary but fixed reference fre-quency. We compute the deviation of the frequency com-ponent f from a 12-TET grid

āˆ†Ļ„ (f) := miniāˆˆZ|Fcent(f)āˆ’ Ļ„ āˆ’ 100i| , (2)

where Ļ„ āˆˆ [āˆ’50, 50[ specifies the overall grid shift incents, see Fig. 2. Applying a Gaussian-like function to thegrid deviation āˆ†Ļ„ (f), we define the intonation cost (IC)Ī˜Ļ„ as

Ī˜Ļ„ (P) :=

āˆ‘(f,a)āˆˆP a

(1āˆ’ exp

(āˆ’āˆ†2

Ļ„ (f)2Ļƒ2

))āˆ‘

(f,a)āˆˆP a, (3)

where the deviations are weighted and then normalized bythe corresponding amplitudes. Due to the normalization,we obtain Ī˜Ļ„ (P) āˆˆ [0, 1] where Ī˜Ļ„ (P) = 1 indicates themaximal IC. The parameter Ļƒ adjusts the cost for deviatingfrom the grid. As suggested by [24], we choose a value ofĻƒ = 16 cents. To obtain invariance to pitch drifts, i.e.,variation of the reference frequency, we choose the gridshift Ļ„ in an adaptive way so that the IC is minimized:

Ļ„āˆ— := arg minĻ„āˆˆ[āˆ’50,50[

Ī˜Ļ„ (P). (4)

We then define the intonation cost Ī˜ as

Ī˜(P) := Ī˜Ļ„āˆ—(P). (5)

For instance, in a scenario where a choir performs withgood local intonation but is affected by a pitch drift, Ļ„āˆ—

slowly changes over time while Ī˜ is constantly small.

2.2 Example with Synthetic Signals

In the following, we illustrate the properties of the IC mea-sure Ī˜(P) by means of synthetic examples. To this end,we define a harmonic tone xf : R ā†’ R with K partialsand fundamental frequency f as

xf (t) :=

Kāˆ‘k=1

ak Ā· sin (2Ļ€kft) , (6)

Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

277

Page 3: TOWARDS MEASURING INTONATION QUALITY OF ......choir piece Locus Iste . We demonstrate the robustness of the proposed measure by comparing scenarios of different complexity regarding

Figure 3. ICs Ī˜Ļ„ for two harmonic tones xduad with aninterval of size If1,f2 in cents and three grid shifts: adap-tive grid Ļ„āˆ—, fixed grid Ļ„ = 0, and, fixed grid Ļ„ = āˆ’25.The F0 of the lower tone is fixed at f1 = 220 Hz.

where t denotes time and

ak := skāˆ’1 (7)

denotes the geometrically decaying partial amplitudes forsome s āˆˆ [0, 1] and k = 1, . . . ,K. Thus, the set of fre-quency components of signal xf is

P[xf ] := {(f, a1), (2f, a2), . . . , (Kf, aK)}. (8)

Let xduad := xf1 + xf2 be the sum of two harmonic toneswhose fundamental frequencies differ by the interval

If1,f2 := |Fcent(f2)āˆ’ Fcent(f1)| (9)

given in cents. Consequently, P[xduad] = P[xf1 ]āˆŖP[xf2 ].Fig. 3 shows Ī˜(P[xduad]) for two harmonic tones with

K = 16 and s = 0.6 for different interval sizes If1,f2 . Thelower fundamental frequency is set to f1 = 220 Hz suchthat āˆ†0(f1) = 0. Ī˜(P[xduad],) is minimal for If1,f2 beingan integer multiple of 100 cents. However, Ī˜(P[xduad])does not reach zero as some partial frequencies k Ā· f ofa harmonic tone do not lie on the 12-TET grid even ifthe fundamental frequency f does. For example, the thirdpartial 3f1 = 660 Hz leads to āˆ†0(3f1) ā‰ˆ 2 cents and thefifth partial 5f1 = 1100 Hz leads to āˆ†0(5f1) ā‰ˆ 14 cents.Since the minimal values are close to zero, this effectis small for a partial decay of s = 0.6. On the otherhand, even a quarter tone interval If1,f2 = 50 cents doesnot lead to the maximal IC of 1, since the grid deviationāˆ†Ļ„āˆ—(k Ā· f1) ā‰ˆ āˆ†Ļ„āˆ—(k Ā· f2) ā‰ˆ 25 cents for Ļ„āˆ— = 25 cents.Fig. 3 further shows that the IC of a fixed grid shift Ļ„ = 0 issimilar to the adaptive grid Ļ„āˆ—, while a fixed shift Ļ„ = āˆ’25significantly increases the overall IC. It is important to notethat the IC with adaptive grid shift Ļ„āˆ— is invariant to thechoice of f1 while a fixed grid shift is not. Further, the min-imal and maximal values depend on the amplitude decay sand on the Gaussian width Ļƒ. Since the IC measure relatesto a 12-TET grid, the IC curve is periodic in interval sizewith a period of 100 cents. As a consequence, each musicalinterval is only evaluated by its deviation from the 12-TETscaleā€”regardless of its harmonic consonance quality.

In Fig. 4, we expand the previous example to three har-monic tones xtriad := xf1 +xf2 +xf3 with f1 ā‰¤ f2 ā‰¤ f3.For instance, the intervals If1,f2 = 400 cents, If2,f3 =300 cents describe an equal-tempered major triad. Thecolors in Fig. 4 indicate Ī˜(P[xtriad]) with respect to the

Figure 4. IC Ī˜ for three harmonic tones xtriad. The plotaxes indicate the size of the lower and upper interval incents, If1,f2 and If2,f3 respectively.

lower and upper intervals, If1,f2 and If2,f3 , respectively.Similar to Fig. 3, we observe a periodic structure with pe-riod 100 cents as the IC is invariant to the musical intervals.Thus, the measure is equally suited for estimating the into-nation quality of both consonant and dissonant triads.

3. EXPERIMENTAL SCENARIO

This section describes the experimental scenario for apply-ing our intonation cost measure to choir recordings.

3.1 Dataset

We compiled a small but diverse dataset of performancesof Anton Brucknerā€™s Gradual Locus iste WAB 23 (seeFig. 5). This choir piece is in Latin and lasts approximatelythree minutes. It is musically interesting, contains sev-eral melodic and harmonic challengesā€”such as the highlychromatic middle partā€”but also harmonically clear pas-sages, and covers a large part of each voiceā€™s tessitura.

Central to this dataset is a publicly available 1 multi-track recording from the Choral Singing Dataset (CSD)[4]. The performance of 16 singers from the semi-professional Anton Bruckner choir (Barcelona) wasrecorded in a studio setting. The four musical partsā€”soprano, alto, tenor, and bassā€”were recorded sequen-tially using directional hand-held microphones. Rhyth-mic and harmonic synchronization was ensured by a con-ducting video and an acoustic reference (MIDI version ofthe piece). Due to this recording scenario, the individualsingersā€™ tracks exhibit a small amount of bleeding fromother singers of the same part (e. g., soprano 2, 3, and, 4slightly bleed into soprano 1 track). Interactive intonationor adaptation across musical parts was not possible sincethe parts were recorded in isolation and each singer lis-tened to the reference MIDI signal while singingā€”this alsoprevented substantial pitch drifts. The restricted interac-tion limits the usability of the data to study intonation and

1 https://zenodo.org/record/1319597#.XJor8ShKhaR

Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

278

Page 4: TOWARDS MEASURING INTONATION QUALITY OF ......choir piece Locus Iste . We demonstrate the robustness of the proposed measure by comparing scenarios of different complexity regarding

Figure 5. Anton Bruckner, Locus iste WAB 23, beginning.

adaptation phenomena in choir performances. Neverthe-less, since this paper focuses on measurement strategies,the multi-track data provides an excellent resource. To ad-dress subsets of the multi-track recording, we refer to thesignals of the four soprano voices as xS1, xS2, xS3, andxS4, and to the first voices of the alto, tenor, and bass asxA1, xT1, and xB1, respectively. We denote the mixed sig-nal of the first voices of each part as xSATB1 and the mixedsignal of all 16 voices as xSATB. We further manipulatedthe original tracks xorig with the digital audio correctionsoftware Melodyne 2 to augment the dataset: xnote is gen-erated by quantizing the median pitch of each note eventonto a 12-TET grid with reference frequency f0 = 55 Hz.xfine is generated by quantizing the complete F0-trajectoryonto the 12-TET grid. To assist our analysis, we use addi-tional score information from the aligned MIDI file.

In addition to this multi-track recording, we collectedseveral commercial 3 and freely available 4 performancesof the piece. As a reference for the real audio measure-ments, we sonified the piece with harmonic tones (Sec-tion 2.2) using random pitch deviations sampled fromGaussian distributions with different standard deviations.

3.2 Measuring Frequency Content

As discussed in Section 1, the extraction of salient fre-quency content from choir recordings is challenging. Inthe case of a mix recording, we have to blindly estimate allpartial frequencies using a partial tracking algorithm suchas [25]. For choir music, partial tracking can be simplifiedas the singing voiceā€™s partial frequencies are located quiteprecisely at integer multiples of an estimated F0. To esti-mate F0-trajectories for the CSD, we can use the individualtracks of the multi-track recording. Due to the bleeding ofother voices from the same part, traditional F0-estimationtechniques [7, 20] may have problems. Therefore, we usea salience-based method similar to Melodia [27]. We com-pute a log-frequency representation using instantaneousfrequency estimation [2, 14, 27] and binning with a reso-lution of 1 cent. Subsequently, we estimate F0-trajectoriesusing dynamic programming [22]. As post-processing, weapply median filtering with a filter length of 101 framesand downsample by a factor of 50 obtaining trajectorieswith a time resolution of 290 ms.

2 https://www.celemony.com3 Philharmonia Vocalensemble Stuttgart (Profil Medien 2006), Chor

des Bayerischen Rundfunks (Decca 2012), Choir of St Johnā€™s CollegeCambridge (Classic Mania 2007), NDR Chor Hamburg (Carus 2015)

4 Internet Archive, https://archive.org/details/LocusIste

In the case of the CSD, we can exploit additional scoreinformation from the aligned MIDI file [13]. We restrictthe F0-estimation to rectangular timeā€“frequency regions(ā€œconstraint regionsā€) derived from onsets, durations andcenter frequencies of the aligned MIDI notes, including afrequency tolerance of Ā±60 cents around the center fre-quencies (according to 12-TET with reference 440 Hz).Such additional information is particularly helpful whenestimating F0-trajectories from a mix recording. 5 Espe-cially, the constraint regions prevent the common confu-sion of the F0 with higher partials.

4. RESULTS

In our experiments, we investigate the robustness of theproposed intonation measure for different scenarios of theCSD where either score information or multi-track record-ings are not used (Section 4.1). Furthermore, we com-pare the measureā€™s behaviour for different synthetic andreal performances of Locus Iste (Section 4.2).

4.1 Local Analysis and Visualization

As a visual orientation, we show in Fig. 6a a piano roll rep-resentation of the entire piece, generated from the alignedMIDI file. Considering the subset of each partā€™s first voicexS1, xA1, xT1, and xB1, we estimate F0-trajectories as de-scribed in Section 3.2 and compute the local, sub-semitonedeviation of the F0-estimate from the corresponding MIDIreference. The deviations are color-coded with a range ofĀ±60 cents. While there seems to be no significant globaldrift (which is not surprising because of the CSDā€™s record-ing scenario, see Section 3.1), we observe a slight domi-nance of notes sung flat (negative deviation) except for thealto part, which is sharp more often.

The main results are shown in Figures 6bā€“g, which indi-cate IC values throughout each performance. We computethe IC measure Ī˜(P) of the set P comprising the frequen-cies (F0 and higher partials) and amplitudes from all partsthat are, according to the aligned score, active in a frame.Assuming that the human voiceā€™s partials are harmonic, wecalculate the first 16 partial frequencies from the measuredF0-trajectories and extract the corresponding amplitudesfrom the log-frequency spectrogram. For computing Ī˜,we use the adaptive grid shift proposed in Section 2.1. Toremove local outliers, we post-process the IC curves usinga moving median filter with a length of 21 frames.

The blue, solid line in Fig. 6b shows the resulting ICcurve for xorig, computed from the individual tracks forthe first voice of each part xS1, xA1, xT1, and xB1 usingscore constraints. For silent regions (e. g., after 160 sec),the IC is zero since no constraint region is active. Dueto the adaptive grid shift, the IC is small for monophonicpassages where only one singer is active (see, e. g., the pas-sage at 80 sec). For some of the consonant chords (e. g., at110 sec), we observe low IC values of about 0.2. Duringthe highly chromatic three-part passage (80ā€“110 sec), the

5 We do not use harmonic summation as in [27] to avoid smearing ofother partsā€™ partials into the constraint regions for mix recordings.

Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

279

Page 5: TOWARDS MEASURING INTONATION QUALITY OF ......choir piece Locus Iste . We demonstrate the robustness of the proposed measure by comparing scenarios of different complexity regarding

Figure 6. Intonation cost (IC) curves for different CSD versions of Locus Iste. (a) Piano roll representation with a log-frequency axis where C2 corresponds to 1200 cents, C3 to 2400 cents etc. The colors encode the deviation of the first voiceof each part (xS1, xA1, xT1, xB1) from the MIDI reference. (b) IC curves for xS1, xA1, xT1, xB1 with score constraintsfrom four individual tracks. The blue curve corresponds to xorig, the red curve to xnote, and the green dashed curve to xfine.(c) IC curves as in (b) without score constraints. (d) IC curves for mixed signal xSATB1 with score constraints. (e) IC curvefor mixed signal of all voices xSATB with score constraints. (f) IC curve computed from all 16 individual tracks xS1, xS2,. . . , xB4 with score constraints. (g) Individual IC curves for the four tracks of each part with score constraints.

IC increases, thus indicating that the singers have difficul-ties to stay in tune in this passage.

As a sanity check, we compare the results for xorig tothe the pitch-corrected versions xnote and xfine. As we ex-pected, the note-wise corrections xnote (red curve) obtain

lower values than xorigā€”with only minor peaks that maybe caused by pitch fluctuations during a note event. If wecorrect such local fluctuations as in xfine (green, dashedcurve), the IC is almost constantly zero.

Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

280

Page 6: TOWARDS MEASURING INTONATION QUALITY OF ......choir piece Locus Iste . We demonstrate the robustness of the proposed measure by comparing scenarios of different complexity regarding

In Fig. 6c, we repeat the experiment of Fig. 6b with-out using score constraints. In this case, F0-estimation isless reliable and, in particular, confusions between F0 andhigher partials may occur. However, since many partialslie on the same 12-TET grid as the F0 (octave-related par-tials) or very close to that (2 cents for fifth-related partials),the IC measure is largely invariant to such confusions. Thehigh similarity between Fig. 6b and Fig. 6c confirms theIC measureā€™s robustness to F0-extraction errors. Only forsilent passages (e. g., after 160 sec or at the beginning), weobtain higher IC values due to erroneously extracted fre-quency components. 6 We conclude that our strategy is,in principle, applicable for any multi-track recording anddoes not necessarily require score information.

To test the applicability in absence of multi-trackrecordings, we compute the IC curve from the mix signalxSATB1 using score constraints (Fig. 6d). Due to the con-straints, confusions with higher partials cannot occur, butpartials of other parts may leak into the same constraintregions, thus affecting the estimated F0s and partialsā€™ am-plitudes. The comparison of Fig. 6b and Fig. 6d indicatesthat such phenomena only slightly affect the IC measure.

Next, we measure the IC from the CSDā€™s full record-ing xSATB (all 16 voices). F0-estimation is more chal-lenging since the four singers of each part contribute tothe same constraint regions. The resulting ICs (Fig. 6e)exhibit slightly lower values than in previous cases. Weassume that having several voices per part stabilizes theF0-estimation to some degree. Overall, we see a similartrend between Fig. 6d and Fig. 6e. This might be an ef-fect of mutual influence between the singers of each part.To further investigate the multiple-singer effect, we showin Fig. 6f the IC curves computed from all 16 individualtracks xS1, xS2, . . . , xB4. We obtain higher IC values thanin Fig. 6e, especially for the monophonic passages (e. g., at48 sec). Due to the score constraints, this must be causedby deviations between the singers of a part, sometimes de-noted as dispersion [4]. To analyze this, we repeat the ex-periment for each part separately (Fig. 6g). The resultingcurve supports our hypothesis since. For instance, the highvalue at around 48 sec (Fig. 6f) is mainly caused by thebassesā€™ dispersion (dotted curve in Fig. 6g).

4.2 Global Analysis of Different Performances

Since our experiments on the CSD have shown the robust-ness of our method even for mix recordings, we finallycompare the global IC values of multiple synthetic andreal performances of Locus Iste. We align the MIDI fileto all performances [13], use the resulting constraint re-gions for extracting F0-trajectories [27], and compute theIC curves. In Fig. 7, we show the statistics of the en-tire curves over time (median, mean, and standard devi-ation). First, we report values for sonifications using har-monic tones as defined in Eq. (6). To simulate detuning,we shifted each noteā€™s fundamental frequency by a randomvalue sampled from a Gaussian distribution with variance

6 This problem could be leveraged using a suitable algorithm for voiceactivity detection [19] or on-/offset detection in vocal music [6].

Harmonic tones, detuned š‘‘š‘‘ = 0 cents

Harmonic tones, detuned š‘‘š‘‘ = 15 cents

Harmonic tones, detuned š‘‘š‘‘ = 30 cents

Choir Samples SibeliusSounds

First voices š‘„š‘„SATB1, corrected š‘„š‘„First voices š‘„š‘„SATB1, corrected š‘„š‘„

First voices š‘„š‘„SATB1, original š‘„š‘„All voices š‘„š‘„SATB, original š‘„š‘„

All voices š‘„š‘„SATB, with strong reverb

Internet Archive 2013

Philharmonia Vocalensemble 1979

Chor des Bayerischen Rundfunks 2012

Choir of St John's College 1996

NDR Chor Hamburg 2000

Intonation Cost Ī˜

Mean, St. Dev.Median

Com

mer

cial

Re

cord

ings

Chor

al S

ingi

ngDa

tase

tSy

nthe

ticEx

ampl

es

0 0.1 0.2 0.3

fine

note

orig

orig

Figure 7. Statistics of the intonation cost measure Ī˜ overfull recordings of different type.

d2. For d = 0 cents, the mean IC is almost zero. For d =15 cents and d = 30 cents, the IC gradually increases asexpected. The average IC of a sample-based sonification 7

is moderately higher than the rendition with ideal harmonictones, which suggests that some choir effects such as dis-persion are also synthesized. Both corrected versions ofthe CSD show very low IC values, with xfine even lowerthan xnote. In contrast, the versions based on the origi-nal recording xorig have high IC values, where the differ-ence between first voices only (xSATB1) and the full mix(xSATB) as well as the effect of adding artificial reverbera-tion is marginal. All commercial recordings performed byprofessional choirs obtain lower IC than the CSD record-ing. Several effects might contribute to this observation.Besides the singersā€™ level of training, commercial record-ings often feature larger choirs (60 singers or more) andlong natural reverberation (recorded in churches). In linewith the authorsā€™ subjective judgment, the recording byNDR Chor Hamburg exhibits the best intonation qualityaccording to the IC measure. 8 Overall, this experimentindicates that the proposed IC measure may serve as a firstindicator for a choir recordingā€™s global intonation quality.

5. CONCLUSIONS

In this paper, we proposed a strategy for measuring the in-tonation quality of choir recordings. Although robust ex-traction of salient frequencies is challenging, the measureproduced meaningful and reliable results once multi-trackrecordings are available or score information can be uti-lized. Even though the insights from such a measure arelimited, it might be a first indicator for the overall intona-tion quality and, thus, could be useful for choir singers orchoir directors in performances and rehearsals.

7 Using Sibelius Sounds, see https://www.avid.com/sibelius8 A 30-seconds mp3 thumbnail of this recording is available at

https://www.carusmedia.com/images-intern/medien/80/8346600/8346600.010s.t1_010.mp3

Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

281

Page 7: TOWARDS MEASURING INTONATION QUALITY OF ......choir piece Locus Iste . We demonstrate the robustness of the proposed measure by comparing scenarios of different complexity regarding

Acknowledgments: This work was supported by the Ger-man Research Foundation (DFG MU 2686/12-1). The In-ternational Audio Laboratories Erlangen are a joint in-stitution of the Friedrich-Alexander-UniversitƤt Erlangen-NĆ¼rnberg (FAU) and Fraunhofer Institut fĆ¼r IntegrierteSchaltungen IIS. The authors want to thank Helena Cuestaand colleagues from UPF Barcelona for creating and pub-lishing the Choral Singing Dataset.

6. REFERENCES

[1] Per-Gunnar Alldahl. Choral Intonation. GehrmansMusikfƶrlag, 1990.

[2] FranƧois Auger and Patrick Flandrin. Improving thereadability of time-frequency and time-scale represen-tations by the reassignment method. IEEE Transac-tions on Signal Processing, 43(5):1068ā€“1089, 1995.

[3] Mert Bay, Andreas F. Ehmann, and J. Stephen Downie.Evaluation of multiple-f0 estimation and tracking sys-tems. In Proceedings of the International Societyfor Music Information Retrieval Conference (ISMIR),pages 315ā€“320, Kobe, Japan, 2009.

[4] Helena Cuesta, Emilia GĆ³mez, AgustĆ­n Martorell, andFelipe LoĆ”iciga. Analysis of intonation in unison choirsinging. In Proceedings of the International Confer-ence of Music Perception and Cognition (ICMPC),pages 125ā€“130, Graz, Austria, 2018.

[5] Jiajie Dai and Simon Dixon. Analysis of interactive in-tonation in unaccompanied SATB ensembles. In Pro-ceedings of the International Society for Music Infor-mation Retrieval Conference (ISMIR), pages 599ā€“605,Suzhou, China, 2017.

[6] Sara Dā€™Amario, Helena Daffern, and Freya Bailes. Anew method of onset and offset detection in ensemblesinging. Logopedics Phoniatrics Vocology, 2018.

[7] Alain de CheveignĆ© and Hideki Kawahara. YIN, a fun-damental frequency estimator for speech and music.Journal of the Acoustical Society of America (JASA),111(4):1917ā€“1930, 2002.

[8] Johanna Devaney. An Empirical Study of the Influenceof Musical Context on Intonation Practices in SoloSingers and SATB Ensembles. PhD thesis, McGill Uni-versity, Montreal, Canada, 2011.

[9] Johanna Devaney and Daniel P. W. Ellis. An em-pirical approach to studying intonation tendencies inpolyphonic vocal performances. Journal of Interdisci-plinary Music Studies.

[10] Johanna Devaney, Michael I. Mandel, and Ichiro Fuji-naga. A study of intonation in three-part singing usingthe automatic music performance analysis and compar-ison toolkit (AMPACT). In Proceedings of the Interna-tional Society for Music Information Retrieval Confer-ence (ISMIR), pages 511ā€“516, Porto, Portugal, 2012.

[11] Johanna Devaney, Jonathan Wild, and Ichiro Fujinaga.Intonation in solo vocal performance: A study of semi-tone and whole tone tuning in undergraduate and pro-fessional sopranos. In Proceedings of the InternationalSymposium on Performance Science, pages 219ā€“224,Toronto, Canada, 2011.

[12] Karin Dressler and Sebastian Streich. Tuning fre-quency estimation using circular statistics. In Proceed-ings of the International Society for Music InformationRetrieval Conference (ISMIR), pages 357ā€“360, Vienna,Austria, 2007.

[13] Sebastian Ewert, Meinard MĆ¼ller, and Peter Grosche.High resolution audio synchronization using chromaonset features. In Proceedings of IEEE InternationalConference on Acoustics, Speech, and Signal Process-ing (ICASSP), pages 1869ā€“1872, Taipei, Taiwan, April2009.

[14] J. L. Flanagan and R. M. Golden. Phase vocoder. BellSystem Technical Journal, 45:1493ā€“1509, 1966.

[15] Volker Gnann, Markus Kitza, Julian Becker, and Mar-tin Spiertz. Least-squares local tuning frequency es-timation for choir music. In Proceedings of the Au-dio Engineering Society (AES) Convention, New YorkCity, USA, 2011.

[16] David M. Howard. Intonation drift in a capella soprano,alto, tenor, bass quartet singing with key modulation.Journal of Voice, 21(3):300ā€“315, 2007.

[17] David M. Howard, Helena Daffern, and Jude Brereton.Four-part choral synthesis system for investigating in-tonation in a cappella choral singing. Logopedics Pho-niatrics Vocology, 38(3):135ā€“142, 2013.

[18] Jong Wook Kim, Justin Salamon, Peter Li, andJuan Pablo Bello. Crepe: A convolutional representa-tion for pitch estimation. In Proceedings of the IEEEInternational Conference on Acoustics, Speech andSignal Processing (ICASSP), pages 161ā€“165, Calgary,Canada, 2018.

[19] Bernhard Lehner, Jan SchlĆ¼ter, and Gerhard Widmer.Online, loudness-invariant vocal detection in mixedmusic signals. IEEE/ACM Transactions on Audio,Speech & Language Processing, 26(8):1369ā€“1380,2018.

[20] Matthias Mauch and Simon Dixon. pYIN: A funda-mental frequency estimator using probabilistic thresh-old distributions. In IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP),pages 659ā€“663, Florence, Italy, 2014.

[21] Matthias Mauch, Klaus Frieler, and Simon Dixon. In-tonation in unaccompanied singing: Accuracy, drift,and a model of reference pitch memory. Journal of theAcoustical Society of America, 136(1):401ā€“411, 2014.

Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

282

Page 8: TOWARDS MEASURING INTONATION QUALITY OF ......choir piece Locus Iste . We demonstrate the robustness of the proposed measure by comparing scenarios of different complexity regarding

[22] Meinard MĆ¼ller. Fundamentals of Music Processing.Springer Verlag, 2015.

[23] Meinard MĆ¼ller, Peter Grosche, and Frans Wier-ing. Automated analysis of performance variations infolk song recordings. In Proceedings of the Inter-national Conference on Multimedia Information Re-trieval (MIR), pages 247ā€“256, Philadelphia, Pennsyl-vania, USA, 2010.

[24] Tomoyasu Nakano, Masataka Goto, and Yuzuru Hi-raga. An automatic singing skill evaluation method forunknown melodies using pitch interval accuracy andvibrato features. In Proceedings of the Annual Confer-ence of the International Speech Communication As-sociation (INTERSPEECH), pages 1706ā€“1709, Pitts-burgh, PA, USA, 2006.

[25] Julian Neri and Philippe Depalle. Fast partial trackingof audio with real-time capability through linear pro-gramming. In Proceedings of the International Confer-ence on Digital Audio Effects (DAFx), pages 326ā€“333,Aveiro, Portugal, 2018.

[26] Reinier Plomp and Willem Johannes Maria Levelt.Tonal consonance and critical bandwidth. Journal ofthe Acoustical Society of America, 38(4):548ā€“560,1965.

[27] Justin Salamon and Emilia GĆ³mez. Melody extrac-tion from polyphonic music signals using pitch contourcharacteristics. IEEE Transactions on Audio, Speech,and Language Processing, 20(6):1759ā€“1770, 2012.

[28] Frank Scherbaum. On the benefit of larynx-microphone field recordings for the documentationand analysis of polyphonic vocal music. Proceedingsof the International Workshop Folk Music Analysis,pages 80ā€“87, 2016.

[29] Carl Emil Seashore. Objective Analysis of Musical Per-formance, volume 4 of Studies in the Psychology ofMusic. University of Iowa Press, Iowa City, USA,1936.

[30] William A. Sethares. Local consonance and the re-lationship between timbre and scale. Journal ofthe Acoustical Society of America, 94(3):1218ā€“1228,1993.

[31] Simon Waloschek and Aristotelis Hadjakos. Driftinā€™down the scale: Dynamic time warping in the presenceof pitch drift and transpositions. In Proceedings of theInternational Society for Music Information RetrievalConference (ISMIR), pages 630ā€“636, Paris, France,2018.

Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

283


Recommended