A System for Hybridizing Vocal Performance By Kim Hang Lau

A System for Hybridizing Vocal Performance

By Kim Hang Lau

Parameters of the singing voice

Parameters of the singing voice can be loosely classified as:– Timbre– Pitch contour– Time contour (rhythm)– Amplitude envelope (projections)

Vocal Modification

Vocal modification refers to the signal processing of live or recorded singing to achieve a different inflection and/or timbre

Commercially available units include– Intonation corrector

– Pitch/formant processor

– Harmonizer

– Vocoder

Objectives

Prototype a system for vocal modification Modify a source vocal sample to match the

time evolution, pitch contour and amplitude envelope of a similarly sung, target vocal sample

Simulates a transfer of singing techniques from a target vocalist to a source vocalist – thus a hybridizing vocal performance

Order of Presentation

System Overview Individual components System evaluation System limitations Conclusions and recommendations

System Overview

Three components– Pitch-marking– Time-alignment– Time/pitch/amplitude

modification engine Inspired by Verhelst’s

prototype system for the post-synchronization of speech utterances

Targeted System Specifications

Vocal performance Commercial singing

Vocal pitch range 60-1200 Hz

Detection accuracy/resolution 10 cents

Detection dynamic range 40dB

Sampling rate 44.1kHz and 48kHz

Time-scale modification ±20%

Pitch-scale modification ±600 cents

Component No.1Pitch-marking

Pitch-marking and Glottal Closure Instants (GCIs)

Information generated from pitch-marking– Pitch period

– Amplitude envelope

– Voiced/unvoiced segment boundaries

Pitch-marks

5ms5msP P’

Pitch-marking applying Dyadic Wavelet Transform (DyWT)

Kadambe adapted Mallat’s algorithm for edge detection in image signal to the detection of GCIs in speech signal

He assumed the correlation between edges in image signal and GCIs in speech signal

DyWT computation for dyadic scales 2^3 to 2^5 was sufficient for pitch-marking

If a particular peak detected in DyWT matches for two consecutive scales, starting from a lower scale, that time-instant is taken as a GCI

Mallat KadambeOriginal Signal 2^1

2^2 2^3

2^4 2^5

Base-band

The proposed pitch-marking scheme

Detection principle– Detection of the scale that contains the fundamental

period– Starting from a higher scale (of lower frequency), there

is a considerable jump in frame power when this scale is encountered

Features– 4X decimation to support high sampling rates – Frame based processing and error correction for

possible quasi-real-time detection

The proposed pitch-marking system

Comparisons of results with Auto-Tune

Proposed system Auto-Tune

Component No.2The Modification Engine

(n): time-modification factor (n): pitch-modification factor

(n): amplitude modification factor D(n): time-warping function

(n) (n) (n) D(n)

Time/pitch/amplitude modification engine

TD-PSOLA(Time-domain Pitch Synchronous Overlap-Add)

Time-domain splicing overlap-add method Used in prosodic modification of speech

Evaluation of the modification engine

Original

TD-PSOLA

Auto-Tune

Component No.3Time-alignment

Time-alignment Based on Verhelst’s prototye

system that applies Dynamic Time Warping (DTW)

He claimed that the basic local constrain produces the most accurate time-warping path

Exponential increase in computation as length of comparison increases

Accuracy deteriorates as length of comparison increases

Adaptations from Verhelst’s method

Proposed to perform time-alignment on a voiced/unvoiced segmental basis– DTW for voiced segments– Linear Time Warping (LTW) for unvoiced segments

Global constraints are introduced to further reduce computations

Synchronization of voiced/unvoiced segments are required, which is manually edited in current implementation

Manipulation of modification parameters

Simple smoothing of (n), (n) using linear phase FIR low-pass filters are performed before feeding them to the modification engine

The Prototype System

System Evaluation: case 1

System Evaluation: case 2

System Limitations

Segmentation– Lack of a reliable technique for voiced/unvoiced

segmentation– Segmentation and classification of different

vocal sounds is the key to devise rules for modification

Modification engine– Lack capabilities to handle pitch transition, total

dependence to the pitch-marking stage

System Limitations

Pitch-marking– Proposed system lacks robustness– Despite desirable time-response of the wavelet filter

bank, its frequency response is not capable of isolating harmonics effectively and efficiently

Time-alignment– The DTW basic local constraint allows infinite time

expansion and compression. – This factor often causes distortions in the synthesized

vocal sample

Conclusions and Recommendations

Current systems works well for slow and continuous singing

Further improvements on the individual components are recommended to handle greater dynamic changes of the vocal signal, thereby extending the current good results to a wider range of singing styles

Questions

&

Answers

Wavelet filter bank

Dyadic Spline Wavelet

Wide-band analysis

DTW local constraints

Calculation of pitch-marks

DyWT

Documents

A System for Hybridizing Vocal Performance By Kim Hang Lau