FUNDAMENTAL FREQUENCY AND ENERGY … · Web viewIn order to obtain information associated with changes in the signal, following features result from relations among signal parameters,

Chapter 7 FEATURE CALCULATION

Chapter 7FEATURE CALCULATION

7.1 Basic Prosodic AttributesIn the present section, calculations and procedures employed to obtain basic features

contours are explained. These essential attributes (i.e. pitch and energy) will be the

starting point in the aim to obtain more complex features, which contain valuable

information for our purposes. The software used in section 7.1 is part of the Verbmobil

long-term project of the Federal Ministry of Education, Science, Research and

Technology.

In order to achieve feasible estimations, and avoid the difficulties caused by the non-

stationary nature of speech, it’s assumed that the properties of the signal change relatively

slow with time. This allows examination of a short-time window of speech to extract

relevant parameters that are presumed to be fixed within the duration of the window.

Most techniques yield parameters averaged over the course of the time window. Thus, if

dynamic parameters are to be modelled, the signal must be divided into successive

windows or analysis frames so that the parameters can be calculated often enough to

follow relevant changes. Consequently, in order to obtain F0 and energy contours, smaller

fragments of speech, called frames, are considered.

For each frame, the F0 and energy values are computed. There will be one single

value per frame and for its calculation a longer analysis window is employed. Inside the

analysis window, all the speech signal values are considered and analysis windows are

this way always overlapped. Frame durations of 10 ms and 20 ms are commonly used in

speech processing, while window lengths for F0 and energy calculations are usually

105


established between 25 ms and 40 ms. The analysis performed in the present work

considers frame durations of 10 ms and analysis window lengths of 40 ms.

Since voiced/unvoiced decision is the base of the F0 computation, it’s the first

algorithm in being described within this section. The decision is frame-based, and only

over voiced frames, F0 will be estimated.

7.1.1 Voiced/unvoiced decision.

Voiced speech involves the vibration of the vocal folds in response to airflow from

the lungs. This vibration is periodic and it could be examined independently of the

properties of the vocal tract. Its periodicity refers to the fundamental frequency of such

vibration or the resulting periodicity in the speech signal, also called “pitch”.

Figure 7.1. Waveform of the glottal source.

In unvoiced speech the sound source is not a regular vibration but rather vibrations

caused by turbulent airflow due to a constriction in the vocal tract. The sound created as a

result of the constriction is described as a noise source. It contains no dominating periodic

component and has a relatively flat spectrum meaning that every frequency component is

represented equally (in fact for some sounds the noise spectrum may slope down at

around 6dB/octave). Attending to the time waveform of a noise source, only a random

pattern of movement is observed around the zero axis. In this context without any

periodicity, pitch estimation makes no sense.

106


Figure 7.2. Different sources in speech production.

Therefore, for F0 estimation is essential to define which frames are considered

voiced and which unvoiced. In contrast with F0 and energy calculations, non-overlapping

windows will be employed for the voiced/unvoiced decision. The algorithm uses only

values of the signal contained within a frame duration.

Voiced frames differentiate themselves from unvoiced frames by means of high

amplitude values, a relative low zero-crossing rate and big energy values. Zero-crossing

rate is understood as the number of zero-crossings per time unit, defined from now as the

frame length, then 10ms. Several procedures to decide between voiced/unvoiced frames

are introduced in [Hes83]. The algorithm used here applies thresholds, which are

presented in [Hes83], and it’s described in [Kie97]. As a result of this work, following

thresholds are proved to be suitably appropriated for the voiced/unvoiced decision:

Zero-crossing rate in Hz: (7.1)

Normalised energy of the signal: (7.2)

Normalised absolute maximum: (7.3)

107


Where

fs Sampling frequency in Hz (here 16000)

N Frame length in samples (here 160)

sn n-sample value of the signal

n_cross Amount of zero-crossings during a frame

Range Difference between maximum and minimum value in the signal

MaxRange Maximum feasible range, dependent on the quantification

(here 16 Bit MaxRange=65536)

Normalisation in (7.2) and (7.3) comes from the fact that the speaker may verbalise at

different energy levels at different time.

The decision rule is achieved through the comparison of thresholds theoretically

based ( ) with a vector whose components result from equations

(7.1) to (7.3):If

n_cross < n_cross and

EneNorm > EneNorm and (7.4)MaxNorm > MaxNorm

then

Voiced

else Unvoiced

Where

= Vector obtained from (7.1) – (7.3)

Definition of appropriated thresholds was optimised in order to reach the best

algorithm performance for the various speech samples available according to some

theoretical background. Thresholds were selected through experiments made during the

Verbmobil project development [Hag95]. After some simple experiments, based on trial

and error methods, some experiments were also conducted using Neural Networks as

classifier for the voiced/unvoiced decision on the frame plain. It was observed that this

108


procedure provided thresholds, whose values yield better results. Detailed information

and additional data about voiced unvoiced decision methods can be found in [Rab78],

[Hes83] and [Kie97].

Since speech signal conditions are similar in this Diploma Thesis, these thresholds

remain for calculations computed during it. Before these values were assumed, it was

verified that they were able to compute efficiently voiced and unvoiced frames. Praat

program is employed to compare regions selected as voiced. Both programs coincided

consistently in which regions were classified as voiced. However, the Verbmobil program

seemed to yield more accurate boundaries in voiced regions, while Praat created, in

certain cases, too long regions, which included some undesirable unvoiced sounds in the

section.

7.1.2 Fundamental Frequency Contour.

7.1.2.1 Previous remark.

This section deals with the fundamental frequency (F0 or pitch) of a periodic signal,

which is the inverse of its period, T (see figure 7.1). The period is defined as the smallest

positive member of the set of time shifts that leave the signal invariant and makes only

sense to a perfectly periodic signal. Speech signal results from a combination of a source

of sound modulated by a transfer (filter) function (see figure 7.3) determined by the shape

of the supra-laryngeal vocal tract, according to the source-filter theory described in

section 3.3.2. This theory, stemmed from the experiments of Johannes Müller (1848),

tested a functional theory of phonation by blowing air through larynges excised from

human cadavers.

Obviously, a signal cannot be switched on or off or modulated without losing its

perfect periodicity, and this combination causes speech signal to be only quasi-periodic,

either due to small period-to-period variations in the vocal cord vibration or in the vocal

tract shape. Therefore, the art of fundamental frequency estimation is to deal with the

information in a consistent and useful way.

109


7.1.2.2 Difficulties in estimating pitch contour.

F0 is considered one of the most important features for the characterisation of

emotions and is the acoustic correlate of the perceptive pitch. Its perception by human ear

is non-linear, reliant on the frequency. In addition, human voice is not a pure sinusoid, but

a complex combination of diverse frequencies.

Estimating the pitch of voiced speech sounds has always been a complex problem.

Thought it appears to be a rather simple task on the surface, there are many subtleties that

need to be kept in mind. F0 is usually defined, for voiced speech, as the rate of vibration

of the vocal folds. Periodic vibration at the glottis may produce speech that is less

perfectly periodic, due to the changes in the shape of the vocal tract that filters the glottal

source waveform, making hard to estimate fundamental periodicity from the speech

waveform.

Therefore, F0 estimation involves a huge number of considerations; it can be

influenced by many factors such as phone intrinsic parameters or coarticulation.

Furthermore, the excitation signal itself is not truly periodic, but it shows small variations

in period duration (jitter) and in periodic amplitude (shimmer). These aperiodicities, in

the form of relatively smooth changes in amplitude, rate or glottal waveform shape (for

example the duty cycle of open and closed phases), or intervals where the vibration seems

to reflect several superimposed periodicities (diplophony), or where glottal pulses occur

without obvious regularity of time interval or amplitude (glottalizations, vocal creak or

110

Figure 7.3. Source filter model for voiced speech.


fry), don’t contribute to the speech intelligibility, but to the naturalness of human speech.

Therefore, the mapping between physical acoustics and perceived prosody is neither

linear nor one-to one; as we said, variations in F0 are the most direct cause of pitch

perception, but amplitude and duration also affect pitch and make its estimation more

intricate.

While there are many successfully implemented pitch estimation algorithms (s.

[Che01.Hes83]), none of them work without making certain assumptions about the sound

being analysed and everyone has to face many difficulties and to admit certain failure.

Next paragraphs exhibit a brief historical overview of different methods tried. It can be

seen, how from the first method ever employed, they meet with diverse limitations.

The first method tried was simply to low-pass the speech signal in order to remove all

harmonics and then measure the fundamental frequency by any convenient means. This

method faces two difficulties. First, it had to be an adaptive filter, because pitch can easily

cover a 2 to 1 range and it always had to pass the fundamental and reject the second

harmonic. The filter frequency was set by tracking the pitch and predicting the

forthcoming pitch value; hence any error in one frame of speech could cause the filter to

select the wrong cut-off frequency in the next frame and so lose track of the pitch

altogether. The second difficulty arose from the fact that in many cases pitch had to be

estimated from speech, where the fundamental frequency was omitted. For instance, in

telephone speech frequency response drops off rapidly below 300 Hz; hence for many

male voices the fundamental frequency is absent or so weak as to be lost in the system

noise.

In the absence of the fundamental, it is common to search for periodicities in a signal

by examining its autocorrelation function. In a periodic function, the autocorrelation will

show a maximum at a lag equal to the period of the function. One first problem is that

speech is not exactly periodic, because of changes in pitch and in formant frequencies.

Therefore, the maximum may be lower and broader than expected, causing problems in

setting the decision threshold. Another problem arises from the possibility that the first

formant frequency is equal to or below the pitch frequency. If its amplitude is

particularly high, this situation can yield a peak in the autocorrelation function that is

111


comparable to the peak belonging to the fundamental. As a result, a pitch tracking process

is used. Anyway, this process can usually ride out a single error, but not a string of errors.

Pitch can be determined either from periodicity in the time domain or from regularly

spaced harmonics in the frequency domain. Consequently, pitch estimation techniques

can be classified into two main groups:

period-synchronous procedures: These methods try to follow the periodic

characteristics of the signal, e.g. positive zero-crossings, and estimate the signal

period from this information.

short-term analysis procedures (window based). The short-term variety of

estimators operates on a block (short-time frame) of speech samples and, for each

one of these frames, one pitch value is estimated. The series of estimated values yield

the fundamental frequency contour of the signal. There are different short-time

analysis procedures e.g. cross- or autocorrelation or algorithms that operate in the

frequency domain. Spectral procedures transform frames spectrally to enhance the

periodicity information in the signal. Periodicity appears as peaks in the spectrum at

the fundamental and its harmonics.

Period-synchronous procedures have the advantage of being generally faster and

present an adequate performance in most applications. Short-term methods are considered

more accurate and robust, due to the higher precision of calculating one changing

attribute in a shorter time interval. In addition, they are less affected by noise and do not

require a complex post-processing. Consequently, a short-term analysis procedure is used

in this thesis for F0 calculation.

7.1.2.3 Description of the algorithm.

The program used for F0 and energy contour calculations is a part of the prosodic

module employed in the second phase prototype of the Verbmobil project. The procedure

was developed in previous works at the Chair for Pattern Recognition of the Friederich-

Alexander-University Erlangen-Nurnberg and is widely detailed in manifold works (s.

[Kom89, Not91, Pen93, Har94, Kie97]). Consequently, here just a brief description is

introduced.

112


Fundamental frequency estimation through a window-based procedure

This procedure performs a short-term analysis, which works in the spectral domain

and provides sequential F0 computing. As it was already clarified, since F0 only makes

sense to voiced frames, voiced/unvoiced decision must be the first step when F0

estimation problem is faced. The way this decision is made was detailed in previous

section 7.1.1.

For the prosodic analysis of the human voice, F0 is usually expected to be in the

interval between 35 Hz and 550 Hz. According to the Shannon theorem [Sha49], an

analog signal must be sampled with at least the double of the highest frequency of the

signal, to be able to be recovered without any losses. In order to respect this theorem,

voiced regions are low-pass-filtered with cut-off frequency of 1100 Hz. Through this

limitation of the F0 maximum to 550 Hz, noise and mistakes will less affect the

algorithm. Then, the low-pass-filtered signal is digitised using a low sampling frequency

(downsampling) in order to reduce the number of signal values that must be computed.

Consequently, the F0 estimation process will be accelerated. For the resulting frames, the

short-time spectrum is calculated through the Fast Fourier Transform (FFT, s. [Nie83]).

The procedure is based on the assumption that the absolute maximum of the short-

time spectrum corresponds to one harmonic vibration of the F0. The main difficulty of the

algorithm is to find a proper definition to build the decision rule, which chooses the

maximum of the spectrum inside a voiced frame. This decision is created here indirectly

through an implemented Dynamic Programming (DP) procedure. For every estimated F0

value (one per voiced frame), absolute decision values (dividers) are allowed. Dividers of

all the frames in one voiced region yields hence a matrix, which is used by DP to

compose a specific low-cost function, employed to find the F0 optimal path. This cost-

function takes into account the distance to adjacent candidates and the distance to a

known target value. This value is calculated, for reasons of robustness, for the voiced

frames with the maximum of the energy signal using a multi-channel procedure.

Different possible candidates are calculated for every target value using correlation

methods (periodic AMBF-Procedures, s. [Ros74]) and frequency domain procedures

(Seffer-Procedures, s. [Sen78]) and the median of these values results in the target value

of the voiced interval. The arithmetic mean of all the target values of the speech signal is

113


the reference point R, which is applied for the divider determination within every voiced

frame. For each frame t, the spectrum from start-frame S to end-frame E of the voiced

interval is considered, and the frequency Ft with maximum energy in this spectrum is

calculated. With help of the divisors Kt=(Ft/R), the matrix J, containing diverse F0

candidates, is defined:

with (7.5)

In preliminary tests arise, that the correct F0 value is mostly enclosed when

considering five candidates (n=2). Now, with the help of a recursive cost-function and by

means of DP, the best path through the matrix J can be founded, which finally yields the

F0 contour of the voiced region.

In addition, the procedure has some other advantages. On one hand, F0 values are

not estimated in isolation for every frame. Instead, the cost-function establishes a relation

with nearest neighbours, so that their spectral characteristics are also taken into account.

On the other hand, proceeding this way short irregular periods produce no perturbation on

the results. One additional benefit is that the expense calculations for every single frame,

where the estimated valued is calculated, is limited. For further description of the cost-

function see [Pen93] and [Kie97].

Post-processing of the F0 Contour

Independently of the F0 calculation method employed, post-processing is

undoubtedly favourable, since direct application of the F0 values for further prosodic

features calculations would be definitely inadequate. The sense in post-processing the F0

values lies behind different reasons:

Automatic algorithms for the F0 extraction generate errors.

Values of F0 are not calculated for every single frame of the signal.

Fluctuations between adjacent F0 values are distressing under certain conditions.

114


Calculations from the F0 contour are dependent on the voice reference (e.g.

maximum).

Several possibilities for post-processing the fundamental frequency contour can be

found in [Hes83]. In the framework of this work, post-processing is accomplished in

different steps, as follows:

Smoothing of the F0 curve through a median filter.

Zero-setting of all the F0 values between 35 Hz and 60 Hz (before interpolation)

Interpolation of the unvoiced interval.

Semitone transformation and mean value subtraction.

Small failures of the algorithm can yield some undesirable noise. F0 curve

smoothing through a median-filter is employed in order to leave out some of these small

failures resulting from the algorithm. Smoothing increases the signal-to-noise ratio and

allows the signal characteristics (peak position, height, width, area, etc.) to be measured

more accurately

Figure 7.4. Smoothing. The right peak is the result of smoothing the left noisy peak.

The zero-setting of all the values between 35 and 60 Hz before the interpolation is

mainly adequate when recordings are carried out by means of WOZ dialogues. Usually,

start and end point of the uttered expression are classified as voiced due to the system

response and the contribution of the human voice also present in these parts. Values of F0

contained in such intervals habitually fit in the range between 35 and 60 Hz. By the use of

zero-setting the system response of the whole utterance is considered.

Though F0 values are not computed over unvoiced frames, a continuous F0 contour

would be desirable for further feature calculation. Therefore, interpolation over the

115


unvoiced frames is absolutely required. For interpolation over intervals, whose F0 cannot

be calculated, numerous alternatives are found. In the present Diploma Thesis, as

proposed in [Kie97], linear interpolation and extrapolation is applied exclusively at the

beginning and at the end of the phrase.

In addition, in order to reproduce the human ear response, semitone transformation

is performed over the resulting interpolated F0 contour using the following function:

(7.6)

By choosing c=12/ln(2), semitones relate to 1 Hz as reference value; for

normalisation of the F0 value, the mean value of each F0 value is subtracted from the

overall F0 contour.

7.1.3 Energy Contour.

Coupling of the loudness perception with the acoustic measurement is as complex as

the coupling of the tone pitch perception and the computable F0. The sensation of

loudness is both dependent on the frequency of the sound and on the duration, and the

other way round, pitch perception depends on the loudness (s. [Zwi67]). As a result,

accurate complex reliance is not directly taken into account for the following algorithm,

energy and F0 calculation are stored in a vector and, consequently, an implicit

standardisation takes place.

Basic calculation procedures used for computation of energy as the acoustic

correlated of perceptive loudness are based on relations between physical acoustic

pressure magnitudes ps, measured in Pascal (1Pa= 1N/m2), and the acoustic intensity Is,

whose unit is W/m2. It can be stated that Is is proportional to ps2. With help of the acoustic

intensity reference value, I0=1pW/m2, and the acoustic pressure reference value,

p0=20μPa, which illustrate the human auditory threshold at mid-range frequencies, the

absolute acoustic pressure in decibels (dB) is given by:

116


(7.7)

The acoustic magnitude loudness quantifies the sound intensity rate between two

perceived tones, hence a sound of 1 kHz with a loudness of 40 phones (acoustic pressure

level of 40dB) is applied as reference. In addition, loudness varies proportionally to the

third root of the intensity.

Automatic computation of energy contour can be achieved through different methods.

In this Diploma Thesis a general method is employed using the following formula:

(7.8)

T[.] represents, in this manner, a convenient transformation of signal values sn and wn

corresponds to an adequate window function to obtain precise segments of the signal.

Values out of the used window are usually set to 0, in order to facilitate finite procedures.

There are many possibilities for the choice of the transformation and the windowing

function. In the loudness calculation process, a Hamming window (figure 7.5) has been

used wnH with the form:

(7.9)

Figure 7.5. The Hamming Window.

117


There, N represents the window size in samples. Rectangular window is proved to

give maximum sharpness but large side-lobes (ripples) while hamming window blurs in

frequency but produces much less leakage.

For the loudness calculation, the reference value I0 is needed, which can no longer be

extracted from digitised signals. For a 16-bit quantization and a maximum acoustic

pressure level of 60 dB, which represents a standard value during normal conversation, I0

is computed with equation 7.7 as follows:

(7.10)

Using Hamming windows wnH of 40 ms of duration, thus with 40/16000 = 640

samples (N=640), the intensity value Is of the frame i can de estimated through the

following expression:

(7.11)

The effective loudness value Lhi of the frame i can therefore be estimated through its

relation to the intensity as follows:

(7.12)

During this Diploma Thesis, both loudness and energy will describe this magnitude

and they are utilised as synonyms. For further details on different examples for energy

calculation procedures or windowing functions, refer to the proper section in [Kie97].

118


7.2 Prosodic Features Previous research on feature extraction for emotion recognition has focused on

prosodic features, based on different linguistic units as utterance vector [Bat00], word

vector [Hub98] or intervals [Ami01]. In the present work we attempt to recognize

emotions from the speech signal given a short command (approximately 2 to 4 seconds),

without getting any profit from context or linguistic information. In the long term, the

goal of the investigation initiated during this thesis is to have a speaker and language

independent emotion classifier. Such challenging purpose, leads us to deal only with

global acoustic features, computed for a whole utterance or command, which seem to

have the favour of many recent studies (s. [Del96, Pet00]).

The term prosody, previously introduced in section 3.1, comprises a number of

attributes that can be classified into basic or compound characteristics.

Main basic prosodic attributes are loudness, pitch and duration related attributes such

as duration, speaking rate and pause. Compound attributes derive from them and are

intonation, accentuation, prosodic phrases, rhythm and hesitation.

With the aim to map emotions on the activation axis (see Chapter 2), we make a

classification depending on prosodic characteristics, since most researches point them as

the most related to feelings that differ in the activation dimension. With this aim we

extracted features that model logarithmic F0, energy and durational aspects. Here we will

mainly deal with acoustic prosodic features that are computed for the whole utterance.

During this work different kind of prosodic features have been used, mainly divided

into two groups:

P1- Features related to prosodic basic attributes (i.e. energy and pitch) and pitch

derivative. Most features have roots in statistics from values over all the frames in a

sentence and in linear regression coefficients of the contour. These parameters derive

from studies by [Bat00] and [Del96].

P2- Features related to prosodic compound attributes, which are more relative and

provide information closer to the intonation and changes in P1 features. These parameters

are based on the features proposed in [Tis93].

Calculations of both sets of features were written in C programming language and

the description of their extraction method is given below.

119


7.2.1 P1

In this section, features of the first set are presented. Each feature is referenced with a

number that corresponds to its index within the output vector from the C program which

computes this set of features (ppal.c).

7.2.1.1 Energy based features.

These features derive from the estimated energy contour. For every frame i an energy

value Ei exists. For further information about how this curve is obtained, see section

7.1.3.

P1.0 - ENER_MAX: Short-term energy maximum.

Maximum value of the energy curve in the whole utterance. The value is achieved by

inspection of the energy values of all the frames within one utterance and selecting the

maximum numeric value among them.

P1.1 - ENER_MAX_POS: Position of short-term-energy maximum.

Relative time position of the maximum energy value into the utterance. The

maximum energy value is P1.0 and its temporal position in the sentence is divided by the

utterance overall length. Calculations are made in frames:

(7.13)

Where

iEmax= frame position of the maximum energy value on the time axis.

N= number of frames in the whole utterance.

P1.2 - ENER_MIN: Short-term-energy minimum.

Minimum value of the energy curve into the whole utterance. The value is achieved

by inspection of the energy values of all the frames within one utterance and selecting the

minimum numeric value among them.

120


P1.3 - ENER_MIN_POS: Position of short-term-energy minimum.

Relative time position of the minimum energy value into the utterance. The minimum

energy value is P1.2 and its temporal position in the sentence is divided by the utterance

overall length. Calculations are made in frames:

(7.14)

Where

iEmin = frame position of the minimum energy value on the time axis.

N = number of frames in the whole utterance.

P1.4 - ENER_REG_COEF: Regression coefficient for short-term-energy.

Slope coefficient of the regression line for the energy curve values in the utterance.

(7.15)

With

(7.16)

(7.17)

Where

i = frame position on the time axis.

Ei = Estimated energy in the ith frame according to the algorithm described in

section 7.1.3.

N = Number of frames in the whole utterance.

P1.5 - ENER SQR_ERR: Mean square error for regression coefficient for short-term-

energy.

Mean square error value between the regression line and the real energy curve.

(7.18)

121


With

(7.19)

(7.20)

Where


Ei = Estimated energy in the ith frame according to the algorithm described in

section 7.1.3.


P1.6 - ENER_MEAN: Mean of short-term-energy.

Mean energy value calculated over the whole utterance. Energy values of all the

frames in a sentence are summed and then divided by the total number of frames.

(7.21)

P1.7 - ENER_ VAR: Variance of short-term-energy.

Variance of the energy values over the whole utterance.

(7.22)

Where

µE ² = Energy mean (P1.6).

7.2.1.2 Fundamental frequency based features.

These features are extracted from the estimated F0-curve, which is obtained using the

logarithmic and interpolated F0-curve. F0i represents the F0-value of the frame ith. For

further description about how this curve is obtained, see section 7.1.2.

122


Since the existence of fundamental frequency only makes sense inside voiced frames,

all the outcomes related to F0 are confined to voiced regions, where ‘voice region’ is

understood as a speech interval containing more than three successive voiced frames. For

further information about the voiced/unvoiced decision see section 7.1.1.

P1.8 - F0_MAX: F0 maximum.

Maximum value of the F0 curve in the voiced parts of the utterance. The value is

achieved by inspection of the pitch values of all the frames labelled as voiced in the

utterance and selecting the maximum numeric value among them.

P1.9 - F0_MAX_POS: Position of F0 maximum on time axis.

Relative time position of the maximum F0 value into the utterance. The maximum

pitch value is P1.8 and its temporal position in the sentence is divided by the utterance


(7.23)

Where

IF0max= frame position of the maximum F0 value on the time axis.

N= number of frames in the whole utterance.

P1.10 - F0_MIN: F0 minimum.

Minimum value of the F0 curve in the voiced parts of the utterance. The value is

achieved by inspection of the pitch values of all the frames labelled as voiced in the

utterance and selecting the minimum numeric value among them.

P1.11 - F0_MIN_POS: Position of F0 minimum on time axis.

Relative time position of the maximum F0 value into the utterance. The minimum

pitch value is P1.10 and its temporal position in the sentence is divided by the utterance


(7.24)

123


Where

IF0max= frame position of the minimum pitch value on the time axis.

P1.12 - F0_REG_COEF: Regression coefficient for F0.

Slope coefficient of the regression line for the F0 curve values in the utterance.

(7.25)

With

(7.26)

(7.27)

Where


F0i = Estimated pitch in the ith frame according to the algorithm described in

7.1.2.


P1.13 - F0_SQR_ERR: Mean square error for regression coefficient.

Mean square error value between the regression line and the real energy curve.

(7.28)

With

(7.29)

(7.30)

Where

124



F0i = Estimated pitch in the ith frame according to the algorithm described in

section 7.1.2.


P1.14 - F0_MEAN: F0 mean.

Mean F0 value calculated over the voiced regions of the utterance. Pitch values of all

the voiced frames in a sentence are summed and then divided by the total number of

voiced frames.

(7.31)

P1.15 - F0_VAR: F0 variance.

Variance of the energy values over the voiced regions in the utterance.

(7.32)

Where

µF0 ² = Pitch mean (P1.14).

P1.36 - Jitter.

Periodic jitter is defined as the relative mean absolute third-order difference of the

point process. This feature is exceptionally calculated using Praat and then included in the

feature vector. The algorithm is computed through the following formula:

(7.33)

Where

125


Ti = interval ith.

N = number of intervals.

For its computation, two arguments are required:

- Shortest period: Shortest possible interval that will be considered. For intervals Ti

shorter than this, the (i-1)th, ist, and (i+1)th terms in the formula are taken as zero. This

argument is set to a very small value, 0.1 ms.

- Longest period: Longest possible interval that will be considered. For intervals Ti

longer than this, the (i-1)th, ith, and (i+1)th terms in the formula are taken as zero.

Establishing the minimum frequency of periodicity as 50 Hz, the value for this parameter

is 20 ms; intervals longer than that will be considered unvoiced.

7.2.1.3 Voiced/unvoiced regions based features.

These features have roots in the voiced/unvoiced information, which is obtained

through an algorithm that assigns 1 to voiced frames and 0 to unvoiced. For further

description about the decision algorithm, see 7.1.1.

P1.16 - F0_FIRST_VCD_FRAME.

F0 value for the first voiced frame in the utterance.

P1.17 - F0_LAST_VCD_FRAME.

F0 value for the last voiced frame in the utterance.

P1.18 - NUM_VOICED_REGIONS.

Amount of regions containing more than three successive voiced frames. Regions

containing three or less voiced frames are not taken into consideration, despite their

frames are counted as voiced.

P1.19 - NUM_UNVCD_REGIONS.

Number of regions with more than three successive unvoiced frames. Same

considerations as P1.18 are used to define regions.

P1.20 - NUM_VOICED_FRAMES.

126


Amount of voiced frames in the utterance. Isolated voiced frames as well as frames

belonging to a voiced region are counted.

P1.21 - NUM_UNVCD_FRAMES.

Number of unvoiced frames in the utterance. Isolated unvoiced frames as well as

frames belonging to a voiced region are counted.

P1.22 - LGTH_LNGST_V_REG.

Length of the longest voiced region. The number of frames for each voiced region is

counted and the highest amount is taken as feature P1.22.

P1.23 - LGTH_LNGST_UV_REG.

Length of longest unvoiced region. The number of frames for each unvoiced region is

counted and the highest amount is taken as feature P1.23.

P1.24 - RATIO_V_UN_FRMS.

Ratio of number of voiced frames and number of unvoiced frames.

(7.33)

P1.25 - RATIO_V_UN_REG.

Ratio of number of voiced regions and number of unvoiced regions.

(7.34)

P1.26 - RATIO_V_ALL_FRMS.

Ratio of number of voiced frames and number of all frames.

(7.35)

P1.27 - RATIO_UV_ALL_FRMS.

127


Ratio of number of unvoiced frames and number of all frames.

(7.36)

7.2.1.4 Pitch contour derivative based features.

The derivative of F0 is computed and similar operations are performed. The

calculations follow identical procedures as the F0 case and therefore they are just

introduced.

P1.28 - F0_DER_MAX.

F0 derivative maximum.

P1.29 - F0_DER_MAX_POS.

Relative position of F0 derivative maximum.

P1.30 - F0_DER_MIN.

F0 derivative minimum.

P1.31 - F0_DER_MIN_POS.

Relative position of F0 derivative minimum.

P1.32 - F0_DER_REG_COEF.

Regression coefficient for F0 derivative.

P1.33 - F0_DER_SQR_ERR.

Mean square error for regression coefficient for F0 derivative.

P1.34 -F0_DER_MEAN.

F0 derivative mean.

P1.35 - F0_DER_VAR.

128


F0 derivative variance.

7.2.1 P2.

This section introduces the features included in the second set. The program used to

calculate them is called complex_calcs.c (see chapter 10).

In order to obtain information associated with changes in the signal, following

features result from relations among signal parameters, instead of being direct

measurement magnitudes. In this section, N corresponds to the number of voiced regions

in the utterance.

P2.0: Mean of the pitch means in every voiced region.

(7.37)

Where

= Mean of the pitch values in the voiced region n.

P2.1: Variance of the pitch means in every region.

(7.38)

Where

= Mean of the pitch values in the voiced region n.

F0AbsMean = P2.0.

P2.2: Mean of the maximum pitch values in every region.

(7.39)

Where

= Maximum of the pitch values within the voiced region n.

129


P2.3: Variance of the maximum pitch values in every region.

(7.40)

F0

1 4

3 2 t

Figure 7.6. F0 contour and points selected for calculations of P2.4 and P2.5.

P2.4: Pitch increasing per voiced region.

This feature take four points into account inside each voiced part of the utterance (see

figure 7.6):

1. Beginning of the voiced region.

2. End of the voiced region.

3. Maximum pitch value.

4. Second maximum pitch value.

The sum of all pitch differences between two successive increasing points, divided by

their respective time difference is computed. The final value for this feature results from

the arithmetic mean of this calculation over all voiced parts contained in the utterance.

(7.41)

Where

i , j = represent one of the four points considered, where i appears before j

ti <tj

F0i < F0j

130

Voiced region


P2.5: Pitch decreasing per voiced region.

Same points as in P2.4 are taken into account (figure 7.6). In this case, sum of all

pitch differences between two successive decreasing points, divided by their respective

time difference is calculated. The final value for results from the arithmetic mean of this

calculation over all voiced parts contained in the utterance.

(7.42)

Where

i and j represent one of the four points considered, where i appears before j

ti <tj

F0i > F0j

P2.6: Mean of the pitch ranges in every voiced region.

(7.43)

P2.7: Flatness.

Mean of the flatness (mean/max) of the pitch for every voiced region multiplied by

100.

(7.44)

P2.8: Mean of the relative duration from the beginning of the voiced part to the position

of the pitch maximum in every voiced region multiplied by 100.

(7.45)

131


P2.9: Peaks increasing for the whole utterance.

The maximum of each voiced region is considered. Sum of all pitch differences

between two successive increasing points, divided by their respective time difference is

calculated. This feature is similar to P2.4 but generalised to the whole sentence.

(7.46)

Where

ti , tj = positions of the maximum value for regions i and j (ti <tj).

F0max = the maximum pitch value in every region. Maximum of region j must

be higher than maximum of region i.

P2.10: Peaks decreasing for the whole utterance.

The maximum of each voiced region is considered. Sum of all pitch differences

between two successive decreasing points, divided by their respective time difference is

calculated. This feature is similar to P2.5 but generalised to the whole sentence.

(7.47)

Where

ti ,tj = positions of the maximum value for regions i and j (ti <tj).

F0max = the maximum value in every region. Maximum of region j must be

lower than maximum of region i.

P2.11: Mean of the voiced region duration.

(7.48)

P2.12: Global energy mean.

132


Mean of the energy means in every voiced region multiplied by 100 and divided by

the absolute energy maximum of the whole utterance.

(7.49)

P2.13: Mean of the relative duration from the beginning of the voiced region to the

position of the energy maximum in every voiced region. Multiplied by 100 and divided

by the absolute energy maximum of the whole utterance.

(7.50)

Where

tstart = starting point of the voiced region.

tmax = energy maximum position of the region.

P2.14: Mean of the relative duration from the position of the energy maximum in every

voiced region to the end of the voiced region. Multiplied by 100 and divided by the

absolute energy maximum of the whole utterance.

(7.51)

Where

tend = end point of the voiced region.

tmax = energy maximum position of the region.

P2.15: Mean of the vehemence (mean/min) of the energy in every voiced region.

(7.52)

133


P2.16: Mean of the flatness (mean/max) of the energy in every voiced region multiplied

by 100.

(7.53)

P2.17: Relation between the maximum energy value of the whole utterance and its

position.

(7.54)

P2.17: Relation between the maximum of the voiced region and the maximum of the

utterance divided by the position of the voiced region maximum position and multiplied

by 100. Arithmetic mean of this calculation for all the voiced regions in the utterance.

(7.55)

P2.18: Mean of the energy tremor in every voiced region.

Tremor refers to a regular variation in the signal and is computed as the number of

zero-crossings over a window of the energy curve derivative

(7.56)

7.3 Quality Features.The classification of emotions using quality voice features is a brand new field of

investigation, which is being used and referred in many lately studies concerned to

emotion recognition (s[Joh99, Alt00]). Since this proposal faces different obstacles, due

to the difficulty of estimation of this kind of attributes, diverse set of features and

134


methods were tried during this Diploma Thesis. Some of the described features have been

used just in few experiments and other are more frequently employed, but all of them are

here introduced.

The software employed to deal with quality features extraction is PRAAT1, a

shareware program developed by Dr. Paul Boersma of the University of Amsterdam.

This section makes use of two different methods for the calculation of the mean value

of a given parameter within a voiced region:

- Mean1: Arithmetic mean of the parameter values over all the frames inside a

voiced region.

(7.57)

Where

nframes = number of frames inside a voiced region.

fi = feature value in the frame i.

- Mean2: First, the Mean1 of the parameter within a voiced region is computed.

Then, single values of this parameter for every frame are checked and the one which is

closest to the computed Mean1 is considered as the mean (Mean2) of this region. This

way, we assume that this value comes from the most representative part inside the voiced

region, since the mean is influenced also by voiced region boundaries. It was

experimentally checked that the chosen frames normally matches the core of the vowel.

1 i nframes (7.58)

Where

nframes = number of frames inside a voiced region

n = index of the region

fi = feature value in the frame i

= Mean1 of the feature in region n

1 Further information can be found under www.praat.org.

135

http://www.praat.org/


From now, they are referred as Mean1 and Mean2 in the subsequent feature

calculation description.

7.3.1 Harmonicity based features.

Since harmonic to noise ratio is clearly related to the voice quality (see Chapter 3),

this voice quality attribute has been said to provide valuable information about the

speaker emotional state (s. [Alt99, Alt00]). Harmonic to noise ration estimation can be

considered as an acoustic correlate for breathiness and roughness, in agreement with

[Alt00]. Therefore, voice quality cues, which help us to infer assumptions about the

speaker’s emotional state, can be extracted from this attribute.

For its calculation, as well as for the remaining voice quality features, Praat program

is utilized. Harmonicity is here expressed in dB; if 99% of the energy of the signal is in

the periodic part, and 1% is noise, the HNR is 10*log10(99/1) = 20 dB. A HNR of 0 dB

means that there is equal energy in the harmonics and in the noise. The algorithm

performs acoustic periodicity detection on the basis of an accurate autocorrelation

method, as described in [Boe93]. Harmonicity values are given for individual frames and

the concrete features are calculated to be employed as classification features. Praat

program requires four different parameters to calculate the harmonicity:

1. Time step (default: 0.01 s): the measurement interval (frame duration), in seconds.

2. Minimum pitch (default: 75 Hz): determines the length of the analysis window.

3. Silence threshold (default: 0.1): frames that do not contain amplitudes above this

threshold (relative to the global maximum amplitude), are considered silent.

4. Number of periods per window (default: 1): determine the level up to the HNR is

guaranteed to be detected. More periods per window raises the figure of detection but the

algorithm becomes more sensitive to dynamic changes in the signal.

QH.0a: Harmonic to noise ratio maximum. Mean2. Default values.

Maximum of the Mean2 values for all the regions in the sentence when the

harmonicity is computed setting all the parameters in Praat to their default value.

136


h n hn n2 max max 1 n N (7.59)

Where

= harmonic to noise ratio Mean2 value in region n

N = number of regions inside a sentence

QH.0b: Harmonic to noise ratio maximum. Mean2. 4.5 periods per window.

Maximum of the Mean2 values for all the regions in the sentence when the

harmonicity is computed setting the number of periods per window to 4.5, which is

considered an optimal value for speech: HNR values up to 37 dB are guaranteed to be

detected reliably. When the number of periods per window increases the minimum pitch

parameter has to be also changed and, following the recommendations of Praat software,

it’s set to:

(7.60)

Where

length = length of the speech segment where the harmonicity is computed.

This feature follows the same formula (7.59) but taking into account the new values

of the harmonicity.

QH.0c: Harmonic to noise ratio maximum. Mean1. 4.5 periods per window.

This feature is identical to QH.0b with the exception that it uses the Mean1 instead

the Mean2 procedure to calculate the mean value of the HNR in the analysed region. It

follows therefore also equation (7.59) with the new values of harmonicity, by just

substituting the term Mean2 for its analogous Mean1.

QH.0d: Harmonic to noise ratio maximum within a voiced region.

Each frame inside a voiced region contains a value of the HNR. The maximum of

these values within the given region is the feature QH.0d.

QH.1a: Harmonic to noise ratio range. Mean2. Default Values.

137


Once the Mean2 values are calculated for every single voice region of the sentence,

the difference between the maximum and the minimum of these values in a sentence is the

feature QH.1a. When there is one unique region, this value becomes zero.

h n h hrange n n2 ,max ,min

(7.61)

Where

= harmonic to noise ratio maximum value in the sentence.

= harmonic to noise ratio minimum value in the sentence.

QH.1b: Harmonic to noise ratio range. Mean2. 4.5 periods per window.



feature QH.1b. When there is one unique region, this value becomes zero. The only

difference with QH.1a is that the parameter number of periods per window is set to 4.5

and, consequently, the minimum pitch comes from equation 7.60 (see QH.0b).

QH.1c: Harmonic to noise ratio range. Mean1. 4.5 periods per window.



feature QH.1b. This feature is identical to QH.1b by changing the criteria to calculate the

mean value in the analysed region; Mean1 criteria is employed as a replacement for

Mean2.

QH.2: Harmonic to noise ratio mean. Mean1. Default settings.

Arithmetic mean (Mean1) of all the HNR values calculated by frame within a voiced

region.

QH.3: Harmonic to noise ratio standard deviation within a voiced region. Default

settings.

Standard deviation of the HNR values within a voiced region.

7.3.2 Formant frequency based features.

138


The algorithm followed by Praat first resample the signal to a sample rate of twice

the value of Maximum formant frequency parameter (aprox. 5000). After this, pre-

emphasis is applied. The pre-emphasis factor is computed as , where

is the sampling period of the sound. Each sample xi of the sound except xi is then

changed, going down from the last sample: .

For each analysis window, Praat applies a Gaussian-like window, and computes the

LPC coefficients with the algorithm by Burg. The Burg algorithm is a recursive estimator

for auto-regressive models, where each step is estimated using the results from the

previous step. The implementation of the Burg algorithm is based on the routines memcof

and zroots in [Pre93]. This algorithm can initially find formants at very low or high

frequencies. From the values obtained for every single frame, some calculations are

extracted to be used as input for the emotional classification.

QF.0a: Minimum of f2Mean2 – f1Mean2 for all the voiced regions.

Difference between the Mean2 of the second and the first formant frequency for each

voiced region in a sentence. The minimum value of this difference among all the voiced

regions is taken as QF.0a.

This feature is used in some cases to select just one region and made it representative

of the sentence. This way, features are calculated in similar regions and their differences

will be more influenced by changes in the speaker’s emotional state than by the nature of

the vowel. The reason to choose the minimum difference between first and second

formant is based on the formant structure of an /a/, which is appropriate to extract quality

features, due to the shape of the vocal tract when it’s uttered, and in which first and

second formant frequencies are closer.

f f fn n n21 2 1 min , , (7.62)

Where

Mean2 of the second formant frequency in the voiced region n.

Mean2 of the first formant frequency in the voiced region n.

N = Number of voiced regions within the utterance.

139


QF.0b: Minimum of (f2–f1)Mean1 for all the voiced regions.

Mean1 of the difference between all the values of the second and the first formant

frequency within each voiced region in a sentence. The minimum value of this difference

among all the voiced regions is taken as QF.0b. Obviously, this feature is equivalent to

QF.0a by only substituting the concept of Mean2 by Mean1. However, the description

follows exactly the process the software was implemented.

The same equation (7.62) can be here applied by only changing the terms Mean2 for

Mean1.

QF.1a: First formant frequency. Mean2.

Frequency of the first formant in the region from where QF.0a is extracted,

calculated as the Mean2 within the voiced region.

QF.1b: First formant frequency. Mean1.

Frequency of the first formant in the region from where QF.0b is extracted,

calculated as the Mean1 within the voiced region.

QF.2a: Second formant frequency. Mean2.

Frequency of the second formant in the region selected by QF.0a, calculated as the

Mean2 within the voiced region.

QF.2b: Second formant frequency. Mean1.

Frequency of the second formant in the region selected by QF.0b, calculated as the


QF.3a: Third formant frequency. Mean2.

Frequency of the third formant in the region selected by QF.0a, calculated as the


QF.3b: Third formant frequency. Mean1.

140


Frequency of the third formant in the region selected by QF.0b, calculated as the


QF.4a: Second formant ratio. Mean2.

Frequency of the second formant (QF.2a) divided by the difference between second

and first formants (QF.0a). All the formants are calculated through the Mean2 and belong

to the selected region (see QF.0a).

(7.63)

QF.4b: Second formant ratio. Mean1.

Frequency of the second formant (QF.2b) divided by the difference between second

and first formants (QF.0b). All the formants are calculated through the Mean1 and belong

to the selected region (see QF.0b).

(7.64)

QF.5: Maximum of the second formant ratio.

The maximum value of the second formant ratio calculated by frame within the region

selected by QF.0b.

(7.65)

Where

Value of the first formant frequency in the frame i.

Value of the second formant frequency in the frame i.

nframes = Number of frames within the voiced region selected by QF.0b.

QF.6: Range of the second formant ratio.

Difference between the maximum and the minimum of the second formant ratio calculated

by frame within the region selected by QF.0b.

(7.66)

141


Where


Value of the second formant frequency in the frame i.

i = frame index


QF.7a: Bandwidth of the first formant. Mean2.

Mean of all the Mean2 first formant bandwidth values in a sentence.

1 n N (7.67)

Where

b n1, is the first formant bandwidth Mean2 in region n.

N is the number of regions inside a sentence.

QF.7b: Bandwidth of the first formant. Mean1.

Mean of all the Mean1 first formant bandwidth values in a sentence. Substituting

Mean2 by Mean1, equation 7.67 is employed.

QF.7c: Bandwidth mean of the first formant within a region . Mean1.

Arithmetic mean (Mean1) of all the first formant bandwidth values calculated by

frame, inside a voiced region.

QF.8a: Bandwidth of the second formant. Mean2.

Mean of all the Mean2 second formant bandwidth values in a sentence.

1 n N (7.68)

Where

is the second formant bandwidth Mean2 in region n.


142


QF.8b: Bandwidth of the second formant. Mean1.

Mean of all the Mean1 second formant bandwidth values in a sentence. Substituting


QF.8c: Bandwidth mean of the second formant within a region. Mean1.

Arithmetic mean (Mean1) of all the second formant bandwidth values calculated by


QF.9a: Bandwidth of the third formant. Mean2.

Mean of all the Mean2 third formant bandwidth values in a sentence.

1 n N (7.69)

Where

is the third formant bandwidth Mean2 in region n.


QF.9b: Bandwidth of the third formant. Mean1.

Mean of all the Mean1 third formant bandwidth values in a sentence. Substituting


QF.9c: Bandwidth mean of the third formant within a region. Mean1.

Arithmetic mean (Mean1) of all the third formant bandwidth values calculated by


QF.10: Maximum of the first formant frequency in the selected region.

Maximum value of the first formant frequency in the region selected by QF.0b.

(7.70)

Where


nframes = Number of frames within the selected region.

143


QF.11: Maximum of the second formant frequency in the selected region.

Maximum value of the second formant frequency in the region selected by QF.0b.

Same equation (7.70) for the second formant frequency case.

QF.12: Maximum of the third formant frequency in the selected region.

Maximum value of the third formant frequency in the region selected by QF.0b.

Same equation (7.70) for the third formant frequency case.

QF.13: Range of the first formant frequency in the selected region.

Difference between the maximum and the minimum of the first formant frequency for

the region selected by QF.0b.

(7.71)

Where



QF.14: Range of the second formant frequency in the selected region.

Difference between the maximum and the minimum of the second formant frequency

for the region selected by QF.0b. Same equation (7.71) for the second formant frequency

case.

QF.15: Range of the third formant frequency in the selected region.

Difference between the maximum and the minimum of the third formant frequency

for the region selected by QF.0b. Same equation (7.71) for the third formant frequency

case.

QF.16: Standard deviation of the first formant frequency in the selected region.

144


Standard deviation of all the first formant frequency values within the region selected

by QF.0b.

QF.17: Standard deviation of the second formant frequency in the selected region.

Standard deviation of all the second formant frequency values within the region

selected by QF.0b.

QF.18: Standard deviation of the third formant frequency in the selected region.

Standard deviation of all the third formant frequency values within the region selected

by QF.0b.

7.3.3 Energy based features.

QE.0 – QE.3: Energy band distribution.

The energy is calculated within four different frequency bands in order to decide,

whether the band contains mainly harmonics of the fundamental frequency or turbulent

noise. Frequency band distribution is taken from a study [Kla97] focused on the

perceptual importance of several voice quality parameters. The four frequency bands

proposed are:

1. From 0 Hz to F0 Hz (where F0 is the fundamental frequency).

2. From 0 Hz to 1 KHz.

3. From 2.5 KHz to 3.5 KHz

4. From 4 KHz to 5 KHz.

From each band distribution, following features are calculated:

QE.0a – QE.3a: The energy contained in the corresponding band is calculated for all the

voiced parts of the utterance. Then, these values are divided by the energy over all

frequencies of the voiced parts of utterance.

EneBandEneBand

enej

j nn

N

nn

N

,1

1

j 1 2 3 4, , , ; (7.72)

145


Where

j = index corresponding to each one of the energy bands (1, 2, 3 or 4)

N = number of voiced regions within the utterance.

QE.0b – QE.3b: The energy values are calculated only in one region. Energy in each

band is divided by the energy over all frequencies within the given region.

j 1 2 3 4, , , ; (7.73)

Where

j = index corresponding to each one of the energy bands (1, 2, 3 or 4)

n = index of the region.

QE.4: Voiced energy ratio sentence based.

Rate of the energy contained in voiced regions and energy over all the utterance.

EneRateene

AbsEne

nn

N

1 (7.74)

Where

AbsEne= energy contained in all the utterance.

N = number if voiced regions within the utterance.

QE.5: Relative energy of one voiced region.

Energy of the voiced region divided by the energy in all the utterance.

(7.75)

Where

n = index corresponding to one voiced region.

AbsEne= energy contained in all the utterance.

7.3.4 Spectral measurements.

146


The algorithm used by Praat to calculate the spectrum is the continuous interpretation

of the Fast Fourier Transform (s. [Bra65, Wea89, Lat92]). If the sound is expressed in

Pascal (Pa), the spectrum is expressed in Pa·s, or Pa/Hz. The frequency integral over the

spectrum equals the time integral over the sound.

For some features concerning spectral measurements, inverse filtering of the speech

signal is performed. Inverse filtering can be seen as the inverse computation of the speech

production model depicted in figure 7.7. Praat obtains the filter with the help of the

technique of linear prediction. This technique tries to approximate a given frequency

spectrum with a small number of peaks, for which it finds the mid-frequencies and the

bandwidths. Doing this for an overlapping sequence of windowed parts of a sound signal

(i.e. a short-term analysis), we get a quasi-stationary approximation of the signal's

spectral characteristics as a function of time. For a speech signal, the peaks are identified

with the resonances (formants) of the vocal tract. Since the spectrum of a vowel spoken

by an average human being falls off with approximately 6 dB per octave, pre-emphasis is

applied to the signal before the linear-prediction analysis, so that the algorithm will not

try to match only the lower parts of the spectrum.

Figure 7.7. Mathematical model of the speech production.

147


For an average human voice, tradition assumes five formants in the range between 0

and 5500 Hertz. This number comes from a computation of the formants of a straight

tube, which has resonances at wavelengths of four tube lengths, four thirds of a tube

length, four fifths, and so on. For a straight tube 16 centimetres long, the shortest

wavelength is 64 cm, which, with a sound velocity of 352 m/s, means a resonance

frequency of 352/0.64 = 550 Hertz. The other resonances will be at 1650, 2750, 3850, and

4950 Hertz. For the linear prediction in Praat, this have to implement this 5500-Hz band

limiting by resampling the original speech signal to 11 kHz. Then, a linear-prediction

analysis on the resampled sound is performed. Analysis is done with 16 linear-prediction

parameters (which will yield at most eight formant-bandwidth pairs in each time frame),

with an analysis window effectively 10 milliseconds long, with time steps of 5

milliseconds (so that the windows will overlap), and with a pre-emphasis frequency of 50

Hz (which is the point above which the sound will be amplified by 6 dB/octave prior to

the analysis proper). This analysis will provide the filter (figure 7.9), which applied to the

original speech sample (figure 7.8a), yields the source signal (figure 7.8b). Since the LPC

analysis was designed to yield a spectrally flat filter (through the use of pre-emphasis),

the source signal will represent everything in the speech signal that cannot be attributed to

148

Frequency (Hz)0 8000

0

20

40


-20

0

20

(a) (b)

Figure 7.8. Spectrum of the /a/ vowel uttered in the sentence “Kick den Ball” extracted from speaker A commands database. Figure a represents the original spectrum of the uttered vowel, whereas figure b represents the source of the sound obtained after inverse filtering.


the resonating cavities. Thus, the "source signal" will consist of the glottal volume-

velocity source (with an expected spectral slope of -12 dB/octave for vowels) and the

radiation characteristics at the lips, which cause a 6 dB/octave spectral rise, so that the

resulting spectrum of the "source signal" is actually the derivative of the glottal flow, with

an expected spectral slope of -6 dB/octave.

QS.0: Open quotient related features.

Open quotient is a spectral measurement whose variations have been associated to

changes in the glottal source quality. Therefore, along with the ideas presented in Chapter

3, it could be a useful parameter in order to determine the emotional state of the speaker.

Following the hypotheses that the amplitude difference of the first and second harmonics

of the inverse-filtered voice signal (H1*-H2*) is a reliable spectral indicator of the

relative length of the the opening phase and therefore an spectral correlate of the open

quotient (s. [Dov97, Hen01]), two open quotient related features, with and without

inverse filtering, are computed.

QS.0a: Difference between first and second harmonic amplitudes of the spectrum of the

speech signal, within the selected region.

149

Figure 7.9. Filter of vocal tract when the /a/ vowel is uttered in the sentence “Kick den Ball” extracted from speaker A commands database. Filter is obtained through LPC analysis.


0

20

40


QS.0b: Difference between first and second harmonic amplitudes of the spectrum of the

speech signal after inverse filtering, within the selected region.

QS.1 Spectral Tilt related features.

Spectral Tilt has been also related to glottal source variations. It is one of the major

acoustic parameters that reliably differentiates phonation types in many languages, and it

can be understood as the degree to which intensity drops off as frequency increases.

Spectral tilt can be quantified when comparing the amplitude of the fundamental to that of

higher frequency harmonics, e.g. the second harmonic, the harmonic closest to the first

formant, or the harmonic closest to the second formant. Spectral tilt is characteristically

most steeply positive for creaky vowels and most steeply negative for breathy vowels.

The amplitude of the first harmonic (H1) compared to the amplitude of the second

formant (A2), which acts as an indicator of the spectral tilt at the mid formant

frequencies, is here used as a voice quality feature for emotion classification. The

parameter is expected to be large and positive for breathy voices and small and/or

negative for creaky voices.

QS.1a: Difference between the first harmonic amplitude and the spectral amplitude at the

second formant frequency. Calculated over the spectrum of the speech signal in the

selected region

QS.1a: Difference between the first harmonic amplitude and the spectral amplitude at the

second formant frequency. Calculated over the spectrum of the speech source calculated

by means of the inverse filtering of the original speech segment.

150

Documents

FUNDAMENTAL FREQUENCY AND ENERGY … · Web viewIn order to obtain information associated with changes in the signal, following features result from relations among signal parameters,