Thesis 1

INTRODUCTION

1.1 Overview

Recently there has been growing interest to improve Human-computer interaction (HCI)

means computers should interact to the humans in day to day life .In this context recognizing

people emotional state and giving suitable feedback may play a crucial role. As a

consequence, emotion recognition represents a hot research area in both industry and

academic field. Usually emotion recognition based on facial or voice features. This proposes

a solution, designed to be employed in a smart phone Environment able to capture emotional

state of a person starting from registration of speech signals in the surrounding obtained by

mobile devices such as smartphones.

This system presents the implementation of a voice-based emotion detection system which is

suitable to recognize four emotions (anger, sadness, joy and neutral) as widely used for

emotion recognition .The classification task for speech signals is done by using Support

Vector Machine (SVM) approach. The main contributions of this is concern: i) a system able

to recognize people emotions composed of two sub-systems, Gender Recognition and

Emotion Recognition . Gender recognition algorithm, based on pitch extraction, and aimed at

providing a priori information about the gender of the speaker; SVM-based emotion

classifier, which employs the gender information as input.

In order to train and test the mentioned SVM-based emotion classifier, a widely used

emotional database called (polish emotional database ED) has been employed. The overall

system reliability on the database adopted for training and testing phases: the use of a

simulated database (i.e., a collection of emotion vocal expressions played by actors) allows

obtaining a higher level of correctly identified emotions.

.

M.Tech thesis submitted to Jawaharlal Nehru Technological University, Hyderabad-2015 Page 1

1.2 Proposed method

The proposed method based on the employment of audio signals consists of four principal parts which are elaborated bellow:

• Feature Extraction: it involves the elaboration of the speech signal in order to obtain a certain number of variables, called features, useful for speech emotion recognition.

• Feature Selection: it selects the more appropriate features in order to reduce the computational load and the time required to recognize an emotion.

• Database: it is the memory of the classifier; it contains sentences divided according to the emotions to be recognized.

• Classification: it assigns a label representing the recognized emotion by using the features selected by the Feature Selection block and the sentences in the Database.

1.3 Objectives

The objectives of the project are illustrated below.

1. For the speech signal we have to do framing and windowing first.

2. By using auto correlation method we have to extract the pitch values from

speech signal.

3. By taking average of pitch values for different samples of male and female is

taken.

4. A threshold level of pitch values is considered for separation of male and

female in gender recognition processes.

5. The principal emotional features for speech signal are formats; MFCC and centre of gravity are extracted.

6. A SVM classifier is used for the train with different speeches in different

emotions.

7. Finally the SVM classifies the required emotion of the speech signal in the

testing phase.

.


1.4 Block diagram

Figure 1.1: Block diagram

1.5 Procedure:

Initially, the speech signal is taken and is passed through front end block which converts the continuous time speech signal in to discrete time signal with a rate of 16 kHz is done. After that it is given to feature extraction block in which the features are extracted in which pitch can be find by using Autocorrelation method is used. After finding pitch values a threshold level is considered for the pitch values versus frames in speech sample is considered in order to find Gender recognition. After that from speech sample the formats can be estimated from LPC coefficients along with MFCC coefficients and centre of gravity for the speech spectrum is considered. All this features along with gender recognition output is given to SVM. Thus the SVM act as a classifier in recognition of emotion of the speech sample. Thus SVM need a database of Polish emotional database is required in order to training the sentences in different emotions at the testing phase the SVM classifies the emotion by using optimization function.


Pitch estimation

2.1 Introduction

Pitch is an important feature of audio signals, especially for quasi-periodic signals such as voiced sounds from human speech/singing and monophonic music from most music instruments. Intuitively speaking, pitch represent the vibration frequency of the sound source of audio signals. In other words, pitch is the fundamental frequency of audio signals, which is equal to the reciprocal of the fundamental period. Thus the speech signal exhibits relative periodicity of its fundamental frequency called pitch.

Conceptually, the most obvious sample point within a fundamental period is often referred to as the pitch mark. Usually pitch marks are selected as the local maxima or minima of the audio waveform.

Pitch detection algorithms can be divided into methods which operate in the time domain, frequency domain, or both.

One group of pitch detection methods uses the detection and timing of sometime domain feature. Other time domain methods use autocorrelation functions or difference norms to detect similarity between the waveform and a time lagged version of itself.

Another family of methods operates in the frequency domain, locating sinusoidal peaks in the frequency transform of the input signal. Other methods use combinations of time and frequency domain techniques to detect pitch.

Frequency domain methods call for the signal to be frequency transformed, then the frequency domain rep presentation is inspected for the first harmonic, the greatest common divisor of all harmonics, or other such indications of the period.

Windowing of the signal is recommended to avoid spectral smearing, and depending on the type of window, a minimum number of periods of the signal must be analyzed to enable accurate location of harmonic peaks .

Various linear pre-processing steps can be used to make the process of locating frequency domain features easier, such as performing linear prediction on the signal and using the residual signal for pitch detection. Performing nonlinear operations such as peak limiting also simplifies the location of harmonics.

The best method used for pitch estimation used in this project is Autocorrelation method which is time domain is used .


2.2 Autocorrelation method of Pitch estimation:

The correlation between two waveforms is a measure of their similarity. The waveforms are compared at different time intervals, and their “sameness” is calculated at each interval. The result of a correlation is a measure of similarity as a function of time lag between the beginnings of the two waveforms. The autocorrelation function is the correlation of a waveform with itself. One would expect exact similarity at a time lag of zero, with increasing dissimilarity as the time lag increases.

The mathematical definition of the autocorrelation function is shown in figure

Figure 2.1: auto correlation of pitch estimation

where τ is the time lag in terms of sample points. The value of τ that maximizes acf(τ) over a specified range is selected as the pitch period in sample points.

Periodic waveforms exhibit an interesting autocorrelation characteristic: the autocorrelation function itself is periodic. As the time lag increases to half of the period of the waveform, the correlation decreases to a minimum. This is because the waveform is out of phase with its time-delayed copy. As the time lag increases again to the length of one period, the autocorrelation again increases back to a maximum, because the waveform and its time-


delayed copy are in phase. The first peak in the autocorrelation indicates the period of the waveform.

Formats estimation

3.1 Introduction Estimation of formant frequencies is generally more difficult than estimation of fundamental frequency. The problem is that formant frequencies are properties of the vocal tract system and need to be inferred from the speech signal rather than just measured. The spectral shape of the vocal tract excitation strongly influences the observed spectral envelope, such that we cannot guarantee that all vocal tract resonances will cause peaks in the observed spectral envelope, nor that all peaks in the spectral envelope are caused by vocal tract resonances.

The dominant method of formant frequency estimation is based on modelling the speech signal as if it were generated by a particular kind of source and filter:

This type of analysis is called source-filter separation, and in the case of formant frequency estimation, we are interested only in the modelled system and the frequencies of its resonances. To find the best matching system we use a method of analysis called Linear Prediction. Linear prediction models the signal as if it were generated by a signal of minimum energy being passed through a purely-recursive IIR filter. We will demonstrate the idea by using LPC to find the best IIR filter from a section of speech signal and then plotting the filter's frequency response.

3.2 LPC method for format estimation:

Speech signal is produced by the convolution of excitation source and time varying vocal tract system components. These excitation and vocal tract components are to be separated from the available speech signal to study these components independent. For deconvolving the given speech into excitation and vocal tract system components, methods based on homomorphic analysis like cepstral analysis are developed. As the cepstral analysis does the deconvolution of speech into source and system components by traversing through frequency domain, the deconvolution task becomes computational intensive process.


To reduce such type of computational complexity and finding the source and system components from time domain itself, the Linear Prediction analysis is developed.

3.2.1 LPC analysis:

The redundancy in the speech signal is exploited in the LP analysis. The prediction of current sample as a linear combination of past p samples form the basis of linear prediction analysis where p is the order of prediction. The predicted sample s^(n) can be represented as follows,

where aks are the linear prediction coefficients and s(n) is the windowed speech sequence obtained by multiplying short time speech frame with a hamming or similar type of window which is given by,

where ω(n) is the windowing sequence. The prediction error e(n) can be computed by the difference between actual sample s(n)and the predicted sample s^(n) which is given by,

The primary objective of LP analysis is to compute the LP coefficients which minimized the prediction error e(n). The popular method for computing the LP coefficients by least squares auto correlation method. This achieved by minimizing the total prediction error. The total prediction error can be represented as follows,


This can be expanded using the equation (5) as follows,

The values of aks which minimize the total prediction error E can be computed by finding

and equating to zero for k=0,1,2,...p.

for each ak give p linear equations with p unknowns. The solution of which gives the LP coefficients. This can be represented as follows,

The differentiated expression can be written as,

where i=1, 2, 3...p. The equation (9) can be written in terms of autocorrelation sequence R(i) as follows,


for i=1,2,3...p.

Where the autocorrelation sequence used in equation (10) can be written as follows,

for i= 1,2,3...p and N is the length of the sequence.

This can be represented in the matrix form as follows,

where R is the pXp symmetric matrix of elements R(i, k) = R(|i-k|), (1<=i, k<=p), r is a column vector with elements (R(1),R(2), ...R(P)) and finally A is the column vector of LPC coefficients (a(1), a(2), ....a(p)). It can be shown that R is toeplitz matrix which can be represented as,

The LP coefficients can be computed as shown,


where R-1 is the inverse of the matrix R

3.2.2 computation of LP residual:

LP residual is the prediction error e(n) obtained as the difference between the predicted speech sample s^(n) and the current sample s(n). This is shown in equation (4).

In the frequency domain, the equation (16) can be represented as,

i.e.,

So LP residual can be obtained filtering the speech signal with A(z) as indicated in figure 1. Similarly it can be shown that the LP spectrum H(z) as,

As A(z) is the reciprocal of H(z), LP residual is obtained by the inverse filtering of speech.


Figure 3.2.2: Computing the LP residual by inverse filtering 3.2.3 Determination of formats frequencies:

LP analysis separates the given short term sequence of speech into its slowly varying vocal tract component represented by LP filter (H(z)) and fast varying excitation component given by the LP residual (e(n)). The LP filter (H(z)) induces the desired spectral shape for the shape on the flat spectrum (E(z)) of the noise like excitation sequence as given in equation (20). As the LP spectrum provides the vocal tract characteristics, the vocal tract resonances (formants) can be estimated from the LP spectrum. Various formant locations can be obtained by picking the peaks from the magnitude LP spectrum (|H(z)|). The figure 3.2.3 shows the first (F1), second (F2) and third formant (F3) frequencies estimated from the peaks in the LP magnitude spectrum.

where S(z) is the spectrum of the given short time speech signal.


Figure 3.2.3: Formant locations corresponding to peaks in LP magnitude spectrum. MFCC (mel frequency cepstral coefficients) estimation

4.1 Introduction:MFCC’s are based on the known variation of the human ear’s critical bandwidths with

frequency, filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the mel-frequency scale, which is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. Normally the MFCC represents the short term power spectrum of the speech signal .

4.2 MFCC implementation:

A block diagram of the structure of an MFCC processor is given in Figure 4.2. The speech input is typically recorded at a sampling rate above 10000 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans. As been discussed previously, the main purpose of the MFCC processor is to mimic the behavior of the human ears. In addition, rather than the


speech waveforms themselves, MFFC’s are shown to be less susceptible to mentioned variations.

Figure 4.2: Block diagram of the MFCC processor

4.2.1Frame Blocking :

In this step the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by N - M samples and so on. This process continues until all the speech is accounted for within one or more frames. Typical values for N and M are N = 256 (which is equivalent to ~ 30 msec windowing and facilitate the fast radix-2 FFT) and M = 100.

4.2.2 Windowing :

The next step in the processing is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. The concept here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. If we define the window as w (n) , 0≤n≤N−1 , where N is the number of samples in each frame, then the result of windowing is the signal

y l( n)=xl( n)w( n) , 0≤n≤N−1

Typically the Hamming window is used, which has the form:

w (n)=0 .54−0 . 46 cos ( 2 πnN−1 ) , 0≤n≤N−1


melcepstrum

melspectrum

framecontinuousspeech

FrameBlocking

Windowing FFT spectrum

Mel-frequencyWrapping

Cepstrum

4.2.3 Fast fourier transform(FFT) :

The next processing step is the Fast Fourier Transform, which converts each frame of N samples from the time domain into the frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT), which is defined on the set of N samples {xn}, as follow:

X k=∑n=0

N−1

xne− j 2π kn /N , k=0,1,2 ,. . ., N−1

In general Xk’s are complex numbers and we only consider their absolute values (frequency magnitudes). The resulting sequence {Xk} is interpreted as follow: positive

frequencies 0≤f <F s /2 correspond to values 0≤n≤N /2−1 , while negative frequencies −F s /2< f <0 correspond to N /2+1≤n≤N−1 . Here, Fs denotes the sampling frequency.

The result after this step is often referred to as spectrum or periodogram.

4.2.4Mel-frequency wrapping :

As mentioned above, psychophysical studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the ‘mel’ scale. The mel-frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz.


Figure 4.2.4: An example of mel-spaced filterbank

One approach to simulating the subjective spectrum is to use a filter bank, spaced uniformly on the mel-scale (see Figure 4.2.4). That filter bank has a triangular bandpass frequency response, and the spacing as well as the bandwidth is determined by a constant mel frequency interval. The number of mel spectrum coefficients, K, is typically chosen as 20. Note that this filter bank is applied in the frequency domain, thus it simply amounts to applying the triangle-shape windows as in the Figure 4 to the spectrum. A useful way of thinking about this mel-wrapping filter bank is to view each filter as a histogram bin (where bins have overlap) in the frequency domain.

4.2.5 Cepstrum :

In this final step, we convert the log mel spectrum back to time. The result is called the mel frequency cepstrum coefficients (MFCC). The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis.


K-1nK

knScK

kkn ,...,1,0,

21cos)~(log ~

1

Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT). Therefore if we denote those mel power spectrum coefficients that are the result of the last step are ~S 0 , k=0,2 ,. . ., K−1 , we can calculate the MFCC's,

~c n , as

Note that we exclude the first component, ~c 0 , from the DCT since it represents the mean

value of the input signal, which carried little speaker specific information.

By applying the procedure described above, for each speech frame of around 30msec with

overlap, a set of mel-frequency cepstrum coefficients is computed. These are result of a

cosine transform of the logarithm of the short-term power spectrum expressed on a mel-

frequency scale. This set of coefficients is called an acoustic vector. Therefore each input

utterance is transformed into a sequence of acoustic vectors.


Centre of gravity(COG)

5.1Introduction:The spectral Centre Of Gravity (COG) is a measure of how high the frequencies in a spectrum are. For this reason the COG gives an average indication of the spectral distribution of the speech signal under observation. Given the considered discrete signal s(n) and its DFT S(k), the COG has been computed by:

, fk = k /N , k = 0, ...,N −1 represents the k-th frequency composing the DFT.

In order to determine exact power levels in the speech signal these spectral shaping features are required .5.2 spectral central moments:The m-th central spectral moment of the considered sequence s(n) has been computed by:


5.2.1 Standard Deviation(SD):

The standard deviation of a spectrum is defined as the measure of how much the frequencies in a spectrum can deviate from the centre of gravity. SD corresponds to the square root of the second central moment µ2:

5.2.2 Skewness:The skewness of a spectrum is a measure of symmetry and it is defined as the third central moment of the considered sequence s(n), divided by the 1.5 power of the second central moment:


\

These are the spectral moments in order to determine how are the frequencies distributed in the spectrum for the speech signal obtained by doing this we can determine exact power values in the spectrum of required speech signal. Those these are best central moments can be find in this project and we go for higher moments in order to determine more accuracy normally up to 3rd central moments are sufficient to describe the power levels in the spectrum we have done up to skewness for the spectrum which is a 3rd central moment .

Emotions Classifier6.1 Introduction:

Usually, in the literature of the field, a Support Vector Machine (SVM) is used to classify sentences. SVM is a relatively new machine learning algorithm introduced by Vapnik and derived from statistical learning theory in the 90s. The main idea is to transform the original input set into a high dimensional feature space by using a kernel function and, then, to achieve optimum classification in this new feature space, where a clear separation among features obtained by the optimal placement of a separation hyperplane under the precondition of linear separability.

6.2 SVM classifier: Differently from the previously proposed approaches, two different classifiers, both kernel-based Support Vector Machines (SVMs), have been employed in this project. The first one (called Male-SVM) is used if a male speaker is recognized by the Gender Recognition block. The other SVM (Female-SVM) is employed in case of female speaker. Male-SVM and Female-SVM classifiers have been trained by using speech signals of the employed reference Database (DB) generated, respectively, by male and female speakers. Being g = {1, −1} the label of the gender as defined in above the two SVMs have been trained by the traditional


Quadratic Programming (QP) as done . In more detail, the following problem has been solved for each gender g:

CLASS A Support vector

Hyperplane SUPPORT VECTOR

CLASSB

Figure 6.2: A linear Support Vector Machine

where represents the well-known Lagrangian Multipliers vector of the QP problem written in dual form.


Vectors are features vectors while scalars

are related labels (i.e., the emotions in this paper).

They represent the vectors of the training set for the g-th gender.

is the related association, also called observation, between the

u-th input features vector and its label . The quantity `gis the total amount of observations composing the training set.

The quantity C (C > 0) is the Complexity constant which determines the trade-off between the flatness (i.e., the sensitivity of the prediction to perturbations in the features) and the tolerant level for misclassified samples.

Higher value of C means that is more important minimising the degree of misclassification. C

= 1 is used in the project represents a non-linear SVM and the function is the

Kernel function that, in this paper, is . Coherently with , the QP problems (one for each gender) in equation are solved by the Sequential Minimal

Optimization (SMO) approach that provides an optimal point, not necessarily unique and isolated, of if and only if Karush-Kuhn-Tucker (KKT) conditions are verified and matrices

are positive semi-definite. Details about the KKT conditions and the SMO approach employed

6.2.1 Polish Emotional database:In this database consists of 4 actors of 2 male and 2 female with 4different emotions was taken. Recordings for every speaker were made during a single session. Each speaker utters four different sentences.


The uttrence code are

1 - Oni kupili dzisiaj nowy samochód.

2 - Jego dziewczyna przylatuje dzisiaj samolotem.

3 - Janek by³ dzisiaj u fryzjera.

4 - Ta lampa dzisiaj jest na biurku.

In this first two utterances are given to training phase for the SVM and the other two are given for the testing phase.

Finally with all the features extracted are given to feature vector was given input to SVM in order to perform the emotion recognition. Thus SVM performs better classification in the process of gender recognition

RESULTS

7.1Introduction:

In this project we have used MATLAB coding. MATLAB (“Matrix Laboratory”) is a tool for

numerical computation and visualization. The basic data element is a matrix, so if you need a

program that manipulates array-based data it is generally fast to write and run in MATLAB

Matlab is widely used in all areas of applied mathematics, in education and research at

universities, and in the industry. MATLAB has powerful graphic tools and can produce nice

pictures in both 2D and 3D. it is also a programming language, and is one of the easiest

programing languages processing, image processing, optimization. Etc.

Typical uses include:


i. Math and computation.

ii. Algorithm development.

iii. Modeling, simulation, and prototyping.

iv. Data analysis, exploration, and visulization.

v. Scientific and engineering graphics.

vi. Application development, including Graphical User Interface building

This is high level matrix/array language with control flow statements. Function, data

structures,input/output, and object-oriented programming features. This is a vast collection of

computational algorithms ranging from elementry functions like sum, sine, cosine, and

complex arithametic. To more sophisticate function like matrix inverse, matrix eignevalues,

Bessel functions, and fast Fourier transforms.In this project we have used MATLAB 8.3

version for programming


Figure 7.1: Gender recognition as female with mean pitch values higher than threshold.

Analasis: Here we have taken speech sample and by using auto correlation function we have find pitch vaules for every frame finaly a threshold level of 250 is taken in order to separete male and female samples.


Figure 7.2: Gender recognition as male with mean pitch values less than threshold.

Analasis: Here we have taken speech sample and by using auto correlation function we have find pitch vaules for every frame finaly a threshold level of 250 is taken in order to separete male and female samples.

Figure 7.3: Formats for speech signal in neutral sate.


Figure 7.4: Formats for speech signal in anger sate.


Figure 7.5: Formats for speech signal in joy.

Figure 7.6: Formats for speech signal in sad state.

Analysis: The formats frequencies in the above figures gives resonant frequencies of vocal tract at different emotions are used to construct the vocal tract system for particular speaker.


Figure 7.7: MFCC for speech signal in anger state male and female.

Figure 7.8: Formats for speech signal in sad state.


Figure 7.9: MFCC for speech signal in joy state male and female.

Analysis: MFCC at different emotions gives the short term power levels of the speech signal that are useful in recognition of speech.


Figure 7.10: Power spectrum for the for speech signal in anger state male

Analysis: The power spectrum represents the power of the particular male speaker in particular emotion such that it is used recognition of particular word.


Figure 7.11: Power spectrum for the for speech signal in anger state female

Analysis: The power spectrum represents the power of the particular female speaker in particular emotion such that it is used recognition of particular word.


Figure 7.12: confusion matrix for four speakers on gender recognition.

Analysis: By performing the sytem with 4 speakers in four different emotions this matrix represents recognition of one particular emotion versus different emotions


conclusion:

Thus the system can be able to dectect the 4emotions such that by using the particular data base of polish emotinal database in which actors speak 4 utterences in different emotions improve the accuracy in desinging the emotion recognition system.

The gender reconition used in this system are useful in reduceing the time delay in form of classfying stage at classifier.By using this also the accuracy can be improved in the system of recognition of emotion.


References

Reference A.

[1]. Igor Bisio, Alessandro Delfino, Fabio Lavagetto, Mario Marchese, and

Andrea Sciarrone, ”Gender-driven Emotion Recognition Through Speech

Signals For Ambient Intelligence Applications”, IEEE 2013.

Reference B :

• [1] F. Burkhardt, M. van Ballegooy, R. Englert, and R. Huber, “An emotion-aware voice portal,” Proc. Electronic Speech Signal Processing ESSP, pp. 123–131, 2005.

[2] J. Luo, Affective computing and intelligent interaction. Springer, 2012, vol. 137





Figure 6.1: input and synthesized out from LPC and LSP for JUSTICE (LPC=8)


Figure 6.2: power spectral density for input

Figure 6.3: power spectral density for output

****** input ******

• Sampling rate (Hz): 48000

• Input length (samples): 42788

• Input length (seconds): 0.891417

****** Compression using Only LPC ******

• Compression ratio: 26.64

• psnr with LPC: 28.721563

• Mahalanobis Distance with LPC: 0.179952

****** compression using LSP as well ******


• psnr with LSP: 28.653103

• Mahalanobis Distance with LSP: 0.227074

2. Speech spelt=JUSTICE


Frame size=20msec

Order of LPC=12

Figure 6.4: input and synthesized out from LPC and LSP for input JUSTICE (LPC=12)

****** input ******











• psnr with LSP :28.616134

• Mahalanobis Distance with LSP :0.185969


Frame size=20msec

Order of LPC=45


Figure 6.5: input and synthesized out from LPC and LSP for input JUSTICE (LPC=45)

****** input ******






• psnr without LSP 28.494923

• Mahalanobis Distance without LSP 0.173408



• psnr with LSP 28.647524

• Mahalanobis Distance with LSP 0.203702


LPC order=

8

LPC order=

12

LPC order=

14

LPC order=

20

LPC order=

40

LPC order=

60

29.4

29.5

29.6

29.7

29.8

29.9

30

LPCLSPPS

NR

Figure 6.6: Calculation of PSNR for input justice of various order of LPC

ORDER=8

ORDER=10

ORDER=12

ORDER=14

ORDER=16

ORDER=18

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

INPUT&LPCINPUT&LSP

dist

ance

Figure 6.7: Calculation of mehalanobis distance for input justice of various order of LPC



Frame size=30msec

Order of LPC=12

Figure 6.8: input and synthesized out from LPC and LSP for input JUSTICE (30msec)

****** input ******














Frame size=40msec

Order of LPC=12



****** input ******






• psnr without LSP :31.15005







Frame size=50msec

Order of LPC=12



****** input ******













frame size= 20msec

frame size= 30msec

frame size=40msec

frame size=50msec

frame size=60msec

262728293031323334

lpclsp

PSN

R

Figure 6.11: Calculation of PSNR for input justice of various frame size

frame size=20msec

frame size=30msec

frame size=40msec

frame size=50msec

00.10.20.30.40.50.6

Input&LPCInput&LSP

Figure 6.12: Calculation of mehalanobis distance for input justice of various frame size

7. Speech spelt=It is simple to be happy (male)

Frame size=30msec

Order of LPC=8


Figure 6.13: input and synthesized out from LPC and LSP (LPC=8)

****** input ******














Frame size=20msec

Order of LPC=12

Figure 6.14: input and synthesized out from LPC and LSP (LPC=12)

Power spectral density:


Figure 6.15: Power spectral density of input and output signals

• ****** input ******




• ****** Compression using Only LPC ******



• Mahalanobis Distance with LSP:0.166563

• ****** compression using LSP as well ******





order=8 order=10 order=12 order=14 order=1630.5

30.6

30.7

30.8

30.9

31

31.1

31.2

Input&LPCInput&LSP

Figure 6.16: Calculation of PSNR for various order of LPC

order =8 order =10 order =12 order =14 order=160

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Input&LPCInput&LSP

Figure 6.17: Calculation of mehalanobis distance for various order of LPC


Frame size=30msec

Order of LPC=12


Figure 6.18: input and synthesized out from LPC and LSP (30msec)

****** input ******






• psnr without LSP: 30.838957








Frame size=40msec

Order of LPC=12


****** input ******



• Input length (seconds): 2.428125M.Tech thesis submitted to Jawaharlal Nehru Technological University, Hyderabad-2015 Page 54









frame size=20msec

frame size=30msec

frame size=40msec

frame size=50msec

29

30

31

32

33

34

35

LPCLSP

Figure 6.20: Calculation of PSNR for various frame size

frame size=20msec frame size=30msec frame size=40msec frame size=50msec0

0.050.1

0.150.2

0.250.3

0.350.4

0.450.5

LPCLSP

Figure 6.21: Calculation of mehalanobis distance for various frame size


11. Speech spelt=Time is precious don’t waste it

Frame size=20msec

Order of LPC=12


****** input ******








• Mahalanobis Distance with LSP : 0.280322






Frame size=30msec

Order of LPC=12



• ****** input ******













Frame size=40msec

Order of LPC=12



****** input ******














33.6

33.65

33.7

33.75

33.8

33.85

33.9

33.95

34

LPCLSP



0.37

0.38

0.39

0.4

0.41

0.42

0.43

0.44

LPCLSP

Dist

ance


Figure 6.26: Calculation of mehalanobis distance for various order of LPC


32.5

33

33.5

34

34.5

35

35.5

36

36.5

37

LPCLSP



0.1

0.2

0.3

0.4

0.5

0.6

0.7

LPCLSP


13. Speech spelt=make in India

Frame size=20msec

Order of LPC=8



****** input ******














26

26.5

27

27.5

28

28.5

LPCLSPPS

NR


order=8 order=10 order=12 order=14 order=160

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

LPCLSP

Dist

ance

Figure 6.31: Calculation of mehalanobis distance for various order of LPCM.Tech thesis submitted to Jawaharlal Nehru Technological University, Hyderabad-2015 Page 63

frame size=20msec frame size=30msec frame size=40msec frame size=50msec24.5

25

25.5

26

26.5

27

27.5

28

28.5

LPCLSP



0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

LPCLSP



MBSD (modified bark spectral distortion)The Average_MBSD=0.0151 for input&LPCSignal= JUSTICE

speech1=justice speech2=Time is precious

speech3=make in India

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

Input&LPCInput&LSP

Avg_

MBS

D

Figure 6.34: Calculation of MBSD for various input signals


Compression Ratio

Speech signal Order of LPC LPC LSP

Justice 8 26.64 35.40

10 22.54 27.95

12 19.54 23.17

14 17.24 19.83

It is simple to be happy 8 26.36 35.69

10 22.30 28.03

12 19.33 23.17

14 17.05 19.78

Time is precious don’t waste it

8 26.39 35.57

10 22.33 27.97

12 19.35 23.13

14 17.08 19.77

Table 6.1: compression ratio for various inputs of different LPC orders


Compression ratio

Speech signal Frame size LPC LSP

Justice 20msec 19.54 23.17

30msec 19.81 23.56

40msec 20.09 23.91

50msec 20.23 24.13

It is simple to be happy 20msec 19.33 23.17

30msec 19.43 23.65

40msec 19.47 23.83

50msec 19.57 24.02

Time is precious don’t waste it

20msec 19.35 23.13

30msec 19.46 23.38

40msec 19.51 23.46

50msec 19.62 23.64

Table 6.2: compression ratio for various inputs of different frame size


Conclusion

The synthesis of LPC and LSP depends on two parameters

1. Order of LPC

2. Frame size

• For the given order of LPC and Frame size, the compression ratio is good for LSP compared for LPC.

• After calculating PSNR and mehalanobis distance , mostly the LPC is dominating the LSP but the difference between them is very small.

• So, for better compression it is better to opt LSP rather than LPC, but implementing LSP is expensive .


References

Reference A:

[1] Sara Grassi,”Optimized Implementation of Speech Processing Algorithms”, Imprimatur Pour La These

Reference B:

[1] F.Itakura, “Line Spectrum Representation of Linear Predictive Coefficients of Speech Signals”,J.of the Acoustical Society of America,Vol.57,pp.S35,1975

[2] K.Paliwal and B.Atal, “Efficient Vector Quantization of LPC parameters at 24bits/frame”, IEEE Trans on Speech and Audio Processing, Vol.1, No.1, pp.3-14, 1993

[3] P.Kabal and P.Ramachandran,”The Computation of Line Spectral Frequencies Using Chebyshev Polynomials”,IEEE Trans.on acoustics,Speech and Signal Processing,Vol.34,No.6,pp.1419-1426,1986


Documents

Thesis 1