Feasibility study of a voice identiﬁcation approach using ... · The basis for signal-processing applications is formed by the discrete Fourier transform (DFT). The discrete Fourier

Feasibility study of a voice identification approach using

Fourier-Transform based spectral matching

xxxx1 and xxxx1 Authors' information has been removed, submitted by 60430381Mathematical Foundations of Data Analysis I WS 2018/19,

Mathematical Institute,

University of Cologne

Speech recognition and especially voice identification is a strong research field nowadays

and it will potentially grow even more in the near future. Spectral analysis is a fundamental

operation in speech recognition. This project is an approach based on discrete Fourier

transformation to analyze words spoken in spectral frequency for speaker recognition. By

using several mathematical methods regarding Fourier transformation, audio signals will

be evaluated not only in the time- but also in the frequency-domain. The final results will

show that the use of the fast Fourier transform for extracting features from the audio input

is a numerically more efficient method for voice identification. In addition to that, it will be

examined how noises influence the operations and in which way it is possible to remove this

noise for an acceptable reconstruction of a template signal.

I. INTRODUCTION

Fourier Analysis forms the foundation for signal pro-

cessing. Applications can be found, e.g. in data

sampling, data compression or imaging. One of the

most fascinating and present field of interest is audio

signal processing. With Fourier-based data analysis it is

possible to compare speech patterns for the identification

of a template speaker. We aim to compare audio files

from different people to a template audio file. As a

result, from using Fourier based spectral comparison in

MATLAB, we then are supposed to be able to identify

and match the speaker. Fourier analysis makes it fairly

easy to denoise audio files and filter out unusable and

undesirable data. The necessity and importance of voice

and speech analysis is obvious: the ability to identify

a person by his voice and to recognize exactly what

one says. The use in public surveillance and smart

speakers are just some of many aspects that emphasize

the rising relevance of voice and speech identification,

and recognition respectively.

One’s voice is the most important tool for human

communication. From a technological point of view,

the study of voices and sounds hence offers many

opportunities for us. With the rapid and constant tech-

nological development in today’s world, it is substantial

to understand how voice identification in general but

also specific topics like speech recognition work. With

this project, we want to analyze the basic mathematical

foundation of voice identification without stepping too

deep into the subject of artificial intelligence and speech

recognition. Our main goal is to show the idea behind

the advanced technological tools, we have access to

nowadays, and to attempt a simple voice identification.

The focus of our work is on the principles of audio signal

processing, including spectral comparison, the filtering

Best ms the fouthe fo

ations can bs can b

a compressionompression

nating and prnating and pr

ssingssing. WiW

parpar

Report

tranr

thod for vod for

the operations ae operation

onstruction of a tuction of

DUCTIONDUCTION

dati

Sample

s a strong researstrong res

ure. Spectral analSpectral

an approach baapproach b

ectral frequency fral frequen

ding Fourier tranding Fourier t

lso in the frequeo in the freq

orm for extm for

e ide

MFDA-In for signaln for signal

und, e.g. ind, e.g.

imaging. Oimaging.

t field of intet field of inte

ourier-based durier-based d

eech patterns fch patte

er. We aimer. We aim

to a teto a t

WS mplatempla 2018

/2019field nowadafield nowad

sis is a fundamsis is a fund

ed on discreted on discrete

r speaker recor speaker reco

formation, aumation, a

cy-domain. Thdomain. Th

acting featurescting feature

tification. In aification. In

n which wayn which way

ignal.nal.

2

of sound, denoising methods and the computational

implementation of some Fourier transformations.

In a world guided by technology and progress, speech

recognition is a fundamental research area, yet to be

completely understood. The basic concepts of speech

analysis go a long way back, already beginning in the

late 18th century. Thus, we are mainly interested in

the current process of speech recognition. The benefits

that come from the understanding of how sounds and

speech are produced and perceived are immense. Audio

signal processing has led to the development of speech

synthesizers as well as automatic speech recognition

systems, such as Siri, Alexa or Google Home, with the

aim to simplify our everyday life. Nevertheless, this

field is still not fully explored. The automatic speech

recognition systems, for example, are often susceptible

to errors. Unexpected variability in acoustics can lead

to poor performances and irritations. Therefore, this

research area is still at the beginning of its development

and has undoubtedly huge potential for the future.

II. MATHEMATICAL METHODS

The basis for signal-processing applications is formed

by the discrete Fourier transform (DFT). The discrete

Fourier transform is essential for finding the spectrum of

a finite-duration signal. In comparison to that, it is also

inevitable to discuss the more efficient computational al-

gorithm, the so-called fast Fourier transform (FFT). For

our considerations, the numerical efficiency is important

for a stable implementation. Therefore, we mainly want

to focus on the use of FFT.

A. Discrete Fourier Transform

In general, the discrete Fourier transform (DFT) of a

finite duration sequence f(n), 0 ≤ n ≤ N − 1, denoted

by f̂ or Ff , is calculated by following formula

f̂(ω) =

N−1∑n=0

f(n)e− i2πωn

N

.

Remark 1. There also exists a similar method of

calculating the n-point DFT of a vector x. Let n ≥ 2

be any given integer. The n-point discrete Fourier

transform (DFT) is the n × n matrix Fn of complex

numbers, defined by

Fn =[(zn)

jk]

0≤j,k≤n−1

=

⎡⎢⎢⎢⎣

1 z0n · · · z0

n

1 z1n · · · zn−1

n

1 · · ·

1 zn−1n · · · z

(n−1)(n−1)n

⎤⎥⎥⎥⎦

=

⎡⎢⎢⎢⎣

1 1 · · · 1

1 z1n · · · zn−1

n

1 · · ·

1 zn−1n · · · z

(n−1)(n−1)n

⎤⎥⎥⎥⎦ ,

where zn = e− i2π

n . For a vector x ∈ Cn the DFT is given

by x̂ = Fnx[2].

For completeness, we also want to mention the inverse

discrete Fourier transform (IDFT) of a given sequence

f̂(ω), 0 ≤ ω ≤ N − 1, which is defined by

f(n) =1

N

N−1∑n=0

f̂(ω)ei2πωn

N .

The inverse of a Fourier transformation can be used to

reconstruct a signal from its frequency components.

Remark 2. Analogous to the DFT, one can derive the

n-point version of the inverse discrete Fourier transform.

For a given sequence f̂(ω), 0 ≤ ω ≤ N − 1, its IDFT is

then defined by

F̃n =1

nF ∗

n ,

where F ∗n is the adjoint of Fn[2].

In general, the DFT can be taken of a sequence of

complex numbers. In our case, for simplicity, we only

considered real-valued numbers.

The main purpose of measuring the spectrum of a sig-

nal, is to learn about the covered frequency range and

the number of frequencies measured. One has to con-

sider that a real-life speech signal must have a certain

duration to be sufficient enough. Furthermore, the per-

formance of spectral analysis by DFT needs information

about the duration of the DFT, the time between succes-

fast Foust F

the numericae numeric

plementation.mentation.

he use of FFThe use of FF

DD

Report

DS

cations is formedions is fo

(DFT). The discT). The

r finding the specfinding the sp

omparison to thaomparison to

efficient comefficient c

r tranr t

Sample

⎣ 1

11 zznnn

= e− i22ππ

pln . For a ve. For a

FnF xx[2].[2]

For completenesscompletF

discrete Fouridiscrete F

f̂f((ωω), 0), 0 ≤

MFDA-Iutatiutat

orm (FFT).FFT)

ciency is impoency is impo

efore, we mainefore, we mai

te Fourier Tre Fourie

te Fourte Fou

ff

WSetee

um ofm of

t is alsot is also

al al-al-

TheTh

reconstrecon

Re

2018

/2019

1)()(n−1)1)

⎦⎦

oror xx ∈∈ CCnn the

we also wantalso want

r transform (Itransform (I

≤ NN − 1, whic1, wh

ff(n

versers

3

sive applications of the DFT and on the sampling rate

of the discretized signal [1]. For a useful evaluation, it

is necessary to choose an appropriate window to cover

possible speech parameter changes. The parameter N is

determined by the preferred spectrum resolution.

Remark 3. The sampling of an analog signal in

the time-domain results in a periodic function in the

frequency-domain. For the DFT both f(n) and f̂(ω)

are periodic with period N . Therefore, the IDFT of the

signal will also be N -periodic.

From a numerical point of view, the DFT certainly has

serious deficiencies regarding its implementation. The

computational expense of the DFT can take up to N2

operations which means that for large data sets, a com-

plexity of O(N2) could occur. Hence, the use of DFT will

probably be highly inefficient. Due to that, we want to

compare the DFT to the fast Fourier transform (FFT).

The FFT is supposed to be a faster version of the DFT

and should save computational time.

B. Fast Fourier Transform

With the FFT algorithm, the computational time

reduces to O(N log2 N) in the worst case. This is done

by separating the calculations of even and odd indices:

When the index is even, the terms n and n + N2

can be grouped. One gets

f̂(2ω) =

N

2 −1∑n=0

(f (n) + f

(n +

N

2

))e

− i2πnω

N

2 .

For odd indices, we hence have

f̂(2ω + 1) =

N

2 −1∑n=0

ei2πn

N

(f (n) + f

(n +

N

2

))e

− i2πnω

N

2 .

Similarily, for DFT matrices, one can derive a full

facorization. For n = 2m, with m ≥ 1 is an integer, we

get following FFT scheme:

Fn = F2m = Gm0 Gm

1 ...Gmm−1P̃2m ,

where P̃ is a permutation matrix and Gmk is defined by

Gmk = diag {E2m−k , ..., E2m−k }

with

E2n =

⎡⎢⎢⎢⎣

In

... Dn

· · · · · · · · ·

In

... −Dn

⎤⎥⎥⎥⎦ .

Continuously, Dn is defined as

Dn = diag{

1, e− iπ

n , ..., e− i(n−1)π

n

}

and In is the n × n identity matrix[2]. This full matrix

factorization formula for Fn with n = 2m, provided n is

an even integer, is supposed to decrease the computa-

tional complexity of F2m .

A signal of size N is then calculated with two discrete

Fourier transforms of size N2 plus O(N) operations. A

similar argument can also be applied to the number

of additions required. All in all, O(N log2 N) time is

needed for the calculation of the FFT.

In comparison to the two methods mentioned above, we

also want to take a closer look at a real-valued version

of the DFT: the discrete Cosine transform (DCT). The

idea behind the DCT is to avoid a dimension extension

in case of the transformation of a real signal into a

complex signal.

C. Discrete Cosine Transform

Recall the definition of the DFT. The discrete Cosine

transform (DCT) of a sequence f(n) of N points is de-

fined by the equation

DCT (ω) =2

N

N−1∑i=0

f(n) cos

((2i + 1)ωπ

2N

),

with ω = 1, 2, ..., (N −1). For ω = 0, 2N

changes to√

2N

[3].

Similarly, we also want to give the definition of the

inverse discrete Cosine transform (IDCT).

We have the same assumptions as in the definition for

Best even, thn,

One getsgets

)) ==

NN

B22 −11∑∑nn=00

((ff ((

Report

m

computationalomputatio

worst case. Thisworst case. Th

ns of even and odns of even and

termt

Sample

n × nn ideident

n formula forormula for F

integer, is suppteger, is s

al complexity ofomplexity F

signal of sizegnal of siz NN

Fourier transfoourier tran

similar argimilar a

of additof

needene

MFDA-In andd nn ++

ff

((n ++

ND22 ))))

hence havehence have

((

WSimee

s donedone

ndices:ndices:

In

alsoals

of theof the

ideaidea

in

2018

/2019

rix[2]. Thisrix[2]. This

with nn == 2mm, pp

sed to decreasd to decreas

mm ..

s then calculathen calcu

ms of sizems of size NN/222ment can alsoment can also

ns required.required.

for the calculor the calcul

omparisonompariso

ant tnt

4

the DCT. Then,

IDCT (n) =2

N

N−1∑i=0

a(ω)DCT (ω) cos

((2i + 1)ωπ

2N

),

n = 1, 2, ..., (N − 1), is called the the inverse discrete

Cosine transform (IDCT). a(ω) =√

1N

for ω = 0 and

a(ω) =√

2N

for ω �= 0. Again, for n = 0, 2N

changes to√

2N

[3].

Alternatively, one can also define the DCT and the IDCT

via a matrix formulation.

Remark 4. For each n ≥ 2, the unitary matrix

Cn = [c0, ..., cn−1]n×n, with

ck =

[1√n

,

√2n

cos(k+ 1

2 )π

n, ...,

√2n

cos(n−1)(k+ 1

2 )π

n

]T

,

is called the n-point discrete Cosine transform (DCT),

and its transpose CTn is called the n-point inverse discrete

Cosine transform (IDCT)[2].

The multiplicative factor 1n

turns the matrix Cn into an

orthogonal matrix. Therefore, the inverse DCT is given

by the transpose of Cn.

With the DCT, it is possible to consider audio signals as

sparse signals in the frequency-domain. For further con-

siderations we refer to [3]. The most interesting aspect

of the DCT is that the DCT speech signal representation

is able to compress input data into as few coefficients

as possible. Coefficients with relatively small amplitudes

can be get rid of without any misrepresentation of infor-

mation in the reconstructed signal.

Remark 5. The computational time of the DCT is

comparable to the DFT. There are also algorithms in

O(n log n) time, but rarely specified in practice[3].

D. Realization and Application

Our voice identification system focuses on the determi-

nation of which signal pattern belongs to a prespecified

speaker. Therefore, we recorded voices from different

people saying the words “Data Analysis”. To be capable

of distinguishing between the template voice and the

rest of the recordings, we will analyze our data regarding

their time-domain and frequency-domain representation.

After this analysis it should be fairly easy to allocate

the signals. We also want to compare the patterns of

the same person recording different words, e.g. “Frohe

Weihnachten”, or a longer sentence. As a first step,

the samples, including male and female voices, will be

read-in MATLAB.

In Part I of this project, our objective is to identify if

a certain voice belongs to the predetermined person.

To analyze and compare these samples, we will firstly

transform them with the above mentioned methods to

obtain the transformed signal in the frequency-domain.

These spectra will then be compared with the master

spectrum. With our application, it should then be pos-

sible to differentiate between speakers and to backtrack

if the same words were spoken.

In Part II, we focus on removing unwanted noise. For this

purpose, a new signal is created by mixing the original

voice with a noise signal. By applying a simple filter, fol-

lowed by an inverse transformation, it should be possible

to reconstruct the template.

III. NUMERICAL STUDY AND EVALUATION

A. Input Data

The foundation of our project, as already mentioned,

is built by audio recordings. A total of nine audio

samples haven been recorded. For further information

and details on each sample, see Appendix.

For a closer look on the audio samples, they were

plotted in the time-domain first.

For simplicity, the four most important and relevant

samples are shown in Figure 1. Graph (a) is the signal

of a female voice saying “Data Analysis”. (b) displays

the audio signal of the template male saying “Frohe

Weihnachten”. The next one, graph (c), will be our

master audio signal. It displays the template male voice

of the words “Data Analysis”. Finally, the last graph (d)

shows the same signal from (c) but with a background

noise. All original input signals are plotted from time in

seconds s to amplitude.

By comparing the signals from (c) and (d), one

Bestwit

out any many

nstructed signucted sign

The computThe comp

e to the DFTe to the DFT

me, butme, but

Report

audio signals aso signals

n. For further conFor further

ost interesting asinterestin

ech signal represeech signal repre

ata into as fewata into as fe

relatively smrelatively

reprere

Samnn

enn

mple our apur a

rentiate betweentiate bet

me words were spowords were

Part II, we focus ot II, we focu

purpose, a newrpose, a n

voice with a noice with

lowed by alowed b

to recoto

MFDA-Ill ampam

ntation of inn of i

al time of thal time of

here are alsohere are als

ly specified inly specified in

onon

WS ectct

tationation

fficientsfficients

udesdes

II 2018

/2019d w

t should tshould t

peakers and toakers and t

en.en.

n removing unwremoving unw

gnal is createdgnal is created

ise signal. Bysignal. By

inverse transfnverse trans

struct the temstruct the te

NUMNUM

5

Figure 1: Four different audio signals in the time-domain

can easily see the difference. In the mixing process,

we have to make sure that both have the same length.

For mixing purposes, we used simple splines. One can

change the factor noiseIntensity, which determines the

strength or intensity of the selected noise. In our sample

we used noiseIntensity = 1.

1 % mixing audio sample S1 with noise sample S9

2 noiseIntensity = 1.0;

3 [S10, Fs10] = mixingSamples(S1, Fs1, S9, ...

Fs9, noiseIntensity, Fs1);

The foundation and basis for our further analysis is set.

In the next step, we will transform these signals into

spectra in the frequency-domain.

B. Fast Fourier Transformation of the Audio

Signals

It is practicable to transform the signal with the fast

Fourier transformation. We used the following code in

MATLAB to call and perform the different kinds of

Fourier transformation of our input signals.

1 [Y, f, fSPE] = ...

computeTransformation(sample, ...

samplingRate, typeOfTransformation);

One can change the type of transformation to ’FFT’,

’DFT’ or ’DCT’. We used the build-in Matlab code for

FFT and our own source code for DFT and DCT. The

sampling rate in our project is always the same, since the

same recorder has been used. The output Y is our trans-

formed data. Because Y is a two-sided spectrum and has

complex values, we perform calculateSpectrum to obtain

the one-sided spectrum with real values.

Best 1

stio sample S1mple S

nsity = 1.0;y = 1.0;

s10] = mixins10] = mixin

9, noiseInt9, noiseIBRep

ort rentent

mixing process,ng proce

e the same lengthhe same le

mple splines. Onesplines.

tyty, which determ, which determ

elected noise. Inelected noise.

Saudio signalio sig

MFDA-I A-noise samplise samp

les(S1, Fs1,les(S1, Fs1,

ty, Fs1);ty, Fs1);

FDbasis for oubasis for ou

will trwill tr

WS cann

es thethe

samplesample

Fo

WSWSWS11 [[

WSW20

18/20

19

in the time-don the time-d

ier transftrans

6

1 function spe = calculateSpectrum(Y)

2 %

3 n = length(Y);

4 amplitude = abs(Y) / n; % amplitude

5 amplitude_oneSide = ...

amplitude(1:floor(n/2)); %two-sided ...

to one-side

6 amplitude_oneSide(2:end-1) = 2 * ...

amplitude_oneSide(2:end-1);

7

8 spe = amplitude_oneSide;

9 end

Important: As in Section II A mentioned and shown,

DFT is very slow for large data. In practice, this fact is

confirmed and can be extracted from the code. It takes

considerably longer to perform our own DFT and DCT,

comparing to the build-in FFT. For further analysis,

we will only display FFT results, since the results from

DFT and DCT are nearly identical. The DFT and DCT

code can be found in the Appendix.

The following plots demonstrate the results of the

transformed signals from Figure 1. Instead of con-

templating the signals regarding the time, we can

now analyze the signals in the frequency-domain. The

corresponding graphs are plotted from frequency in Hz

to amplitude. The results are given in Figure 2.

We want to evaluate the outcome of our applications.

Comparing the plots, one can easily extract following re-

sults:

1. The frequency spectrum of (a) is broadly spread.

The peak of this voice can be found around 1000Hz,

which is considerably higher than the peak of the

voice in (c), which is around 500Hz. Moreover, re-

garding (a), there are higher frequencies, especially

higher than 4000Hz, being represented. Since the

course of these two graphs are similar, it is con-

ceivable to suggest that the intonations are alike

and the same words were probably spoken but with

a different frequency spectrum. Considering that

both are saying the same, it is plausible to sug-

gest that higher frequencies are a sign for a higher

pitched voice, i.e. from a female.

2. Graph (b) has clearly a different frequency spec-

trum compared to the other three samples. There

are plenty of peaks between 100Hz to 400Hz. The

maximum frequency can be found around 220Hz.

A frequency of 2000Hz and higher is barely exist-

ing. The absence of higher frequency points to a

male voice. Since the trend of this spectrum heav-

ily differs from our template (c), we can assume

that other words were spoken.

3. Finally, (d) is very similar to (c). The trend of

both graphs nearly correspond to each other. Also,

they have similar peaks and range of frequencies.

One can directly guess that the signals are from

the same person. Considering the results above, the

probability, that the same words were said, are ex-

tremely high. The effect of the additional noise, can

be seen in the interval [500, 1500] and [4000, 5000],

where the amplitudes in (d) are slightly higher.

In addition to these simple observations, we want to com-

pare these spectra with each other more theoretically.

Therefore, we want to determine the correlation between

some spectra, e.g. the degree of similarity. Given two

transformed signals Y1 and Y2, the correlation of these

two is given by

c =| < Y1, Y2 > |

||Y1|| · ||Y2||.

The basic idea of this correlation, is to determine the

orthogonality between the two signals. In Table I, the

correlations between several spectra are shown. In Addi-

tion to the previous four samples (a)-(d), we have added

the following samples:

• (e): template male voice saying a longer sentence

• (f): another male voice saying ”Data Analysis”

a b c d e fa 1 0.4262 0.4135 0.5710 0.4598 0.4599b 0.4262 1 0.6265 0.6861 0.7368 0.5117c 0.4135 0.6265 1 0.9455 0.7176 0.3555d 0.5710 0.6861 0.9455 1 0.7434 0.4914e 0.4598 0.7368 0.7176 0.7434 1 0.4652f 0.4599 0.5117 0.3555 0.4914 0.4652 1

Table I: Comparing different spectra using simplecorrelation

The correlation results show exactly what we anticipated.

One can see that samples (b),(d) and (e) have most simi-

Besty spectrum opectrum o

of this voice cahis voice c

is considerablyis considerab

in (c), whichin (c), whic

(a), th(a), th

Report

of

d of con-f co

me, we canwe can

y-domain. Themain. T

m frequency infrequency

en in Figure 2.n Figure 2

come of our appcome of our app

n easily extractn easily extr

Sample

. T

n the interhe inte

e the amplitudese amplitud

tion to these simpto these s

e these spectra whese spectr

Therefore, we waefore, w

some spectra,ome spec

transformetransfo

two is gwo

MFDA-I ll

is broadly sprs broadly spr

found aroundfound around

her than theher than the

around 500around 500HzHz

are higher freqare higher freq

00Hzz, being re, being

e two graphe two graph

st thast t

WS z

cations.cations.

wing re-ing re-

TheThe b

or

2018

/2019

additioadditio

1500] and [41500] and

(d) are slightlare slightl

e observationsbservati

th each otherth each other

to determinedetermine

e.g. the degreg. the degre

signalsign YY1Y ana

ven byven by

7

Figure 2: Audio signals in the frequency-domain using FFT

larity with the template sample (c), which is not surpris-

ing. Another fact is that (c) and (d) have a correlation of

0.9455, which is very high. Even with additional noise,

we can easily identify the same person. We also notice

that it is irrelevant what is said or how long a person

speaks, the correlation with samples of the same person

is always considerably high. From these results, we see

that the frequency spectrum delivers a lot of important

information, i.e. helping us

• to identify if it is the same person speaking,

• to differentiate between female and male voices

and/ or

• to recognize if the same words were spoken.

Comparing signals (c) and (d), we notice that the noise

makes the correlation with other signals higher. In prac-

tice, we often encounter audio samples that are not clean.

In order to remove these unwanted noise and to obtain a

clean voice, we need a process of denoising, e.g. filtering.

C. Filtering

There are many ways to denoise a signal. The most com-

mon techniques are filters which cut off specific frequency

ranges or remove frequencies with low amplitudes. Fre-

quently used filters are

• Low-pass filters,

• High-pass filters and

• Band-pass filters.

We go back to samples (c) and (d), see Figure 2. In

this project, cutting off frequency ranges or removing

amplitudes do not provide the desired result. We have

to consider another customized filter for this problem,

since both spectra are very similar.

The frequency spectrum of the original, the mixed and

the filtered signal are displayed in Figure 3. Our filter is

based on the intention of removing parts and the influ-

ence of noise from the mixed signal as much as possible.

Therefore, we consider two frequency spectra Ymixed and

Bestith

y high. Fgh.

y spectrum deectrum de

e. helping uselping us

ntify if it intify if it

Report

als inls i

h is not surpris-not surpr

ave a correlation oa correlati

with additional nh additio

e person. We alse person. We a

said or how lonsaid or how

amples of thampl

m them

Sahe frequene freq

MFDA-Isameame

e results, wets, w

a lot of impoa lot of im

e same personsame person

e between fe between

WS ise,e,

noticeotice

personperson

ersonon

ThereThere

mo

2018

/2019

y-domain usingdomain usin

8

Figure 3: Overlay of frequency spectra

Ynoise. To be able to compare these two, we have to nor-

malize them first. Then we can calculate the difference

in their amplitude:

amp_diff = Ymixed,norm − Ynoise,norm.

The filter is constructed as simple as follows with

amp_diff > 0:

1 function filter = ...

designDeNoisingFilter(fSample, fNoise)

2 %

3 fSample_amp = abs(fSample)/length(fSample);

4 fNoise_amp = abs(fNoise)/length(fNoise);

5

6 amp_diff = fSample_amp/sum(fSample_amp) ...

- fNoise_amp/sum(fNoise_amp);

7 filter = amp_diff > 0;

8 %

9 end

To apply this filter, we have to multiply it with the trans-

formed mixed signal. After adjusting it to the one-sided

spectrum, we obtain the filtered frequency spectrum in

Figure 3. We notice that the filtered spectrum comes

very close to the original signal. The effect of this filter

can be seen clearly, especially in the frequency intervals

[500, 1500] and [4000, 5000].

1 filter_FFT = ...

designDeNoisingFilter(S10_FFT, S9_FFT);

2 S10_FFT_filtered = S10_FFT.*filter_FFT;

3 S10_FFT_filtered_SPE = ...

calculateSpectrum(S10_FFT_filtered);

4 S10_FFT_filtered_Inversed = ...

real(computeInverseTransformation...

5 (S10_FFT_filtered, 'FFT'));

As shown in Figure 3, the filtered spectrum is mostly

lower than the original spectrum regarding the ampli-

tude. It is clear, that we used a very basic and easy

filter, so there is the possibility of over-filtering. The

frequency spectrum alone is not conclusive enough. It

is very important to see the impact on the samples in

the time-domain. For that purpose, we performed an

inverse transformation, depending on the previously

used method.

Best cted ased

Besnn filter =filter =

ignDeNoisignDeNoi

Report

OverOve

we have to nor-have to no

late the difference the diffe

,norm,nor −− YYnoise,nonoisY

impleim

Say of frequeof fre

MFDA-I s follows wows

DAlter(fSamplelter(fSample

= abs(fSamplebs(fSamp

= abs(fNois= abs(fNoi

plpl

WS Fi

very cvery

can becan b

[500[500

2018

/2019

cy spectray spect

re 3. W3. W

sese

9

Figure 4: Overlay of signal (c) and (d) plus filtered version of signal (d)

In Figure 4, the inversed signals from Figure 3 are dis-

played. We notice, that there is still some unwanted noise

in the filtered signal. But still, the filter does work and

clearly improved the signal. To proof it with numbers,

we can calculate the correlation between the filtered and

the original signal (c). As a result, we have a correlation

of 0.9866, an improvement of 4,3% to the 0.9455 obtained

previously, see Table I.

IV. DISCUSSION

A. Conclusion

In this project, we have successfully used DFT, FFT

and DCT to obtain the frequency spectra of several

audio samples. With these we have built the foundation

of spectral analysis. In order to compare and identify

these samples, we studied the similarity and correlation

of each spectrum. As a result, we can now identify

different voices and match these to a template speaker.

Clearly, we have to be aware of the fact, that this is

only a small sample size. Obviously, we have just used a

very simple indicator of similarity. There are definitely

way more complex and better indicators for a complete

spectral analysis, which would go beyond the scope

of this project. Nevertheless, we have demonstrated

the basic idea and have noticed, that one’s frequency

spectrum is so unique in itself, that a distinction is fairly

easy to be made.

Furthermore, we have also looked into denoising

and have created our own filter. Our filter does improve

the similarity to the original template. The fact that

still little noise remains, indicates that the filter could

be improved in accuracy and precision.

In Addition to all that, we performed different (in-

verse) Fourier transformations to reconstruct our input

signals. We can confirm that the results are almost

equal but we also do realize the huge difference in

computational complexity. Since we only considered

a very small set of data, it would be interesting to

see how big the actual difference between the methods

are, especially thinking of complex speech recognition

systems.

Best s a

ment of 4nt of

ble I..

IV.V

Report

(c)(c)

gure 3 are dis-e 3 are d

me unwanted noisunwanted

e filter does worklter does

o proof it with no proof it with

ion between theion between t

sult, we havsult, we

to tht

Sad (d) plus(d) p

MFDA-Ia correcor

0.9455 obtai5 obta

SCUSSIONSCUSSION

A.A. ConclusiConclus

ee

WS andd

mbers,bers,

red andred and

ationon

ve

way mway

spectraspect

of tf t

2018

/2019

filtered versiontered version

simple inmple i

rere

10

B. Outlook

The next step of a voice identification is the speaker

recognition system (SRS). The goal of a SRS is to present

an (almost) exact scheme to distinguish the individual

properties of each speaker. The base for a SRS are

the similarities in individual speech elements. Speaker

recognition requires precise analysis of the speech signal.

It has to be clarified whether the representation carries

suitable signal features or not. Moreover, artificial

intelligence, e.g. deep learning, plays a significant role

for training speech patterns in order to improve their

accuracy.

For further implementation one can derive the so-

called Short-Time Fourier transform, or also known

as the windowed Fourier Transform. The short-time

Fourier transform is localized in time and frequency.

This means that it provides information about at which

point in time a certain frequency occurs.

As a possible comparison, it is convenient to con-

sider Wavelet Transforms. They are supposed to tell

what frequencies are present and, in contrast to normal

Fourier Transforms, where (or at what scale). Wavelets

might be better in areas like good compression or

improved filtering. It would be interesting to see if there

are any advantages, that come up by using Wavelet

transforms and if so, which. Based on the time-frequency

multi-resolution property of Wavelet transforms, the

input speech signal is decomposed into various frequency

channels. The major challenge for creating a Wavelet

based speech recognition system is the choice of suitable

wavelets for speech signals and also selecting the right

feature vectors from the wavelet coefficients. A possible

implementation could be the evaluation of the function-

ality of a wavelet based feature extraction system and

its performance on word recognition problems.

ACKNOWLEDGEMENT

I would like to show my gratitude to xxxxxxxx for his engagement. Without the help of the other, this project could not have been realized. His significant support, especially regarding his implementation and study with respect to the DFT and DCT, made this project possible in the first place. Only with his full commitment and dedication, this project became complete. I am already looking forward to working together in the next project.

[1] Ben Gold, Nelson Morgan, and Dan Ellis. Speech and Au-

dio Signal Processing, 2nd Edition. John Wiley & Sons,

Inc., 2011, New Jersey.

[2] Dr. Boqiang Huang. Mathematical Foundations of Data

Analysis I. Wintersemester 2018/19, University of

Cologne.

[3] R.G. Moreno-Alvarado, and Mauricio Martinez-Garcia.

DCT-compressive Sampling of Frequency-sparse Audio

Signals. Proceedings of the World Congress on Engineer-

ing 2011 Vol II, WCE 2011, July 6 - 8, 2011, London, U.K.

Best 2n

ey.

Huang.ang. MathemMathe

I. Winterseme. WinterseI

Report

c

d to telto t

st to normalnormal

cale). Wavelets). Wavele

eeppand Dan Ellis.and Dan Ellis. Sp

ditionditio . John

Sample

n

ACKNOWLACKNOW

ouldd likelike toto shos

engagement.gement. Wit

couldould notnot ha

especiallyespecia

respectres

in t

MFDA-I Wiley &iley

al Foundationsundations

2018/19, Un2018/19,

WS loo

SWSand Au-and Au-

Sons,ns

2018

/2019

rob

DGEMENTDGEMEN

mymy gratitudegratitude

out theth helphelp

beenbeen realizerealize

egardingarding hishis imi

oo thethe DFTDFT and

e firstrst place.plac

dication,cation thisthis

ng forwaforwa

APPENDIX

Main Script: Project.m

1 %% Step 0: Initializing and loading audio samples or files

2 %

3 clear;

4

5 projectPath = uigetdir();

6 if isdir(projectPath)

7 cd(projectPath);

8 if exist('Input.mat','file') == 2

9 load('Input.mat');

10 else % read audio files

11 [S1,Fs1] = audioread('Ichkurz.m4a');

12 [S2,Fs2] = audioread('Ichhochkurz.m4a');

13 [S3,Fs3] = audioread('Mamahochkurz.m4a');

14 [S4,Fs4] = audioread('Papa.m4a');

15 [S5,Fs5] = audioread('Ich2.m4a');

16 [S6,Fs6] = audioread('Ich3lang.m4a');

17 [S7,Fs7] = audioread('IchRauschen.m4a');

18 [S8,Fs8] = audioread('IchRauschen2.m4a');

19 [S9,Fs9] = audioread('Rauschen.m4a');

20

21 save('Input.mat');

22 end

23 end

24 %% Step 1: Performing DFT/DCT for each audio sample and calculating one-sided frequency spectrum

25 %

26 if exist('Output_step1.mat', 'file') == 2

27 load('Output_step1.mat');

28 end

29

30 numberOfSignals = 9;

31 typeOfTransformation = 'FFT';

32 for i = 1:9

33 eval(['sample = S' num2str(i) ';']);

34 eval(['samplingRate = Fs' num2str(i) ';']);

35

36 [Y, f, fSPE] = computeTransformation(sample, samplingRate, typeOfTransformation); %#ok<ASGLU>

37

38 eval(['S' num2str(i) '_' typeOfTransformation ' = Y;']);

39 eval(['S' num2str(i) '_' typeOfTransformation '_freq = f;']);

40 eval(['S' num2str(i) '_' typeOfTransformation '_SPE = fSPE;']);

41 end

42

43 clearvars i numberOfSignals sample samplingRate Y f fSPE;

44 save('Output_step1.mat');

45

46 %% Step 2: De-noising a mixed audio sample

47 %

48 clear;

49 if exist('Output_step2.mat','file') == 2

50 load('output_step2.mat')

51 else

52 load('Output_step1.mat');

53 end

54

55 % mixing audio sample S1 with noise sample S9

56 noiseIntensity = 1.0;

57 [S10, Fs10] = mixingSamples(S1, Fs1, S9, Fs9, noiseIntensity, Fs1);

58

59 % performing DFT/DCT for mixed sample S10 and calculating its frequency spectrum


61 [Y, f, fSPE] = computeTransformation(S10, Fs10, typeOfTransformation);

62 eval(['S10_' typeOfTransformation ' = Y;']);

Best omp

m2str(i)tr(i)

num2str(i)str(i) ''

num2str(i)um2str(i)

ars i numbears i num

tput_sttput_st

Report

o

2

2str(i)2str(i) ';'; ]););

= Fs'= Fs' num2str(ium2st

eTransformeTra

ty

Sample

mple and cample an

MFDA-I ion(saon(s

OfTransformaOfTransforma

peOfTransforeOfTrans

typeOfTransfOfTransf

gnals samplegnals sample

at't');

ising a mixeng a mix

tep2tep

WS ;';']);]);

le,

2018

/2019

culating one-ulating on

63 eval(['S10_' typeOfTransformation '_freq = f;']);

64 eval(['S10_' typeOfTransformation '_SPE = fSPE;']);

65

66 % designing and applying filter in frequency space

67 if strcmp(typeOfTransformation, 'DFT')

68 filter_DFT = designDeNoisingFilter(S10_DFT, S9_DFT);

69 S10_DFT_filtered = S10_DFT.*filter_DFT;

70 S10_DFT_filtered_SPE = calculateSpectrum(S10_DFT_filtered);

71 S10_DFT_filtered_Inversed = real(computeInverseTransformation(S10_DFT_filtered, 'DFT'));

72 elseif strcmp(typeOfTransformation, 'DCT')

73 filter_DCT = designDeNoisingFilter(S10_DCT, S9_DCT);

74 S10_DCT_filtered = S10_DCT.*filter_DCT;

75 S10_DCT_filtered_SPE = calculateSpectrum(S10_DCT_filtered);

76 S10_DCT_filtered_Inversed = computeInverseTransformation(S10_DCT_filtered, 'DCT');

77 else

78 filter_FFT = designDeNoisingFilter(S10_FFT, S9_FFT);

79 S10_FFT_filtered = S10_FFT.*filter_FFT;

80 S10_FFT_filtered_SPE = calculateSpectrum(S10_FFT_filtered);

81 S10_FFT_filtered_Inversed = real(computeInverseTransformation(S10_FFT_filtered, 'FFT'));

82 end

83

84 clearvars Y f fSPE;

85 save('Output_step2.mat');

86

87 %% Plotting: selected samples

88 %

89 %load('Output_step2.mat');

90 samples = { S3, S5, S1, S10 };

91 samplingRates = { Fs3, Fs5, Fs1, Fs10 };

92 titles = {'(a)', '(b)', '(c)', '(d)'};

93 h = plotSamplesInSubplots(samples, samplingRates, titles);

94 pause;

95

96 print(h,'-dpng', '-r600', '-noui', fullfile(projectPath, 'samples_amplitude.png'));

97 close(h);

98

99 %% Plotting: frequency spectra for selected samples

100 %



103 titles = {'(a)', '(b)', '(c)', '(d)'};


105 freqs = { S3_DFT_freq, S5_DFT_freq, S1_DFT_freq, S10_DFT_freq };

106 freqSPE = { S3_DFT_SPE, S5_DFT_SPE, S1_DFT_SPE, S10_DFT_SPE };

107 xlimits = [0 5000];

108 ylimits = [0 0.1];


110 freqs = { S3_DCT_freq, S5_DCT_freq, S1_DCT_freq, S10_DCT_freq };

111 freqSPE = { S3_DCT_SPE, S5_DCT_SPE, S1_DCT_SPE, S10_DCT_SPE };

112 xlimits = [0 6000];

113 ylimits = [0 0.0004];

114 else

115 freqs = { S3_FFT_freq, S5_FFT_freq, S1_FFT_freq, S10_FFT_freq };

116 freqSPE = { S3_FFT_SPE, S5_FFT_SPE, S1_FFT_SPE, S10_FFT_SPE };

117 xlimits = [0 5000];

118 ylimits = [0 0.1];

119 end

120 h = plotSpectraInSubplots(freqs, freqSPE, titles, xlimits, ylimits);

121 pause;

122

123 file = fullfile(projectPath, ['samples_freqSpectrum_' typeOfTransformation '.png']);

124 print(h,'-dpng', '-r600', '-noui', file);

125 close(h);

126

127 %% Plotting: comparing spectra of mixed with filtered and original samples

128 %



131 titles = {'mixed sample', 'filtered sample', 'original sample'};


000]0

0.1];1];

typeOfTransfeOfTrans

{ S3_DCT_freq_DCT_fr

E = { S3_DCT_{ S3_DCT_

its = [0 6000its = [0 60

imits = [0 0.imits = [0 0

= { S= { S

E =E =

Report

ile(proe(pr

selected sampllected sa

;

c)', '(d)''(d)'};};

mation,mation, 'DFT''DFT ))

req, S5_DFT_frereq, S5_DFT_f

SPE, S5_DFT_SPE, S5_DF

Sample

es, titles);es, titles);

ctPathtP

MFDA-I E, SE,

ion,ion, 'DCT'DCT )

_DCT_freq, ST_freq,

S5_DCT_SPE,S5_DCT_SPE,

;;

T_freq, S5_FFT_freq, S5_FF

3_FFT_SPE, S5_FFT_SPE

5000];5000];

0 0.1];0 0.1];

ubplubp

WS 1_DFT_freq,1_DFT_freq

_DFT_SPEFT_SP

2018

/2019T

'samples_ampamples_

133 freqSPE = { S10_DFT_SPE, S10_DFT_filtered_SPE, S1_DFT_SPE };

134 freqs = S10_DFT_freq;


136 freqSPE = { S10_DCT_SPE, S10_DCT_filtered_SPE, S1_DCT_SPE };

137 freqs = S10_DCT_freq;

138 else

139 freqSPE = { S10_FFT_SPE, S10_FFT_filtered_SPE, S1_FFT_SPE };

140 freqs = S10_FFT_freq;

141 end

142 h = compareSpectra(freqSPE, freqs, titles, typeOfTransformation);

143 pause;

144

145 file = fullfile(projectPath, ['comparison_filteredFreqSpectrum_' typeOfTransformation '.png']);


147 close(h);

148

149 %% Plotting: comparing mixed sample with filtered and original samples

150 %



153 titles = {'mixed sample', 'filtered sample', 'original sample'};

154 samplingRates = {Fs10, Fs10, Fs1};


156 samples = {S10, S10_DFT_filtered_Inversed, S1};


158 samples = {S10, S10_DCT_filtered_Inversed, S1};

159 else

160 samples = {S10, S10_FFT_filtered_Inversed, S1};

161 end

162 h = compareSamples(samples, samplingRates, titles, typeOfTransformation);

163 pause;

164

165 file = fullfile(projectPath, ['comparison_filteredSample_' typeOfTransformation '.png']);


167 close(h);

168

169 %%

Implementation of DFT, DCT, FFT: computeTransformation.m

1 function [Y, f, fSPE] = computeTransformation(X, Fs, typeOfTransform)

2 %

3 m = length(X); % sample length

4 n = pow2(nextpow2(m)); % length used for transformation

5

6 X = cat(1, reshape(X, m, 1), zeros(n-m, 1));

7 Y = zeros(n, 1);

8

9 if strcmp(typeOfTransform, 'DFT') % perform DFT

10 for k = 0:n-1

11 Y(k+1) = exp(-2*pi*1i*(0:n-1)/n*k)*X;

12 end

13 elseif strcmp(typeOfTransform, 'DCT') % perform DCT

14 for k = 0:n-1

15 Y(k+1) = cos(pi/n*(0.5:n)*k)*X;

16 end

17 Y = [Y(1)/sqrt(n); Y(2:end)/sqrt(n/2)]; % normalization factor for DCT

18 else

19 Y = fft(X,n); % perform build-in FFT

20 end

21

22 f = Fs/n *(0:(n-1)/2)'; % frequency

23 fSPE = calculateSpectrum(Y); % frequency spectrum

24 %

25 end

Best PE] =E]

X);

nextpow2(m))ow2(m)

at(1, reshapat(1, resha

zeros(n, 1)zeros(n, 1)

cmp(tcmp(t

kk

Report

fil

;

poFT, DCT, FFFT, DCT, FF

RmputeTrmpu

Sample

s, typeOfTransfs, typeOfTra

redSample_dSamp

MFDA-I Isformatirmat

% sample le% sample

% length% length

m, 1), zerosm, 1), zero

Transform,Transfor ''

n-11

1) = exp(-21) = exp(-2**

typeOftypeOf

11

WS S: computeT: computW(X

2018

/2019

mation);tion);

typeOfTransftypeOfTrans

2

Implementation of IDFT, IDCT, IFFT: computeInverseTransformation.m

1 function X = computeInverseTransformation(Y, typeOfTransform)

2 %

3 n = length(Y); % length used for transformation

4 X = zeros(n, 1);

5

6 if strcmp(typeOfTransform, 'DFT') % perform inverse DFT

7 for k = 0:n-1

8 X(k+1) = 1/n*exp(2*pi*1i*(0:n-1)/n*k)*Y;

9 end

10 elseif strcmp(typeOfTransform, 'DCT') % perform inverse DCT

11 Y = [Y(1)/sqrt(n); Y(2:end)/sqrt(n/2)];

12 for k = 0:n-1

13 X(k+1) = cos(pi/n*(0.5:n)*k)*Y;

14 end

15 else

16 X = ifft(Y,n); % perform build-in inverse FFT

17 end

18 %

19 end

Mixing two Samples: mixingSamples.m

1 function [x, sr] = mixingSamples(x1, sr1, x2, sr2, ratio, sr)

2 %

3 m1 = length(x1);

4 m2 = length(x2);

5 m = min(m1, m2);

6

7 if sr > min(sr1, sr2)

8 sr = min(sr1, sr2);

9 end

10

11 t1 = 0:1/sr1:((m1-1)/sr1);

12 t2 = 0:1/sr2:((m2-1)/sr2);

13 t = 0:1/sr:((m-1)/sr);

14

15 x = spline(t1, x1, t') + spline(t2, x2, t') * ratio;

16 end

Self-designed Filter: designDeNoisingFilter.m

1 function filter = designDeNoisingFilter(fSample, fNoise)

2 %

3 fSample_amp = abs(fSample)/length(fSample);

4 fNoise_amp = abs(fNoise)/length(fNoise);

5

6 amp_diff = fSample_amp/sum(fSample_amp) - fNoise_amp/sum(fNoise_amp);

7 filter = amp_diff > 0;

8 %

9 end

tgned Filtergned Filte

Bfiltefilte

Report

;

2);

););

) + spline) + spli

Sample

leamr2, ratio, sr)r2, ratio, s

MFDA-I , x2x

AI

esignDeNosignDeN

FDdesignDeNoisidesignDeNoisi

= abs(fSamp= abs(fSamp

= abs(fNoi= abs(fNoi

plpl

WS t')) **

2018

/201919

8/2

Documents

Feasibility study of a voice identiﬁcation approach using ... · The basis for signal-processing applications is formed by the discrete Fourier transform (DFT). The discrete Fourier