VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL REGION

04/15/2023 1

Under guidance of Dr. G. PradhanNIT PATNA (ECE dept.)

Presented by -Niranjan Kumar –(1104087)Kamlesh Kalvaniya -(1104080)Piyush Kumar-(1104091)B.TECH 4th yr (ECE dept.)N.I.T. PATNA ECE, DEPTT.

Development of a voice password based speaker

Verification system using vowel and non-vowel like region

04/15/2023 N.I.T. PATNA ECE, DEPTT. 2

1. Introduction 2. Literature Review3. Work done till last presentation4. Work done after last presentation• Dynamic Time Warping• Empirical Mode Decomposition• VLROP Detection

5. Experimental Result 6. Summary & Future Plan

OUTLINE


INTRODUCTION

• Speaker Verification is the computing task of validating identity claim of a person from his/her voice.

• A voice password based speaker verification system is a special case of text dependent system in which the utterance in training & testing phase are remain same for a particular speaker.


Literature Review Model Based Approacho Gaussian Mixture Model

Gaussian model assumes the feature vectors follow a Gaussian distribution, characterized by mean vectors, covariance matrix and weights.

o UBM-GMM Adapted GMM , model is adapted from a large set of background speakers.

o Hidden Markov Model A hidden Markov model (HMM) is a statistical model in which the system

being modeled is assumed to be a Markov process with unknown parameters; the challenge is to determine the hidden parameters from the observable data .

Template Matching Approacho Dynamic Time Warping

Input and Template are sequence of feature vector Aim is to find distortion between input and template vector

Existing Methods Of speaker verification system


Motivation

• GMM & Adapted GMM is a probabilistic model which requires more training data , it is not suitable for password based system which have few seconds of training data.

• Adapted GMM requires huge set of background speakers.• HMM also requires huge data.• DTW is a simple approach but it requires exact knowledge

of start & end point of utterance which is frequently difficult to obtain under noisy conditions.

• Objective of the work: To devise a system with less complexity & would give better

performance under noisy condition also.


Work done up to last semester

GMM based speaker verification system using NIST data base

.GAUSSIAN SIZE

8

16

32

64

TEST 15 SecTRAIN 15 SEC

Test FullTrain 15 sec

TEST 15 secTrain Full

Test FullTrain Full

EQUAL ERROR RATE(%)

EQUAL ERROR RATE(%)

EQUAL ERROR RATE(%)

EQUAL ERROR RATE(%)

34.90 34.24 33.18 27.70

33.05 32.28 30.50 25.67

32.46 32.94 28.78 23.67

32.82 33.06 27.42 22.05


Summary of Last presentation

Performance is sensitive to duration of training and testing data.

Performance is more sensitive to duration of training data compared to testing data.

GMM based speaker verification system may not suitable for limited data


Development of Dynamic Time Warping (DTW) based speaker verification system

• DTW is a template matching technique• Test Features and Template (Model) are

sequence of feature vectors• Aim is to find distortion between Test Features

and Template • They may have different length • DTW uses dynamic programming to find

optimal path for normalizing the length variation.


i

i+2

i

i i

timetime

Why Dynamic Time Warping?

Any distance (Euclidean, Manhattan, …) which aligns the i-th point on one time series with the i-th point on the other will produce a poor similarity score.

A non-linear (elastic) alignment produces a more intuitive similarity measure, allowing similar shapes to match even if they are out of phase in the time axis.



Development of DTW based baseline System

• Speaker verification Data Collection Data of 100 speakers was collected. Each speaker utter his/her full name or roll no as the

voice password which was recorded over phone. No of male speaker: 81, No of female speaker: 19 Duration of data: 2 -5 Sec No of training session: 3, No of testing session: 5 With

minimum gap of one day between each sessions During verification task each speaker was compared

with its own & 19 other imposter speakers.


Experimental results for DTW based system for Voice password database

13 39 13 39 13 39 13 39 13 39

25 28 14 14.6 25 27.9 14.7 15 25.2 26.3

31 34 17 18.9 30 33.6 18 19.3 29.4 32.6

28 29 18 19 29 31 18.7 20 31.5 32.6

31 32 15 16 32 32.3 16.1 17.5 30.5 32.6

31 33 17 18 32 34 18.2 20.7 34.7 35.7

13 39

14.7 15.7

20 21.05

20 21.94

16.8 18.94

18.9 21.05

Start to End

VAD Start to end VAD Start to end VAD

1 2 3

1

2

3

4

5

Train

Test


Experimental results for GMM based system for Voice password database

17.9 19.7 21.2

18.34 18.1 20.3

18.69 19.6 18.7

19.8 20.1 18.9

20.6 20.6 20

1 2 3

1

2

3

4

5

TrainTest


DTW using only mean vector of GMM

15.9 19.7 21.2

16.26 18.1 20.3

18.69 19.6 18.7

19.8 20.1 18.9

20.6 20.6 20

1 2 3

1

2

3

4

5

TrainTest


Verification result comparison and discussion

DTW based system best EER :14%GMM based system best EER :17.9%DTW using mean vector of GMM best EER :15.9%Best result was obtained for DTW.Performance of DTW based system depends on

detection of end points.Performance of DTW based system may be improved

by robust end point detection and enhancing more speaker specific regions

Hence the motivation for the present work


Development of a speaker verification system using only vowel regions to

improve performance Speech is produced as a sequence of changes, known as event

• Time varying nature of excitation source• Time varying nature of vocal tract system

Vowel is the most important event in speech signal• Vowel regions : High SNR , Nearly periodic, Long duration ,Lower zero

crossing rate and quasi stationary regions. Important information can be extracted using knowledge of these

events• Vocal tract• Excitation information

VOP: instants at which the onset of vowel take place in the speech signal• Vowel regions can be selected using VOPs


VOP and VEP events

VOP(circle ) and VEP (arrow head) events for an utterance/the sea/


VOP Events

Typical cases in which VOP occurs• Isolated vowel• Consonant vowel and consonant cluster vowel

( Cn V , where n > 1)• Vowel consonant and vowel consonant cluster

(V Cm, where m > 1)• In the form of Cn V Cm where n ≥ 1 and m ≥ 1• In the form of diphthong C nV k Cm where n ≥ 1, k

≥ 1 and m ≥ 1


Issues with existing VOP detection methods• Performance of the Feature based approach depends on

• Robustness of the feature to the recording conditions• Performance varies depending on the recording conditions• Literature suggest that excitation based features are more robust compared to

the vocal tract based feature• Performance of the statistical model based depend on

• Choice of classifier• Selection of features• Availability of similar label speech data

• Since it is difficult to avail speech data from similar recording conditions most of the existing methods use feature based approach.

• VOP detection method fails for most of the vowel semivowel clusters• Due to similarity in speech production mechanism between vowel and

semivowels.

• Objective of this work:• Explore a method to enhance the detection of VOPs in case semivowel-vowel

clusters and diphthongs.


Empirical Mode Decomposition

• Empirical Mode Decomposition (EMD)• Data-driven, multi-scale, robust to non-stationary signal• Fast oscillating signal can be superimposed to slow oscillating signals• Local mean of decomposed signals is zero and the signals are symmetric to

its local mean.• Impact of noise on the signal can be reduced

• Decomposed signals are defined as Implicit Mode Function (IMF), if it satisfies following conditions

• The number of extrema and the number of zero crossing differs only by one

• The local average is zero. This implies that envelop mean of upper envelop and lower envelop is zero.

.

EMD Algorithm• For a given input signal X to decompose

Identify the local extrema of the signal X. Construct upper envelop E max & lower envelop Emin by interpolating maximum

&minimum,respectively Approximate local average by envelop mean Em taking average of two

envelops E max &Emin.

Compute candidate implicit mode h1=X-Em. If h1 is IMF,decompose the signal X as IMF imf= hi& the residue signal r=X-

imf.Otherwise repeat above steps.• If r has implicit oscillation mode,set r as input signal & repeat the steps.• A signal S(n) can be represented through IMFs as follows

S(n)= +r(n)Where r(n) is the residue.


MOTIVATION FOR USE OF EMD

• Environmental effect on the speech data can be deemphasized

• Excitation information present in different frequency range can be analyzed separately.

• To emphasize the weak transitions in case of nasal-vowel, semivowel-vowel & Dipthongs.


Flowchart for VOP detection


VOP EVIDENCE PLOT


Experiment

Speech data• Complete TIMIT database• Number of Male speakers: 438• Number of Female speakers: 192• Sampling Frequency=8 KHz• VOP experiment was performed on 100 speakers.


Performance measure

• Identification rate (IR): Percentage of reference VOPs (VEPs) that are matched by detected VOPs (VEPs) with in vowel regions

• Spurious rate (SR): Percentage of detected VOPs (VEPs), which are detected outside vowel regions


Performance of proposed VOP detection method

Baseline 47 74 78 88 15

Proposed 62 83 90 96 13

Detection Rate Spurious Rate

Method 10ms 20ms 30ms 40ms

Observation:•Performance of proposed method is better than baseline in terms of both Detection rate & Spurious Rate.•83% detection is achieved in 20ms window which is beneficial when used for comparison of strings of vowel regions.


Contribution Till Now……

• Significance of EMD for speech analysis is demonstrated.

• New method for detection of VOP is proposed.• Performance of the proposed method is

compared with the best method available in the literature.

• Experimental results shows that proposed method provides better performance in all respects.


Future work

• Development of DTW based speaker verification system using only vowel regions.

• Exploration for a comparison technique to replace DTW.


THANK YOU

Engineering

VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM USING VOWEL AND NON VOWEL REGION