Upload
niranjan-kumar
View
10
Download
2
Tags:
Embed Size (px)
Citation preview
04/15/2023 1
Under guidance of Dr. G. PradhanNIT PATNA (ECE dept.)
Presented by -Niranjan Kumar –(1104087)Kamlesh Kalvaniya -(1104080)Piyush Kumar-(1104091)B.TECH 4th yr (ECE dept.)N.I.T. PATNA ECE, DEPTT.
Development of a voice password based speaker
Verification system using vowel and non-vowel like region
04/15/2023 N.I.T. PATNA ECE, DEPTT. 2
1. Introduction 2. Literature Review3. Work done till last presentation4. Work done after last presentation• Dynamic Time Warping• Empirical Mode Decomposition• VLROP Detection
5. Experimental Result 6. Summary & Future Plan
OUTLINE
04/15/2023 N.I.T. PATNA ECE, DEPTT. 3
INTRODUCTION
• Speaker Verification is the computing task of validating identity claim of a person from his/her voice.
• A voice password based speaker verification system is a special case of text dependent system in which the utterance in training & testing phase are remain same for a particular speaker.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 4
Literature Review Model Based Approacho Gaussian Mixture Model
Gaussian model assumes the feature vectors follow a Gaussian distribution, characterized by mean vectors, covariance matrix and weights.
o UBM-GMM Adapted GMM , model is adapted from a large set of background speakers.
o Hidden Markov Model A hidden Markov model (HMM) is a statistical model in which the system
being modeled is assumed to be a Markov process with unknown parameters; the challenge is to determine the hidden parameters from the observable data .
Template Matching Approacho Dynamic Time Warping
Input and Template are sequence of feature vector Aim is to find distortion between input and template vector
Existing Methods Of speaker verification system
04/15/2023 N.I.T. PATNA ECE, DEPTT. 5
Motivation
• GMM & Adapted GMM is a probabilistic model which requires more training data , it is not suitable for password based system which have few seconds of training data.
• Adapted GMM requires huge set of background speakers.• HMM also requires huge data.• DTW is a simple approach but it requires exact knowledge
of start & end point of utterance which is frequently difficult to obtain under noisy conditions.
• Objective of the work: To devise a system with less complexity & would give better
performance under noisy condition also.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 6
Work done up to last semester
GMM based speaker verification system using NIST data base
.GAUSSIAN SIZE
8
16
32
64
TEST 15 SecTRAIN 15 SEC
Test FullTrain 15 sec
TEST 15 secTrain Full
Test FullTrain Full
EQUAL ERROR RATE(%)
EQUAL ERROR RATE(%)
EQUAL ERROR RATE(%)
EQUAL ERROR RATE(%)
34.90 34.24 33.18 27.70
33.05 32.28 30.50 25.67
32.46 32.94 28.78 23.67
32.82 33.06 27.42 22.05
04/15/2023 N.I.T. PATNA ECE, DEPTT. 7
Summary of Last presentation
Performance is sensitive to duration of training and testing data.
Performance is more sensitive to duration of training data compared to testing data.
GMM based speaker verification system may not suitable for limited data
04/15/2023 N.I.T. PATNA ECE, DEPTT. 8
Development of Dynamic Time Warping (DTW) based speaker verification system
• DTW is a template matching technique• Test Features and Template (Model) are
sequence of feature vectors• Aim is to find distortion between Test Features
and Template • They may have different length • DTW uses dynamic programming to find
optimal path for normalizing the length variation.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 9
i
i+2
i
i i
timetime
Why Dynamic Time Warping?
Any distance (Euclidean, Manhattan, …) which aligns the i-th point on one time series with the i-th point on the other will produce a poor similarity score.
A non-linear (elastic) alignment produces a more intuitive similarity measure, allowing similar shapes to match even if they are out of phase in the time axis.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 10
04/15/2023 N.I.T. PATNA ECE, DEPTT. 11
Development of DTW based baseline System
• Speaker verification Data Collection Data of 100 speakers was collected. Each speaker utter his/her full name or roll no as the
voice password which was recorded over phone. No of male speaker: 81, No of female speaker: 19 Duration of data: 2 -5 Sec No of training session: 3, No of testing session: 5 With
minimum gap of one day between each sessions During verification task each speaker was compared
with its own & 19 other imposter speakers.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 12
Experimental results for DTW based system for Voice password database
13 39 13 39 13 39 13 39 13 39
25 28 14 14.6 25 27.9 14.7 15 25.2 26.3
31 34 17 18.9 30 33.6 18 19.3 29.4 32.6
28 29 18 19 29 31 18.7 20 31.5 32.6
31 32 15 16 32 32.3 16.1 17.5 30.5 32.6
31 33 17 18 32 34 18.2 20.7 34.7 35.7
13 39
14.7 15.7
20 21.05
20 21.94
16.8 18.94
18.9 21.05
Start to End
VAD Start to end VAD Start to end VAD
1 2 3
1
2
3
4
5
Train
Test
04/15/2023 N.I.T. PATNA ECE, DEPTT. 13
Experimental results for GMM based system for Voice password database
17.9 19.7 21.2
18.34 18.1 20.3
18.69 19.6 18.7
19.8 20.1 18.9
20.6 20.6 20
1 2 3
1
2
3
4
5
TrainTest
04/15/2023 N.I.T. PATNA ECE, DEPTT. 14
DTW using only mean vector of GMM
15.9 19.7 21.2
16.26 18.1 20.3
18.69 19.6 18.7
19.8 20.1 18.9
20.6 20.6 20
1 2 3
1
2
3
4
5
TrainTest
04/15/2023 N.I.T. PATNA ECE, DEPTT. 15
Verification result comparison and discussion
DTW based system best EER :14%GMM based system best EER :17.9%DTW using mean vector of GMM best EER :15.9%Best result was obtained for DTW.Performance of DTW based system depends on
detection of end points.Performance of DTW based system may be improved
by robust end point detection and enhancing more speaker specific regions
Hence the motivation for the present work
04/15/2023 N.I.T. PATNA ECE, DEPTT. 16
Development of a speaker verification system using only vowel regions to
improve performance Speech is produced as a sequence of changes, known as event
• Time varying nature of excitation source• Time varying nature of vocal tract system
Vowel is the most important event in speech signal• Vowel regions : High SNR , Nearly periodic, Long duration ,Lower zero
crossing rate and quasi stationary regions. Important information can be extracted using knowledge of these
events• Vocal tract• Excitation information
VOP: instants at which the onset of vowel take place in the speech signal• Vowel regions can be selected using VOPs
04/15/2023 N.I.T. PATNA ECE, DEPTT. 17
VOP and VEP events
VOP(circle ) and VEP (arrow head) events for an utterance/the sea/
04/15/2023 N.I.T. PATNA ECE, DEPTT. 18
VOP Events
Typical cases in which VOP occurs• Isolated vowel• Consonant vowel and consonant cluster vowel
( Cn V , where n > 1)• Vowel consonant and vowel consonant cluster
(V Cm, where m > 1)• In the form of Cn V Cm where n ≥ 1 and m ≥ 1• In the form of diphthong C nV k Cm where n ≥ 1, k
≥ 1 and m ≥ 1
04/15/2023 N.I.T. PATNA ECE, DEPTT. 19
Issues with existing VOP detection methods• Performance of the Feature based approach depends on
• Robustness of the feature to the recording conditions• Performance varies depending on the recording conditions• Literature suggest that excitation based features are more robust compared to
the vocal tract based feature• Performance of the statistical model based depend on
• Choice of classifier• Selection of features• Availability of similar label speech data
• Since it is difficult to avail speech data from similar recording conditions most of the existing methods use feature based approach.
• VOP detection method fails for most of the vowel semivowel clusters• Due to similarity in speech production mechanism between vowel and
semivowels.
• Objective of this work:• Explore a method to enhance the detection of VOPs in case semivowel-vowel
clusters and diphthongs.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 20
Empirical Mode Decomposition
• Empirical Mode Decomposition (EMD)• Data-driven, multi-scale, robust to non-stationary signal• Fast oscillating signal can be superimposed to slow oscillating signals• Local mean of decomposed signals is zero and the signals are symmetric to
its local mean.• Impact of noise on the signal can be reduced
• Decomposed signals are defined as Implicit Mode Function (IMF), if it satisfies following conditions
• The number of extrema and the number of zero crossing differs only by one
• The local average is zero. This implies that envelop mean of upper envelop and lower envelop is zero.
.
EMD Algorithm• For a given input signal X to decompose
Identify the local extrema of the signal X. Construct upper envelop E max & lower envelop Emin by interpolating maximum
&minimum,respectively Approximate local average by envelop mean Em taking average of two
envelops E max &Emin.
Compute candidate implicit mode h1=X-Em. If h1 is IMF,decompose the signal X as IMF imf= hi& the residue signal r=X-
imf.Otherwise repeat above steps.• If r has implicit oscillation mode,set r as input signal & repeat the steps.• A signal S(n) can be represented through IMFs as follows
S(n)= +r(n)Where r(n) is the residue.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 22
MOTIVATION FOR USE OF EMD
• Environmental effect on the speech data can be deemphasized
• Excitation information present in different frequency range can be analyzed separately.
• To emphasize the weak transitions in case of nasal-vowel, semivowel-vowel & Dipthongs.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 23
Flowchart for VOP detection
04/15/2023 N.I.T. PATNA ECE, DEPTT. 24
VOP EVIDENCE PLOT
04/15/2023 N.I.T. PATNA ECE, DEPTT. 25
Experiment
Speech data• Complete TIMIT database• Number of Male speakers: 438• Number of Female speakers: 192• Sampling Frequency=8 KHz• VOP experiment was performed on 100 speakers.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 26
Performance measure
• Identification rate (IR): Percentage of reference VOPs (VEPs) that are matched by detected VOPs (VEPs) with in vowel regions
• Spurious rate (SR): Percentage of detected VOPs (VEPs), which are detected outside vowel regions
04/15/2023 N.I.T. PATNA ECE, DEPTT. 27
Performance of proposed VOP detection method
Baseline 47 74 78 88 15
Proposed 62 83 90 96 13
Detection Rate Spurious Rate
Method 10ms 20ms 30ms 40ms
Observation:•Performance of proposed method is better than baseline in terms of both Detection rate & Spurious Rate.•83% detection is achieved in 20ms window which is beneficial when used for comparison of strings of vowel regions.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 28
Contribution Till Now……
• Significance of EMD for speech analysis is demonstrated.
• New method for detection of VOP is proposed.• Performance of the proposed method is
compared with the best method available in the literature.
• Experimental results shows that proposed method provides better performance in all respects.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 29
Future work
• Development of DTW based speaker verification system using only vowel regions.
• Exploration for a comparison technique to replace DTW.
04/15/2023 N.I.T. PATNA ECE, DEPTT. 30
THANK YOU