33
MAJOR PROJECT MID-TERM PRESENTATION : SPEAKER VERIFICATION FOR REMOTE AUTHENTICATION Members: Ganesh Tiwari (063BCT510) Madhav Pandey(063BCT514) Manoj Shrestha(063BCT518) Supervisor : Dr. Subarna Shakya Associate Professor

Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

Embed Size (px)

DESCRIPTION

Joint Speech and Speaker Recognition using Hidden Markov Model/Vector Quantization for speaker independent Speech Recognition and Gaussian Mixture Model for speech independent speaker recognition- used MFCC (Mel-Frequency Cepstral Coefficient) for Feature Extraction (delta,delta delta and energy - 39 coefficients). Developed in JAVA with client/server Architecture, web interface developed in Adobe Flex.This project was done at TU, IOE - Pulchowk Campus, Nepal.For more details visit http://ganeshtiwaridotcomdotnp.blogspot.comABSTRACT OF PROJECT>>>Biometric is physical characteristic unique to each individual. It has a very useful application in authentication and access control.The designed system is a text-prompted version of voice biometric which incorporates text-independent speaker verification and speaker-independent speech verification system implemented independently. The foundation for this joint system is that the speech signal conveys both the speech content and speaker identity. Such systems are more-secure from playback attack, since the word to speak during authentication is not previously set. During the course of the project various digital signal processing and pattern classification algorithms were studied. Short time spectral analysis was performed to obtain MFCC, energy and their deltas as feature. Feature extraction module is same for both systems. Speaker modeling was done by GMM and Left to Right Discrete HMM with VQ was used for isolated word modeling. And results of both systems were combined to authenticate the user.The speech model for each word was pre-trained by using utterance of 45 English words. The speaker model was trained by utterance of about 2 minutes each by 15 speakers. While uttering the individual words, the recognition rate of the speech recognition system is 92 % and speaker recognition system is 66%. For longer duration of utterance (>5sec) the recognition rate of speaker recognition system improves to 78%.

Citation preview

Page 1: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

MAJOR PROJECT MID-TERM PRESENTATION :

SPEAKER VERIFICATION FOR REMOTE AUTHENTICATION

Members:

Ganesh Tiwari (063BCT510)

Madhav Pandey(063BCT514)

Manoj Shrestha(063BCT518)

Supervisor :

Dr. Subarna Shakya

Associate Professor

Page 2: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

INTRODUCTION

Voice biometric system user login

Text-Prompted system The claimant is asked to speak a prompted text Speech and Speaker Recognition/Verification More secure to playback attack.

Web Application Client (Adobe Flex) : Voice Capture, preprocessing and

feature extraction Server (JAVA) : Training / Classification BlazeDS RPC for JAVA-Flex Connectivity

Page 3: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

BLOCK DIAGRAM OF SPEAKER / SPEECH RECOGNITION SYSTEM

Page 4: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

Signal Capture and Pre-Processing

Page 5: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

CAPTURE AND PREPROCESSING

Get the audio signal i.e., ADC

Make suitable for feature extraction

Capture

PCM Extract

Silence Removal

Pre-Emphasis

Framing

Windowing

Page 6: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

CAPTURE AND PREPROCESSING : CAPTURE

22050 Hz 16-bits, Signed Little Endian Mono Uncompressed PCM

Capture

PCM Extract

Silence Removal

Pre-Emphasis

Framing

Windowing

Page 7: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

CAPTURE AND PREPROCESSING : PCM EXTRACT

Capture

PCM Extract

Silence Removal

Pre-Emphasis

Framing

Windowing

Page 8: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

CAPTURE AND PREPROCESSING :

SILENCE REMOVAL

Algorithm described in paper‘a new method for silence removal and endpoint detection’ †

†G. Saha, Sandipan Chakroborty, Suman Senapati of Department of Electronics and

Electrical Communication Engineering, Indian Institute of Technology, Khragpur, India

0 1 2 3 4 5 6 7 8 9

x 104

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

-1

-0.5

0

0.5

1

Capture

PCM Extract

Silence Removal

Pre-Emphasis

Framing

Windowing

Page 9: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

CAPTURE AND PREPROCESSING : PRE-EMPHASIS

Boosting the high frequency energy

In time domain, y[n] = x[n]−αx[n−1], 0.9 ≤ α ≤ 1.0

0 2000 4000 6000 8000 10000 120000

0.01

0.02

0.03

0.04

0.05

Frequency (Hz)

|Y(f

)|

0 2000 4000 6000 8000 10000 120000

1

2

3

4

5x 10

-3

Frequency (Hz)

|Y(f

)|

Capture

PCM Extract

Silence Removal

Pre-Emphasis

Framing

Windowing

Page 10: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

CAPTURE AND PREPROCESSING :

FRAMING

Speech Signal is stationary (statistical properties) for 10-30 ms

50% overlapped frames each of 23ms is used

Capture

PCM Extract

Silence Removal

Pre-Emphasis

Framing

Windowing

Page 11: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

CAPTURE AND PREPROCESSING :

WINDOWING

Windowing is done on the frame blocked signal

Hamming window

0 10 20 30 40 50 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Hamming Window

0 200 400 600 800 1000 1200-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0 200 400 600 800 1000 1200-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

Capture

PCM Extract

Silence Removal

Pre-Emphasis

Framing

Windowing

Page 12: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

Feature Extraction

Page 13: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

FEATURE EXTRACTION

Transform the input audio signal into a sequence of acoustic feature vectors

MFCC : Mel Filter Cepstral Coefficients as Feature Perceptual approach Human Ear processes audio signal in Mel

scale Mel scale : linear up to 1KHz and

logarithmic after 1KHz

MFCC gives distribution of energy in Mel frequency band Calculated for each frame

Fourier Transform

Mel Filter

Log

IFT : DCT

Cepstral Mean Subtraction

Energy and Deltas

Page 14: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

FEATURE EXTRACTION :

FOURIER TRANSFORM

Gives information about the amount of energy at each frequency band

FFT used

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

-1

-0.5

0

0.5

1

0 2000 4000 6000 8000 10000 120000

1

2

3

4

5x 10

-3

Frequency (Hz)

|Y(f

)|

Fourier Transform

Mel Filter

Log

IFT : DCT

Cepstral Mean Subtraction

Energy and Deltas

Page 15: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

FEATURE EXTRACTION :

MEL FILTER

We used filter bank of triangular filters spaced in Mel scale

Fourier Transform

Mel Filter

Log

IFT : DCT

Cepstral Mean Subtraction

Energy and Deltas

Page 16: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

FEATURE EXTRACTION :

MEL FILTER (CONTD..)

Mel Filter

Where,

Fourier Transform

Mel Filter

Log

IFT : DCT

Cepstral Mean Subtraction

Energy and Deltas

Page 17: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

FEATURE EXTRACTION : LOG, IFT(DCT)

Log

DCT

MFCC

Fourier Transform

Mel Filter

Log

IFT : DCT

Cepstral Mean Subtraction

Energy and Deltas0 100 200 300 400 500 600 700 800 900

-20

-15

-10

-5

0

5

10

Page 18: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

FEATURE EXTRACTION :

CEPSTRAL MEAN SUBTRACTION

CMS: for minimizing channel effectFourier Transform

Mel Filter

Log

IFT : DCT

Cepstral Mean Subtraction

Energy and Deltas

Page 19: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

FEATURE EXTRACTION :

ENERGY AND DELTAS

For completeness of feature vector and for achieving high recognition rate

A Energy Feature

A delta or velocity feature, and a double delta or acceleration featureCalculated by linear regression of regression window M

Fourier Transform

Mel Filter

Log

IFT : DCT

Cepstral Mean Subtraction

Energy and Deltas

Page 20: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

COMPOSITION OF FEATURE VECTOR

12 MFCC Features 12 Delta MFCC 12 Delta-Delta MFCC 1 Energy Feature 1 Delta Energy Feature 1 Delta-Delta Energy Feature

39 Features from each frame

Page 21: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

Speaker Recognition/Verification by GMM

Page 22: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

GAUSSIAN MIXTURE MODEL

Parametric probability density function Based on clustering technique M Gaussian components

: a k-dimensional random vector: mixture weight of mth component

: k-dimensional Gaussian function (pdf)

= (, )

Page 23: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

GMM TRAINING

Goal: estimate the parameters Method: Maximum Likelihood estimation Input: X = {}

) Maximize with Expectation Maximization

algorithm Iterative process:

initial model: new model: P(X/ ) ≥ P(X/ )

Convergence Condition:

P(X/ ) - P(X/ ) <

Page 24: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

VERIFICATION

Decision: Hypothesis TestH0: the speaker is the claimed speaker

H1: the speaker is an imposter Based on likelihood ratio

= Decision by threshold

< reject identity claim > accept identity claim

Page 25: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

Speech Recognition by HMM/VQ

Page 26: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

HIDDEN MARKOV MODEL :DEFINITION

Hidden Markov Model (HMM) is the statistical model

HMM is the extension of Markov Process

HMM has hidden states and observable symbols per states

HMM Model :

Observed data : feature vector Hidden states : phonemes

(A,B, )

Page 27: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

CODEBOOK GENERATION

K-Means Clustering Clustering the whole database & Codebook

Generation

VQ : Vector Quantization is used for mapping each input feature vector to discrete quantized symbols Codebook for each incoming feature vector is built Compare it to each of the prototype vectors in

codebook Select the one which is closest (by some distance

metric) Replace the input vector by the index of this

prototype vector observation sequence

Page 28: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

SPEECH RECOGNITION SYSTEM: BY : HMM / VQ

Page 29: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

HIDDEN MARKOV MODEL :TRAINING Training by:

Forward backward (Baum-Welch) algorithm

Forward-backward algorithm iteratively re-estimates the parameters and improves the probability that given observation are generated by the new parameters

Three parameters need to be re-estimated: Initial state distribution: πi

Transition probabilities: ai,j

Emission probabilities: bi(ot)

Input is observation sequence, given by VQ

Page 30: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

HIDDEN MARKOV MODEL :VERIFICATION/MATCHING

Viterbi algorithm is used

Input is Observation sequence, given by VQ HMM model of the word

Best matched word is returned

Page 31: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

PROBLEM FACED

Learning curve Complex Mathematics

Flex & Java Connectivity (initially) Data conversion

Page 32: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

REMAINING TASKS

Speech Training Data Collection

Model Training (HMM, GMM)

Module Integration

Testing

Page 33: Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System Mid-term Project Presentation

Thanks