A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 1


A Talking Elevator

An introduction to the main concepts

of speaker recognition

© Jacques KoremanNTNU


Q What is the problem?A Two types of biometrics:

◘ behavioral◘ physical


Q What causes the problem?A Variability

◘ Repetitions◘ Sessions◘ Channel◘ Background noise

Variability across speakers is good, variability within speakers is not.)


Q How does variability affect◘ Speech recognition?◘ Speaker recognition?

A Structure of acoustic space enhances decoding of the linguistic content of a message.

Schematic representation of the distribution of phones (fill colours) and speakers (border colours) in the acoustic space


Q What is the difference between

A (Closed set) speaker identif-ication selects the most likely speaker from a given set.Speaker verification is concerned with ascertaining a claimed identity.Open set speaker identification combines the two.

◘ speaker identification and◘ speaker verification?


Q What is the difference between

speaker recognition?A In TI recognition, the speaker

can produce any speech, whilein TD recognition the speaker must pronounce a fixed or prompted phrase

◘ text-independent (TI) and◘ text-dependent (TD)

(Dis)advantages? User-friendliness, variabilty, cooperativeness.


Q What is the best way to select a prompt in TD recognition?

A The prompt can be◘ fixed …good to find consistent

speaker differences, but im-postors know the prompt too.

◘ self-selected …more secret, but users may choose a short, easy-to-guess prompt.

◘ variable …but you need a lot of enrollment data to model all possible contexts.


Q How are the training data selected?

A ◘ Training data should reflect test (=operation) conditions to prevent training-test mismatch.

◘ More data need for TI than for TD models.

◘ Better speaker models with more training data, but less user-friendly.


Q How can we deal with noise in the recordings?

A Two ways:◘ Pre-processing: normalize

the signals, e.g. by cepstral mean substraction (cms).

◘ Modelling: create multi-condition speaker models based on signals recorded in different environments


Steps:

◘ microphone recording

◘ preprocessing

◘ modeling/testing

Q How does a speaker (speech) recognizer work?

A Two parts: ◘ enrollment/training◘ testing


Q What are these speaker models?

A Statistical models of the enrollment data:

◘ hidden Markov models (HMMs)

◘ Gaussian mixture models (GMMs)


Q Why use statistical models?A Because of variation in the

signal (behavioral biometric)◘ often not noticed by

human listeners, but◘ detrimental to computer

performance if not modelled appropriately


Q What is an HMM?A This question needs several

slides to answer.Let’s start with a simple Markov model, which represents a sequence of observations (feature vector from preprocessing) by states and transitions.


◘ Stochastic model of sequence of events.

◘ Start at container (state) S, which is empty.

◘ Go to container (state) 1 (with p=1) and take out a black ball (observation).

S E1 0.4 0.30.5

0.6 0.5 0.7

1 2 3

Q What is a MM?


◘ Go to state 2 (with p=0.4) and take a red ball, or

◘ stay in state 1 and take another black ball out of the container.

S E1 0.4 0.30.5

0.6 0.5 0.7

1 2 3

Q What is a MM?


◘ …and so on, until you get to state E and have a row of colored balls (cf. feature vectors obtained from the speech signal).

S E1 0.4 0.30.5

0.6 0.5 0.7

1 2 3

Q What is a MM?


Q What is an HMM?A Only difference with a MM:

The same observations (color-ed balls) can be emitted by different states (containers). In the example: all containers contain balls of different colors. Different percentages of each color are modeled by their emission probabilities.


S E1 0.4 0.30.5

0.6 0.5 0.7

1 2 3

Q What is an HMM?◘ Start at state S, which is

empty.◘ Go to state 1 (with p=1)

and take out a ball, which can be black, red or yellow.


S E1 0.4 0.30.5

0.6 0.5 0.7

1 2 3

Q What is an HMM?◘ Go to state 2 (with p=0.4)

and take out a ball, or ◘ stay in state 1 and take

another ball out,◘ until you get to state E.


Q What is an HMM?◘ …And so on, until you get

to state E and have a sequence of colored balls.

◘ Notice left-to-right nature: order of sounds in a word is fixed.


Q What is an HMM?◘ You now have a sequence

of colored balls,◘ but cannot tell from the

sequence of balls which containers they were taken from (unlike in a MM): the states are „hidden“.

1 1 1 1 1 2 2 2 2 3 3 31 1 1 2 2 2 2 2 3 3 3 3etc.


Q What is an HMM?◘ HMM for speech:

◘ state = (part of a) phone◘ colored ball = feature

vector which represents the frequency spectrum of the speech signal.

◘ Task of HMM (Viterbi algorithm): find most likely state sequence to model observation sequence.


Q What is an HMM?◘ In the example, the

observations were discreet (colors).

◘ Usually, Gaussian mixtures (normal distributions) are used to decribe the (contin-uous) observations.


Q What is a GMM, and when is it used instead of an HMM?

A An HMM that consists of only one state. It can be used if we do not need any information about the linguistic content (time structure) of the speech.


Q How much enrollment data is needed to train an HMM/GMM?

A Balance between◘ representativeness of

speaker and operation conditions (as much as possible)

◘ user-friendliness (as little as possible)

Adaptation from universal background model (UBM)


Q How is a UBM used?A Two ways:

◘ in training: to initialize the client speaker models (cf. previous slide)

◘ In testing (normally only in verification, not identific-ation): to compare likeli-hood of client model with that of UBM (normalisation by taking likelihood ratio)


Q How good is a system?A Evaluation of test data:

◘ for identification: percent-age correct identifications

◘ for verification: comparison of number of false accept-ances with false rejections.◘ DET instead of ROC curve◘ Selected operating point

depends on required security level: r=1 (EER), but also r=0.1 or 10).


Q How good is a system?

ROC curve DET curve(receiver operating characteristic) (detection error tradeoff)

Alvin Martin et al. (1997). The DET curve in assessment of detection task performance, www.nist.gov/speech/publications/


This lecture has familiarized you with ◘ the main concepts in

speaker recognition◘ speaker modeling at a

conceptual levelWe should now◘ take a closer look at the

signal which is modelled in speaker recognition

Summary


A Talking Elevator

An introduction to the main concepts

of speaker recognition

Jacques KoremanNTNU


Documents

A Talking Elevator, WS2006 UdS, Speaker Recognition 1