32
A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 1

Embed Size (px)

Citation preview

Page 1: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 1

Page 2: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 2

A Talking Elevator

An introduction to the main concepts

of speaker recognition

© Jacques KoremanNTNU

Page 3: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 3

Q What is the problem?A Two types of biometrics:

◘ behavioral◘ physical

Page 4: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 4

Q What causes the problem?A Variability

◘ Repetitions◘ Sessions◘ Channel◘ Background noise

Variability across speakers is good, variability within speakers is not.)

Page 5: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 5

Q How does variability affect◘ Speech recognition?◘ Speaker recognition?

A Structure of acoustic space enhances decoding of the linguistic content of a message.

Schematic representation of the distribution of phones (fill colours) and speakers (border colours) in the acoustic space

Page 6: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 6

Q What is the difference between

A (Closed set) speaker identif-ication selects the most likely speaker from a given set.Speaker verification is concerned with ascertaining a claimed identity.Open set speaker identificat- ion combines the two.

◘ speaker identification and◘ speaker verification?

Page 7: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 7

Q What is the difference between

speaker recognition?A In TI recognition, the speaker

can produce any speech, whilein TD recognition the speaker must pronounce a fixed or prompted phrase

◘ text-independent (TI) and◘ text-dependent (TD)

(Dis)advantages? User-friendliness, variabilty, cooperativeness.

Page 8: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 8

Q What is the best way to select a prompt in TD recognition?

A The prompt can be◘ fixed …good to find consistent

speaker differences, but im-postors know the prompt too.

◘ self-selected …more secret, but users may choose a short, easy-to-guess prompt.

◘ variable …but you need a lot of enrollment data to model all possible contexts.

Page 9: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 9

Q How are the training data selected?

A ◘ Training data should reflect test (=operation) conditions to prevent training-test mismatch.

◘ More data need for TI than for TD models.

◘ Better speaker models with more training data, but less user-friendly.

Page 10: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 10

Q How can we deal with noise in the recordings?

A Two ways:◘ Pre-processing: normalize

the signals, e.g. by cepstral mean substraction (cms).

◘ Modelling: create multi-condition speaker models based on signals recorded in different environments

Page 11: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 11

Steps:

◘ microphone recording

◘ preprocessing

◘ modeling/testing

Q How does a speaker (speech) recognizer work?

A Two parts: ◘ enrollment/training◘ testing

Page 12: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 12

Q What are these speaker models?

A Statistical models of the enrollment data:

◘ hidden Markov models (HMMs)

◘ Gaussian mixture models (GMMs)

Page 13: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 13

Q Why use statistical models?A Because of variation in the

signal (behavioral biometric)◘ often not noticed by

human listeners, but◘ detrimental to computer

performance if not modelled appropriately

Page 14: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 14

Q What is an HMM?A This question needs several

slides to answer.Let’s start with a simple Markov model, which represents a sequence of observations (feature vector from preprocessing) by states and transitions.

Page 15: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 15

◘ Stochastic model of sequence of events.

◘ Start at container (state) S, which is empty.

◘ Go to container (state) 1 (with p=1) and take out a black ball (observation).

S E1 0.4 0.30.5

0.6 0.5 0.7

1 2 3

Q What is a MM?

Page 16: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 16

◘ Go to state 2 (with p=0.4) and take a red ball, or

◘ stay in state 1 and take another black ball out of the container.

S E1 0.4 0.30.5

0.6 0.5 0.7

1 2 3

Q What is a MM?

Page 17: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 17

◘ …and so on, until you get to state E and have a row of colored balls (cf. feature vectors obtained from the speech signal).

S E1 0.4 0.30.5

0.6 0.5 0.7

1 2 3

Q What is a MM?

Page 18: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 18

Q What is an HMM?A Only difference with a MM:

The same observations (color-ed balls) can be emitted by different states (containers). In the example: all containers contain balls of different colors. Different percentages of each color are modeled by their emission probabilities.

Page 19: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 19

S E1 0.4 0.30.5

0.6 0.5 0.7

1 2 3

Q What is an HMM?◘ Start at state S, which is

empty.◘ Go to state 1 (with p=1)

and take out a ball, which can be black, red or yellow.

Page 20: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 20

S E1 0.4 0.30.5

0.6 0.5 0.7

1 2 3

Q What is an HMM?◘ Go to state 2 (with p=0.4)

and take out a ball, or ◘ stay in state 1 and take

another ball out,◘ until you get to state E.

Page 21: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 21

Q What is an HMM?◘ …And so on, until you get

to state E and have a sequence of colored balls.

◘ Notice left-to-right nature: order of sounds in a word is fixed.

Page 22: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 22

Q What is an HMM?◘ You now have a sequence

of colored balls,◘ but cannot tell from the

sequence of balls which containers they were taken from (unlike in a MM): the states are „hidden“.

1 1 1 1 1 2 2 2 2 3 3 31 1 1 2 2 2 2 2 3 3 3 3etc.

Page 23: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 23

Q What is an HMM?◘ HMM for speech:

◘ state = (part of a) phone◘ colored ball = feature

vector which represents the frequency spectrum of the speech signal.

◘ Task of HMM (Viterbi algorithm): find most likely state sequence to model observation sequence.

Page 24: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 24

Q What is an HMM?◘ In the example, the

observations were discreet (colors).

◘ Usually, Gaussian mixtures (normal distributions) are used to decribe the (contin-uous) observations.

Page 25: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 25

Q What is a GMM, and when is it used instead of an HMM?

A An HMM that consists of only one state. It can be used if we do not need any information about the linguistic content (time structure) of the speech.

Page 26: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 26

Q How much enrollment data is needed to train an HMM/GMM?

A Balance between◘ representativeness of

speaker and operation conditions (as much as possible)

◘ user-friendliness (as little as possible)

Adaptation from universal background model (UBM)

Page 27: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 27

Q How is a UBM used?A Two ways:

◘ in training: to initialize the client speaker models (cf. previous slide)

◘ In testing (normally only in verification, not identific-ation): to compare likeli-hood of client model with that of UBM (normalisation by taking likelihood ratio)

Page 28: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 28

Q How good is a system?A Evaluation of test data:

◘ for identification: percent-age correct identifications

◘ for verification: comparison of number of false accept-ances with false rejections.◘ DET instead of ROC curve◘ Selected operating point

depends on required security level: r=1 (EER), but also r=0.1 or 10).

Page 29: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 29

Q How good is a system?

ROC curve DET curve(receiver operating characteristic) (detection error tradeoff)

Alvin Martin et al. (1997). The DET curve in assessment of detection task performance, www.nist.gov/speech/publications/

Page 30: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 30

This lecture has familiarized you with ◘ the main concepts in

speaker recognition◘ speaker modeling at a

conceptual levelWe should now◘ take a closer look at the

signal which is modelled in speaker recognition

Summary

Page 31: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 31

A Talking Elevator

An introduction to the main concepts

of speaker recognition

Jacques KoremanNTNU

Page 32: A Talking Elevator, WS2006 UdS, Speaker Recognition 1

A Talking Elevator, WS2006 UdS, Speaker Recognition 32