Upload
thomas-baker
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
A Talking Elevator, WS2006 UdS, Speaker Recognition 1
A Talking Elevator, WS2006 UdS, Speaker Recognition 2
A Talking Elevator
An introduction to the main concepts
of speaker recognition
© Jacques KoremanNTNU
A Talking Elevator, WS2006 UdS, Speaker Recognition 3
Q What is the problem?A Two types of biometrics:
◘ behavioral◘ physical
A Talking Elevator, WS2006 UdS, Speaker Recognition 4
Q What causes the problem?A Variability
◘ Repetitions◘ Sessions◘ Channel◘ Background noise
Variability across speakers is good, variability within speakers is not.)
A Talking Elevator, WS2006 UdS, Speaker Recognition 5
Q How does variability affect◘ Speech recognition?◘ Speaker recognition?
A Structure of acoustic space enhances decoding of the linguistic content of a message.
Schematic representation of the distribution of phones (fill colours) and speakers (border colours) in the acoustic space
A Talking Elevator, WS2006 UdS, Speaker Recognition 6
Q What is the difference between
A (Closed set) speaker identif-ication selects the most likely speaker from a given set.Speaker verification is concerned with ascertaining a claimed identity.Open set speaker identificat- ion combines the two.
◘ speaker identification and◘ speaker verification?
A Talking Elevator, WS2006 UdS, Speaker Recognition 7
Q What is the difference between
speaker recognition?A In TI recognition, the speaker
can produce any speech, whilein TD recognition the speaker must pronounce a fixed or prompted phrase
◘ text-independent (TI) and◘ text-dependent (TD)
(Dis)advantages? User-friendliness, variabilty, cooperativeness.
A Talking Elevator, WS2006 UdS, Speaker Recognition 8
Q What is the best way to select a prompt in TD recognition?
A The prompt can be◘ fixed …good to find consistent
speaker differences, but im-postors know the prompt too.
◘ self-selected …more secret, but users may choose a short, easy-to-guess prompt.
◘ variable …but you need a lot of enrollment data to model all possible contexts.
A Talking Elevator, WS2006 UdS, Speaker Recognition 9
Q How are the training data selected?
A ◘ Training data should reflect test (=operation) conditions to prevent training-test mismatch.
◘ More data need for TI than for TD models.
◘ Better speaker models with more training data, but less user-friendly.
A Talking Elevator, WS2006 UdS, Speaker Recognition 10
Q How can we deal with noise in the recordings?
A Two ways:◘ Pre-processing: normalize
the signals, e.g. by cepstral mean substraction (cms).
◘ Modelling: create multi-condition speaker models based on signals recorded in different environments
A Talking Elevator, WS2006 UdS, Speaker Recognition 11
Steps:
◘ microphone recording
◘ preprocessing
◘ modeling/testing
Q How does a speaker (speech) recognizer work?
A Two parts: ◘ enrollment/training◘ testing
A Talking Elevator, WS2006 UdS, Speaker Recognition 12
Q What are these speaker models?
A Statistical models of the enrollment data:
◘ hidden Markov models (HMMs)
◘ Gaussian mixture models (GMMs)
A Talking Elevator, WS2006 UdS, Speaker Recognition 13
Q Why use statistical models?A Because of variation in the
signal (behavioral biometric)◘ often not noticed by
human listeners, but◘ detrimental to computer
performance if not modelled appropriately
A Talking Elevator, WS2006 UdS, Speaker Recognition 14
Q What is an HMM?A This question needs several
slides to answer.Let’s start with a simple Markov model, which represents a sequence of observations (feature vector from preprocessing) by states and transitions.
A Talking Elevator, WS2006 UdS, Speaker Recognition 15
◘ Stochastic model of sequence of events.
◘ Start at container (state) S, which is empty.
◘ Go to container (state) 1 (with p=1) and take out a black ball (observation).
S E1 0.4 0.30.5
0.6 0.5 0.7
1 2 3
Q What is a MM?
A Talking Elevator, WS2006 UdS, Speaker Recognition 16
◘ Go to state 2 (with p=0.4) and take a red ball, or
◘ stay in state 1 and take another black ball out of the container.
S E1 0.4 0.30.5
0.6 0.5 0.7
1 2 3
Q What is a MM?
A Talking Elevator, WS2006 UdS, Speaker Recognition 17
◘ …and so on, until you get to state E and have a row of colored balls (cf. feature vectors obtained from the speech signal).
S E1 0.4 0.30.5
0.6 0.5 0.7
1 2 3
Q What is a MM?
A Talking Elevator, WS2006 UdS, Speaker Recognition 18
Q What is an HMM?A Only difference with a MM:
The same observations (color-ed balls) can be emitted by different states (containers). In the example: all containers contain balls of different colors. Different percentages of each color are modeled by their emission probabilities.
A Talking Elevator, WS2006 UdS, Speaker Recognition 19
S E1 0.4 0.30.5
0.6 0.5 0.7
1 2 3
Q What is an HMM?◘ Start at state S, which is
empty.◘ Go to state 1 (with p=1)
and take out a ball, which can be black, red or yellow.
A Talking Elevator, WS2006 UdS, Speaker Recognition 20
S E1 0.4 0.30.5
0.6 0.5 0.7
1 2 3
Q What is an HMM?◘ Go to state 2 (with p=0.4)
and take out a ball, or ◘ stay in state 1 and take
another ball out,◘ until you get to state E.
A Talking Elevator, WS2006 UdS, Speaker Recognition 21
Q What is an HMM?◘ …And so on, until you get
to state E and have a sequence of colored balls.
◘ Notice left-to-right nature: order of sounds in a word is fixed.
A Talking Elevator, WS2006 UdS, Speaker Recognition 22
Q What is an HMM?◘ You now have a sequence
of colored balls,◘ but cannot tell from the
sequence of balls which containers they were taken from (unlike in a MM): the states are „hidden“.
1 1 1 1 1 2 2 2 2 3 3 31 1 1 2 2 2 2 2 3 3 3 3etc.
A Talking Elevator, WS2006 UdS, Speaker Recognition 23
Q What is an HMM?◘ HMM for speech:
◘ state = (part of a) phone◘ colored ball = feature
vector which represents the frequency spectrum of the speech signal.
◘ Task of HMM (Viterbi algorithm): find most likely state sequence to model observation sequence.
A Talking Elevator, WS2006 UdS, Speaker Recognition 24
Q What is an HMM?◘ In the example, the
observations were discreet (colors).
◘ Usually, Gaussian mixtures (normal distributions) are used to decribe the (contin-uous) observations.
A Talking Elevator, WS2006 UdS, Speaker Recognition 25
Q What is a GMM, and when is it used instead of an HMM?
A An HMM that consists of only one state. It can be used if we do not need any information about the linguistic content (time structure) of the speech.
A Talking Elevator, WS2006 UdS, Speaker Recognition 26
Q How much enrollment data is needed to train an HMM/GMM?
A Balance between◘ representativeness of
speaker and operation conditions (as much as possible)
◘ user-friendliness (as little as possible)
Adaptation from universal background model (UBM)
A Talking Elevator, WS2006 UdS, Speaker Recognition 27
Q How is a UBM used?A Two ways:
◘ in training: to initialize the client speaker models (cf. previous slide)
◘ In testing (normally only in verification, not identific-ation): to compare likeli-hood of client model with that of UBM (normalisation by taking likelihood ratio)
A Talking Elevator, WS2006 UdS, Speaker Recognition 28
Q How good is a system?A Evaluation of test data:
◘ for identification: percent-age correct identifications
◘ for verification: comparison of number of false accept-ances with false rejections.◘ DET instead of ROC curve◘ Selected operating point
depends on required security level: r=1 (EER), but also r=0.1 or 10).
A Talking Elevator, WS2006 UdS, Speaker Recognition 29
Q How good is a system?
ROC curve DET curve(receiver operating characteristic) (detection error tradeoff)
Alvin Martin et al. (1997). The DET curve in assessment of detection task performance, www.nist.gov/speech/publications/
A Talking Elevator, WS2006 UdS, Speaker Recognition 30
This lecture has familiarized you with ◘ the main concepts in
speaker recognition◘ speaker modeling at a
conceptual levelWe should now◘ take a closer look at the
signal which is modelled in speaker recognition
Summary
A Talking Elevator, WS2006 UdS, Speaker Recognition 31
A Talking Elevator
An introduction to the main concepts
of speaker recognition
Jacques KoremanNTNU
A Talking Elevator, WS2006 UdS, Speaker Recognition 32