View
221
Download
3
Category
Tags:
Preview:
Citation preview
Spectral Features for Automatic Text-Independent Speaker Recognition
Tomi Kinnunen
Research seminar, 27.2.2004
Department of Computer ScienceUniversity of Joensuu
Based on a True Story …
T. Kinnunen: Spectral Features for Automatic Text-Independent Speaker Recognition, Ph.Lic. thesis, 144 pages, Department of Computer Science, University of Joensuu, 2004.
Downloadable in PDF from :
http://cs.joensuu.fi/pages/tkinnu/research/index.html
Introduction
Why Study Feature Extraction ?
• As the first component in the recognition chain, the accuracy of classification is strongly determined by its selection
Why Study Feature Extraction ? (cont.)
• Typical feature extraction methods are directly “loaned” from the speech recognition task
Quite contradictory, considering the “opposite” nature of the two tasks
• In general, it seems that currently we are at the best guessing what might be invidual in our speech !
• Because it is interesting & challenging!
Principle of Feature Extraction
1. FFT-implemented filterbanks (subband processing)
2. FFT-cepstrum
3. LPC-derived features
4. Dynamic spectral features (delta features)
Studied Features
Speech Material & Evaluation Protocol
• Each test file is splitted into segments of T=350 vectors (about ~ 3.5 seconds of speech)
• Each segment is classified by vector quantization
• Speaker models are constructed from the training data by RLS clustering algorithm
• Performance measure = classification error rate (%)
1. Subband Features
Computation of Subband Features
Windowed speech frame
Magnitude spectrum by FFT
Smoothing by a filterbank
Nonlinear mapping of the filter outputs
Compressed filter ouputsf = (f1,f2, … , fM)T
Parameters of the filterbank:• Number of subbands• Filter shapes & bandwidths• Type of frequency warping• Filter output nonlinearity
Frequency Warping… What’s That?!
0 500 1000 1500 2000 2500 3000 3500 40000
0.2
0.4
0.6
0.8
1
Frequency [Hz]
Gai
n
shape: triangular, warping: Bark
0 0.5 1 1.5 2 2.5 3 3.5 40
5
10
15
20
25
Frequency [kHz]
Fre
quen
cy [B
ark]
• “Real” frequency axis (Hz) is stretched and compressed locally according to a (bijective) warping function
Bark scaleA 24-channel bark-warped filterbank
Discrimination of Individual Subbands (F-ratio)F
-rat
io
Frequency Frequency
Helsinki TIMIT
Low-end (~0-200 Hz) and mid/high frequencies (~ 2 - 4 kHz) are important, region ~200-2000 Hz less important. (However, not consistently!)
(Fixed parameters: 30 linearly spaced triangular filters)
Subband Features : The Effect of the Filter Output Nonlinearity
Helsinki TIMIT
Consistent ordering (!) : cubic < log < linear
Fixed parameters: 30 linearly spaced triangular filters
1. Linear f(x) = x2. Logarithmic: f(x) = log(1 + x)3. Cubic: f(x) = x1/3
Subband Features : The Effect of the Filter Shape
Helsinki TIMIT
The differences are small, no consistent ordering
probably the filter shape is not as crucial as the other parameters
Fixed parameters: 30 linearly spaced filters, log-compression
1. Rectangular2. Triangular3. Hanning
Subband Features : The Number of Subbands (1)
Helsinki TIMIT
Observation: error rates decrease monotonically with increasing number of subbands (in most cases) …
Fixed parameters: linearly spaced / triangular-shaped filters, log-compression
Experiment 1: From 5 to 50
Subband Features : The Number of Subbands (2)
Fixed parameters: linearly spaced / triangular-shaped filters, log-compression
Experiment 2: From 50 to 250
Helsinki: (Almost) monotonous decrease in errors with increasing number of subbands
TIMIT: Optimum number of bands is in the range 50..100
Differences between corpora are (partly) explained by the discrimination curves
Discussion of the Subband Features
• (Typically used) log-compression should be replaced with cubic compression or some better nonlinearity
• Number of subbands should be relatively high (at least 50 based on these experiments)
• Shape of the filter does not seem to be important• Discriminative information is not evenly spaced along the
frequency axis• The relative discriminatory powers of subbands depends on
the selected speaker population/language/speech content…
2. FFT-Cepstral Features
Computation of FFT-Cepstrum
Windowed speech frame
Magnitude spectrum by FFT
Smoothing by a filterbank
Nonlinear mapping of the filter outputs
c = (c1,…,cM)T
Decorrelation by DCT
Coefficient selection
Cepstrum vector
Processing is very similar to “raw” subband processing
Common steps
FFT-Cepstrum : Type of Frequency Warping
Helsinki TIMIT
Fixed parameters: 30 triangular filters, log-compression, DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c[0]
Helsinki: Mel-frequency warped cepstrum gives the best results on average
TIMIT: Linearly warped cepstrum gives the best results on average
Same explanation as before: discrimination curves
1. Linear warping2. Mel-warping3. Bark-warping4. ERB-warping
FFT-Cepstrum : Number of Cepstral Coefficients
( Fixed parameters: mel-frequency warped triangular filters, log-compression, DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c[0], codebook size = 64)
Helsinki TIMIT
Minimum number of coefficients around ~ 10, rather independent of the number of filters
Discussion About the FFT-Cepstrum• Same performance as with the subband features, but smaller number of features
For computational and modeling reasons, cepstrum is the preferred method of these two in automatic recognition
• The commonly used mel-warped filterbank is not the best choice in general case !
There is no reason to assume that it would be, since mel-cepstrum is based on modeling of human hearing and originally meant for speech recognition purposes
• I prefer / recommend to use linear frequency warping, since:
It is easier to control the amount resolution on desired subbands (e.g. by linear weighting). In nonlinear warping, the relationship between the “real” and “warped” frequency axes is more complicated
3. LPC-Derived Features
What Is Linear Predictive Coding (LPC) ?• In time domain, current sample is approximated as a linear combination of the past p samples :
• The objective is to determine the LPC coefficients a[k] k=1,…,p such that the squared prediction error is minimized• In the frequency domain, LPC’s define an all-pole IIR-filter whose poles correspond to local maximae of the magnitude spectrum
An LPC pole
Computation of LPC and LPC-Based FeaturesWindowed speech frame
Autocorrelation computation
Levinson-Durbin algorithmSolving of Yule-Walker AR equations
LPC coefficients (LPC)
Arcus sine coefficients (ARCSIN)
Reflection coefficients (REFL)
asin(.)LAR conversion
Log area ratios (LAR)
Complex polynomial expansion
Root-finding algorithm
Line spectral frequencies (LSF)
LPC pole finding
Formants (FMT)
Atal’s recursion
Linear Predictive Cepstral Coefficients (LPCC)
Linear Prediction (LPC) : Number of LPC coefficients
Helsinki TIMIT
• Minimum number around ~ 15 coefficients (not consistent, however)
• Error rates surprisingly small in general !
• LPC coefficients were used directly in Euclidean-distance -based classifier. In literature there is usually warning of the following form : “Do not ever use LPC’s directly, at least with the Euclidean metric.”
Comparison of the LPC-Derived Features
• Overall performance is very good• Raw LPC coefficients gives worst performance on average• Differences between feature sets are rather small
Other factors to be considered:• Computational complexity• Ease of implementation
Fixed parameters: LPC predictor order p = 15
Helsinki TIMIT
A programming bug???
LPC-Derived FormantsFixed parameters: Codebook size = 64
Helsinki TIMIT
• Formants give comparable, and surprisingly good results !
• Why “surprisingly good” ?1. Analysis procedure was very simple (produces spurious formants)
2. Subband processing, LPC, cepstrum, etc… describe the spectrum continuously - formants on the other hand pick only a discrete number of maximum peaks’ amplitudes from the spectrum (and a small number!)
Discussion About the LPC-Derived Features
• In general, results are promising, even for the raw LPC coefficients
• The differences between feature sets were small– From the implementation and efficiency viewpoint the following
are the most attractive: LPCC, LAR and ARCSIN
• Formants give (surprisingly) good results also, which indicates indirectly:– The regions of spectrum with high amplitude might be important
for speaker recognition
0 1000 2000 3000 4000 5000 60000.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Frequency [Hz]
Mag
nitu
de [d
B]
An idea for future study :
How about selecting subbands around local maximae?
4. Dynamic Features
Dynamic Spectral Features• Dynamic feature: an estimate of the time derivate of the feature• Can be applied to any feature
Time trajectory of the original feature
Estimate of the 1st time derivative (-feature)
Estimate of the 2nd time derivative ( -feature)
• Two widely used estimatation methods are differentiator and linear regression method :
(M = number of neigboring frames, typically M = 1..3)
• Typical phrase : “Don’t use differentiator, it emphasizes noise”
Delta Features :Comparison of the Two Estimation Methods
Helsinki TIMIT
Dif
fere
nti
ator
Reg
ress
ion
Best : -LSF (7.0 %), M=1 Best: -ARCSIN (8.1 %), M=4
Best : -LSF (10.6 %), M=2 Best : -ARCSIN (8.8 %), M=1
Delta Features :Comparison with the Static Features
Discussion About the Delta Features :
• Optimum order is small (In most cases M=1,2 neighboring frames)
• The differentiator method is better in most cases (surprising result, again!)
• Delta features are worse than static features but might provide uncorrelated extra information (for multiparameter recognition)
• The commonly used delta-cepstrum gives quite poor results !
Towards Concluding Remarks ...
FFT-Cepstrum Revisited :Question : Is Log-Compression / Mel-Cepstrum Best ?
Helsinki TIMIT
Answer: NO !
Please note: Now segment length is reduced down to T=100 vectors, that’s why absolute recognition rates are worse than before (ran out of time for the thesis…)
FFT- vs. LPC-Cepstrum:Question: Is it really that “FFT-cepstrum is more accurate” ?
Helsinki TIMIT
Answer: NO ! (TIMIT shows this quite clearly)
The Essential Difference Between the FFT- and LPC-Cepstra ?
• FFT-cepstrum approximates the spectrum by linear combination of cosine functions (non-parametric model)
• LPC makes a least-squares fit of the all-pole filter to the spectrum (parametric model)
• FFT-cepstrum first smoothes the original spectrum by filterbank, whereas LPC filter is fitted directly to the original spectrum
LPC captures more “details”
FFT-cepstrum represents “smooth” spectrum
However, one might argue that we could drop out the filterbank from FFT-cepstrum ...
General Summary and Discussion
• Number of subbands should be high (30-50 for these corpora)
• Number of cepstral coefficients (LPC/FFT-based) should high ( 15)
• In particular, number of subbands, coefficients, and LPC order are clearly higher than in speech recognition generally
• Formants give (surprisingly) good performance
• Number of formants should be high ( 8)
• In most cases, the differentiator method outperforms the regression method in delta-feature computation
All of these indicate indirectly the importance of spectral details and rapid spectral changes
“Philosophical Discussion”
• The current knowledge of speaker individuality is far from perfect :
• Engineers concentrete on tuning complex feature compensation methods but don’t (necessarily) understand what’s individual in speech
• Phoneticians try to find the “individual code” in the speech signal, but they don’t (necessarily) know how to apply engineers’ methods
• Why do we believe that speech would be any less individual than e.g. fingerprints ?
• Compare the history “fingerprint” and “voiceprint” :
• Fingerprints have been studied systematically since the 17th century (1684)
• Spectrograph wasn’t invented until 1946 ! How could we possibly claim that we know what speech is with research of less than 60 years?
• Why do we believe that human beings are optimal speaker discriminators? Our ear can be fooled already (e.g. MP3 encoding).
That’s All, Folks !
Recommended