Speech Processing Speaker Recognition. 18 February 2016Veton Këpuska2 Speaker Recognition Definitions: Speaker Identification: For given a set of

Speech Processing Speaker Recognition 18 February 2016Veton Kpuska2 Speaker Recognition Definitions: Speaker Identification: For given a set of models obtained for a number of known speakers, which voice models best characterizes a speaker. Speaker Verification: Decide whether a speaker corresponds to a particular known voice or to some other unknown voice. Claimant an individual who is correctly posing as one of known speakers Impostor unknown speaker who is posing as a known speaker. False Acceptance False Rejection 18 February 2016Veton Kpuska3 Speaker Recognition Steps in Speaker Recognition: Model Building For each target speaker (claimant) A number of background (imposter) speakers Speaker-Dependent Features Oral and Nasal tract length and cross section during different sounds Vocal fold mass and shape Location and size of the false vocal folds Accurately measured from the speech waveform. Training Data + Model Building Procedure Generate Models 18 February 2016Veton Kpuska4 Speaker Recognition In practice difficult to derive speech anatomy features from the speech waveform. Use conventional methods to extract features: Constant-Q filter bank Spectral Based Features 18 February 2016Veton Kpuska5 Speaker Recognition System Feature Extraction Recogni tion Training Training Speech Data Linda Kay Joe Target & Background Speaker Models Decision: Tom Not Tom Tom Testing Speech Data 18 February 2016Veton Kpuska6 Spectral Features for Speaker Recognition Attributes of Human Voice: High-level difficult to extract from speech waveform: Clarity Roughness Magnitude Animation Prosody pitch intonation, articulation rate, and dialect Low-level easy to extract from speech waveform: Vocal tract Spectrum Instantaneous pitch Glottal flow excitation Source Event Onset Times Modulations in Format Trajectories 18 February 2016Veton Kpuska7 Spectral Features for Speaker Recognition Want the feature set to reflect the unique characteristics of a speaker. The short-time Fourier transform (STFT): STFT Magnitude: Vocal tract resonances Vocal tract anti-resonances important for speaker identifiability. General trend of the envelope of the STFT Magnitude is influenced by the coarse component of the glottal flow derivative. Fine structure of STFT characterized by speaker-dependent features: Pitch Glottal-flow Distributed Acoustic Effects 18 February 2016Veton Kpuska8 Spectral Features for Speaker Recognition Speaker Recognition Systems use smooth representation of the STFT magnitude: Vocal tract resonances Spectral Tilt Auditory-based features superior to the conventional features: All-pole LPC spectrum Homomorphic filtered spectrum Homomorphic prediction, etc. 18 February 2016Veton Kpuska9 Mel-Cepstrum Davies & Mermelstein: 18 February 2016Veton Kpuska10 Short-Time Fourier Analysis (Time- Dependent Fourier Transform) 18 February 2016Veton Kpuska11 Rectangular Window 18 February 2016Veton Kpuska12 Hamming Window 18 February 2016Veton Kpuska13 Comparison of Windows 18 February 2016Veton Kpuska14 Comparison of Windows (contd) 18 February 2016Veton Kpuska15 A Wideband Spectrogram 18 February 2016Veton Kpuska16 A Narrowband Spectrogram 18 February 2016Veton Kpuska17 Discrete Fourier Transform In general, the number of input points, N, and the number of frequency samples, M, need not be the same. If M>N, we must zero-pad the signal If MP( j |x) ji For a two class problem this decision rule means: Choose 1 if else 2 This rule can be expressed as a likelihood ratio: 18 February 2016Veton Kpuska30 Bayes Risk Define cost function ij and conditional risk R( i |x): ij is cost of classifying x as i when it is really j R( i |x) is the risk for classifying x as class i Bayes risk is the minimum risk which can be achieved: Choose i if R( i |x) < R( j |x) ij Bayes risk corresponds to minimum P(error|x) when All errors have equal cost ( ij = 1, ij) There is no cost for being correct ( ii = 0) 18 February 2016Veton Kpuska31 Discriminant Functions Alternative formulation of Bayes decision rule Define a discriminant function, g i (x), for each class i Choose i if g i (x)>g j (x) j = i Functions yielding identical classiffication results: g i (x) = P( i |x) = p(x| i )P( i ) = log p(x| i )+log P( i ) Choice of function impacts computation costs Discriminant functions partition feature space into decision regions, separated by decision boundaries. 18 February 2016Veton Kpuska32 Density Estimation Used to estimate the underlying PDF p(x| i ) Parametric methods: Assume a specific functional form for the PDF Optimize PDF parameters to fit data Non-parametric methods: Determine the form of the PDF from the data Grow parameter set size with the amount of data Semi-parametric methods: Use a general class of functional forms for the PDF Can vary parameter set independently from data Use unsupervised methods to estimate parameters 18 February 2016Veton Kpuska33 Parametric Classifiers Gaussian distributions Maximum likelihood (ML) parameter estimation Multivariate Gaussians Gaussian classifiers 18 February 2016Veton Kpuska34 Gaussian Distributions Gaussian PDFs are reasonable when a feature vector can be viewed as perturbation around a reference Simple estimation procedures for model parameters Classification often reduced to simple distance metrics Gaussian distributions also called Normal 18 February 2016Veton Kpuska35 Gaussian Distributions: One Dimension One-dimensional Gaussian PDFs can be expressed as: The PDF is centered around the mean The spread of the PDF is determined by the variance 18 February 2016Veton Kpuska36 Maximum Likelihood Parameter Estimation Maximum likelihood parameter estimation determines an estimate for parameter by maximizing the likelihood L() of observing data X = {x 1,...,x n } Assuming independent, identically distributed data ML solutions can often be obtained via the derivative: ^ 18 February 2016Veton Kpuska37 Maximum Likelihood Parameter Estimation For Gaussian distributions log L() is easier to solve 18 February 2016Veton Kpuska38 Gaussian ML Estimation: One Dimension The maximum likelihood estimate for is given by: 18 February 2016Veton Kpuska39 Gaussian ML Estimation: One Dimension The maximum likelihood estimate for is given by: 18 February 2016Veton Kpuska40 Gaussian ML Estimation: One Dimension 18 February 2016Veton Kpuska41 ML Estimation: Alternative Distributions 18 February 2016Veton Kpuska42 ML Estimation: Alternative Distributions 18 February 2016Veton Kpuska43 Gaussian Distributions: Multiple Dimensions (Multivariate) A multi-dimensional Gaussian PDF can be expressed as: d is the number of dimensions x={x 1,,x d } is the input vector = E(x)= { 1,..., d } is the mean vector = E((x- )(x-) t ) is the covariance matrix with elements ij, inverse -1, and determinant || ij = ji = E((x i - i )(x j - j )) = E(x i x j ) - i j 18 February 2016Veton Kpuska44 Gaussian Distributions: Multi- Dimensional Properties If the i th and j th dimensions are statistically or linearly independent then E(x i x j )= E(x i )E(x j ) and ij =0 If all dimensions are statistically or linearly independent, then ij =0 ij and has non-zero elements only on the diagonal If the underlying density is Gaussian and is a diagonal matrix, then the dimensions are statistically independent and 18 February 2016Veton Kpuska45 Diagonal Covariance Matrix: = 2 I 18 February 2016Veton Kpuska46 Diagonal Covariance Matrix: ij =0 i j 18 February 2016Veton Kpuska47 General Covariance Matrix: ij 0 18 February 2016Veton Kpuska48 Multivariate ML Estimation The ML estimates for parameters = { 1,..., l } are determined by maximizing the joint likelihood L() of a set of i.i.d. data x = {x 1,..., x n } To find we solve L()= 0, or log L()= 0 The ML estimates of and are ^ 18 February 2016Veton Kpuska49 Multivariate Gaussian Classifier Requires a mean vector i, and a covariance matrix i for each of M classes { 1, , M } The minimum error discriminant functions are of the form: Classification can be reduced to simple distance metrics for many situations. 18 February 2016Veton Kpuska50 Gaussian Classifier: i = 2 I Each class has the same covariance structure: statistically independent dimensions with variance 2 The equivalent discriminant functions are: If each class is equally likely, this is a minimum distance classifier, a form of template matching The discriminant functions can be replaced by the following linear expression: where 18 February 2016Veton Kpuska51 Gaussian Classifier: i = 2 I For distributions with a common covariance structure the decision regions a hyper-planes. 18 February 2016Veton Kpuska52 Gaussian Classifier: i = Each class has the same covariance structure The equivalent discriminant functions are: If each class is equally likely, the minimum error decision rule is the squared Mahalanobis distance The discriminant functions remain linear expressions: where 18 February 2016Veton Kpuska53 Gaussian Classifier: i Arbitrary Each class has a different covariance structure i The equivalent discriminant functions are: The discriminant functions are inherently quadratic: where 18 February 2016Veton Kpuska54 Gaussian Classifier: i Arbitrary For distributions with arbitrary covariance structures the decision regions are defined by hyper-spheres. 18 February 2016Veton Kpuska55 3 Class Classification (Atal & Rabiner, 1976) Distinguish between silence, unvoiced, and voiced sounds Use 5 features: Zero crossing count Log energy Normalized first autocorrelation coefficient First predictor coefficient, and Normalized prediction error Multivariate Gaussian classifier, ML estimation Decision by squared Mahalanobis distance Trained on four speakers (2 sentences/speaker), tested on 2 speakers (1 sentence/speaker) 18 February 2016Veton Kpuska56 Maximum A Posteriori Parameter Estimation Bayesian estimation approaches assume the form of the PDF p(x|) is known, but the value of is not Knowledge of is contained in: An initial a priori PDF p() A set of i.i.d. data X = {x 1,...,x n } The desired PDF for x is of the form The value posteriori that maximizes p(|X) is called the maximum a posteriori (MAP) estimate of ^ 18 February 2016Veton Kpuska57 Gaussian MAP Estimation: One Dimension For a Gaussian distribution with unknown mean : MAP estimates of and x are given by As n increases, p(|X) converges to , and p(x,X) converges to the ML estimate ~ N(, 2 ) ^ ^ 18 February 2016Veton Kpuska58 References Huang, Acero, and Hon, Spoken Language Processing, Prentice-Hall, Duda, Hart and Stork, Pattern Classification, John Wiley & Sons, Atal and Rabiner, A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classification with Applications to Speech Recognition, IEEE Trans ASSP, 24(3), 1976. 18 February 2016Veton Kpuska59 Speaker Recognition Algorithm Minimum-Distance Classifier: 18 February 2016Veton Kpuska60 Vector Quantization Class dependent Distance 18 February 2016Veton Kpuska61 Gaussian Mixture Model (GMM) Speech Production not Deterministic: Phones are never produced by a speaker with exactly the same vocal tract shape and glottal flow due to variations in: Context Coarticulation Anatomy Fluid Dynamics On of the best ways to represent variability is through multi-dimensinal Gaussian pdfs. In general a Mixture of Gaussians is used to represent a class pdf. 18 February 2016Veton Kpuska62 Mixture Densities PDF is composed of a mixture of m components densities { 1,, 2 }: Component PDF parameters and mixture weights P( j ) are typically unknown, making parameter estimation a form of unsupervised learning. Gaussian mixtures assume Normal components: 18 February 2016Veton Kpuska63 Gaussian Mixture Example: One Dimension p(x)=0.6p 1 (x)+0.4p 2 (x) p1(x)~N(-, 2 ) p 2 (x) ~N(1.5, 2 ) 18 February 2016Veton Kpuska64 Gaussian Example First 9 MFCCs from [s]: Gaussian PDF 18 February 2016Veton Kpuska65 Independent Mixtures [s]: 2 Gaussian Mixture Components/Dimension 18 February 2016Veton Kpuska66 Mixture Components [s]: 2 Gaussian Mixture Components/Dimension 18 February 2016Veton Kpuska67 ML Parameter Estimation: 1D Gaussian Mixture Means 18 February 2016Veton Kpuska68 Gaussian Mixtures: ML Parameter Estimation The maximum likelihood solutions are of the form: 18 February 2016Veton Kpuska69 Gaussian Mixtures: ML Parameter Estimation The ML solutions are typically solved iteratively: Select a set of initial estimates for P( k ), k, k Use a set of n samples to re-estimate the mixture parameters until some kind of convergence is found Clustering procedures are often used to provide the initial parameter estimates Similar to K-means clustering procedure 18 February 2016Veton Kpuska70 Example: 4 Samples, 2 Densities 1.Data: X = {x 1,x 2,x 3,x 4 } = {2,1,-1,-2} 2.Init: p(x| 1 )~N(1,1), p(x| 2 )~N(1-,1), P( i )=0.5 3.Estimate: 4.Recompute mixture parameters (only shown for 1 ): x1x1 x2x2 x3x3 x4x4 P( 1 |x) P( 2 |x) p(X) (e e -4.5 )(e 0 + e -2 )(e 0 + e -2 )(e e -4.5 )0.5 4 18 February 2016Veton Kpuska71 Example: 4 Samples, 2 Densities Repeat steps 3,4 until convergence. 18 February 2016Veton Kpuska72 [s] Duration: 2 Densities 18 February 2016Veton Kpuska73 Gaussian Mixture Example: Two Dimensions 18 February 2016Veton Kpuska74 Two Dimensional Mixtures... 18 February 2016Veton Kpuska75 Two Dimensional Components 18 February 2016Veton Kpuska76 Mixture of Gaussians: Implementation Variations Diagonal Gaussians are often used instead of full- covariance Gaussians Can reduce the number of parameters Can potentially model the underlying PDF just as well if enough components are used Mixture parameters are often constrained to be the same in order to reduce the number of parameters which need to be estimated Richter Gaussians share the same mean in order to better model the PDF tails Tied-Mixtures share the same Gaussian parameters across all classes. Only the mixture weights P( i ) are class specific. (Also known as semi-continuous) 18 February 2016Veton Kpuska77 Richter Gaussian Mixtures [s] Log Duration: 2 Richter Gaussians 18 February 2016Veton Kpuska78 Expectation-Maximization (EM) Used for determining parameters, , for incomplete data, X = {x i } (i.e., unsupervised learning problems) Introduces variable, Z = {z j }, to make data complete so can be solved using conventional ML techniques In reality, z j can only be estimated by P(z j |x i,), so we can only compute the expectation of log L() EM solutions are computed iteratively until convergence 1.Compute the expectation of log L() 2.Compute the values j, which maximize E 18 February 2016Veton Kpuska79 EM Parameter Estimation: 1D Gaussian Mixture Means Let z i be the component id, { j }, which x i belongs to Convert to mixture component notation: Differentiate with respect to k : 18 February 2016Veton Kpuska80 EM Properties Each iteration of EM will increase the likelihood of X Using Bayes rule and the Kullback-Liebler distance metric: 18 February 2016Veton Kpuska81 EM Properties Since was determined to maximize E(log L()): Combining these two properties: p(X|) p(X|) 18 February 2016Veton Kpuska82 Dimensionality Reduction Given a training set, PDF parameter estimation becomes less robust as dimensionality increases Increasing dimensions can make it more difficult to obtain insights into any underlying structure Analytical techniques exist which can transform a sample space to a different set of dimensions If original dimensions are correlated, the same information may require fewer dimensions The transformed space will often have more Normal distribution than the original space If the new dimensions are orthogonal, it could be easier to model the transformed space 18 February 2016Veton Kpuska83 Principal Components Analysis Linearly transforms d-dimensional vector, x, to d dimensional vector, y, via orthonormal vectors, W y=W t x W={w 1,,w d } W t W=I If d

Documents

Speech Processing Speaker Recognition. 18 February 2016Veton Këpuska2 Speaker Recognition Definitions: Speaker Identification: For given a set of