22
University of Joensuu Dept. of Computer Scienc P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Automatic Speaker Recognition for Series 60 Mobile Devices University of Joensuu, Department of Computer Science Specom’2004, Sep 20, 2004 Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, and Pasi Fränti

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 Automatic

Embed Size (px)

Citation preview

Page 1: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Automatic Speaker Recognition for Series 60

Mobile Devices

University of Joensuu,Department of Computer Science

Specom’2004, Sep 20, 2004

Juhani Saastamoinen, Evgeny Karpov,Ville Hautamäki, and Pasi Fränti

Page 2: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Background

• Project in National FENIX programme– New Methods and Applications in Speech

Technology

• 7 research institutes• Project partners: NRC, Lingsoft, National

Bureau of Investigation, etc.• Joensuu: Speaker Recognition• http://cs.joensuu.fi/pages/pums

Page 3: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Research Group

Pasi FräntiProfessor

Juhani SaastamoinenProject manager

Evgeny KarpovProject researcher

Ville HautamäkiProject researcher

Tomi KinnunenResearcher

Ismo Kärkkäinen Clustering algorithms

PUMS project

Page 4: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Application Scenarios

Speaker VerificationSpeaker Verification Speaker IdentificationSpeaker Identification

Speaker RecognitionSpeaker Recognition

Whose voice is this?Is this Bob’s voice?

(Claim)+

Verification

Imposter!

?Identification

Page 5: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Project Goal

Port speaker recognition to Series 60 mobile phone

Page 6: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Symbian Phones

• Series 60 phone features:– 16 MB ROM– 8 MB RAM

– 176 x 208 display

– ARM-processor

– No floating-point unit!!!

Series 80

Series 60UIQ

Page 7: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Symbian OS

• Defined by Symbian consortium

• Based on EPOC• Operating system for mobile phones

– Real-time system– Long uptime required

• Multitasking, multithreading

Page 8: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Problems of Porting

• Usual considerations when porting to phone– GUI event driven program(ming)

– Platform specific programming model

– Real-time system, exceptions

• Application specific porting problems– Number crunching without floating point unit!!!

– Signal processing numerically challenging

Page 9: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Identification System

Speaker Recognition:Classify input speech

based on existing profiles

Signal ProcessingFeature Extraction

Speaker Modelling:Create speaker

profileFeatureVectors

SpeechAudio

Add speaker profiles during training

Read and use all profiles during recognition

Decision

Speaker ProfileDatabase

Page 10: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

MFCC Signal Processing

Time windowin

gDFT Abs

Filter bank

Log

DCT

Digital speechsignal frame

Featurevector

Pre-emphasis

• pre-emph. coeff. 0.97, Hamm window, 30 triangular mel-filters, base-2 logarithm, output 12 MFCC's

Page 11: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Fixed-Point Implementation

• Numerical analysis needed for fixed-point arithmetic implementation

• Truncation and re-scaling to avoid overflows in the converted algorithm

• Minimize information loss caused by computation in fixed-point arithmetic – Minimize relative error

Page 12: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

FFT, Fixed-Point

• Frequency spectrum of speech– Biggest source of numerical error– Butterflies have multiplications– Layers repeat truncation errors

• Fixed number of bits per element– 32, native integer size in many systems

• Reference implementation: FFTGEN– http://www.jjj.de/fft/fftgen.tgz

Page 13: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

FFTGEN (16/16)

• Multiplication: 32 x 32 -bit result must fit in 32 bits: truncate input

• FFTGEN: Truncate inputs to 16/16 bits

32-bit multiplication result

FFT layer input FFT Twiddle FactorX

X16-bit integer 16-bit integer

FFT layer output (part of it)Crop-off for next layer: 16 bits!16-bit integer

16 used bits 16 crop-off bits

Page 14: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Info Preserving FFT (22/10)• Approximate DFT operator F with G• Increase ||F-G||, preserve more signal information

– minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024

– Truncate multiplication inputs to 22/10 bits (signal/op)

22 used bits 10 crop-off bits

32-bit multiplication result

X32-bit integer, 22 bits used 16-bit integer, 10 bits used

32-bit integer

FFT layer input FFT Twiddle FactorX

FFT layer output (part of it)Crop-off for next layer: 10 bits

Page 15: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

FFT Spectrum, Fixed-Point

originalTIMIT signal

TIMIT signal x 4

16/16 abs values 22/10 abs values

• x-axis: fixed-point FFT element abs. values

• y-axis: correct FFT element abs. values

Page 16: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Scale of Error in Proposed FFT

16/16 22/10

Log10 of relative error in FFT elements

16/16 22/10

average -0.775 -2.118

standard deviation 0.797 0.590

Page 17: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

• Compute complex absolute values using maximum coordinate and coordinate ratio

• Suppose |x| > |y| for z = x + i y, then

• Interpret the (squared) y/x by t• Approx. square root by a polynomial P(t)• Constant time algorithm (vs. Newton)

Magnitude Spectrum, Fixed-Point

222 /1 xy+x=y+x|=z|

Page 18: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Logarithm, Fixed-Point

• Use base 2 instead of base 10– corresponds to output multiplication

• Standard technique:– Return problem to interval [1,2)– Use linear interpolation from values

stored in a look-up table– 8 bits used for indexing the look-up

table values

Page 19: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Rest of System, Fixed-Point

• No improvement needed in VQ/GLA• Should apply similar technique as

with FFT to other signal processing– Pre-emphasis, utilize full 32 bits– Time windowing, use less bits in

windowing function– FB, use less bits in frequency responses– DCT, use less bits for the cosines

Page 20: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Effect of Signal Processing

• TIMIT data sets, varying number of speakers (N)• For each N repeat (6x, 5x, 2x) train/recognize

cycles (eliminate GLA initial solution randomness)• FFTGEN: FFT with 16/16 multiplication• Fixed-point: use proposed 22/10 FFT• Mixed: floating-point DSP, fixed-point GLA/VQ

N=10 (6x) N=20 (5x) N=100 (2x)FFTGEN 93,3% 68,0% 59,5%Fixed-point 98,3% 95,0% 82,5%Mixed 100,0% 100,0% 100,0%Floating-point 100,0% 100,0% 100,0%

Page 21: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Effect of Signal Quality

• GSM/PC data: 16 aligned dual recordings

• All computations in floating-point arith.

• Signal recorded with laptop and PC mic gives average recognition rate 100%

• Signal recorded with Nokia 3660 results in average recognition rate 84,9%

13/16 14/16 15/16 16/16Symbian audio 1 3 3 10PC audio 0 0 0 17

Page 22: University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955  Automatic

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Conclusion

• Speaker identification was ported to Symbian Series 60 mobile phone

• 22/10 bit usage in multiplication proposed instead of “standard” 16/16

• Experiments indicate that recognition accuracy improves from 68% to 95%