AdvAIR

AdvAIRAdvAIR

Supervised by Supervised by Prof. Michael R. LyuProf. Michael R. LyuPrepared by Alex Fok, Shirley NgPrepared by Alex Fok, Shirley Ng

2002 Fall2002 Fall

An Advanced Audio Information Retrieval

System

OutlineOutline

• IntroductionIntroduction

• System OverviewSystem Overview

• ApplicationsApplications

• ExperimentExperiment

• Future WorkFuture Work

• Q&AQ&A

IntroductionIntroduction

MotivationMotivation

• Rapid expansion of audio information Rapid expansion of audio information due to blooming of internetdue to blooming of internet

• Little attention paid on audio miningLittle attention paid on audio mining

• Lack of a framework for generic Lack of a framework for generic audio information processingaudio information processing

TargetsTargets

• Open platform that can provide a Open platform that can provide a basis for various voice oriented basis for various voice oriented applicationsapplications

• Enhance audio information retrieval Enhance audio information retrieval by performance with guaranteed by performance with guaranteed accuracyaccuracy

• Generic speech analysis tools for Generic speech analysis tools for data miningdata mining

ApproachesApproaches

• Robust low-level sound information Robust low-level sound information preprocess modulepreprocess module

• Speed oriented but accuracy Speed oriented but accuracy algorithmsalgorithms

• Generalized model concept for Generalized model concept for various usagevarious usage

• A visual framework for presentationA visual framework for presentation

System DesignSystem Design

System Flow ChartSystem Flow ChartAudio Signal

FeaturesExtraction

Training andModeling

Segmentationand clusteringPreprocessing

SceneCutting

SpeakerIdentification

Linguistic Identification

Video Scene ChangeAnd

Speaker Tracking

Core Platform

Extended tools

Data

base

Sto

rag

e

Implements

Features ExtractionFeatures Extraction• Energy MeasurementEnergy Measurement

• Zero Crossing RateZero Crossing Rate

• Pitch Pitch

• Human resolves frequencies non-Human resolves frequencies non-linearly across the audio spectrumlinearly across the audio spectrum

• MFCC approach MFCC approach

• Simulate vocal track shapeSimulate vocal track shape

Features Extraction (con’t)Features Extraction (con’t)• The idea of filter-bank, which The idea of filter-bank, which

approximates the non-linear approximates the non-linear frequency resolutionfrequency resolution

• Bins hold a weighted sum Bins hold a weighted sum representing the spectral magnitude representing the spectral magnitude of channelsof channels

• Lower and upper frequency cut-offsLower and upper frequency cut-offs

magnitude

Frequency

…

• Segmentation is to cut audio Segmentation is to cut audio stream at the acoustic change pointstream at the acoustic change point

• BIC (Bayesian Information Criterion) BIC (Bayesian Information Criterion) is usedis used

• It is threshold-free and robustIt is threshold-free and robust

• Input audio stream is modeled Input audio stream is modeled

as Gaussiansas Gaussians

SegmentationSegmentation

Mean

Gaussian

SegmentationSegmentation

• Notations for an audio stream:Notations for an audio stream:– N : Number of framesN : Number of frames– X = {xi : i = 1,2,…,N} : a set of feature vectorsX = {xi : i = 1,2,…,N} : a set of feature vectors– μ is the meanμ is the mean– Σ is the full covariance matrix Σ is the full covariance matrix

Segmentation for single Segmentation for single change pt.change pt.• Assume change point is at frame iAssume change point is at frame i

• HH00,H,H11 : two different models : two different models • HH0 0 models the data as one Gaussian models the data as one Gaussian

– XX11 … X … XNN ~ N( μ , Σ ) ~ N( μ , Σ )

• HH1 1 models the data as two Gaussiansmodels the data as two Gaussians– XX11 … X … Xii ~ N( μ ~ N( μ11 , Σ , Σ11 ) ) – XXi+1i+1 …X …XNN ~ N( μ ~ N( μ22 , Σ , Σ22 ) ) Audio Stream

Frame 1 Frame i Frame NChange point

Segmentation for single Segmentation for single change pt. (con’t)change pt. (con’t)

• maximum likelihood ratio statistics is maximum likelihood ratio statistics is

R(i) = N log | Σ | - NR(i) = N log | Σ | - N11 log | Σ log | Σ11 | - N | - N22

log | Σlog | Σ22 | |

Audio Stream

Frame 1 Frame i Frame NChange point

Segmentation for single Segmentation for single change pt. (con’t)change pt. (con’t)• BIC(i) = R(i) -λ* P BIC(i) = R(i) -λ* P

• BIC(i) is +ve: i is the change pointBIC(i) is +ve: i is the change point

• BIC(i) is –ve: i is not the change pointBIC(i) is –ve: i is not the change point

• Which model fits the data better, single Which model fits the data better, single Gaussian(HGaussian(H00) or 2 Gaussians(H) or 2 Gaussians(H11)? )?

model H0

model H1

Segmentation for single change pSegmentation for single change pt. (con’t)t. (con’t)

•To detect a single change point, we To detect a single change point, we need to calculate BIC(i) for all i = 1,2,need to calculate BIC(i) for all i = 1,2,…,N…,N

•The frame i with largest BIC value is The frame i with largest BIC value is the change pointthe change point

•O(N) to detect a single change point O(N) to detect a single change point

Segmentation for multiple Segmentation for multiple change pt.change pt.

• Step 1: Initialize interval [a,b], set a = 1, b = 2Step 1: Initialize interval [a,b], set a = 1, b = 2• Step 2: Detect change point in interval [a,b] Step 2: Detect change point in interval [a,b]

through BIC single change point detection through BIC single change point detection algorithmalgorithm

• Step 3: If no change point in interval [a,b],Step 3: If no change point in interval [a,b], then set b = b+1then set b = b+1 else let t be the changing point else let t be the changing point

detected, detected, set a = t+1, b = t+2set a = t+1, b = t+2• Step 4: Go to Step (2)Step 4: Go to Step (2)

Enhanced Implementation Enhanced Implementation Algorithm Algorithm • Original multiple change point detection Original multiple change point detection

algorithm:algorithm:– Start to detect change point within 2 framesStart to detect change point within 2 frames– Increase investigation interval by 1 every Increase investigation interval by 1 every

timetime

• Enhanced Implementation algorithm:Enhanced Implementation algorithm:– minimum processing interval used in our minimum processing interval used in our

engine is engine is 100100 frames frames– Increase investigation interval by Increase investigation interval by 100100 every every

timetime

Enhanced Implementation AlgoritEnhanced Implementation Algorithm (con’t)hm (con’t)• Why do we choose to increase the Why do we choose to increase the

interval by 100 frames?interval by 100 frames?

• It increases is too large, then scene It increases is too large, then scene change may be missed. change may be missed.

• Must be smaller than 170 frames Must be smaller than 170 frames because there are around 170 frames in because there are around 170 frames in 1 second1 second

• It increases is too small, then speed of It increases is too small, then speed of processing is too slowprocessing is too slow

Enhanced Implementation AlgoritEnhanced Implementation Algorithm (con’t)hm (con’t)• Advantage: Speed upAdvantage: Speed up

• Trade-off: the change point we Trade-off: the change point we detected is not too accurate detected is not too accurate

• To compensate: To compensate: – investigate on the frames around the investigate on the frames around the

change point againchange point again– investigation interval is incremented by investigation interval is incremented by

1 to locate a more accurate change point1 to locate a more accurate change point

Training and ModelingTraining and Modeling• Before doing various identification, Before doing various identification,

training and modeling is neededtraining and modeling is needed• Probability-based Model Probability-based Model Gaussian Gaussian

Mixture Model (GMM) is usedMixture Model (GMM) is used• GMM is used for language identification, GMM is used for language identification,

gender identification and speaker gender identification and speaker identification identification

• GMM is modeled by many different GMM is modeled by many different Gaussian distributionsGaussian distributions

• A Gaussian distribution is represented by A Gaussian distribution is represented by its mean and varianceits mean and variance

Gaussian Mixture Model Gaussian Mixture Model (GMM)(GMM)

Model for Speaker i

1

12

2

1p 2p

………………i

i

ip

• To train a model is to calculate the meaTo train a model is to calculate the mean , variance and weight (n , variance and weight (λλ) for each of th) for each of the Gaussian distributione Gaussian distribution

Training of speaker GMMsTraining of speaker GMMs• Collect sound clips that is long enough for eaCollect sound clips that is long enough for ea

ch speaker (e.g. 20 minutes sound clips)ch speaker (e.g. 20 minutes sound clips)• Steps for training one speaker model:Steps for training one speaker model:

– Step 1. Start with an initial model Step 1. Start with an initial model λλ – Step 2. Calculate new mean, variance, weighting Step 2. Calculate new mean, variance, weighting

(new (new λλ) by training) by training– Step 3. Use a newStep 3. Use a newλλif it represents the model betteif it represents the model bette

r than the oldr than the oldλλ– Step 4. Repeat Step 2 to Step 3Step 4. Repeat Step 2 to Step 3

• Finally, we get Finally, we get λλthat can represent the modelthat can represent the model

ApplicationsApplications

ApplicationsApplications

• Video scene change and speaker Video scene change and speaker trackingtracking

• Speaker IdentificationSpeaker Identification

• Telephony message notificationTelephony message notification

Video scene change and Video scene change and Speaker trackingSpeaker tracking

Video Clip

AdvAIR coreSegmentation

Speakers IndexInformation

MultimediaPresentation

VideoPlaying

Mechanism

TimingAnd Speaker Information

UsageUsage

• Speaker tracking enhance data Speaker tracking enhance data mining about a particular person mining about a particular person (e.g. Political person in a conference)(e.g. Political person in a conference)

• Audio information indexing and Audio information indexing and sorting for audio library storagesorting for audio library storage

• It as an auxiliary tool for video It as an auxiliary tool for video cutting and editing applicationscutting and editing applications

ScreenshotScreenshotInput clip

Multimedia player

Time information and indexing

Speaker IdentificationSpeaker IdentificationSound source

PreprocessedSpeaker clip

GMM ModelTraining

Speaker ModelsDatabase

Speaker ComparisonMechanism

Speaker Identity

Training StageTesting Stage

UsageUsage

• Security authenticationSecurity authentication

• Speaker identification of telephone Speaker identification of telephone base systembase system

• Criminal investigation (For example, Criminal investigation (For example, similar to fingerprint)similar to fingerprint)

ScreenshotScreenshot

Input source

Flexible length comparison

Speaker Identity

Media Player for visual verification

Telephony Message Telephony Message NotificationNotification

Caller phone

User can’t listen

Record the leaving message of caller

AdvAIRsegmentation

GMM model comparison

Desired groupModel

database

Desired group

Non-desired Group

Messaging API

Short MessageSystem E-mail system

Experiment Experiment ResultsResults

Threshold-free BIC criterionThreshold-free BIC criterionTest Wave length Actual

Turing PointFalse Alarm Missed

PointTime used

1 9 seconds 2 0 0 2 seconds





Background Noise affect accuracy

Enhanced ImplementationEnhanced Implementation

Test Method Wave length Actual Turning Point

False Alarm Missed Point Time used

1 Old 9 seconds 2 0 0 10 seconds

New 0 0 2 seconds


New 0 0 4 seconds


New 2 0 8 seconds

4 Old 540 seconds 18 7 2 Over 1 days

New 8 2 1200 seconds

Speed enhance is determined by relative number of changing point by length

GMM modal closed-set speaker GMM modal closed-set speaker identificationidentification

Training StageTraining Stage10 speaker10 speaker

5 males, 5 females5 males, 5 females

20 minutes for each speaker20 minutes for each speaker

Testing StageTesting Stage

50 sound clips with 5 seconds duration50 sound clips with 5 seconds duration

48 sound clips are correct, i.e. 96 %48 sound clips are correct, i.e. 96 %

GMM modal open-set speaker GMM modal open-set speaker identificationidentification

• Accept or Reject as resultAccept or Reject as result

• Same setting as closed-setSame setting as closed-set– i.e. 10 speaker, which each 20 minutesi.e. 10 speaker, which each 20 minutes

• Correct 45/50 = 90%Correct 45/50 = 90%

• False reject 3/50 = 6 %False reject 3/50 = 6 %

• False accept 2/50 = 4 %False accept 2/50 = 4 %

Problems Problems

and and

LimitationLimitation

Problems and limitationsProblems and limitations

• Accuracy is affected by background Accuracy is affected by background noisenoise

• Some speakers have very likely Some speakers have very likely features of soundfeatures of sound

• Open set speaker identification Open set speaker identification determination function is not so determination function is not so accurate if duration is shortaccurate if duration is short

• Segmentation is still a time consuming Segmentation is still a time consuming processprocess

Future WorkFuture Work

• Speaker gender identificationSpeaker gender identification

• Robust open-set speaker Robust open-set speaker identificationidentification

• Speech content recognitionSpeech content recognition

• Music pattern matchingMusic pattern matching

• Distributed system for segmentationDistributed system for segmentation

Q & AQ & A

Documents

AdvAIR