View
12
Download
0
Category
Preview:
DESCRIPTION
AdvAIR. An Advanced Audio Information Retrieval System. Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall. Outline. Introduction System Overview Applications Experiment Future Work Q&A. Introduction. Motivation. - PowerPoint PPT Presentation
Citation preview
AdvAIRAdvAIR
Supervised by Supervised by Prof. Michael R. LyuProf. Michael R. LyuPrepared by Alex Fok, Shirley NgPrepared by Alex Fok, Shirley Ng
2002 Fall2002 Fall
An Advanced Audio Information Retrieval
System
OutlineOutline
• IntroductionIntroduction
• System OverviewSystem Overview
• ApplicationsApplications
• ExperimentExperiment
• Future WorkFuture Work
• Q&AQ&A
IntroductionIntroduction
MotivationMotivation
• Rapid expansion of audio information Rapid expansion of audio information due to blooming of internetdue to blooming of internet
• Little attention paid on audio miningLittle attention paid on audio mining
• Lack of a framework for generic Lack of a framework for generic audio information processingaudio information processing
TargetsTargets
• Open platform that can provide a Open platform that can provide a basis for various voice oriented basis for various voice oriented applicationsapplications
• Enhance audio information retrieval Enhance audio information retrieval by performance with guaranteed by performance with guaranteed accuracyaccuracy
• Generic speech analysis tools for Generic speech analysis tools for data miningdata mining
ApproachesApproaches
• Robust low-level sound information Robust low-level sound information preprocess modulepreprocess module
• Speed oriented but accuracy Speed oriented but accuracy algorithmsalgorithms
• Generalized model concept for Generalized model concept for various usagevarious usage
• A visual framework for presentationA visual framework for presentation
System DesignSystem Design
System Flow ChartSystem Flow ChartAudio Signal
FeaturesExtraction
Training andModeling
Segmentationand clusteringPreprocessing
SceneCutting
SpeakerIdentification
Linguistic Identification
Video Scene ChangeAnd
Speaker Tracking
Core Platform
Extended tools
Data
base
Sto
rag
e
Implements
Features ExtractionFeatures Extraction• Energy MeasurementEnergy Measurement
• Zero Crossing RateZero Crossing Rate
• Pitch Pitch
• Human resolves frequencies non-Human resolves frequencies non-linearly across the audio spectrumlinearly across the audio spectrum
• MFCC approach MFCC approach
• Simulate vocal track shapeSimulate vocal track shape
Features Extraction (con’t)Features Extraction (con’t)• The idea of filter-bank, which The idea of filter-bank, which
approximates the non-linear approximates the non-linear frequency resolutionfrequency resolution
• Bins hold a weighted sum Bins hold a weighted sum representing the spectral magnitude representing the spectral magnitude of channelsof channels
• Lower and upper frequency cut-offsLower and upper frequency cut-offs
magnitude
Frequency
…
• Segmentation is to cut audio Segmentation is to cut audio stream at the acoustic change pointstream at the acoustic change point
• BIC (Bayesian Information Criterion) BIC (Bayesian Information Criterion) is usedis used
• It is threshold-free and robustIt is threshold-free and robust
• Input audio stream is modeled Input audio stream is modeled
as Gaussiansas Gaussians
SegmentationSegmentation
Mean
Gaussian
SegmentationSegmentation
• Notations for an audio stream:Notations for an audio stream:– N : Number of framesN : Number of frames– X = {xi : i = 1,2,…,N} : a set of feature vectorsX = {xi : i = 1,2,…,N} : a set of feature vectors– μ is the meanμ is the mean– Σ is the full covariance matrix Σ is the full covariance matrix
Segmentation for single Segmentation for single change pt.change pt.• Assume change point is at frame iAssume change point is at frame i
• HH00,H,H11 : two different models : two different models • HH0 0 models the data as one Gaussian models the data as one Gaussian
– XX11 … X … XNN ~ N( μ , Σ ) ~ N( μ , Σ )
• HH1 1 models the data as two Gaussiansmodels the data as two Gaussians– XX11 … X … Xii ~ N( μ ~ N( μ11 , Σ , Σ11 ) ) – XXi+1i+1 …X …XNN ~ N( μ ~ N( μ22 , Σ , Σ22 ) ) Audio Stream
Frame 1 Frame i Frame NChange point
Segmentation for single Segmentation for single change pt. (con’t)change pt. (con’t)
• maximum likelihood ratio statistics is maximum likelihood ratio statistics is
R(i) = N log | Σ | - NR(i) = N log | Σ | - N11 log | Σ log | Σ11 | - N | - N22
log | Σlog | Σ22 | |
Audio Stream
Frame 1 Frame i Frame NChange point
Segmentation for single Segmentation for single change pt. (con’t)change pt. (con’t)• BIC(i) = R(i) -λ* P BIC(i) = R(i) -λ* P
• BIC(i) is +ve: i is the change pointBIC(i) is +ve: i is the change point
• BIC(i) is –ve: i is not the change pointBIC(i) is –ve: i is not the change point
• Which model fits the data better, single Which model fits the data better, single Gaussian(HGaussian(H00) or 2 Gaussians(H) or 2 Gaussians(H11)? )?
model H0
model H1
Segmentation for single change pSegmentation for single change pt. (con’t)t. (con’t)
•To detect a single change point, we To detect a single change point, we need to calculate BIC(i) for all i = 1,2,need to calculate BIC(i) for all i = 1,2,…,N…,N
•The frame i with largest BIC value is The frame i with largest BIC value is the change pointthe change point
•O(N) to detect a single change point O(N) to detect a single change point
Segmentation for multiple Segmentation for multiple change pt.change pt.
• Step 1: Initialize interval [a,b], set a = 1, b = 2Step 1: Initialize interval [a,b], set a = 1, b = 2• Step 2: Detect change point in interval [a,b] Step 2: Detect change point in interval [a,b]
through BIC single change point detection through BIC single change point detection algorithmalgorithm
• Step 3: If no change point in interval [a,b],Step 3: If no change point in interval [a,b], then set b = b+1then set b = b+1 else let t be the changing point else let t be the changing point
detected, detected, set a = t+1, b = t+2set a = t+1, b = t+2• Step 4: Go to Step (2)Step 4: Go to Step (2)
Enhanced Implementation Enhanced Implementation Algorithm Algorithm • Original multiple change point detection Original multiple change point detection
algorithm:algorithm:– Start to detect change point within 2 framesStart to detect change point within 2 frames– Increase investigation interval by 1 every Increase investigation interval by 1 every
timetime
• Enhanced Implementation algorithm:Enhanced Implementation algorithm:– minimum processing interval used in our minimum processing interval used in our
engine is engine is 100100 frames frames– Increase investigation interval by Increase investigation interval by 100100 every every
timetime
Enhanced Implementation AlgoritEnhanced Implementation Algorithm (con’t)hm (con’t)• Why do we choose to increase the Why do we choose to increase the
interval by 100 frames?interval by 100 frames?
• It increases is too large, then scene It increases is too large, then scene change may be missed. change may be missed.
• Must be smaller than 170 frames Must be smaller than 170 frames because there are around 170 frames in because there are around 170 frames in 1 second1 second
• It increases is too small, then speed of It increases is too small, then speed of processing is too slowprocessing is too slow
Enhanced Implementation AlgoritEnhanced Implementation Algorithm (con’t)hm (con’t)• Advantage: Speed upAdvantage: Speed up
• Trade-off: the change point we Trade-off: the change point we detected is not too accurate detected is not too accurate
• To compensate: To compensate: – investigate on the frames around the investigate on the frames around the
change point againchange point again– investigation interval is incremented by investigation interval is incremented by
1 to locate a more accurate change point1 to locate a more accurate change point
Training and ModelingTraining and Modeling• Before doing various identification, Before doing various identification,
training and modeling is neededtraining and modeling is needed• Probability-based Model Probability-based Model Gaussian Gaussian
Mixture Model (GMM) is usedMixture Model (GMM) is used• GMM is used for language identification, GMM is used for language identification,
gender identification and speaker gender identification and speaker identification identification
• GMM is modeled by many different GMM is modeled by many different Gaussian distributionsGaussian distributions
• A Gaussian distribution is represented by A Gaussian distribution is represented by its mean and varianceits mean and variance
Gaussian Mixture Model Gaussian Mixture Model (GMM)(GMM)
Model for Speaker i
1
12
2
1p 2p
………………i
i
ip
• To train a model is to calculate the meaTo train a model is to calculate the mean , variance and weight (n , variance and weight (λλ) for each of th) for each of the Gaussian distributione Gaussian distribution
Training of speaker GMMsTraining of speaker GMMs• Collect sound clips that is long enough for eaCollect sound clips that is long enough for ea
ch speaker (e.g. 20 minutes sound clips)ch speaker (e.g. 20 minutes sound clips)• Steps for training one speaker model:Steps for training one speaker model:
– Step 1. Start with an initial model Step 1. Start with an initial model λλ – Step 2. Calculate new mean, variance, weighting Step 2. Calculate new mean, variance, weighting
(new (new λλ) by training) by training– Step 3. Use a newStep 3. Use a newλλif it represents the model betteif it represents the model bette
r than the oldr than the oldλλ– Step 4. Repeat Step 2 to Step 3Step 4. Repeat Step 2 to Step 3
• Finally, we get Finally, we get λλthat can represent the modelthat can represent the model
ApplicationsApplications
ApplicationsApplications
• Video scene change and speaker Video scene change and speaker trackingtracking
• Speaker IdentificationSpeaker Identification
• Telephony message notificationTelephony message notification
Video scene change and Video scene change and Speaker trackingSpeaker tracking
Video Clip
AdvAIR coreSegmentation
Speakers IndexInformation
MultimediaPresentation
VideoPlaying
Mechanism
TimingAnd Speaker Information
UsageUsage
• Speaker tracking enhance data Speaker tracking enhance data mining about a particular person mining about a particular person (e.g. Political person in a conference)(e.g. Political person in a conference)
• Audio information indexing and Audio information indexing and sorting for audio library storagesorting for audio library storage
• It as an auxiliary tool for video It as an auxiliary tool for video cutting and editing applicationscutting and editing applications
ScreenshotScreenshotInput clip
Multimedia player
Time information and indexing
Speaker IdentificationSpeaker IdentificationSound source
PreprocessedSpeaker clip
GMM ModelTraining
Speaker ModelsDatabase
Speaker ComparisonMechanism
Speaker Identity
Training StageTesting Stage
UsageUsage
• Security authenticationSecurity authentication
• Speaker identification of telephone Speaker identification of telephone base systembase system
• Criminal investigation (For example, Criminal investigation (For example, similar to fingerprint)similar to fingerprint)
ScreenshotScreenshot
Input source
Flexible length comparison
Speaker Identity
Media Player for visual verification
Telephony Message Telephony Message NotificationNotification
Caller phone
User can’t listen
Record the leaving message of caller
AdvAIRsegmentation
GMM model comparison
Desired groupModel
database
Desired group
Non-desired Group
Messaging API
Short MessageSystem E-mail system
Experiment Experiment ResultsResults
Threshold-free BIC criterionThreshold-free BIC criterionTest Wave length Actual
Turing PointFalse Alarm Missed
PointTime used
1 9 seconds 2 0 0 2 seconds
2 12 seconds 4 0 0 4 seconds
3 25 seconds 3 0 0 8 seconds
4 120 seconds 8 1 0 134 seconds
5 540 seconds 12 8 0 1200 seconds
Background Noise affect accuracy
Enhanced ImplementationEnhanced Implementation
Test Method Wave length Actual Turning Point
False Alarm Missed Point Time used
1 Old 9 seconds 2 0 0 10 seconds
New 0 0 2 seconds
2 Old 12 seconds 4 0 0 40 seconds
New 0 0 4 seconds
3 Old 25 seconds 3 1 0 1300 seconds
New 2 0 8 seconds
4 Old 540 seconds 18 7 2 Over 1 days
New 8 2 1200 seconds
Speed enhance is determined by relative number of changing point by length
GMM modal closed-set speaker GMM modal closed-set speaker identificationidentification
Training StageTraining Stage10 speaker10 speaker
5 males, 5 females5 males, 5 females
20 minutes for each speaker20 minutes for each speaker
Testing StageTesting Stage
50 sound clips with 5 seconds duration50 sound clips with 5 seconds duration
48 sound clips are correct, i.e. 96 %48 sound clips are correct, i.e. 96 %
GMM modal open-set speaker GMM modal open-set speaker identificationidentification
• Accept or Reject as resultAccept or Reject as result
• Same setting as closed-setSame setting as closed-set– i.e. 10 speaker, which each 20 minutesi.e. 10 speaker, which each 20 minutes
• Correct 45/50 = 90%Correct 45/50 = 90%
• False reject 3/50 = 6 %False reject 3/50 = 6 %
• False accept 2/50 = 4 %False accept 2/50 = 4 %
Problems Problems
and and
LimitationLimitation
Problems and limitationsProblems and limitations
• Accuracy is affected by background Accuracy is affected by background noisenoise
• Some speakers have very likely Some speakers have very likely features of soundfeatures of sound
• Open set speaker identification Open set speaker identification determination function is not so determination function is not so accurate if duration is shortaccurate if duration is short
• Segmentation is still a time consuming Segmentation is still a time consuming processprocess
Future WorkFuture Work
• Speaker gender identificationSpeaker gender identification
• Robust open-set speaker Robust open-set speaker identificationidentification
• Speech content recognitionSpeech content recognition
• Music pattern matchingMusic pattern matching
• Distributed system for segmentationDistributed system for segmentation
Q & AQ & A
Recommended