Cascade classifiers trained on gammatonegrams for reliably detecting audio events

P. Foggia, A. Saggese, N. Strisciuglio, M. Vento

University of Salerno - Italy

Machine Intelligence lab for Video, Image and Audio processing

"Cascade classifiers trained on gammatonegrams for reliably detecting audio events," Advanced Video and Signal Based Surveillance (AVSS), 2014 11th IEEE International Conference on ,

vol., no., pp.50,55, 26-29 Aug. 2014 - doi: 10.1109/AVSS.2014.6918643

State of the art

Single-layer representation or classification Vacher et al. (2004), Clavel et al. (2005): GMM classifier

Valenzise et al. (2007): GMM for background modeling

Rabaoui et al. (2008): OC-SVM with a novel dissimilaritymeasure.

Complex classification architecture or representation Rouas et al. (2006): GMM + SVM

Ntalampiras et al. (2009): two-stage GMM classifier Conte et al. (2012): two classifier with different time

resolutions

Chin and Burred (2012): sub-sequences matchingthrough Genetic Motif Discovery technique.

Proposed Architecture

ImageRepresentation

FeaturesExtraction

(Haar)

Cascade Classifiers

Audio representation

Biologically-inspired representation of audio streams as the response of the cochlea membrane in the human auditorysystem (Gammatone filter bank)

Scream Gun shot Glass breaking


ImageRepresentation

FeaturesExtraction

(Haar)

Cascade Classifiers

Haar features

Haar Wavelets to describe local variations of energy in the Gammatonegram images f.i. abrupt variations of the energy distribution along time is effectively

described by a vertical Haar basis function

Efficiently computed from the Integral Image of the Gammatonegram


ImageRepresentation

FeaturesExtraction

(Haar)

Cascade Classifiers

Cascade Classifiers

Events of interest can occur at every position in time

Classification through a n x m sliding window

Multi-stage cascade classifier learned with AdaBoostalgorithm (inspired to VJ face detector)

Smaller and simpler classifiers in the first stages of the cascade

Speed-up for the early rejection of negative windows

Input Image

rejected (no-events)

eventdetected

Data Set (http://mivia.unisa.it)

4 classes of sounds

Glass breaking (GB), Gun shot (GS), Screams (S), Background sound (BG)

2500 events for each class

1000 for training and 1500 for testing

The events are created by super-imposingabnormal sounds on several background sounds

Originally 173 background sounds + 278 sound from the classes of interest

Experimental Evaluation

Recognition Rate

Correct detection/classification of events of interest

False Positive Rate (False alarms)

Detection of events of interest when onlybackground sounds is present

Comparison with 2 other methods from the literature based on a LVQ [1] and Bag of Aural Words (BoAW) classifier [2]

[1] Conte et al. - An ensemble of rejecting classifiers for anomaly detection of audio events, AVSS 2012[2] Carletti et al. - Audio surveillance using a bag of aural words classifier, AVSS 2013

Experimental Evaluation (2)

Avg. Rec. Rate = 95.89%

Avg. Rec. Rate = 79.87% Avg. Rec. Rate = 95.67%

[1] [2]

Recognition Rate

Experimental Evaluation (3)

False Positive Rate

[1]

[2]

[1]

[2]

Qualitative analysis

Many false scream detections occur on background sounds that contain loud cheeringcrowds or twistles

Scream

Twistle

Cheeringbaby

Conclusions

Innovative approach for audio analysis and events detection based on Computer Vision techniques

High detection capabilities

Low processing time: complex features are computed only for windows that are more probable to contain an event of interest

Detection of sounds of interest with lowenergy

References

P. Foggia,A. Saggese, N.Strisciuglio, M. Vento

"Cascade classifiers trained on gammatonegrams for reliably detecting audio events" Advanced Video and Signal Based Surveillance (AVSS), 2014 11th IEEE International Conference on , vol., no., pp.50,55, 26-29 Aug. 2014doi: 10.1109/AVSS.2014.6918643

Web: http://mivia.unisa.it

Email: nstrisciuglio[at]unisa.it

http://mivia.unisa.it

Engineering

Cascade classifiers trained on gammatonegrams for reliably detecting audio events