35
16-18 November, 2020 VIRTUAL INTERNATIONAL CONFERENCE IMPROVING VIDEO ANALYSIS FOR A SAFER EUROPE

VIRTUAL INTERNATIONAL CONFERENCE IMPROVING VIDEO …

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

PowerPoint PresentationVIRTUAL INTERNATIONAL CONFERENCE
2Virtual Conference – 16 – 18 November 2020
Audio Data Aspects of the VAP
Alexander Schindler
3
Energy Health & Bioresources
Digital Safety & Security
Vision, Automation & Control
• Dynamic Transportation Systems • Transportation Infrastructure Technologies
• Electric Drive Technologies • Light Metals Technologies Ranshofen
• Experience Contexts and Tools • Experience Business Transformation
• Digital Innovation • Foresight & Institutional Change • Policies for Change
4Virtual Conference – 16 – 18 November 2020
Audio Data Aspects of the VAP
Alexander Schindler, Anahid Jalali, Andrew Lindley, Martin Boyer,
Sergiu Gordea, Ross King
AIT Austrian Institute Of Technology Gmbh [email protected]
Incident Detection Alarms, Threats Emergencies
Music Analysis Classification, Segmentation Similarity Estimation
Domestic Audio Home automation Home security
Bio-Acoustics Environmental Sounds
Speech Recognition Identifying Speech
Smart Cities
“There was a loud noise.”
„There was an explosion.”
Search with reference to sound
„There was a loud sound and then someone jumped into a car and disappeared“
Search by sound
(e.g. a person screaming somthing)
6
Virtual Final Conference – 16-18/11/2020
• Large Scale Audio-Visual Video Analytics Platform (VAP) • Integrates four analytic modules
• Audio Event Detection (AIT) • Audio Similarity Retrieval (AIT) • Audio Synchronization (AIT) • Vehicle Sound Detection (IRIT)
8Virtual Final Conference – 16-18/11/2020
• Gunshots, explosions, emergency vehicles, scream, speech, Alarm
• Use-Case • Content filter in mass video-data • Example: attack with firearms
• => initiate search by filtering all videos which contain Gunshots (sorted by confidence)
Gunshot Gunshot Scream Gunshot …
[talking, gunshot]
SED
Generalization Challenge
Difference: Music Studio / Sound Event Detection Studio
Microphone close to instrument. Avoid room acoustic / record and mix extra
Sound Scenes Microphone in open spaces (rooms, outdoors) Room acoustic (reverb, echo, noise, etc.)
From:https://commons.wikimedia.org/ wiki/File:Various-microphones.JPG
Indoors: reflecting walls vs. Curtains Outdoors: reflecting facades vs. Open space Weather:
clear vs. Fog, Rain, Wind, etc. Dry vs. Wet road
Also:
Different Sound Scapes: Vienna vs. Paris (ambulance, police, subway, tramway, etc.)
From:https://pixabay.com/photos/micr ophone-artichoke-nature-1535932/
From:https://upload.wikimedia.org/wiki pedia/commons/thumb/e/e6/Micropho ne_in_hall.jpg
From:https://c.pxhere.com/photos/ad/6 1/microphone_voice_over_music- 731669.jpg!d
From:https://de.wikipedia.org/wiki/Dat ei:Dense_Seattle_Fog.jpg
Traffic noise (cars, airplanes, etc.) Fauna: birds vs. crickets
Summer vs. Winter
Gravel / Snow on asphalt Leaves rustling vs. No leaves in Winter
From:https://www.flickr.com/photos/45 774886@N00/248949123
Flattening / Disentangling of the hirarchy Replacing child labels with parent labels
Machine Gun => Gunshot Fusillade (salvo) => Gunshot
Resolve ambiguities / overlaps Gunshot != Atillery Fire
Remove non-project-relevant labels Cap Gun, Speech Synthesizer
New grouping of classes Atillery Fire => Explosion Ambulance, Police Car, Fire Truck => Emergency vehicle
Selection of project-relevant parent labels Gunshot, Explosion, Emergency vehicle,
Audioset – pre-processing
Final Acoustic Concepts • Alarm • Explosion • Screaming • Speech • Wind Noise (Microphone) • Emergency Vehicle • Gunshot
• Strong Labels – High precission (~0.5s) – Per class labelling – Expensive – Datasets usually small
• Weak labels – Low precission (~10s) – Multi class labelling – „cheap“ – Large Datasets
Harb, Robert, and Franz Pernkopf. "Sound event detection using weakly-labeled semi-supervised data with GCRNNS, VAT and Self-Adaptive Label Refinement." arXiv preprint arXiv:1810.06897 (2018).
Dog, Wind, Speech
2. Convolutional Neural Network Block (CNN) • Learn audio embeddings
3. Recurrent Neural Network Block (RNN) • Learn Temporal dependencies of embeddings
4. Array of Fully Connected Layers • One Layer per temporal dimension (Time-
Distributed) • Dimensionality of Layer = Number of classes
5. Outputs • Strong Labels – Training & Inference
• Output of Time-Distributed Fully Connected Layers
• Weak Labels - Training • Output Layer aggregation (e.g. avg, max) • Multi label prediction
19 Recurrent Convolutional Neural Networks
Attention Layer
Convolution Layer
Recurrent Layer
Input Layer
High signal to noise ratio Due to attention layer
Tick off hints by eye-witnesses
20
Task Searching for video-segments with similar audio-signature Sub-Segment video-search
Use-Case Suspect could not be identified in one video Select segment and search for others using audio-signature Instant localization (videos close to audio source)
Solution Select range in video Retrieve a list of similar sounding video segments Sorted by simlarity
1 1 / 1 2 / 2 0 1 8 – R e v i e w R P 1
24
6s 6s 6s 6s 6s
6s 6s 6s 6s 6s 6s
6s 6s 6s 6s 6s 6s 6s
Audio features extracted for each 6s segment
Rhythm Patterns (repetitiveness in audio)
Statistical Spectrum Descriptors
Nearest Neighbor search using late fusion in a normalized feature space
Audio Similarity – Dallas Protest Shooting
25Virtual Final Conference – 16-18/11/2020
vid1
Vid4
vid5
vid2
vid3
vid6vid6
28
1 1 / 1 2 / 2 0 1 8 – R e v i e w R P 1 OFFSET ?
Task Synchronize various video files with unreliable time metadata Use audio-signature to relatively align video files
Technology Audio-fingerprints (chromaprint) Noise invariant
Audio Fingerprints
• To synchronize two audio signals, we use fingerprints to identify each audio landmarks
• Each audio fingerprint is a unambiguous hash value of two local frequency peaks with their offsets
Audio Wave
Audio Spectrogram
Audio Synchronization
Each two matching recordings contain large amount of matching hash values
The mode of difference between each of their offset determine the overall offset between the two recordings
Illustration of difference between the offsets of two different recordings of the same concert
Illustration of difference between the offsets of two different recordings of two different concerts
Audio Synchronization
Advantage of this approach is Accurate identification of matching pairs Avoiding false matches Fast computation of matches
Illustration of difference between the offsets of two different recordings of the same concert
Illustration of difference between the offsets of two different recordings of two different concerts
32
Thank you and Next: Vehicle Sound Detection (IRIT)
Conclusions
34Virtual Conference – 16 – 18 November 2020
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No740754
35Virtual Conference – 16 – 18 November 2020
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No740754
www.victoria-project.eu
@H2020Victoria
https://www.linkedin.com/groups/12091342
https://www.victoria-project.eu/join-the-
victoria-community
Foliennummer 4
Audio Analysis
Developed Audio Modules
Audio Similarity – Instant Localization (Without GPS)
Foliennummer 27
Audio-based Video-Synchronization
Audio Fingerprints
Audio Synchronization
Audio Synchronization