Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Neural Architectures for Music Representation Learning
Sanghyuk Chun, Clova AI Research
Contents
1
- Understanding audio signals
- Front-end and back-end framework for audio architectures- Powerful front-end with Harmonic filter banks
- [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN.
- [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.
- [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.
- Interpretable back-end with self-attention mechanism- [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.
- [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.
- Conclusion
Understanding audio signalsRaw audioSpectrogramMel filter bank
2
Understanding audio signals in time domain.
3
[0.001, -0.002, -0.005, -0.004, -0.003, -0.003, -0.003, -0.002, -0.001, …]
“Waveform” shows “magnitude” of the input signal across time.How can we capture “frequency” information?
Understanding audio signals in frequency domain.
4
Time-amplitude => Frequency-amplitude
Understanding audio signals intime-frequency domain.
5
Types for audio inputs- Raw audio waveform- Linear spectrogram- Log-scale spectrogram- Mel spectrogram- Constant Q transform (CQT)
Human perception for audio: Log-scale.
6
220 Hz 440 Hz 880 Hz
293.66 Hz146.83 Hz 587.33 Hz
Mel filter banks: Log-scale filter bank.
7
Mel-spectrogram.
8
Raw audio Linear spectrogram Mel-spectrogram
1D: Sampling rate * audio length= 11025 * 30 = 330K
Input shape 2D: (# fft / 2 + 1) X # frames= (2048 / 2 + 1) X 1255= [1025, 1255]
Related to many hyperparams (hop size, window size, …)
2D: (# mel bins) X # frames= [128, 1255]
Informationdensity
Very sparse in time axis(need very large receptive field)
1 sec = sampling rate
Sparse in freq axis(need large receptive field)Each time bin has 1025 dim
Less sparse in freq axisEach time bin has 128 dim
loss No loss (if SR > Nyquist rate) “Resolution” “Resolution” + Mel filter
Bonus: MFCC (Mel-Frequency Cepstral Coefficient).
9
Mel-spectrogram
2D: (# mel bins) X # frames= [128, 1255]
DCT (Discrete Cosine Transform)
2D: (# mfcc) X # frames= [20, 1255]
Frequently used in the speech domain.Very lossy representation to the high-level music representation learning (c.f., SIFT)
Front-end and back-end frameworkFully convolutional neural network baselineRethinking convolutional neural networkFront-end and back-end framework
10
Fully convolutional neural network baseline for automatic music tagging.
11[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler
Rethinking CNN asfeature extractor and non-linear classifier.
12
Rethinking CNN as feature extractor and non-linear classifier.
13[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler
Low-level feature(timbre, pitch)
High-level feature(rhythm, tempo)
Non-linearclassifier
Rethinking CNN asfeature extractor and non-linear classifier.
14
Front-end and back-end framework
15
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
Front-end and back-end framework
16
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
[ICASSP 2017] "Convolutional recurrent neural networks for music classification.”, Choi, et al.
CNN RNNMel-filter banks
Front-end and back-end framework
17
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.
CNN RNNMel-filter banks
CNN CNNMel-filter banks
Front-end and back-end framework
18
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.
CNN RNNMel-filter banks
CNN CNNMel-filter banks
Front-end and back-end framework
19
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
[SMC 2017] "Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms.”, Lee, et al.
CNN RNNMel-filter banks
CNN CNNMel-filter banks
CNN CNNFully-learnable
Front-end and back-end framework
20
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
[ICASSP 2020] "Data-driven Harmonic Filters for Audio Representation Learning.”, Won, et al.
CNN RNNMel-filter banks
CNN CNNMel-filter banks
CNN CNNFully-learnable
CNN CNNPartially-learnable
Front-end and back-end framework
21
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
CNN RNNMel-filter banks
CNN CNNMel-filter banks
CNN CNNFully-learnable
[ArXiv 2019] “Toward Interpretable Music Tagging with Self-attention.”, Won, et al.
CNN CNNPartially-learnable
CNN Self-attentionMel-filter banks
Front-end and back-end framework
22
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
CNN RNNMel-filter banks
CNN CNNMel-filter banks
CNN CNNFully-learnable
CNN CNNPartially-learnable
CNN Self-attentionMel-filter banks
Powerful front-end with Harmonic filter banks[ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN.[ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning. [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.
23
Motivation: Data-driven, but human-guided
24
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
Traditional methods Mel-filter banks MFCC SVM
Recentmethods Fully-learnable CNN CNN
Hand-crafted featureswith strong human prior
Data-driven approach without any human prior
Classification
25
DATA-DRIVEN HARMONIC FILTERS
26
DATA-DRIVEN HARMONIC FILTERS
Data-driven filter banks
27
f(m): pre-defined frequency values depending onSampling rate, FFT size, mel bin size, …
Proposed data-driven filter is parameterized by- fc: center frequency- BW: bandwidth
Data-driven filter banks
28
Trainable Q!
Derived from equivalent rectangular bandwidth (ERB)
29
DATA-DRIVEN HARMONIC FILTERS
Harmonic filters
30
n=1 n=2 n=3 n=4
Output of harmonic filters
31
n=1 n=2 n=3 n=4
Fundamental frequency 2nd harmonic of Fund freq 3rd harmonic of Fund freq 4th harmonic of Fund freq
Harmonic tensors
32
Harmonic CNN
33
Back-end
34
Experiments
35
Music tagging Keyword spotting Sound event detection
# data 21k audio clips 106k audio clips 53k audio clips
# classes 50 35 17
Task Multi-labeled Single-labeled Multi-labeled
Experiments
36
Music tagging Keyword spotting Sound event detection
# data 21k audio clips 106k audio clips 53k audio clips
# classes 50 35 17
Task Multi-labeled Single-labeled Multi-labeled
“techno”, “beat”, “no voice”,“fast”, “dance”, …
Many tags are highly related to harmonic structure,e.g., timbre, genre, instruments, mood, …
Experiments
37
Music tagging Keyword spotting Sound event detection
# data 21k audio clips 106k audio clips 53k audio clips
# classes 50 35 17
Task Multi-labeled Single-labeled Multi-labeled
“wow”
(other labels: “yes”, “no”, “one”, “four”, …)
Harmonic characteristic is well-known important featurefor speech recognition (MFCC)
Experiments
38
Music tagging Keyword spotting Sound event detection
# data 21k audio clips 106k audio clips 53k audio clips
# classes 50 35 17
Task Multi-labeled Single-labeled Multi-labeled
“Ambulance (siren)”, “Civil defense siren”
(other labels: “train horn”, “Car”, “Fire truck”…)
Non-music and non-verbal audio signals are expected to have “inharmonic” features
Experiments
39
Mel-spectrogram
Fully-learnable
Filters Front-end back-end
CNN CNN
CNN CNN
Partially-learnable CNN CNN
Mel-spectrogram CNN Attention RNNLinear / Mel spec, MFCC Gated-CRNN
Effect of harmonic
40
Harmonic CNN is efficient and effective architecture for music representation
41
All models can be reproduced by the following repository https://github.com/minzwon/sota-music-tagging-models
Harmonic CNN is more generalizable to realistic noises than other methods
42
Interpretable back-end with self-attention mechanism[ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.[ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.
43
Recall: Front-end and back-end framework
44
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
Recall: Front-end and back-end framework
45
Local feature extraction
Front-end
Temporal summarization& Classification
Back-end OutputFilter banks
Time-frequencyfeature extraction
Input
What’s happening in the back-end?
CNN-based back-end cannot capture “temporal property”.
46[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler
Loose locality
Self-attention back-endto capture long-term relationship
47
Experiments (music tagging)
48
Self-attention back-end for better interpretability
49
Self-attention back-end for better interpretability
50
Self-attention back-end for better interpretability
51
Attention score visualization.
52
My network says it has a “quiet” tag.But why?
Attention score visualization.
53
Back-end as “sound” detector:Positive tags case
54
“Piano” “Drum”
Back-end as “sound” detector:Negative tags case
55
“No vocal” “Quiet”
Tag-wise heatmap contribution
56
Set the selected attention weight as 1, while others are set to 0
Tag-wise heatmap contribution
57
Confidence of “Quiet”
Confidence of “Loud”
Tag-wise heatmap contribution
58
Confidence of “Piano”
Confidence of “Flute”
Conclusion
59
Powerful front-end by Harmonic CNNInterpretable back-end by self-attention
60
Reference
61
- [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN,Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra
- [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra
- [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.Minz Won, Andres Ferraro Dmitry Bogdanov, Xavier Serra
- [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.Minz Won, Sanghyuk Chun, Xavier Serra
- [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.Minz Won, Sanghyuk Chun, Xavier Serra