Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Music Classification Using Neural Networks
Leo Y. Liu, Lu Wang, Yang Yu
UNC STOR
GTZAN Dataset
I 1000 audio tracks each about 30 seconds long
I 10 types: Blues, Classical, Country, Disco, Hiphop, Jazz, Metal, Pop,Reggae and Rock
I 22050Hz Mono 16-bit audio files in .wav format
https://drive.google.com/open?id=0BzPvXAjSgVbXLUxsSWc0c2k1MXM.
GTZAN Dataset
5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco.
Figure 1: Correlation Plot
GTZAN Dataset
Figure 2: Periodogram
GTZAN Dataset
Figure 3: Gabor Transformation
Model: Neural NetworkA neural network is a two-stage regression or classification model, typicallyrepresented by a network diagram as following.
Zm = σ(b0m + wTmX ),m = 1, . . . ,M,
Tk = b′0k + βTk Z , k = 1, . . . ,K ,
Yk = fk(X ) = gk(T ), k = 1, . . . ,K .
Model: Neural Network
I Activation function σ(v): sigmoid σ(v) = 1/(1 + e−v ).
I Output function gk(T ): For regression, identity function. In K -classclassification, gk(T ) = eTk /
∑Kl=1 e
Tl (softmax).
I Unknowns: bias and weights {b0m,wm;m = 1, . . . ,M} and{b′0k , βk ; k = 1, . . . ,K}. In total, M(p + 1) + K(M + 1) unknowns.
I Measure of fit: the sum-of-squared error
R(θ) =K∑
k=1
N∑i=1
(Y(i)k − fk(X (i)))2.
can be used for both regression and classification; the cross-entropy
R(θ) = −N∑i=1
K∑k=1
Y(i)k log fk(X (i))
for classification and the corresponding classifier is G(x) = argmaxk fk(X ).
Model: Deep Neural Network / Deep Learning
I Deep neural network/deeplearning: a neural networkwith more than one layers;
I Many variations includingrecurrent neural network(RNN), auto-encoder (AE),convolutional neural network(CNN);
I The variations are generallymodification of the layerstructure, activation functionand input-output flow.
Model: Convolutional Neural Network
It is a deep network with special types of hidden layer: convolutional layer,pooling layer, and fully-connected layer (same hidden layer in regular neuralnetworks).
CNN: Convolutional Layer
I Applying the element-wise product between a convolutional kernel (amatrix) and the corresponding regions in the input matrix. Sum them theproducts up and add a bias term as the input of the next layer.
I Move the kernel along certain direction and with certain stride size.
I Possibly need zero padding.
CNN: Convolutional Layer Continued...Same weights and bias are used for each of the 3× 3 hidden neurons.
See http://cs231n.github.io/convolutional-networks/ for anautomation illustration.
CNN: ReLU
Rectified linear unit:
I Activation layer with max(0, x).
I Sparsity and feature selection.
CNN: Pooling Layer
I Down-sample the input layer;
I Max pooling (most popular), average pooling;
I Applying filtering on local regions.
Application: Regression
When time series data shows nonlinearity, we can use neural network to build aneural network autoregression (NNAR) instead of AR. An NNAR(p,K) is aneural network with Xt−1, . . . ,Xt−p as inputs, K neurons in the hidden layerand Xt as the output. Following is an NNAR(4, 6).
Application: Regression
Application: Classification
I Handwriting recognition
I Music classification
I so on...
GTZAN Music genres classificationI n = 1000;
I T = 22, 050 ∗ 25 = 551, 250;
I K = 10;
I Using 80% training 20% testing;
I Preprocessing;
I Classification.
Mel-frequency cepstrum coefficients (MFCC)MFCC’s characterize the short-term power spectrum of a sound;
1. Take the Fourier transform of a windowed excerpt of a signal.
2. Map the powers of the spectrum obtained above onto the mel scale,using triangular overlapping kernel weights.
m = 2595 log10
(1 +
f
700
)= 1127 ln
(1 +
f
700
),
3. Take the logs of the powers at each of the mel frequencies.
4. Take the discrete cosine transform of the list of mel log powers, as if itwere a signal.
5. Extract MFCCs as the amplitudes of the resulting spectrum.
I Apply triangle kernel weight on given frequencies to compute the powerspectrum.
I Bandwidth is equal in mel scale, and different in original scale. (small inlow frequency and large in high frequency).
Note: window width can be either overlapped or non-overlapped, we usedwindow width of 100 ms with stride size of 25 ms.
Advantages:
I Approximates the human auditory system’s response.Demo in http://www.apronus.com/music/flashpiano.htm
I Downsample the raw data by sampling in the a few frequencies(20hz-8000hz).Demo in https://en.wikipedia.org/wiki/Audio_frequency
I Utilize the local information, both in time domain and frequency domain.
Final Model
I Attempted typical CNN, but got disappointing results...
I Low image features in MFCC’s matrix;I Algorithm not converged;I Li et al. (2010) used 2 hours to training a CNN to classify only 3 genres.
I A deep fully connected CNN, implemented in MATLAB, trained in lessthan 2 minutes.
Implemented in Matlab. Only a few lines of codes, and less than five minutesof training.
layers = [imageInputLayer([21 997 1])
fullyConnectedLayer(2000)
fullyConnectedLayer(1000)
fullyConnectedLayer(500)
fullyConnectedLayer(250)
fullyConnectedLayer(n_class)
softmaxLayer
classificationLayer()];
Binary results
blues classical country disco hiphop jazz metal pop reggae rockblues 100.0% 97.5% 72.5% 82.5% 70.0% 87.5% 80.0% 90.0% 72.5% 72.5%
classical 97.5% 100.0% 92.5% 95.0% 100.0% 87.5% 100.0% 100.0% 97.5% 97.5%country 72.5% 92.5% 100.0% 82.5% 77.5% 77.5% 95.0% 85.0% 85.0% 65.0%disco 82.5% 95.0% 82.5% 100.0% 70.0% 95.0% 92.5% 80.0% 72.5% 70.0%hiphop 70.0% 100.0% 77.5% 70.0% 100.0% 82.5% 90.0% 82.5% 72.5% 65.0%jazz 87.5% 87.5% 77.5% 95.0% 82.5% 100.0% 97.5% 97.5% 80.0% 87.5%metal 80.0% 100.0% 95.0% 92.5% 90.0% 97.5% 100.0% 92.5% 100.0% 92.5%pop 90.0% 100.0% 85.0% 80.0% 82.5% 97.5% 92.5% 100.0% 90.0% 92.5%
reggae 72.5% 97.5% 85.0% 72.5% 72.5% 80.0% 100.0% 90.0% 100.0% 77.5%rock 72.5% 97.5% 65.0% 70.0% 65.0% 87.5% 92.5% 92.5% 77.5% 100.0%
I Overall above 80%.
I Lower accuracies: country vs. blues (72.5%), hiphop vs. blues (70%),hiphop vs. disco (70%), rock vs. blues (72.5%), country vs. rock (65%),and hiphop vs. rock (65%).
I classical, metal and pop are the three most distinguishable genres;
I blues, country and rock are the three least distinguishable genres.
Multi-category results
classical metal pop
Target Class
classical
metal
pop
Ou
tpu
t C
lass
Confusion Matrix
1931.7%
11.7%
00.0%
95.0%5.0%
00.0%
2033.3%
00.0%
100%0.0%
11.7%
23.3%
1728.3%
85.0%15.0%
95.0%5.0%
87.0%13.0%
100%0.0%
93.3%6.7%
classical jazz metal pop
Target Class
classical
jazz
metal
pop
Ou
tpu
t C
lass
Confusion Matrix
1822.5%
22.5%
00.0%
00.0%
90.0%10.0%
11.3%
1620.0%
11.3%
22.5%
80.0%20.0%
00.0%
00.0%
2025.0%
00.0%
100%0.0%
11.3%
00.0%
22.5%
1721.3%
85.0%15.0%
90.0%10.0%
88.9%11.1%
87.0%13.0%
89.5%10.5%
88.8%11.3%
blues classical disco metal pop
Target Class
blues
classical
disco
metal
pop
Ou
tpu
t C
lass
Confusion Matrix
1313.0%
33.0%
11.0%
22.0%
11.0%
65.0%35.0%
00.0%
1919.0%
00.0%
11.0%
00.0%
95.0%5.0%
33.0%
00.0%
1111.0%
11.0%
55.0%
55.0%45.0%
11.0%
00.0%
22.0%
1717.0%
00.0%
85.0%15.0%
22.0%
22.0%
44.0%
22.0%
1010.0%
50.0%50.0%
68.4%31.6%
79.2%20.8%
61.1%38.9%
73.9%26.1%
62.5%37.5%
70.0%30.0%
blues classical country disco hiphop jazz metal pop reggae rock
Target Class
blues
classical
country
disco
hiphop
jazz
metal
pop
reggae
rock
Ou
tpu
t C
lass
Confusion Matrix
84.0%
00.0%
52.5%
21.0%
00.0%
00.0%
42.0%
00.0%
00.0%
10.5%
40.0%60.0%
00.0%
178.5%
10.5%
00.0%
00.0%
10.5%
00.0%
00.0%
00.0%
10.5%
85.0%15.0%
73.5%
10.5%
73.5%
10.5%
00.0%
10.5%
00.0%
10.5%
10.5%
10.5%
35.0%65.0%
00.0%
00.0%
10.5%
105.0%
31.5%
00.0%
21.0%
31.5%
00.0%
10.5%
50.0%50.0%
10.5%
10.5%
31.5%
31.5%
42.0%
00.0%
63.0%
10.5%
10.5%
00.0%
20.0%80.0%
00.0%
10.5%
10.5%
31.5%
00.0%
84.0%
10.5%
00.0%
42.0%
21.0%
40.0%60.0%
00.0%
00.0%
00.0%
10.5%
00.0%
00.0%
189.0%
10.5%
00.0%
00.0%
90.0%10.0%
10.5%
00.0%
10.5%
00.0%
00.0%
00.0%
21.0%
168.0%
00.0%
00.0%
80.0%20.0%
31.5%
10.5%
21.0%
21.0%
31.5%
00.0%
10.5%
00.0%
52.5%
31.5%
25.0%75.0%
31.5%
00.0%
00.0%
52.5%
10.5%
00.0%
52.5%
21.0%
21.0%
21.0%
10.0%90.0%
34.8%65.2%
81.0%19.0%
33.3%66.7%
37.0%63.0%
36.4%63.6%
80.0%20.0%
46.2%53.8%
66.7%33.3%
38.5%61.5%
18.2%81.8%
47.5%52.5%
Conclusion
I the deep neural network yields competitive classification accuracy.
I advantage: the prediction power;
I disadvantage: the interpretability;
I the MFCC’s capture the key features in the musical audio signals.
Thank You!