36
GCT634: Musical Applications of Machine Learning Tonal Analysis Hidden Markov Model Graduate School of Culture Technology, KAIST Juhan Nam

GCT634: Musical Applications of Machine Learning Tonal …juhan/gct634/2018/slides/09... · 2018. 9. 14. · GCT634: Musical Applications of Machine Learning Tonal Analysis Hidden

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • GCT634: Musical Applications of Machine LearningTonal Analysis

    Hidden Markov Model

    Graduate School of Culture Technology, KAISTJuhan Nam

  • Outlines

    • Introduction- Tonality- Perceptual Distance of Two Tones- Chords and Scales

    • Tonal Analysis - Key Estimation- Chord Recognition

    • Hidden Markov Model

  • Introduction

    Bach’s Chorale Harmonization

    Jazz “Real book” Pop Music

  • Tonality

    • Tonal music has a tonal center called key- 12 keys (C, C#, D, …, B)

    • Tonal music has a major or minor scale on the key and the notes have different roles

    • Notes in tonal music are harmonized by chords

    (C major scale)

  • Tonality

    • A sequence of notes or chord progressions provide certain degree of stability or instability- E.g., cadence (V-I, IV-I), tension (sus2, sus4)

    • How the tonality is formed? - In other words, how we perceive different degrees of stability or tension

    from notes?

  • Tonality

    • Consonance and Dissonance - If two sinusoidal tones are within 3 ST (minor 3rd) in frequency, they

    become dissonant - Most dissonant when they are apart about one quarter of the critical band- Critical bands become wider below 500 Hz; two low notes can sound

    dissonant (e.g. two piano notes in lower keys)

    • Consonance of two harmonics tones- Determined by how much two tones have closely-located overtones within

    critical bands

  • Consonance Rating of Intervals in Music

    • Perceptual distance between two notes are different from semi-tone distance between them.

  • Chords

    • The basic units of tonal harmony- Triads, 7th , 9th, 11th, …

    • Triads are formed by choosing three notes that make the most consonant (or “most harmonized”) sounds- This ends up with stacking up major or minor 3rds- 7th, 9th are obtained by stacking up 3rds more.

    • The quality of consonance becomes more sophisticated as more notes are added- Music theory is basically about how to create tension and resolve it with

    different quality of consonance

  • Scales in Tonal Harmony

    • Major Scale- Formed by spreading notes from three major chords

    • Minor scale- Formed by spreading notes from three minor chords (natural minor scale)

    - Harmonic or melodic minor scale can be formed by using both minor and major chords

  • Automatic Chord Recognition

    • Identifying chord progression of tonal music

    • It is a challenging task (even for human)- Chords are not explicit in music - Non-chord notes or passing notes- Key change and chromaticism: requires in-depth knowledge of music

    theory- In audio, multiple musical instruments are mixed- Relevant: harmonically arranged notes- Irrelevant: percussive sounds (but can help detecting chord changes)

    • What kind of audio features can be extracted to recognize chords in a robust way?

  • Chroma Features: FFT-based approach

    • Compute spectrogram and mapping matrix- Convert frequency to music pitch scale and get the pitch class - Set one to the corresponding pitch class and, otherwise, set zero- Adjust non-zeros values such that low-frequency content have more

    weights

  • Chroma Features: Filter-bank approach

    • A filter-bank can be used to get a log-scale time-frequency representation- Center frequencies are arranged over 88

    piano notes - band widths are set to have constant-Q and

    robust to +/- 25 cent detune

    • The outputs that belong to the same pitch class are wrapped and summed.

    (Müller, 2011)

  • Beat-Synchronous Chroma Features

    • Make chroma features homogeneous within a beat (Bartsch and Wakefield, 2001)

    (From Ellis’ slides)

  • Key Estimation Overview

    • Estimate music key from music data- One of 24 keys: 12 pitch classes (C, C#, D, .., B) + major/minor

    • General Framework (Gomez, 2006)

    G majorSimilarityMeasureChroma Features

    Average

    Key Template

    KeyStrength

  • Key Template

    • Probe tone profile (Krumhansl and Kessler, 1982)- Relative stability or weight of tones - Listeners rated which tones best completed the first seven notes of a major

    scale - For example, in C major key, C, D, E, F, G, A, B, … what?

    Probe Tone Profile - Relative Pitch Ranking

  • Key Estimation

    • Similarity by cross-correlation between chroma features and templates

    • Find the key that produces the maximum correlation

  • Chord Recognition

    • Estimate chords from music data- Typically, one of 24 keys: 12 pitch classes + major/minor - Often, diminish chords are added (36 chords)

    • General Framework

    ChordsDecisionMaking

    Audio/Transform

    Chroma Features

    Chord Template or Models

    Template MatchingHMM, SVM

  • Template-Based Approach

    • Use chord templates (Fujishima, 1999; Harte and Sandler, 2005) and find the best matches

    • Chord Templates

    (from Bello’s Slides)

  • Template-Based Approach

    • Compute the cross-correlation between chroma features and chord templates and select chords that have maximum values

    (from Bello’s Slides)

  • Review

    • Template approach is too straightforward- The binary templates are hard assignments

    • We can use a multi-class classifier- The output is one of the target chords- However, the local estimation tends to be temporally not smooth

    • We need some algorithm that considers the temporal dependency between chords- The majority of tonal music have certain types of chord progression

  • Hidden Markov Model (HMM)

    • A probabilistic model for time series data - Speech, gesture, DNA sequence, financial data, weather data, …

    • Assumes that the time series data are generated from hidden states and the hidden states follow a Markov model

    • Learning-based approach- Need training data annotated with labels - The labels usually correspond to hidden states

  • Markov Model

    • A random variable 𝑞 has 𝑁 states (𝑆1, 𝑆2, … , 𝑆𝑁) and, at each time step, one of the states are randomly chosen: 𝑞( ∈ {𝑆1, 𝑆2, … , 𝑆𝑁}

    • The probability distribution for the current state is determined by the previous state(s)- The first-order: 𝑃 𝑞( 𝑞-, 𝑞., … , 𝑞(/- = 𝑃 𝑞( 𝑞(/-- The second-order: 𝑃 𝑞( 𝑞-, 𝑞., … , 𝑞(/- = 𝑃 𝑞( 𝑞(/-, 𝑞(/.

    • The first-order Markov model is widely used for simplicity

  • Markov Model

    • Example: chord progression- 𝑞( ∈ {𝐶, 𝐹, 𝐺}- The transition probability matrix 3 by 3

    FC

    GSt

    End

    𝑃 𝑞( = 𝐶 𝑞(/- = 𝐶 = 0.7

    𝑃 𝑞( = 𝐹 𝑞(/- = 𝐶 = 0.1

    𝑃 𝑞( = 𝐺 𝑞(/- = 𝐶 = 0.2

    𝑃 𝑞( = 𝐶 𝑞(/- = 𝐹 = 0.2

    𝑃 𝑞( = 𝐹 𝑞(/- = 𝐹 = 0.6

    𝑃 𝑞( = 𝐺 𝑞(/- = 𝐹 = 0.2

    𝑃 𝑞( = 𝐶 𝑞(/- = 𝐺 = 0.3

    𝑃 𝑞( = 𝐹 𝑞(/- = 𝐺 = 0.1

    𝑃 𝑞( = 𝐺 𝑞(/- = 𝐺 = 0.6

  • Markov Model

    • The joint probability of a sequence of states is simple with the Markov model

    𝑃 𝑞-, 𝑞., … , 𝑞( = 𝑃 𝑞-, 𝑞., … , 𝑞(/- 𝑃 𝑞( 𝑞-, 𝑞., … , 𝑞(/- = 𝑃 𝑞-, 𝑞., … , 𝑞(/- 𝑃 𝑞( 𝑞(/-

    = 𝑃 𝑞-, 𝑞., … , 𝑞(/. 𝑃 𝑞(/- 𝑞-, 𝑞., … , 𝑞(/. 𝑃 𝑞( 𝑞(/-

    = 𝑃 𝑞-, 𝑞., … , 𝑞(/. 𝑃 𝑞(/- 𝑞(/. 𝑃 𝑞( 𝑞(/-

    = 𝑃 𝑞- 𝑃 𝑞.|𝑞- …𝑃 𝑞(/- 𝑞(/. 𝑃 𝑞( 𝑞(/-

  • What Can We Do with the Markov Model?

    • Generate a chord sequence- e.g.) C – C – C – C – F – F – C – C – G – G – C– C - … - We can also generate melody if we define the transition probability matrix

    among notes

    • Evaluate if a specific chord progression is more likely than others. - For example, C-G-C is more likely than C-F-C (assuming 𝑃 𝑞- = 𝐶 = 1)

    𝑃 𝑞 = 𝐶, 𝐺, 𝐶 = 𝑃 𝑞- = 𝐶 𝑃 𝑞. = 𝐺|𝑞- = 𝐶 𝑃 𝑞; = 𝐶|𝑞. = 𝐺 = 0.2 ∗ 0.3 = 0.06

    𝑃 𝑞 = 𝐶, 𝐹, 𝐶 = 𝑃 𝑞- = 𝐶 𝑃 𝑞. = 𝐹|𝑞- = 𝐶 𝑃 𝑞; = 𝐶|𝑞. = 𝐹 = 0.1 ∗ 0.2 = 0.02

  • What Can We Do with a Markov Model ?

    • Compute the probability that the chord at time 𝑇 is C (or F or G) - Naïve method: count all paths that have C chord at time 𝑇: exponential!- Clever method: use a recursive induction- 𝑃 𝑞> = 𝐶 = 𝑃 𝑞> = 𝐶|𝑞>/- = 𝐶 𝑃 𝑞>/- = 𝐶

    +𝑃 𝑞> = 𝐶|𝑞>/- = 𝐹 𝑃 𝑞>/- = 𝐹+𝑃 𝑞> = 𝐶|𝑞>/- = 𝐺 𝑃 𝑞>/- = 𝐺

    - Repeat this for 𝑃 𝑞@ = 𝐶 , 𝑃 𝑞@ = 𝐹 , 𝑃 𝑞@ = 𝐺 for 𝑖 = 𝑇 − 1, 𝑇 − 2,… , 1

  • Chord Recognition from Audio

    • What we observe are not chords but audio features (e.g. chroma)

    • We want to infer a chord sequence from audio feature sequences

    𝑞-, 𝑞., … , 𝑞(/-

    𝑂-, 𝑂., … , 𝑂(/-

  • Hidden Markov Model (HMM)

    • The hidden states follow the Markov model• Given a state, the corresponding observation distribution is

    independent of previous states or observations- Each state has emission distribution

    𝑞(/- 𝑞( 𝑞(D-

    𝑂(/-

    . . .

    𝑂( 𝑂(D-

    FC

    G

    𝑃 𝑂 𝑞( = 𝐶 𝑃 𝑂 𝑞( = 𝐹 𝑃 𝑂 𝑞( = 𝐺

  • Hidden Markov Model (HMM)

    • Model parameters- Initial state probabilities: 𝑃 𝑞E → 𝜋@- Transition probability matrix: 𝑃 𝑞( 𝑞(/- → 𝑎@J- Observation distribution given a state: 𝑃 𝑂 𝑞J → 𝑏J (e.g. Gaussian)

    • How can we learn the parameters from data?

  • Training HMM for Chord Recognition

    • If chord labels are aligned with audio, estimate the parameters directly from the data- Initial state probabilities and transition probability matrix: count chord and

    chord-to-chord transition- Observation distribution: fit a Gaussian model to the audio features

    separately for each chord- Easy to train but very expensive to obtain the time-aligned data

    • If If chord labels are not aligned with audio, we should do the maximum-likelihood estimation

  • Training HMM: EM algorithm

    • If If chord labels are not aligned with audio, use the EM algorithm (the Baum-Welch method)

    • E-Step: evaluate the probability of transitioning from state 𝑆𝑖 at time 𝑡 to state 𝑆𝑗 at time 𝑡 + 1 given observation

    - Then, the probability of being in state 𝑆𝑖 at time 𝑡 can be also derived

    𝛾( 𝑖 = 𝑝 𝑞( = 𝑆@ 𝑂, 𝜃 =Q𝜉( 𝑖, 𝑗S

    JT-

    𝜉( 𝑖, 𝑗 = 𝑝(𝑞( = 𝑆@, 𝑞(D- = 𝑆J|𝑂, 𝜆)

  • Training HMM: EM algorithm

    • M-Step: update the parameters such that they maximize the log-likelihood given the evaluation- ∑ 𝛾( 𝑖>/-(T- : expected number of transitions from 𝑆𝑖

    (or how many the state 𝑆𝑖 is visited from 1 to T-1)

    - ∑ 𝜉( 𝑖, 𝑗>/-(T- : expected number of transition from 𝑆𝑖 to 𝑆𝑗

    • We can use the label to constrain the model (e.g. initialization)

    𝜋@ = 𝛾- 𝑖 =

    𝑎@J =∑ 𝜉( 𝑖, 𝑗>/-(T-∑ 𝛾( 𝑖>/-(T-

    =

    𝑏J 𝑘 =∑ 𝛾( 𝑖, 𝑗 𝑠. 𝑡. 𝑂( = 𝑣[>(T-

    ∑ 𝛾( 𝑖>(T-=

    expected frequency in state 𝑆𝑖 at time 𝑡 = 1

    expected number of transition from 𝑆𝑖 to 𝑆𝑗

    expected number of transition from 𝑆𝑖

    expected number of times in state 𝑆𝑗 and observing 𝑣[

    expected number of times in state 𝑆𝑗

  • Evaluating HMM for Chord Recognition

    • Find the most likely sequence of hidden states given observations and HMM model parameters

    • Viterbi algorithm - Define a probability variable

    - Initialization:

    - Recursion:

    - Termination:

    (from start state)

    (to end state)

    𝛿( 𝑖 = max`a,`b,…,`cda𝑃(𝑞-, 𝑞., … , 𝑞( = 𝑆@, 𝑂-, 𝑂., … , 𝑂(| 𝜆)

    𝛿- 𝑖 = 𝜋@𝑏@(𝑂-) 𝜓- 𝑖 = 0

    𝛿( 𝑗 = max-g@gS 𝛿(/- 𝑖 𝑎@J𝑏J(𝑂()

    𝜓( 𝑗 = argmax-g@gS

    𝛿(/- 𝑖 𝑎@J2 ≤ 𝑡 ≤ 𝑇, 1 ≤ 𝑗 ≤ 𝑁

    1 ≤ 𝑖 ≤ 𝑁

    𝑃∗ = max-g@gS

    𝛿> 𝑖

    𝑞>∗ = argmax-g@gS

    𝛿> 𝑖

  • The Viterbi Trellis

    • Recall the Dynamic Programming!

    C

    F

    G

    St

    v2 ( j)v1( j)

    . .

    .

    . .

    .

    . .

    . C

    F

    G

    Endv3( j)

    t=1 t=2 t=3

    C

    F

    G

    C

    F

    G

    C

    F

    G

    t=T-1 t=T

    vT−1( j) vT ( j)

  • Chord Recognition Result

    • HMM provide more smoothed chord recognition output

    (From Ellis’ E4896 practicals)

  • References

    • P. R. Cook (Editor), “Music, Cognition, and Computerized Sound: An Introduction to Psychoacoustics”, book, 2001

    • C. Krumhansl, “Cognitive Foundations of Musical Pitch”, 1990 • M.A. Bartsch and G. H. Wakefield,“To catch a chorus: Using chroma-based

    representations for audio thumbnailing”, 2001• E. Gómez, P. Herrera, “Estimating The Tonality Of Polyphonic Audio Files: Cognitive

    Versus Machine Learning Modeling Strategies”, 2004. • M. Müller and S. Ewert, “Chroma Toolbox: MATLAB Implementations for Extracting

    Variants of Chroma-Based Audio Features”, 2011.• T. Fujishima, “Real-time chord recognition of musical sound: A system using common

    lisp music,” 1999• A. Sheh and D. Ellis, “Chord Segmentation and Recognition using EM-Trained Hidden

    Markov Models”, 2003.• L. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech

    Recognition”, 1989