32
Dan Ellis Audio Signal Reecognition 2003-11-13 - 1 / 25 Audio Signal Recognition for Speech, Music, and Environmental Sounds Pattern Recognition for Sounds Speech Recognition Other Audio Applications Observations and Conclusions 1 2 3 4 Dan Ellis <[email protected]> Laboratory for Recognition and Organization of Speech and Audio (Lab ROSA ) Columbia University, New York http://labrosa.ee.columbia.edu/

efeffefe

Embed Size (px)

DESCRIPTION

efefefef

Citation preview

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 1 / 25

    Audio Signal Recognitionfor Speech, Music,

    and Environmental Sounds

    Pattern Recognition for Sounds

    Speech Recognition

    Other Audio Applications

    Observations and Conclusions

    1

    2

    3

    4

    Dan Ellis

    Laboratory for Recognition and Organization of Speech and Audio(Lab

    ROSA

    )Columbia University, New Yorkhttp://labrosa.ee.columbia.edu/

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 2 / 25

    Pattern Recognition for Sounds

    Pattern recognition is abstraction

    - continuous signal

    discrete labels- an essential part of understanding?

    information extraction

    Sound is a challenging domain

    - sounds can be highly variable- human listeners are extremely adept

    1

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 3 / 25

    Pattern classification

    Classes are defined as distinct region in some feature space

    - e.g. formant frequencies to define vowels

    Issues

    - finding segments to classify

    - transforming to an appropriate feature space

    - defining the class boundaries

    0

    00

    1000

    1000

    2000

    2000

    1000200030004000

    0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    f/Hz

    time/s

    F1/Hz

    F2/Hz

    x

    x ayao

    0 200 400 600 800 1000 1200 1400 1600

    600

    800

    1000

    1200

    1400

    1600

    1800Pols vowel formants: "u" (x), "o" (o), "a" (+)

    F1 / Hz

    F2 /

    Hz

    x

    *

    +

    o

    new observation x

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 4 / 25

    Classification system parts

    signal

    segment

    feature vector

    class

    Pre-processing/segmentation

    Feature extraction

    Classification

    Post-processing

    Sensor

    STFT Locate vowels

    Formant extraction

    Context constraints Costs/risk

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 5 / 25

    Feature extraction

    Feature choice is critical to performance

    - make important aspects explicit, remove irrelevant details

    - equivalent representations can perform very differently in practice

    - major opening for domain knowledge (cleverness)

    Mel-Frequency Cepstral Coefficients (MFCCs):Ubiquitous speech features

    - DCT of log spectrum on auditory scale- approximately decorrelated ...

    0 1 2 30

    10

    20

    30

    0 1 2 30

    5

    10

    time / sec

    freq.

    cha

    nnel

    cep.

    coe

    f.Mel Spectrogram MFCCs

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 6 / 25

    Statistical Interpretation

    Observations are random variableswhose distribution depends on the class:

    Source distributions

    p

    (

    x

    |

    i

    )

    - reflect variability in feature- reflect noise in observation- generally have to be estimated from data

    (rather than known in advance)

    Classi

    (hidden)discrete continuous

    Observationx

    p(x|i)Pr(i|x)

    p(x|i)

    x

    1 2 3 4

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 7 / 25

    Priors and posteriors

    Bayesian inference can be interpreted as updating prior beliefs with new information,

    x

    :

    Posterior is prior scaled by likelihood & normalized by evidence (so

    (

    posteriors

    )

    = 1)

    Minimize the probability of error by choosing

    maximum

    a posteriori

    (MAP) class:

    Pr i( )p x i( )

    p x j( ) Pr j( )j--------------------------------------------------- Pr i x( )=

    Posteriorprobability

    Evidence = p(x)

    Likelihood

    Prior probability

    Pr i x( )iargmax =

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 8 / 25

    Practical implementation

    Optimal classifier is

    but we dont know

    So, model conditional distributions then use Bayes rule to find MAP class

    Pr i x( )iargmax =

    Pr i x( )

    p x i( )

    Labeledtraining

    examples{xn,xn}

    Sortaccordingto class

    Estimateconditional pdf

    for class 1

    p(x|1)

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 9 / 25

    Gaussian models

    Model data distributions via parametric model

    - assume known form, estimate a few parameters

    E.g. Gaussian in 1 dimension:

    For higher dimensions, need mean vector

    i

    and

    d

    x

    d

    covariance matrix

    i

    Fit more complex distributions with mixtures...

    p x i( )1

    2i---------------- 1

    2---

    x i

    i------------- 2

    exp=normalization

    02

    4

    0

    2

    4

    00.5

    1

    0 1 2 3 4 50

    1

    2

    3

    4

    5

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 10 / 25

    Gaussian models for formant data

    Single Gaussians a reasonable fit for this data

    Extrapolation of decision boundaries can be surprising

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 11 / 25

    Outline

    Pattern Recognition for Sounds

    Speech Recognition- How its done- What works, and what doesnt

    Other Audio Applications

    Observations and Conclusions

    1

    2

    3

    4

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 12 / 25

    How to recognize speech?

    Cross correlate templates?- waveform?- spectrogram?- time-warp problems

    Classify short segments as phones (or ...), handle time-warp later- model with slices of ~ 10 ms- pseudo-piecewise-stationary model of words:

    2

    time / s

    freq

    / Hz

    0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

    1000

    2000

    3000

    4000sil silg w eh n

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 13 / 25

    Speech Recognizer Architecture

    Almost all current systems are the same:

    Biggest source of improvement is increase in training data- .. along with algorithms to take advantage

    Featurecalculation

    sound

    Acousticclassifier

    feature vectorsAcoustic model

    parameters

    HMMdecoder

    Understanding/application...

    phone probabilities

    phone / word sequence

    Word models

    Language modelp("sat"|"the","cat")p("saw"|"the","cat")

    s ah t

    D A

    T A

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 14 / 25

    Speech: Progress

    Annual NIST evaluations

    - steady progress (?), but still order(s) of magnitude worse than human listeners

    1990 1995 2000 2005

    30%

    3%

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 15 / 25

    Speech: Problems

    Natural, spontaneous speech is weird!

    - coarticulation- deletions- disfluencies is word transcription even a sensible approach?

    Other major problems- speaking style, rate, accent- environment / background...

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 16 / 25

    Speech: What works, what doesnt

    What works: Techniques:- MFCC features + GMM/HMM systems

    trained with Baum-Welch (EM)- Using lots of training dataDomains:- Controlled, low noise environments- Constrained, predictable contexts- Motivated, co-operative users

    What doesnt work: Techniques:- rules based on insight- perceptual representations

    (except when they do...)Domains:- spontaneous, informal speech- unusual accents, voice quality, speaking style- variable, high-noise background / environment

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 17 / 25

    Outline

    Pattern Recognition for Sounds

    Speech Recognition

    Other Audio Applications- Meeting recordings- Alarm sounds- Music signal processing

    Observations and Conclusions

    1

    2

    3

    4

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 18 / 25

    Other Audio Applications:ICSI Meeting Recordings corpus

    Real meetings, 16 channel recordings, 80 hrs

    - released through NIST/LDC

    Classification e.g.: Detecting emphasized utterances based on f0 contour (Kennedy & Ellis 03)

    - per-speaker normalized f0 as unidimensional feature simple threshold classification

    3

    Speaker 1 Speaker 2440 f0/Hz f0/Hz440110 176022011055

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 19 / 25

    Personal Audio

    LifeLog / MyLifeBits / Remembrance Agent:- easy to record everything you

    hear

    Then what?- prohibitive to review- applications if access easier?

    Automatic content analysis / indexing...

    - find features to classify into e.g. locations

    50 100 150 200 2500

    2

    4

    14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30

    5

    10

    15

    freq

    / Hz

    freq

    / Bar

    k

    time / min

    clock time

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 20 / 25

    Alarm sound detection

    Alarm sounds have particular structure- clear even at low SNRs- potential applications...

    Contrast two systems: (Ellis 01)- standard, global features, P(X|M)- sinusoidal model, fragments, P(M,S|Y)

    - error rates high, but interesting comparisons...

    0

    freq

    / kHz

    1

    2

    3

    4Restaurant+ alarms (snr 0 ns 6 al 8)

    20 25 30 35 40 45 50

    0

    6 7 8 9

    time/sec0

    freq

    / kHz

    1

    2

    3

    4

    0MLP classifier output

    Sound object classifier output

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 21 / 25

    Music signal modeling

    Use machine listener to navigate large music collections- e.g. unsigned bands on MP3.com

    Classification to label:- notes, chords, singing, instruments- .. information to help cluster music

    Artist models based on feature distributions

    - measure similarity between users collections and new music? (Berenzweig & Ellis 03)

    third cepstral coef

    fifth

    cep

    stra

    l coe

    f

    madonnabowie

    p

    1 0.5 0 0.5

    0.8 0.6 0.4 0.2

    00.20.40.6

    Country

    Elec

    troni

    ca

    madonnabowie

    10

    5

    15 10 5

    15

    0

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 22 / 25

    Outline

    Pattern Recognition for Sounds

    Speech Recognition

    Other Audio Applications

    Observations and Conclusions- Model complexity- Sound mixtures

    1

    2

    3

    4

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 23 / 25

    Observations and Conclusions:Training and test data

    Balance model/data size to avoid overfitting:

    Diminishing returns from more data:

    4

    training or parameters

    errorrate

    Testdata

    Trainingdata

    Overfitting

    500

    1000

    2000

    4000

    9.25

    18.5

    37

    74

    32

    34

    36

    38

    40

    42

    44

    Hidden layer / unitsTraining set / hours

    WER

    %

    WER for PLP12N-8k nets vs. net size & training data

    Word Error RateMore training dataMor

    e parameter

    s

    Optimal parameter/data ratio

    Constant training

    time

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 24 / 25

    Beyond classification

    No free lunch: Classifier can only do so much- always need to consider other parts of system

    Features- impose ceiling on system performance- improved features allow simpler classifiers

    Segmentation / mixtures- e.g. speech-in-noise:

    only subset of feature dimensions availablemissing-data approaches...

    Y(t)

    S1(t)

    S2(t)

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 25 / 25

    Summary

    Statistical Pattern Recognition- exploit training data for probabilistically-correct

    classifications

    Speech recognition- successful application of statistical PR- .. but many remaining frontiers

    Other audio applications- meetings, alarms, music- classification is information extraction

    Current challenges- variability in speech- acoustic mixtures

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 26 / 25

    Extra slides

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 27 / 25

    Neural network classifiers

    Instead of estimating and using Bayes,

    can also try to estimate posteriors

    directly (the decision boundaries)

    Sums over nonlinear functions of sums give a large range of decision surfaces...

    e.g. Multi-layer perceptron (MLP):

    Problem is finding the weights wij ... (training)

    p x i( )

    Pr i x( )

    yk F w jk F wijxij[ ]j[ ]=

    +

    +wij

    x1 y1

    h1

    h2

    y2

    x2x3 F[]

    +

    +

    +wjk

    Inputlayer

    Hiddenlayer

    Outputlayer

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 28 / 25

    Neural net classifier

    Models boundaries, not density

    Discriminant training- concentrate on boundary regions- needs to see all classes at once

    p x i( )

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 29 / 25

    Why is Speech Recognition hard?

    Why not match against a set of waveforms?- waveforms are never (nearly!) the same twice- speakers minimize information/effort in speech

    Speech variability comes from many sources:- speaker-dependent (SD) recognizers must

    handle within-speaker variability- speaker-independent (SI) recognizers must also

    deal with variation between speakers- all recognizers are afflicted by background noise,

    variable channels

    Need recognition models that:- generalize i.e. accept variations in a range, and- adapt i.e. tune in to a particular variant

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 30 / 25

    Within-speaker variability

    Timing variation:- word duration varies enormously

    - fast speech reduces vowels

    Speaking style variation:- careful/casual articulation- soft/loud speech

    Contextual effects:- speech sounds vary with context, role:

    How do you do?

    0.5

    Freq

    uenc

    y

    00 1.0 1.5 2.0 2.5 3.0

    2000

    4000

    s

    SO ITHOUGHT

    ABOUTTHAT

    ITHINK

    IT'S STILL POSSIBLE

    AND

    ow ayth

    aadx

    ax b aw plihtstih

    kn

    ihth

    ayn

    axth

    aa s b ax l

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 31 / 25

    Between-speaker variability

    Accent variation- regional / mother tongue

    Voice quality variation- gender, age, huskiness, nasality

    Individual characteristics- mannerisms, speed, prosody

    02000400060008000

    time / s

    freq

    / Hz

    0 0.5 1 1.5 2 2.50

    2000400060008000

    mbma0

    fjdm2

  • Dan Ellis Audio Signal Reecognition 2003-11-13 - 32 / 25

    Environment variability

    Background noise- fans, cars, doors, papers

    Reverberation- boxiness in recordings

    Microphone channel- huge effect on relative spectral gain

    0

    2000

    4000

    time / s

    freq

    / Hz

    0 0.2 0.4 0.6 0.8 1 1.2 1.40

    2000

    4000

    Closemic

    Tabletopmic

    Audio Signal Recognition for Speech, Music, and Environmental SoundsPattern Recognition for SoundsPattern classificationClassification system partsFeature extractionStatistical InterpretationPriors and posteriorsPractical implementationGaussian modelsGaussian models for formant dataOutlineSpeech Recognizer ArchitectureSpeech: ProgressSpeech: ProblemsSpeech: What works, what doesntOutlineOther Audio Applications: ICSI Meeting Recordings corpusPersonal AudioAlarm sound detectionMusic signal modelingOutlineObservations and Conclusions: Training and test dataBeyond classificationSummaryExtra slidesNeural network classifiersNeural net classifier