T he SPL - IT Query by Example Search on Speech system for MediaEval 2014

Jorge Proença 1,2

Arlindo Veiga 1,2

Fernando Perdigão 1,2

The SPL-IT Query by Example Search on Speech

system for MediaEval 2014

The 2014 Query by Example Search on Speech (QUESST)

1 Instituto de Telecomunicações,

Coimbra, Portugal

2 Electrical and Computer Eng.

Department,

University of Coimbra, Portugal

2

SPL-IT system

MediaEval 2014

| October 16-17 2014, Barcelona, SPAIN

Overview of the system:

Fuses Dynamic Time Warping (DTW) modifications

Fuses results from systems with phonetic recognizers for 3

languages

3

Phonetic Recognizer

MediaEval 2014


Hard to extract good posteriorgrams with an HMM system (our in-

house system).

Used 3 systems/languages (for 8 kHz) based on long temporal context

and neural networks from Brnu University of Technology (BUT):

Czech

Hungarian

Russian

Output: posteriorgrams (3 states per phoneme).

Leading and trailing silence/noise removed

Ph

on

em

e S

tate

Frame

State Posteriorgram example for one query

4

Dynamic Time Warping

MediaEval 2014


Local Distance matrix:

Dot Product of Query and Audio posterior probability vectors;

Back-off with l =10-4

, logD q x q x

Distance Matrix of Query vs Audio

5

Dynamic Time Warping

MediaEval 2014


Basic DTW strategy (A1):

Smallest distance in identically

weighted unitary jumps:

Distance Matrix (top) and accumulated Distance matrix (bottom) of Query vs Audio

6

DTW Modifications

MediaEval 2014


4 additional approaches:

(A2) – Cutting up to 250ms at the end of the query,

keeping the segment above 500ms

(A3) – Cutting up to 250ms at the beginning of the query,

keeping the segment above 500ms

Que

ryQ

ue

ry

Audio

Query vs. Audio posterior distance matrix (top) and the best path from A2 (bottom)

7

DTW Modifications

MediaEval 2014


(A4) – Allowing one jump in the path up to ½ Query’s length,

can’t occur at initial and final 250ms of the query

can’t occur for queries shorter than 800msQ

ue

ryQ

ue

ry

Audio


8

DTW Modifications

MediaEval 2014


(A5) – Swaps: accounting for re-ordering of words.

Backtrack the best 5 candidates from (A1) from the end,

Find the best path for the beginning of the query, ahead of the

end of the first one, with restrictions similar to (A4).Q

ue

ryQ

ue

ry

Audio


9

Fusing systems

MediaEval 2014


Different approaches:

Minimum of the approaches – not the best.

Harmonic mean found to be a good compromise.

Per-query normalization (standard score):

Different languages:

Arithmetic mean of the 3 scores.

X

10

Submissions and Results

MediaEval 2014


Primary: fusing (A1) and (A2) (basic and cutting the end)

Late: fusing the 5 approaches.

Late provided worse overall results

primary late

Cnxe, MinCnxe - Dev 0.6797, 0.5438 0.7106, 0.5881

Cnxe, MinCnxe - Eval 0.6588, 0.5080 0.6708, 0.5240

ATWV, MTWV - Dev 0.4494, 0.4494 0.4051, 0.4052

ATWV, MTWV - Eval 0.4399, 0.4423 0.3918, 0.4218

11

Submissions and Results (cont.)

MediaEval 2014


Primary: fusing (A1) and (A2) (basic and cutting the end)

Late: fusing the 5 approaches.

Cnxe for isolated approaches on Eval:

A1: 0.6823, A2: 0.6721, A3: 0.6947, A4: 0.6957 A5: 0.6999

For Type 3 queries, late system was better:

0.8049 Cnxe on primary to 0.7865 Cnxe on late

primary late

Cnxe, MinCnxe - Eval 0.6588, 0.5080 0.6708, 0.5240

12

Conclusions

MediaEval 2014


Although this year’s task has an added difficulty, a simple DTW still works

well for most cases.

Cutting queries at the end revealed to be the best strategy, and fusing it

with A1 was even better.

Including the possibility of jumps and re-orders increased False Positives

overall, since these special cases are a small part of the database.

We lacked an optimization method for Cnxe

Which would greatly improve the results.

13

END – Thank You

MediaEval 2014


Processing Speed:

Hardware – CRAY CX1 Cluster, running windows server 2008 HPC, using 16 of 56

cores (7 nodes with double Intel Xeon 5520 2.27GHz quad-core and 24GB RAM per

node).

Indexing Speed Factor – 1.4

Searching Speed Factor – 0.0029 per sec and per language

Peak Memory – 0.098 GB

Software

T he SPL - IT Query by Example Search on Speech system for MediaEval 2014