GTM-UVigo Systems for the Query-by-Example
Search on Speech Task at MediaEval 2015
Paula López Otero, Laura Docío Fernández, Carmen GarcíaMateo
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 1/9
Main contributions
Neural networks for phoneme posteriorgram extraction
Phoneme unit selection
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 2/9
Neural networks
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 3/9
Neural networks
We used two Kaldi ASR recipes for phoneme posteriorgramextraction:
LSTMDNN
minCnxe Dev GA ES EN CZ ISF PMUI
lstm 0.895 0.879 0.915 0.904 0.34 0.22
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 4/9
Neural networks
We used two Kaldi ASR recipes for phoneme posteriorgramextraction:
LSTM → everything went fineDNN
minCnxe Dev GA ES EN CZ ISF PMUI
LSTM 0.895 0.879 0.915 0.904 0.34 0.22
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 4/9
Neural networks
We used two Kaldi ASR recipes for phoneme posteriorgramextraction:
LSTM → everything went fineDNN → very slow and highly memory consuming!!!
minCnxe Dev GA ES EN CZ ISF PMUI
LSTM 0.895 0.879 0.915 0.904 0.34 0.22DNN 0.898 0.897 0.915 0.922 2.93 6
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 4/9
Neural networksUntangling the DNN recipe
System minCnxe ISF PMUI
Initial version 0.897 2.63 6
No lattice determinize
What works for ASR doesn’t have to work for QbESTD
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9
Neural networksUntangling the DNN recipe
System minCnxe ISF PMUI
Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73
No lattice determinize
What works for ASR doesn’t have to work for QbESTD
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9
Neural networksUntangling the DNN recipe
System minCnxe ISF PMUI
Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73
One ASR pass 0.896 4.48 1.73No lattice determinize
What works for ASR doesn’t have to work for QbESTD
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9
Neural networksUntangling the DNN recipe
System minCnxe ISF PMUI
Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73
One ASR pass 0.896 4.48 1.73No lattice determinize 0.895 1.66 2.04
What works for ASR doesn’t have to work for QbESTD
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9
Neural networksUntangling the DNN recipe
System minCnxe ISF PMUI
Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73
One ASR pass 0.896 4.48 1.73No lattice determinize 0.895 1.66 2.04
No fMLLR 0.867 0.62 0.48
What works for ASR doesn’t have to work for QbESTD
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9
Neural networksUntangling the DNN recipe
System minCnxe ISF PMUI
Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73
One ASR pass 0.896 4.48 1.73No lattice determinize 0.895 1.66 2.04
No fMLLR 0.867 0.62 0.48
What works for ASR doesn’t always work for QbESTD
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9
Phoneme unit selection
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 6/9
Phoneme unit selectionCross-lingual search on speech
/©/ /R//R/ /x/
Spanish
/ð/ /d¶/ /h//ŋ/ /θ/ /r//¬/ /v/ /w//z/ /¶/ /æ//ª:/ /£/ /i:/
/a:/ /u:/ /3:/
English/a/
/e/ /i/ /o//u/ /b/ /d//g/ /p/ /t/
/k/ /m//n/ /s/ /t¬//j/ /l/ /f/
/D/
Many phonemes are not common to both languages ⇒ Dothey really contribute somehow?
But we are working with unknown languages! ⇒ automaticselection of the most suitable phonetic units
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 7/9
Phoneme unit selectionCross-lingual search on speech
/©/ /R//R/ /x/
Spanish
/ð/ /d¶/ /h//ŋ/ /θ/ /r//¬/ /v/ /w//z/ /¶/ /æ//ª:/ /£/ /i:/
/a:/ /u:/ /3:/
English/a/
/e/ /i/ /o//u/ /b/ /d//g/ /p/ /t/
/k/ /m//n/ /s/ /t¬//j/ /l/ /f/
/D/
Many phonemes are not common to both languages ⇒ Dothey really contribute somehow?
But we are working with unknown languages! ⇒ automaticselection of the most suitable phonetic units
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 7/9
Phoneme unit selectionCross-lingual search on speech
/©/ /R//R/ /x/
Spanish
/ð/ /d¶/ /h//ŋ/ /θ/ /r//¬/ /v/ /w//z/ /¶/ /æ//ª:/ /£/ /i:/
/a:/ /u:/ /3:/
English/a/
/e/ /i/ /o//u/ /b/ /d//g/ /p/ /t/
/k/ /m//n/ /s/ /t¬//j/ /l/ /f/
/D/
Many phonemes are not common to both languages ⇒ Dothey really contribute somehow?
But we are working with unknown languages! ⇒ automaticselection of the most suitable phonetic units
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 7/9
Phoneme unit selection
Every step of the path has a costRelevance of phoneme β: R(P(Q, D), β) = 1
K
∑Kk=1 cik ,dk ,β
Relevance of phoneme γ: R(P(Q, D), γ) = 1K
∑Kk=1 cik ,dk ,γ
R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9
Phoneme unit selection
Computation of the best alignment path P(Q, D) of length K
Relevance of phoneme β: R(P(Q, D), β) = 1K
∑Kk=1 cik ,dk ,β
Relevance of phoneme γ: R(P(Q, D), γ) = 1K
∑Kk=1 cik ,dk ,γ
R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9
Phoneme unit selection
Every step of the path has a cost ci ,j
Relevance of phoneme β: R(P(Q, D), β) = 1K
∑Kk=1 cik ,dk ,β
Relevance of phoneme γ: R(P(Q, D), γ) = 1K
∑Kk=1 cik ,dk ,γ
R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9
Phoneme unit selection
Relevance of phoneme α: R(P(Q, D), α) = 1K
∑Kk=1 cik ,jk (α)
Relevance of phoneme β: R(P(Q, D), β) = 1K
∑Kk=1 cik ,dk ,β
Relevance of phoneme γ: R(P(Q, D), γ) = 1K
∑Kk=1 cik ,dk ,γ
R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9
Phoneme unit selection
Relevance of phoneme α: R(P(Q, D), α) = 1K
∑Kk=1 cik ,jk (α)
Relevance of phoneme β: R(P(Q, D), β) = 1K
∑Kk=1 cik ,jk (β)
Relevance of phoneme γ: R(P(Q, D), γ) = 1K
∑Kk=1 cik ,dk
(γ)
R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9
Phoneme unit selection
Relevance of phoneme α: R(P(Q, D), α) = 1K
∑Kk=1 cik ,jk (α)
Relevance of phoneme β: R(P(Q, D), β) = 1K
∑Kk=1 cik ,jk (β)
Relevance of phoneme γ: R(P(Q, D), γ) = 1K
∑Kk=1 cik ,jk (γ)
R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9
Phoneme unit selection
Relevance of phoneme α: R(P(Q, D), α) = 1K
∑Kk=1 cik ,dk ,α
Relevance of phoneme β: R(P(Q, D), β) = 1K
∑Kk=1 cik ,dk ,β
Relevance of phoneme γ: R(P(Q, D), γ) = 1K
∑Kk=1 cik ,dk ,γ
R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), β)
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9
Phoneme unit selectionPerformance using different phoneme posteriorgrams
0.88
0.9
0.92
0.94
0.96
0.98
1
20 30 40 50 60 70 80
min
Cnx
e
Number of phoneme units
CZtrapsHUtrapsRUtraps
CZdnn
CZlstmGAdnnGAlstmESdnn
ESlstmENdnnENlstm
López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 9/9