Upload
daniella-tucker
View
213
Download
0
Embed Size (px)
Citation preview
Comparing Audio Signals
• Phase misalignment• Deeper peaks and valleys• Pitch misalignment• Energy misalignment• Embedded noise• Length of vowels• Phoneme variance
What makes it difficult?
Review: Minimum Distance AlgorithmE X E C U T I O N
0 1 2 3 4 5 6 7 8 9
I 1 1 2 3 4 5 6 6 7 8
N 2 2 2 3 4 5 6 7 7 7
T 3 3 3 3 4 5 5 6 7 8
E 4 3 4 3 4 5 6 6 7 8
N 5 4 4 4 4 5 6 7 7 7
T 6 5 5 5 5 5 5 6 7 8
I 7 6 6 6 6 6 6 5 6 7
O 8 7 7 7 7 7 7 6 5 6
N 9 8 8 8 8 8 8 7 6 5
Array[i,j] = min{1+Array[i-1,j], cost(i,j)+Array[i-1,j-1],1+ Array[i,j-1)}
Pseudo Code (minDistance(target, source))
n = character in sourcem = characters in targetCreate array, distance, with dimensions n+1, m+1FOR r=0 TO n distance[r,0] = rFOR c=0 TO m distance[0,c] = cFOR each row r FOR each column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of
distance[r-1,c] + 1, //insertion
distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitutionResult is in distance[n,m]
Is Minimum Distance Applicable?• Maybe?
– The optimal distance from indices [a,b] is a function of the costs with smaller indices.
– This suggests that a dynamic approach may work.• Problems
– The cost function is more complex. A binary equal or not equal doesn’t work
– Need to define a distance metric. But what should that metric be? Answer: It depends on which audio features we use.
– Longer vowels may still represent the same speech. The classical solution is not to apply a cost when going from index [i-1,j] or [i,j-1] to [I,j]. Unfortunately, this assumption can lead to singularities, which result in incorrect comparisons
Complexity of Minimum Distance
• The basic algorithm is O(m*n) where m is the length (samples) of one audio signal and m is the length of the other. If m=n, the algorithm is O(n2). Why?: count the number of cells that need to be filled in.
• O(n2) may be too slow. Alternate solutions have been devised.– Don’t fill in all of the cells.– Use a multi-level approach
• Question: Are the faster approaches needed for our purposes? Perhaps not!
Don’t Fill in all of the Cells
Problem: May miss the optimal minimum distancepath
The Multilevel Approach
Concept1. Down sample to coarsen the array2. Run the algorithm3. Refine the array (up sample)4. Adjust the solution5. Repeat steps 3-4 till the original sample
rate is restored
Notes •The multilevel approach is a common technique for increasing many algorithms’ complexity from O(n2) to O(n lg n)•Example is partitioning a graph to balance work loads among threads or processors
Singularities
• Assumption– The minimum distance comparing two signals only depends on
the previous adjacent entries– The cost function accounts for the varied length of a particular
phoneme, which causes the cost in particular array indices to no longer be well-defined
• Problem: The algorithm can compute incorrectly due to mismatched alignments
• Possible solutions:– Compare based on the change of feature values between
windows instead of the values themselves– Pre-process to eliminate the causes of the mismatches
Possible Preprocessing• Remove the phase from the audio:
– Compute the Fourier transform – Perform discrete cosine transform on the amplitudes
• Normalize the energy of voiced audio: – Compute the energy of both signals– Multiply the larger by the percentage difference
• Remove the DC offset: Subtract the average amplitude from all samples• Brick Wall Normalize the peaks and valleys:
– Find the average peak and valley value– Set values larger than the average equal to the average
• Normalize the pitch: Use PSOLA to align the pitch of the two signals• Remove duplicate frames: Auto correlate frames at pitch points• Remove noise from the signal: implement a noise removal algorithm• Normalize the speed of the speech:
Which Audio Features?
• Cepstrals: They are statistically independent and phase differences are removed
• ΔCepstrals, or ΔΔCepstrals: Reflects how the signal is changing from one frame to the next
• Energy: Distinguish the frames that are voiced verses those that are unvoiced
• Normalized LPC Coefficients: Represents the shape of the vocal track normalized by vocal tract length for different speakers.
These are the popular features used for speech recognition
Which Distance Metric?• General Formula:
array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)}• Assumption : There is no cost assessed for duplicate or
eliminated frames. • Distance Formula:
– Euclidian: sum the square of one metric minus another squared– Linear: sum the absolute value of the distance between features
• Weighting the features: Multiply each metric’s difference by a weighting factor to give greater/lesser emphasis to certain features
• Example of a distance metric using linear distance∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for signals a and b. w[i] is that feature’s weight