Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Comparing Audio Signals

• Phase misalignment• Deeper peaks and valleys• Pitch misalignment• Energy misalignment• Embedded noise• Length of vowels• Phoneme variance

What makes it difficult?

Review: Minimum Distance AlgorithmE X E C U T I O N

0 1 2 3 4 5 6 7 8 9

I 1 1 2 3 4 5 6 6 7 8

N 2 2 2 3 4 5 6 7 7 7

T 3 3 3 3 4 5 5 6 7 8

E 4 3 4 3 4 5 6 6 7 8

N 5 4 4 4 4 5 6 7 7 7

T 6 5 5 5 5 5 5 6 7 8

I 7 6 6 6 6 6 6 5 6 7

O 8 7 7 7 7 7 7 6 5 6

N 9 8 8 8 8 8 8 7 6 5

Array[i,j] = min{1+Array[i-1,j], cost(i,j)+Array[i-1,j-1],1+ Array[i,j-1)}

Pseudo Code (minDistance(target, source))

n = character in sourcem = characters in targetCreate array, distance, with dimensions n+1, m+1FOR r=0 TO n distance[r,0] = rFOR c=0 TO m distance[0,c] = cFOR each row r FOR each column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of

distance[r-1,c] + 1, //insertion

distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitutionResult is in distance[n,m]

Is Minimum Distance Applicable?• Maybe?

– The optimal distance from indices [a,b] is a function of the costs with smaller indices.

– This suggests that a dynamic approach may work.• Problems

– The cost function is more complex. A binary equal or not equal doesn’t work

– Need to define a distance metric. But what should that metric be? Answer: It depends on which audio features we use.

– Longer vowels may still represent the same speech. The classical solution is not to apply a cost when going from index [i-1,j] or [i,j-1] to [I,j]. Unfortunately, this assumption can lead to singularities, which result in incorrect comparisons

Complexity of Minimum Distance

• The basic algorithm is O(m*n) where m is the length (samples) of one audio signal and m is the length of the other. If m=n, the algorithm is O(n2). Why?: count the number of cells that need to be filled in.

• O(n2) may be too slow. Alternate solutions have been devised.– Don’t fill in all of the cells.– Use a multi-level approach

• Question: Are the faster approaches needed for our purposes? Perhaps not!

Don’t Fill in all of the Cells

Problem: May miss the optimal minimum distancepath

The Multilevel Approach

Concept1. Down sample to coarsen the array2. Run the algorithm3. Refine the array (up sample)4. Adjust the solution5. Repeat steps 3-4 till the original sample

rate is restored

Notes •The multilevel approach is a common technique for increasing many algorithms’ complexity from O(n2) to O(n lg n)•Example is partitioning a graph to balance work loads among threads or processors

Singularities

• Assumption– The minimum distance comparing two signals only depends on

the previous adjacent entries– The cost function accounts for the varied length of a particular

phoneme, which causes the cost in particular array indices to no longer be well-defined

• Problem: The algorithm can compute incorrectly due to mismatched alignments

• Possible solutions:– Compare based on the change of feature values between

windows instead of the values themselves– Pre-process to eliminate the causes of the mismatches

Possible Preprocessing• Remove the phase from the audio:

– Compute the Fourier transform – Perform discrete cosine transform on the amplitudes

• Normalize the energy of voiced audio: – Compute the energy of both signals– Multiply the larger by the percentage difference

• Remove the DC offset: Subtract the average amplitude from all samples• Brick Wall Normalize the peaks and valleys:

– Find the average peak and valley value– Set values larger than the average equal to the average

• Normalize the pitch: Use PSOLA to align the pitch of the two signals• Remove duplicate frames: Auto correlate frames at pitch points• Remove noise from the signal: implement a noise removal algorithm• Normalize the speed of the speech:

Which Audio Features?

• Cepstrals: They are statistically independent and phase differences are removed

• ΔCepstrals, or ΔΔCepstrals: Reflects how the signal is changing from one frame to the next

• Energy: Distinguish the frames that are voiced verses those that are unvoiced

• Normalized LPC Coefficients: Represents the shape of the vocal track normalized by vocal tract length for different speakers.

These are the popular features used for speech recognition

Which Distance Metric?• General Formula:

array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)}• Assumption : There is no cost assessed for duplicate or

eliminated frames. • Distance Formula:

– Euclidian: sum the square of one metric minus another squared– Linear: sum the absolute value of the distance between features

• Weighting the features: Multiply each metric’s difference by a weighting factor to give greater/lesser emphasis to certain features

• Example of a distance metric using linear distance∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for signals a and b. w[i] is that feature’s weight

Documents

Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance