11
Comparing Audio Signals • Phase misalignment • Deeper peaks and valleys • Pitch misalignment • Energy misalignment • Embedded noise • Length of vowels • Phoneme variance What makes it difficult?

Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Embed Size (px)

Citation preview

Page 1: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Comparing Audio Signals

• Phase misalignment• Deeper peaks and valleys• Pitch misalignment• Energy misalignment• Embedded noise• Length of vowels• Phoneme variance

What makes it difficult?

Page 2: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Review: Minimum Distance AlgorithmE X E C U T I O N

0 1 2 3 4 5 6 7 8 9

I 1 1 2 3 4 5 6 6 7 8

N 2 2 2 3 4 5 6 7 7 7

T 3 3 3 3 4 5 5 6 7 8

E 4 3 4 3 4 5 6 6 7 8

N 5 4 4 4 4 5 6 7 7 7

T 6 5 5 5 5 5 5 6 7 8

I 7 6 6 6 6 6 6 5 6 7

O 8 7 7 7 7 7 7 6 5 6

N 9 8 8 8 8 8 8 7 6 5

Array[i,j] = min{1+Array[i-1,j], cost(i,j)+Array[i-1,j-1],1+ Array[i,j-1)}

Page 3: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Pseudo Code (minDistance(target, source))

n = character in sourcem = characters in targetCreate array, distance, with dimensions n+1, m+1FOR r=0 TO n distance[r,0] = rFOR c=0 TO m distance[0,c] = cFOR each row r FOR each column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of

distance[r-1,c] + 1, //insertion

distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitutionResult is in distance[n,m]

Page 4: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Is Minimum Distance Applicable?• Maybe?

– The optimal distance from indices [a,b] is a function of the costs with smaller indices.

– This suggests that a dynamic approach may work.• Problems

– The cost function is more complex. A binary equal or not equal doesn’t work

– Need to define a distance metric. But what should that metric be? Answer: It depends on which audio features we use.

– Longer vowels may still represent the same speech. The classical solution is not to apply a cost when going from index [i-1,j] or [i,j-1] to [I,j]. Unfortunately, this assumption can lead to singularities, which result in incorrect comparisons

Page 5: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Complexity of Minimum Distance

• The basic algorithm is O(m*n) where m is the length (samples) of one audio signal and m is the length of the other. If m=n, the algorithm is O(n2). Why?: count the number of cells that need to be filled in.

• O(n2) may be too slow. Alternate solutions have been devised.– Don’t fill in all of the cells.– Use a multi-level approach

• Question: Are the faster approaches needed for our purposes? Perhaps not!

Page 6: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Don’t Fill in all of the Cells

Problem: May miss the optimal minimum distancepath

Page 7: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

The Multilevel Approach

Concept1. Down sample to coarsen the array2. Run the algorithm3. Refine the array (up sample)4. Adjust the solution5. Repeat steps 3-4 till the original sample

rate is restored

Notes •The multilevel approach is a common technique for increasing many algorithms’ complexity from O(n2) to O(n lg n)•Example is partitioning a graph to balance work loads among threads or processors

Page 8: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Singularities

• Assumption– The minimum distance comparing two signals only depends on

the previous adjacent entries– The cost function accounts for the varied length of a particular

phoneme, which causes the cost in particular array indices to no longer be well-defined

• Problem: The algorithm can compute incorrectly due to mismatched alignments

• Possible solutions:– Compare based on the change of feature values between

windows instead of the values themselves– Pre-process to eliminate the causes of the mismatches

Page 9: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Possible Preprocessing• Remove the phase from the audio:

– Compute the Fourier transform – Perform discrete cosine transform on the amplitudes

• Normalize the energy of voiced audio: – Compute the energy of both signals– Multiply the larger by the percentage difference

• Remove the DC offset: Subtract the average amplitude from all samples• Brick Wall Normalize the peaks and valleys:

– Find the average peak and valley value– Set values larger than the average equal to the average

• Normalize the pitch: Use PSOLA to align the pitch of the two signals• Remove duplicate frames: Auto correlate frames at pitch points• Remove noise from the signal: implement a noise removal algorithm• Normalize the speed of the speech:

Page 10: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Which Audio Features?

• Cepstrals: They are statistically independent and phase differences are removed

• ΔCepstrals, or ΔΔCepstrals: Reflects how the signal is changing from one frame to the next

• Energy: Distinguish the frames that are voiced verses those that are unvoiced

• Normalized LPC Coefficients: Represents the shape of the vocal track normalized by vocal tract length for different speakers.

These are the popular features used for speech recognition

Page 11: Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance

Which Distance Metric?• General Formula:

array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)}• Assumption : There is no cost assessed for duplicate or

eliminated frames. • Distance Formula:

– Euclidian: sum the square of one metric minus another squared– Linear: sum the absolute value of the distance between features

• Weighting the features: Multiply each metric’s difference by a weighting factor to give greater/lesser emphasis to certain features

• Example of a distance metric using linear distance∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for signals a and b. w[i] is that feature’s weight