46

Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Embed Size (px)

Citation preview

Page 1: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34
Page 2: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Automatic speech recognitionWhat is the task?What are the main difficulties?How is it approached?How good is it?How much better could it be?

2/34

Page 3: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

What is the task?Getting a computer to understand spoken

languageBy “understand” we might mean

React appropriatelyConvert the input speech into another medium,

e.g. textSeveral variables impinge on this

3/34

Page 4: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

How do humans do it?

Articulation produces sound waves which the ear conveys to the brain for processing

4/34

Page 5: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

How might computers do it?

DigitizationAcoustic analysis of the

speech signalLinguistic interpretation

5/34

Acoustic waveform Acoustic signal

Speech recognition

Page 6: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Basic Block Diagram

6/34

Speech Recognitio

n

Parallel

PortControl

P2toP9

ATMega32

microcontroller

LCDDisplay

MATLAB

Page 7: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

What’s hard about that?Digitization

Converting analogue signal into digital representationSignal processing

Separating speech from background noisePhonetics

Variability in human speechPhonology

Recognizing individual sound distinctions (similar phonemes)Lexicology and syntax

Disambiguating homophones Features of continuous speech

Syntax and pragmatics Interpreting prosodic features

Pragmatics Filtering of performance errors (disfluencies)

7/34

Page 8: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

DigitizationAnalogue to digital conversion Sampling and quantizingUse filters to measure energy levels for

various points on the frequency spectrumKnowing the relative importance of

different frequency bands (for speech) makes this process more efficient

E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale)

8/34

Page 9: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Separating speech from background noiseNoise cancelling microphones

Two mics, one facing speaker, the other facing awayAmbient noise is roughly same for both mics

Knowing which bits of the signal relate to speechSpectrograph analysis

9/34

Page 10: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Variability in individuals’ speechVariation among speakers due to

Vocal range (f0, and pitch range – see later)Voice quality (growl, whisper, physiological

elements such as nasality, adenoidality, etc)ACCENT !!! (especially vowel systems, but also

consonants, allophones, etc.)Variation within speakers due to

Health, emotional stateAmbient conditions

Speech style: formal read vs spontaneous

10/34

Page 11: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Speaker-(in)dependent systemsSpeaker-dependent systems

Require “training” to “teach” the system your individual idiosyncracies The more the merrier, but typically nowadays 5 or 10

minutes is enough User asked to pronounce some key words which allow

computer to infer details of the user’s accent and voice Fortunately, languages are generally systematic

More robustBut less convenientAnd obviously less portable

Speaker-independent systemsLanguage coverage is reduced to compensate need to be

flexible in phoneme identificationClever compromise is to learn on the fly

11/34

Page 12: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

(Dis)continuous speechDiscontinuous speech much easier to

recognizeSingle words tend to be pronounced more

clearlyContinuous speech involves contextual

coarticulation effectsWeak formsAssimilationContractions

12/34

Page 13: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Performance errorsPerformance “errors” include

Non-speech soundsHesitationsFalse starts, repetitions

Filtering implies handling at syntactic level or above

Some disfluencies are deliberate and have pragmatic effect – this is not something we can handle in the near future

13/34

Page 14: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

14/34

Page 15: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Template-based approachStore examples of units (words, phonemes),

then find the example that most closely fits the input

Extract features from speech signal, then it’s “just” a complex similarity matching problem, using solutions developed for all sorts of applications

OK for discrete utterances, and a single user

15/34

Page 16: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Template-based approachHard to distinguish very similar templatesAnd quickly degrades when input differs from

templatesTherefore needs techniques to mitigate this

degradation:More subtle matching techniquesMultiple templates which are aggregated

Taken together, these suggested …

16/34

Page 17: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Neural Network based approach

17/34

Page 18: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Statistics-based approachCollect a large corpus of transcribed speech

recordingsTrain the computer to learn the

correspondences (“machine learning”)At run time, apply statistical processes to

search through the space of all possible solutions, and pick the statistically most likely one

18/34

Page 19: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Statistics based approachAcoustic and Lexical Models

Analyse training data in terms of relevant features

Learn from large amount of data different possibilities different phone sequences for a given word different combinations of elements of the speech

signal for a given phone/phonemeCombine these into a Hidden Markov Model

expressing the probabilities

19/34

Page 20: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

20/34

Page 21: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Identify individual phonemesIdentify wordsIdentify sentence structure and/or meaning

21/34

Page 22: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

SPEECH RECOGNITION BLOCK DIAGRAM

22/34

Page 23: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

BLOCK DIAGRAM DESCRIPTION

23/34

Speech Acquisition Unit•It consists of a microphone to obtain the analog speech signal•The acquisition unit also consists of an analog to digital converter

Speech Recognition Unit •This unit is used to recognize the words contained in the input speech signal.•The speech recognition is implemented in MATLAB with the help of •template matching algorithm

Device Control Unit •This unit consists of a microcontroller, the ATmega32, to control the various appliances•The microcontroller is connected to the PC via the PC parallel port•The microcontroller then reads the input word and controls the device connected to it accordingly.

Page 24: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

SPEECH RECOGNITION

24/34

End PointDetection

Feature Extraction

DynamicTime

Warping

X(n)

DigitizedSpeech

XF(n)

MFCC

Recognizedword

Page 25: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

END-POINT DETECTION

25/34

• The accurate detection of a word's start and end points means that subsequent processing of the data can be kept to a minimum by processing only the parts of the input corresponding to speech.

•We will use the endpoint detection algorithm proposed by Rabiner and Sambur. This algorithm is based on two simple time-domain measurements of the signal - the energy and the zero crossing rate.

The algorithm should tackle the following cases:-1.Words which begin with or end with a low energy phoneme2.Words which end with a nasal3.Speakers ending words with a trailing off in intensity or short breath

Page 26: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Steps for EPD

26/34

•Removal of noise by subtracting the signal values with that of noise• Word extraction

steps –1. ITU [Upper energy threshold]2. ITL [Lower energy threshold]3. IZCT [Zero crossing rate threshold ]

Page 27: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Feature ExtractionInput data to the algorithm is usually too

large to be processedInput data is highly redundantRaw analysis requires high computational

powers and large amounts of memoryThus, removing the redundancies and

transforming the data into a set of featuresDCT based Mel Cepstrum

27/34

Page 28: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

DCT Based MFCC• Take the Fourier transform of a signal.• Map the powers of the spectrum obtained

above onto the mel scale, using triangular overlapping windows.

• Take the logs of the powers at each of the mel frequencies.

• Take the discrete cosine transform of the list of mel log powers, as if it were a signal.

• The MFCCs are the amplitudes of the resulting spectrum.

28/34

Page 29: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

MFCC Computation

As Log Magnitude is real and symmetric IDFT reduces to DCT. The DCT produces highly un-correlated feature yt

(m)(k). The Zero Order MFCC coefficient yt

(0)(k) is approximately equal to the Log Energy of the frame.

29/34The number of MFCC co-effecients chosen were 13

Page 30: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Feature extraction by MFCC Processing

30/34

Page 31: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Dynamic Time Warping and Minimum Distance Paths measurement

Isolated word recognition:• Task :

• Want to build an isolated word recogniser

• Method:

1. Record, parameterise and store vocabulary of reference words.

2. Record test word to be recognised and parameterize.

3. Measure distance between test word and each reference word.

4. Choose reference word ‘closest’ to test word.

31/34

Page 32: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

32

Words are parameterised on a frame-by-frame basis

Choose frame length, over which speech remains reasonably stationary

Overlap frames e.g. 40ms frames, 10ms frame shift

We want to compare frames of test and reference words i.e. calculate distances between them

40ms

20ms

Page 33: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

33

• Hard:

Number of frames won’t always correspond

• Easy:

Sum differences between corresponding frames

Calculating Distances

Page 34: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

34

• Solution 1: Linear Time Warping

Stretch shorter sound

• Problem?

Some sounds stretch more than others

Page 35: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

35

• Solution 2:

Dynamic Time Warping (DTW)

5 3 9 7 3

4 7 4

Test

Reference

Using a dynamic alignment, make most similar frames correspond

Find distances between two utterances using these corresponding frames

Page 36: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Dynamic Programming

36

Waveforms showing the utterance of the word “Open” at two different instants. The signals are not time aligned.

Page 37: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

37

3 5 1 x 4 x 1 x

7 4 3 x 0 x 3 x

9 3 5 x 2 x 5 x

3 2 1 x 4 x 1 x

5 1 1 x 2 x 1 x

1 2 3

4 7 4

Reference

Test

Place distance between frame r of Test and frame c of Reference in cell(r,c) of distance matrix

DTW Process

Page 38: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Constraints

GlobalEndpoint detectionPath should be close to diagonal

LocalMust always travel upwards or eastwardsNo jumpsSlope weightingConsecutive moves upwards/eastwards

38

Page 39: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

39

SONY SUVARNA GEMINI HBO CNN NDTV IMAGINE ZEE CINEMA

SONY 9 0 1 0 0 0 0 0

SUVARNA 0 10 0 0 0 0 0 0

GEMINI 0 0 8 0 0 0 2 0

HBO 0 0 0 10 0 0 0 0

CNN 0 0 0 0 8 0 2 0

NDTV 0 0 0 0 0 10 0 0

IMAGINE 0 0 0 0 0 0 10 0

ZEE CINEMA 0 0 0 0 0 0 1 9

Page 40: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

40

SONY SUVARNA GEMINI HBO CNN NDTV IMAGINE ZEE CINEMA

SONY 8 0 1 0 0 0 1 0

SUVARNA 0 8 0 0 0 0 0 2

GEMINI 1 0 8 0 0 0 1 0

HBO 0 0 0 10 0 0 0 0

CNN 1 0 0 0 8 0 2 0

NDTV 0 0 0 0 0 10 0 0

IMAGINE 0 0 0 0 0 0 10 0

ZEE CINEMA 0 2 0 0 0 0 0 8

Page 41: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

ApplicationsMedical TranscriptionMilitaryTelephony and other domainsServing the disabledFurther Applications• Home automation• Automobile audio systems• Telematics

41

Page 42: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

Where, From here?

42

Page 43: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

43/34

all speakers of the language including

foreign

application independent or

adaptive

all styles including human-human (unaware)

wherever speech occurs

2015

vehicle noise radio

cell phones

regional accentsnative speakers

competent foreign speakers

some application– specific data and one

engineer year

natural human-machine dialog (user can adapt)

1995

expert years to create app– specific language model

speaker independent and adaptive

normal officevarious microphonestelephone

planned speech

1985

NOISE ENVIRONMENT

SPEECH STYLE

USER

POPULATION

COMPLEXITY

1975

quiet roomfixed high –quality mic

careful reading

speaker-dep.

application– specific speech and language

Evolution of ASR

Page 44: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34

44

Conclusive remarks

Recorded Speech Noise Padded

Gain Adjustment DC offset elimination

Spectral Subtraction End Point Detection

Page 45: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34
Page 46: Automatic speech recognition What is the task? What are the main difficulties? How is it approached? How good is it? How much better could it be? 2/34