Indian Language Speech Processing · types (e.g., decoding ... Malayalam Very rich in morphology...

Preview:

Citation preview

Dr. G. Bharadwaja Kumar

Indian Language

Speech

Processing

Speech recognition

Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, into the corresponding

orthographic representation.

Parameters Range

Speaking Mode Isolated words to Continuous Speech

Speaking style Read Speech to Spontaneous Speech

Enrollment Speaker dependent to Speaker independent

Vocabulary Small (<20 words) to large (> 20000 words)

Language model Finite State to Context Sensitive

Perplexity Small (<10) to large (>100)

Signal to Noise Ratio (SNR)

High (>30dB) to low (<10dB)

Transducer noice cancelling microphone to Telephone

Speech recognition systems can be characterized by many parameters

Speaking Style

Read speech

– Planned or read speech may not contain disfluencies

– News

Spontaneous speech

– extemporaneously generated speech

– Disfluencies (hesitations and fillers)

– ‘less-well-formed’ (or un-grammatical).

Vocabulary size

As the vocabulary increases, the number

of input-template comparisons which

must be made before a best match

can be determined also increases.

Enrolment

Some systems require speaker enrollment -- a user must provide samples of his or her speech before using them -- whereas other systems are said to be speaker-independent, in that no enrollment is necessary.

External parameters

In addition, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.

The modifications to pronunciation once isolated words are embedded in continuous speech include

Assimilation

Elision

Vowel reduction

Strong and weak forms

Liaison

Contractions

Juncture

Ref: http://cristiancuesta512.blogspot.in/

Source Channel Model

If A represents the acoustic feature sequence extracted from a speech sample, the speech recognition system should yield the optimal word sequence which matches Abest .

W= argmax P(W|A)

W

Using Baye’s rule, we can rewrite as

Here, P(A|W) is the likelihood of feature sequence A given the acoustic model of word sequence W.

P(A|W)P(W)P(W|A)=

P(A)

Pronunciation Lexicon

provides pronunciations of words, so decoder knows which HMMs to use for a certain word.

also provides a list of words to limit the language model complexity and the decoder’s search space.

As a result, an ASR system can only recognize a limited number of words presented in the dictionary, which is normally known as the closed-vocabulary speech recognition.

Out-of-vocabulary (OOV)

words that are unknown and appear in test data for which the phonetic sequence is unknown.

They cannot be recognized and also effect the recognition accuracy of their surrounding in-vocabulary (IV) words.

Four challenges with OOV:

Detecting the presence of the word

Determining its location within the utterance

Recognizing the underlying phonetic sequence

Identifying the spelling of the word

Acoustic Modeling

Sampling: measuring amplitude of signal at time t

16,000 Hz (samples/sec) Microphone (“Wideband”):

8,000 Hz (samples/sec) Telephone

Why?– Need at least 2 samples per cycle

– max measurable frequency is half sampling rate

– Human speech < 10,000 Hz, so need max 20K

– Telephone filtered at 4K, so 8K is enough

Why Frequency Domain

the frequency of a sound is one of its most important physical properties.

Which can be easily observed by converting time to frequency domain using FFT

Cepstral coefficients are typically used in speech recognition to characterize spectral envelopes, capturing primarily the formants of speech.

Mobile Recorded Speech

Mel-Frequency Cepstral Coefficient (MFCC)

Most widely used spectral representation in ASR

Why is MFCC so popular?

Efficient to compute

Incorporates a perceptual Mel frequency scale

Separates the source and filter

IDFT(DCT) decorrelates the features– Improves diagonal assumption in HMM

modeling

Acoustic Modeling

- Mporas, Iosif, et al. "Comparison of speech features on the speech recognition task." Journal of Computer Science 3.8 (2007): 608-616.

HMM/GMM Models

Approaches to Speaker Adaptation

Language models

help any speech recognizer to figure out how likely a word sequence is independent of the acoustics.

play a paramount role in guiding and constraining among large number of alternative word hypotheses in continuous speech recognition.

continuous speech recognition suffers from difficulties such as variation due to sentence structure (prosodies), interaction between adjacent words (crossword co-articulation), and no clear acoustic markers to delineate word boundaries.

play a vital role in resolving acoustic confusions that arise due to co-articulation, assimilation and homophones while decoding.

The perplexity can be roughly interpreted as the average branching factor of the testing data to the language model.

lower perplexity correlates to better recognition performance due to less branches the speech recognizer needs to consider during decoding

N-Gram Language Models

The intuition of the N-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words

Smoothing Techniques

Add-1 smoothing or Good-Turing:

OK for text categorization,

for language modeling the most commonly used method:

Extended Interpolated Kneser-Ney

For very large N-grams like the Web:

backoff

Domain Sensitivity

Language models are extremely sensitive to changes in the style, topic or genre of the text on which they are trained.

A language model trained on Dow-Jones newswire text will see its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period

Rosenfeld, Ronald. "Two decades of statistical language modeling: Where do we go from here?." Proceedings of the IEEE 88.8 (2000): 1270-1278.

Given a background model PB (w|h)

and a topic-based model PT (w|h) it

is possible to obtain a final model PI

(w|h) , to be used in the second

decoding pass, as follows:

Complexity

ASR systems often have complexity that is linear in the number of tokens and polynomial in the number of types (e.g., decoding using a trigram language model with size-Nvocabulary has, in the worst case, a complexity of at least O(N3)).

-- Lin, Hui, and Jeff Bilmes. "Optimal selection of limited vocabulary speech corpora." Twelfth Annual Conference of the International Speech Communication Association. 2011.

Notable speech recognition software engines

Ref- https://en.wikipedia.org/wiki/List_of_speech_recognition_software

System Name Open Source Acoustic Modeling

CMU Yes GMM/HMM

HTK No GMM/HMM

RWTH Yes LSTM

Kaldi Yes Deep Neural Network

Julius Yes GMM/HMM

Challenges in

Indian Language Speech Processing

Dravidian Languages

Major: Telugu, Tamil, Kannada, Malayalam

Very rich in morphology and complex Sandhi rules

Relatively free word order languages

Pronunciation Lexicon

Most of the Indian languages are phonetic in nature i.e. there exists a one-to-one correspondence between the orthography and pronunciation in these languages.

For Telugu, simple rule based G2P is enough.

Tamil does not distinguish between voiced and voiceless stops and lacks symbols for voiced and aspirated stops.

Morph Based Language Models

In two Finnish recognition tasks, relative error rate reductions between 12% and 31% are obtained.

word fragments obtained using grammatical rules do not outperform the fragments discovered from text.

Hirsimäki, Teemu, et al. "Unlimited vocabulary speech recognition with morph language models applied to Finnish." Computer Speech & Language 20.4 (2006): 515-541.

Phoneme list

Tamil Phonetic Mapping

Grapheme to Phoneme Mapping Softwares

Sequitur G2P

https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html

Sequence-to-Sequence G2P toolkit

https://github.com/cmusphinx/g2p-seq2seq

Phonetisaurus G2P

https://github.com/AdolfVonKleist/Phonetisaurus

Morphology

Application of extensive Sandhi changes sometimes results in telescoping of several words into long strings.

English Sentence: Do you say that there is no hot water?’

Telugu Sentence: vEdinILLu levu aNtavu A?

After Sandhi: vENNILLEvaNtAvA (one word)

– Reference: P. Bhaskara Rao, “Telugu” , Concise Encyclopedia of Languages of the world, Elsevier, pp 1055-1060.

Type-Token Analysis

G. Bharadwaja Kumar et. Al. “Statistical Analyses of Telugu Text Corpora”, IJDL, Vol. 36, No. 2 (2007)

BNC Corpus for English (100 Million Word Corpus)

UoH Corpus for Telugu ( 40 Million Word Corpus)

One of the main problems with LVCSR systems is that the words spoken may not always exist within the system’s vocabulary.

These are called out of vocabulary words (OOV’s).

Predominant problem for very rich & complex Morphological languages such as Dravidian Languages

Thank You

Presentation is available at

http://bharadwajakumar.wordpress.com

Recommended