Acoustic Space of Bangla Vowelswseas.us/e-library/conferences/2005corfu/c1/papers/498-459.pdfAcoustic Space of Bangla Vowels Abstract: - Acoustic space of Bangla Vowel based on articulatory

Acoustic Space of Bangla Vowels

M LUTFAR RAHMANDept of Computer Science andEngineeringUniversity of Dhaka, DhakaBANGLADESH

FARRUK AHMEDDept of Computer Science andEngineeringNorth South University, DhakaBANGLADESH

Proceedings of the 5th WSEAS Int. Conf. on SIGNAL, SPEECH and IMAGE PROCESSING, Corfu, Greece, August 17-19, 2005 (pp138-142)

SYED AKHTER HOSSAINDept of Computer Science andEngineeringEast West University43 Mohakhali C/A, Dhaka 1212BANGLADESH

Abstract: - Acoustic space of Bangla Vowel based on articulatory properties of vocal tract plays a significant role inBangla speech synthesis and recognition. A vowel system is essentially a way of dividing up the "vowel space" intodistinct vowel phonemes. In general, languages make the most efficient use of the vowel space. This papersynthesizes vowel space illustration by graphically showing where a speech sound for Bangla language, such as avowel, is located in both "acoustic" and "articulatory" space of the vocal tract tube. Measurements of the spectralcharacteristics and the formant frequencies were made for each vowel from isolated Bangla word spoken by bothmale and female speakers. All these measurements are tested in synthesis of isolated utterance.

Key-Words:- Speech Processing, Formants, Energy, Voiced, Speech Synthesis, Phoneme

1 IntroductionSpeech signals are composed of a sequence of

sounds. These sounds and the transitions between themserve as a symbolic representation of information. Inspeech production process the sub-glottal systemcomposed of lungs, bronchi and trachea serves as asource of energy for the production of speech. Speechis simply the acoustic wave that is radiated from thissystem when air is expelled from the lungs and theresulting flow of air is perturbed by a constrictionsomewhere in the vocal tract [1,2,3].

Speech sounds are classified into three distinctclasses according to their mode of excitation. Voicedsounds are produced by forcing air through the glottiswith the tension of the vocal chords adjusted so thatthey vibrate in a relaxation oscillator, therebyproducing quasi-periodic pulses of air which excite thevocal tract. In English language, voiced segments arelabeled /U/, /d/, /w/, /i/ and /e/. Forming a constrictionat some point in the vocal tract, and forcing air throughthe constriction at a high enough velocity to produceturbulence generates fricative or unvoiced sounds.Plosive sounds result from making a complete closurein front of the vocal tract, building up pressure behindthe closure, and abruptly releasing it [1,2,3].

1.1 Vowel Space IllustrationVowels are produced by exciting a fixed vocal tractwith quasi-periodic pulses of air caused by vibration ofthe vocal chords. The resonant frequency of the tract(formants) varies with the cross sectional area and thedependence of cross-sectional area upon distance longthe tract, which is known as area function isdetermined primarily by the position of the tongue, butthe positions of the jaw, lips, and, to a small extent, thevelum also influences the resulting sound.

The chief characteristic of the vowels is the freedomwith which the air stream, once out of the glottis,passes through the speech organs. The supra-glottalresonators do not cut off or constrict the air; they causeonly resonance, that is to say, the reinforcement ofcertain frequency ranges. A vowel's timbre (or quality)depends on the following elements:

• the number of active resonators (among the oral,labial, and nasal cavities);

• the shape of the oral cavity;• the size of the oral cavity.

There are three possible resonators involved in thearticulation of a vowel: the oral cavity, the labialcavity, and the nasal cavity. If the soft palate is raised,

the air does not enter the nasal cavity, and passesexclusively through the oral cavity; if the soft palate islowered, however, air can pass through nose andmouth simultaneously. If the lips are pushed forwardand rounded, a third, labial resonator is formed; if, onthe other hand, the lips are spread sideways or pressedagainst the teeth, no labial resonator is formed.

The sound of a vowel is governed by many factors,of which the position of the tongue and rounding of thelips are the most important. The "vowel space" is thusdefined as the total area over which the tongue positionranges, along the two axes of height and backness.

Thus, each vowel sound can be characterized bythe vocal tract configuration (area function) that is usedin its production.

a gsu"avo(fotheThfoga

a ind--

2 Analysis Method

The speech data used in this study is captured in anoise free environment using the sound capture card inSpeech Filing System environment. For each utteranceof Bengali vowels, one of the most commonly usedwords were captured for different male and femalespeakers and the Table-1 shows the detail of therecording for each of the following utterance:

BengaliWord

Corresponding Vowel

Avg. Duration(msec)

/ / 132/ / 83

/ / 72

/ / 92


Fig.1 Vowel Space representation of English

The vowel space illustration shown above providesraphical method of showing where a speech sound,

ch as a vowel, is located in both "acoustic" andrticulatory" space. The illustration shows an acousticwel space based on the first two formants for vowelsrmants are the bands of energy that correspond to resonance of the vocal tract for particular shapes).e vertical axis represents the frequency of the first

rmant (F1). The horizontal axis shows the frequencyp between the first two formant (F2-F1)[2,5,6].

This 2-dimensional representation corresponds, tocertain degree, to tongue body position, withications of high vs. low and front vs. back positions

an articulatory space [3].

The digitized speech was filtered to remove anyadditional noise and was segmented visually to applyformant extract procedure through Fast FourierTransform (FFT) and Linear Predictive Coding (LPC).

There are many approaches to formant extraction,which are based on various (typically LPC-based)signal-processing techniques. Such approaches toformant extraction are normally fully automatedalthough provision is often made for intervention afterthe process has been completed especially editing offormant tracks [8,9,10].

A further limitation of most formant trackingprocedures is that formant gain and bandwidthextraction is either absent or very rudimentary. It isextremely difficult, for example, to extract meaningfulbandwidth information as the extracted values aredependant on both the actual formant bandwidth andalso on the extraction process (e.g. variations in thenumber of LPC coefficients result in variations inbandwidth).

It is also difficult to define gain in a useful way.Again, in LPC extraction procedures, the relative peakheight of the formants can vary with the number ofcoefficients used.

/ / 129/ / 115/ / 134/ / 172

/ / 208

/ / 169Table 1. Recording duration of utterance

Further, is it desirable for raw formant peak heightbe measured or is it more meaningful to measureformant gain after removing the effect of sourcespectral slope. For many applications are may not benecessary to accurately measure formant bandwidthand gain. For example, many speech analysis tasks ofrelevance to acoustic phonetic research only requireaccurate formant frequency information. Such taskstypically most often examine formants in vowels and(to a lesser extent) vowel-like consonants.

In vowels, gains and bandwidths are fairlypredictable (Fant, 1960). Vowel formant bandwidth isalso readily calculated utilizing a simple relationshipbetween formant center frequency and formantbandwidth [11,12,13].

Fig.2 Perceptual Linear Prediction

In Perceptual Linear Prediction method as shownin Fig. 2, the typical window length is 20ms. For 10kHz sampling frequency, 200 speech samples are used,padded with 56 zeros, hamming windowed, FFT andconverted to a power spectral density. After applyingthe pre-emphasis filtration, the IFFT was applied forcepstrum calculation and a suitable peak detectionalgorithm was used to determine formants.

For all the processing of speech samples, the firstthree formants are recorded and the correspondingformants are used to produce synthetic speech for theaccurate estimation of the formant frequencies. Allthese formants are grouped together to plot in two-dimensional space to produce vowel spacerepresentation considering the first two formants [5].

3 Results and Discussion

The speech corpus was conditioned and requiredpre emphasis and de-emphasis were carried out andlabeled to segment the phoneme boundary of interest.The extracted speech segment was taken for LinearPredictive Coding (LPC) scheme with 14 as number ofpoles. The maximum formant considered was 5000 Hzand the analysis width of the window was 25msec withan overlap of 10msec for the accurate resolution of theformants. Among the existing methods, the Burgmethod was applied and along with the formantextraction, the vocal tract response characteristics wereextracted for each of the phoneme in the vowel space.The results are postulated here in the following figures:

Fig.3 (a) Vowel Space representation of / / in

Fig

F

111111

F1 (Hz)0 3000

0

3000

22 222222

2

2222222

F1 (Hz)0 3000

0

3000

333333

F1 (Hz)0 3000

0

3000

4444444444444

F1 (Hz)0 3000

0

3000

555555555

F1 (Hz)0 3000

0

3000

0

428.6

857.1

1286

1714

2143

2571

3000

0 428.571 857.143 1285.71 1714.29 2142.86 2571.43 3000

F1/F2 Formant Map

30003000300030003000F1/F2 Formant Map


. 3 (b) Vowel Space representation of / / in

11

F1 (Hz)0 3000

0

22222

F1 (Hz)0 3000

0

3

3

333333

F1 (Hz)0 3000

0

44

F1 (Hz)0 3000

00

428.6

857.1

1286

1714

2143

2571

0 428.571 857.143 1285.71 1714.29 2142.86 2571.43 3000

400040004000400040004000F1/F2 Formant Map

ig.3 (c) Vowel Space representation of / / in

1111111111

F1 (Hz)100 3000

300

2222222222

F1 (Hz)100 3000

300

33333333

F1 (Hz)100 3000

300

4444444444

F1 (Hz)100 3000

300

5555

F1 (Hz)100 3000

300300

828.6

1357

1886

2414

2943

3471

100 363.636 627.273 890.909 1154.55 1418.18 1681.82 1945.45 2209.09 2472.73 2736.36 3000

F

F

400040004000400040004000F1/F2 Formant Map


Fig.3 (d) Vowel Space representation of / / in

i

F

The vowel space distribution of the vocal tractresponses for Bangla vowel utterance is shown in theabove figures from 3(a) through 3(f) which indicatesthe spectral property as well as the vocal tract responseduring the production of the vowel [4].

The vowel space representation indicates a rangefor the formants in order to be recognized or for thepurpose of synthesis but the vocal tract response asshown in case of / / and / / and a similar pair ofphoneme / / and / / seems very muchindistinguishable.

111111111111111111111111111111111

F1 (Hz)0 3000

0

2222222222222222

F1 (Hz)0 3000

0

33333333333333333

33

F1 (Hz)0 3000

0

4444444444444444444444

4

F1 (Hz)0 3000

0

555555555

F1 (Hz)0 3000

00

571.4

1143

1714

2286

2857

3429

0 428.571 857.143 1285.71 1714.29 2142.86 2571.43 3000

400040004000400040004000F1/F2 Formant Map

g.3 (e) Vowel Space representation of / / in

4 ConclusionBengali Vowels synthesis and recognition with

respect to vowel space requires more methodical andcareful study of the linguistic phonetics. The vowelspace representation postulated in this paper is used toproduce synthetic vowel sounds. The synthesis resulthas shown good degree of correlation only with theexception of the phoneme boundary detection issues,which would require further enhancement of the

11

11

11111111111

111

F1 (Hz)0 3000

0

2222

22

2222222

F1 (Hz)0 3000

0

3333333333333333

F1 (Hz)0 3000

0

444444444444444

F1 (Hz)0 3000

0

555555

5

555

F1 (Hz)0 3000

00

571.4

1143

1714

2286

2857

3429

0 428.571 857.143 1285.71 1714.29 2142.86 2571.43 3000

400040004000400040004000F1/F2 Formant Map

ig.3 (f) Vowel Space representation of / / in

algorithm used for the feature extraction.

References:[1] Rabiner L. R., Schafer R. W., "Digital Processing

of Speech Signals", Trentice-Hall Inc, EnglewoodCliffs, 1978

[2] John R Deller, John G Proakis, John H L Hansen,"Discrete-Time Processing of Speech Signals",Macmillan Publishing Company, 1993

[3] G.Fant, Acoustic Theory of Speech Production, ‘s-Gravenhage, The Netherlands:

11111111111111

1111

111

F1 (Hz)0 3000

0

2222222222222222222

F1 (Hz)0 3000

0

33333333333

33

F1 (Hz)0 3000

0

4444444444

444444444444

F1 (Hz)0 3000

0

5555555

55

F1 (Hz)0 3000

00

571.4

1143

1714

2286

2857

3429

0 428.571 857.143 1285.71 1714.29 2142.86 2571.43 3000

ig.4 Combined Vowel Space representation of / /through / /

Mounton and Co., 1960.[4] Muhammad Abdul Hai, Dhvani Vijnan O Bangla

Dhvani-Tattwa, Mullick Brothers, 2000.[5] M. Berouti, R. Schwartz and J. Makhoul,

“Enhancement of speech corrupted by acousticnoise,” Proc. IEEE Int. Conf. on Acoust., Speech,Signal Procs., pp. 208-211, Apr. 1979.

[6] S.Blumstein and K.Stevens, “Acoustic invariancein speech production,” J. Acoust. Soc. Am., vol. 66,pp. 1001-1017, 1979.

[7] S. Blumstein and K. Stevens, “Perceptualinvariance and onset spectra for stopconsonants in different vowel environments,” J.Acoust. Soc. Am., vol. 67, pp. 648-662, 1980.

******

F1 (Hz)0 5000

0

5000

** **************

F1 (Hz)0 5000

0

5000

******

F1 (Hz)0 5000

0

5000

*************

F1 (Hz)0 5000

0

5000

F1 (Hz)0 5000

0

5000

F1 (Hz)0 5000

0

5000

F1 (Hz)0 5000

0

5000

F1 (Hz)0 5000

0

5000

eeeeeeeeee

F1 (Hz)0 5000

0

5000

eeeeeeeeee

F1 (Hz)0 5000

0

5000

eeeeeeee

F1 (Hz)0 5000

0

5000

eeeeeeeeee

F1 (Hz)0 5000

0

5000

eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

F1 (Hz)0 5000

0

5000

aeae

aeae

aeaeaeaeaeaeaeaeaeaeaeaeaeae

F1 (Hz)0 5000

0

5000

aeaeaeae

aeae

aeaeaeaeaeaeae

F1 (Hz)0 5000

0

5000

oioioioioioioioioioioioioioioi

oioioioioioi

F1 (Hz)0 5000

0

5000

oooooooooooooooooooooooooooooooo

F1 (Hz)0 5000

0

5000

uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

F1 (Hz)0 5000

0

5000

0

1000

2000

3000

4000

5000

0 1000 2000 3000 4000 5000

[8] S.Boll, “Suppression of acoustic noise in speechusing spectral subtraction,” IEEETrans. Acoust., Speech, Signal Process., vol.27,pp. 113-120, Apr. 1979.

[9] B. Gold and N. Morgan, Speech and audio signalprocessing, Wiley, 2000.

[10] S Akhter Hossain, M Lutfar Rahman , FarrukAhmed “Vowel Space Identification of BanglaSpeech”, Dhaka University Journal of Science,51(1): 31-38 2003(January)

[11] S Akhter Hossain, Md Faruk Ahmed, MozammelHuq Azad Khan, M A Sobhan, and Md LutfarRahman, “Analysis by Synthesis of Bangla

Vowels”, 5th International Conference onComputer and Iinformation TechnologyProceeding, 2002, pp. 272-276

[12] S Akhter Hossain, M A Sobhan, Mozammel HuqAzad Khan, “Acoustic Vowel Space of Bangla Speech”, International Conference onComputer and Iinformation Technology 2001Proceeding, pp. 312-316

[13] S Akhter Hossain & M Abdus Sobhan,“Fundamental Frequency Tracking of BanglaVoiced Speech” –1st National Conference onComputer and Information System Proceeding1997, pp. 302-306


Acoustic Space of Bangla Vowels1 Introduction1.1 Vowel Space Illustration2 Analysis Method3 Results and Discussion4 ConclusionReferences:

Documents

Acoustic Space of Bangla Vowelswseas.us/e-library/conferences/2005corfu/c1/papers/498-459.pdfAcoustic Space of Bangla Vowels Abstract: - Acoustic space of Bangla Vowel based on articulatory