Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Acoustic Space of Bangla Vowels
M LUTFAR RAHMANDept of Computer Science andEngineeringUniversity of Dhaka, DhakaBANGLADESH
FARRUK AHMEDDept of Computer Science andEngineeringNorth South University, DhakaBANGLADESH
Proceedings of the 5th WSEAS Int. Conf. on SIGNAL, SPEECH and IMAGE PROCESSING, Corfu, Greece, August 17-19, 2005 (pp138-142)
SYED AKHTER HOSSAINDept of Computer Science andEngineeringEast West University43 Mohakhali C/A, Dhaka 1212BANGLADESH
Abstract: - Acoustic space of Bangla Vowel based on articulatory properties of vocal tract plays a significant role inBangla speech synthesis and recognition. A vowel system is essentially a way of dividing up the "vowel space" intodistinct vowel phonemes. In general, languages make the most efficient use of the vowel space. This papersynthesizes vowel space illustration by graphically showing where a speech sound for Bangla language, such as avowel, is located in both "acoustic" and "articulatory" space of the vocal tract tube. Measurements of the spectralcharacteristics and the formant frequencies were made for each vowel from isolated Bangla word spoken by bothmale and female speakers. All these measurements are tested in synthesis of isolated utterance.
Key-Words:- Speech Processing, Formants, Energy, Voiced, Speech Synthesis, Phoneme
1 IntroductionSpeech signals are composed of a sequence of
sounds. These sounds and the transitions between themserve as a symbolic representation of information. Inspeech production process the sub-glottal systemcomposed of lungs, bronchi and trachea serves as asource of energy for the production of speech. Speechis simply the acoustic wave that is radiated from thissystem when air is expelled from the lungs and theresulting flow of air is perturbed by a constrictionsomewhere in the vocal tract [1,2,3].
Speech sounds are classified into three distinctclasses according to their mode of excitation. Voicedsounds are produced by forcing air through the glottiswith the tension of the vocal chords adjusted so thatthey vibrate in a relaxation oscillator, therebyproducing quasi-periodic pulses of air which excite thevocal tract. In English language, voiced segments arelabeled /U/, /d/, /w/, /i/ and /e/. Forming a constrictionat some point in the vocal tract, and forcing air throughthe constriction at a high enough velocity to produceturbulence generates fricative or unvoiced sounds.Plosive sounds result from making a complete closurein front of the vocal tract, building up pressure behindthe closure, and abruptly releasing it [1,2,3].
1.1 Vowel Space IllustrationVowels are produced by exciting a fixed vocal tractwith quasi-periodic pulses of air caused by vibration ofthe vocal chords. The resonant frequency of the tract(formants) varies with the cross sectional area and thedependence of cross-sectional area upon distance longthe tract, which is known as area function isdetermined primarily by the position of the tongue, butthe positions of the jaw, lips, and, to a small extent, thevelum also influences the resulting sound.
The chief characteristic of the vowels is the freedomwith which the air stream, once out of the glottis,passes through the speech organs. The supra-glottalresonators do not cut off or constrict the air; they causeonly resonance, that is to say, the reinforcement ofcertain frequency ranges. A vowel's timbre (or quality)depends on the following elements:
• the number of active resonators (among the oral,labial, and nasal cavities);
• the shape of the oral cavity;• the size of the oral cavity.
There are three possible resonators involved in thearticulation of a vowel: the oral cavity, the labialcavity, and the nasal cavity. If the soft palate is raised,
the air does not enter the nasal cavity, and passesexclusively through the oral cavity; if the soft palate islowered, however, air can pass through nose andmouth simultaneously. If the lips are pushed forwardand rounded, a third, labial resonator is formed; if, onthe other hand, the lips are spread sideways or pressedagainst the teeth, no labial resonator is formed.
The sound of a vowel is governed by many factors,of which the position of the tongue and rounding of thelips are the most important. The "vowel space" is thusdefined as the total area over which the tongue positionranges, along the two axes of height and backness.
Thus, each vowel sound can be characterized bythe vocal tract configuration (area function) that is usedin its production.
a gsu"avo(fotheThfoga
a ind--
2 Analysis Method
The speech data used in this study is captured in anoise free environment using the sound capture card inSpeech Filing System environment. For each utteranceof Bengali vowels, one of the most commonly usedwords were captured for different male and femalespeakers and the Table-1 shows the detail of therecording for each of the following utterance:
BengaliWord
Corresponding Vowel
Avg. Duration(msec)
/ / 132/ / 83
/ / 72
/ / 92
Proceedings of the 5th WSEAS Int. Conf. on SIGNAL, SPEECH and IMAGE PROCESSING, Corfu, Greece, August 17-19, 2005 (pp138-142)
Fig.1 Vowel Space representation of English
The vowel space illustration shown above providesraphical method of showing where a speech sound,
ch as a vowel, is located in both "acoustic" andrticulatory" space. The illustration shows an acousticwel space based on the first two formants for vowelsrmants are the bands of energy that correspond to resonance of the vocal tract for particular shapes).e vertical axis represents the frequency of the first
rmant (F1). The horizontal axis shows the frequencyp between the first two formant (F2-F1)[2,5,6].
This 2-dimensional representation corresponds, tocertain degree, to tongue body position, withications of high vs. low and front vs. back positions
an articulatory space [3].
The digitized speech was filtered to remove anyadditional noise and was segmented visually to applyformant extract procedure through Fast FourierTransform (FFT) and Linear Predictive Coding (LPC).
There are many approaches to formant extraction,which are based on various (typically LPC-based)signal-processing techniques. Such approaches toformant extraction are normally fully automatedalthough provision is often made for intervention afterthe process has been completed especially editing offormant tracks [8,9,10].
A further limitation of most formant trackingprocedures is that formant gain and bandwidthextraction is either absent or very rudimentary. It isextremely difficult, for example, to extract meaningfulbandwidth information as the extracted values aredependant on both the actual formant bandwidth andalso on the extraction process (e.g. variations in thenumber of LPC coefficients result in variations inbandwidth).
It is also difficult to define gain in a useful way.Again, in LPC extraction procedures, the relative peakheight of the formants can vary with the number ofcoefficients used.
/ / 129/ / 115/ / 134/ / 172
/ / 208
/ / 169Table 1. Recording duration of utterance
Further, is it desirable for raw formant peak heightbe measured or is it more meaningful to measureformant gain after removing the effect of sourcespectral slope. For many applications are may not benecessary to accurately measure formant bandwidthand gain. For example, many speech analysis tasks ofrelevance to acoustic phonetic research only requireaccurate formant frequency information. Such taskstypically most often examine formants in vowels and(to a lesser extent) vowel-like consonants.
In vowels, gains and bandwidths are fairlypredictable (Fant, 1960). Vowel formant bandwidth isalso readily calculated utilizing a simple relationshipbetween formant center frequency and formantbandwidth [11,12,13].
Fig.2 Perceptual Linear Prediction
In Perceptual Linear Prediction method as shownin Fig. 2, the typical window length is 20ms. For 10kHz sampling frequency, 200 speech samples are used,padded with 56 zeros, hamming windowed, FFT andconverted to a power spectral density. After applyingthe pre-emphasis filtration, the IFFT was applied forcepstrum calculation and a suitable peak detectionalgorithm was used to determine formants.
For all the processing of speech samples, the firstthree formants are recorded and the correspondingformants are used to produce synthetic speech for theaccurate estimation of the formant frequencies. Allthese formants are grouped together to plot in two-dimensional space to produce vowel spacerepresentation considering the first two formants [5].
3 Results and Discussion
The speech corpus was conditioned and requiredpre emphasis and de-emphasis were carried out andlabeled to segment the phoneme boundary of interest.The extracted speech segment was taken for LinearPredictive Coding (LPC) scheme with 14 as number ofpoles. The maximum formant considered was 5000 Hzand the analysis width of the window was 25msec withan overlap of 10msec for the accurate resolution of theformants. Among the existing methods, the Burgmethod was applied and along with the formantextraction, the vocal tract response characteristics wereextracted for each of the phoneme in the vowel space.The results are postulated here in the following figures:
Fig.3 (a) Vowel Space representation of / / in
Fig
F
111111
F1 (Hz)0 3000
0
3000
22 222222
2
2222222
F1 (Hz)0 3000
0
3000
333333
F1 (Hz)0 3000
0
3000
4444444444444
F1 (Hz)0 3000
0
3000
555555555
F1 (Hz)0 3000
0
3000
0
428.6
857.1
1286
1714
2143
2571
3000
0 428.571 857.143 1285.71 1714.29 2142.86 2571.43 3000
F1/F2 Formant Map
30003000300030003000F1/F2 Formant Map
Proceedings of the 5th WSEAS Int. Conf. on SIGNAL, SPEECH and IMAGE PROCESSING, Corfu, Greece, August 17-19, 2005 (pp138-142)
. 3 (b) Vowel Space representation of / / in
11
F1 (Hz)0 3000
0
22222
F1 (Hz)0 3000
0
3
3
333333
F1 (Hz)0 3000
0
44
F1 (Hz)0 3000
00
428.6
857.1
1286
1714
2143
2571
0 428.571 857.143 1285.71 1714.29 2142.86 2571.43 3000
400040004000400040004000F1/F2 Formant Map
ig.3 (c) Vowel Space representation of / / in
1111111111
F1 (Hz)100 3000
300
2222222222
F1 (Hz)100 3000
300
33333333
F1 (Hz)100 3000
300
4444444444
F1 (Hz)100 3000
300
5555
F1 (Hz)100 3000
300300
828.6
1357
1886
2414
2943
3471
100 363.636 627.273 890.909 1154.55 1418.18 1681.82 1945.45 2209.09 2472.73 2736.36 3000
F
F
400040004000400040004000F1/F2 Formant Map
Proceedings of the 5th WSEAS Int. Conf. on SIGNAL, SPEECH and IMAGE PROCESSING, Corfu, Greece, August 17-19, 2005 (pp138-142)
Fig.3 (d) Vowel Space representation of / / in
i
F
The vowel space distribution of the vocal tractresponses for Bangla vowel utterance is shown in theabove figures from 3(a) through 3(f) which indicatesthe spectral property as well as the vocal tract responseduring the production of the vowel [4].
The vowel space representation indicates a rangefor the formants in order to be recognized or for thepurpose of synthesis but the vocal tract response asshown in case of / / and / / and a similar pair ofphoneme / / and / / seems very muchindistinguishable.
111111111111111111111111111111111
F1 (Hz)0 3000
0
2222222222222222
F1 (Hz)0 3000
0
33333333333333333
33
F1 (Hz)0 3000
0
4444444444444444444444
4
F1 (Hz)0 3000
0
555555555
F1 (Hz)0 3000
00
571.4
1143
1714
2286
2857
3429
0 428.571 857.143 1285.71 1714.29 2142.86 2571.43 3000
400040004000400040004000F1/F2 Formant Map
g.3 (e) Vowel Space representation of / / in
4 ConclusionBengali Vowels synthesis and recognition with
respect to vowel space requires more methodical andcareful study of the linguistic phonetics. The vowelspace representation postulated in this paper is used toproduce synthetic vowel sounds. The synthesis resulthas shown good degree of correlation only with theexception of the phoneme boundary detection issues,which would require further enhancement of the
11
11
11111111111
111
F1 (Hz)0 3000
0
2222
22
2222222
F1 (Hz)0 3000
0
3333333333333333
F1 (Hz)0 3000
0
444444444444444
F1 (Hz)0 3000
0
555555
5
555
F1 (Hz)0 3000
00
571.4
1143
1714
2286
2857
3429
0 428.571 857.143 1285.71 1714.29 2142.86 2571.43 3000
400040004000400040004000F1/F2 Formant Map
ig.3 (f) Vowel Space representation of / / in
algorithm used for the feature extraction.
References:[1] Rabiner L. R., Schafer R. W., "Digital Processing
of Speech Signals", Trentice-Hall Inc, EnglewoodCliffs, 1978
[2] John R Deller, John G Proakis, John H L Hansen,"Discrete-Time Processing of Speech Signals",Macmillan Publishing Company, 1993
[3] G.Fant, Acoustic Theory of Speech Production, ‘s-Gravenhage, The Netherlands:
11111111111111
1111
111
F1 (Hz)0 3000
0
2222222222222222222
F1 (Hz)0 3000
0
33333333333
33
F1 (Hz)0 3000
0
4444444444
444444444444
F1 (Hz)0 3000
0
5555555
55
F1 (Hz)0 3000
00
571.4
1143
1714
2286
2857
3429
0 428.571 857.143 1285.71 1714.29 2142.86 2571.43 3000
ig.4 Combined Vowel Space representation of / /through / /
Mounton and Co., 1960.[4] Muhammad Abdul Hai, Dhvani Vijnan O Bangla
Dhvani-Tattwa, Mullick Brothers, 2000.[5] M. Berouti, R. Schwartz and J. Makhoul,
“Enhancement of speech corrupted by acousticnoise,” Proc. IEEE Int. Conf. on Acoust., Speech,Signal Procs., pp. 208-211, Apr. 1979.
[6] S.Blumstein and K.Stevens, “Acoustic invariancein speech production,” J. Acoust. Soc. Am., vol. 66,pp. 1001-1017, 1979.
[7] S. Blumstein and K. Stevens, “Perceptualinvariance and onset spectra for stopconsonants in different vowel environments,” J.Acoust. Soc. Am., vol. 67, pp. 648-662, 1980.
******
F1 (Hz)0 5000
0
5000
** **************
F1 (Hz)0 5000
0
5000
******
F1 (Hz)0 5000
0
5000
*************
F1 (Hz)0 5000
0
5000
F1 (Hz)0 5000
0
5000
F1 (Hz)0 5000
0
5000
F1 (Hz)0 5000
0
5000
F1 (Hz)0 5000
0
5000
eeeeeeeeee
F1 (Hz)0 5000
0
5000
eeeeeeeeee
F1 (Hz)0 5000
0
5000
eeeeeeee
F1 (Hz)0 5000
0
5000
eeeeeeeeee
F1 (Hz)0 5000
0
5000
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
F1 (Hz)0 5000
0
5000
aeae
aeae
aeaeaeaeaeaeaeaeaeaeaeaeaeae
F1 (Hz)0 5000
0
5000
aeaeaeae
aeae
aeaeaeaeaeaeae
F1 (Hz)0 5000
0
5000
oioioioioioioioioioioioioioioi
oioioioioioi
F1 (Hz)0 5000
0
5000
oooooooooooooooooooooooooooooooo
F1 (Hz)0 5000
0
5000
uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
F1 (Hz)0 5000
0
5000
0
1000
2000
3000
4000
5000
0 1000 2000 3000 4000 5000
[8] S.Boll, “Suppression of acoustic noise in speechusing spectral subtraction,” IEEETrans. Acoust., Speech, Signal Process., vol.27,pp. 113-120, Apr. 1979.
[9] B. Gold and N. Morgan, Speech and audio signalprocessing, Wiley, 2000.
[10] S Akhter Hossain, M Lutfar Rahman , FarrukAhmed “Vowel Space Identification of BanglaSpeech”, Dhaka University Journal of Science,51(1): 31-38 2003(January)
[11] S Akhter Hossain, Md Faruk Ahmed, MozammelHuq Azad Khan, M A Sobhan, and Md LutfarRahman, “Analysis by Synthesis of Bangla
Vowels”, 5th International Conference onComputer and Iinformation TechnologyProceeding, 2002, pp. 272-276
[12] S Akhter Hossain, M A Sobhan, Mozammel HuqAzad Khan, “Acoustic Vowel Space of Bangla Speech”, International Conference onComputer and Iinformation Technology 2001Proceeding, pp. 312-316
[13] S Akhter Hossain & M Abdus Sobhan,“Fundamental Frequency Tracking of BanglaVoiced Speech” –1st National Conference onComputer and Information System Proceeding1997, pp. 302-306
Proceedings of the 5th WSEAS Int. Conf. on SIGNAL, SPEECH and IMAGE PROCESSING, Corfu, Greece, August 17-19, 2005 (pp138-142)
Acoustic Space of Bangla Vowels1 Introduction1.1 Vowel Space Illustration2 Analysis Method3 Results and Discussion4 ConclusionReferences: