1
A Dutch Jaw + An American Tongue + A Kiwi Neck = British English A Dynamic Model of the Human Vocal Tract By Xiao Bo, Lu Supervised by: William Thorpe and Peter Bier Introduction The human voice box has long been a puzzle. How do we relate what we hear to how we speak? Why does every person’s voice sound different? We know that the quality of voice is largely determined by the shape of the vocal tract (VT). By moving various vocal organs, you are able to manipulate your voice through the changing physical properties of the vocal tract. This project aims to synthesize speech based on a physiological model of the human vocal tract. The “Talking Head” First, the surrounding vocal organs were modelled to reproduce the voice box. Most of these models were based on human anatomical measurements, however they come from different origins. The Kinematics Data To simulate movements of these vocal organs, we used data collected from a number of electrical coils attached to a British man during the recording of a short English sentence. The Moving Lips and Tongue Based on the kinematic data, the movements of the lips, tongue and jaw were simulated by deforming the model in a way so as to match the resting positions of coils (Green) to their corresponding target positions (Red) at each time step. By doing so, we reproduced the physical movements of the speech organs in the ‘Talking Head’. Fit VT Models The surfaces of the vocal organs were digitised to give a discrete representation of the enclosed vocal space (Left). The initial mesh served as the starting point (Middle). The mesh parameters were optimised for minimising the distances between the data points and mesh surface (Right). We can further approximate the speech signal at the lips l(nT) by a glottal input g(nT) filtered by a high order digital filter Results The time-varying VT area function for the short English sentence ‘where were you while we were away’ were produced. The synthetic voice is partially recognisable. Compared with the traditional method, this articulatory model shows the ability to generate more realistic speech sounds. A subject in an EMA experiment (adapted from Mulooly 2004) Summary We present here a 3D finite element model of the vocal tract that is derived from a physiologically based model of speech. The model is constructed in such a way as to allow its shape to vary as the surrounding organs move. For the short English sentences tested in this model, the resulting time-varying area functions are compared with area functions computed from the speech audio by LPC analysis in order to validate the model. These results show that the model is able to provide a realistic representation of the time-varying vocal tract. 4 6 8 10 12 14 16 18 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 1 2 3 4 5 6 7 8 Distance from glottis (cm) Time (ms) Area (cm 2 ) Extract cross-sectional areas at regular intervals along the VT mesh Approximate VT by concatenated uniform tubes 0 0.5 1 1.5 2 2.5 3 x 10 4 -8 -6 -4 -2 0 2 4 6 /i/ /e/ /З/

A Dynamic Model of the Human Vocal Tract - Aucklandhomepages.engineering.auckland.ac.nz/~jgre007/Powerpoint and... · A Dynamic Model of the Human Vocal Tract ... we reproduced the

Embed Size (px)

Citation preview

Page 1: A Dynamic Model of the Human Vocal Tract - Aucklandhomepages.engineering.auckland.ac.nz/~jgre007/Powerpoint and... · A Dynamic Model of the Human Vocal Tract ... we reproduced the

A Dutch Jaw + An American Tongue + A Kiwi Neck = British English

A Dynamic Model of the Human Vocal TractBy Xiao Bo, Lu Supervised by: William Thorpe and Peter Bier

IntroductionThe human voice box has long been a puzzle. How do we relate what we hear to how we speak? Why does every person’s voice sound different? We know that the quality of voice is largely determined by the shape of the vocal tract (VT). By moving various vocal organs, you are able to manipulate your voice through the changing physical properties of the vocal tract. This project aims to synthesize speech based on a physiological model of the human vocal tract.

The “Talking Head”

First, the surrounding vocal organs were modelled to reproduce the voice box. Most of these models were based on human anatomical measurements, however they come from different origins.

The Kinematics Data

To simulate movements of these vocal organs, we used data collected from a number of electrical coils attached to a British man during the recording of a short English sentence.

The Moving Lips and Tongue

Based on the kinematic data, the movements of the lips, tongue and jaw were simulated by deforming the model in a way so as to match the resting positions of coils (Green) to their corresponding target positions (Red) at each time step.

By doing so, we reproduced the physical movements of the speech organs in the ‘Talking Head’.

Fit VT Models

The surfaces of the vocal organs were digitised to give a discrete representation of the enclosed vocal space (Left). The initial mesh served as the starting point (Middle). The mesh parameters were optimised for minimising the distances between the data points and mesh surface (Right).

We can further approximate the speech signal at the lips l(nT) by a glottal input g(nT) filtered by a high order digital filter

ResultsThe time-varying VT area function for the short English sentence ‘where were you while we were away’ were produced. The synthetic voice is partially recognisable. Compared with the traditional method, this articulatory model shows the ability to generate more realistic speech sounds.

A subject in an EMA experiment (adapted from Mulooly 2004)

SummaryWe present here a 3D finite element model of the vocal tract that is derived from a physiologically based model of speech. The model is constructed in such a way as to allow its shape to vary as the surrounding organs move. For the short English sentences tested in this model, the resulting time-varying area functions are compared with area functions computed from the speech audio by LPC analysis in order to validate the model. These results show that the model is able to provide a realistic representation of the time-varying vocal tract.

46

810

1214

1618

0200

400600

8001000

12001400

16001800

20000

1

2

3

4

5

6

7

8

Distance from glottis (cm)Time (ms)

Are

a (c

m2 )

Extract cross-sectional areas at regular intervals along the VT mesh

Approximate VT by concatenated uniform tubes

0 0.5 1 1.5 2 2.5 3

x 104

-8

-6

-4

-2

0

2

4

6

/i/ /e/ /З/