Transcript
Page 1: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Implementation of

Implementation of Speaking SoftMan in Game

Xiangyang Huang Yixin Yin Guangping Zeng Xuyan TuUniversity of Science and Technology

Beijing 100083, ChinaE-mail: [email protected]

yyx([email protected]@sina.com

Abstract-Unlike traditional NPCs(Nonplayer Characters)which are just "smart" programs, This thesis considersSoftMans as NPC model, which live in a simulated world andhave a synthetic body with autonomous behaviors. Based onthis novel idea, we research text-driven lip motion synthesissystem for SoftMan in game. First, we divide Chinesephonemes into 11 basic static visemes in which whole 3Dvirtual face is considered. Secondly, text scripts are dividedinto phoneme sequences. Thirdly, proper intermediateanimation frames are created by morphing technique. Finally,those frames are rendered by vertex shader technique.

I INTRODUCTION

The game industry evolves very fast. The quality ofcommercial computer games is directly related to theirentertainment value.

Historically , traditional Nonplayer Characters(NPCs) arejust "smart" programs, they are generally given direct accessto the game data, free to extract whatever they need;Logically they may know everything and never forgetwhatever known. But it is unfair to Player Characters(PCs).NPCs themselves may explore the same world as well asPCs while obtaining knowledge and achieving their goals.Intelligent Animation based on artificial intelligence,artificial life and digital technology should be considered asAl NPCs11]. SoftMans(SMs) and GeneralizedSoftMans(GSMs) are a kind of virtual robot, which aresubject to constraints of their environmentE21131. GSMs mayhave a virtual body. It is natural to consider GSMs as NPCs.

The added feeling of reality just increases the gamer'spotential enjoyment of each game. With emotions, allnonplayer character behaviors would seem more realisticand generally increase the immersiveness of the gameenvironment. In addition to helping display emotion, themouth is great for communicating. The lips are powerfullittle muscle machines that can contort themselves intomany shapes. The mouth changes shape to help create thesounds that compose the speech.The paper will propose a method to implement speaking

SMs(generally, they are GSMs). The flow is depicted in Fig1. First, text scripts are divided into phoneme sequences,then each phoneme is played according to its recorded voiceand at the same time the proper facial animation of the

phoneme is rendered. Obviously, creating visemes, buildingphoneme sequences and realizing animations are three mainproblems which will be discussed in flowing sections.

text bword syllable

voice

phoneme

mouthcontour

Fig. 1. Flow Diagram of Text-driven Lip-motion

II. BASIC STATIC VISEMES OF CHINESE

The unique sounds we use to create words are calledphonemes. The shape of the mouth and the position of thetongue, which create the sounds, are called visemes. Tocreate lip-synced animation, we need to construct a set ofvisemes our game can use to match the phonemes in arecorded voice

Although visemes are relate to the shape of SM's mouthand the position of its tongue, we should consider wholefacial animation. Many models were proposed to synthesizethe lip motion and facial image, such as elastic deformablemodel and skeleton model, but these models are verycomplicate, the animations produced by them are rigid and itis difficult to implement them in game[4[15]. Here we willadopt 3D mesh technique to construct SoftMan's face(seeFig. 2) and 3D vertex shader technique to render animations.

The physical model of SMs is created by using 3D Meshtechnique. The majority of meshes constructed are thephoneme shapes the SM mesh's mouth can form. Dependingon how realistic lip-syncing animations look, lower-qualityanimations only use as few as four viseme shapes, whilehigh-quality animations can use up to 30 or more visemeshapes. Chinese phonemes consist of 19 initials (initial of a

0-7803-9422-4/05/$20.00 C2005 IEEE1403

Page 2: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Implementation of

Chinese syllable) and 39 vowels (vowel of a Chinesesyllable). In our game, we divide these phonemes into 11basic visual classes(see Table I ), It's those phonemes forwhich we want to create matching facial meshes.

Fig. 2. Facial Mesh of SoftMan

TABLE I

BASIC STATIC VISEMES OF CHINESE

Record number Representativephonemes

1 aao

2 p,b,m

3 d, t, n,1

4 i, j

5 f,v

6 g,k,h

7 ai, ei

8 e

9 o,u

10 r

11 z,c, s,x,q

Lip shapes

into its phoneme sequence. A script contains the exactsequence that is to be spoken and lip-synced. For example,suppose the SM want to say "nihao." Using a text editor,enter that exact phrase and save it as a text file. Next, usinga sound-editing program of some sort, record that phrase,then save it as a standard .WAV file. After having done this,we can break up the phrase into its phoneme sequence. Weaccomplish this by using a phoneme dictionary, which is adata file that contains phoneme sequences for each wordcontained in the dictionary. By piecing together thesequences for each word, we can create sequences to matchany script. For instance, the phoneme dictionary definitionfor the word "hao" would be the phonemes h and ao. Eachof those two phonemes has a matching facial meshassociated with it. As those various phonemes are processedin animation playback using the sound-editing program, thefacial mesh morphs to match the sounds. For the entire"nihao" sequence, we would use the phonemes n, i, h, andao.

Take a look at the entire process of writing, recording,and processing data that will eventually become thelip-sync animation. Start by writing a script file (as a textfile). This script should contain everything that is to bespoken by SMs. Using the phoneme dictionary, take thewritten script and convert every word to a sequence ofphonemes.

Next, using a high-quality recording format, record thescript word for word. Be sure to speak slowly and clearly,adding brief silences between words to isolate each word.The silence marks the end of one word and the beginning ofanother, therefore we can determine each word length whichdetermines the animation speed for each phoneme sequence.Based on the sound's playback frequency and position, wecan go back to our phoneme sequence and use the timevalues to animate it. We can use two structures to describethe phoneme and the phoneme sequence. The first structure,Phoneme, stores information about a single phoneme.There's the phoneme identification number (the phonememesh number), as well as the beginning and ending time (inmilliseconds) for the phoneme to animate. The Phonemestructure is actually an animation key frame. The secondstructure, PhonemeSequence, stores an array of Phonemeobjects that define an entire animation sequence. Thenumber of key frames in the animation is defined by the firstvalue in PhonemeSequence, followed by the array of keyframes.

IV. MORPHING ANIMATION

Im. BUILDING PHONEME SEQUENCES

SM's lips change shape to match each phoneme in itsspeech. Creating lip-synced animation sequences is to usescripts (spoken and written) in combination with a phonemedictionary to break every spoken (and written) word down

When a SM is speaking, the corresponding .WAV filesare played, and at the same time some appropriate facialmeshes are rendered by inserting some intermediate framesbetween two key frames(two viseme meshes). In fact, theintermediate frames are created by morphing technique. Themorphing is the technique of changing one shape to another.

1404

Page 3: [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - Implementation of

Here, those shapes are meshes(that is, animation frames).The process of morphing a mesh involves slowly changingthe coordinates of the mesh vertices(normals are identical tovertex coordinates), starting at one mesh's shape andprogressing to another. The mesh that contains theorientation of the vertices at the beginning of the morphingcycle is called the source mesh. The second mesh, whichcontains the orientation of the vertices at the end of themorphing cycle, is called the target mesh. The two meshesmust share the same number of vertices and each vertex inthe source mesh must have a matching vertex (that is, amatching index number) in the target mesh.The morphing operation gradually moves vertices from

the source mesh positions to match the target mesh positions.We can measure the time period used to track the motion ofthe vertices from the source mesh coordinates to the targetmesh coordinates using a scalar value (ranging from 0 to 1).With a scalar value of 0 the vertices will be positioned at thesource mesh coordinates, whereas with a scalar value of 1the vertices will be positioned at the target mesh coordinates.It is quite simple to calculate the coordinates in which toposition a vertex between the source mesh coordinates andtarget mesh coordinates. Take a vertex from the source meshand multiply the vertex's coordinates by the inversed scalarvalue (1.0-scalar). Using the inversed scalar means that theoriginal vertex coordinates will use 100 percent of thevertex's coordinate position when the scalar is at 0.0, andzero percent of the vertex's coordinate position when thescalar is at 1.0. Next, using the same indexed vertex'scoordinates from the target mesh, multiply the vertex'scoordinates by the scalar value. Adding the two resultingvalues the final coordinates are obtained to use for thevertex during the morphing animation.A morphing mesh is composed of interpolated position

and normal coordinates that are calculated from a sourcemesh and a target mesh. Since those position and normalcoordinates are part of the vertex stream, we can create avertex shader that takes two vertex streams at a time andscalar values, and calculates the vertex values using a fewsimple commands 6]. It's all happening in line with thevertex shader, no more locking and rebuilding a morphingmesh each firame. The vertex shader is used to transform theattributes of vertices (points of a triangle) such as color,texture, position and direction from the original color spaceto the display space. It allows the original objects to bedistorted or reshaped in any manner. The vertex shader is aprogrammable function in display adapters that offers agraphics application programmer flexibility in rendering animage, which can be trivially parallelized.

the whole facial expression are considered by using 3Dmesh technique, then divide text into phoneme sequences,and fmally proper animation frames are created by morphtechnique and rendered by vertex shader technique.

ACKNOWLEDGMENT

This work was supported in part by the National ScienceFoundation under grant no.60375038.

REFERENCES

[1] Xuyan Tu, "Intelligent Animation, Intelligent Game, Intelligent Film &Television," Proceedings of 2004 Sino-Japan Symposium on KANSEI& Artificial Life, Beijing, July 2004.

[2] Guangping Zeng and Xuyan Tu, "SoftMan," proceedings of the 10-thCAAI National conference, Beijing: Beijing University of Posts andTelecommunications Publishing House, 2003, pp. 677-682.

[3] Xuyan Tu, "Generalized Soft-Man and its Application," Proceedings ofthe 2nd CMIA Symposium on Digital Human Body, Beijing, 2004.

[4] P. Ekamn and W. V. Friesen, Facial action coding system (FACS):manual[AM, Consulting Psychologists Press, 1978.

[5] K. Waters, "A Muscle Model for Animating Three-Dimensional FacialExpression," In SIGGRAPH 87 Conference Proceedings, volume 21,pages 17-24, ACM SIGGRAPH, July 1987.

[6] D. Gosselin, Character Animation with Direct3D Vertex Shaders,ShaderX, Wordware Inc., 2002.

V. CONCLUSIONS

Based on introducing SMs into game, we researchtext-driven lip motion synthesis system for SMs. First, wedivide Chinese phonemes into 11 basic static visemes where

1405


Recommended