Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
FiliText: A Filipino Hands-Free Text Messaging ApplicationJerrick Chua, Unisse Chua, Cesar de Padua, Janelle Isis Tan, Mr. Danny Cheng
College of Computer Studies De La Salle University - Manila
1401 Taft Avenue, Manila (63) 917 5120271
[email protected], [email protected], [email protected], [email protected], [email protected]
___________________________________________________________________________________
ABSTRACT This research aims to create a hands-free text messaging
application capable of recognizing the Filipino language which
will allow users to send text messages using speech. Using this
research, other developers may be able to study further on Filipino speech recognition and its application to Filipino text messaging.
Keywords Speech Recognition, Filipino Language, Text Messaging
1. INTRODUCTION Texting while driving has been a problem in most countries since
the late 2000s. The Philippine National Police (PNP) reported about 15,000 traffic accidents in 2006, averaging on 41 accidents
per day. It is concluded that most accidents are caused by error on
the part of the driver. Additionally, traffic accidents caused by
cellphone use while driving represented the highest increase among the causes of traffic accidents [2]. According to Bonabente
(2008), `The Automobile Association Philippines (AAP) has
called for an absolute ban on use of mobile phones while driving,
saying it was the 12th most common cause of traffic accidents in the country in 2006.' AAP said that the using cell phones, even
hands-free sets, while driving could impair the driver's attention
and could lead to accidents.
An existing software application that helps people use words to
command their phones what to do is Vlingo. It is an intelligent
voice application that is capable of doing a lot of things other than
just allowing users to text while driving [3]. It is also a multiplatform application which is available for Apple, Android,
Nokia, Blackberry, and Windows Mobile. There is another
software developed by Clemson University called VoiceText. It
allows the driver to send text messages while keeping their eyes
on the road. Drivers using VoiceText put their mobile phones in
Bluetooth mode and connect it to their car. It is through the car's
speakers system or through a Bluetooth headset thar drivers are
able to give a voice command and deliver a text message. StartTalking is another existing software from AdelaVoice that
allows the user to initiate, compose, review, edit and send a text
message entirely by voice command. However, this certain
application is only available for Android 2.0 and above [4]. There are other applications that are similar to the ones mentioned above
but they all have the same purpose - to help lessen the cases of car
accidents caused by distracted driving.
2. SIGNIFICANCE OF RESEARCH Western countries have started to develop hands free texting
applications that has helped in reducing the number of car
accidents caused by texting while driving and some of these have
capabilities to understand Chinese. However, the Philippines, which is considered the text capital of the world, still has no such
applications that prevent drivers from texting while driving
mainly because there are no Filipino language capabilities on the
existing applications so far.
There are party-lists and organizations that support this kind of
law. Buhay party-list filed a bill seeking to penalize persons
caught using their mobile phones on the road. The cities of Manila, Makati and Cebu has successfully banned this kind of act
on the paper, however, it has not been properly enforced [1]. Even
when there is a law that bans Filipinos from using their mobile
phones while driving, it has not been strictly implemented and there are only a few ways of knowing whether a person is really
following this law. The development of a local version of an
existing hands-free text messaging application, Vlingo InCar,
delivers an alternative for Filipinos. This service aims to keep driver's hands on the wheel and keep concentration solely to his
environment. The Philippines, known as the Texting Capital in the
World, may have lesser traffic accidents in the future when
drivers have been granted to use hands-free mobile phones on the road rather than having to glimpse and read a text message from
one of his contacts. Hands-free text messaging is not only helpful
in restricting drivers from using their hands to text while driving,
but it may also be used by physically disabled individual with normal speech or simply those who are used to multitasking or for
some cases where a person may need to use both hands to perform
an activity. An application of this would be for those busy
businessmen and women who need to do a lot of things in a short
amount of time due to the load of work they need to finish.
Having such an application would be helpful in their daily routine
because they no longer need to use their hands in sending an
urgent message to their colleagues while attending to other urgent matters.
Alongside with this useful application, this research will be able
to shine a light on speech recognition of the Filipino language, a topic that is lacking research and in depth analysis. When deeper
studies have been conducted, it could be used as a stepping stone
for various studies in the future involving Filipino speech recognition.
58 Proceedings of the 8th National Natural Language Processing Research Symposium, pages 58-63
De La Salle University, Manila, 24-25 November 2011
3. RELATED LITERATURE
3.1 Filipino Text Messaging Language Manila's lingua franca was used as the base for the Philippine's
national language, Filipino, and this is commonly used in the urban areas of the country and it is also spreading fast across the
country [6]. Tagalog is the structural base of the Filipino language
and it was commonly spoken in Manila and the provinces of
Rizal, Cavite, Laguna, and many more.
As conducted on a study by Shane Snow, the Philippine is still
considered the Text Capital of the World. With the constraint of
Short Messaging System (SMS) of 160 characters or less to send a message, people learned how to shorten what they wanted to say
that is now referred to as „text speak‟. One simple way of
shortening a message is by taking out all the vowels; however,
this does not work for some words because it gives out an ambiguous feel for words that have similar consonant sequences.
Phonetics or how the word sounds like also plays a role in
shortening messages for Filipino texters [7]. Examples for such
are „Dito na kami’ becomes „D2 n kmi’ and „Kumain na ako’ becomes „Kumain n ko’.
3.2 Speech to Text Libraries Speech-to-Text systems are already available as desktop
applications, and some of these systems give out their APIs and/or libraries for those who want to use their system to create a new
desktop application. Some of these mentions systems are
CMUSphinx, Android Speech Input, Java Speech API and
SpinVox Create. Among all the APIs and libraries available, CMU Sphinx is the most appropriate. CMUSphinx is giving out
their toolkit ‘CMUSphinx Toolkit' which comes with various tools
used to build speech applications, these tools include a recognizer
library, a support library, language model and acoustic model training tools, and a decoder for speech recognition research, are
just some of the tools offered by CMUSphinx. It also has a library
for mobile support called as PocketSphinx. CMUSphinx can also
generate its own pronunciation dictionary with the help of an existing dictionary as a basis, but the pronunciation generation
code only supports English and Mandarin.
3.3 CMU Sphinx-4 Sphinx-4 is a Java-based, open source and automatic speech recognition system [8]. Sphinx-4 is made up of 3 core modules,
namely, the FrontEnd, the Linguist, and the Decoder. The
Decoder module is the central module, which takes in the output
of the FrontEnd module and the Linguist module. From their output, it generates its results, which it passes to the calling
application. The Decoder module has a single module, the
SearchManager, which it uses to recognize a set of frames of
features. The SearchManager is not limited to any single search algorithm, and its capabilities are further extended due to the
design of the FrontEnd module.
The FrontEnd module is responsible for the digital signal
processing. It takes in one or more input signals and parameterizes them into features which it then passes these features off to the
Decoder module.
Finally, there is the Linguist module, which is responsible for
generating the SearchGraph. The decoder module compares the features from the FrontEnd module against this SearchGraph to
generate its results. The Linguist model is made up of 3 sub-
modules: the AcousticModel, the Dictionary, and the
LanguageModel. The AcousticModel module is responsible for the mapping between units of speech and their respective hidden
Markov models. The LanguageModel module provides
implementations which represent the word-level language
structure. The dictionary dictates how words in the LanguageModel are pronounced.
CMU Sphinx can also use a language model that is made by a
user. With this language model users can create a grammar that is suitable for their own language and with the help of an acoustic
model, a user can fully utilize the language model patterned after
their own native language.
4. SYSTEM DESIGN
4.1 Overview FiliText is an application for desktop computers designed
especially for the Filipinos. It serves as a stepping stone of future
developers to create a hands free texting application for mobile
phones. FiliText is a system that accepts audio files, specifically in a Waveform Audio File Format or WAV, as an input and
processes it through a speech recognition API to convert the
message into a text. The conversion of the message will be
generated after the user has acknowledged that the voice input is already done. This application will produce two outputs. The first
output will be a converted message with the proper and complete
spelling in Filipino. As an option, the user may choose to
compress the text output into a SMS due to the fact that most cell phone carriers allow only up to 160 characters per message.
4.2 Architecture The system will begin by first gathering input, a spoken message,
through the input module. The input module will then pass off the unprocessed spoken message to the Sphinx-4 module, configured
for recognizing Filipino. Sphinx-4 will then pass off the now text-
based message to the message shrinking module, which will apply
common methods of reducing word length. Finally, the shrunken, text-based message will be passed off to the output module, which
will display said output to its user.
Figure 2. Architectural Design of FiliText
The system will rely on Sphinx-4 in order to convert the spoken
message into its respective text format. Because Sphinx-4 is
highly configurable, a speech recognition module would not need
59
to be coded from scratch. Instead, the Sphinx-4 module will be
trained and configured to recognize informal Filipino.
The input will first pass through the FrontEnd module, which will
handle the cleaning and normalizing of the input. Little effort will
be placed into configuring and optimizing the FrontEnd as it deals with digital signal processing.
The Linguist module will create the SearchGraph, which the
Decoder module will use to compare the input against in order to generate its results. The Linguist module will be the most
configured of the three as it contains the hidden Markov models,
the list of possible words and their respective pronunciations, and
the acoustic models of phonemes. Sphinx-4 does not have any of the necessary files to understand Filipino, so the dictionary, the
acoustic models, and the language models will be created by the
proponents using the tools provided by CMU Sphinx group.
The Language model will be created using a compilation of
Filipino text messages, newscast transcripts, Facebook posts, and
Twitter feeds. These will be placed into a text file with the format,
<S> text n </S>. A vocabulary file, a text file listing all
Filipino words used, will also be created and used to generate the
language model. The vocabulary file will not include names and
numbers. These two files will be used by the CMU-Cambridge Language Modeling Toolkit (CMUCLMTK) to create an n-gram
language model. Aside from the implementation, the created
language model is necessary for the creation of the Acoustic
Model.
The Acoustic Model submodule of the Linguist module will be
trained using SphinxTrain, a tool also created by the CMU Sphinx
group for generating the acoustic models a system will use. To train the acoustic model, recording from different speakers will be
compiled and each audio file would be transcribed. Each
recording will be a wav file sampled at 16 kHz, 16-bit mono, and
segmented to be between 5 to 30 seconds long [5]. The set of speakers will include males and females of 16 years of age or
older.
The Decoder module compares the processed input against the SearchGraph produced by the Linguist to produce its results. The
Decoder's SearchManager sub-module will be configured to use
the implemented SimpleBreadthFirstSearchManager, an
implementation of the frame-synchronous Viterbi algorithm.
The message shrinker module will use word alignment, a
statistical machine translation algorithm to shorten the output of
the Sphinx-4 module that will still be understandable. The output of this module would be a text with at most 160 characters unless
it is stated that there is no possible way to make the text shorter
than 160 characters. This shortened message will then be sent to
the output module, which will display the message to the user.
4.3 Customization of Sphinx-4 Since the documentation of the Sphinx-4 has specified the steps in
creating the language model and acoustic model of a new
language when needed, it was somehow easy to create a
prototype. The challenge in customizing the Sphinx-4 for a totally different language is getting all the recordings and having them be
trained by SphinxTrain.
The initial task to perform was to gather audio recordings of
different speakers that will use all the phonemes of Filipino in their recordings. The audio file format must be in 16-bit mono and
16 kHz and it must not be shorter than 5 seconds to aid in the
accuracy of the acoustic model training. After gathering all the
speech recordings, they are all placed in a folder that will be run with CMUCLMTK so that it may create the dictionary of used
words and also the different phonemes it was able to detect. After
being able to run it through the language modeling toolkit, it will
be ready for training under SphinxTrain to create the acoustic model that the application would use to understand the different
words uttered by the end-user of the application.
4.4 Data Collection The corpus the proponents will be using is a Filipino Speech Corpus (FSC) created by Guevarra, Co, Espina, Garcia, Tan,
Ensomo, and Sagum of the University of the Philippines -
Diliman Digital Signal Processing Laboratory in 2002. The corpus
contains audio files and matching transcription files which will be used to train the acoustic model and the language model for the
Filipino language and use these trained models for recognition.
For the language model, contemporary sentences were also gathered from social networking sites such as Facebook and
Twitter.
4.5 Training the Acoustic and Language
Models To create a new acoustic model for Sphinx 4 to use, the CMU Sphinx project has created tools to aid in creating these models for
recognizing speech. The required files are placed in a single folder
which is considered as the speech database.
Sound recordings will be placed into the directory
wav/speaker_n. The dictionary, the mapping of words to their
respective pronunciations, that will be used for SphinxTrain will be the same as the one used in the implementation of Sphinx-4. A
single text file will be created to house the transcriptions of each
recording. Each entry in the text file must follow the following
format: <S>transcription of file n</S>
(file_n). A file with the filenames of all the sound recordings
must also be present as this will be used to map the files to the transcription. The ARPA language model will also be used by
SphinxTrain to generate the acoustic models. A phoneset file, a
file listing each phoneme tag, will be provided needed as well.
The filler dictionary will be created to only include silences. These will all be used by SphinxTrain to generate the acoustic
model. All mentioned files other than audio recordings will be
placed inside a folder labeled etc.
Because the FSC had recorded speech with lengths of 25 to 35
minutes each, the proponents had to segment each file into the
specified 5 to 30-second length. They were able to automate the process by using SoX, an open source audio manipulation tool.
This tool was able to segment the sound files according to the
existing transcriptions that came with the speech corpus. After
segmenting the sound files, the filenames were transcribed to a
fileids file and the transcriptions of each sound file was
compiled into a single file, ready for training.
60
The language model needed for the etc folder was created using
the transcription file and the CMUCLMTK. The phonetic dictionary was also created with the aid of the language modeling
toolkit because in the process of creating the language model
itself, the toolkit will first create a dictionary file with all the
words in the transcription file.
The phonetic dictionary that the proponents created used the
letters of the Filipino words as the phone for each letter. According to the acoustic model training documentation of
SphinxTrain, this approach is done when there is no phoneme
book available and it “gives very good results” [5].
Figure 3. Phonetic Dictionary Sample
The folder structure must be followed because the training process
is controlled by calling Perl scripts to setup the rest of the training
binaries and configuration files.
Before starting the training, the train configuration file
(sphinx_train.cfg) must be edited according to the size of
the speech database to be trained. The variables that must be taken
into consideration before training are the model parameters: the
tied-states (senones) and the density.
Figure 4. Approximation of Senones and Number of Densities
Training internals include but is not limited to computing the features of the audio files, training the context-independent
models and the context-dependent untied models, building
decision trees, pruning the decision trees and finally train the
context-dependent tied models.
4.6 Mobile Application Having a desktop application is very different from a mobile
application because of the limitations of the mobile devices when
it comes to size capacity, processing speed and many more. When moving the application to a mobile device, it would be better if
the size of the whole application is smaller and would still be able
to perform similarly to the desktop application. This comes as a
challenge since the application would need the acoustic model, the
language model and other linguistics related models that would be
needed to recognize the spoken text.
Sphinx-4 also has a mobile counterpart called PocketSphinx and
this is usually used for mobile based applications that require speech recognition. It has been used to develop applications for
the Apple iPhone before [5].
5. TEST RESULTS The proponents have trained three sets of acoustic model and
language model using 20 speakers, 40 speakers and 60 speakers
from the FSC. The proponents split the trainings to see whether or
not accuracy would improve when the trained data has been
increased. The proponents conducted two types of testing: controlled and uncontrolled testing. Controlled testing made use
of the existing recordings from the speech corpus while
uncontrolled testing was done with random people that were not
from the corpus.
In determining the accuracy of the system, the result text
generated is compared to the correct transcription of the
recording. The following formula is used to attain the accuracy rate of the system:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒 =𝑚𝑎𝑡𝑐ℎ𝑖𝑛𝑔_𝑤𝑜𝑟𝑑𝑠
𝑇𝑜𝑡𝑎𝑙_𝑤𝑜𝑟𝑑𝑠_𝑖𝑛_𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛 × 100
Controlled testing was done for all three trained sets and the
accuracy for each speaker per set is shown in Figure 5.
Figure 5. Accuracy Comparison for 3 Sets
It can be seen that the accuracy for the 40-speaker set dropped but
this was because it lacked training. For the 60-speaker set, the variables were adjusted to fit the size of the training data which in
turn gave better results when compared to the 20-speaker set.
Figure 6 shows the average accuracy rate for each set for the
conducted controlled testing and it is evident that the accuracy of the 60-speaker set increased as compared to the 20-speaker set.
The mean accuracy rate of controlled testing of each test was at
45% for 20-speaker set, 43.25% for 40-speaker set and 58.32%
for 60-speaker set.
Figure 6. Average Accuracy Rate for Controlled Test
61
For the uncontrolled testing, the proponents designed the language
model to compose of the transcriptions from UP Diliman and
contemporary sentences gathered from different social networking
sites. The proponents gathered 20 speakers, 10 male and 10 female, to test the system with sentences that are usually used in
daily conversations.
Using the trained data with 120-speaker set and the new language model, the system attained an average of 69.67% in accuracy and
an error rate of 30.33%.
Figure 7. Accuracy and Error Rate for Uncontrolled Test –
Male
Figure 8. Accuracy and Error Rate for Uncontrolled Test -
Female
For better results, speakers are recommended to speak in a clear loud voice and to avoid mispronunciations of the words. The
speaker should also speak in a slower pace to make each word
more distinct with one another to avoid conjunction of two
different words. The accuracy of the system will also drop when there is too much background noise present. In Table 1, actual
results produced by the system are seen:
Table 1. Sample Output
Expected Output Actual Output
Nandito ako Nandito ako
Maganda ba yun palabas Maganda bayong palabas
Tumatawag na naman siya Tumatawag nanaman siya
Pauwi ka na ba Pauwi kalaban
5.1 Creating the Mobile Application Attempting to port the existing desktop speech recognition application to an Android device proved to be a challenge. Since
the mobile version of Sphinx-4, PocketSphinx, is not well
documented yet for Android, the proponents had a hard time
installing the required software and actually creating the application for the mobile device. There were some sample
applications available online that were on a demo level however it
was tricky to install on the mobile device.
Another challenge was the limitations on the existing phones that
the proponents had. The demo application that was downloaded
and modified was too heavy for the HTC Hero such that when the
application was opened, it would close itself without any warning. Another mobile phone available for testing was a Samsung
Galaxy Ace. However, the proponents have yet to test it on the
said device.
5.2 Improving Performance As mentioned above, the accuracy for sentences not found in the
language model was very low. The proponents are currently
continuing research on how to improve the performance of the
system. There are two approaches in which the proponents will tackle: a new language model would be built with sentences that
consists of everyday conversational Filipino sentences and train
the acoustic model to be phoneme dependent.
The new language model would be built with the help of the
Department of Filipino of De La Salle University. The department
would advise the researchers on what sentences are considered to
be conversational Filipino and include these sentences in the language model. An additional resource for sentences that could
be added to the language model would be a collection of existing
text messages sent in Filipino. After this language model is
completed, the system would be retrained to follow the new language model and be tested whether improvement occurred.
Training the acoustic model to be phoneme dependent would aid
in the system to use letter-to-sound rules to guess the pronunciation of unknown words. These unknown words are not
found in the dictionary and the transcription files which mean that
they were not trained to understand these words. The letter-to-
sound rule will attempt to guess the pronunciation of unknown words based on the existing phonemes and words in the
dictionary. Again, testing would be conducted to the different sets
of models trained to see whether improvement occurred with
increments to the training data. The test results would also be compared to existing test results to see if unknown words would
really be recognized.
62
6. CHALLENGES ENCOUNTERED Sphinx-4 is a very flexible system capable of performing many
types of recognition tasks and there is a lot of documentation provided for the public. However, since this tool has not been
made specifically for the Filipino language, there are a lot of
modifications to be made.
In the demonstration programs provided by the Sphinx-4 has low
accuracy. This was due to the noise and echo included during the
testing. This certain challenge was remedied by switching off
noise reduction and acoustic echo cancellation in the microphone‟s setting.
Sphinx documentation also specified that the recorded wave file
should be set to 16-bit 16 kHz mono to be used for the training. However, with the first set of recordings, it did not follow the
specifications. By changing the sample rate of the given sound file
only resulted in it being slowed. This issue was resolved after
changing the sample rate of the project itself instead of the files.
Although Sphinx-4 has been built and tested on the Solaris
Operating Environment, Mac OS X, Linux and Win32 operating
systems, CMUCLMTK requires the use of UNIX. This was remedied by using Cygwin as recommended by the Sphinx team.
Line breaks in the tool also requires being in UNIX format. This
issue was resolved by switching to UNIX line breaks using
Notepad++.
The recorded messages that would be used in the training of the
system had less background noise. When the application is used in
normal environment which has more noise and echo, the system‟s accuracy could drop.
The issues of creating hands-free texting application also involve
the users‟ style of texting and speaking. The type of keypad their
phone has, whether a QWERTY or T9, has a factor as to how they type their text messages. There is also a difference on how a
person composes a text to the way he or she speaks in a
conversation.
An issue from the results is the lack of relevance of the trained
language model to the messages sent for text messaging. Because
of the low accuracy rate for the uncontrolled tests, the proponents
believe that the language model is the main contributor to the drop of accuracy. The language model was patterned according to the
speech uttered by the speakers in the FSC and these speeches
include stories and words. These are not sentences that are often
used in everyday texting which is why the sentences uttered by the speakers for the uncontrolled test were barely recognized by
the system.
7. CONCLUSION CMU Sphinx-4 is an effective tool to develop a desktop application that is able to recognize speech in Filipino language
and produce its text equivalent. The system is also able to use a
simple rule-based text shortening using regular expressions to provide the users the „text speak‟ equivalent of the output
produced.
There are also several recommendations that future developers may do to improve the system: Firstly, to increase the data in the
language model. These data may include English words since
most Filipinos does not use plain Tagalog in texting but mixes
English and Tagalog. Developers may also allow the user to place punctuation marks in sentences for better understanding of the
result. Other commands such as starting and ending the recording
for speech recognition may also be added as a feature
enhancement for the application. Lastly, it is recommended that the application be ported into mobile with different operating
systems such as Android and iOS.
8. ACKNOWLEDGMENTS The researchers would like to thank the following: (1) Mr. Danny
Cheng for being an adviser and guiding the group throughout the research, (2) for the entire panelist, namely: Mr. Allan Borra and
Mr. Clement Ong for the remarks and suggestions that they have
given to further improve the research, and last but not the least to
(3) EEE department of the University of the Philippines – Diliman for allowing us to use their Filipino Speech Corpus (FSC) for our
research.
9. REFERENCES [1] Bonabente, C. L. (2008). Total ban sought on cell phone use
while driving. Retrieved from
http://newsinfo.inquirer.net/breakingnews/metro/view/20080
920-
[2] CarAccidents. (2010). Philippines Crash Accidents, Driving,
Car, Manila Auto Crashes Pictures, Statistics, Info. Retrieved from http://www.car-accidents.com/country-car-
accidents/philippines-car-accidents.html
[3] Vlingo Incorporated. (2010). Voice to text applications
powered by intelligent voice recognition. Retrieved from
http://www.vlingo.com.
[4] Adela Voice Corporation. (2010). Starttalking. Retrieved from
http://www.adelavoice.com/starttalking.php.
[5] CMU Sphinx. (2010). CMU Sphinx – Speech Recognition
ToolKit. Retrieved from http://cmusphinx.sourceforge.net
[6] Gonzalez, A. (1998). The Language Planning Situation in the
Philppines. Journal of Multilingual and Multicultural
Development, 55, 5-6.
[7] BBC h2g2. (2002). Writing Text Messages. Retrieved from
http://www.bbc.co.uk/dna/h2g2/A70091
[8] CMU Sphinx Group. (2011). CMU Sphinx. Retrieved from
http://cmusphinx.sourceforge.net/sphinx4/
[9] Cu, J., Ilao, J., Ong, E. (2010). O-COCOSDA 2010 Philippine
Country Report. Retrieved from http://desceco.org/O-
COCOSDA2010/o-cocosda2010-abstract.pdf
63