Upload
matthew-bruchon
View
214
Download
0
Embed Size (px)
Citation preview
7/30/2019 Speech Recognition Article (2007)
1/10
Tech Enthusiast Magazine
April 2007 1
Is Worf a native speaker of English?
SPEAKING IN
[BINARY
]TONGUES
How speech recognition can let us
talk to our computers (or at least try to)
By Matthew Bruchon
I grew up watching Star Trek, and I
didnt concern myself too much with
the faster-than-light-speed travel, the
loud, fiery explosions in the vacuum of
space, or many of
the shows other
scientificimpossibilities.
One question
always did bother
me, though: how
did the Klingons
and Ferengis and
Vulcans all come to
speak English so
well? Eventually, I discovered that the
shows writers had created a device
called the Universal Translator, a tiny
computer that processes speech from
all languages, known or unknown, and
converts it to the users native tongue.
Back on modern-day Earth, our
computers cant master the Universal
Translators simplest aspect: the ability
to hear a persons voice in a familiar
language, and to
figure out what
words are beingsaid. That is the
fundamental goal
of the speech
recognition
systems we have
today. We dont
yet know how to
make computers
understand 100% of our speech yet, and
much of the time, we might as well be
speaking in tongues. Until the human
race learns how to speak in binary
machine code, there will be a need for
improved speech recognition systems.
7/30/2019 Speech Recognition Article (2007)
2/10
Tech Enthusiast Magazine
2 Speaking in [BINARY] Tongues
THE PROMISE OF TOMORROW
The reasons for wanting our computers
to be able to understand our voices are
seemingly endless. Many of thosereasons stem from the fact that all of us,
except maybe the very most skilled
typists, can speak more quickly than we
can type. Nuance Technologies, a
company specializing in speech
recognition, estimates in its marketing
materials that most people speak more
than 120 words per minute, but type
fewer than 40 words per minute.
I recently
learned just how
realistic that number
is. I was trying to
transcribe the
recording of an
interview I had just
conducted with Dr.David McAllister,
Computer Science
Professor at North
Carolina State University in Raleigh.
Dr. McAllister is, among many other
things, part of a research team doing
work in computerized speech
processing. When transcribing the
interview, I found myself needing to
pause the recording every ten seconds
or so, sometimes rewinding to re-listen
to words Id missed. My fingers simply
could not keep up with the pace of his
voice.
The problem wasnt the speed or
the clarity of his voice. His evenly-
measured baritone was no more
rapidly spoken than the average
persons voice, and his syllables wereclearly articulated. And Id like to
think the problem wasnt my typing
abilities. In high school I took a typing
class, and my keyboard proficiency has
been shaped by years of instant
messaging and web surfing. The issue
was the basic fact that our hands are a
clumsy way to convert our thoughts
into a readable form. On the other
hand, our voices
are like a wormhole
leap straight from
Star Trek, a direct
portal from our
brains to the
outside world.
If acomputer could
have automatically
converted Dr.
McAllisters voice into text for me, the
process would have taken much less
time on my part. Looking at society as
a whole, similar scenarios are plentiful.
Transcriptions of medical and legal
information, for example, currently are
very time-consuming, and can be made
much more efficient with the use of
speech recognition. And time is
money, of course.
Our voices are like a
wormhole leap
straight from Star
Trek, a direct portal
from our brains tothe outside world.
7/30/2019 Speech Recognition Article (2007)
3/10
Tech Enthusiast Magazine
April 2007 3
Many of us are familiar with at
least a few everyday conveniences
provided by speech recognition. Most
cell phones include features like voice
dialing and the ability to answer callswith a voice commandeven my
Kyocera KX1, the cheapest phone
available with my wireless plan. Many
telephone menus for things like
customer service can now be navigated
by voice command as an alternative to
button presses. And most in-car GPS
systems can be commanded by voice.
But imagine being able to use
your voice not only to show directions,
but also to actually drive your car. (If
nothing else, it would mean people
finally would stop talking on their cell
phone while they drove!) Thats
probably out of the question for the
near, and not-so-near, future. But there
are some pretty neat advances thatarent so far down the pipeline. This is
perhaps most evident in the area of
personal computing. Imagine being
able to control your computers every
action by voice command, for example.
You wouldnt have to use the keyboard
and mouseforever the bane of heavy
computer users and carpal tunnel
syndrome sufferers, two groups that go
hand in hand (pun intended). Youd
also be freed from your desk, and could
get things done from the other side of
the room if you wanted.
THE REALITY OF TODAY
In fact, speech recognition systems are
already being used for personal
computing. One group that relies onthese systems is the population of
disabled people who cant use their
hands to type or to move the mouse.
Dr. McAllisters own neighbor, for
example, suffers from hand muscle
atrophy and uses speech recognition
NaturallySpeaking 9 Standard
software regularly. He talks to his
computer and has it do things for him,
says McAllister. He uses it to create
email and other messages, and stuff
like that works very well. Its not
always perfect, but its much better
than you would think.
His neighbor uses a standalone
program called Dragon
NaturallySpeaking, produced by
Nuance Technologies. Its the worlds
bestselling speech recognition for
7/30/2019 Speech Recognition Article (2007)
4/10
Tech Enthusiast Magazine
4 Speaking in [BINARY] Tongues
professional use. Its existed in various
forms since 1990, when a DOS-based
version was made available for $9,000.
That version required that the user
pause between every word, to help itidentify word boundaries. The latest
version of NaturallySpeaking retails for
$99.99, allows the user to speak in their
normal, casual fashion, and advertises
up to 99% accuracy.
I decided to try the program for
myself. Luckily, the N.C. State librarys
Assistive Technologies Center had a
copy of the program available for me to
try out. Getting started with the
program was a very simple process: I
just put on a headset with a
microphone attached, opened the
program, and started talking. There is
an option to set up a new profile and
train the program to understand your
voice, a process that takes roughly 30
minutes depending on how thorough
you choose to be. I chose to skip that
step, because one of the latest versions
selling points is that NaturallySpeaking
requires no training, so you can get
started dictating right away. Armedwith several pages worth of test
materials ranging from tongue twisters
to Shakespeare monologues, I began to
recite in a natural, perhaps slightly
more carefully articulated voice.
As the figure above shows, the
results of my trial were decidedly
mixed. I measured my average voice
dictation speed to be roughly 200
words per minute (I average about 60
when typing), but I cant say the
improved speed fully made up for the
errors. To be fair, the examples I chose
are some of the worst. Realistically, the
dictation averaged about one or two
errors per sentence. And I could see a
moderate amount of improvement as
my trial progressed: I was learning
Dragon NaturallySpeakings Accuracy: A Sampling
NaturallySpeaking Recorded...
Haircut or effect of takeovertactics
Was offset from them, when Ibelieve in mankind
To be order not the: man is thequestion
NaturallySpeaking is the greatestpiece of software
When I said...
Peter Piper picked a pick of pickledpeppers
One small step for man, one giantleap for mankind
To be or not to be: that is thequestion
NaturallySpeaking is the greatestpiece of software
7/30/2019 Speech Recognition Article (2007)
5/10
Tech Enthusiast Magazine
April 2007 5
Say Mousegrid to show a 3x3 grid.
Say One to pick the upper left box.
Say Four to move to the File menu.
Say Click to click at that position.
Saying Close closes the window.
how to use the program (using
keywords to dictate commas and
periods, for example) and as I corrected
its errors, it was beginning to train itself
to my voice. Its probably safe to saythe results would have been much
more agreeable if Id trained the
program for period of days or weeks,
just as any serious user of the program
would. (McAllisters neighbor had
done this, of course.)
Another feature of
NaturallySpeaking is the ability to
control the mouse by voice. This is
accomplished by something called the
Mousegrid, which divides the screen
into increasingly small numbered
rectangles and moves the mouse into
the rectangle you command it to. The
figure to the left demonstrates how I
used the Mousegrid to close a browser
window. It was easy enough to use,and for someone who cant use a mouse
it would be an essential feature.
However, it takes the computer a
moment to render each grid onto the
screen, and it was necessary to pause a
bit between words. It took a total of
approximately 5 seconds for me to close
the window. This may not sound very
long, but closing a window using the
mouse itself takes under a second.
NaturallySpeaking is the most
widely used standalone speech
recognition program, but many
personal computers are sold with a
7/30/2019 Speech Recognition Article (2007)
6/10
Tech Enthusiast Magazine
6 Speaking in [BINARY] Tongues
speech recognition
system built-in.
Microsoft Office XP
is bundled with a
speech recognitionengine (though it
isnt installed by
default), and it is a
standard feature in
Microsofts Vista and Apples Tiger OS.
Given that Vista was just
released at the end of last year, the jury
is out on the quality of its built-in
speech recognition. Extremetech.com
reviewed it thoroughly, and concluded
that while it isnt perfect, it becomes so
accurate that its a joy to use given
enough training.
The technologys rollout at a
public demo was, for the most part,
successful, but the publics perception
of it was largely shaped by oneembarrassing moment that spread
virally throughout the blogosphere and
even network TV news. When the
presenter of the demopresumably
trained with the speech recognition
software in advancetried to write a
Dear Mom letter by voice, the speech
engine produced Dear aunt, and his
repeated attempts to delete the error
were misunderstood. The final product
was a pathetic Dear aunt, lets set so
double the killer delete select all. I
think its picking up a little bit of echo,
the flustered presenter said, to the
audiences laughter.
Recovering from
that PR nightmare
may take awhile.
The speechengine in Tiger OS
is, for the most part,
unchanged from
previous releases of
Apples OS X. A blogger at
systemsboy.blogspot.com said the
speech engine often froze, and that it
was overly sensitive to noise: Heaven
help you if youre eating a burrito
while you want to use speech control.
One at crunchgear.com reported that
Apples voice recognition is an
afterthought at best and cripple-ware at
worst. The same blogger pointed out
that, as shown below, setting up the OS
X speech engine isnt practical without
using a mouse, which would be aproblem for the disabled.
A mouse is needed to configure the
Mac OS speech engine setup screen.
Heaven help you if
youre eating a
burrito while you
want to use speech
control.
7/30/2019 Speech Recognition Article (2007)
7/10
Tech Enthusiast Magazine
April 2007 7
In defense of the speech engines
found in Vista and Tiger, the bulk of
the complaints seem to deal more with
their initial setup and with controlling
applications through voice. I cameacross relatively few frustrated users of
the basic dictation feature, which is still
the most widely used feature of the
speech engines and their bread and
butter. For that purpose, at the very
least, the speech engines perform well
given enough training.
RESEARCHING SPEECH
I spoke with Dr. McAllister to learn
more about the science behind speech
processing and whats holding it back
from working perfectly.
McAllisters research career was
already well underway when he
entered the area of speech processing.
Since early in his career, much of his
research has dealt with the area of
stereo computer graphics and three-
dimensional imaging. One of hisprojects, for instance, was to help the
Defense Mapping Agency process its
warehouses full of high altitude
photographs, McAllister says, and
provide elevation values for every
place on the earth. Over time, he
became a highly regarded expert in the
field, publishing two books in the area.
His involvement in 3-D imaging
continues to this day.
McAllister became involved in
speech processing during a project
related to lip synching, the matching of
lip movements to speech. The project,
he says, used filtering, sophisticated
techniques and signal processing
Dr. David McAllister in his office at N.C. State University
7/30/2019 Speech Recognition Article (2007)
8/10
Tech Enthusiast Magazine
8 Speaking in [BINARY] Tongues
which had not been applied to tell
what a person was saying. These
complex methods were used to process
speech signals and produce a computer
animation of them being spoken. Such
a method was of interest to video game
and movie animation companies, for
example. New to the area of signal
processing at the time, McAllister
played the role of graduate student for
awhile.
After that, McAllister and his
research partners realized their newsignal processing techniques could be
used for an entirely different type of
speech processing, called speaker
recognition. Unlike speech recognition,
which seeks to identify the words being
spoken, speaker recognition is
concerned with identifying the speaker.
Many of the underlying problems are
shared between the two areas, but the
majority of McAllisters speech
processing experience is in speaker
recognition. There are many uses for
speaker recognition technology,
including criminal justice and security.
The plots above are from a 2002
paper written by McAllister and four
colleagues at N.C. State. It uses a
complex mathematical technique to
model the speakers voice in two
dimensions, as shown on the plots.
Even without understanding exactly
what the plots mean, its easy to see
that the two left plots are much more
similar than the other two, because
theyre the same speaker.
Much of the research being done
in speaker recognition deals withcriminal justice, and is being subsidized
by the government. It is of interest for
the FBI, for instance, to be able to
identify people who have issued bomb
threats over the telephone, says
McAllister, and lawyers would like to
be able to establish that either a person
did or didnt say certain things on the
telephone. In cases in which its
known for a fact that the speaker is a
member of a given group of people
called a closed set problemthe
speaker can be chosen at a forensic
quality of 95% or more, given enough
7/30/2019 Speech Recognition Article (2007)
9/10
Tech Enthusiast Magazine
April 2007 9
voice samples. But in many criminal
justice situations, where the speaker (or
the suspect, as the case may be) could
be a member of that group or notan
open set problemthere has beenmuch less success determining the
speaker. Theres a lot of trouble in
making such conclusions with enough
accuracy that it would stand up in
court, says McAllister.
Many of the
obstacles that
plague speaker
recognition are
shared by speech
recognition. For
example, McAllister
says two common
problems in speaker
recognition are a
shortage of samples
and a speech signalwith a lot of noise.
Similarly, its
common that a voice is disguised,
either intentionally or by accident. A
bomb threat caller might speak in
falsetto or in a fake accent, for example,
or the speaker could have laryngitis.
And one perpetual problem in speech
processing is the finite amount of
computing power available. As
McAllister puts it, The machines are
becoming faster and we can crunch
numbers faster, and the algorithms can
get more complicated. The problem is,
you want to be able to operate in real
time. You could do lots of things if you
arent in a hurry that you cant do if
you want information now.
One unique aspect of the speechprocessing field is its multidisciplinary
nature. McAllisters specialty is
mathematicshes a flunky numerical
analyst, he jokes. Dr. Robert Rodman,
one of his closest research partners, is a
computational
linguist and Dr.
Donald Bitzer,
another member of
the speech
processing team, is
a signal processing
expert. Theyre
three very different
specialties, but
McAllister says
that all three ofthem fit together
quite nicely.
The future of speech processing
research will have plenty of room for
more research, and more progress.
Aside from the issues of identifying
words and speakers, for example,
theres the problem of dividing
sentences properly. Beyond that,
theres an even more complex issue,
one that McAllister says still needs a
lot of researchhow a computer can
figure out what a sequence of words
means, and whether its gibberish or
The problem is, you
want to be able to
operate in real time.
You could do lots of
things if you arent in
a hurry that you cant
do if you want
information now.
7/30/2019 Speech Recognition Article (2007)
10/10
Tech Enthusiast Magazine
10 Speaking in [BINARY] Tongues
Fortunately, it should be a
while before we start running
into any Klingons or Ferengis.
not. To address these complex
problems, some of the same methods
can be used.. For instance, one
approach computers use is to look at
common acoustic features of voices andsounds. A similar approach could be
used by analyzing common features of
words and sentences. Feature
extraction is a problem, McAllister
says. What are the features that you
want? Can you reduce the number of
features that matter? And how do you
use the features to group individuals
into categories?
But until these kinds of high-
level problems are solved, he says, we
might have to rephrase something
repeatedly until the computer
understands what is being asked of it.
These problems are large enough to
make some of the current bugs and
inconveniences in speech recognitionsystems seem trivial by comparison.
WHAT TO EXPECT
Its clear that some uses of speech
recognition are more realistic in the
near future than others. We probably
can expect more speech systems that
help make our lives more convenient,
as in the case of hands free computer
use. Its been demonstrated that under
the right conditions, that sort of thing
can be done at a high level of reliability.
But until that reliability goes
from high to perfect, we cant expect to
see things that rely on speech
processingonly ones that use it as a
supplement. Imagine if voice was used
to log into your computer instead of a
password. What if you had a sore
throat and couldnt log in at all? Its
safe to say well all own keyboards for
the foreseeable future, even if we mightnot be typing on them quite as often.
If the Universal Translator only
worked 90% (or even 99%) of the time,
he Star Trek shows would be more
dramatic, to say the least. Its probably
safe to say at least a few intergalactic
wars wouldve been caused when a
word or two got misinterpreted.
Fortunately, it should be a while before
we start running into Klingons or
Ferengis, and theres plenty of time to
get our Universal Translators ready for
that day.