Speech Recognition Article (2007)

7/30/2019 Speech Recognition Article (2007)

1/10

Tech Enthusiast Magazine

April 2007 1

Is Worf a native speaker of English?

SPEAKING IN

[BINARY

]TONGUES

How speech recognition can let us

talk to our computers (or at least try to)

By Matthew Bruchon

I grew up watching Star Trek, and I

didnt concern myself too much with

the faster-than-light-speed travel, the

loud, fiery explosions in the vacuum of

space, or many of

the shows other

scientificimpossibilities.

One question

always did bother

me, though: how

did the Klingons

and Ferengis and

Vulcans all come to

speak English so

well? Eventually, I discovered that the

shows writers had created a device

called the Universal Translator, a tiny

computer that processes speech from

all languages, known or unknown, and

converts it to the users native tongue.

Back on modern-day Earth, our

computers cant master the Universal

Translators simplest aspect: the ability

to hear a persons voice in a familiar

language, and to

figure out what

words are beingsaid. That is the

fundamental goal

of the speech

recognition

systems we have

today. We dont

yet know how to

make computers

understand 100% of our speech yet, and

much of the time, we might as well be

speaking in tongues. Until the human

race learns how to speak in binary

machine code, there will be a need for

improved speech recognition systems.


2/10


2 Speaking in [BINARY] Tongues

THE PROMISE OF TOMORROW

The reasons for wanting our computers

to be able to understand our voices are

seemingly endless. Many of thosereasons stem from the fact that all of us,

except maybe the very most skilled

typists, can speak more quickly than we

can type. Nuance Technologies, a

company specializing in speech

recognition, estimates in its marketing

materials that most people speak more

than 120 words per minute, but type

fewer than 40 words per minute.

I recently

learned just how

realistic that number

is. I was trying to

transcribe the

recording of an

interview I had just

conducted with Dr.David McAllister,

Computer Science

Professor at North

Carolina State University in Raleigh.

Dr. McAllister is, among many other

things, part of a research team doing

work in computerized speech

processing. When transcribing the

interview, I found myself needing to

pause the recording every ten seconds

or so, sometimes rewinding to re-listen

to words Id missed. My fingers simply

could not keep up with the pace of his

voice.

The problem wasnt the speed or

the clarity of his voice. His evenly-

measured baritone was no more

rapidly spoken than the average

persons voice, and his syllables wereclearly articulated. And Id like to

think the problem wasnt my typing

abilities. In high school I took a typing

class, and my keyboard proficiency has

been shaped by years of instant

messaging and web surfing. The issue

was the basic fact that our hands are a

clumsy way to convert our thoughts

into a readable form. On the other

hand, our voices

are like a wormhole

leap straight from

Star Trek, a direct

portal from our

brains to the

outside world.

If acomputer could

have automatically

converted Dr.

McAllisters voice into text for me, the

process would have taken much less

time on my part. Looking at society as

a whole, similar scenarios are plentiful.

Transcriptions of medical and legal

information, for example, currently are

very time-consuming, and can be made

much more efficient with the use of

speech recognition. And time is

money, of course.

Our voices are like a

wormhole leap

straight from Star

Trek, a direct portal

from our brains tothe outside world.


3/10


April 2007 3

Many of us are familiar with at

least a few everyday conveniences

provided by speech recognition. Most

cell phones include features like voice

dialing and the ability to answer callswith a voice commandeven my

Kyocera KX1, the cheapest phone

available with my wireless plan. Many

telephone menus for things like

customer service can now be navigated

by voice command as an alternative to

button presses. And most in-car GPS

systems can be commanded by voice.

But imagine being able to use

your voice not only to show directions,

but also to actually drive your car. (If

nothing else, it would mean people

finally would stop talking on their cell

phone while they drove!) Thats

probably out of the question for the

near, and not-so-near, future. But there

are some pretty neat advances thatarent so far down the pipeline. This is

perhaps most evident in the area of

personal computing. Imagine being

able to control your computers every

action by voice command, for example.

You wouldnt have to use the keyboard

and mouseforever the bane of heavy

computer users and carpal tunnel

syndrome sufferers, two groups that go

hand in hand (pun intended). Youd

also be freed from your desk, and could

get things done from the other side of

the room if you wanted.

THE REALITY OF TODAY

In fact, speech recognition systems are

already being used for personal

computing. One group that relies onthese systems is the population of

disabled people who cant use their

hands to type or to move the mouse.

Dr. McAllisters own neighbor, for

example, suffers from hand muscle

atrophy and uses speech recognition

NaturallySpeaking 9 Standard

software regularly. He talks to his

computer and has it do things for him,

says McAllister. He uses it to create

email and other messages, and stuff

like that works very well. Its not

always perfect, but its much better

than you would think.

His neighbor uses a standalone

program called Dragon

NaturallySpeaking, produced by

Nuance Technologies. Its the worlds

bestselling speech recognition for


4/10



professional use. Its existed in various

forms since 1990, when a DOS-based

version was made available for $9,000.

That version required that the user

pause between every word, to help itidentify word boundaries. The latest

version of NaturallySpeaking retails for

$99.99, allows the user to speak in their

normal, casual fashion, and advertises

up to 99% accuracy.

I decided to try the program for

myself. Luckily, the N.C. State librarys

Assistive Technologies Center had a

copy of the program available for me to

try out. Getting started with the

program was a very simple process: I

just put on a headset with a

microphone attached, opened the

program, and started talking. There is

an option to set up a new profile and

train the program to understand your

voice, a process that takes roughly 30

minutes depending on how thorough

you choose to be. I chose to skip that

step, because one of the latest versions

selling points is that NaturallySpeaking

requires no training, so you can get

started dictating right away. Armedwith several pages worth of test

materials ranging from tongue twisters

to Shakespeare monologues, I began to

recite in a natural, perhaps slightly

more carefully articulated voice.

As the figure above shows, the

results of my trial were decidedly

mixed. I measured my average voice

dictation speed to be roughly 200

words per minute (I average about 60

when typing), but I cant say the

improved speed fully made up for the

errors. To be fair, the examples I chose

are some of the worst. Realistically, the

dictation averaged about one or two

errors per sentence. And I could see a

moderate amount of improvement as

my trial progressed: I was learning

Dragon NaturallySpeakings Accuracy: A Sampling

NaturallySpeaking Recorded...

Haircut or effect of takeovertactics

Was offset from them, when Ibelieve in mankind

To be order not the: man is thequestion

NaturallySpeaking is the greatestpiece of software

When I said...

Peter Piper picked a pick of pickledpeppers

One small step for man, one giantleap for mankind

To be or not to be: that is thequestion

NaturallySpeaking is the greatestpiece of software


5/10


April 2007 5

Say Mousegrid to show a 3x3 grid.

Say One to pick the upper left box.

Say Four to move to the File menu.

Say Click to click at that position.

Saying Close closes the window.

how to use the program (using

keywords to dictate commas and

periods, for example) and as I corrected

its errors, it was beginning to train itself

to my voice. Its probably safe to saythe results would have been much

more agreeable if Id trained the

program for period of days or weeks,

just as any serious user of the program

would. (McAllisters neighbor had

done this, of course.)

Another feature of

NaturallySpeaking is the ability to

control the mouse by voice. This is

accomplished by something called the

Mousegrid, which divides the screen

into increasingly small numbered

rectangles and moves the mouse into

the rectangle you command it to. The

figure to the left demonstrates how I

used the Mousegrid to close a browser

window. It was easy enough to use,and for someone who cant use a mouse

it would be an essential feature.

However, it takes the computer a

moment to render each grid onto the

screen, and it was necessary to pause a

bit between words. It took a total of

approximately 5 seconds for me to close

the window. This may not sound very

long, but closing a window using the

mouse itself takes under a second.

NaturallySpeaking is the most

widely used standalone speech

recognition program, but many

personal computers are sold with a


6/10



speech recognition

system built-in.

Microsoft Office XP

is bundled with a

speech recognitionengine (though it

isnt installed by

default), and it is a

standard feature in

Microsofts Vista and Apples Tiger OS.

Given that Vista was just

released at the end of last year, the jury

is out on the quality of its built-in

speech recognition. Extremetech.com

reviewed it thoroughly, and concluded

that while it isnt perfect, it becomes so

accurate that its a joy to use given

enough training.

The technologys rollout at a

public demo was, for the most part,

successful, but the publics perception

of it was largely shaped by oneembarrassing moment that spread

virally throughout the blogosphere and

even network TV news. When the

presenter of the demopresumably

trained with the speech recognition

software in advancetried to write a

Dear Mom letter by voice, the speech

engine produced Dear aunt, and his

repeated attempts to delete the error

were misunderstood. The final product

was a pathetic Dear aunt, lets set so

double the killer delete select all. I

think its picking up a little bit of echo,

the flustered presenter said, to the

audiences laughter.

Recovering from

that PR nightmare

may take awhile.

The speechengine in Tiger OS

is, for the most part,

unchanged from

previous releases of

Apples OS X. A blogger at

systemsboy.blogspot.com said the

speech engine often froze, and that it

was overly sensitive to noise: Heaven

help you if youre eating a burrito

while you want to use speech control.

One at crunchgear.com reported that

Apples voice recognition is an

afterthought at best and cripple-ware at

worst. The same blogger pointed out

that, as shown below, setting up the OS

X speech engine isnt practical without

using a mouse, which would be aproblem for the disabled.

A mouse is needed to configure the

Mac OS speech engine setup screen.

Heaven help you if

youre eating a

burrito while you

want to use speech

control.


7/10


April 2007 7

In defense of the speech engines

found in Vista and Tiger, the bulk of

the complaints seem to deal more with

their initial setup and with controlling

applications through voice. I cameacross relatively few frustrated users of

the basic dictation feature, which is still

the most widely used feature of the

speech engines and their bread and

butter. For that purpose, at the very

least, the speech engines perform well

given enough training.

RESEARCHING SPEECH

I spoke with Dr. McAllister to learn

more about the science behind speech

processing and whats holding it back

from working perfectly.

McAllisters research career was

already well underway when he

entered the area of speech processing.

Since early in his career, much of his

research has dealt with the area of

stereo computer graphics and three-

dimensional imaging. One of hisprojects, for instance, was to help the

Defense Mapping Agency process its

warehouses full of high altitude

photographs, McAllister says, and

provide elevation values for every

place on the earth. Over time, he

became a highly regarded expert in the

field, publishing two books in the area.

His involvement in 3-D imaging

continues to this day.

McAllister became involved in

speech processing during a project

related to lip synching, the matching of

lip movements to speech. The project,

he says, used filtering, sophisticated

techniques and signal processing

Dr. David McAllister in his office at N.C. State University


8/10



which had not been applied to tell

what a person was saying. These

complex methods were used to process

speech signals and produce a computer

animation of them being spoken. Such

a method was of interest to video game

and movie animation companies, for

example. New to the area of signal

processing at the time, McAllister

played the role of graduate student for

awhile.

After that, McAllister and his

research partners realized their newsignal processing techniques could be

used for an entirely different type of

speech processing, called speaker

recognition. Unlike speech recognition,

which seeks to identify the words being

spoken, speaker recognition is

concerned with identifying the speaker.

Many of the underlying problems are

shared between the two areas, but the

majority of McAllisters speech

processing experience is in speaker

recognition. There are many uses for

speaker recognition technology,

including criminal justice and security.

The plots above are from a 2002

paper written by McAllister and four

colleagues at N.C. State. It uses a

complex mathematical technique to

model the speakers voice in two

dimensions, as shown on the plots.

Even without understanding exactly

what the plots mean, its easy to see

that the two left plots are much more

similar than the other two, because

theyre the same speaker.

Much of the research being done

in speaker recognition deals withcriminal justice, and is being subsidized

by the government. It is of interest for

the FBI, for instance, to be able to

identify people who have issued bomb

threats over the telephone, says

McAllister, and lawyers would like to

be able to establish that either a person

did or didnt say certain things on the

telephone. In cases in which its

known for a fact that the speaker is a

member of a given group of people

called a closed set problemthe

speaker can be chosen at a forensic

quality of 95% or more, given enough


9/10


April 2007 9

voice samples. But in many criminal

justice situations, where the speaker (or

the suspect, as the case may be) could

be a member of that group or notan

open set problemthere has beenmuch less success determining the

speaker. Theres a lot of trouble in

making such conclusions with enough

accuracy that it would stand up in

court, says McAllister.

Many of the

obstacles that

plague speaker

recognition are

shared by speech

recognition. For

example, McAllister

says two common

problems in speaker

recognition are a

shortage of samples

and a speech signalwith a lot of noise.

Similarly, its

common that a voice is disguised,

either intentionally or by accident. A

bomb threat caller might speak in

falsetto or in a fake accent, for example,

or the speaker could have laryngitis.

And one perpetual problem in speech

processing is the finite amount of

computing power available. As

McAllister puts it, The machines are

becoming faster and we can crunch

numbers faster, and the algorithms can

get more complicated. The problem is,

you want to be able to operate in real

time. You could do lots of things if you

arent in a hurry that you cant do if

you want information now.

One unique aspect of the speechprocessing field is its multidisciplinary

nature. McAllisters specialty is

mathematicshes a flunky numerical

analyst, he jokes. Dr. Robert Rodman,

one of his closest research partners, is a

computational

linguist and Dr.

Donald Bitzer,

another member of

the speech

processing team, is

a signal processing

expert. Theyre

three very different

specialties, but

McAllister says

that all three ofthem fit together

quite nicely.

The future of speech processing

research will have plenty of room for

more research, and more progress.

Aside from the issues of identifying

words and speakers, for example,

theres the problem of dividing

sentences properly. Beyond that,

theres an even more complex issue,

one that McAllister says still needs a

lot of researchhow a computer can

figure out what a sequence of words

means, and whether its gibberish or

The problem is, you

want to be able to

operate in real time.

You could do lots of

things if you arent in

a hurry that you cant

do if you want

information now.


10/10



Fortunately, it should be a

while before we start running

into any Klingons or Ferengis.

not. To address these complex

problems, some of the same methods

can be used.. For instance, one

approach computers use is to look at

common acoustic features of voices andsounds. A similar approach could be

used by analyzing common features of

words and sentences. Feature

extraction is a problem, McAllister

says. What are the features that you

want? Can you reduce the number of

features that matter? And how do you

use the features to group individuals

into categories?

But until these kinds of high-

level problems are solved, he says, we

might have to rephrase something

repeatedly until the computer

understands what is being asked of it.

These problems are large enough to

make some of the current bugs and

inconveniences in speech recognitionsystems seem trivial by comparison.

WHAT TO EXPECT

Its clear that some uses of speech

recognition are more realistic in the

near future than others. We probably

can expect more speech systems that

help make our lives more convenient,

as in the case of hands free computer

use. Its been demonstrated that under

the right conditions, that sort of thing

can be done at a high level of reliability.

But until that reliability goes

from high to perfect, we cant expect to

see things that rely on speech

processingonly ones that use it as a

supplement. Imagine if voice was used

to log into your computer instead of a

password. What if you had a sore

throat and couldnt log in at all? Its

safe to say well all own keyboards for

the foreseeable future, even if we mightnot be typing on them quite as often.

If the Universal Translator only

worked 90% (or even 99%) of the time,

he Star Trek shows would be more

dramatic, to say the least. Its probably

safe to say at least a few intergalactic

wars wouldve been caused when a

word or two got misinterpreted.

Fortunately, it should be a while before

we start running into Klingons or

Ferengis, and theres plenty of time to

get our Universal Translators ready for

that day.

Documents

Speech Recognition Article (2007)