Speech Recognition Article (2007)

Embed Size (px)

Citation preview

  • 7/30/2019 Speech Recognition Article (2007)

    1/10

    Tech Enthusiast Magazine

    April 2007 1

    Is Worf a native speaker of English?

    SPEAKING IN

    [BINARY

    ]TONGUES

    How speech recognition can let us

    talk to our computers (or at least try to)

    By Matthew Bruchon

    I grew up watching Star Trek, and I

    didnt concern myself too much with

    the faster-than-light-speed travel, the

    loud, fiery explosions in the vacuum of

    space, or many of

    the shows other

    scientificimpossibilities.

    One question

    always did bother

    me, though: how

    did the Klingons

    and Ferengis and

    Vulcans all come to

    speak English so

    well? Eventually, I discovered that the

    shows writers had created a device

    called the Universal Translator, a tiny

    computer that processes speech from

    all languages, known or unknown, and

    converts it to the users native tongue.

    Back on modern-day Earth, our

    computers cant master the Universal

    Translators simplest aspect: the ability

    to hear a persons voice in a familiar

    language, and to

    figure out what

    words are beingsaid. That is the

    fundamental goal

    of the speech

    recognition

    systems we have

    today. We dont

    yet know how to

    make computers

    understand 100% of our speech yet, and

    much of the time, we might as well be

    speaking in tongues. Until the human

    race learns how to speak in binary

    machine code, there will be a need for

    improved speech recognition systems.

  • 7/30/2019 Speech Recognition Article (2007)

    2/10

    Tech Enthusiast Magazine

    2 Speaking in [BINARY] Tongues

    THE PROMISE OF TOMORROW

    The reasons for wanting our computers

    to be able to understand our voices are

    seemingly endless. Many of thosereasons stem from the fact that all of us,

    except maybe the very most skilled

    typists, can speak more quickly than we

    can type. Nuance Technologies, a

    company specializing in speech

    recognition, estimates in its marketing

    materials that most people speak more

    than 120 words per minute, but type

    fewer than 40 words per minute.

    I recently

    learned just how

    realistic that number

    is. I was trying to

    transcribe the

    recording of an

    interview I had just

    conducted with Dr.David McAllister,

    Computer Science

    Professor at North

    Carolina State University in Raleigh.

    Dr. McAllister is, among many other

    things, part of a research team doing

    work in computerized speech

    processing. When transcribing the

    interview, I found myself needing to

    pause the recording every ten seconds

    or so, sometimes rewinding to re-listen

    to words Id missed. My fingers simply

    could not keep up with the pace of his

    voice.

    The problem wasnt the speed or

    the clarity of his voice. His evenly-

    measured baritone was no more

    rapidly spoken than the average

    persons voice, and his syllables wereclearly articulated. And Id like to

    think the problem wasnt my typing

    abilities. In high school I took a typing

    class, and my keyboard proficiency has

    been shaped by years of instant

    messaging and web surfing. The issue

    was the basic fact that our hands are a

    clumsy way to convert our thoughts

    into a readable form. On the other

    hand, our voices

    are like a wormhole

    leap straight from

    Star Trek, a direct

    portal from our

    brains to the

    outside world.

    If acomputer could

    have automatically

    converted Dr.

    McAllisters voice into text for me, the

    process would have taken much less

    time on my part. Looking at society as

    a whole, similar scenarios are plentiful.

    Transcriptions of medical and legal

    information, for example, currently are

    very time-consuming, and can be made

    much more efficient with the use of

    speech recognition. And time is

    money, of course.

    Our voices are like a

    wormhole leap

    straight from Star

    Trek, a direct portal

    from our brains tothe outside world.

  • 7/30/2019 Speech Recognition Article (2007)

    3/10

    Tech Enthusiast Magazine

    April 2007 3

    Many of us are familiar with at

    least a few everyday conveniences

    provided by speech recognition. Most

    cell phones include features like voice

    dialing and the ability to answer callswith a voice commandeven my

    Kyocera KX1, the cheapest phone

    available with my wireless plan. Many

    telephone menus for things like

    customer service can now be navigated

    by voice command as an alternative to

    button presses. And most in-car GPS

    systems can be commanded by voice.

    But imagine being able to use

    your voice not only to show directions,

    but also to actually drive your car. (If

    nothing else, it would mean people

    finally would stop talking on their cell

    phone while they drove!) Thats

    probably out of the question for the

    near, and not-so-near, future. But there

    are some pretty neat advances thatarent so far down the pipeline. This is

    perhaps most evident in the area of

    personal computing. Imagine being

    able to control your computers every

    action by voice command, for example.

    You wouldnt have to use the keyboard

    and mouseforever the bane of heavy

    computer users and carpal tunnel

    syndrome sufferers, two groups that go

    hand in hand (pun intended). Youd

    also be freed from your desk, and could

    get things done from the other side of

    the room if you wanted.

    THE REALITY OF TODAY

    In fact, speech recognition systems are

    already being used for personal

    computing. One group that relies onthese systems is the population of

    disabled people who cant use their

    hands to type or to move the mouse.

    Dr. McAllisters own neighbor, for

    example, suffers from hand muscle

    atrophy and uses speech recognition

    NaturallySpeaking 9 Standard

    software regularly. He talks to his

    computer and has it do things for him,

    says McAllister. He uses it to create

    email and other messages, and stuff

    like that works very well. Its not

    always perfect, but its much better

    than you would think.

    His neighbor uses a standalone

    program called Dragon

    NaturallySpeaking, produced by

    Nuance Technologies. Its the worlds

    bestselling speech recognition for

  • 7/30/2019 Speech Recognition Article (2007)

    4/10

    Tech Enthusiast Magazine

    4 Speaking in [BINARY] Tongues

    professional use. Its existed in various

    forms since 1990, when a DOS-based

    version was made available for $9,000.

    That version required that the user

    pause between every word, to help itidentify word boundaries. The latest

    version of NaturallySpeaking retails for

    $99.99, allows the user to speak in their

    normal, casual fashion, and advertises

    up to 99% accuracy.

    I decided to try the program for

    myself. Luckily, the N.C. State librarys

    Assistive Technologies Center had a

    copy of the program available for me to

    try out. Getting started with the

    program was a very simple process: I

    just put on a headset with a

    microphone attached, opened the

    program, and started talking. There is

    an option to set up a new profile and

    train the program to understand your

    voice, a process that takes roughly 30

    minutes depending on how thorough

    you choose to be. I chose to skip that

    step, because one of the latest versions

    selling points is that NaturallySpeaking

    requires no training, so you can get

    started dictating right away. Armedwith several pages worth of test

    materials ranging from tongue twisters

    to Shakespeare monologues, I began to

    recite in a natural, perhaps slightly

    more carefully articulated voice.

    As the figure above shows, the

    results of my trial were decidedly

    mixed. I measured my average voice

    dictation speed to be roughly 200

    words per minute (I average about 60

    when typing), but I cant say the

    improved speed fully made up for the

    errors. To be fair, the examples I chose

    are some of the worst. Realistically, the

    dictation averaged about one or two

    errors per sentence. And I could see a

    moderate amount of improvement as

    my trial progressed: I was learning

    Dragon NaturallySpeakings Accuracy: A Sampling

    NaturallySpeaking Recorded...

    Haircut or effect of takeovertactics

    Was offset from them, when Ibelieve in mankind

    To be order not the: man is thequestion

    NaturallySpeaking is the greatestpiece of software

    When I said...

    Peter Piper picked a pick of pickledpeppers

    One small step for man, one giantleap for mankind

    To be or not to be: that is thequestion

    NaturallySpeaking is the greatestpiece of software

  • 7/30/2019 Speech Recognition Article (2007)

    5/10

    Tech Enthusiast Magazine

    April 2007 5

    Say Mousegrid to show a 3x3 grid.

    Say One to pick the upper left box.

    Say Four to move to the File menu.

    Say Click to click at that position.

    Saying Close closes the window.

    how to use the program (using

    keywords to dictate commas and

    periods, for example) and as I corrected

    its errors, it was beginning to train itself

    to my voice. Its probably safe to saythe results would have been much

    more agreeable if Id trained the

    program for period of days or weeks,

    just as any serious user of the program

    would. (McAllisters neighbor had

    done this, of course.)

    Another feature of

    NaturallySpeaking is the ability to

    control the mouse by voice. This is

    accomplished by something called the

    Mousegrid, which divides the screen

    into increasingly small numbered

    rectangles and moves the mouse into

    the rectangle you command it to. The

    figure to the left demonstrates how I

    used the Mousegrid to close a browser

    window. It was easy enough to use,and for someone who cant use a mouse

    it would be an essential feature.

    However, it takes the computer a

    moment to render each grid onto the

    screen, and it was necessary to pause a

    bit between words. It took a total of

    approximately 5 seconds for me to close

    the window. This may not sound very

    long, but closing a window using the

    mouse itself takes under a second.

    NaturallySpeaking is the most

    widely used standalone speech

    recognition program, but many

    personal computers are sold with a

  • 7/30/2019 Speech Recognition Article (2007)

    6/10

    Tech Enthusiast Magazine

    6 Speaking in [BINARY] Tongues

    speech recognition

    system built-in.

    Microsoft Office XP

    is bundled with a

    speech recognitionengine (though it

    isnt installed by

    default), and it is a

    standard feature in

    Microsofts Vista and Apples Tiger OS.

    Given that Vista was just

    released at the end of last year, the jury

    is out on the quality of its built-in

    speech recognition. Extremetech.com

    reviewed it thoroughly, and concluded

    that while it isnt perfect, it becomes so

    accurate that its a joy to use given

    enough training.

    The technologys rollout at a

    public demo was, for the most part,

    successful, but the publics perception

    of it was largely shaped by oneembarrassing moment that spread

    virally throughout the blogosphere and

    even network TV news. When the

    presenter of the demopresumably

    trained with the speech recognition

    software in advancetried to write a

    Dear Mom letter by voice, the speech

    engine produced Dear aunt, and his

    repeated attempts to delete the error

    were misunderstood. The final product

    was a pathetic Dear aunt, lets set so

    double the killer delete select all. I

    think its picking up a little bit of echo,

    the flustered presenter said, to the

    audiences laughter.

    Recovering from

    that PR nightmare

    may take awhile.

    The speechengine in Tiger OS

    is, for the most part,

    unchanged from

    previous releases of

    Apples OS X. A blogger at

    systemsboy.blogspot.com said the

    speech engine often froze, and that it

    was overly sensitive to noise: Heaven

    help you if youre eating a burrito

    while you want to use speech control.

    One at crunchgear.com reported that

    Apples voice recognition is an

    afterthought at best and cripple-ware at

    worst. The same blogger pointed out

    that, as shown below, setting up the OS

    X speech engine isnt practical without

    using a mouse, which would be aproblem for the disabled.

    A mouse is needed to configure the

    Mac OS speech engine setup screen.

    Heaven help you if

    youre eating a

    burrito while you

    want to use speech

    control.

  • 7/30/2019 Speech Recognition Article (2007)

    7/10

    Tech Enthusiast Magazine

    April 2007 7

    In defense of the speech engines

    found in Vista and Tiger, the bulk of

    the complaints seem to deal more with

    their initial setup and with controlling

    applications through voice. I cameacross relatively few frustrated users of

    the basic dictation feature, which is still

    the most widely used feature of the

    speech engines and their bread and

    butter. For that purpose, at the very

    least, the speech engines perform well

    given enough training.

    RESEARCHING SPEECH

    I spoke with Dr. McAllister to learn

    more about the science behind speech

    processing and whats holding it back

    from working perfectly.

    McAllisters research career was

    already well underway when he

    entered the area of speech processing.

    Since early in his career, much of his

    research has dealt with the area of

    stereo computer graphics and three-

    dimensional imaging. One of hisprojects, for instance, was to help the

    Defense Mapping Agency process its

    warehouses full of high altitude

    photographs, McAllister says, and

    provide elevation values for every

    place on the earth. Over time, he

    became a highly regarded expert in the

    field, publishing two books in the area.

    His involvement in 3-D imaging

    continues to this day.

    McAllister became involved in

    speech processing during a project

    related to lip synching, the matching of

    lip movements to speech. The project,

    he says, used filtering, sophisticated

    techniques and signal processing

    Dr. David McAllister in his office at N.C. State University

  • 7/30/2019 Speech Recognition Article (2007)

    8/10

    Tech Enthusiast Magazine

    8 Speaking in [BINARY] Tongues

    which had not been applied to tell

    what a person was saying. These

    complex methods were used to process

    speech signals and produce a computer

    animation of them being spoken. Such

    a method was of interest to video game

    and movie animation companies, for

    example. New to the area of signal

    processing at the time, McAllister

    played the role of graduate student for

    awhile.

    After that, McAllister and his

    research partners realized their newsignal processing techniques could be

    used for an entirely different type of

    speech processing, called speaker

    recognition. Unlike speech recognition,

    which seeks to identify the words being

    spoken, speaker recognition is

    concerned with identifying the speaker.

    Many of the underlying problems are

    shared between the two areas, but the

    majority of McAllisters speech

    processing experience is in speaker

    recognition. There are many uses for

    speaker recognition technology,

    including criminal justice and security.

    The plots above are from a 2002

    paper written by McAllister and four

    colleagues at N.C. State. It uses a

    complex mathematical technique to

    model the speakers voice in two

    dimensions, as shown on the plots.

    Even without understanding exactly

    what the plots mean, its easy to see

    that the two left plots are much more

    similar than the other two, because

    theyre the same speaker.

    Much of the research being done

    in speaker recognition deals withcriminal justice, and is being subsidized

    by the government. It is of interest for

    the FBI, for instance, to be able to

    identify people who have issued bomb

    threats over the telephone, says

    McAllister, and lawyers would like to

    be able to establish that either a person

    did or didnt say certain things on the

    telephone. In cases in which its

    known for a fact that the speaker is a

    member of a given group of people

    called a closed set problemthe

    speaker can be chosen at a forensic

    quality of 95% or more, given enough

  • 7/30/2019 Speech Recognition Article (2007)

    9/10

    Tech Enthusiast Magazine

    April 2007 9

    voice samples. But in many criminal

    justice situations, where the speaker (or

    the suspect, as the case may be) could

    be a member of that group or notan

    open set problemthere has beenmuch less success determining the

    speaker. Theres a lot of trouble in

    making such conclusions with enough

    accuracy that it would stand up in

    court, says McAllister.

    Many of the

    obstacles that

    plague speaker

    recognition are

    shared by speech

    recognition. For

    example, McAllister

    says two common

    problems in speaker

    recognition are a

    shortage of samples

    and a speech signalwith a lot of noise.

    Similarly, its

    common that a voice is disguised,

    either intentionally or by accident. A

    bomb threat caller might speak in

    falsetto or in a fake accent, for example,

    or the speaker could have laryngitis.

    And one perpetual problem in speech

    processing is the finite amount of

    computing power available. As

    McAllister puts it, The machines are

    becoming faster and we can crunch

    numbers faster, and the algorithms can

    get more complicated. The problem is,

    you want to be able to operate in real

    time. You could do lots of things if you

    arent in a hurry that you cant do if

    you want information now.

    One unique aspect of the speechprocessing field is its multidisciplinary

    nature. McAllisters specialty is

    mathematicshes a flunky numerical

    analyst, he jokes. Dr. Robert Rodman,

    one of his closest research partners, is a

    computational

    linguist and Dr.

    Donald Bitzer,

    another member of

    the speech

    processing team, is

    a signal processing

    expert. Theyre

    three very different

    specialties, but

    McAllister says

    that all three ofthem fit together

    quite nicely.

    The future of speech processing

    research will have plenty of room for

    more research, and more progress.

    Aside from the issues of identifying

    words and speakers, for example,

    theres the problem of dividing

    sentences properly. Beyond that,

    theres an even more complex issue,

    one that McAllister says still needs a

    lot of researchhow a computer can

    figure out what a sequence of words

    means, and whether its gibberish or

    The problem is, you

    want to be able to

    operate in real time.

    You could do lots of

    things if you arent in

    a hurry that you cant

    do if you want

    information now.

  • 7/30/2019 Speech Recognition Article (2007)

    10/10

    Tech Enthusiast Magazine

    10 Speaking in [BINARY] Tongues

    Fortunately, it should be a

    while before we start running

    into any Klingons or Ferengis.

    not. To address these complex

    problems, some of the same methods

    can be used.. For instance, one

    approach computers use is to look at

    common acoustic features of voices andsounds. A similar approach could be

    used by analyzing common features of

    words and sentences. Feature

    extraction is a problem, McAllister

    says. What are the features that you

    want? Can you reduce the number of

    features that matter? And how do you

    use the features to group individuals

    into categories?

    But until these kinds of high-

    level problems are solved, he says, we

    might have to rephrase something

    repeatedly until the computer

    understands what is being asked of it.

    These problems are large enough to

    make some of the current bugs and

    inconveniences in speech recognitionsystems seem trivial by comparison.

    WHAT TO EXPECT

    Its clear that some uses of speech

    recognition are more realistic in the

    near future than others. We probably

    can expect more speech systems that

    help make our lives more convenient,

    as in the case of hands free computer

    use. Its been demonstrated that under

    the right conditions, that sort of thing

    can be done at a high level of reliability.

    But until that reliability goes

    from high to perfect, we cant expect to

    see things that rely on speech

    processingonly ones that use it as a

    supplement. Imagine if voice was used

    to log into your computer instead of a

    password. What if you had a sore

    throat and couldnt log in at all? Its

    safe to say well all own keyboards for

    the foreseeable future, even if we mightnot be typing on them quite as often.

    If the Universal Translator only

    worked 90% (or even 99%) of the time,

    he Star Trek shows would be more

    dramatic, to say the least. Its probably

    safe to say at least a few intergalactic

    wars wouldve been caused when a

    word or two got misinterpreted.

    Fortunately, it should be a while before

    we start running into Klingons or

    Ferengis, and theres plenty of time to

    get our Universal Translators ready for

    that day.