22
ITR: Listen and Learn – Artificial Intelligence in Auditory Environments L. K. Saul (University of Pennsylvania) D. P. W. Ellis (Columbia University) J. J. Hopfield (Princeton University) C. L. Isbell (Georgia Institute of Technology) D. D. Lee (University of Pennsylvania) S. P. Singh (University of Michigan) Project Summary Most academic departments in computer science and electrical engineering provide some sort of intellectual umbrella to integrate research in machine learning, computer vision, robotics, and artificial intelligence (AI). For largely historical and outmoded reasons, however, research in auditory computation and machine listen- ing has not been included in these efforts. We hope to rectify this situation. Our proposal involves an interdisciplinary collaboration of researchers from departments of computer sci- ence, electrical engineering, and molecular biology—all sharing a common mathematical background in machine learning. Our goal is to build autonomous agents that perceive and navigate the world through sound. These agents will operate in real-time, in both natural and artificial environments, where overlapping sounds arise from a myriad of possible sources (e.g., speech, music, wind, thunder, alarms, explosions, en- gines, etc). We intend these agents to learn from extended human-computer interaction and to equip them with highly engaging behaviors that encourage, rather than stifle, this interaction. The investigators have a track record of published work in real time audio processing, pitch tracking for voice-driven agents, binaural localization in mobile robots, biologically inspired architectures for speech processing, bottom-up and top-down models of computational auditory scene analysis, and reinforcement learning for communicative agents. The proposed work will bring all these areas of expertise to bear on the problem of AI in auditory environments. Projects will address fundamental questions in science and engineering. How can machines use sound to infer the number and location of objects in their auditory environment? How can affective vocalizations supplement or replace the use of highly structured natural language in models of human-computer interac- tion? Some specific projects include: (i) real time pitch tracking of overlapping sources, combining novel methods in signal processing with unsupervised learning algorithms previously applied in the visual domain; (ii) active localization, in which a mobile robot learns how to use cues from head motion to disambiguate otherwise equivocal cues from binaural timing differences; (iii) reinforcement learning for real time com- municative agents that respond to affective vocalizations and natural sounds in the auditory environment. The proposed work will have a broad impact beyond its specialized area of concentration. The investigators are partnered with local educational initiatives and efforts to increase the participation of underrepresented groups in computing. They are also committed to providing useful software in the public domain, as well as pedagogical tools for classroom instruction. Existing prototypes of voice-driven agents have already gen- erated inquiries from a diverse mix of potential consumers, including audio engineers, museum artists, toy manufacturers, electronic musicians, foreign language instructors, and neuroscientists studying learning in songbirds. Numerous media outlets have used related work by the investigators as a vehicle for educating the general public about grand challenges in science and engineering. Ultimately, we believe that the pro- posed research will have a profound impact on the organization of departments in computer science and engineering, leading to the same integration of researchers in AI and auditory computation as has previously occurred in computer vision and robotics.

ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

ITR: Listen and Learn – Artificial Intelligencein Auditory Environments

L. K. Saul (University of Pennsylvania) D. P. W. Ellis (Columbia University)J. J. Hopfield (Princeton University) C. L. Isbell (Georgia Institute of Technology)D. D. Lee (University of Pennsylvania) S. P. Singh (University of Michigan)

Project Summary

Most academic departments in computer science and electrical engineering provide some sort of intellectualumbrella to integrate research in machine learning, computer vision, robotics, and artificial intelligence (AI).For largely historical and outmoded reasons, however, research in auditory computation and machine listen-ing has not been included in these efforts. We hope to rectify this situation.

Our proposal involves an interdisciplinary collaboration of researchers from departments of computer sci-ence, electrical engineering, and molecular biology—all sharing a common mathematical background inmachine learning. Our goal is to build autonomous agents that perceive and navigate the world throughsound. These agents will operate in real-time, in both natural and artificial environments, where overlappingsounds arise from a myriad of possible sources (e.g., speech, music, wind, thunder, alarms, explosions, en-gines, etc). We intend these agents to learn from extended human-computer interaction and to equip themwith highly engaging behaviors that encourage, rather than stifle, this interaction.

The investigators have a track record of published work in real time audio processing, pitch tracking forvoice-driven agents, binaural localization in mobile robots, biologically inspired architectures for speechprocessing, bottom-up and top-down models of computational auditory scene analysis, and reinforcementlearning for communicative agents. The proposed work will bring all these areas of expertise to bear on theproblem of AI in auditory environments.

Projects will address fundamental questions in science and engineering. How can machines use sound toinfer the number and location of objects in their auditory environment? How can affective vocalizationssupplement or replace the use of highly structured natural language in models of human-computer interac-tion? Some specific projects include: (i) real time pitch tracking of overlapping sources, combining novelmethods in signal processing with unsupervised learning algorithms previously applied in the visual domain;(ii) active localization, in which a mobile robot learns how to use cues from head motion to disambiguateotherwise equivocal cues from binaural timing differences; (iii) reinforcement learning for real time com-municative agents that respond to affective vocalizations and natural sounds in the auditory environment.

The proposed work will have a broad impact beyond its specialized area of concentration. The investigatorsare partnered with local educational initiatives and efforts to increase the participation of underrepresentedgroups in computing. They are also committed to providing useful software in the public domain, as wellas pedagogical tools for classroom instruction. Existing prototypes of voice-driven agents have already gen-erated inquiries from a diverse mix of potential consumers, including audio engineers, museum artists, toymanufacturers, electronic musicians, foreign language instructors, and neuroscientists studying learning insongbirds. Numerous media outlets have used related work by the investigators as a vehicle for educatingthe general public about grand challenges in science and engineering. Ultimately, we believe that the pro-posed research will have a profound impact on the organization of departments in computer science andengineering, leading to the same integration of researchers in AI and auditory computation as has previouslyoccurred in computer vision and robotics.

Page 2: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

ITR: Listen and Learn – Artificial Intelligencein Auditory Environments

L. K. Saul1, D. P. W. Ellis2, J. J. Hopfield3, C. L. Isbell4, D. D. Lee5, andS. P. Singh61Department of Computer and Information Science, University of Pennsylvania

2Department of Electrical Engineering, Columbia University3Department of Molecular Biology, Princeton University4College of Computing, Georgia Institute of Technology

5Department of Electrical and Systems Engineering, University of Pennsylvania6Department of Electrical Engineering and Computer Science, University of Michigan

1 Introduction

Over a decade ago, in a chapter recounting the history of research in auditory and visual perception, thedistinguished experimental psychologist Al Bregman wrote [9]:

If you were to pick up a general textbook on perception written before 1965 and leaf through it, youwould not find any great concern with the perceptual or ecological questions about audition. By aperceptual question I mean one that asks how our auditory systems could build a picture of the worldaround us through their sensitivity to sound, whereas by an ecological one I am referring to one thatasks how our environment tends to create and shape the sound around us ... The situation would be quitedifferent in the treatment of vision ... Why should there be such a difference?

The remainder of the chapter advances the idea that auditory phenomena are equally deserving of studyas their visual counterparts, and the remainder of his book—entitledAuditory Scene Analysis—distills anenormous body of experimental evidence in support of this thesis. Bregman argues that we perceive andnavigate the world through sound in ways as profound and nuanced as sight; that our auditory systemsare exquisitely well adapted to gather information about our physical surroundings; and that much of ourknowledge of these processes can be summarized by a few basic principles with far-reaching consequences.These ideas have come to be widely accepted.

A similar story is now playing out—and a similar book waiting to be written—in the field of artificial intel-ligence (AI), where computer scientists and engineers face many of the same challenges as researchers inperception. Today, Bregman would no doubt be amused to pick up the current leading textbook in AI [58]and discover a discrepancy of coverage along the same historical lines as his field. There, he would finddefinitive summaries of computational research in visual perception, robotics, and natural language process-ing, but few words on the emerging area ofmachine listening[14, 22, 36, 40, 41, 77, 88, 91]. Automaticspeech recognition (ASR) is treated as an impressive application of AI, but its particular focus on transcrib-ing speech to text largely sidesteps the more general problem of machine awareness in auditory environ-ments. In its most widely accepted formulation [58], “the problem of AI is to describe and build agentsthat receive percepts from the environment and perform actions.” In this view, a machine that could respondas intelligently to sound as a dog—not necessarily understanding sentences or even words, but recognizingits owner’s voice in a crowded room, sensing emotions from changes in intonation and rhythm, judgingdistances from binaural cues, inferring motion from acoustic gradients—such a machine would certainlybe hailed as a milestone in AI. Yet, while many academic departments provide some sort of intellectualumbrella to integrate research in AI, computer vision, and robotics, research in auditory computation and

1

Page 3: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

Figure 1: Interaction of an intelligent agent with its auditory environment. From microphone input, theagent must infer the state of the world and take appropriate action.

machine listening has not typically included in these efforts. One is led again to Bregman’s question: “Whyshould there be such a difference?”

We can begin to answer that question by returning to the underlying problem of AI—how to describe agentsthat perceive and act in complex environments—and asking what an agent with auditory awareness wouldlook like. Fig. 1 illustrates the interaction between such an agent and its environment. It is a continualprocess, in which the agent listens to incoming sounds, makes inferences about the world, performs actions,receives feedback in the form of rewards and punishments, and adapts its behavior correspondingly. Letus further stipulate that the agent operates in real-time, in both natural and artificial environments, whereoverlapping sounds arise from a myriad of possible sources (e.g., speech, music, wind, thunder, alarms, ex-plosions, engines, etc). The different components in Fig. 1 reflect the many areas of expertise that would berequired to build such an agent: speech and signal processing, pattern recognition, optimal control, robotics,and human-computer interaction, to name just a few. It is largely a historical accident that these areas of ex-pertise have evolved in different communities, with barriers of distance and language separating researcherswith similar ambitions. Our proposal is a collaboration of researchers from these different communities,united by the overarching goal of AI in auditory environments, and sharing a common mathematical back-ground in machine learning.

2 Background

There are three main areas of research that provide context for the proposed work: AI of autonomousagents, automatic speech recognition, and auditory computation. These areas do not have sharp boundaries,and indeed one goal of the proposed work is to blur the boundaries that do exist. Nevertheless, these areasprovide a useful starting point for developing the goals of AI in auditory environments.

AI of autonomous agents

From its inception, AI has concerned itself with building autonomous agents that perceive and act intelli-gently in complex real-world environments. Early progress in AI was achieved by focusing primarily onissues of knowledge representation and abstract reasoning. In the last decade or so, however, there has beenrenewed interest in building complete systems that “close the loop”, with tightly linked modules for percep-tion and action. These systems take various forms—from embedded software agents in virtual environmentsto embodied agents or robots in the real world.

2

Page 4: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

Machine learning has transformed the way in which these agents are built. In the past, researchers oftenpreprogrammed the behaviors of agents based on oversimplified models of their environment. Now weunderstand that agents can learn from experience—that is, from repeated interactions with the world aroundthem. The field ofreinforcement learningis centered around the problem depicted in Fig. 1: how canautonomous agents with sensors and actuators learn policies that maximize their cumulative payoff overtime? Most problems with autonomous agents can be formulated as problems in reinforcement learningbecause there are very few assumptions on the payoffs; they can be noisy, infrequent, and delayed in time.

Reinforcement learning has made an impact on many domains of AI. In the area of game-playing, forexample, one need only contrast the grandmaster standing of Deep-Blue [11]—the culmination of decadesof programming and insights specific to the game of chess—with the world-class play at backgammonachieved by Tesauro’s TDgammon [82]. The latter was trained by a few months of self-play, using a basictechnique from reinforcement learning. In robotics, machine learning has made it possible to move awayfrom elaborate model-based approaches, whose success seems limited to carefully controlled environments,to robots which learn from experience. Robots have learned how to juggle [63], navigate [78], and evengive guided tours [83]. As a final example, spoken dialogue systems—which for years have relied on hand-crafted scripts by experts in interface design—are now being improved using methods from reinforcementlearning [33], yielding systems that support more effective dialogues than traditional approaches.

Despite continuing progress in the AI of autonomous agents, if one criticism could be made, it would be thatresearchers in this area have focused on action to the neglect of perception. The most frequent applications ofreinforcement learning are not to agents with sensorimotor systems, but to planning problems in operationsresearch (e.g., elevator control [15], job shop scheduling [92], warehouse placement [4]). Building agentsthat perceive the world through sight and sound is arguably the most difficult challenge facing researchersin reinforcement learning and AI.

Automatic speech recognition (ASR)

State-of-the-art ASR relies heavily on statistical methods in machine learning, and perhaps for that reason,ASR is often viewed as the foremost problem bridging AI to the world of sound. Certainly, recognizershave become faster and more accurate over the last twenty years. These advances have come not only fromfaster computers and larger training corpora, but also from more efficient search algorithms [37] and moresophisticated statistical models [90]. In particular, many extensions of Markov models and hidden Markovmodels have been investigated to address the acoustic variability arising from coarticulation [42, 45] andspeaker differences [71], as well as to incorporate ever-increasing amounts of linguistic context [53]. State-of-the-art ASR is highly reliable for clean speech in situations where constraints on the subject matter (e.g.,airline travel reservations [7]) can be used to prune the combinatorial search space of possible sentences. Inthis restricted domain, ASR currently supports a great deal of commercial enterprise.

To say that ASR has been successful, however, is not to say that the problem is solved, particularly ifwe judge machines by the ultimate benchmark of human performance. Commercial speech recognizersare far less robust to noise and distortion than human listeners, and their architectures do not reflect ourmost basic understanding of the peripheral auditory system and its role in speech perception [1]. For rela-tively clean telephone speech, word error rates on large vocabulary speaker-independent tasks are typicallyover 25% [90], and even on simpler tasks, machine performance is typically orders of magnitude worse thanhuman performance [34]. Arguably, the narrow focus on transcription has attached an overrated importanceto word error rates; a relatively small fraction of researchers are seriously investigating alternative archi-tectures inspired by biological considerations [8, 14]. The view of ASR as a self-contained problem—thatof transcribing speech to text—has also distanced the field from researchers in AI, who aspire to build ma-

3

Page 5: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

Frequencyanalysis

Sound

Elements Sources

Groupingmechanism

Onsetmap

Harmonicitymap

Spatialmap

Sourceproperties

Figure 2: Traditional bottom-up view of auditory scene analysis in listeners.

chines that not only recognize patterns of speech, but also experience the world as embodied agents [56].Thus, whether viewed as an end in itself, or as a vehicle for AI of autonomous agents, ASR stands to gainby obtaining a better understanding of auditory processes.

Auditory computation

Bregman’s seminal contribution was to view auditory perception as a process of scene analysis [9]. Hiswork has been followed up by extensive research in both experimental psychology and computer modeling.The dominant paradigm is illustrated in Fig. 2 (based on [16]): the soundfield reaching the ears is brokenup into a set of atomic regions, fragments of time-frequency with locally-coherent properties that belongto only a single source. These fragments are constructed based on several special-purpose cue detectors,responsive to the properties known to govern the formation of auditory percepts. The most important ofthese properties are:common onset(energy in different frequency bands that appears simultaneously),har-monicity (simple ratios between the frequencies of resolved harmonics, and, in the unresolved portion ofthe spectrum, amplitude modulation in different frequency bands by a common fundamental frequency),andspatial origin (interaural level and timing differences pointing to a common azimuth and elevation ofdifferent time-frequency segments).

These elements are the assembled via some grouping process into sets of energy that are interpreted as re-lating to a single origin or “auditory source”. Grouping has usually been described in terms of a set of rules,i.e., energy with onset times within a few milliseconds will be “perceptually fused” into a single source.However, experiments have shown that these rules have complex context dependencies; in particular, group-ing principles can be put into conflict (such as two tones that share a common onset but whose harmonicrelationship indicate a different grouping organization) with consequences that are difficult to describe.

The final stage of Fig. 2 draws attention to the fact that the perceptually-fused sources grouped from the basicelements will themselves have properties such aspitch, perceived onset time, andspatial extent, that reflectthe same low-level attributes involved in source formation. Thus a set of harmonically-related partials willusually fuse into a complex tone with a distinct pitch, mainly determined by their fundamental frequency.However, it is only once the fused object has been found that the pitch can be determined; it is different fromthe pitch that would be experienced for most if not all of the component elements.

Computational modelers have been strongly influenced by this paradigm, and several systems have imple-mented it more or less directly [10, 40]. This approach, known as computational auditory scene analysis(CASA), has worked quite well on examples where the correct interpretation is unambiguous. However,such systems also reveal that in many situations a single interpretation cannot be directly inferred fromthe local signal properties. Noise and interaction between sources complicates the formation of the initialelements, and broader contextual constraints and/or prior experience are needed to make the correct interpre-tation of many real-world sound scenes. These issues were addressed in work by Ellis [19], described below.

4

Page 6: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

Figure 3:Left: quadruped robot constructed by LEE [29, 30] that performs auditory localization, audiovisualsaliency tracking, vestibular-ocular reflex gain adaptation, and legged locomotion.Right: legged robotsplaying soccer on the University of Pennsylvania robocup team.

3 Previous work

The investigators have an extensive track record of work in both academic and industrial settings. Theexamples below illustrate our collective expertise in real time audio processing, biologically inspired archi-tectures for speech processing, bottom-up and top-down models of computational auditory scene analysis,and reinforcement learning for communicative agents. The proposed work in section 4 will bring all theseareas of expertise to bear on the problem of AI in auditory environments.

Binaural localization in mobile robots

LEE [29, 30] has built quadruped robots that localize sounds and perform other sensorimotor tasks. Figure 3(left) shows a picture of one of these robots, approximately 25 cm in length and 1 kg in weight, and poweredby a 9.6V rechargeable battery pack. Fourteen hobby servo motors provide three degrees of freedom foreach of the four legs, as well as two degrees of freedom for the head. Attached to the head assembly is asmall board level CCD camera for vision, and two directional electret microphones for hearing. The twohead servo motors allow the camera and microphones to pan and tilt rapidly over a wide range of angles.In addition to audiovisual input, a two-axis gyroscopic rate sensor is used to provide vestibular input forstabilization.

The robot is able to localize sounds by detecting interaural time differences. The audio signals from the leftand right microphones are digitized at 16000 Hz. The waveforms in these audio channels are then passedthrough a rectification nonlinearity to extract their envelopes. Next, the envelopes from different channels arecross-correlated to estimate the interaural delay. Different delays correspond to different azimuthal positionsof auditory sources, assuming the elevation angle is close to the horizontal plane of the microphones.

Real time pitch tracking for voice-driven agents

Like localization, pitch perception is a fundamental component of auditory scene analysis. It is especiallyrelevant to human-computer interaction because pitch is one of the most easily and rapidly controlled acous-tic attributes of the human voice. Not only does pitch play a central role in speech production and percep-tion [79] but it also conveys a great deal of information about the speaker’s emotional state [47, 77]. As afirst step toward building voice-driven agents that respond to the pitch contours of human speech, SAUL etal [60] implemented a real time front end for detecting voiced speech and estimating its fundamental fre-

5

Page 7: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

pointwisenonlinearity

lowpassfilter

two octavefilterbank

25-100 Hz

50-200 Hz

100-400 Hz

200-800 Hz

sinusoiddetectors

sharpestestimate

pitch

f < 800 Hz?0

f < 100 Hz?0

f < 400 Hz?0

f < 200 Hz?0

yes

speech

voiced?

Figure 4: Left: real-time front end for detecting voiced speech and estimating its fundamental frequencywithout FFTs or autocorrelation at the pitch period, by SAUL et al [60]. Right: real-time pitch tracker withaudiovisual feedback; the pitch scrolls across the screen and (optionally) drives an electronic instrument.

quency. The algorithm for pitch detection (diagrammed in Fig. 4) is unusual in several respects: it does notinvolve FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-samplebasis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolutionestimates; and it works reliably over a four octave range—in real time—without any postprocessing. Thealgorithm contrasts with standard approaches to pitch tracking, based on locating peaks in the windowed au-tocorrelation function [48] or power spectrum [43, 44, 64] and using postprocessing procedures [65], such asdynamic programming or median filtering, to obtain smooth contours. Experiments with the real time frontend showed that it performed as well or better than many standard algorithms without real time constraints.

The front end for real time pitch tracking was implemented in two interactive voice-driven agents that pro-vide continuous audiovisual feedback [60]. The first was a voice-to-MIDI player that synthesizes electronicmusic from vocalized melodies. The second was a multimedia Karaoke machine with audiovisual feedback,voice-driven key selection, and performance scoring. In both applications, the user’s pitch is displayed inreal time, scrolling across the screen as he or she sings into the computer; see Fig. 4 (right). Interestingly,the real time audiovisual feedback provided by these agents creates a profoundly different user experiencethan current systems in automatic speech recognition [49]. The delays and inaccuracies of speech recog-nizers frustrate many users, highlighting the gap between human and machine performance. By contrast,most users are highly engaged by the voice-to-MIDI and audiovisual Karaoke machines, allowing for thepossibility (see section 4) that these machines—or others like them—could learn interesting behaviors overthe course of extended human-computer interaction.

Biologically inspired front ends for speech processing

The real time front ends at work in Figs. 3 and 4 are based on highly simplified models of localizationand pitch perception. Implicitly, the algorithms assume that the auditory environment consists of a singlespeaker with no background noise. Several of the investigators have explored biologically inspired frontends, with the aim of matching the robustness of human listeners in more complex auditory environments.These front ends begin with a bank of bandpass filters modeled after the peripheral auditory system. SAUL

investigated front ends of this nature for pitch perception [59] and phonetic feature detection [61], in bothcases focusing on cues arising from the periodicity of the acoustic waveform. HOPFIELD and Brody [24]explored how the relative timing of onsets detected in narrow frequency bands could be used for robust

6

Page 8: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

Periodiccomponents

inputsound

signalfeatures

predictionerrors

hypotheses

predictedfeaturesFront end Compare

& reconcile

Noisecomponents

Hypothesismanagement

Predict& combine

Figure 5: Block diagram of the prediction-driven approach to computational auditory scene analysis.

speech recognition. A biologically plausible model was implemented for recognition of spoken digits andfound to exhibit a surprising degree of robustness to overlapping speakers, perhaps the most difficult formof speech interference. The model also incorporated a natural mechanism for ‘time-warp’ invariance [23],so that both fast and slow speech were recognized in the same underlying manner.

Computational auditory scene analysis

The biologically inspired front ends mentioned above focused either on narrowband signatures of pitch oronsets. Of course, human listeners naturally combine both these cues (and many others) to recognize speechand more generally analyze the auditory environment. Ellis has worked on architectures for the generalproblem of auditory scene analysis [14], involving multiple, distributed cues that are sometimes comple-mentary, sometimes conflicting, and sometimes ambiguous. As mentioned in section 2, the conventionalunderstanding of human auditory scene analysis has difficulty accounting for the perception of ill-definedor ambiguous stimuli, such as sources embedded in noisy backgrounds, or the numerous ‘auditory illusions’described in the literature. Such stimuli rely on broader constraints for their resolution, such as prior bi-ases due to past experiences or other expectations. This kind of high-level information becomes still moretroublesome for a system that is intended to work incrementally in real-time.

Ellis proposed a prediction-driven approach to computational auditory scene analysis to handle these dif-ficulties [19]. Illusions such as Warren’s ‘restored phoneme’, in which a speech sound that is deleted butreplaced by a loud cough is perceived as affirmatively present by listeners [85] are viewed not as curiositiesbut as evidence of critical auditory capabilities. The ability of the auditory system to infer the presence ofsound sources or features that are not directly observable is clearly very useful in noisy, dense sound scenes,as long as it is accurate.

Fig. 5 illustrates the block diagram of such a system. In contrast to the traditional bottom-up view of Fig. 2,the system’s ‘perceptions’ of sound scene elements are represented as a set of hypotheses constructed fromintermediate-level primitives (tonal fragments, noise patches etc.) that are organized into more completesources. Simple predictions for the expected evolution of these sources are combined into a single com-posite expectation which is compared to the actual sound received; discrepencies between observation andprediction trigger modifications to the ‘world model’ hypotheses, but, critically, predictions that are not in-consistent with observations may continue unimpeded. This shift from requiring direct evidence to requiringonly no contradictary evidence provides the opening for inferred perceptions.

The correct implementation of hypothesis-based predictions and the reconciliation of prediction errors re-mains a research issue. In the original system, predictions consisted of the simplest extensions of existingperiodic and noisy components; reconciliation was accomplished with a hand-crafted rule base that hypothe-sized new sound elements to account for unanticipated energy increases. On a small test set of environmental

7

Page 9: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

sound scenes (street noise, airport ambience, construction site), the model achieved a high level of agreementwith human subjects asked to press a button when a new sound source appeared. More complex inference,however, such as Warren’s phonemic restoration, requires more complex representations of past experienceand the ‘nature of the world’. The most promising approach for building such models is to acquire themdirectly from experience, as we propose in section 4.

Communicative agents and reinforcement learning

The investigators have also fielded software agents that engage in extended forms of human-computer in-teraction through natural language. Though the proposed work in section 4 will not focus on agents thatcommunicate through natural language, we expect similar methods in reinforcement learning—and similarflights of creativity—to be useful in their acquisition of interesting, engaging behaviors.

Spoken dialogue systems

SINGH et al [76] showed that a spoken dialogue system could learn from experience to improve the qualityof its human-computer interaction. Users spoke to the system using free-form natural language with thegoal of retrieving information from a database of activities in New Jersey. The user’s speech was interpretedby an automatic speech recognizer, and the system’s responses (in natural language) were conveyed back tothe user by a text-to-speech synthesizer. The dialogue policy determines what the system says to the userat any given point in the dialogue. It is difficult, even for experts in human-computer interaction, to designdialogue policies by hand; the combinatorial space of possible dialogues is simply too large. Thus, a naturalquestion arises: can such a system learn from experience how to best serve its users?

This problem was studied in the framework of reinforcement learning. The dialogue system was viewedas an autonomous agent of the sort depicted in Fig. 1, whose “observations” correspond to recognized ut-terances and whose “actions” correspond to possible replies. Dialogue policies were modeled as mappingsfrom estimated states to actions. The system was initialized with a hand-crafted policy and rewarded when-ever human users completed an assigned task of information retrieval. SINGH et al [76] showed that thepercentage of correctly completed tasks increased from 52% to 64% as a result of reinforcement learning.The resulting policy also outperformed more traditional design choices for spoken dialogue systems.

Cobot

ISBELL led the team at AT&T Labs that created and deployed the software agent Cobot [25, 26, 27] in the on-line world of LambdaMOO. LambdaMOO is one of the oldest continuously-operating online worlds, withroots in text-based multi-user role-playing games. These worlds—known as MUDs (multi-user dungeons)—differ from most chat and gaming systems in their use of apersistentvirtual space. The mechanisms ofsocial interaction in MUDs are designed to reinforce the illusion that the user is present in the virtual space.LambdaMOO appears as a series of interconnected rooms, populated by users and objects who may movefrom room to room. Each room provides a chat channel shared by its current inhabitants. The rooms andobjects in LambdaMOO are created by users themselves, who devise descriptions and control access byother users. Users can also create objects with methods (or actions) that can be invoked by other players.LambdaMOO represents a long-standing, ongoing experiment in collective programming and creation, witha long history of objects that can be viewed as experiments in AI.

Cobot is a denizen of LambdaMOO and an active member of its social community, which hosts hundredsof regular users. Cobot’s goal is to interact with others and to become a vital, useful and accepted part ofhis social fabric. Toward this end, Cobot tracks actions taken by users, building statistics on who invokes

8

Page 10: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

0 100 200 300 400 500 6000

2000

4000

6000

8000

10000

12000

14000

16000

18000Cumulative interactions (self interactions excluded)

cobot cockatooHFh EyeBalls

statisticsemotechat

events

HFh [to cobot]: relate me to Bustercobot whispers: Here are your relationships with Buster.You like to use:−(62%), poke (7%), hug (3%), eye (3%),nod (2%), hi5, h5, zap, comfort, and grin on each other.Buster is ranked 14 on your list of playmates. You areranked 1 on Buster’s list. Your socializing overlap is 75.4%and your playmate overlap is 33.7%.

cobot [to cyberat]: The Roman system of urban sanitationwas forgotten, so that not until rather recent times did thesanitation of European cities equal that of Ancient Rome.Fawn Guest [to HFh]: cobot’s conversation topics just getweirder and weirdercobot [to cyberat]: In spite of every sign I’m an intelligentbeing meaning no harm, you continue to hate and fear me.

Figure 6: Left: Plot of the cumulative number of actions directed over several months at four denizens ofLambdaMOO: Cobot, the Cockatoo, and the two most active two human users. Each dashed vertical lineindicates the introduction of a major new feature on Cobot; in each case, note the corresponding increase inthe rate of its interactions.Right: Two examples of conversations with Cobot.

which actions, and on whom they invoke them. Using his chatting interface, Cobot can answer queriesabout these statistics, and describe the statistical similarities and differences between characters; see Fig. 6.Cobot’s social statistics provide him with external feedback that can be used for learning and imitation.Cobot’s behaviors were optimized by a policy gradient algorithm [81] for reinforcement learning with linearfunction approximation. Note that Cobot maintains a model of each user in LambdaMOO based on featuresextracted from the user’s social statistics. Policies were learned by balancing multiple sources of reward [69]in a slightly more complicated setting than usual problems in reinforcement learning. Using this approach,Cobot was able to learn non-trivial, state-dependent preferences for a number of users in LambdaMOO. Infact, it was ultimately observed that Cobot became more popular than the human users in LambdaMOO.

Results from prior support

ELLIS is currently involved as a co-PI on NSF project:NSF IIS-0121396 ($1,402,851), Title: ITR/PE+SY:Mapping Meetings: Language Technology to make Sense of Human Interaction, Award period: 2001-09-01to 2005-08-31, PI: Nelson Morgan, International Computer Science Institute.This project investigates theuse of ASR and other methods in signal processing to extract information from audio recordings of natural,unconstrained meetings between human participants. Important goals are automatic event detection andsegmentation in meeting audio. Several publications arising from this work are currently under review; theinitial data collection and general goals are described in [39].

4 Proposed work

The proposed research will build on the collective expertise sketched in the previous section. Our goalis to build autonomous agents that perceive and navigate the world through sound in complex auditoryenvironments. As we intend these agents to learn from human-computer interaction, it is important thatthey have engaging behaviors that encourage, rather than stifle, this interaction. Some examples of proposedwork are given below. (Many other projects are also planned, but omitted for brevity.)

9

Page 11: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

Real time pitch tracking of overlapping sources(LEE, SAUL )

As discussed in section 2, pitch is one of the strongest cues for separating overlapping sources of sound. Oneproject of this proposal is to extend the real-time front end for pitch tracking in SAUL et al [60] to auditoryenvironments with overlapping sources. This is a classic problem in auditory scene analysis, but a definitivereal time solution has yet to emerge. Our goal is to replace the program shown if Fig. 4 by one that displaysthe pitches of two or more auditory objects (e.g., voices, instruments) being tracked in real time. This willinvolve the introduction of an auditory filterbank that decomposes incoming sound into narrow frequencybands and a back end for pattern matching that groups resolved harmonics from different sources.

The pattern matching in the back end will be handled by a variant of non-negative matrix factorization(NMF). NMF is an automatic method for discovering the parts of complex objects, recently proposed byLEE and Seung [31, 32]. It was initially shown to discover the component features (eyes, noses, mouths)of human faces, represented as pixel images. Because NMF only allows non-negative combinations of fea-tures, it is often more appropriate for “parsing” complex objects than purely linear methods in unsupervisedlearning, such as principal component analysis [5]. The latter method, applied to images of faces, doesnot discover eyes or mouths, but “eigenfaces” [84] that do not correspond to obvious percepts and must becombined in a less intuitive way using both positive and negative coefficients.

Tracking the pitches of overlapping sound sources presents a similar problem, but in the auditory domain. Tofirst order, human hearing is not sensitive to phase differences across different auditory channels [38]. Thus,it is only the (non-negative) magnitudes of pitch-related events in different channels that must be parsed intoa description of the auditory scene. NMF will be used to help recover the individual components of soundmixtures—for example, the individual musical lines in an ensemble of voices and instruments. This abilityis necessary for voice-driven agents that can engage humans in complex auditory environments.

The project has several intellectual merits. For the signal processing in the front end, it will investigatea novel mechanism for pitch detection, relying neither on peak picking in the frequency domain [46] norautocorrelation in the time domain [86], and building on earlier work by SAUL et al [59, 60] in this direction.For the pattern matching in the back end, the application of NMF to source separation can be viewed as partof a larger trend in unsupervised learning, in which powerful new algorithms (e.g., spectral clustering [70],belief propagation [89], locally linear embedding [55]) are being used to process and organize sensoryinputs. Most previous applications of these methods, however, have been to problems in computer vision.Their remarkable success in that domain suggests (in line with the general theme of this proposal) that thesemethods—appropriately tailored–could have many useful applications in machine listening.

Active spatialization (ELLIS, LEE)

Simple spatial cues, such as binaural timing difference, underspecify source location: time difference aloneresults in a ‘cone of confusion’ around the interaural axis. Presented with artificial stimuli via headphones,human listeners can make mistakes around this cone such as front/back reversals [87]. In natural circum-stances, however, such mistakes are very unusual, even in darkness where visual corroboration is not possi-ble. A major additional cue available in the natural situation comes fromhead motion: moving the receivers(i.e. rotating the ‘head’ on which the acoustic sensors are mounted) alters the relative orientation of allexternal sound sources in a systematic and coherent manner. The change in interaural time difference isspecific to the original position, but has a quite different pattern of ambiguity.

The resolution of ambiguities in this way provides a concrete example ofactive perceptionin the auditory(as opposed to visual) domain. A simple plane-wave approximation gives a dependence of interaural time

10

Page 12: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

differenceτ on azimuthθ (relative to the median plane) ofτ = d sin θ, whered is the distance between thetwo ‘ears’. Here we can see why front/back reversals occur, sincesin θ = sin(π−θ). However, thechangeinthis timing difference as the relative angle between ears and source changes is given bydτ

dθ = d cos θ, whichis ambiguous only for the sign ofθ (i.e. left/right confusions). Taken together, interaural timing differenceand its change resulting from head motion provide a very robust indication of azimuth (even in the face ofcompeting noise sources), but the latter cue is only available when the head is not fixed.

We will use these ideas in active spatialization to enable the robots in Fig. 3 to perform more complex tasksin auditory localization. Biological organisms can reliably and rapidly use binaural cues to determine thethree dimensional location of sound sources [28, 66, 67, 68]; the same ability has yet to be demonstratedin machines equipped with only two microphones. Much of the previous work in binaural analysis hasassumed stationary receivers without making use of the head motion cue [12]. The robots at our disposalhave the ability to reorient themselves, and thus to collect this information. Note that head motions may bemade specifically to sharpen acoustic spatial cues, but motion for any other reason is just as useful, since thetiming difference variations can be gathered opportunistically. The important idea is that some self-motionoccurs while interaural cues are being measured.

Algorithms in machine learning will play an important role here. In practice, the specific interaural cuesindicating source location are delicately related to the frequency-dependent characteristics of the acousticreceiver. These in turn are related in a complicated way to the shapes of the head and ears. While it mightbe possible to preset a constructed system to reasonable, approximate values, the best way to calibrate thesystem is on-line, where linked sets of interaural time and level differences, as well as their rates of changeunder head motion, can be gathered and modeled. In short, successful agents will need to listen and learn.Ground-truth data on actual location (perhaps from vision, or even by dead-reckoning of head motion awayfrom a known reference such as face-on) could also be incorporated into the learning process. This shouldallow agents to learn an accurate relationship between received acoustic cues and actual object location.

We also plan to investigate active uses of auditory feedback that facilitate collaboration among multiple,spatially distributed robots. Robots that share auditory inferences about their surrounding environmentshould be able to perform more complex tasks as teams; see Fig. 3 (right). In particular, improved abilitiesin auditory localization should help the robots to track each other’s location. This would be of incrediblebenefit for tasks such as search and rescue, occurring in environments where vision is often unreliable.

Unifying frameworks for scene analysis(HOPFIELD, LEE, SAUL )

Vision, audition, and olfaction are all remote senses. Animals rely on one or another of these to understand“what lies out there”, to divide the world up into separate objects with individual properties, and to under-stand where these objects are in space. While these senses seem very different, they are related in goals andalgorithms. For example, perhaps the most powerful general grouping idea in animal vision is “what movestogether is an object”. Precisely this same idea can be seen in Bregman’s demonstration [9] of the fusing ofsynthesized tones into a human singing voice by adding a vibrato. Humans also use this principle to separateoverlapping speakers, exploiting the fact that the vibrato from a single voice is coherent across frequencies,while there is no such coherence between voices.

The same idea is also at work in olfaction. It is clear that highly olfactory animals can parse the distant worldinto objects, even though odors arrive mixed at the nose. Roughly speaking, the vagaries of air turbulencecause odor currents from the same source to fluctuate coherently, like a vibrato. As in audition, this providesenough information over time to parse the mixture into its fundamental components. This is an olfactoryform of the “blind source separation problem”, one that can be solved by a simple neural network [23],

11

Page 13: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

whose synapses, changing with time, do the dynamical part of the computation. Other similarities betweenolfactory and auditory processes are also worth noting. Both have a problem with a non-trivial scalar invari-ant, in the speech case ”local time warp” and in the olfactory case, odor intensity invariance. Both must beable to divide the world into multiple objects, not knowing the objects ab initio, and to identify previouslyknown patterns in the presence of competing sources of signal.

As described in section 3, HOPFIELD and Brody [24] exploited these ideas to build a network of neural-likeelements (in simulation) that recognized overlapping utterances of spoken digits. We propose to extend thiswork in several directions: to incorporate pitch-related events, in addition to onset-related events, into thenarrowband feature detectors of such networks; to investigate the ability to recognize smaller units of speech(e.g., phonemes, syllables) as well as non-speech sounds in this framework; and to implement these ideas ina real time embodied agent.

Learning the sounds of natural objects(ELLIS, SINGH)

Research into both psychological and computational audition has repeatedly pointed to the importance ofprior knowledge and expectations in the successful analysis of complex sound mixtures [20]. How canan agent with auditory awareness acquire this “world knowledge”? We will work towards a completelyautonomous system able to learn classes of interesting sounds and to employ these classes for auditoryscene analysis. Although this is clearly the approach employed by biological systems, no automatic systemof this kind has been attempted for real-world sounds, and the intellectual merit of this study will likelyextend well beyond the auditory modality.

Building on recent work by Reyes-Gomez and ELLIS [50], sound classes will be described by hiddenMarkov models, or related dynamic-state systems. Broad initial classes will be successively refined asmore and more training examples are accumulated. This could be done in an unsupervised way, using cri-teria such as entropy reduction or likelihood gain to control when new classes are formed. The ultimategoal, however, is to derive these sound classes in the framework of reinforcement learning, tying them tothe environmental rewards of an embedded or embodied agent. Investigations in this direction will build onearlier work by SINGH [72, 73, 74, 75, 80] on the problem of abstraction in reinforcement learning.

Given the current set of acquired classes, incoming sound will be analyzed as the best-fitting combination ofthese models. It is possible to do this directly by fitting the combined observations of multiple, simultaneousmodels to the observations [51]; inference here rapidly becomes intractable, however, as the number ofsources modeled exceeds two or three [21]. Recognition is greatly simplified if observation features arebased on limited time-frequency regions and the assumption is made that a single source dominates withineach region [54]; then, rather than generating combined models on the fly, recognition can be reduced to thesimpler task of allocating feature cells to different models, each of which then returns a likelihood of matchbased on missing-data techniques [2, 3]. This decomposition requires only a single ‘clean’ model for eachsource, and also allows for a term modeling the likelihood of any particular pattern of visible and obscuredobservations.

Automatic acquisition of training data comprises another unprecedented area of study. The simplest, butweakest, approach is simply to rely on occasional exposure to ‘clear views’ of particular sound events—instances in which, fortuitously, other sound sources are silent or negligable. Given limitless training expo-sure, this approach will eventually succeed, requiring only some measure of signal ’simplicity’ to distinguishbetween isolated and overlapped exemplars. Spectral continuity could be used here, in that an unobstructedsound source will have a spectrum that evolves smoothly through its duration, whereas the superposition ofseveral sounds has abrupt spectral transitions as components begin and end.

12

Page 14: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

A more powerful approach would be to use missing-data techniques, invoked above to help with recognition,in the training stage also. Given a mechanism to divide the spectrum into regions belonging to an individualsource—such as direct modeling or inference from most-likely model decomposition—these partial obser-vations could be used for subsequent training, using methods in probabilistic inference to fill in the missingdata (as in the Expectation-Maximization algorithm [17]). The strength of this approach is that class model-ing can continue in every acoustic condition, rather than relying on infrequent highly-favorable conditions.The disadvantage is that some of the inferred training information might be incorrect, but we expect this tobe easily outweighed by the benefits of much larger training sets.

Communicative agents(ISBELL, SINGH)

We intend to build on our previous successes using reinforcement learning to improve the quality of human-computer interaction. Rather than focusing on the medium of natural language, however, we will investigatevoice-driven agents that respond to non-linguistic cues. These agents will engage humans in real-time,exploiting the low-level algorithms developed by other investigators, and in turn suggesting ideas for otheruseful forms of auditory feature extraction. A major innovation will be the use of auditory features (e.g.,pitch, intensity, binaural differences) to communicate state and/or payoff information in real time embeddedagents, trained by reinforcement learning. These features can be used as components of a state featurevector, or as special observations that directly communicate payoff information.

As a motivating example, consider the problem of training a dog to perform various “tricks”, such as sittingdown, wagging its tail, or fetching a ball. Voice modulation in this setting is part of the command (state),as well as the reinforcement signal (payoff), with features such as pitch and intensity carrying at least asmuch information as the trainer’s particular choice of words. A specific goal of the proposal is to equip anAibo robotic dog so that it can be trained in this way, without placing any restrictions on the vocabulary,language, or manner of the trainer’s speech. We believe that this exercise addresses the main obstacle to thecurrent use of reinforcement learning in human-centered environments. Many systems that ask human usersto provide sustained feedback are unnatural and disruptive to the point that users decline the effort. Humanshave little tolerance for half-second delays, restricted vocabularies, or close-talking microphones. Thoughwe make the problem of machine communication harder by working with real-time constraints and naturalspeech, we make it easier by focusing on features such as prosody and stress as opposed to natural language.The exploration of this tradeoff is vital for the future of voice-driven agents in AI[47, 77].

Other agents will also be investigated. One goal is to build awearableagent [13, 35, 52, 57, 62] analogousto Cobot (in Fig. 6), that listens to the sounds of daily activities and constructs an auditory profile of eachday. Such an agent would be able to report summary features, such as the fraction of the day spent inconversations, the fraction of quiet time, and the number of different people encountered. A longer-termgoal is for the agent to construct an emotional profile of each day; this would require the agent to identify(for example) periods of laughter, loud speech, and other distinctive exchanges. The intellectual challengeof this project is to develop agents that acquire an understanding of emotional and physical state throughcontinual voice processing [47, 77]. Efforts in this direction will complement a concurrent project (funded byIntel) on emotion-aware cognitive orthotic agents. This project is being pursued by SINGH and a colleague,Martha Pollack, at the University of Michigan. Cognitive orthotic agents are needed to issue timely andeffective reminders to elderly people who suffer from mild cognitive impairments, occasionally forgettingto take their medicine, use the restroom, drink enough water, etc. Current plans for these agents rely onvoice interaction with human users. The ability to sense emotional and physical states will improve thequality of human-computer interaction and, ultimately, the caregiver objectives of the reminder system.

13

Page 15: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

5 Broader impact

Integration of education and researchThe work in this proposal—involving real-time embedded and embodied agents—is extremely well-suitedto educating the public at large about the rewards and challenges of scientific research. Many of the inves-tigators have already used existing projects to these ends, and the proposed research should support furtherefforts in these areas. General media outlets such as the New York Times have also featured previous workby the investigators on sensory computation [6] and autonomous agents [18] .

Initiatives for the general publicLEE has a long track record of involvement in broader educational initiatives, especially those benefitinghis local community. As a scientist at Bell Labs, he gave public lectures at New Providence High Schooland participated in “World of Science” days that brought NJ high school students to weekend seminars atBell Labs. As a faculty member at the University of Pennsylvania, he has started a collaboration with theFranklin Institute, located in Philadelphia, as part of their program for Partnerships for Achieving Careersin Technology and Science (PACTS). In January 2003, as part of this collaboration, he led hands-on demon-strations with legged robots (see Fig. 3) at the Institute’s Annual Robot Day. The proposed research shouldlead to even more impressive demonstrations of these sorts.

Synergies with undergraduate and graduate courseworkThe investigators teach several core courses in computer science and electrical engineering. The proposedwork dovetails very well with these courses: Artificial Intelligence (SAUL , SINGH), Digital Signal Process-ing (ELLIS, LEE), Machine Learning (SINGH), and Speech and Audio Processing (ELLIS). More generally,we believe the proposed research will have a profound impact on the organization of computer scienceand engineering departments, ultimately leading to the same integration of researchers in AI and auditorycomputation as has previously occurred in computer vision and robotics.

Pedagogical tools for the classroomExperience has shown that real-time front ends and hands-on demonstrations are valuable pedagogical tools.The investigators are teaching several specialized courses that lend themselves to this style of teaching.SAUL frequently uses his real-time pitch tracking work to demonstrate ideas in voice processing, both in hiscurrent seminar onMachine Listeningand in guest lectures for classes outside his department. Next year,LEE is spearheading a fast-paced undergraduate laboratory course integrating elements of microprocessordesign, interfacing protocols, and real-time algorithms; he is also leading his institution’s Robocup team(see Fig. 3). These forums will provide an excellent opportunity for students to learn more about artificialsensorimotor systems, including the voice-driven agents in this proposal. ISBELL is currently developingand co-teaching a course onAdaptive and Personalized Information Environments. This is a project-drivencourse that brings together graduate students in machine learning and human-computer interaction. Studentsare taught how to build systems that adapt to the needs of their users based on cues from speech events, onlineactivities, and physical locations. The proposed research could seed many projects in this course.

Reaching under-represented communitiesISBELL has long been interested in ways to increase the number of underrepresented minorities in comput-ing. As the first African-American Ph.D. candidate in the AI Lab at M.I.T., he was invited to participatein an NSF-sponsored panel on increasing minority participation in computing. This led (indirectly) to thecreation of the Coalition to Diversify Computing and the Institute of African American E-Culture, in whichhe has remained an active member. At the University of Michigan, SINGH is chairing the department com-mittee charged with improving the retention of female undergraduates. Research has shown that female and

14

Page 16: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

minority students are more likely to pursue fields of study that address social needs and present significantopportunities for collaboration with their peers. As voice is a primary medium of social interaction, andas our proposed work is inherently collaborative (requiring many different areas of expertise), we believe itcould serve as a vehicle for attracting female and minority students to careers in computing.

Public domain softwareThe investigators are committed to providing useful software in the public domain. ELLIS is the leading au-thor of numerous public domain software tools for audio processing, including a toolkit of audio extensionsto the Tcl language that were eventually incorporated into the standard Snack package, and a complete large-vocabulary neural-net-based speech recognizer distributed by the International Computer Science Institute(Berkeley). Roweis and SAUL have previously made source code available for nonlinear dimensionalityreduction [55], and SAUL and ISBELL are currently working to release free software for real-time pitchtracking with audiovisual feedback. The proposed research should lead to several additional tools in speechand audio processing. These tools will enable interested researchers to develop real-time interactive au-dio applications without having to re-invent the optimized, low level routines. We also intend to distributeapplets based on these tools that can be used as educational aids in the classroom.

Applications of benefit to society at largeThough the proposed research intends to focus on fundamental questions in science and engineering, weexpect it to seed many applications. For example, our work [60] in real-time pitch tracking with audiovisualfeedback (see Fig. 4) has already generated inquiries from a diverse mix of “consumers”, including audioengineers, museum artists, toy manufacturers, electronic musicians, foreign language instructors, and neu-roscientists studying learning in songbirds. As there already exists a large literature on pitch tracking (aswell as many public domain implementations), it was specifically the real time audiovisual feedback thatgenerated this interest. For example, the language instructors felt that visual feedback might be useful forteaching prosody; the neuroscientists wondered if mistuned audio feedback could be used to probe the song-bird’s sensorimotor development. The proposed research is aimed at producing far more powerful devices:robust, real-time auditory front ends for source separation, speaker tracking, and emotion detection. Suchinnovations would lead to correspondingly greater commercial and scientific opportunities.

6 Plan

We envision a high degree of collaboration between the institutions participating in this proposal. The low-level work in auditory processing will be carried out by the investigators at Columbia University (ELLIS),Princeton University (HOPFIELD), and the University of Pennsylvania (LEE, SAUL ). As these institutionsare within close distance, we plan to schedule joint meetings four times per year on rotating campuses,involving not only the PIs, but also students and postdocs. Local investigators will attend all these meetings,while SINGH at University of Michigan and Isbell at Georgia Institute of Technology will participate in atleast one meeting per year. The high-level work on behavior in autonomous agents will be carried out mainlyby the labs of ISBELL, LEE, and SINGH. The first agents will be equipped with the real time front ends forlocalization and pitch tracking already developed by LEE [30] and SAUL et al [60]. As more sophisticatedfront ends emerge, they will be immediately shipped to the high level investigators.

We plan to organize a special workshop on Auditory Intelligence at the midway point of the proposal’s fiveyear span. There will also be regular opportunities for personal contact at the annual winter conferenceon Neural Information Processing Systems, which most of the PIs attend. Over the five year period of theproposal, we expect the collaboration to yield a progression of software and hardware agents that operate inincreasingly complex auditory environments and gradually approach the robustness of human listeners.

15

Page 17: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

References

[1] J. B. Allen. How do humans process and recognize speech?IEEE Transactions on Speech and AudioProcessing, 2:567–577, 1994.

[2] J. Barker, M. Cooke, and D. P. W. Ellis. Combining bottom-up and top-down constraints for robustASR: the multisource decoder. InProceedings of the CRAC-2001 Workshop, 2001.

[3] J. Barker, M. Cooke, and D. P. W. Ellis. Decoding speech in the presence of other sources.SpeechCommunication, 2003. Submitted.

[4] D. P. Bertsekas and D. A. Castanon. Rollout algorithms for stochastic scheduling problems.Journalof Heuristics, 5:89–108, 1999.

[5] C. M. Bishop.Neural Networks for Pattern Recognition. Oxford University Press, 1996.

[6] S. Blakeslee. A rule of thumb that unscrambles the brain.New York Times, page F3, October 3, 2000.

[7] E. Bocchieri, G. Riccardi, and J. Anantharaman. The 1994 AT&T ATIS CHRONUS recognizer. InProceedings of the 1995 ARPA Spoken Languge Technology Workshop, pages 265–268, January 1995.

[8] H. Bourlard, H. Hermansky, and N. Morgan. Towards increasing speech recognition error rates.SpeechCommunication, 18(3):205–231, June 1996.

[9] A. S. Bregman.Auditory Scene Analysis: the Perceptual Organization of Sound. M.I.T. Press, Cam-bridge, MA, 1994.

[10] G. J. Brown. Computational auditory scene analysis: A representational approach. PhD thesis, CSdept., Univ. of Sheffield, 1992.

[11] M.S. Campbell, A.J. Hoane, and F.H. Hsu. Deep Blue.Artificial Intelligence, 134(1–2):57–83, 2002.

[12] W. Chau and R. O. Duda. Combined monaural and binaural localization of sound sources. InProceed-ings of the 29th Asilomar Conference on Signals, Systems, and Computers, November 1995.

[13] B. Clarkson, N. Sawhney, and A. Pentland. Auditory context awareness in wearable computing. InProceedings of the Perceptual User Interfaces Workshop, 1998.

[14] M. Cooke and D. P. W. Ellis. The auditory organization of speech and other sources in listeners andcomputational models.Speech Communication, 35:141–177, 2001.

[15] R. H. Crites and A. G. Barto. Improving elevator performance using reinforcement learning. InAdvances in Neural Information Processing Systems 8. MIT Press, 1996.

[16] C. J. Darwin and R. P. Carlyon. Auditory grouping. In B. C. J. Moore, editor,The Handbook ofPerception and Cognition, Vol 6, Hearing, pages 387–424. Academic Press, 1995.

[17] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm.Journal of the Royal Statistical Society B, 39:1–38, 1977.

[18] A. Eisenberg. Find me a file, cache me a catch.New York Times, page G1, February 10, 2000.

[19] D. P. W. Ellis. Prediction-driven Computational Auditory Scene Analysis. PhD thesis, MIT Dept. ofElectrical Engineering and Computer Science, 1996.

16

Page 18: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

[20] D. P. W. Ellis. Using knowledge to organize sound: The prediction-driven approach to computationalauditory scene analysis, and its application to speech/nonspeech mixtures.Speech Communications,27(3–4):281–298, 1999.

[21] Z. Ghahramani and M. I. Jordan. Factorial hidden markov models.Machine Learning, 29:245–273,1997.

[22] M. Goto. A predominant-F0 estimation method for CD recordings: MAP estimation using EM algo-rithm for adaptive tone models. InProceedings of the 2001 International Conference on Acoustics,Speech, and Signal Processing, 2001.

[23] J. J. Hopfield. Pattern recognition computation using action potential timing for stimulus representa-tion. Nature, 376:33–66, 1995.

[24] J. J. Hopfield and C. D. Brody. What is a moment? transient synchrony as a collective mechanism forspatiotemporal integration.Proceedings of the National Science Academy, 98:1282–1287, 2001.

[25] C. L. Isbell, M. Kearns, D. Kormann, S. Singh, and P. Stone. Cobot in LambdaMOO: A Social StatisticsAgent. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, 2000.

[26] C. L. Isbell, C. Shelton, M. Kearns, S. Singh, and P. Stone. A social reinforcement learning agent.InProceedings of the Fifth International Conference on Autonomous Agents, 2001.

[27] M. Kearns, C. L. Isbell, S. Singh, D. Litman, and J. Howe. CobotDS: A spoken dialogue system forchat. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, 2002.

[28] M. Konishi. Listening with two ears.Scientific American, 268:66–73, 1993.

[29] D. D. Lee. Dimensionality reduction for sensorimotor learning in mobile robotics. InProceedings ofthe International Society for Optical Engineering: Applications and science of neural networks, fuzzysystems, and evolutionary computation, volume V, 2002.

[30] D. D. Lee and H. S. Seung. Learning in intelligent embedded systems. InUsenix Workshop onEmbedded Systems, 1999.

[31] D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization.Nature,401:788–791, 1999.

[32] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In T. K. Leen, T. G.Dietterich, and V. Tresp, editors,Advances in Neural and Information Processing Systems 13. MITPress, 2001.

[33] E. Levin and R. Pieraccini. A stochastic model of computer-human interaction for learning dialoguestrategies. InProceedings of the Eurospeech Conference, pages 1883–1886, Rhodes, Greece, 1997.

[34] R. P. Lippmann. Speech recognition by machines and humans.Speech Communication, 22:115, 1997.

[35] S. Mann. Wearable computing: A first step toward personal imaging.IEEE Computer, 30(2):25–32,1997.

[36] K. D. Martin. Sound-source recognition: A theory and computational model. PhD thesis, MIT MediaLab, 1999.

17

Page 19: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

[37] M. Mohri, F. C. N. Pereira, and M. Riley. Weighted finite-state transducers in speech recognition.Computer Speech and Language, 16(1):69–88, 2002.

[38] B. C. J. Moore.An Introduction to the Psychology of Hearing. Academic Press, 1997.

[39] N. Morgan, D. Baron, J. Edwards, D. Ellis, D. Gelbart, A. Janin, T. Pfau, E. Shriberg, and A. Stolcke.The meeting project at ICSI. InProceedings of the Human Language Technology Conference, 2001.

[40] T. Nakatani and H. Okuno. Harmonic sound stream segregation using localization and its applicationsto speech stream segregation.Speech Communication, 27(3–4):299–310, 1999.

[41] M. R. Naphade, T. T. Kristjansson, B. J. Frey, and T. S. Huang. Probabilistic multimedia objects(multijects): A novel approach to video indexing and retrieval in multimedia systems. InProceedingsof the 1998 IEEE International Conference on Image Processing, 1998.

[42] H. Nock and S. Young. Modelling asynchrony in automatic speech recognition using loosely coupledhmms.Cognitive Science, 26(3):283–301, 2002.

[43] A. M. Noll. Cepstrum pitch determination.Journal of the Acoustical Society of America, 41(2):293–309, 1967.

[44] A. M. Noll. Pitch determination of human speech by the harmonic product spectrum, the harmonicsum spectrum, and a maximum likelihood estimate. InProceedings of the Symposium on ComputerProcessing in Communication, pages 779–798, April 1969.

[45] M. Ostendorf, V. Digalakis, and O. Kimball. From HMMs to segment models: A unified view ofstochastic modeling for speech recognition.IEEE Transactions on Speech and Audio Processing,4(5):360–378, September 1996.

[46] T. W. Parsons. Separation of speech from interfering speech by means of harmonic selection.Journalof the Acoustical Society of America, 60(4):911–918, 1976.

[47] R. W. Picard.Affective Computing. MIT Press, 1997.

[48] L. R. Rabiner. On the use of autocorrelation analysis for pitch determination.IEEE Transactions onAcoustics, Speech, and Signal Processing, 25:22–33, 1977.

[49] L. R. Rabiner and B. H. Juang.Fundamentals of Speech Recognition. Prentice Hall, EnglewoodsCliffs, NJ, 1993.

[50] M. J. Reyes-Gomez and D. P. W. Ellis. Selection, parameter estimation, and discriminative train-ing of hidden markov models for general audio modeling. InProceedings of the IEEE InternationalConference on Multimedia & Expo, 2003. Submitted.

[51] M. J. Reyes-Gomez, B. Raj, and D. P. W. Ellis. Multi-channel source separation by factorial HMMs.In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2003. toappear.

[52] B. J. Rhodes. The wearable remembrance agent: A system for augmented memory.Personal Tech-nologies, 1:218–224, 1997.

[53] R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here?Proceedingsof the IEEE, 88(8), 2000.

18

Page 20: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

[54] S. T. Roweis. One microphone source separation. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors,Advances in Neural Information Processing Systems, volume 13. MIT Press, 2001.

[55] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.Science,290:2323–2326, 2000.

[56] D. Roy and A. Pentland. Learning words from sights and sounds: A computational model.CognitiveScience, 26(1):113–146, 2002.

[57] D. Roy, N. Sawhney, C. Schmandt, and A. Pentland. Wearable audio computing: A survey of interac-tion techniques. Vision and Modeling Technical Report 434, MIT Media Lab, 1997.

[58] S. Russell and P. Norvig.Artificial Intelligence: A Modern Approach. Prentice-Hall, 1995.

[59] L. K. Saul and J. B. Allen. Periodic component analysis: an eigenvalue method for representingperiodic structure in speech. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors,Advances in NeuralInformation Processing Systems 13. MIT Press, 2001.

[60] L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun. Real time voice processing with audiovisualfeedback: toward autonomous agents with perfect pitch. In S. Becker, S. Thrun, and K. Obermayer,editors,Advances in Neural Information Processing Systems 15. MIT Press, 2003.

[61] L. K. Saul, M. G. Rahim, and J. B. Allen. A statistical model for robust integration of narrowband cuesin speech.Computer Speech and Language, 15(2):175–194, 2001.

[62] N. Sawhney and C. Schmandt. Speaking and listening on the run: Design for wearable audio comput-ing. In Proceedings of the International Symposium on Wearable Computing, 1998.

[63] S. Schaal, D. Sternad, and C. Atkeson. One-handed juggling: A dynamical approach to a rhythmicmovement task.Journal of Motor Behavior, 28(2):165–183, 1996.

[64] M. R. Schroeder. Period histogram and product spectrum: new methods for fundamental frequencymeasurement.Journal of the Acoustical Society of America, 43(4):829–834, 1968.

[65] B. G. Secrest and G. R. Doddington. An integrated pitch tracking algorithm for speech systems. InProceedings of the 1983 IEEE International Conference on Acoustics, Speech, and Signal Processing,pages 1352–1355, Boston, 1983.

[66] S. A. Shamma. On the role of space and time in auditory processing.Trends in Cognitive Sciences,5:340–348, 2001.

[67] S. A. Shamma, N. Shen, and P. Gopalaswamy. Stereausis: binaural processing without neural delays.Journal of the Acoustical Society of America, 86:989–1006, 1989.

[68] S. A. Shamma, X. Yang, and K. Wang. Auditory representation of acoustic spectrum.IEEE Transac-tions on Information Theory, 38:824–839, 1992.

[69] Christian R. Shelton. Balancing multiple sources of reward in reinforcement learning. In T. K. Leen,T. G. Dietterich, and V. Tresp, editors,Advances in Neural Information Processing Systems, volume 13.MIT Press, 2001.

[70] J. Shi and J. Malik. Normalized cuts and image segmentation.IEEE Transactions on Pattern Analysisand Machine Intelligence (PAMI), pages 888–905, August 2000.

19

Page 21: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

[71] K. Shinoda and C.-H. Lee. A structural Bayes approach to speaker adaptation.IEEE Transactions onSpeech and Audio Processing, 9(3):276–287, March 2001.

[72] S. Singh. Reinforcement learning with a hierarchy of abstract models. InProceedings of the Tenth Na-tional Conference on Artificial Intelligence, pages 202–207, San Jose,CA, July 1992. AAAI Press/MITPress.

[73] S. Singh. Scaling reinforcement learning algorithms by learning variable temporal resolution models.In D. Sleeman and P. Edwards, editors,Proceedings of the Ninth Machine Learning Conference, pages406–415, Aberdeen, Scotland, July 1992. Morgan Kauffman.

[74] S. Singh, T. Jaakkola, and M. I. Jordan. Learning without state-estimation in partially observableMarkovian decision processes. In W. W. Cohen and H. Hirsh, editors,Machine Learning: Proceedingsof the Eleventh International Conference, pages 284–292, New Brunswick, New Jersey, July 1994.Morgan Kaufmann.

[75] S. Singh, T. Jaakkola, and M. I. Jordan. Reinforcement learning with soft state aggregation. In J. D.Cowan, G. Tesauro, and J. Alspector, editors,Advances in Neural Information Processing Systems 7,pages 703–710. Morgan Kaufmann, 1995.

[76] S. Singh, D. Litman, M. Kearns, and M. Walker. Optimizing dialogue management with reinforcementlearning: Experiments with the NJFun system.Journal of Artificial Iintelligence Research, 16:105–133, 2002.

[77] M. Slaney and G. McRoberts. Babyears: A recognition system for affective vocalizations.SpeechCommunication, 39(3–4):367–384, 2003.

[78] W.D. Smart and L.P. Kaelbling. Effective reinforcement learning for mobile robots. InProceedings ofthe IEEE International Conference on Robots and Automation, 2002.

[79] K. Stevens.Acoustic Phonetics. M.I.T. Press, Cambridge, MA, 1999.

[80] R. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporalabstraction in reinforcement learning.Artificial Intelligence Journal, 112:181–211, 1999.

[81] R. S. Sutton, D. McAllester, S. Singh, and Y Mansour. Policy gradient methods for reinforcementlearning with function approximation. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors,Advancesin Neural Information Processing Systems, volume 12. MIT Press, 2000.

[82] G. J. Tesauro. Temporal difference learning and TD-gammon.Communications of the ACM, 38(3):58–68, March 1995.

[83] S. Thrun, W. Burgard, and D. Fox. A probabilistic approach to concurrent mapping and localizationfor mobile robots.Machine Learning, 31(1-3):29–53, 1998.

[84] M. Turk and A. Pentland. Eigenfaces for recognition.Journal of Cognitive Neuroscience, 3:71–86,1991.

[85] R. M. Warren. Perceptual restoration of missing speech sounds.Science, 167:392–393, 1970.

[86] M. Weintraub.A theory and computational model of auditory monaural sound separation. PhD thesis,Department of Electrical Engineering, Stanford University, 1985.

20

Page 22: ITR: Listen and Learn – Artificial Intelligence in …ee.columbia.edu › ~dpwe › proposals › ITR02-listenlearn.pdfITR: Listen and Learn – Artificial Intelligence in Auditory

[87] F. L. Wightman and D. J. Kistler. Resolution of front-back ambiguity in spatial hearing by listener andsource movement.Journal of the Acoustical Society of America, 105:2841–2853, 1999.

[88] E. Wold, T. Blum, D. Keislar, and J. Wheaton. Content-based classification, search, and retrieval ofaudio. IEEE Multimedia, 3:27–36, 1996.

[89] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In T. K. Leen, T. G.Dietterich, and V. Tresp, editors,Advances in Neural Information Processing 13. MIT Press, 2001.

[90] S. Young. Statistical modeling in continuous speech recognition. InProceedings of the Conference onUncertainty in Artificial Intelligence, pages 562–571, August 2001.

[91] T. Zhang and C.-C. J. Kuo. Audio content analysis for online audiovisual data segmentation andclassification.IEEE Transactions on Speech and Audio Processing, 9(4):441–457, 2001.

[92] W. Zhang and T. G. Dietterich. High-performance job-shop scheduling with a time delay TD(λ) net-work. In J. D. Cowan, G. Tesauro, and J. Alspector, editors,Advances in Neural Information Process-ing Systems 8, pages 703–710. MIT Press, 1995.

21