Audient: An Acoustic Search Engine Student: Ted Leath Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Engineering

Audient: An Acoustic Search Engine

Student: Ted LeathSupervisor: Prof. Paul Mc Kevitt

School of Computing and Intelligent SystemsFaculty of Engineering

University of Ulster, Magee

Aims and Objectives

• Development of Audient as a speech-centric, non-lexical search engine capable of handling multimodal queries for retrieving spoken audio information

• Explore the efficacy of using standards-based phonogrammic streams as an internal data representation for storing, indexing, searching and retrieving spoken audio information

• Compare the performance of optional compound strategies for the abstraction and refinement of standards-based phonogrammic streams

• Design, implement, refine and test Audient• Demonstrate research results, comparing Audient with

other existing system architectures

Literature Review

• Information Retrieval• Automatic Speech Recognition and Spoken

Document Retrieval• Current and previous research in SDR systems• Public access SDR systems• Commercial ASR and audio mining products• Sub-word based approaches to SDR• Transcripts, annotation and phonogrammic

streams• Speech and non-speech audio

Information Retrieval

• Typical Information Retrieval (IR) tasks involve the retrieval of relevant information items from various types of documents by matching a user request or query.

• IR encompasses document media types containing different types of information like images, video and audio information in addition to text documents. Audio recordings of speech can be referred to as spoken documents.

Automatic Speech Recognition• ASR attempts to mimic the human capacity for recognising speech by

enabling a computer to identify spoken words and/or sub-word units. Most current ASR systems are lexical in nature, and conceptually follow the processes of encoding and decoding introduced in the figure below:

(adapted from Young et al., 2002)

Spoken Document Retrieval• A significant amount of research has been conducted in SDR, and

performance evaluations like the Text REtrieval Conference (TREC) have encouraged development and the sharing of information. A diagram representing a typical TREC SDR process is reproduced below:

(Garfolo et al., 2000)

SDR Systems

• CMU Informedia I, Informedia II and Sphinx Projects(Hauptmann and Witbrock, 1997)

• Video Mail Retrieval and Multimedia Document Retrieval projects(Jones et al., 1997, Spärck Jones et al., 2001)

• SCAN (Choi et al., 1998 and Choi et al., 1999)

• THISL and Abbot (Abberley et al., 1998, Abbot, 1999)

• Taiscéalaí (Smeaton et al., 1998)

Public Access SDR Systems

• SpeechBot (Quinn, 2000, Van Thong et al., 2001)

• National Public Radio (NPR) Online(NPR, 2000, NPR Archives, 2004)

• SpeechFind and The National Gallery of the Spoken Word (Hansen et al., 2004, Zhou and Hansen, 2002)

Commercial ASR and Audio Mining Products

• BBN Rough ‘n’ Ready (Kubala et al., 1999) • Nexidia Fast-Talk and Convera RetrievalWare

(Clements et al., 2001a, Clements et al., 2001b)

• ScanSoft (Network Speech, 2004, Embedded Speech, 2004, MediaIndexer, 2004,

NaturallySpeaking, 2005, AudioMining, 2005, Xmode, 2004)

• Virage AudioLogger (Virage, 2004) • Nuance (Nuance, 2005)

• AT&T SCANMail (Hirschberg et al., 2001 and SCANMail, 2003) • Microsoft Speech Server (MSS, 2005)

Sub-word Based Approaches to SDR

• Wechsler (Wechsler, 1998) • Ng., K. (Ng, 2000)

• Glavitsch and Schäuble (Glavitsch and Schäuble, 1992)

• Ng., C. (Ng, 2001)

Also other sub-word research efforts including Larson (2001), Moreau et al. (2004)

Phonogrammic StreamsOrthographical representations of phonemic streams. This abstraction is ancient, and partially inherent in the English alphabet.

Egyptian hieroglyphs with semantic and phonetic value. Ref. http://www.omniglot.com/writing/egyptian.htm

Transcription1-best transcriptions

N-best transcriptions

Lattices or graphs

SILENCE HARD ROCK SILENCE

(Fundamentals, 2005)

(Fundamentals, 2005)

Annotation - Markup Languages and MPEG-7

• SSML

• VoiceXML

• SALT• XHTML+Voice profile

All of the above markup languages contain SSML as a subset

• MPEG-7 and spoken content

Non-Speech Audio Retrieval

• MELDEX• Musipedia (Melodyhound/Tuneserver)• Sonoda• Super MBox• MIRACLE• SMILE

• Shazam• Name That Clip• The Humdrum Toolkit• Themefinder• Boogeebot• Muscle Fish

Processing of speech is handled differently by humans than non-speech acoustic information.

Project Proposal

Audient Architecture

Audient Core Modules

Queries andTable Input

Phonemic Recognitionand Abstraction

Stream to Speech

Text to Stream

Create TranslationTable

Phonogrammic Streams,Location, Temporal

Information and Indexing

Text Query

Speech Query

DigitisedAudio

Stream

PhonogrammicStream

Phonogrammic Stream

DigitisedAudio

Streamand

Location

PhonogrammicStream

PhonogrammicMatch Request

PhonogrammicMatch Answer

SyntheticSpeech

Text TableComponent

PhonogrammaticTable

Component

Text

Converted Phonogrammic Stream

PhonogrammicQuery Result

TextTranslationInformation

AudioStreamReplay

Locationand

TemporalReference

Locationand

TemporalReference

DigitisedAudio

Streamand

Location

PhonogrammicStream,Locationand Temporal

Information

TextTranslationTable

Text forTranslation

PhongrammicTranslation

Audient ParrotsPhonetic and Temporal Abstraction

Audio Speech File 1

Phonogrammic Stream

Text to Speech

Audio Speech File 2

Speech RecognitionEngine

Compound Strategies (or none)

Audient Parrot

=

Document 1 Document 2

Audient ParrotReader speaks text to Audient Parrot

Writer records text from Audient Parrot

Document 1 and Document 2 are compared

Functional diagram for an Audient Parrot

Determining recognition differences

She sells sea shells by the seashore. She cells C shels bye the sea shore

Comparison with Previous Work

Software Analysis

• Hidden Markov Model Toolkit (HTK)• LVCSR and CSLU Toolkit• Sphinx-2, Sphinx-3, Sphinx-4• TIMIT• Linux and C++• Perl and PHP• Festival• The CMU Pronouncing Dictionary• SSML, VoiceXML, SALT and X+V• The Apache Web Server

Possible IR and Monitoring Applications

• The indexing search and retrieval of Internet audio files

• Indexing search and retrieval of broadcast media• Services for the blind• Library services• Surveillance and intelligence gathering• Voice mail • Audio mining and trend analysis (topic detection

and tracking)

Possible Philosophical and Cognitive Research Applications

• Artificial self-learning systems

• Philosophical investigations of speech-centric versus text-centric methods

• Research models for cognitive science and consciousness theories

• Examination of behaviourist versus cognitive semantic recognition of speech

Project Schedule

Conclusion

• The introduction of standards-based phonogrammic streams as a fundamental internal data structure

• Support for unconstrained multimodal queries• The development of new mimetic means for comparative

evaluation and demonstration• The provision of contextual strategies for the refinement

of phonogrammic streams• Movement of the man-machine boundary to allow more

effective partitioning of tasks between the human and the machine portions of the system

• Design, implementation and testing of the Audient acoustic search engine

Documents

Audient: An Acoustic Search Engine Student: Ted Leath Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Engineering