10
Vroom!: A Search Engine for Sounds by Vocal Imitation eries Yichi Zhang University of Rochester Rochester, NY [email protected] Junbo Hu University of Rochester Rochester, NY [email protected] Yiting Zhang Georgia Institute of Technology Atlanta, GA [email protected] Bryan Pardo Northwestern University Evanston, IL [email protected] Zhiyao Duan University of Rochester Rochester, NY [email protected] ABSTRACT Traditional search through collections of audio recordings com- pares a text-based query to text metadata associated with each audio file and does not address the actual content of the audio. Text descriptions do not describe all aspects of the audio content in detail. Query by vocal imitation (QBV) is a kind of query by example that lets users imitate the content of the audio they seek, providing an alternative search method to traditional text search. Prior work proposed several neural networks, such as TL-IMINET, for QBV, however, previous systems have not been deployed in an actual search engine nor evaluated by real users. We have de- veloped a state-of-the-art QBV system (Vroom! ) and a baseline query-by-text search engine (TextSearch). We deployed both sys- tems in an experimental framework to perform user experiments with Amazon Mechanical Turk (AMT) workers. Results showed that Vroom! received significantly higher search satisfaction rat- ings than TextSearch did for sound categories that were difficult for subjects to describe by text. Results also showed a better overall ease-of-use rating for Vroom! than TextSearch on the sound library used in our experiments. These findings suggest that QBV, as a complimentary search approach to existing text-based search, can improve both search results and user experience. CCS CONCEPTS Information systems Search interfaces; Speech / audio search; Similarity measures; Novelty in information retrieval. KEYWORDS vocal imitation, sound search, subjective evaluation, Siamese style convolutional recurrent neural networks, text description ACM Reference Format: Yichi Zhang, Junbo Hu, Yiting Zhang, Bryan Pardo, and Zhiyao Duan. 2020. Vroom!: A Search Engine for Sounds by Vocal Imitation Queries. In 2020 Conference on Human Information Interaction and Retrieval (CHIIR’20), March Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CHIIR’20, March 14–18, 2020, Vancouver, BC, Canada © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6892-6/20/03. . . $15.00 https://doi.org/10.1145/3343413.3377963 14–18, 2020, Vancouver, BC, Canada. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3343413.3377963 1 INTRODUCTION Designing methods to access and manage multimedia documents such as audio recordings is an important information retrieval task. Traditional search engines for audio files use text labels as queries. However this is not always effective. First, it requires users to be familiar with the audio library taxonomy and text labels, which is unrealistic for many users with no or little audio engineering background. Second, text descriptions or metadata are abstract and do not describe the audio content in detail. Third, many sounds, such as those generated by computer synthesizers, lack commonly accepted semantic meanings and text descriptions. Vocal imitation is commonly known as using voice to mimic sounds. It is widely used in our daily conversations, as it is an ef- fective way to convey sound concepts that are difficult to describe by language. For example, when referring to the “Christmas tree” dog barking sound (i.e., a barking sound with overtones fading out rapidly showing a Christmas-tree-shaped spectrogram) [32], vocal imitation is more intuitive compared to text descriptions. Hence, designing computational systems that allow users to search sounds through vocal imitation [4] goes beyond the current text-based search and enables novel human-computer interactions. It has nat- ural advantages over text-based search as it does not require users to be familiar with the labels of an audio taxonomy and it indexes the detailed audio content instead of abstract text descriptions that not all agree on. Regarding applications, sound search by vocal imitation can be useful in many fields including movie and music production, multimedia retrieval, and security and surveillance. Recently, a deep learning based model called TL-IMINET [43] was proposed for sound search by vocal imitation. It addresses two main technical challenges: 1) feature learning: what feature rep- resentations are appropriate for the vocal imitation and reference sound, and 2) metric learning: how to design the similarity between a vocal imitation and each sound candidate. Experiments on the Vo- calSketch Data Set [5] have shown promising retrieval performance for this model, however, no user studies have been conducted to validate the model as part of a user-facing search engine and the sound-search-by-vocal-imitation approach in general at the system level. In this paper, we seek to answer the following questions: 1) Is vocal-imitation-based search an acceptable approach to sound search for ordinary users without an extensive audio engineering Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada 23

Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

Vroom!: A Search Engine for Sounds by Vocal ImitationQueriesYichi Zhang

University of RochesterRochester, NY

[email protected]

Junbo HuUniversity of Rochester

Rochester, [email protected]

Yiting ZhangGeorgia Institute of Technology

Atlanta, [email protected]

Bryan PardoNorthwestern University

Evanston, [email protected]

Zhiyao DuanUniversity of Rochester

Rochester, [email protected]

ABSTRACTTraditional search through collections of audio recordings com-pares a text-based query to text metadata associated with eachaudio file and does not address the actual content of the audio.Text descriptions do not describe all aspects of the audio contentin detail. Query by vocal imitation (QBV) is a kind of query byexample that lets users imitate the content of the audio they seek,providing an alternative search method to traditional text search.Prior work proposed several neural networks, such as TL-IMINET,for QBV, however, previous systems have not been deployed inan actual search engine nor evaluated by real users. We have de-veloped a state-of-the-art QBV system (Vroom!) and a baselinequery-by-text search engine (TextSearch). We deployed both sys-tems in an experimental framework to perform user experimentswith Amazon Mechanical Turk (AMT) workers. Results showedthat Vroom! received significantly higher search satisfaction rat-ings than TextSearch did for sound categories that were difficult forsubjects to describe by text. Results also showed a better overallease-of-use rating for Vroom! than TextSearch on the sound libraryused in our experiments. These findings suggest that QBV, as acomplimentary search approach to existing text-based search, canimprove both search results and user experience.

CCS CONCEPTS• Information systems → Search interfaces; Speech / audiosearch; Similarity measures; Novelty in information retrieval.

KEYWORDSvocal imitation, sound search, subjective evaluation, Siamese styleconvolutional recurrent neural networks, text description

ACM Reference Format:Yichi Zhang, Junbo Hu, Yiting Zhang, Bryan Pardo, and Zhiyao Duan. 2020.Vroom!: A Search Engine for Sounds by Vocal Imitation Queries. In 2020Conference on Human Information Interaction and Retrieval (CHIIR’20), March

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’20, March 14–18, 2020, Vancouver, BC, Canada© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6892-6/20/03. . . $15.00https://doi.org/10.1145/3343413.3377963

14–18, 2020, Vancouver, BC, Canada. ACM, New York, NY, USA, 10 pages.https://doi.org/10.1145/3343413.3377963

1 INTRODUCTIONDesigning methods to access and manage multimedia documentssuch as audio recordings is an important information retrieval task.Traditional search engines for audio files use text labels as queries.However this is not always effective. First, it requires users to befamiliar with the audio library taxonomy and text labels, whichis unrealistic for many users with no or little audio engineeringbackground. Second, text descriptions or metadata are abstract anddo not describe the audio content in detail. Third, many sounds,such as those generated by computer synthesizers, lack commonlyaccepted semantic meanings and text descriptions.

Vocal imitation is commonly known as using voice to mimicsounds. It is widely used in our daily conversations, as it is an ef-fective way to convey sound concepts that are difficult to describeby language. For example, when referring to the “Christmas tree”dog barking sound (i.e., a barking sound with overtones fading outrapidly showing a Christmas-tree-shaped spectrogram) [32], vocalimitation is more intuitive compared to text descriptions. Hence,designing computational systems that allow users to search soundsthrough vocal imitation [4] goes beyond the current text-basedsearch and enables novel human-computer interactions. It has nat-ural advantages over text-based search as it does not require usersto be familiar with the labels of an audio taxonomy and it indexesthe detailed audio content instead of abstract text descriptions thatnot all agree on. Regarding applications, sound search by vocalimitation can be useful in many fields including movie and musicproduction, multimedia retrieval, and security and surveillance.

Recently, a deep learning based model called TL-IMINET [43]was proposed for sound search by vocal imitation. It addresses twomain technical challenges: 1) feature learning: what feature rep-resentations are appropriate for the vocal imitation and referencesound, and 2) metric learning: how to design the similarity betweena vocal imitation and each sound candidate. Experiments on the Vo-calSketch Data Set [5] have shown promising retrieval performancefor this model, however, no user studies have been conducted tovalidate the model as part of a user-facing search engine and thesound-search-by-vocal-imitation approach in general at the systemlevel. In this paper, we seek to answer the following questions: 1)Is vocal-imitation-based search an acceptable approach to soundsearch for ordinary users without an extensive audio engineering

Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada

23

Page 2: Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

background? 2) How does vocal-imitation-based search comparewith the traditional text-based search for different kinds of soundsin terms of search effectiveness, efficiency and user satisfaction?

To answer the above questions, in this work, we conduct a userstudy to compare sound search by vocal imitation and by text de-scription on Amazon Mechanical Turk. Specifically, we designed aweb-based search engine called Vroom!. The frontend GUI allowsa user to record a vocal imitation as a query to search sounds ina sound library, and the backend uses a pre-trained deep learningmodel as the search algorithm. We also designed a baseline systemcalled TextSearch for comparison. It allows a user to search a collec-tion of sounds by comparing a user’s text query to the keywords foreach sound. We further developed a software experimental frame-work to record user behaviors and ratings on a cloud database usingMongoDB Atlas.

In this study, each system was used by 100 workers, each ofwhich was asked to search for 10 sounds randomly selected froma large sound library that contains 3,602 sounds from eight soundcategories with an average length of 4 seconds. Each search wasdone within the sound’s category. Analyses of search results anduser behaviors show that subjects gave significantly higher overallease-of-use scores to Vroom! than TextSearch in this sound library.Results also show significant advantages of Vroom! over TextSearchon categories that were difficult to describe by text.

The rest of the paper is organized as the following. We first re-view related work in Section 2. We then introduce the sound searchby vocal imitation system Vroom! in Section 3, and the baseline ofsound search by text description system TextSearch in Section 4.In Section 5, we describe the FreeSoundIdeas dataset that we col-lected for user evaluation. In section 6, we discuss the experimentalframework, subject recruitment, and analyze the results. Finally weconclude the paper in Section 7.

2 RELATEDWORKUsing text descriptions to search the metadata (e.g. filenames, texttags) in collections of audio files is already widely deployed. Forexample, Freesound [13] is an online community generated sounddatabase with more than 420,000 sounds. Each audio file in the col-lection is tagged with text descriptions for text-based search. Sound-Cloud [34] is another community-based online audio distributionplatform that enables users to search sounds by text description.

On the other hand, sound query by vocal imitation (QBV) isdrawing increasing attention from the research community to ad-dress limitations of text-based search. It is one kind of of Query byExample (QBE) [44]. There are numerous QBE applications in theaudio domain, such as content based sound search and retrieval[12, 37], audio fingerprinting of the exact match [36] or live versions[28, 35], cover song detection [3] and spoken document retrieval [6].Vocal imitation of a sound was first proposed for music retrieval,such as finding songs by humming the melody as a query [9, 15] orbeat boxing the rhythm [16, 18]. Recently, it has been extended forgeneral sound retrieval, as summarized below.

Roma and Serra [29] designed a system that allows users tosearch sounds on Freesound by recording audio with a microphone,but no formal evaluation was reported. Blancas et al. [4] built a

supervised system using hand-crafted features by the Timbre Tool-box [26] and an SVM classifier. Helén and Virtanen [17] designed aquery by example system for generic audio. Hand-crafted frame-level features were extracted from both query and sound samplesand the query-sample pairwise similarity was measured by proba-bility distribution of the features.

In our previous work, we first proposed a supervised system us-ing a Stacked Auto-Encoder (SAE) for automatic feature learning fol-lowed by an SVM for imitation classification [38].We then proposedan unsupervised system called IMISOUND [39, 40] that uses SAEto extract features for both imitation queries and sound candidatesand calculates their similarity using various measures [10, 22, 30].IMISOUND learns feature representations independently of thedistance metric used to compare sounds in the representation space.Later, we proposed an end-to-end Siamese style convolutional neu-ral networks named IMINET [41] to integrate learning the distancemetric and the features. This model was improved by transfer learn-ing from other relevant audio tasks, leading to the state-of-the-artmodel TL-IMINET [43]. The benefits of applying positive and nega-tive imitations to update the cosine similarity between the queryand sound candidate embedding was investigated in [21]. To under-stand what such a neural network actually learns, visualization andsonification of the input patterns in Siamese style convolutionalneural networks using activation maximization [11] was discussedin [42, 43].

To date, research on sound search by vocal imitation has beenonly conducted at the algorithm development level. No usablesearch engines have been deployed based on these algorithms,nor have any user studies been conducted to assess the effective-ness of the new search approach in practice. This paper conducts alarge-scale user study along this line: evaluating the performanceof a vocal-imitation-based search engine built on a best-performingdeep learning algorithm, and comparing it with a traditional text-based search engine.

3 PROPOSED SEARCH ENGINE: VROOM!We designed a web-based sound search engine by vocal imitation,called Vroom!, which can be accessed via https://vocalimitation.com.It includes frontend design and backend implementation.

3.1 Frontend GUI DesignThe frontend GUI is designed using Javascript, HTML, and CSS. Itallows a user to record a vocal imitation of sound that he/she islooking for from one of the categories described in Subsection 5.2using the recorder.js Javascript library [25]. It also allows the userto listen to the recording, inspect the waveform, and re-record imi-tations. By clicking on the “Go Search!” button, the user can initiatethe search request. The recording is then uploaded to the backendserver and compared with each sound within the specified categoryusing the CR-IMINET algorithm described later. Top five soundcandidates with the highest similarity scores are first returned tothe user, and more candidates up to 20 can be returned by clickingon "Show more results". The user can play the returned sounds andmake a selection to complete the search. Only sounds that havebeen played become available for the selection. If not satisfied with

Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada

24

Page 3: Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

Conv1

BidirectionalGRU

Pool1

Concatenation

FC1

FC2

Binary Crossentropy Loss

Input_a: vocal imitation

Input_b: sound recording

Metric Learning

FC2: 1 unit, sigmoid activationFC1: 96 units, ReLU activation

Bidirectional GRU: 40 * 2 neuronsPool1: maximum, stride: 6 * 6Conv1: 24 filters, receptive field: 6 * 6

340 frames340 frames

40 m

el b

ands

40 m

el b

ands

Conv1

BidirectionalGRU

Pool1

Feature Learning

Figure 1: Frontend GUI of the vocal-imitation-based searchengine Vroom!.

any of the returned sounds, the user can re-record an imitation andre-do the search. The frontend GUI is shown in Figure 1.

3.2 Backend Search Algorithm: CR-IMINETHosted on a Ubuntu system, the backend server is designed usingNode.js express framework, with Keras v2.2.4 [7] and GPU accel-eration supported. It receives the user’s vocal imitation from thefrontend, pre-processes the audio, then implements a Siamese styleconvolutional recurrent neural networkmodel called CR-IMINET tocalculate the similarity between the vocal imitation and candidatesounds in the sound library. It responds to each frontend searchrequest and retrieves the most similar sounds to each imitationquery, within the specified sound category of the sound library.

3.2.1 Architecture. As shown in Figure 2, CR-IMINET contains twoidentical Convolutional Recurrent Deep Neural Network (CRDNN)towers for feature extraction: One tower receives a vocal imitation(the query) as input. The other receives a sound from the library(the candidate) as input. Each tower outputs a feature embedding.

Conv1

BidirectionalGRU

Pool1

Concatenation

FC1

FC2

Binary Crossentropy Loss

Input_a: vocal imitation

Input_b: sound recording

Metric Learning

FC2: 1 unit, sigmoid activationFC1: 96 units, ReLU activation

Bidirectional GRU: 40 * 2 neuronsPool1: maximum, stride: 6 * 6Conv1: 24 filters, receptive field: 6 * 6

340 frames340 frames

40 m

el b

ands

40 m

el b

ands

Conv1

BidirectionalGRU

Pool1

Feature Learning

Figure 2: Architecture of the CR-IMINET. The two inputspectrograms are of the same size. The two towers forfeature extraction are of the same structure with sharedweights. In each tower a convolutional layer is followed bya bi-directional GRU layer.

These embeddings are then concatenated and fed into a Fully Con-nected Network (FCN) for similarity calculation. The final singleneuron output from FC2 is the similarity between the query andthe candidate. The feature learning and metric learning modulesare trained jointly on positive (i.e., related) and negative (i.e., non-related) query-candidate pairs. Through joint optimization, featureembeddings learned by the CRDNNs are better tuned for the FCN’smetric learning, compared with isolated feature and metric learningin [40].

The Siamese (two-tower) and integrated feature and metric learn-ing architecture in the proposed CR-IMINET is inspired by theprevious state-of-the-art architecture, TL-IMINET [43]. Differently,TL-IMINET only uses convolutional layers in the feature extractiontowers, while CR-IMINET uses both a convolutional layer and abi-directional GRU (Gated Recurrent Unit) [8] layer. This configura-tion can better model temporal dependencies in the input log-melspectrograms. Another difference is that TL-IMINET pre-trains theimitation and recording towers on environmental sound classifica-tion [31] and spoken language recognition [24] tasks, respectively,while CR-IMINET does not adopt this pre-training for simplicitythanks to its much smaller model size (shown in Table 1).

3.2.2 Training. We use the VimSketch dataset [19] to train theproposed CR-IMINET model. It is a combination of VocalSketchData Set v1.0.4 [5] and Vocal Imitation Set [20]. The VocalSketchData Set v1.0.4 contains 240 sounds with distinct concepts and 10vocal imitations for each sound collected from different AmazonMechanical Turkers. These sounds are from four categories: Acous-tic Instruments (AI), Everyday (ED), Single Synthesizer (SS) andCommercial Synthesizers (CS). The number of sounds in these cat-egories is 40, 120, 40 and 40, respectively. The Vocal Imitation Set iscurated based on Google’s AudioSet ontology [14], containing sixcategories of sounds in the first layer of the AudioSet ontology tree:Animal, Channel, Human Sounds, Music, Natural Sounds, Sounds

Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada

25

Page 4: Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

Table 1: Model size (# trainable parameters) and retrievalperformance (MRR) comparisons between the proposed CR-IMINET and the previous state of the art, TL-IMINET.

Config. Model Size MRR (mean ± std)

TL-IMINET [43] 799.9k 0.325 ± 0.03CR-IMINET 55.1k 0.348 ± 0.03

of Things, and Source Ambiguous Sounds. The number of soundsin these categories is 31, 4, 38, 65, 10, 134, and 20, respectively.Considering the relatively small number of sound recordings inthe Channel and Natural Sounds categories (i.e., 4 and 10, respec-tively) and non-obvious association between the recording andcorresponding imitations after listening to them, we remove thesetwo categories in training.

We used in total 528 sounds from the remaining categories inVimSketch dataset for training. These sounds are of distinct con-cepts, and each has 10 to 20 vocal imitations from different people.All sounds and imitations are trimmed to about 4-second longeach. We used 10-fold cross validation to train and validate theCR-IMINET model. In each fold, the number of sound concepts fortraining and validation is 476 and 52, respectively. Sounds fromeach concept are paired with imitations from the same concept aspositive pairs, and paired with imitations from other concepts asnegative pairs.

The ground-truth similarity labels are 1 for positive pairs and0 for negative pairs. The loss function to minimize is the binarycross-entropy between the network output and the binary labels.Adam is used as the optimizer. The batch size is 128. Model trainingterminates after 20 epochs as the validation loss begins to increaseafterwards.

3.2.3 Performance. We use Mean Reciprocal Rank (MRR) [27] toevaluate the retrieval performance.

MRR =1Q

Q∑i=1

1ranki

, (1)

where ranki is the rank of the target sound among all sounds inthe library available for retrieval for the i-th imitation query; Q isthe number of imitation queries. MRR ranges from 0 to 1 with ahigher value indicating a better sound retrieval performance. Wereport the average MRR across 10 folds with 52 sound recordingsin each fold to search from. The results are shown in Table 1.

It can be seen that CR-IMINET outperforms TL-IMINET in termsof MRR. An unpaired t-test shows that this improvement is statisti-cally significant, at the significance level of 0.05 (p = 4.45e-2). AnMRR of 0.348 suggests that within the 52 sound candidates in eachfold, on average, the target sound is ranked as the top 3 candidatein the returned list. This suggests that CR-IMINET becomes thenew state-of-the-art algorithm for sound search by vocal imitation.

4 BASELINE SEARCH ENGINE: TEXTSEARCHTo evaluate the performance of the proposed Vroom! search en-gine, we also designed a web-based sound search engine by text

Conv1

BidirectionalGRU

Pool1

Concatenation

FC1

FC2

Binary Crossentropy Loss

Input_a: vocal imitation

Input_b: sound recording

Metric Learning

FC2: 1 unit, sigmoid activationFC1: 96 units, ReLU activation

Bidirectional GRU: 40 * 2 neuronsPool1: maximum, stride: 6 * 6Conv1: 24 filters, receptive field: 6 * 6

340 frames340 frames

40 m

el b

ands

40 m

el b

ands

Conv1

BidirectionalGRU

Pool1

Feature Learning

FreeSoundIdeas(3,602)

Animal (230)

Human (300)

Music (521)

Natural Sounds (86)

Home and Office (819)

Synthesizers (762)

Source Ambiguous Sounds (660)

Transportations (224)

Figure 3: Frontend GUI of the text-description-based searchengine TextSearch.

description as the baseline system, called TextSearch. It also in-cludes frontend design and backend implementation described asthe following.

4.1 Frontend GUI DesignSimilar to Vroom!, the frontend GUI of TextSearch is designed usingJavascript, HTML, and CSS languages as well. As shown in Figure 3,the GUI provides the user with a text box to enter one or multiplekeywords as the query for a sound. The user can then click on the“Go Search!” button to search.

The query string is uploaded to the backend server, and thencompared with the keyword list associated to each sound candidateto find matches in the Solr database described in Subsection 4.2.Returned sound candidates are ranked in an order based on theinternal matching mechanism of Solr. In order to have a comparableexperimental setup with Vroom!, only the file names but not thekeyword lists associated with the returned sounds are presentedto the user. By clicking on the “Show more results” button, upto 20 sound candidates can be returned. The user can play thereturned sounds and make a selection to complete the search, andonly sounds that have been played can be selected. If not satisfiedwith any of the returned sounds, the user can re-type the querykeywords and re-do the search.

4.2 Backend Search AlgorithmThe backend is realized by designing a main server that receivesand responds to requests from the frontend, and utilizing a separate

Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada

26

Page 5: Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

search engine service called Solr [1] specialized in text search andmatching. The entire backend is hosted on a Ubuntu system. Theoverall process is as the following. The query request from thefrontend users is first received by the main server. Then the mainserver resends query requests to Solr for text search and ranking.The ranked list is then returned to the main server from Solr, andfinally returned to the frontend user.

Specifically, Solr is an open-source enterprise search platformbuilt upon Apache Lucene. Each sound in FreeSoundIdeas has thefollowing information, namely, sound filename, descriptive tagsprovided by the original Freesound.org uploader (i.e., keywords),and the unique sound ID.We organize this information of all soundswithin each category into a separate JSON format file. Solr readsin the JSON file for each sound category and organizes the soundkeywords into a tree structure for fast indexing given a queryinput. The retrieval similarity is calculated by Lucene’s PracticalScoring Function [2]. Query parsing and partial word matchingfunctions are also supported by Solr for a more user-friendly searchexperience.

4.3 Baseline ValidityOur designed baseline search engine TextSearch is comparable toother text-based search engines such as Freesound.org in the fol-lowing two aspects. First, the workflow of TextSearch is the sameas Freesound.org. Both search engines provide the user with a textbox to type in the query strings, and return the user with a list ofsounds ranked by relevance that are available for playback overmultiple pages. Second, the backend Solr-based search algorithmused in TextSearch is also used in Freesound.org. It guarantees thesearch effectiveness and efficiency of the baseline.

TextSearch still has its limitations, for example, the searchablespace is within each category of sounds in FreeSoundIdeas. This mayprevent the user from finding sounds that are keyword-relevant tothe query string but belong to other categories, while Freesound.orgdoes not have this constraint. In addition, Freesound.org displaysthe associated keywords, the user quality rating, and the uploaderinformation of the returned sounds, while TextSearch does not.Such metadata may aid the user in searching the target high qualitysound more efficiently. Nevertheless, we believe that TextSearchimplements the key features of a text-based search engine of sounds,and serves as a sufficient baseline for assessing the feasibility ofvocal-imitation-based search.

5 SUBJECTIVE EVALUATION5.1 Research QuestionsBy designing the proposed search engine Vroom! and the baselineTextSearch, we would like to understand and answer the followingquestions. 1) Is vocal-imitation-based search an acceptable approachto sound search for ordinary users without an extensive audioengineering background? 2) How does vocal-imitation-based searchcompare with the traditional text-based search for different kindsof sounds in terms of search effectiveness and efficiency? In thissection, we present a large-scale user study on Amazon MechanicalTurk to answer these questions.

FreeSoundIdeas (3,602)

Animal (230)

Human (300)

Music (521)

Natural Sounds (86)

Office and Home (819)

Synthesizers (762)

Tools and Miscellaneous (660)

Transportation (224)

Auto Other Ground Vehicles

Airplanes WatercraftsMotorcycles

General Car Sounds Formula Drag RaceCar Brands

“Car wash.wav” “Car Horns.wav” “Car Breaking Skid 01.wav” “Start car and drive.wav” …

Figure 4: FreeSoundIdeas dataset ontology. Numbers inparentheses indicate the number of sound concepts in thecategory at the first level of the ontology tree.

5.2 Evaluation Dataset CollectionTo carry out the subjective evaluation, we create a new datasetcalled FreeSoundIdeas as our sound library. The sounds of thisdataset are from Freesound.org [13], while we reference sounddescriptions and the structure of how sounds are organized inSound Ideas [33] to form the FreeSoundIdeas ontology. Specifically,the ontology has a multi-level tree structure and is derived fromtwo libraries of Sound Ideas: “General Series 6000 - Sound EffectLibrary” and “Series 8000 Science Fiction Sound Effects Library”,where the former has more than 7,500 sound effects covering a largescope, and the latter has 534 sound effects created by Hollywood’sbest science fiction sound designers. We copied the indexing key-words from 837 relatively distinct sounds in these two libraries andformed eight categories of sound concepts, namely, Animal (ANI),Human (HUM), Music (MSC), Natural Sounds (NTR), Office andHome (OFF), Synthesizers (SYN), Tools and Miscellaneous (TOL),and Transportation (TRA).

We do not use sounds from Sound Ideas because of copyrightissues, instead, we use keywords of each sound track from theabovementioned ontology as queries to search similar sound fromFreesound.org. For each query, the first 5 to 30 returned soundsfrom Freesound.org are downloaded and stored as elements for ourFreeSoundIdeas dataset. Keywords of these sounds from Freesound.orginstead of the queried keywords to find these sounds are storedtogether with these sounds for a more accurate description. It isnoted that this FreeSoundIdeas dataset has no overlap with theVimSketch dataset which is used to train the search algorithm forVroom!.

In total the FreeSoundIdeas dataset includes 3,602 sounds. Thereare 230, 300, 521, 86, 819, 762, and 660 sound concepts in the categoryof ANI, HUM, MSC, NTR, OFF, SYN, TOL, and TRA, respectively.Its ontology is shown in Figure 4, with the Transportation categorybeing expanded to leaf nodes to illustrate the granularity.

5.3 Experimental FrameworkTo quantify search behaviors and user experiences and to makequantitative comparisons between Vroom! and TextSearch, we de-signed an experimental framework that wraps around each searchengine. The experimental framework is another web application.

Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada

27

Page 6: Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

The framework guides each subject to make 10 searches, ratetheir satisfaction score about each search, and rate the ease-of-usescore for the search engine after completing all 10 searches. Foreach search, it guides the subject through three steps. In Step 1,the subject listens to a reference sound randomly chosen from acategory of the sound library. The category name is visible whilethe keywords of the sound is not provided. This sound will be thetarget sound to search in following steps. In Step 2, the referencesound is hidden from the subject, and the subject uses the searchengine (Vroom! or TextSearch) to search for the reference sound inthe specified category of the sound library. In Step 3, the referencesound appears again. The user compares it with their retrievedsound to rate their satisfaction about the search. These three steps,for the Vroom! search engine, are shown in Figure 5 for illustration.

The experimental framework tries to mimic the search processesin practice as much as possible. For example, searches are conductedin each of the eight root categories instead of over the entire libraryto reduce complexity, as the root-level categories show clear distinc-tions on their semantics. However, certain modifications still haveto be made to allow quantitative analysis. In practice, a user rarelylistens to the exact target sound before a search; they usually onlyhave a rough idea about the target sound in their mind to cast theirquery (imitation or text). In our experimental framework, however,before each search, the subject listens to the target sound to casttheir query. While this may positively bias the quality of the query(especially for the imitation query), this is necessary to controlwhat sound to search by the subjects. For example, the library maysimply not contain the target sound if we allowed subjects to searchfreely. To reduce this positive bias, we hide the target sound duringthe search (Step 2).

The backend of this experimental framework records statistics ofimportant search behaviors listed as the following. These statisticsare then sent to MongoDB Atlas cloud database for storage andanalysis.

• User satisfaction rating for each search• Ease-of-use rating for the search engine• Number of “Go Search!” button clicked• Number of returned candidate sounds played• Total time spent for each search• Rank of the target sound in the returned list for each search

5.4 Subject RecruitmentWe recruited a total of 200 workers (i.e., 100 for each search engine)from the crowdsourcing platform Amazon Mechanical Turk (AMT)as our subjects. Our sound search tasks were released throughcloudresearch.com [23] as it provides several more advanced andconvenient features compared with the task releasing mechanismin AMT. Our recruiting criteria are summarized as the following.

(1) AMT worker’s historical HIT performance. We required thateach worker had a HIT approval rate higher than 97% and thenumber of HITs higher than 50, and the worker was located in theUnited States.

(2) Duplicate submission prevention. We blocked multiple sub-missions from the same IP address, and verified that worker’s IPwas consistent with the United States region setting. Finally, westrictly controlled that there was no overlap between workers in the

two groups. This was to prevent a worker from becoming familiarwith the sound library, which might cause a positive bias to thesecond search engine that the user tested.

(3) Equipment requirement. We asked workers to use Chromeor Firefox to complete the task, as a comprehensive internal testwas conducted on these two browsers. Those finished with otherbrowsers (i.e., IE, Safari, etc.) were rejected to rule out any unex-pected issues. Also, we asked workers to sit in a quiet environmentand make sure their speaker and microphone were on.

The demographic information of the recruited subjects is sum-marized in Figure 6. We can see that the gender distribution is quiteeven, and a large portion of subjects were born in the 1980s and1990s. For race distribution, most subjects are White/Caucasian,followed by Black/African American and Asian.

Before the worker starts, he/she is welcomed with the task portalincluding instructions, an external link directed to our web basedexperimental framework hosting Vroom! and TextSearch, and a textbox for entering the completion code, which will be available tocopy and paste on the last page of the experimental framework,after the user finishes his/her sound search task.

The two groups of subjects were asked to perform 10 soundsearches using Vroom! and TextSearch, respectively. Subjects wereinformed about the collection of their search behaviors and ratingsbefore the experiments. After the user finished 10 sound searchesand provided ease-of-use score and general feedback, then thecompletion code would be available for the user to paste into thetext box from the task portal. Finally, we verified the submittedcompletion code from each subject to approve his/her job.

Our internal pilot trials show that each experiment took about25 and 15 minutes for Vroom! and TextSearch, respectively. There-fore, we paid each subject 1.5 US dollars for Vroom! and 1 US dollarfor TextSearch. To encourage the subjects to treat the experimentsmore seriously, we made an extra 50% bonus payment based on theworker’s performance. Subjects were informed about this compen-sation policy including the bonus payment before they started theexperiments.

5.5 Experimental Results5.5.1 User Feedback. Figure 7 compares two types of user ratingsbetween Vroom! and TextSearch: 1) User’s satisfaction rating (SAT)indicates how satisfied a user is with each search by comparing thefinally retrieved sound to the reference sound (collected in Step 3 inFigure 5); 2) ease-of-use rating evaluates a user’s overall experienceof each search engine upon the completion of all 10 searches.

It can be seen that Vroom! shows a statistically significantlyhigher ease-of-use rating than TextSearch at the significance levelof 0.05 (p=0.0324, unpaired t-test). This suggests a positive answer tothe first research question raised in Section 5.1, i.e., vocal-imitation-based search can be accepted by ordinary users without an extensiveaudio engineering background. The average satisfaction rating ofall categories shows slightly better performance of Vroom! thanTextSearch. However, a further inspection reveals that the averagesatisfaction rating varies much from one category to another. ForMSC, NTR, SYN, and TOL categories, Vroom! receives a statisticallysignificantly higher satisfaction rating than TextSearch does, at thesignificance level of 0.1 (MSC p = 9.8e-2), 0.1 (NTR p = 6.1e-2), 0.001

Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada

28

Page 7: Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

Step 1: Listen Step 2: Search Step 3: Rate

Figure 5: Experimental framework hosting the proposed vocal imitation based search engine Vroom!. The framework hostingthe text description based search engine TextSearch is exactly the same except that Step 2 is replaced with the TextSearchengine.

GenderFemale 109Male 118

Birth Decade1940s 11950s 81960s 221970s 371980s 851990s 652000s 1

RaceWhite/Caucasian 172Black/African American 25Asian 12Multiracial 6Pacific Native 2Native American or Alaska Native1

79%

11%

6%

3% 1% 0% White/CaucasianBlack/African AmericanAsianMultiracialPacific NativeNative American or Alaska Native

< 1%< 1%

< 1%0% 4%

10%

17%

39%

30%

0% 1940s1950s1960s1970s1980s1990s2000s

< 1%< 1%

48%52%

FemaleMale

79%

11%

6%

3% 1% 0% White/CaucasianBlack/African AmericanAsianMultiracialPacific NativeNative American or Alaska Native

(a) Gender (b) Birth Decade (c) Race

< 1%

Figure 6: Pie charts showing the demographic information of the subjects. Charts are created based on the information col-lected from cloudresearch.com. Please note that cloudresearch.com may not have every demographic for every subject.

(SYN p = 8.22e-10), and 0.1 (TOL p = 8.79e-2), respectively, underunpaired t-tests. This is because many subjects could not recognizesounds from these categories nor find appropriate keywords tosearch in TextSearch. This is especially significant for the SYN cate-gory, as many sounds simply do not have semantically meaningfulor commonly agreeable keywords, while imitating such sounds wasnot too difficult for many subjects. Also please note that in conduct-ing Vroom! experiments with AMT workers, TOL Category wasnamed as a much boarder concept called “Sound of Things”, whilein TextSearch this category was renamed to the current “Tools andMiscellaneous” to provide more information and help the workerbetter understand the sounds they heard. This may slightly bias theexperimental results to TextSearch in TOL category. Nevertheless,

in the figure we see that Vroom! still outperforms TextSearch in TOLcategory in terms of user satisfaction rating.

On the other hand, for the ANI, HUM, andOFF category, however,TextSearch outperforms Vroom! significantly in terms of satisfactionrating, at the significance level of 0.005 (ANI p = 1.2e-3), 0.005 (HUMp = 1.2e-3), 0.05 (OFF p = 2.1e-2), respectively, under unpaired t-tests.Subjects were more familiar with these sounds that can be easilyidentified in everyday environments and knew how to describethemwith keywords, while some sounds could be difficult to imitate,e.g., shuffling cards, toilet flushing, and cutting credit card.

For the remaining TRA category, the average satisfaction ratingof TextSearch is slightly better than our proposed Vroom!, however,such outperformance is not statistically significant.

Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada

29

Page 8: Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

Ease-of-use Avg SAT ANI SAT HUM SAT MSC SAT NTR SAT OFF SAT SYN SAT TOL SAT TRA SAT0

1

2

3

4

5

Use

r Rat

ing

Search by Text DescriptionSearch by Vocal Imitation

Figure 7: Average user ratings of sound search by text description (TextSearch) and vocal imitation (Vroom!). Ratings includethe overall ease-of-use rating of the two search engines, and search satisfaction (SAT) within each sound category and acrossall categories. Error bars show standard deviations

5.5.2 User Behaviors. Figure 8 further compares user behaviorsbetween Vroom! and TextSearch. First, for both search engines, itis obvious to observe the trend of positive correlation among thenumber of search trials, the number of sounds played, and the totaltime spend in one sound search.

Second,Vroom! has significantly fewer search trials than TextSearchin all categories except HUM as shown in Table 2. Note that in HUMcategory the mean number of search trials in Vroom! is still lowerthan TextSearch, although it is not statistically significant. Consider-ing that user satisfaction ratings for Vroom! in MSC, NTR, SYN, andTOL categories are significantly higher than those for TextSearch,this suggests that fewer search trials may lead to better search expe-rience. Higher search efficiency can be achieved by requesting lessqueries from the user. This answers the second research questionin Subsection 5.1 in terms of search efficiency.

On the other hand, the average number of played sound can-didates in each search using Vroom! is much larger than that ofusing TextSearch. As file names of returned sound candidates oftencontain semantic meanings, we believe that users can often skiplistening to sound candidates when their file names seem irrelevantto the text query in TextSearch. For Vroom!, such “short cut” is notavailable and listening is often the only way to assess the relevancy.

Finally, the overall time spent on each search in Vroom! is signif-icantly longer than that in TextSearch. This can be explained by thelarger number of sounds played in Vroom! as well as the additionaltime spent to record and playback vocal imitations compared totyping in keywords.

5.5.3 Ranking of Target Sound. We visualize the target sound rank-ing distribution for both the proposedVroom! and baseline TextSearchacross different categories. Complimentary to User Feedback in Sec-tion 5.5.1, it is an objective evaluation measure to compare the twosearch engine performance.

Please note that in Vroom! the target sound is always in thereturned candidate list. But in TextSearch, given the user’s querykeywords, the target sound may or may not be in the returnedcandidate list. If the target sound is not in the candidate list, wetreat the target sound rank as 999, which is greater than the numberof sounds in each category. As shown in Figure 9, black and white

Avg ANI HUM MSC NTR OFF SYN TOL TRA0

5

10

15

20

Num

ber

0

2

4

6

8

10

Tim

e Sp

ent (

10 s)

(a) Vroom!

No. search trialsNo. sound playedTotal time

Avg ANI HUM MSC NTR OFF SYN TOL TRA0

5

10

15

20

Num

ber

0

2

4

6

8

10

Tim

e Sp

ent (

10 s)

(b) TextSearch

No. search trialsNo. sound playedTotal time

Figure 8: User behavior statistics for Vroom! and TextSearch.Each point in the plot is themean value with std omitted forbetter visualization.

bars indicate rank counts for Vroom! and TextSearch, respectively.The overlapping portion between the two systems is shown in grey.Some interesting findings can be observed as the following.

First, target ranks in Vroom! are distributed more flattened withinthe range of maximal number of sounds in that category. ButTextSearch shows a more polarized result that either the targetsound ranks very high or has no rank at all. It indicates that theuser may choose highly matched keywords of the target sound orcannot come up with relevant descriptions for that sound entirely.This is obvious in MSC, OFF, and SYN categories, by comparingthe leftmost and rightmost white bars in the figure. For examplein the SYN category, for users without music or audio engineeringbackground, describing a synthesizer sound effect by text is verychallenging (e.g., a sound effect named “eerie-metallic-noise.mp3”annotated with keywords of “alien”, “eerie”, “glass”, and “metallic”).

Second, in HUM, MSC, OFF, and TOL categories, Vroom! shows asmaller proportion of high ranks compared with TextSearch, whilein other categories like NTR, SYN, TRA, Vroom! it shows a higher

Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada

30

Page 9: Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

Table 2: P-values of unpaired t-tests verifying if the hypotheses in the first column are statistically significant, at the signifi-cance level of 0.05.

User behavior hypothesis Avg ANI HUM MSC NTR OFF SYN TOL TRA

No. search trials (TextSearch > Vroom!) 7.66e-24 2.78e-2 Not significant 3.79e-8 4.76e-2 1.60e-3 2.64e-8 6.74e-8 5.40e-3No. sound played (Vroom! > TextSearch) 5.87e-27 6.86e-4 3.86e-5 7.14e-4 4.81e-2 9.86e-12 7.24e-7 6.79e-4 1.10e-2

Total time (Vroom! > TextSearch) 1.69e-33 6.03e-4 2.58e-6 3.80e-3 4.26e-2 2.55e-14 9.84e-7 2.22e-6 3.70e-3

0 200 400 600 800 1000Rank in ANI category

0

10

20

30

40

50

60

Freq

uenc

y

Vroom!TextSearch

0 200 400 600 800 1000Rank in HUM category

0

10

20

30

40

50

60

Freq

uenc

y

Vroom!TextSearch

0 200 400 600 800 1000Rank in MSC category

0

20

40

60

80

100

120

140

160

180

Freq

uenc

y

Vroom!TextSearch

0 200 400 600 800 1000Rank in NTR category

0

10

20

30

40

50

60

Freq

uenc

y

Vroom!TextSearch

0 200 400 600 800 1000Rank in OFF category

0

20

40

60

80

100

120

140

160

180

Freq

uenc

y

Vroom!TextSearch

0 200 400 600 800 1000Rank in SYN category

0

20

40

60

80

100

120

140

160

180

Freq

uenc

y

Vroom!TextSearch

0 200 400 600 800 1000Rank in TOL category

0

20

40

60

80

100

120

140

160

180Fr

eque

ncy

Vroom!TextSearch

0 200 400 600 800 1000Rank in TRA category

0

10

20

30

40

50

60

Freq

uenc

y

Vroom!TextSearch

Figure 9: Comparisons of target sound rank in the returned sound list between Vroom! and TextSearch within each category.

proportion of high ranks. In both cases the target sound from Vroom!can be low ranked around several hundreds out of the entire return-ing candidate sound list. We argue that this is still different fromthe out-of-scope situation if no matched keywords can be found inTextSearch. In practice, Vroom! could return more than 20 sounds forthe user to choose from, and could work with text-based search toreduce the candidate pool, target sounds in low ranks could still bepossible to discover. Furthermore, as the returned candidate soundsare ranked based on the content similarity with the vocal imitationquery, even if the target sound is low ranked, other high rankedcandidate sounds may still align well with the user’s taste.

SAT rating and target sound ranking indicate the subjectivefeeling and objective evaluation about sound search effectivenessof Vroom! compared with TextSearch. It answers the second researchquestion in terms of search effectiveness.

6 CONCLUSIONSThis paper presented a search engine for sounds by vocal imitationqueries called Vroom!. It has a frontend GUI allowing the user torecord his/her vocal imitation of a sound concept and search forthe sound in a library. Its backend hosts a Siamese convolutionalrecurrent neural network model called CR-IMINET to calculate thesimilarity between the user’s vocal imitation with sound candidates

in the library. We conducted a comprehensive subjective study onAmazon Mechanical Turk with 200 workers to evaluate the perfor-mance of the vocal-imitation-based search engine and compare witha text-based sound search engine TextSearch as the baseline. Wedeveloped an experimental framework to wrap around Vroom! andTextSearch to conduct this user study. User ratings and behavioraldata collected from the workers showed that vocal-imitation-basedsearch has significant advantages over text-based search for certaincategories (e.g., Synthesizers, Music, Natural Sounds, and Tools andMiscellaneous) of sounds in our collected FreeSoundIdeas soundeffect library. Ease-of-use ratings of the vocal-imitation-based en-gine is also significantly higher than that of the text-based engine.Nonetheless, we can still benefit from text-based sound search en-gines in categories that we are familiar with (e.g., Animal, Human,and Office and Home). For future work, we would like to furtherimprove the performance of the Vroom! search algorithm by incor-porating the attention mechanism and to design search paradigmsto combine vocal-imitation-based and text-based search together.

ACKNOWLEDGMENTSThis work is funded by the National Science Foundation grantsNo. 1617107 and No. 1617497. We also acknowledge NVIDIA’s GPUdonation for this research.

Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada

31

Page 10: Vroom!: A Search Engine for Sounds by Vocal Imitation Queries · Vocal imitation of a sound was first proposed for music retrieval, such as finding songs by humming the melody as

REFERENCES[1] [n.d.]. Apache Solr. Retrieved September 30, 2019 from http://lucene.apache.

org/solr/[2] [n.d.]. Class Similarity. Retrieved September 30, 2019 from http://lucene.apache.

org/core/3_5_0/api/core/org/apache/lucene/search/Similarity.html[3] Thierry Bertin-Mahieux and Daniel PW Ellis. 2011. Large-scale cover song

recognition using hashed chroma landmarks. In Proc. Applications of SignalProcessing to Audio and Acoustics (WASPAA), 2011 IEEE Workshop on. 117–120.https://doi.org/10.1109/ASPAA.2011.6082307

[4] David S. Blancas and Jordi Janer. 2014. Sound retrieval from voice imitationqueries in collaborative databases. In Proc. Audio Engineering Society 53rd Inter-national Conference on Semantic Audio. 1–6.

[5] Mark Cartwright and Bryan Pardo. 2015. VocalSketch: Vocally imitating audioconcepts. In Proc. the 33rd Annual ACMConference on Human Factors in ComputingSystems (CHI). 43–46. https://doi.org/10.1145/2702123.2702387

[6] Tee Kiah Chia, Khe Chai Sim, Haizhou Li, and Hwee Tou Ng. 2008. A lattice-based approach to query-by-example spoken document retrieval. In Proc. the31st Annual International ACM SIGIR Conference on Research and Development inInformation Retrieval. 363–370. https://doi.org/10.1145/1390334.1390397

[7] François Chollet et al. 2015. Keras. https://github.com/fchollet/keras.[8] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, , and Yoshua Bengio. 2014.

Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint arXiv:1412.3555 (2014).

[9] Roger B Dannenberg, William P Birmingham, Bryan Pardo, Ning Hu, Colin Meek,and George Tzanetakis. 2007. A comparative evaluation of search techniques forquery-by-humming using the MUSART testbed. Journal of the Association forInformation Science and Technology 8, 5 (2007), 687–701. https://doi.org/10.1002/asi.20532

[10] Najim Dehak, Reda Dehak, James R Glass, Douglas A Reynolds, and PatrickKenny. 2010. Cosine similarity scoring without score normalization techniques.In Odyssey. 1–5.

[11] Dumitru Erhan, Aaron Courville, and Yoshua Bengio. 2010. Understandingrepresentations learned in deep architectures. Department d‘Informatique etRecherche Operationnelle, University of Montreal, QC, Canada, Tech. Rep 1355(2010), 1–25.

[12] Jonathan T. Foote. 1997. Content-based retrieval of music and audio. In Proc.Multimedia Storage and Archiving Systems II, International Society for Optics andPhotonics, Vol. 3229. 138–147. https://doi.org/10.1117/12.290336

[13] Freesound.Org. 2005. Freesound. Retrieved September 30, 2019 from http://www.freesound.org/

[14] Jort F. Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen,Wade Lawrence,R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: Anontology and human-labeled dataset for audio events. In Proc. Acoustics, Speechand Signal Processing (ICASSP), 2017 IEEE International Conference on. 776–780.https://doi.org/10.1109/ICASSP.2017.7952261

[15] Asif Ghias, Jonathan Logan, David. Chamberlin, and Brian C. Smith. 1995. Queryby humming: musical information retrieval in an audio database. In Proc. the 3rdACM International Conference on Multimedia. 231–236. https://doi.org/10.1145/217279.215273

[16] Oliver Gillet and Gaël Richard. 2005. Drum loops retrieval from spoken queries.Journal of Intelligent Information Systems 24 (2005), 159–177. https://doi.org/10.1007/s10844-005-0321-9

[17] Marko Helén and Tuomas Virtanen. 2009. Audio query by example using sim-ilarity measures between probability density functions of features. EURASIPJournal on Audio, Speech, and Music Processing 2010, 1 (2009), 179303. https://doi.org/10.1155/2010/179303

[18] Ajay Kapur, Manj Benning, and George Tzanetakis. 2004. Query-by-beating-boxing: Music retrieval for the DJ. In Proc. International Society for Music Infor-mation Retrieval Conference (ISMIR). 170–177.

[19] Bongjun Kim, Mark Cartwright, and Bryan Pardo. 2019. VimSketch Dataset.https://doi.org/10.5281/zenodo.2596911

[20] Bongjun Kim,Madhav Ghei, Bryan Pardo, and Zhiyao Duan. 2018. Vocal ImitationSet: a dataset of vocally imitated sound events using the AudioSet ontology. InProc. Detection and Classification of Acoustic Scenes and Events 2018 Workshop(DCASE). 1–5.

[21] Bongjun Kim and Bryan Pardo. 2019. Improving Content-based Audio Retrievalby Vocal Imitation Feedback. In Proc. Acoustics, Speech and Signal Processing(ICASSP), 2019 IEEE International Conference on. 4100–4104. https://doi.org/10.1109/ICASSP.2019.8683461

[22] Solomon Kullback and Richard A. Leibler. 1951. On information and sufficiency.The Annals of Mathematical Statistics 22, 1 (1951), 79–86. https://doi.org/10.1214/aoms/1177729694

[23] Leib Litman, Jonathan Robinson, and Tzvi Abberbock. 2017. TurkPrime.com: Aversatile crowdsourcing data acquisition platform for the behavioral sciences.Behavior Research Methods 49, 2 (2017), 433–442. https://doi.org/10.3758/s13428-016-0727-z

[24] Gregoire Montavon. 2009. Deep learning for spoken language identification. InProc. NIPS Workshop on deep learning for Speech Recognition and Related Applica-tions. 1–4.

[25] Octavian Naicu. 2018. Using Recorder.js to capture WAV audio in HTML5and upload it to your server or download locally. Retrieved September 30,2019 from https://blog.addpipe.com/using-recorder-js-to-capture-wav-audio-in-your-html5-web-site/

[26] Geoffroy Peeters, Bruno L. Giordano, Patrick Susini, Nicolas Misdariis, andStephen McAdams. 2011. The timbre toolbox: Extracting audio descriptorsfrom musical signal. The Journal of the Acoustical Society of America 130, 5 (2011),2902–2916. https://doi.org/10.1121/1.3642604

[27] Dragomir R. Radev, Hong Qi, Harris Wu, and Weiguo Fan. 2002. Evaluatingweb-based question answering systems. Ann Arbor 1001 (2002), 48109.

[28] Zafar Rafii, Bob Coover, and Jinyu Han. 2014. An audio fingerprinting system forlive version identification using image processing techniques. In Proc. Acoustics,Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on.644–648. https://doi.org/10.1109/ICASSP.2014.6853675

[29] Gerard Roma and Xavier Serra. 2015. Querying freesound with a microphone. InProc. the 1st Web Audio Conference (WAC).

[30] Hiroaki Sakoe and Seibi Chiba. 1978. Dynamic programming algorithm optimiza-tion for spoken word recognition. Acoustics, Speech and Signal Processing, IEEETransaction on 26, 1 (1978), 43–49. https://doi.org/10.1109/TASSP.1978.1163055

[31] Justin Salamon and Juan Pablo Bello. 2017. Deep convolutional neural networksand data augmentation for environmental sound classification. IEEE SignalProcessing Letters 24, 3 (2017), 279–283. https://doi.org/10.1109/LSP.2017.2657381

[32] John P. Scott. 1976. Genetic variation and the evolution of communication. InCommunicative Behavior and Evolution (1st ed.), Martin E. Hahn and Edward C.Simmel (Eds.). ACM Press, New York, NY, 39–58.

[33] Sound-Ideas.Com. 1978. Sound Ideas. Retrieved September 30, 2019 fromhttps://www.sound-ideas.com/

[34] SoundCloud.Com. 2008. SoundCloud. Retrieved September 30, 2019 fromhttps://www.soundcloud.com/

[35] Timothy J. Tsai, Thomas Prätzlich, and Meinard Müller. 2016. Known artist livesong ID: A hashprint approach. In Proc. International Society for Music InformationRetrieval Conference (ISMIR). 427–433.

[36] Avery Wang. 2003. An industrial strength audio search algorithm. In Proc.International Society for Music Information Retrieval Conference (ISMIR). 7–13.

[37] Erling Wold, Thom Blum, Douglas Keislar, and James Wheaten. 1996. Content-based classification, search, and retrieval of audio. IEEE Multimedia 3, 3 (1996),27–36. https://doi.org/10.1109/93.556537

[38] Yichi Zhang and Zhiyao Duan. 2015. Retrieving sounds by vocal imitationrecognition. In Proc. Machine Learning for Signal Processing (MLSP), 2015 IEEEInternational Workshop on. 1–6. https://doi.org/10.1109/MLSP.2015.7324316

[39] Yichi Zhang and Zhiyao Duan. 2016. IMISOUND: An unsupervised system forsound query by vocal imitation. In Proc. Acoustics, Speech and Signal Processing(ICASSP), 2016 IEEE International Conference on. 2269–2273. https://doi.org/10.1109/ICASSP.2016.7472081

[40] Yichi Zhang and Zhiyao Duan. 2016. Supervised and unsupervised sound retrievalby vocal imitation. Journal of the Audio Engineering Society 64, 7/8 (2016), 533–543.https://doi.org/10.17743/jaes.2016

[41] Yichi Zhang and Zhiyao Duan. 2017. IMINET: Convolutional semi-Siamesenetworks for sound search by vocal imitation. In Proc. Applications of SignalProcessing to Audio and Acoustics (WASPAA), 2017 IEEE Workshop on. 304–308.

[42] Yichi Zhang and Zhiyao Duan. 2018. Visualization and interpretation of Siamesestyle convolutional neural networks for sound search by vocal imitation. InProc. Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE InternationalConference on. 2406–2410. https://doi.org/10.1109/ICASSP.2018.8461729

[43] Yichi Zhang, Bryan Pardo, and Zhiyao Duan. 2019. Siamese style convolutionalneural networks for sound search by vocal imitation. IEEE/ACM Transactions onAudio, Speech, and Language Processing 27, 2 (2019), 429–441. https://doi.org/10.1109/TASLP.2018.2868428

[44] Moshe M. Zloof. 1977. Query-by-example: A data base language. IBM SystemsJournal 16, 4 (1977), 324–343.

Session 1: Innovations in Interface Design CHIIR ’20, March 14–18, 2020, Vancouver, BC, Canada

32