A UTOMATIC RECOGNITION OF LAUGHTER USING VERBAL AND NON … · 2014-03-06 · A UTOMATIC RECOGNITION OF LAUGHTER USING VERBAL AND NON-VERBAL ACOUSTIC FEATURES Tomasz Jacykiewicz 1

AUTOMATIC RECOGNITION OF LAUGHTER

USING VERBAL AND NON-VERBAL ACOUSTIC

FEATURES

Tomasz Jacykiewicz1 Dr. Fabien Ringeval2

JANUARY, 2014

DEPARTMENT OF INFORMATICS - MASTER PROJECT REPORT

Département d’Informatique - Departement für Informatik • Université de Fribourg - Universität Freiburg •Boulevard de Pérolles 90 • 1700 Fribourg • Switzerland

[email protected], BENEFRI Master Student, University of [email protected], Supervisor, University of Fribourg

Abstract

Laughter is a fundamental social event, yet our knowledge about it is incomplete. Several researches have

been conducted so far for the automatic recognition of laughter. The majority of them were especially

focused on non-verbal spectral-related features. In this Master Thesis we investigate three classifica-

tion problems in two approaches. We consider the discrimination of the speech from (1) laughter, (2)

speech-laugh and (3) the two previous types of laughter plus an acted one. All experiments were con-

ducted on the MAHNOB Laughter Database. We applied the leave-one-subject-out cross-validation as

evaluation framework to achieve both language and speaker independence during the automatic laugh-

ter recognition. We evaluated the scores using weighted accuracy (WA). The first approach is based on

non-verbal features. We tested four feature sets prepared for the INTERSPEECH (IS) Challenges be-

tween 2010 and 2013 and we proposed another four sets of features based on formant values (F). We

obtained a very high performance ((1) WAIS = 98%,WAF = 92%; (2) WAIS = 86%,WAF = 75%;

(3) WAIS = 92%,WAF = 86%) for both group of feature sets, with a domination of IS feature sets.

The feature-level fusion was investigated for the best sets of each group and it improved the scores in

one classification problem (1), even though the results were already very high. The second approach

is based on verbal features. We tested Bag-of-Words and n-grams modeling based on acoustic events

detected automatically, i.e. voiced/unvoiced segments, pseudo-vowels/pseudo-consonants and acoustic

landmarks. The feature sets based on acoustic landmarks (AL) achieved the best scores in all experi-

ments ((1) WAAL = 78%; (2) WAAL = 73%; (3) WAAL = 74%), though the results were not as

good as in the first approach. We observed that n-grams perform generally better than Bag-of-Words,

which shows that the sequencing of units based on acoustic landmarks is more pertinent for automatic

laughter recognition than their distribution. The results obtained in this thesis are very promising, since

the state-of-the-art performance in automatic recognition of laughter from speech signal was significantly

improved, i.e. from 77% to 98%.

Acknowledgment

I am grateful to Pr. Rolf Ingold, the head of the DIVA research group, for giving me the opportunity of

writing the Master Thesis in this group. I am also very thankful to my supervisor, Dr. Fabien Ringeval,

for his assistance in my approach to the machine learning domain and professional explanations, for his

pertinent advices and constructive feedbacks and for our fruitful discussions. Finally, I want to express

my gratitude to those who supported me mentally during the time of writing this Master Thesis.

Contents

1 Introduction 1

1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Laughter in human interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Eliciting laughter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.2 Categories of laughter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Psychoacoustic models of laughter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Automatic speech processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.6 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 State-of-the-art in laughter recognition 9

2.1 Available databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Relevant acoustic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Spectral-related features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Prosodic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 Voicing-related features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.4 Modulation spectrum-related features . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.5 Acoustic-based verbal features . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Popular classification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Support Vector Machines (SVMs) . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 Gaussian Mixture Model (GMMs) . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 Artificial Neural Network (ANN) . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Actual performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.1 Audio-only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

CONTENTS

2.4.2 Video-only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.3 Audio-visual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Automatic laughter recognition based on acoustic features 19

3.1 MAHNOB database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.2 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Non-verbal acoustic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 INTERSPEECH Challenge feature sets . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Formants and vocalization triangle . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.1 Speech vs Laughter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.2 Speech vs Speech-Laugh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.3 Speech vs All types of laughter . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Automatic laughter recognition based on verbal features 37

4.1 Acoustic events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.1 Voiced/unvoiced/silence segments . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.2 Pseudo-vowel/pseudo-consonant/silence segments . . . . . . . . . . . . . . . . 39

4.1.3 Phonetic based landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.4 Perceptual centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Results: acoustic events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42


4.2.2 Speech vs Speech-Laugh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43


4.3 Direct fusion of acoustic events and p-centers . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Results: acoustic events & p-center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45


4.4.2 Speech vs Speech-laugh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

CONTENTS

5 Conclusions 49

CONTENTS

List of Figures

1.1 A waveform (top) and the corresponding frequency spectrum (bottom) of a typical laugh,

composed of six vowel-like notes, showing the regularities. Adapted from [84] . . . . . 5

1.2 Four generations of speech and speaker recognition research; taken from [93] and adapted

from [37]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 The vocalization triangle (in red). Adapted from [93]. . . . . . . . . . . . . . . . . . . . 25

3.2 Spectrograms of speech (above) and laughter (bottom) segments with values of formants

marked in color: red - F1, green - F2, blue - F3 and yellow - F4. . . . . . . . . . . . . . 27

3.3 A waveform (above) and the corresponding spectrogram (bottom) of a segment annotated

as speech-laughter; between 0.0 and 1.0 second the segment contain a normal speech and

between approximately 1.05 and 1.4 second the speech is interfered by a laughter. . . . . 29

3.4 Redistibution of features after applying the CFS. . . . . . . . . . . . . . . . . . . . . . 31

3.5 Redistribution of features after applying the CFS for the Set LE. . . . . . . . . . . . . . 34

4.1 A speech signal (blue) with its energy (black), voiced (darkgreen) and unvoiced (red)

segments. The black dashed line is the silence threshold. Adapted from [107]. . . . . . . 38

4.2 A waveform (top) of a short two-syllable (“ha-ha”) segment of laughter produced by a

subject with landmarks indicated and the corresponding spectrum (bottom). . . . . . . . 40

4.3 Energy waveforms of 5 frequency and one voicing (bottom) bands. Adapted from [14]. . 40

4.4 A rhythmic envelope extracted on a speech signal, adapted from [93]. . . . . . . . . . . 42

4.5 A rhythmic envelope extracted on a speech signal (red) and perception levels of p-centers

with threshold 1/3, 1/4, 1/6 of the amplitude (gray scale); adapted from [93]. . . . . . . . 42

Chapter 1

Introduction

This chapter unveils the subject and the aim of this Master thesis. First, the motivations that conducted

this study are given and the thesis structure is uncovered. Then, the notion of laughter in human inter-

actions is explained and the psycho-acoustical model of laughter is presented. A brief description of the

concept of automatic speech processing and automatic laughter recognition is also given. At the end, the

contributions brought with this work are listed.

1.1 Motivations

The laughter, among with other non-lexical utterances, like moan or cry, was reported to appear before

the development of speech and to be used as an expressive-communicative social signal [96]. However,

the knowledge about laughter is incomplete and it lacks of empirical studies [52]. Knowing how to auto-

matically recognize the laughter, can be useful in analyzing the context and circumstances that provoked

it. That could have practical application in more efficient exploration of neuro-behavioral topics [85]

or better perception of human affects [97]. That would also improve human-computer interfaces (i.e.

human-centered computing) by sensing human behavioral signals [71] and social attitudes, like agree-

ment or disagreement [13]. Moreover, recognizing laughter more correctly as non-speech segments of

signal would augment the performance of automatic speech recognition systems [75]. Automatic recog-

nition of laughter could also be helpful in automatic analysis of nonverbal communication in groups [39].

At last, automatic tagging of multimedia data and, in consequence, their retrieval [122] could be improved

while using a user’s laughter as feedback.

1

2 CHAPTER 1. INTRODUCTION

1.2 Thesis structure

This thesis is split into five chapters.

The chapter 1 describes briefly what laughter is, what is its role in human interactions, what are its

characteristics and what could be the benefits of recognizing the laughter automatically. It also gives an

introduction to automatic signal processing.

In the chapter 2, an introduction of the state-of-the-art in automatic laughter recognition is given, with

an overview of available databases containing laughter, a summary of types of features utilized for that

task, classification methods and results of accomplished experiments.

The chapter 3 presents our first approach in laughter recognition, which is based on acoustic features

of a speech signal. We describe in detail the MAHNOB database [75] that we used for the experiments,

the feature sets that we tested and our methodology.

In the chapter 4 we present our second approach in laughter recognition, which is based on verbal

features extracted from acoustic events of a speech signal. We describe the types of acoustic events we

used and the corresponding results.

In the chapter 5 we express our conclusions and we discuss the perspectives on future works.

1.3 Laughter in human interactions

Human communication is a base of human existence. It is fundamentally a social phenomenon and is

important for many social processes. Humans communicate by exchanging information with others. The

word “communicate” 1 comes from Latin communicationem, noun of the action of communicare - "to

share, divide out; communicate, impart, inform; join, unite, participate in". A piece of information can be

transmitted with the use of verbal codes (i.e. words) or non-verbal codes. In both cases, the information is

valid as long as it is comprehensible for both, the sender and the receiver of the message. However, unlike

verbal, the non-verbal communication can be unintentional, i.e. speaker / listener may not be aware of

it. Such type of communication includes postures, proximity, gestures, eye contact or sounds like groan,

sigh or laughter. The last one is of huge interest, since it can bring some context-related information,

which can help to better understand the meaning of an utterance.

Reflections about the laughter has been reported in the past in different ways by many important

personages in the domain of science and philosophy, namely Aristotle, Kant, Darwin, Bergson and Freud

[84]. Laughter is estimated to be about 7 million years old [63], which means that when we laugh

1http://www.etymonline.com/index.php?term=communication

1.3. LAUGHTER IN HUMAN INTERACTIONS 3

spontaneously, we use a capacity rooted in our most primitive biology. We are born with this innate skill,

since the laughter is observable even among deaf-blind children [96]. Laughter was already identified in

5-week-old infants [95], far before they say their first word. It is a direct appeal for mutuality [32] and,

in opposite to speech, indicates positive feedback when done synchronously with others. Its evolution

assisted the progress of developing positive social relations in groups, as suggested in [68]. Since laughter

is associated with release of tension and some stress physiology issues [10], it was clinically used for

patients relaxation [112]. Further works in that field could help medicine to take advantage of these

positive effects of laughter in treatment of psychiatric disorders like depression [70].

1.3.1 Eliciting laughter

Laughter can be elicited in different ways, generally depending on age. Among infants, the laughter oc-

curs during tickling or surprising sounds, sights, or movements as well as due to motor accomplishments

like standing up for the first time [95]. Among kids, the laughter was reported in response to energetic

social games like the chasing and running activities of rough-and-tumble play [108]. Among adults, the

laughter occurs mostly during friendly social interactions, like greetings or remarks and not in response

to explicit verbal jokes [70]. Only less than 20% of the conversational laughter is elicited by humorous

comments or stories [84]. Laughter is a social signal, so the stimulus that elicits it the most is another

person rather than something funny [86]. That explains why people laugh about 30 times less often when

they are alone and without any stimulus, like television or book, than they do when they are in a social

situation - we are more likely to smile and talk to ourselves than laugh while being alone [84]. Moreover,

teasing or criticizing ironically our relatives and those we admire, may also provoke laughter - the more

important person is to us, the more mirth it provokes [70]. The superiority theory suggests that we laugh

at someone else’s mistakes or misfortune, because we feel superior to this person [16]. On the other hand,

persistent one-sided laughter can signal seeking of dominance [70]. The laughter can be elicited also in

response to a relief of tension - this kind of manipulation is being used in movies: moments of suspense

are often followed by a side, comic comment [16]. People are so keen to laughing that a humorist became

a real profession. Since 1998, in United States, the greatest humorists have been annually awarded for

their contribution to the American humor and their impact on American society by the John F. Kennedy

Center for the Performing Arts - The Mark Twain Prize for American Humor, named after the 19th cen-

tury novelist, essayist and satirist Mark Twain [5]. So far, such great comedians as Richard Pryor (1998),

Whoopi Goldberg (2001), George Carlin (2008) or Bill Cosby (2009) were awarded. However, we have

to be aware and keep in mind that in some cultures laughing or smiling to others is disapproved [99].


1.3.2 Categories of laughter

Laughter is not a stereotyped signal and it can be categorized in different dimensions. A standard laughter

does not exist - there is a plenty of varieties of laughter [9]. Laughter can be voiced (about 30% of all

analyzed laughs), unvoiced (about 50%) or mixed (the remaining 20%). Laughter can be categorized

according to the number of laugh syllables: comment laugh - 1 syllable, a quiet laugh (chuckle) - 2 syl-

lables, rhythmical (3 and more syllables) and very high pitched laugh (squeal) [64]. Laughter can be

categorized according to its emotional content [46, 55]. Humans are capable of intuitively interpreting a

wide range of meanings contained in the laughter, such us: sincerity, nervousness, hysteria, embarrass-

ment, raillery or strength of character [32]. It is also interesting to note that a variety of synonyms of the

term “laughter” are often onomatopoeic, e.g. giggle, cackle or titter [32]. Speech-synchronous forms of

laughter, i.e. speech-laugh - laughing simultaneously with articulation, seem to constitute another cate-

gory of non-verbal vocalization, since the segments of laughter are not just superimposed on articulation,

but are nested, while articulation configuration is preserved [117]. Smiled-speech can as well be consid-

ered as a laughter, since it can be distinguished from non-smiled speech only by listening [113]. Since

Darwin [24], there is a discussion whether smiling and laughing are extremes of the same continuum.

1.4 Psychoacoustic models of laughter

To represent the laughter, RR. Provine [84], took an approach of a visiting extraterrestrial who meets a

group of laughing human beings: “What would the visitor make of the large bipedal animals emitting

paroxysms of sound from a toothy vent in their faces?”. He proposed to describe physical characteristics

of that “noisy behavior”, the mechanism that produces it and rules that control it. Characteristics of the

creature producing the sound (such as gender or age) and information about other species that emit similar

sounds might as well be useful. R.R. Provine tried to describe sonic structure of human laughter, but he

found it was difficult, since humans do not have as much conscious control over laughter as we have

over speech - he asked people in public places to laugh, but about a half of subjects reported that they

could not laugh on command [84]. Moreover, although a laughter of a given person can be identifiable,

it is not invariant - typically every human uses several patterns of laughter. The good news, regarding

computational models of laughter, is that, there is no significant difference in the sounds used in laughter

between various cultures [32].

By the means of a sound spectrograph2, Provine [84] analyzed the sonic properties of the laughter. A

sound of laughter can be represented as a series of notes (syllables) that resemble a vowel. Typically, a2A device that captures a sound and represent it visually as a variations of the frequencies and intensities of the sound over time.

1.4. PSYCHOACOUSTIC MODELS OF LAUGHTER 5

note is about 75 milliseconds long and is repeated regularly in 210-millisecond intervals. The figure 1.1

presents a waveform and the corresponding spectrogram of a typical laughter.

Figure 1.1: A waveform (top) and the corresponding frequency spectrum (bottom) of a typical laugh, composed of

six vowel-like notes, showing the regularities. Adapted from [84]

A sequence of laughter syllables in one exhalation phase is called a bout and a laughter episode is

defined as a sequence of bouts separated by one or more inhalation phases [117]. The production of

the sound of laughter depends on openness of the mouth (closed/half open/fully open) and glottalization

[32]. The aspiration /h/ is a central sound feature of laughter and it can be repeated or combined with

any vowel or vocalic nasal, e.g. /m/ or /n/ [32] . There is no specific vowel sounds that define laughter,

however, a particular laugh is typically composed of similar vowel sounds, e.g. "ha-ha-ha" or "hi-hi-hi"

are possible structures of laugh, but "ha-hi-ha-hi" is not. This is due to some intrinsic constraints of our

vocal apparatus that prevent us from producing such sounds [84]. If a variation happens, it is generally in

the first or the last note in a sequence, i.e. “ha-ha-hi” is a possible structure of laughs. Another constraint

prevents us from producing abnormally long laughter notes [84]. That’s why we rarely hear laughters like

“haaa-haaa-haaa”; and when we actually do hear it, we might suspect that it is a faked one, such as the

laughter of Nelson from The Simpsons series. Abnormally short notes, i.e. who last much less than 75

milliseconds, are as well contrary to our nature. Similarly, too long or too short inter-note intervals are

rare.

Laughter has a harmonic structure [84]. Each harmonic is a multiple of a low, fundamental frequency.


The fundamental frequency of the laughter of females is higher than the laughter of males. However, all

human laughter is a variation of this basic form. This allows us to recognize the laughter, no matter how

much we differ from each other.

Laugh notes are temporarily symmetrical [83], which means that a short bout played backward will sound

similarly to its original version. This can be seen as well on the sound spectrum - its form is very similar

while reading in both directions. However, not every aspect of laughter is reversible. The loudness of a

segment of laughter declines gradually over time (probably because of the lack of air) and thus can be

described by a decrescendo3.

At last, the placement of laughter segments in the flow of speech is not random, which makes it a very

important feature. In more than 99% of cases (1192 out of 1200 samples) the laughter appeared during

the pauses at the end of phrases [84]. This is called the punctuation effect [83] and it may suggest that

a neurologically based process gives more priority for accessing the single vocalization channel to the

speech than to the laughter.

1.5 Automatic speech processing

The research in automatic speech processing has been evolving for over 50 years now [36]. Various types

of systems have been proposed so far: from a simple isolated digit recognition system for a single speaker

[25] up to complex systems recognizing speakers and their utterances in multi-party meetings [34]. The

figure 1.2 presents the evolution of the research. The general schema of automatic speech recognition

(ASR) is composed of three steps: (1) retrieving raw data from sensors, (2) pulling characteristic details

(i.e. features extraction) and (3) detecting patterns based on predefined models, i.e. classification. While

the first step is more linked to the environment, the second and third steps depend more on the specificity

of a task. There are two major categories of modeling methods for speech data: verbal, e.g. Bag of

Words [124], n-grams [28] or maximum likelihood estimates [27], and non-verbal, e.g. perceptual linear

predictive (PLP) technique [44] or mel-frequency cepstral features [43], as well as several classification

methods, e.g. Hidden Markov Models (HMM) [87], Artificial Neural Networks (ANN) [57], Gaussian

Mixture Models (GMM) [91] or Support Vector Machines (SVM) [38].

However, the human communication consists not only of speaking but also of other wordless cues.

One of them, the para-language, use the same communication channel as speech does, i.e. the voice.

Knowing how to recognize these types of signals, not only can improve the performance of speech recog-

nizers, but also can help to retrieve other important informations, perhaps relevant for understanding the

3A gradual decrease in volume of a musical passage; taken from Merriam-Webster Dictionary.

1.6. THESIS CONTRIBUTIONS 7

Figure 1.2: Four generations of speech and speaker recognition research; taken from [93] and adapted from [37].

context. One of the branches that explores the paralinguistic signals is a research of laughter recognition

systems. The main goal in this research area is aimed to detect any kind of laughter in spontaneous speech

data. The main difficulty of laughter recognition is the definition of laughter itself and the variability of

its forms. The general process of automatic laughter recognition is the same as for the automatic speech

recognition. There has already been several studies in this research field. Earlier attempts aimed solely to

detect whether a particular segment of signal, generally of a duration of 1 second, contained laughter, e.g.

[49, 118]. More recent works, e.g. [53, 54], are more precise in their predictions and are able to detect the

start and the end of a laughter segment, though there still remain some challenges to reach the ultimate

goal, i.e. the human-level performance. More on the state-of-the-art experiments in laughter recognition

is presented in the chapter 2.

1.6 Thesis contributions

The purpose of this Master thesis is to develop a system of laughter recognition using two different

approaches: non-verbal and verbal methods. Non-verbal approach is especially used in automatic speech

and emotion recognition, but it showed also a good performance in laughter recognition, e.g. in [77].


Our motivation to investigate this approach is conducted by a desire to improve the performance by using

other types of features, such as new models of voice quality. The use of a verbal model was motivated

by the lack of experiments of this method. The only attempt, to our best knowledge, in using a verbal

approach in laughter recognition was done by Pammi et al. [69] and presented promising results. Thus,

we use the Bag-of-words and n-grams as features extraction methods, that we applied on automatically

detected acoustic event of different nature.

Since a standard laughter does not exist [9], we introduce three classification problems: discrimination

of speech (1) from laughter, (2) from speech-laugh and (3) from all types of laughter (including two

previous types of laughter plus an acted one). All experiments are done on multilingual speech corpus

containing spontaneous and forced laughters as well as speech-laughs, i.e. MAHNOB Laughter Database

[75], to investigate both speaker and language independent recognition of laughter.

Our results for non-verbal approach showed that the classification of acoustic-based features can

obtain very high scores and that formant-related features can also be pertinent, especially when using a

logarithmic scale of formant values normalized by their energies.

The results obtained using verbal approach, although not as good as those obtained with the non-

verbal approach, also showed a good performance and great potential. Since this method has been little

exploited in laughter recognition, a lot of perspectives wait to be unfold.

Chapter 2

State-of-the-art in laughter recognition

This chapter presents the state-of-the-art in laughter recognition. It includes brief descriptions of the

databases which contain laughter episodes, popular types of features, commonly used classifiers and

obtained performance for its automatic recognition.

2.1 Available databases

There are several audiovisual databases, available for download, that contain laughter events. An overview

is presented in Table 2.1.

ILHAIRE Laughter Database [59] is an ensemble of laughter episodes extracted from five existing

databases. (1) The Belfast Naturalistic Database is composed of video materials, drawn from television

programmes, talk shows, religious and factual programmes, that contain positive and negative emotions.

53 out of 127 clips contain laughter and were included to the ILHAIRE database. (2) The HUMAINE

Database [30] is composed of 50 audiovisual clips with diverse examples of emotional content, drawn

from various sources like TV interviews or reality shows. Although their quality is variable, they present

a variety of situations in which laughter occurs. 46 laughter episodes were extracted to include in the

ILHAIRE database. (3) The Green Persuasive Database [30] contains 8 interactions between a University

Professor and his students, who tries to persuade them to use more environmentally friendly lifestyle. 280

instances of conversational or social laughter were extracted to include in the ILHAIRE database. (4) The

Belfast Induced Natural Emotion Database [109] is composed of 3 sets of audiovisual clips containing

emotionally coloured naturalistic responses to a series of laboratory based tasks or to emotional video

clips. 289 laughter episodes were extracted from a total of 565 clips of the Set 1. Ongoing works aim

9

10 CHAPTER 2. STATE-OF-THE-ART IN LAUGHTER RECOGNITION

to add laughter episodes from the Sets 2 and 3. (5) The SEMAINE Database [60] is composed of high-

quality audiovisual clips recorded during an emotionally coloured interaction with an avatar, known as

a Sensitive Artificial Listener (SAL). 443 instances of conversational and social laughter were extracted

from 345 video clips to include in the ILHAIRE database.

AMI Meeting Corpus [19] (stands for Augmented Multi-party Interaction) is a multi-modal data set

consisting of 100 hours of meeting recordings. About two-third of meetings are scenarios played by four

people - design team members who take off and finish a design project. The rest consists of naturally

occurring meetings, such as a discussion between four colleagues about selecting a movie for a fictitious

movie club or a debate of three linguistics students who plan a postgraduate workshop [6]. The corpus

provides orthographic transcription and annotations for many different phenomena like dialogs, hand / leg

/ head movements etc. Laughter annotations are also provided but only approximately, i.e. no start neither

end time is given, but only a time stamp about occurring laughter. Although the language spoken in the

meeting is English, most of the participants are non-native speakers with therefore a variety of accents.

This database was used, inter alia, in [77, 76, 73] (only close-up video recordings of the subject’s face

and the related individual headset audio recordings).

ICSI Meeting Corpus [62] is a collection of 75 meetings (approximately 72 hours of speech) col-

lected at the International Computer Science Institute in Berkeley. In comparison to AMI Meeting Cor-

pus, ICSI corpus contains real-life meetings - regular weekly meetings of various ICSI working teams,

including the team working on the ICSI Meeting Project, with an average of six participants per meeting

and a total of 53 unique speakers (13 females and 40 males) varying in fluency in English. Each partic-

ipant wore close-talking microphone. In addition, six tabletop microphones simultaneously recorder the

audio. Annotations include events like coughs, lip smacks, microphone noise and laughter. This database

was used in [118, 53].

AudioVisual Laughter Cycle database [119] is a collection of audiovisual recordings of 24 subjects

(9 females and 15 males) from different countries that were registered by one web-cam, seven infrared

cameras and a headset microphone while watching 10-minute-compilation of funny videos. Annotations

were added by one annotator using a hierarchical annotation protocol: a main class (laughter, breath,

verbal, clap, silence or trash) and its subclasses. The laughter subclasses include temporal structure

(number of bouts and syllables, as proposed by Trouvain [117]) and type of sound (e.g. voiced, breathy,

nasal, etc.). The number of laughter episodes is around 44 per participant (in average 23.5% of all

recordings) and 871 in total. This database was used in [69].

AVIC (Audiovisual Interest Corpus) [100] is a collection of audiovisual recordings where subjects

are interacting with an experimenter who plays a role of a product presenter and leads the subject through

2.2. RELEVANT ACOUSTIC FEATURES 11

a commercial presentation. The subject is asked to interact actively but naturally depending on his/her

interest in the proposed product. The presentations are held in English, but most of the 21 subjects (10

females and 11 males) are non-native speakers. Data is recorded by a camera and two microphones, one

headset and one far-field microphone. The total duration is 10 hours and 22 minutes with 324 laughter

episodes. Annotations are done by four independent annotators, mainly to describe a level of interest

(disinterest, indifference, neutrality, interest, curiosity), but some additional annotations for nonlinguistic

vocalizations (like laughter, consent or hesitation) are also available.

MAHNOB Laughter Database [75] is the most recent audiovisual corpus available on-line (after an

end user license agreement is signed) at http://mahnob-db.eu/laughter/. It consists of 22 subjects (12

males and 10 females) from 12 different countries. During sessions they were watching funny video

clips in order to elicit laughter, but also some posed laughter and smiles were recorded. In addition, they

were asked to give a short speech in English as well as in their native language to create a multilingual

speech corpus, since all previous works employ only utterances in English, which can bring some bias to

the discrimination models. All recordings were done by a camera with a microphone and an additional

lapel microphone. Annotations were performed by one human annotator using 9 labels, i.e. laughter,

speech, speech-laugh, posed smile, acted laughter, laughter + inhalation, speech-laugh + inhalation,

posed laughter + inhalation, and other. Such amount of labels resolves problems like whether an au-

dible inhalation that follows a laughter belongs to it or not [117]. In addition to that, a second level of

annotations were added which separates all the laughter episodes into voices and unvoiced. This step

was performed by a combination of two approaches, i.e. manual labeling by two human annotators and

automatic detection of unvoiced frames based on the pitch contour by PRAAT software [12].

This database was chosen for this master project because of its multilingual character and the fact that

it contains speech from the same subjects that produce laughter, which makes it appropriate for training

a system to distinguish laughter and speech characteristics. More about its structure and annotations can

be found in the chapter 3.1.

2.2 Relevant acoustic features

This section describes briefly the most common features and extraction methods used in the state-of-

the-art systems for automatic laughter recognition from speech signal. The purpose of feature extraction

step in speech processing is to reduce the quantity of informations passed to the classifier by adapting

its parameters so that their discrimination capabilities are adjusted to the classes that are modeled. The

most common extraction techniques use the models of human auditory system [40] and are based on a


Table 2.1: Overview of the existing databases containing laughter. Three type of interaction exists: dyadic - subjects

interact with agents which play a role; elicit - laughter is engendered by watching funny videos; spontaneous -

recordings from real-life meetings. H: headset, L: lapel, F: far-field, ?: no information available.

Name Interaction # subjects # episodes Camera res. Mic. type # raters

SEMAINE [60] Dyadic 150 959 780x580 H, F 28

AVIC [100] Dyadic 21 324 720x576 L, F 4

MAHNOB [75] Elicit 22 563 720x576 C, L 2

AVLC [119] Elicit 24 871 640x480 H 1

Belfast Nat. [29] Elicit 125 298 - ? 3

Belfast Ind. [109] Elicit 256 1300 1920x1080 F 1

AMI [19] Spontaneous 10 124 720x576 H, L, F ?

ICSI [62] Spontaneous 53 9181 - H, L, F ?

short-term spectral analysis [88]. They are successfully used in tasks like speech recognition, speaker

recognition and emotion recognition [101], but also to distinguish the speaker’s paralinguistic and id-

iosyncratic traits such as gender or age [61].

Since a speech signal is not stationary in long-term and it is quasi-stationary in short-term, before

the feature extraction phase, the signal is divided into overlapping segments using a sliding window.

Although there is no standard length of a segment, in speech recognition the duration of approximately

25 ms is used (which corresponds to the length of a phoneme), with a shift of 10 ms to ensure stationarity

between adjacent frames and to cover phonetic transitions, which contain important information about

the phones nearby [80]. Characteristics obtained in that way are also called the Low-Level Descriptors

(LLDs). Depending on the type of information extracted from the speech signal, acoustic features can be

further divided into spectral, prosodic, voicing-related and modulation features; we detail them in the next

section. They can be exploited in two approaches: dynamic or static [102, 94]. In the dynamic approach,

a classifier is optimized directly using the LLDs values. In the static approach, the LLDs first undergo a

set of statistical measure over a duration and then are passed to a classifier.

To reduce growing dimensionality of the features, a set of techniques can be used, e.g. Principal

Component Analysis (PCA) or Linear Discriminant Analysis (LDA) techniques; calculating a mean and

a standard deviation for a temporal window [78] or for whole laughter segment [49]; polynomial fitting

that approximately describes a curve of feature values using a p th order polynomial (the best results seem

2.2. RELEVANT ACOUSTIC FEATURES 13

to be produced by a quadratic polynomial [77]) or the Correlation Feature Selection (CFS).

2.2.1 Spectral-related features

The Perceptual Linear Predictive (PLP) analysis uses some concepts from the psychoacoustic and is

more consistent with human hearing than conventional linear predictive (LP) analysis. The representation

of an audio signal is low-dimensional. PLP analysis is computationally efficient and speaker-independent

[44]. In [118, 76], 13 PLP coefficients for a frame were used, while, in [77] and [49], only 7 PLP

coefficients were calculated, which was found to lead to a better performance in laughter recognition:

F-measure of 64% was obtained for 13 coefficients on AMI corpus using neural networks [76], while

F-measure of 68% for 7 coefficients on the same corpus also using neural networks [77]. In all above

cases, also their delta values were calculated.

The RelAtive SpecTrAl-Perceptual Linear Predictive (RASTA-PLP) technique suppresses the

spectral components that are outside the range of a typical rate of change of speech (the vocal tract

shape) [45]. Thus, it adds some filtering capabilities for channel distortions to PLP features. In [31], it

was shown that this features result in better performance in speech recognition tasks in noisy environ-

ments than PLP. RASTA-PLP features were tested in [89] in laughter detection task, comparing GMM

and HMM. GMM classifiers performed slightly better (mean AUC-ROC of 0.825) than HMM classifiers

(mean AUC-ROC of 0.822).

The Mel-frequency cepstrum is a representation of the short-term power spectrum of a sound.

The speech amplitude spectrum is represented in compact form. Mel Frequency Cepstral Coefficients

(MFCCs) have been dominant features, widely used in speech, speaker and emotion recognition tasks.

Typically 13 MFCCs are used (e.g. in [53]). However it has been reported [49] that a use of only the first

6 MFCCs in laughter detection results in the same performance as when using 13 MFCCs. It can suggest

that characteristics of laughter are more discriminative in lower frequencies. 6 MFCCs were later used in

[79, 78].

2.2.2 Prosodic features

Prosody reflects several features of the sound like its intonation or stress, thus prosodic features are used

in speech, speaker and emotion recognition. The prosodic of a sound signal are variations in pitch (a

perceived fundamental frequency of a sound), loudness (a perceived physical strength of a sound), voice

quality (a perceived shape of acoustic formants) and duration patterns (rhythm). Those were therefore

used in laughter recognition for their ability to describe dynamic patterns, e.g. in [118, 78].


2.2.3 Voicing-related features

Bickley and Hunnicutt [11] found out that ratios of unvoiced to voiced duration in laughter signals are

decidedly greater than a typical ratio for spoken English. This characteristic was used in [118], by calcu-

lating the fraction of locally unvoiced frames and the degree of voice breaks. Voice breaks are determined

by dividing the total duration of breaks between voiced part of the signal by the total duration of the ana-

lyzed segment of signal.

2.2.4 Modulation spectrum-related features

The rhythm and the repetitive syllable sounds, e.g. vowel sounds which are characteristic for most laugh-

ter, can be extracted by calculating amplitude envelope, via a Hilbert transform which employs the dis-

crete Fourier transform (DFT), then applying a low-pass filter and down-sampling. The expectation is

to capture the repeated high-energy pulses which occur roughly every 200-250 ms in laughter [11]. In

[118], the first 16 coefficients of the DFT are used as features, whereas in [49] the system uses the first

20 coefficients of the DFT.

2.2.5 Acoustic-based verbal features

Although all the preceding features are non-verbal, there exists also an approach to extract verbal fea-

tures, i.e. by quantifying the presence and the sequencing of linguistic or pseudo-linguistic units, via the

Bag-of-Words and n-grams [17, 20], respectively. These types of units can be detected automatically, in

unsupervised way, i.e. data-driven approach, e.g. voiced/unvoiced segments or pseudo-vowels/pseudo-

consonants. It is also possible to recognize a word in supervised manner, however, it requires the use of a

classifier like the Hidden Markov Models [21]. The idea of combining the automatic detection of units in

a speech signal was proposed originally in [107]; it was successfully applied on the emotion recognition

task on the SEMAINE corpus. The obtained scores were higher than for the method using manually tran-

scribed words, i.e. unweighted accuracy for the arousal and the valence dimensions were, respectively,

6% and 5.1% higher than in the experiment using words. In this thesis, we reuse this method by applying

it on the laughter recognition, as in [69], however, with other types of acoustic and rhythmic units.

2.3. POPULAR CLASSIFICATION METHODS 15

2.3 Popular classification methods

This section briefly describes machine learning algorithms that are commonly used in automatic laughter

recognition.

2.3.1 Support Vector Machines (SVMs)

The SVM is a machine learning method for solving binary classification problems. The general idea is to

map input vectors that are non-linearly separable into a higher dimension feature space that will allow a

linear separation of the two classes [23]. The optimization of the SVMs was for a long time a bottleneck

of this method, since the training required solving very large quadratic programming (QP) problems [92].

However, in 1998 Platt [81] proposed the Sequential Minimal Optimization (SMO) algorithm, that breaks

a QP problem into several, smallest possible, QP problems. The SVMs show a good generalization

performance in many classification problems, e.g. handwritten digit recognition [56] or face detection

[67]. For the laughter recognition they were used, inter alia, in [89], and are very popular in many speech

related decision problems for their ability to deal efficiently with very large features vectors, e.g. 5-6k

features.

2.3.2 Gaussian Mixture Model (GMMs)

The Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted

sum of Gaussian component densities and is widely used to model continuous measurements or biometric

features [90]. It found applications in systems for speaker identification or hand geometry detection [98].

It was successfully used for automatic laughter recognition in different studies [47, 89, 118].

2.3.3 Artificial Neural Network (ANN)

Artificial Neural Network is a computational methodology of analysis inspired by the biological neuronal

networks [126]. The model, based on our knowledge of the structure and functions of neurons (i.e. neuro-

computing), is composed of layers of computing nodes interconnected by weighted lines [26]. The ANN

has been widely used in different areas, e.g. speech recognition, medicine [50] or forecasting [115]. It

was also used for automatic laughter recognition [53, 77, 76, 78, 79]. However, such systems are not easy

to use on both large datasets and large features vectors due to the time needed for the training phase, that

can take up to several weeks on very powerful computers.


2.4 Actual performance

A number of attempts has been done so far in building laughter recognition systems. This section presents

a summary of experiments done with audio-only, video-only and audiovisual features.

2.4.1 Audio-only

The majority of experiments that use only audio signal to recognize laughter are focused on spectral

features, especially PLP and MFCC. The best performance (F-measure of 77%) among audio-only exper-

iments was achieved using 7 PLP coefficients and their delta values (14 features in total) with a context

window of 320 ms [77]. The system used neural networks and was trained on AMI corpus. Adaptive

Boosting meta-algorithm selected the best features for classification. Slightly worse result (F-measure of

75%) was achieved in another experiment in [77], with the same set of 14 features and neural networks,

however using mean and standard deviation values over the context window. Without any context infor-

mation, this system achieved F-measure of 68%. A system [76] obtained F-measure of 62% while using

13 PLP coefficients and their delta values (26 features in total), using as well neural networks trained on

AMI corpus. Another research [118] that exploit PLP features was built using GMM classifier trained on

ICSI corpus. It achieved an EER of 13 % when using 13 PLP and their delta values as features and 19 %

when using global pitch and voice-related features.

High results were also achieved using MFCCs. In [53], neural networks with one hidden layer com-

posed of 200 neurons were trained on ICSI corpus using 13 MFCCs and the highest normalized cross

correlation value found to determine F0 (AC PEAK) along with a context window of 750 ms. The system

achieved an EER of 8.15 %, while without the AC PEAK the EER augmented to 11.35 %. In [74], 13

MFCCs with their delta values were computed every 10 ms over a window of 40 ms. Time delay neural

networks with one hidden layer were trained on the SEMAINE database and achieved F-measure of 95.9

% and 17.1 % for detection of speech and laughter, respectively.

An experiment using other than spectral features was performed in [69]. Their system used the n-gram

approach based on automatically acquired acoustic segments using Automatic Language Independent

Speech Processing (ALISP) models [22, 51], trained on SEMAINE and AVLC databases and evaluated

with MAHNOB database. 3-gram and 5-gram yielded in similar F-measure of about 75 %, however

a small difference in precision and recall were noted between them, i.e. 3-grams model showed better

recall, but smaller precision than 5-grams.

2.4. ACTUAL PERFORMANCE 17

2.4.2 Video-only

The best performance of systems using only video features was achieved in [77] with a F-measure of

83%. Their system used neural networks trained on AMI corpus with 20 facial points projected to 4

PCs (7 to 10) as feature vector. Worse, F-measure of 60%, was achieved in [47], using GMM classifiers

with 10 facial points, trained on their own dataset (composed of seven 4-8 minutes dialogs). In another

research [89] trained SVM on AMI corpus using 20 facial points and achieved an EER of 13%. In [74], a

3D tracker capturing facial expression over 113 facial points [66] was tested. Time delay neural networks

with one hidden layer were trained on the SEMAINE database and achieved F-measure of 89.4 % and

9.8 % for detection of speech and laughter, respectively.

2.4.3 Audio-visual

Among systems using both audio signal and video sequence for laughter recognition, the best perfor-

mance, i.e. F-measure of 88%, was achieved in [78]. The system, composed of neural networks trained

on AMI corpus, uses, as audio features, 6 MFCCs with their delta values, mean and standard deviation

values of pitch and energy calculated over a context window of 320 ms, and 20 facial points projected to 5

PCs (6 to 10) as video features. The fusion is done on feature-level. The same performance was achieved

in [79], but using detection algorithms instead of classification. The system, as well as previously men-

tioned one, is composed of neural networks but trained on SAL corpus and tested on AMI corpus. It uses

also 6 MFCCs but only 4 PCs and the classification is done by a prediction technique. The prediction is

made in three dimensions: (1) a value of the current frame is predicted on the past values, (2) the current

audio frame is predicted on the current video frame and (3) the current video frame is predicted on the

current audio frame. A slightly worse performance was achieved in [77]. The system, composed of neural

networks trained on AMI corpus, uses 7 PLP coefficients and their delta values as audio features and 4

PCs as video features. When the modalities were fused on decision-level, the system achieved F-measure

of 86% in comparison to 83% when fusion was done on feature-level. In another research [76], a sim-

ilar system was built, but instead of using 7 PLP coefficients, 13 and their delta values were used. In

this study the decision-level fusion also performed better than feature-level, i.e. F-measure of 82% and

81%, respectively, however the results were worse than when using 7 PLP + their delta values. In [74],

a decision-level fusion of 13 MFCCs with their delta values and 3D tracker capturing facial expression

over 113 facial points [66] was tested. Time delay neural networks with one hidden layer were trained

on the SEMAINE database and achieved F-measure of 97.5 % and 25.5 % for detection of speech and

laughter, respectively.


Chapter 3

Automatic laughter recognition based

on acoustic features

In this chapter, we present the first part of our experiment on the recognition of laughter in spontaneous

speech signals. We investigate the usability of acoustic features extracted from the speech data. All exper-

iments are performed on the MAHNOB corpus [75], which is the most recent database containing speech

and laughter episodes produced by the same subjects, which makes it relevant for a development of a

system that distinguishes laughter and speech characteristics. This corpus was already used in researches

on the laughter recognition, i.e. [69, 75]. A brief description of the MAHNOB database was presented in

the chapter 2.

In the first section of this chapter, we explain the structure of the chosen corpus and the types of annota-

tions it uses. Next, we describe the types and the extraction methods of features that we tested, i.e. the

feature sets proposed by the INTERSPEECH Challenges and new feature sets related to the voice quality

based on formants. In the subsequent section, we illustrate the classification and evaluation procedures.

In the last section of this chapter, we present and discuss the results obtained for the three tasks of dis-

crimination that we performed: (1) speech vs laughter, (2) speech vs speech-laughter and (3) speech vs

all the types of laughter. Moreover, each task is divided into three parts: (I) the first part presents the

results achieved for the feature sets from the INTERSPEECH Challenges, (II) the second part presents

the results obtained with our feature sets based on the formants and (III) the last one presents the results

achieved with the fusion of the best feature sets from the two previous parts.

19

20 CHAPTER 3. AUTOMATIC LAUGHTER RECOGNITION BASED ON ACOUSTIC FEATURES

3.1 MAHNOB database

3.1.1 Structure

The MAHNOB database contains 180 sessions recorded by 22 subjects from 12 different countries using

a camera with a built-in microphone (2 channels, 48 kHz, 16 bits) and a lapel microphone (1 channel,

44.1 kHz, 16 bits). Each session is named after the combination of subject ID and the number of session

and it represents the time of watching from 1 to 5 funny video clips, depending on their length. Subjects

were not aware (with the exception of the three authors participating in the study) neither of the content

of the clips, nor of the purpose of the research. Additionally, every subject performed two extra speech

sessions, where they spoke for about 90 seconds in English and in their native language for the same

amount of time. This makes the corpus multilingual and allows a research about the influence of language

in discrimination of laughter from speech. Moreover, each subject was asked to produce laughter without

any humorous stimuli (i.e. posed laughter), however, more than a half of them found it difficult to do, as

confirmed in [96]. In all, the corpus consists of 90 laughter sessions, 38 speech sessions and 23 posed

laughter sessions. These sessions contain 344 speech episodes, 149 laughter episodes, 52 speech-laugh

episodes and 5 posed-laughter episodes.

3.1.2 Annotations

Start and end points of a speech signal are quite easy to determine. However, it is much harder for

laughter episodes, since it is not clearly defined how laughter episode should be divided [117]. The

MAHNOB corpus follows the principle, proposed in [9], that laughter is any sound expression, that would

be characterized as laughter by an ordinary person in ordinary circumstances. Annotations were added by

one human annotator using audio channels, with a support of video channel, where the segmentation was

not obvious. They consist of 9 classes, however, we focused only on 4 of them (speech, laughter, speech-

laugh and posed laughter), since the rest of them is related to smile (e.g. posed smile) while we are

interested uniquely in laughter, or they are extensions of the 4 selected ones (e.g. laughter + inhalation).

Annotations are stored in three different formats per session: (1) the Comma Separated Value (CSV)

file with the start and end times in seconds (for audio processing) and the start and end frames numbers

(for video processing); (2) ELANAnnotation (i.e. the software that was used to perform annotation [18])

file and (3) FaceTrackingAnnotation file. In addition, each session is provided with an XML file, that

specifies the general information like session ID and recording time, but also some clues specific for the

subject like his/her ethnicity and whether or not he/she has glasses, beard or mustache.

3.2. NON-VERBAL ACOUSTIC FEATURES 21

3.2 Non-verbal acoustic features

Audio recordings were split into segments according to time stamps provided in the annotation files.

Non-verbal acoustic features were extracted using openSMILE and configuration files from the popular

Interspeech ComParE international challenges on paralinguistic. These features are detailed in the next

sections. The aim of this task is to obtain characteristics for each segment and prepare them for the

classification task. The next, optional, step is the correlation-based features selection (CFS). Its goal is

to reduce the set of features to a subset that is highly correlated with laughter, while having low intra-

correlation, which therefore reduce redundancy in features (i.e. features are not correlated with each

other) [42]. The CFS accelerates the classification, since it is done only once for every application, and

the size of the feature subset can be vary small, e.g. from 6k to less than 100 features. For this purpose,

we use WEKA data mining software [41] with the Best-First search to find the best subset of features.

We also use WEKA for the classification task, because this software allows comparisons of performance

with other results from the literature [103, 105, 104, 106, 48, 65, 123, 8]. WEKA uses specific file

format, Attribute-Relation File Format (ARFF), that is divided into two sections: the Header and the

Data. The Header section contains the name of the relation (which is a meta-data that describes the type

of information present in the file) and a list of attributes (features) with their names and types. The last

attribute always specifies the classes. Lines that begin with a % are comments. The listing 1 presents an

example, taken from [1], of the header of an ARFF file that describes iris plants.

The second section, the Data, contains the values of the specified attributes. Each line represents one

instance. Values are comma-separated and the last one must match one of the specified classes. Attributes

whose values are unknown must be written with a question mark (which corresponds to a NaN value in

programming). The listing 2 presents an example, taken from [1], of the data section of an ARFF file that

describes iris plants.

WEKA provides a functionality to concatenate the data from two files, if they specify the same at-

tributes. The concatenation of attributes is however not supported by WEKA, but can be achieved with a

custom script.

3.2.1 INTERSPEECH Challenge feature sets

INTERSPEECH (IS) is an Annual Conference of the International Speech Communication Association

(ISCA) [2]. Since 2009, each year the conference is dedicated to a different issue concerning speech

communication science and technology, both from theoretical and empirical point of view. In addition to

that, every year brings a new challenge for spoken language processing specialists. The topics like para-


Listing 1 A header of an ARFF file describing iris plants [1].% 1. Title: Iris Plants Database

%

% 2. Sources:

% (a) Creator: R.A. Fisher

% (b) Donor: Michael Marshall (MARSHALL%[email protected])

% (c) Date: July, 1988

%

@RELATION iris

@ATTRIBUTE sepallength NUMERIC

@ATTRIBUTE sepalwidth NUMERIC

@ATTRIBUTE petallength NUMERIC

@ATTRIBUTE petalwidth NUMERIC

@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

Listing 2 A data section of an ARFF file describing iris plants [1].@DATA

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa

4.7,3.2,1.3,0.2,Iris-setosa


linguistics, speaker’s state and speaker’s trait were the main subjects of the challenges. As result, after

every challenge, a set of features that obtained the best performance for a particular task is published.

The distribution of the types of features, for the last four years, is presented in the table 3.1. More details

on the features can be found in corresponding publications.

Year / reference Energy related Spectral related Voicing related Total

2010 / [103] 1 31 6 1 582

2011 / [105] 4 50 5 4 368

2012 / [104] 4 54 6 6 125

2013 / [106] 4 55 6 6 373

Table 3.1: Distribution of LLDs according to feature’s type from INTERSPEECH ComParE challenges’ feature sets

between 2010 and 2013; Total corresponds to: LLDs × functionals.

In order to better understand the purpose of the selection of features for particular challenges, we

present below short summaries of tasks for which they were designed:

INTERSPEECH 2010 addresses three sub-challenges: (1) Age Sub-Challenge aims to classify a speaker

into one of four age groups (children, youth, adults or seniors), the baseline result (unweighted

average recall, i.e. weighted accuracy (WA)) is 48.91%; (2) Gender Sub-Challenge aims to classify

a speaker as male, female or child; the baseline result is 81.21%; (3) Affect Sub-Challenge is a

regression task of detecting speaker’s state of interest in ordinal representation; the baseline result,

expressed as Person correlation coefficient, is 0.421.

INTERSPEECH 2011 addresses two sub-challenges: (1) Intoxication Sub-Challenge aims to classify a

speaker’s alcoholisation level as alcoholised (blood alcohol concentration (BAC) higher than 0.5

per mill) or non-alcoholised (BAC equal or below 0.5 per mill); the baseline result (WA) is 65.9%;

(2) Sleepiness Sub-Challenge aims to classify a level of speaker’s sleepiness as sleepiness (for

values above 7.5/10 of the Karolinska Sleepiness Scale (KSS)) or non-sleepiness (for values equal

or below 7.5 of the KSS); the baseline result (WA) is 70.3%.

INTERSPEECH 2012 addresses three sub-challenges: (1) Personality Sub-Challenge aims to classify a

speaker into one of five OCEAN personality dimensions [125], each mapped onto two classes; the

baseline result (mean of WA of all five classification tasks) by SVM is 68.0% and 68.3% by random

forests; (2) Likability Sub-Challenge aims to classify the likability of speaker’s voice into one of

two classes, although the annotation provides likability in multiple levels; the baseline result (WA)


by SVM is 55.9% and 59.0% by random forests; (3) Pathology Sub-Challenge aims to determine

the intelligibility of a speaker in a pathological condition; the baseline result (WA) by SVM is

68.0% and 68.9% by random forests;

INTERSPEECH 2013 addresses four sub-challenges: Social Signal Sub-Challenge aims to detect and

localize non-linguistic events of a speaker, such as laughter or sigh; the baseline result (WA)

83.3%; Conflict Sub-Challenge aims to detect conflicts in group discussions; the baseline result

(WA) 80.8%; Emotion Sub-Challenge aims to classify user’s emotion into one of 12 emotional

categories; the baseline result (WA) 40.9%; Autism Sub-Challenge aims to determine the type of

pathology of a speaker in two evaluation tasks: typically vs atypically developing children and

a “diagnosis” task classifying into one of four disorder categories. the baseline results (WA) are

90.7% and 67.1%, respectively.

Since paralinguistic is related to non-verbal elements of speech, we found that these feature sets may

be relevant to laughter recognition. We use the openSMILE toolkit, which is the official feature extractor

for the INTERSPEECH ComParE Challenges since 2009 [33], to extract characteristics from our data. It

outputs the results in different types of format, including CSV, HTK (for ASR with the HTK toolbox [4])

and ARFF (for WEKA data mining toolkit).

3.2.2 Formants and vocalization triangle

In 1948, Potter and Peterson [82], suggested that using quantitative analysis of the main trajectories of the

acoustic resonance may be used to distinguish vowels. Nowadays, we know that the first two formants

F1 and F2 carry clues about the place of articulation (e.g. front, central, back) and the degree of aperture

(e.g. close, mid, open) determined by the position of tongue [93]. These, among with the roundedness,

are characteristic for vowels. In 1952, Peterson and Barney [72] published a representation of formants

values of English vowels on a plan. The area filled by the vowels is called the vocalization triangle (c.f.

figure 3.1). Formants are widely used in speech recognition and emotion recognition, because they carry

information about the speaker’s effort in articulation [116, 120]. This information and the fact that a laugh

is a series of vowel-like notes, may induce the relevance of formants in laughter recognition.

Thus, we use the values of the two formants F1 and F2, values of their energy and the value of the

vocalization area. The purpose of calculating the vocalization area is to describe the degree of articulation

of an utterance, i.e. a small area of F1-F2 corresponds to an utterance hypo-articulated, while a large area

corresponds to an utterance hyper-articulated. The script calculating the vocalization area was elaborated

in Matlab [3] and uses values of the first and the second formant. These values are extracted over the


Figure 3.1: The vocalization triangle (in red). Adapted from [93].

voiced segments using The Snack Sound Toolkit 1. For each voiced segment, the script looks for the two

extreme (i.e. minimum and maximum) values of the F2 and remembers their coordinates. In the next step,

the script goes through all the values of F1 and looks for the one that maximizes the area of the triangle

composed of the two extreme F2 values and the selected F1 value. The area of triangle is computed

using the Heron’s formula: T =√s(s− a)(s− b)(s− c) where s is the semi-perimeter of the triangle:

s = a+b+c2 . We use 29 statistical measures (c.f. Table 3.2) applied on those values. In consequence, the

set is composed of 145 voice quality features. In addition to raw values (1), we investigated combinations

of them as well as perceptual scales modeling to try take into account some psychoacoustic phenomenon.

Given the logarithmic scale of perception, the second set of features (2) is composed of logarithmic

values of F1, F2 and their energies. In the third set of features (3), the values of F1 and F2 are normalized

by their respective energies. The last set (4) combines the two previous: values of F1, F2 and their

energies are logarithmized, then, in addition, the values of F1 and F2 are normalized by their respective

energies. The values of the formantic area were always computed with the raw values of formants and

logarithmized afterwards for the set 2 (L) and the set 4 (LE). In all, we created 4 sets of features:

Set 1 (R): F1, F2, EF1, EF2

, Area

Set 2 (L): log(F1), log(F2),log(EF1), log(EF2

), log(Area)1http://www.speech.kth.se/snack/


Symbol Description

1 : Maximum Value of the maximum

2 : Rposmax Relative position of the maximum

3 : Minimum Value of the minimum

4 : Rposmin Relative position of the minimum

5 : Rposdiff Difference between the relative positions of maximum and minimum

6: Range Maximum - minimum

7 : Rangenorm (Maximum - minimum) / Difference between their relative positions

8 : Mean Mean value

9 : σnorm Standard deviation, normalized by N-1

10 : Skewness Third statistical moment

11 : Kurtosis Fourth statistical moment

12 : Q1 Value of the first quartile (25/100)

13 : Q2 Median value (50/100)

14 : Q3 Value of the third quartile (75/100)

15 : IQR Interquartile range

16 : σIQR Standard deviation of the interquartile range

17 : Slope Slope from the regression line

18 : Onset Value of the onset (first value)

19 : Target Value of the target (middle value)

20 : Offset Value of the offset (last value)

21 : Target−Onset Difference between the target and the onset values

22 : Offset−Onset Difference between the offset and the onset values

23 : Offset− Target Difference between the offset and the target values

24 :∑↗ values /

∑↗ segs Average number of increasing values per segment

25 :∑↘ values /

∑↘ segs Average number of decreasing values per segment

26 : µ↗ Mean of increasing values

27 : σ ↗ Standard deviation of increasing values

28 : µ↘ Mean of decreasing values

29 : σ ↘ Standard deviation of decreasing values

Table 3.2: Set of statistical measures used for modeling the laughter.

3.3. CLASSIFICATION 27

Figure 3.2: Spectrograms of speech (above) and laughter (bottom) segments with values of formants marked in color:

red - F1, green - F2, blue - F3 and yellow - F4.

Set 3 (E): F1 × EF1, F2×EF2

, EF1, EF2

, Area

Set 4 (LE): log(F1)×log(EF1), log(F2)×log(EF2

), log(EF1), log(EF2

), log(Area)

Note that, to avoid confusions, in the rest of this work, we will refer to the Set 1 as the Set R (Raw

values), to the Set 2 as the Set L (Logarithmic values), to the Set 3 as the Set E (values normalized by

Energy) and to the Set 4 as the Set LE (Logarithmic values normalized by Energy).

3.3 Classification

As classifier we used the Support Vector Machines (SVM) with the Sequential Minimal Optimization

(SMO), given its small generalization error for large vectors of non-linearly separable features [93]. Our

models (complexity and kernel) were optimized on a development partition using the leave-one-subject-

out (LOSO) cross-validation as evaluation framework, to ensure that results are user independent and,


hence, language independent, since we used the MAHNOB database. This methodology is based on n-

folds, where n is the number of subjects in the corpus, i.e. n=15 in our study. For each fold one subject is

considered as testing set and no information nor optimization were thus used or performed on it, whereas

the remaining are divided equally into two partitions: training set and development set. In addition,

we applied permutations while selecting subjects for training and development sets, so that the data is

randomly distributed. The first part of each experiment consists of training the system using the training

set and optimizing performance on the development set to select the best setting of the SVM classifier, i.e.

the value of complexity (10−5, 5×10−5, 10−4, 5×10−4, 10−3, 5×10−3, 10−2, 5×10−2 or 1), the type

of kernel (polynomial or radial basis function - RBF - kernel) and the value of exponent for polynomial

kernel (1, 2 or 3) or gamma for RBF kernel (10−6, 5×10−6, 10−5, 5×10−5, 10−4, 5×10−4 or 10−3).

Because the distribution of classes (c.f. 3.1.1) is unbalanced in the data, we use weighted accuracy (WA,

i.e. unweighted average recall) as primary evaluation measure; even if we also give unweighted accuracy

(UA, i.e. weighted average recall) for informative purpose. These values are calculated with a help of

a confusion matrix. The table 3.3 shows an example of a confusion matrix for two classes: Speech and

Laughter.

Predicted class

Speech Laughter

Actual classSpeech True Speech (TS) False Laughter (FL)

Laughter False Speech (FS) True Laughter (TL)

Table 3.3: An example of a confusion matrix, used to calculate unweighted accuracy and weighted accuracy.

Once the matrix is filled with the actual values, we calculate out evaluation measures using the fol-

lowing equations:

Weighted accuracy (WA): Au=TS

TS+FL+ TLTL+FS

2

Unweighted accuracy (UA): Aw= TS+TLTS+FS+TL+FL

The setting that obtains the best score (i.e. the highest WA) is then tested once on the testing set of

each of the 15 folds and the mean value is considered as the final score. For each set of features we

perform three tasks: first and second tasks (speech vs laughter and speech vs speech-laugh, respectively)

are 2-class discrimination problems; the third one, speech vs all the types of laughter (including laughter,

speech-laugh and posed-laughter) is a one-vs-all discrimination problem. We expect worse performance

for the experiments speech vs speech-laughter and speech vs all than for speech vs laughter, since both

3.4. RESULTS 29

classes in the mentioned two experiments contain some segments of speech. Moreover, the class speech-

laugh may contain only one or two words expressed while laughing, whereas all others are uttered in a

non-laughing style, which complicates the discrimination of speech-laughter vs speech. An example of

waveform of such a case is presented in the figure 3.3: between 0.0 and 1.0 second the subject speaks in

a non-laughing way and approximately between 1.05 and 1.4 second he laughs while speaking.

Figure 3.3: A waveform (above) and the corresponding spectrogram (bottom) of a segment annotated as speech-

laughter; between 0.0 and 1.0 second the segment contain a normal speech and between approximately 1.05 and 1.4

second the speech is interfered by a laughter.

3.4 Results

In this section, we present the most important results achieved for each of the three performed experi-

ments. In addition to results for the sets of features presented in the section 3.2, we present also the results

obtained for a feature set composed of a combination of the two feature sets that achieved the best scores

in each of the two groups, i.e. IS feature sets and those based on formants. We expect that this fusion will

complement the set of features and, in consequence, improve the performance.

3.4.1 Speech vs Laughter

The results of the experiment of distinguishing speech from laughter are in line with expectations. We can

observe that the performance of all IS sets is very high - far above the score chance, which is 69.8%. Two


Speech vs Laughter % UA % WA

Feature set name Full set CFS set Full set CFS set

INTERSPEECH feature sets

IS10 97.8 97.0 98.0 97.4

IS11 98.1 97.5 98.4 97.6

IS12 98.1 95.1 98.4 95.5

IS13 97.6 94.6 98.0 95.3

Feature sets based on formants

Set R 83.5 85.9 83.4 86.2

Set E 83.3 87.0 83.4 88.0

Set L 82.5 89.6 82.2 90.5

Set LE 91.8 92.1 91.5 92.5

Fusion of both

IS11 + Set LE 98.4 97.7 98.6 97.6

IS12 + Set LE 98.0 96.9 98.0 97.6

Table 3.4: Classification Speech vs Laughter: unweighted accuracy (UA) and weighted accuracy (WA) for all tested

sets of features and the fusion of the best of each group.

sets obtained the same best score: IS11 and IS12 - feature sets tailored for detecting speaker’s state (such

as the level of intoxication or sleepiness) and speaker’s trait (such as personality or likability of speaker’s

voice), respectively. Those two were selected for a fusion with the feature set based on formants. The

application of the CFS did not enhance the performance in any case, but the scores were already very

high. In contrast, the scores for all feature sets based on formants are slightly better when the CFS is

applied. The reason is the complexity of redundancy removal, i.e. it is easier to delete the redundancy

in a small set of features that is already very efficient, than in a large ensemble where the redundancy

is much more manifest. Moreover, the performance of the set based on formants is remarkable due

a fact that, using as few as 145 features, it models only a small component of the prosody with few

parameters, in comparison to the IS feature sets which contain over 4k features representing 3 large

categories of prosodic information. However, the fusion of both IS11 and IS12 with the LE features set

slightly improved performance, showing the complementarity of our new features set with the IS-based

ones. These sets also obtained scores much better than the score chance, although, worse than any IS

3.4. RESULTS 31

feature set. We can observe that, among CFS sets, the score obtained by the Set E is higher than the score

of raw values (Set R). The same relationship can be noted between the Sets L and LE, which can suggest

that the normalization of formant values by the energy is pertinent for laughter recognition. Moreover,

both those sets (i.e. L and LE) obtained better results than sets without the logarithm applied (i.e. R and

E), which demonstrate the importance of perception in logarithmic scale for formants. The repartition of

the features (c.f. the figure 3.4) of the best set, i.e. Set LE, after applying the CFS on the whole corpus,

shows that more than a half of the new set is composed of features related to the 1st formant, what can

suggest its strong discrimination properties. However, it could also mean that 2nd formant is not extracted

as precisely as the 1st one. Moreover, it shows that only 8 formant-based features suffice to obtain a good

performance in laughter recognition, comparable to sets composed of more than 1k features. At last,

the fusion of the IS11 and the Set LE improved minimally the score, what confirms our expectations.

Similarly as among IS feature sets, the CFS did not change significantly the results for the fusion. The

results, unweighted accuracy (UA) and weighted accuracy (WA), for all IS feature sets, feature sets based

on formants and the fusion of the best ones, are presented in the table 4.1.

Set LE

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

5 (63%) 2 (25%) 1 (13%)

log(F1) x log(E(F1)) log(F2) x log(E(F2)) log(E(F1))

Figure 3.4: Redistibution of features after applying the CFS.

3.4.2 Speech vs Speech-Laugh

The results of the experiment of distinguishing speech from speech-laugh are not as high as for the

previous experiment. This is due to a more difficult classification task, since both classes contain speech.

In addition, because of a huge imbalance in number of instances per class (344 vs 52), the score chance

is very high: 86,9%. None of the tested sets of features has achieved better score than this chance score,

thus we do not go into details of the results obtained after applying the CFS. We can clearly observe, like

we also did on the previous task, i.e. speech vs laughter, that the modifications of the feature set based on

formants can improve the score: the set based on values normalized by their energies (Set E) had about

15% better score than the raw set (Set R); the score for the set with logarithmic values (Set L) was higher

than for the Set R and Set E and the set with logarithmic values normalized by their energies (Set LE)

achieved the best score. This shows the importance of each successive step that we have proposed, i.e.


Speech vs Speech-Laugh % UA % WA

Feature set name Full set CFS set Full set CFS set

INTERSPEECH feature sets

IS10 79.9 79.1 83.6 79.3

IS11 69.3 79.8 86.4 83.3

IS12 77.7 74.2 81.1 82.1

IS13 73.2 77.2 73.2 83.1


Set R 52.2 63.7 73.7 61.1

Set E 60.1 65.3 61.9 59.6

Set L 70.8 66.5 69.2 58.8

Set LE 75.1 70.0 75.3 72.0

Fusion of both

IS11 + Set LE 72.2 78.9 73.0 72.0

Table 3.5: Classification Speech vs Speech-Laugh: unweighted accuracy (UA) and weighted accuracy (WA) for all

tested sets of features and the fusion of the best of each group.

using logarithmic scale to represent formants and their energy, as well as weighting values of formants

by their respective values of energy to take into account phenomena of perception by measuring the voice

quality. That set was used in the fusion among with the IS10, however, it did not improve the score. The

results, unweighted accuracy (UA) and weighted accuracy (WA), for all IS feature sets, feature sets based

on formants and the fusion of the best ones, are presented in the table 4.2.

3.4.3 Speech vs All types of laughter

The last experiment in this section, the discrimination of speech from all the types of laughter, brought

us good results. The number of instances in this experiment was the most balanced among the three

experiments, and so the score chance is 62.5%. All scores for IS feature sets are approximately 90%.

The best one, 91%, was obtained by the IS12 - a feature set tailored for speaker’s trait classification,

like evaluating a speaker’s intelligibility level while reading a text in different pathological conditions.

However, the results obtained by all the IS feature sets are very close to each other. Feature sets based

on formants followed our expectations (except the Set E, which was the worst in this experiment), i.e.

3.4. RESULTS 33

Speech vs All Laughter % UA % WA

Feature set name Full set With CFS Full set With CFS

INTERSPEECH Challenge feature sets

IS10 89.7 90.3 91.5 91.1

IS11 90.6 89.4 91.6 89.3

IS12 90.9 88.9 91.8 89.5

IS13 90.2 89.5 91.5 89.6


Set R 75.6 79.3 75.5 80.0

Set E 73.2 79.4 72.9 79.1

Set L 77.3 80.6 77.6 81.1

Set LE 83.9 85.5 84.2 85.8

Fusion

IS12 + Set LE 90.8 89.5 91.6 89.8

Table 3.6: Classification Speech vs All: unweighted accuracy (UA) and weighted accuracy (WA) for all tested sets

of features and the fusion of the best of each group.

using both logarithmic values and weighting. The best result among these feature sets was achieved, as

in previous two experiments, by the Set LE. We can see that the CFS improved all the scores of feature

sets based on formants. The repartition of the features (c.f. the figure 3.5) of the best set, i.e. Set LE,

after applying the CFS on the whole corpus, shows that the new set of selected features consists of only

1/8 (23 features) of the original size (145 features) and that about a half of them are related to the 1st

formant, what could be also observed in a previous experiment, i.e. speech vs laughter. This suggests that

the position of tongue is very important for the laughter recognition. In the same time, we can imagine

that the variability of the F1 must be weak for the laughter compared to the speech, since the position of

tongue is almost stable during a laughing phase. This may be less pertinent for the F2, which corresponds

to the shape of mouth and can vary upon the intensity of laughter. However, the bigger amount of selected

features in comparison to the mentioned experiment above, and the fact that other types of features were

kept, i.e. features based on the formantic area, can suggest that, for more complicated tasks, such us the

discrimination of speech and different types of laughter, which can contain some speech episodes as well,

more voice quality information is needed. The fusion of the best feature sets, i.e. IS12 and the Set LE, did


not improve the performance - obtained result was similar to the result for IS12. The results, unweighted

accuracy (UA) and weighted accuracy (WA), for all IS feature sets, feature sets based on formants and

the fusion of the best ones, are presented in the table 4.2.

Set LE

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

11 (48%) 5 (22%) 3 (13%) 2 (9%) 2 (9%)

log(F1) x log(E(F1)) log(F2) x log(E(F2)) log(E(F1)) log(E(F2)) log(Area)

Figure 3.5: Redistribution of features after applying the CFS for the Set LE.

3.5 Conclusions

In this chapter we presented two approaches for automatic recognition of laughter based on acoustic

features. All tests were done on the MAHNOB laughter database, containing 15 different subjects, using

leave-one-subject-out cross-validation to investigate both language and speaker independent automatic

laughter recognition. We chose the SVM as classifier and the weighted accuracy as evaluation measure.

The first approach uses feature sets used in the INTERSPEECH Challenges, that were tailored for

paralinguistic classification tasks. Among four tested feature sets, three of them obtained the best scores

in different experiments. Moreover, all the results obtained by the IS feature sets are very close to each

other. The best score in distinguishing the speech from all types of laughter, which, in our opinion, is the

most complex task, achieved the score of 91% (whereas the score chance is 62,5%). The CFS did not

improve, already high enough, scores.

The second approach employed feature sets based on formants. We tested four sets of features, which,

in fact, are the combinations of the base feature set, i.e. composed of raw values of formants, formants’

energy values and the corresponding formantic area. A logarithmic scale of perception and normalization

of values are considered by applying logarithms and weighting formant values by their respective values

of energy, respectively. We observed that each successive step improved the performance, i.e. weighting

augmented scores compared to the feature set of raw formant values; and applying logarithms ameliorated

scores in relation to the feature sets based on non-logarithmic values. The set composed of logarithmic

values of formants and normalized by the energy (i.e. Set LE) obtained the highest scores among all

formant-based feature sets for all three experiments, i.e. 93%, 75% and 86% for the speech vs laughter,

speech vs speech-laugh and speech vs all types of laughter, respectively.

3.5. CONCLUSIONS 35

The analysis of the features selected by the classifier (CFS) showed that as few as 8 features are

enough to obtain a good performance in laughter recognition. About a half of selected features were based

on the 1st formant, which can prove its strong discrimination properties. However, for more complicated

classification, i.e. speech vs all types of laughter, more formant-based features were needed, but the

difference in the total number of features is still huge in relation to the IS feature sets.

On the end, we combined the best feature sets from both groups on the feature-level. Since the

feature sets from the INTERSPEECH Challenges set a bar very high, it was difficult to improve the

scores. However, the fusion enhanced the results for the speech-vs-laughter experiment, what shows that

voice quality plays an important role in laughter recognition.

In the next chapter we are going to present another approach in laughter recognition, based on verbal

features.


Chapter 4

Automatic laughter recognition based

on verbal features

In the previous chapter, we presented an approach in laughter recognition that was based on acoustic char-

acteristics of a speech signal. We have achieved very high score for set of features that were tailored for

the INTERSPEECH Challenges, good score for feature sets based on formants and a small amelioration

in score using their fusion. However, since the laughter was found to be a series of notes of similar length,

repeated in intervals of similar length [84], in this chapter we will investigate the relevance of distribution

and sequencing of units based on acoustic events. Acoustic events are detected with a data-driven ap-

proach, which is more robust and efficient compared to machine learning algorithms used for automatic

speech recognition. The feature extraction is done automatically in a supra-segmental way, which means

that a signal is divided into units of varying length, depending on detected events. In the first section

of this chapter, we introduce the notion of acoustic events, that are related to changes in production and

perception of speech [110], and present our motivations of using them in laughter recognition tasks. We

particularly focus on the events related to speech production, e.g. acoustic landmarks, which correspond

to abrupt changes in the articulation system, or pseudo-vowels/pseudo-consonants segments and speech

perception, e.g. voiced/unvoiced/silent segments and center of perception (p-center) of the speech. We

use p-centers for a direct fusion with all the three types of acoustic events mentioned before. The best

results are shown in the subsequent sections. The classification process is done in the same way as in

the previous chapter, i.e. LOSO on MANHOB database with SVM. We use the Bag of Words (BoW) to

quantify the distribution of units and the n-grams for their sequencing.

37

38 CHAPTER 4. AUTOMATIC LAUGHTER RECOGNITION BASED ON VERBAL FEATURES

4.1 Acoustic events

An acoustic event or a speech unit can be a time-stamp, that marks a moment of a change in the artic-

ulation system, or a whole segment, that represents a piece of signal with specific acoustic properties.

In this section we introduce four types of such units: (1) voiced/unvoiced/silence segments, (2) pseudo-

vowels/pseudo-consonants/silence segments, (3) phonetic based landmarks and (4) p-centers. Since an-

notations of p-centers are composed only of two alternating labels (p-center/silence), we found irrelevant

to classify them separately and we decided to use them in a direct fusion (c.f. 4.3) with the rest of units.

4.1.1 Voiced/unvoiced/silence segments

Laughter can be described as a sequence of alternating voiced and unvoiced segments [11], which are

related to the speech perception. Also in [117], it was described as an alternating voiced-unvoiced pattern.

A voiced segment is a piece of speech signal where vocal cords vibrate. Voiced and unvoiced segments

can be identified using pitch and loudness values of the signal. We use the openSmile tool to extract those

two features and a Matlab script to detect and label the segments. The figure 4.1 presents a Matlab plot

with a speech signal and detected voiced/unvoiced segments.

Figure 4.1: A speech signal (blue) with its energy (black), voiced (darkgreen) and unvoiced (red) segments. The

black dashed line is the silence threshold. Adapted from [107].

4.1. ACOUSTIC EVENTS 39

4.1.2 Pseudo-vowel/pseudo-consonant/silence segments

Pseudo-vowel and pseudo-consonant segments are related to the speech production. Their detection is

based on pseudo-phonemes - speech units introduced by the computational sciences that are established

on the stationarity of the speech signal. In consequence, the identification of the pseudo-phonemes is

relevant for vowels, since their acoustic wave forms were observed to be stationary for more than 30 ms.

Consonants happen to be shorter than this and their waveforms are often non-linear. More details on

the method of detection pseudo-vowels and pseudo-consonants segments can be found in [93]. We are

interested in this type of units, since it was reported that a laughter can be described as a series of short

vowel-like notes[84]. Another research [117] characterize a “typical” laugh as a syllable of a consonant-

vowel structure.

4.1.3 Phonetic based landmarks

Phonetic based landmarks are events that correspond to a change in the articulation system [111]. These

abrupt changes in the amplitude can be observed in an acoustic signal simultaneously across wide fre-

quency ranges. Thus, in order to automatically detect acoustic landmarks, the speech signal is divided

into several frequency bands and a voicing contour is computed. For each of the bands, an energy wave-

form is constructed and the time derivative is computed, which then is used to detect its peaks. The peaks

represent the times of abrupt spectral changes in the bands. A landmark is detected if the peaks appear

simultaneously in several bands in a specified pattern [111] and the amplitude value of the signal reach

a threshold, derived empirically, for abruptness [14]. This method was implemented in The SpeechMark

tool [15], which was used in our experiments. Following types of landmarks can be detected [14]: +g

(glottis): the onset of a voicing; -g: the offset of a voicing; +s (syllabicity) the onset/release (+s) of a

voiced sonorant consonant; -s the offset/closure of a voiced sonorant consonant; +b (burst) the onset of

a burst of air following stop, affricate consonant release or onset of frication noise for fricative conso-

nants; -b the point where aspiration or frication noise ends; V (vowel) the point of the harmonic power

maximum. In addition, energy changes correlated with frication patterns are also detected (+/-f and +/-v).

The SpeechMark tool generates annotation files containing the the time of occurrence and the type of

landmark, c.f. listing 3. The tool can also visualize the analysis of a signal. The figure 4.2 presents a

waveform and the corresponding spectrum of a two-syllable (“ha-ha”) segment of laugher.

The figure 4.3 shows three scenarios of changes in frequency and voicing bands: (a) energy increase

is detected in frequency bands just before the onset of voicing, however, the change is not large enough in

some bands - no landmark is identified; (b) a large energy increase is detected in all frequency bands just


Figure 4.2: A waveform (top) of a short two-syllable (“ha-ha”) segment of laughter produced by a subject with

landmarks indicated and the corresponding spectrum (bottom).

Figure 4.3: Energy waveforms of 5 frequency and one voicing (bottom) bands. Adapted from [14].

before the onset of voicing - a +b (burst) landmark is identified; (c) a large energy increase is detected in

all frequency bands during the voicing - a +s (syllabic) landmark is identified.

4.1. ACOUSTIC EVENTS 41

Listing 3 A landmarks annotation file for a short segment of laughter produced by the subject 1.0.011000,+b

0.044000,+g

0.175000,+s

0.245000,-s

0.260000,-g

Since different sounds produce different patterns of abrupt changes, the analysis of those patterns

can help to detect types of sounds. In addition, a stemming technique can be applied on those values by

treating onsets and offsets of the same group as the same words.

4.1.4 Perceptual centers

A perceptual center (P-center) is related to the perception of temporal patterns in speech, music, and other

temporally sensitive activities [121]. That includes isochrony, e.g. if a subject can identify an isochronous

pattern, i.e. rhythm, it means that successive p-centers appear in constant intervals. In other words, a

sequence of words sounds isochronously to a listener, if a locus of one word is in equal temporal distance

of loci in surrounding words [35]. Several methods of measuring the location of p-centers exist. The most

commonly used, rhythm adjustment method (first described in detail in [58]), involves a subject listening

to a repetition of a pattern composed of two short alternating sounds, say sound A and sound B, which is

not perceptually isochronous. His task is to adjust the onset of the sound B until he feels (subjectively)

the rhythm, while the distance between consecutive sounds A stays untouched. The result of a final

adjustment is an estimation of the interval of p-centers of the sound B with respect to the sound A [121].

Another method, worth mentioning, is the method of “tapping fingers” proposed in [7]. It was originally

designed to discover the locus of a stress beat. In that method, a subject is asked to tap his fingers while

he perceives a particular syllable in a sentence. The same sentence is repeated 50 times. The results

showed that subjects tended to tap before the onset of the vowel in a stressed syllable, what was defined

as a moment, where occurrence of the syllable is perceived [58]. An automatic method of extracting a

rhythmic envelope of a speech signal was proposed in [114], c.f. figure 4.4. The method uses a set of

numeric filters supposed to represent the process of perceiving the rhythm of the speech. A rhythmic

envelope permits to locate the p-center by defining a threshold on its amplitude, which corresponds to a

level of perception of the rhythmic prominence, c.f. figure 4.5. We use four values of the threshold, i.e.

1/3, 1/4, 1/6 and 1/8.


Figure 4.4: A rhythmic envelope extracted on a speech signal, adapted from [93].

Figure 4.5: A rhythmic envelope extracted on a speech signal (red) and perception levels of p-centers with threshold

1/3, 1/4, 1/6 of the amplitude (gray scale); adapted from [93].

4.2 Results: acoustic events

In this section, we present the results achieved for each of three performed experiments using the BoW,

1-, 2- and 3-grams on speech units as feature extraction methods. We have tested four types of fea-

4.2. RESULTS: ACOUSTIC EVENTS 43

tures: (1) phonetic based landmarks, (2) stemmed phonetic based landmarks, (3) pseudo-vowel/pseudo-

consonant/silence segments and (4) voiced/unvoiced/silence segments. One unit represent one feature,

e.g. 1-grams based on voiced/unvoiced/silence segments will constitue of three features. Moreover, for

the BoW, beside the value of frequency of occurence of a unit in a segment, we used three other com-

binations: a logarithm of the frequency, an inverse document frequency and a logarithm of the inverse

document frequency.

However, due to a large number of results, we present only the best scores for each of these four

groups. We expect good performance for voiced/unvoiced/silence segments, since, in [11], it was reported

that laughter can be described as a sequence of alternating voiced and unvoiced segments. We expect also

good performance for phonetic based landmarks as they are suitable for examining non-lexical attributes

of speech and sensitive to aspects of metrical structure [15].


The score chance in this experiment is 69,8%. Only feature sets based on phonetic based landmarks

obtained scores significantly higher than the chance score. We found that these feature sets performed

better with n-grams, than with BoW, which means that sequencing of phonetic based landmarks is more

pertinent than their distribution in distinguishing speech and laughter. BoW was better than n-grams

only for voiced/unvoiced segments. Another observation that we made is that the CFS generally did not

improve scores, beside for voiced/unvoiced segments, where the CFS augmented the score by almost

150%. We found that the unvoiced segments were selected more often than others in that process. The

CFS was also better for the BoW of vowels/consonants, which was the second best score for that type of

unit, though the difference was not so high. The table 4.1 presents the best scores for each of tested unit

types.

4.2.2 Speech vs Speech-Laugh

The score chance in this experiment is very high (86,9%) and none of tested feature sets achieved a higher

score. However, some observations can be made. The acoustic landmarks after stemming obtained the

best score. Contrary, the acoustic landmarks without stemming were the worst in this task, which is the

opposite of what we observed in speech-vs-laughter classification. The CFS improved scores, especially

for BoW and 1-grams. However, while investigating the features selected by the classifier, we observed an

interesting phenomenon: for 1-grams based on both voiced/unvoiced and vowels/consonants segments,

the “</s>” gram was selected as the only feature. Since this tag denotes the end of an instance, it always




Acoustic events

Landmarks (1-grams) 78.7 74.9 77.5 70.6

Landmarks-stemmed (2-grams) 76.6 75.0 76.9 73.6

Vowels/Cons./Sil. (1-grams) 71.1 65.9 70.8 68.4

Voiced/Unv./Sil. (BoW) 44.2 67.2 41.8 62.7

Table 4.1: Classification Speech vs Laughter: unweighted accuracy (UA) and weighted accuracy (WA) for the best

feature sets of each type of unit.

appears only once. In consequence, its value as 1-gram is always equal to the multiplicative inverse of

the sum of all detected units in an instance. This suggests, that simply a number reciprocal to the sum of

segments occurrences can be pertinent in distinguishing speech and speech-laugh. The table 4.2 presents

the best scores for each of tested unit types.



Acoustic events

Landmarks (2-grams) 53.4 64.3 73.0 62.1


Vowels/Cons./Sil. (BoW) 56.8 68.1 50.5 58.8

Voiced/Unv./Sil. (1-grams) 65.9 66.9 63.4 62.4

Table 4.2: Classification Speech vs Speech-Laugh: unweighted accuracy (UA) and weighted accuracy (WA) for the

best feature sets of each type of unit.


In that experiment, the best score was obtained by the acoustic landmarks - almost 75%, which is more

than 10% higher than the score chance. The stemming did not improve the result. Similar to the first ex-

periment, i.e. speech-vs-laughter, the n-grams of acoustic landmarks were better than BoW, which suggest

that their sequencing is more pertinent than the distribution. The same was observed for voiced/unvoiced

segments. Only vowels/consonants units are better with BoW. The CFS was observed to gain better scores

4.3. DIRECT FUSION OF ACOUSTIC EVENTS AND P-CENTERS 45



Acoustic events

Landmarks (2-grams) 74.3 67.6 74.4 66.9


Vowels/Cons. (BoW) 61.5 69.6 59.6 65.6

Voiced/Unv./Sil. (2-grams) 64.6 63.2 64.5 62.2

Table 4.3: Classification Speech vs All types of laughter: unweighted accuracy (UA) and weighted accuracy (WA)

for the best feature sets of each type of unit.

for BoW. We found that consonants were selected more often than vowels. We found no selection rule

for voiced/unvoiced segments. The table 4.3 presents the best scores for each of tested unit types.

4.3 Direct fusion of acoustic events and p-centers

In section 4.1, we extracted three types of acoustic events: voiced/unvoiced segments, pseudo-vowels/pseudo-

consonants and acoustic landmarks. In this section we explain the concept of direct fusion of these events

with rhythmic events, i.e. p-centers (c.f. 4.1.4).

We use a direct fusion, which means that the detected units are not concatenated, but they are merged

into new units with the same timings. In result, we obtain three new types of labels: (1) voiced-rhythmic,

unvoiced-rhythmic and silence-rhythmic; (2) vowel-rhythmic, consonant-rhythmic and silence-rhythmic;

(3) acoustic rhythmic-landmarks set. To consider a voiced/unvoiced segment or a pseudo-vowel/pseudo-

consonant as rhythmic, at least 10% of the p-center segment must be inside. E.g. if a p-center starts

before a voiced segment and finishes during it, at least 1/10 of the duration of p-center must occur during

the voiced segment. The procedure is easier for the acoustic landmarks, since the are not segments, but

abrupt events. We simply check if an acoustic landmark is situated during the p-center to consider it as

rhythmic.

4.4 Results: acoustic events & p-center

In this section we present results obtained in direct fusion of acoustic events extracted in previous section

and rhythmic events, i.e. p-centers. Due to a large number of tested configurations (e.g. four threshold


values of perception levels), we present only the best scores, one for each type of features.


The fusion with p-centers, in general, did not improve scores. The only improvement was observed for

stemmed acoustic landmarks, though very slight and not for the same type of n-grams. That feature set,

after CFS, obtained the best score in this experiment, i.e. 1-grams of stemmed acoustic landmarks &

p-centers with the perception level threshold of 1/3. However, the obtained score was very similar to

the score obtained by the feature set based only on acoustic landmarks (notice that both are based on

1-grams). We observed that, in most cases, stemming of acoustic landmarks marginally improved the

results. We also found that the bigger the threshold of perception level is, the better the results are, on

average of all units. This is especially visible for acoustic landmarks, both with and without stemming,

and it could suggest that the prominence of the p-center is more pertinent for distinguishing speech and

laughter using acoustic landmarks. The averages of scores among features sets based on vowel/consonant

units for particular thresholds are more leveled, while the averages of scores among features sets based

on voiced/unvoiced segments are in opposite to those of acoustic landmarks. The table 4.4 presents the

best scores for each of tested unit types.



Acoustic events & p-centers

Landmarks & p-center t=1/3 (1-grams) 74.9 77.5 76.9 74.6

Landmarks-stemmed & p-center t=1/3 (3-grams) 72.4 78.2 74.2 76.5

Vowels/Cons./Sil. & p-center t=1/4 (1-grams) 72.6 69.6 71.8 71.6

Voiced/Unv./Sil. & p-center t=1/8 (2-grams) 77.5 74.8 78.7 72.8

Table 4.4: Classification Speech vs Laughter: unweighted accuracy (UA) and weighted accuracy (WA) for the best

feature sets of each type of unit.

4.4.2 Speech vs Speech-laugh

The fusion with p-centers improved the score only for acoustic landmarks. We did not notice any im-

provement for other types of units. Moreover, the best result, i.e. 1-grams based on stemmed acoustic

landmarks with p-center of threshold 1/4 after CFS, was still worse than the best result of the experiment

4.5. CONCLUSIONS 47

without the fusion. Here again, we observed that bigger thresholds of perception level achieve better

scores. The CFS improved in some cases the performance, especially for BoW, though it was not a rule

and the improvement was insignificant. The table 4.5 presents the best scores for each of tested unit types.





Landmarks-stemmed & p-center t=1/3 (1-grams) 68.0 69.2 68.4 66.4

Vowels/Cons./Sil. & p-center t=1/4 (1-grams) 61.5 65.3 60.1 65.2


Table 4.5: Classification Speech vs All: unweighted accuracy (UA) and weighted accuracy (WA) for the best feature

sets of each type of unit.


In this experiment, the fusion with p-centers generally did not augment the scores. The best feature set,

unlike in other experiments, was based on 3-grams, but its score was still worse than the best one in

the experiment without p-centers. The feature sets based on acoustic landmarks achieved better scores

than other types of units. We also observed that, for those sets of features, the bigger the threshold of

perception level is, the better the results are. This suggests that the prominence of the p-center is more

pertinent for distinguishing speech and different laughter types using acoustic landmarks. We did not find

this relation for neither voiced/unvoiced nor for vowels/consonants units. The table 4.6 presents the best

scores for each of tested unit types.

4.5 Conclusions

In this chapter we presented a verbal approach for automatic recognition of laughter, where we used the

Bag of Words and n-grams as feature extraction methods. All tests were done in the same manner as in the

previous chapter, i.e. using leave-one-subject-out cross-validation on the MANHOB laughter database,

containing 15 different subjects, and the SVM as classifier. We computed the unweighted accuracy as

evaluation measure. As feature we use acoustic events related to speech production and perception. In the

first experiment we used three types of units. Acoustic landmarks, events corresponding to abrupt changes






Landmarks-stemmed & p-center t=1/8 (BoW) 71.0 67.8 67.5 63.3

Vowels/Cons. & p-center t=1/4 (1-grams) 72.2 68.1 71.6 69.6


Table 4.6: Classification Speech vs All: unweighted accuracy (UA) and weighted accuracy (WA) for the best feature

sets of each type of unit.

in articulation system, obtained the best scores for all performed tasks. The best score in distinguishing

the speech from all types of laughter, which, in our opinion, is the most complex task, achieved the score

of 75% (whereas the score chance is 62,5%). In the second experiment we used a direct fusion technique

to merge the acoustic events from the first experiment with p-centers. This fusion resulted in small or no

improvement of scores. The best score in this approach for the classification between the speech and all

the laughter types was at 72.6%. We also found that bigger thresholds of perception level in p-centers

tend to produce better scores, though it is not a rule.

Chapter 5

Conclusions

In this Master Thesis we investigated two main types of feature sets in laughter recognition from spon-

taneous speech data, i.e. (1) non-verbal and (2) verbal. So far, several researches have been conducted

in this domain. The great majority of experiments used non-verbal approach, focusing especially on

spectrum-related features (i.e. PLP and MFCCs), which lead to a good performance (up to F-measure

of 77 %). To our best knowledge, only one experiment has been done using verbal approach until now.

This experiment employed the ALISP-based n-gram models and showed a good precision in detecting

laughter, though a small recall.

Three different classification problems were considered in this Master Thesis: distinguishing speech

from (1) laughter, (2) speech-laugh and (3) all the types of laughter. All experiments were conducted on

the MAHNOB Laughter Database. We used the SVM as classifier, trained using the SMO algorithm with

the LOSO cross-validation. We evaluated the scores using unweighted accuracy.

Our first approach uses non-verbal feature extraction. We tested two groups of feature sets: (1) four

feature sets tailored for the INTERSPEECH Challenges (IS) between 2010 and 2013, respectively, and

(2) another four feature sets based on formants. Since the IS feature sets were tailored for paralinguistic

tasks, we obtained very high scores for all of them, i.e. we significantly improved the state-of-the-art

performance from 77% to 98%. Although in each experiment the best score was obtained by a different

feature set, i.e. IS10 for Speech vs Speech-Laugh, IS11 for Speech vs Laughter and IS12 for Speech

vs All types of laughter, the differences of results between each of IS feature sets were very small. In

the second group of features, the scores were not as high as in the first one, though the best score was

only slightly worse. This can indicate that formants have good distinguishing properties between speech

and laughter, what confirms the difference of their measured average formant values. In this experiment

49

50 CHAPTER 5. CONCLUSIONS

we observed also that simply applying logarithms on the formant values and normalizing them by their

energies can improve the performance. At the end we performed a feature-level fusion of the best feature

sets from each group, which improved the score for the speech-vs-laughter classification.

In our second approach we used verbal feature extraction. We tested the Bag-of-Words (BoW) and

n-grams methods using automatically detected acoustic events as units, i.e. voiced/unvoiced segments,

pseudo-vowels/pseudo-consonants and acoustic landmarks. Thus three groups of feature sets were pre-

pared, each composed of BoW, 1-grams, 2-grams and 3-grams. The group based on acoustic landmarks

obtained the best scores in all experiments. The n-grams, especially 1- and 2-grams, were often better

than BoW. In the second part of this approach, we employed p-centers to directly merge with the acoustic

events from the first part, i.e. if an acoustic event occurs in the same time as a p-center it generates a new

label (e.g. voiced-rhythmic) with the original timings of the acoustic event. This direct fusion resulted in

small or no improvement in scores. However, we could observe that bigger thresholds of perception level

in p-centers tend to produce better scores, though it is not a rule. Here again, n-grams were often better

than BoW.

Although we achieved very good scores in the first approach, our focus is directed to the second one,

since it has been little explored and it offers a great potential in the laughter recognition task. Several

perspectives can be considered to improve the performance e.g. using other types of units, taking into

account the duration and intervals of p-centers for rhythmic modeling a combination of BoW and n-grams

(i.e. Bag-of-n-grams). Another classification algorithms could be tested, e.g. resolving the non-linearity

with the ANN or taking into account the context/history with Long Short Term Memory Network. More-

over, another level of annotations in databases, like the intensity of laughter, could be used for regression

tasks. At last, all the experiments could be trained on one database and optimized on another one, i.e.

cross-learning.

Bibliography

[1] Attribute-Relation File Format Overview. http://weka.wikispaces.com/ARFF+

%28developer+version%29/. [Online; accessed 13-January-2014].

[2] International Speech Communication Association. http://www.isca-speech.org/. [On-

line; accessed 10-February-2014].

[3] MATLAB R©. http://www.mathworks.ch/products/matlab/, 2013. [Online; ac-

cessed 02-July-2013].

[4] The Hidden Markov Model Toolkit. http://htk.eng.cam.ac.uk/, 2013. [Online; ac-

cessed 09-July-2013].

[5] The Kennedy Centre Mark Twain Prize for Humor. http://www.kennedy-center.org/

programs/specialevents/marktwain/, 2013. [Online; accessed 27-October-2013].

[6] The AMI Meeting Corpus. https://www.idiap.ch/dataset/ami/documentation/

non-scenario-meetings/, 2014. [Online; accessed 08-February-2014].

[7] G. D. Allen. The location of rhythmic stress beats in english: An experimental study i. Language

and Speech, 15(1):72–100, 1972.

[8] J. Araki. Dialogue act recognition using cue phrases.

[9] J.-A. Bachorowski, M. J. Smoski, and M. J. Owren. The acoustic features of human laughter. The

Journal of the Acoustical Society of America, 110:1581, 2001.

[10] M. P. Bennett and C. Lengacher. Humor and laughter may influence health: Iii. laughter and health

outcomes. Evidence-Based Complementary and Alternative Medicine, 5(1):37–40, 2008.

[11] C. A. Bickley and S. Hunnicutt. Acoustic analysis of laughter. In ICSLP, 1992.

51

http://weka.wikispaces.com/ARFF+%28developer+version%29/

http://weka.wikispaces.com/ARFF+%28developer+version%29/

http://www.isca-speech.org/

http://www.mathworks.ch/products/matlab/

http://htk.eng.cam.ac.uk/

http://www.kennedy-center.org/programs/specialevents/marktwain/

http://www.kennedy-center.org/programs/specialevents/marktwain/

https://www.idiap.ch/dataset/ami/documentation/non-scenario-meetings/

https://www.idiap.ch/dataset/ami/documentation/non-scenario-meetings/

52 BIBLIOGRAPHY

[12] P. Boersma and D. Weenink. Praat: doing phonetics by computer (version 4.3. 01)[computer

program]. retrieved from h ttp, 2005.

[13] K. Bousmalis, M. Mehu, and M. Pantic. Spotting agreement and disagreement: A survey of nonver-

bal audiovisual cues and tools. In Affective Computing and Intelligent Interaction and Workshops,

2009. ACII 2009. 3rd International Conference on, pages 1–9. IEEE, 2009.

[14] S. Boyce, H. Fell, L. Wilde, and J. MacAuslan. Automated tools for identifying syllabic landmark

clusters that reflect changes in articulation. In Models and Analysis of Vocal Emissions for Biomed-

ical Applications 7thinternational Workshop. 2011, volume 77, page 63. Firenze University Press,

2011.

[15] S. Boyce, H. J. Fell, and J. MacAuslan. SpeechMark: Landmark detection tool for speech analysis.

In INTERSPEECH, 2012.

[16] M. Brain. How laughter works. Retrieved September, 29, 2008.

[17] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-based n-gram models

of natural language. Computational linguistics, 18(4):467–479, 1992.

[18] H. Brugman and A. Russel. Annotating multi-media/multi-modal resources with elan. In LREC,

2004.

[19] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos,

W. Kraaij, M. Kronenthal, et al. The ami meeting corpus: A pre-announcement. In Machine

learning for multimodal interaction, pages 28–39. Springer, 2006.

[20] W. B. Cavnar, J. M. Trenkle, et al. N-gram-based text categorization. Ann Arbor MI, 48113(2):161–

175, 1994.

[21] M. Chetouani, A. Mahdhaoui, and F. Ringeval. Time-scale feature extractions for emotional speech

characterization. Cognitive Computation, 1(2):194–201, 2009.

[22] G. Chollet, J. Cernocky, A. Constantinescu, S. Deligne, and F. Bimbot. Toward alisp: A pro-

posal for automatic language independent speech processing. In Computational Models of Speech

Pattern Processing, pages 375–388. Springer, 1999.

[23] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.

[24] C. Darwin. The expression of the emotions in man and animals. Oxford University Press, 1998.

BIBLIOGRAPHY 53

[25] K. Davis, R. Biddulph, and S. Balashek. Automatic recognition of spoken digits. The Journal of

the Acoustical Society of America, 24:637, 1952.

[26] J. E. Dayhoff and J. M. DeLeo. Artificial neural networks. Cancer, 91(S8):1615–1635, 2001.

[27] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via

the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38,

1977.

[28] G. R. Doddington et al. Speaker recognition based on idiolectal differences between speakers. In

INTERSPEECH, pages 2521–2524, 2001.

[29] E. Douglas-Cowie, N. Campbell, R. Cowie, and P. Roach. Emotional speech: Towards a new

generation of databases. Speech communication, 40(1):33–60, 2003.

[30] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. Mcrorie, J.-C. Martin, L. Dev-

illers, S. Abrilian, A. Batliner, et al. The humaine database: addressing the collection and annota-

tion of naturalistic and induced emotional data. In Affective computing and intelligent interaction,

pages 488–500. Springer, 2007.

[31] S. Dupont and J. Luettin. Audio-visual speech modeling for continuous speech recognition. Mul-

timedia, IEEE Transactions on, 2(3):141–151, 2000.

[32] M. S. Edmonson. Notes on laughter. Anthropological linguistics, pages 23–34, 1987.

[33] F. Eyben, M. Wöllmer, and B. Schuller. Opensmile: the munich versatile and fast open-source

audio feature extractor. In Proceedings of the international conference on Multimedia, pages 1459–

1462. ACM, 2010.

[34] J. G. Fiscus, J. Ajot, and J. S. Garofolo. The rich transcription 2007 meeting recognition evaluation.

In Multimodal Technologies for Perception of Humans, pages 373–389. Springer, 2008.

[35] C. A. Fowler. “perceptual centers” in speech production and perception. Perception & Psy-

chophysics, 25(5):375–388, 1979.

[36] S. Furui. 50 years of progress in speech and speaker recognition. In 10th International Conference

on Speech and Computer-SPECOM, Patras, Greece, pages 1–9, 2005.

[37] S. Furui. Selected topics from 40 years of research on speech and speaker recognition. In INTER-

SPEECH, pages 1–8, 2009.

54 BIBLIOGRAPHY

[38] A. Ganapathiraju, J. E. Hamaker, and J. Picone. Applications of support vector machines to speech

recognition. Signal Processing, IEEE Transactions on, 52(8):2348–2355, 2004.

[39] D. Gatica-Perez. Automatic nonverbal analysis of social interaction in small groups: A review.

Image and Vision Computing, 27(12):1775–1787, 2009.

[40] O. Ghitza. Auditory models and human performance in tasks related to speech coding and speech

recognition. Speech and Audio Processing, IEEE Transactions on, 2(1):115–132, 1994.

[41] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data

mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1):10–18, 2009.

[42] M. A. Hall. Correlation-based feature selection for machine learning. PhD thesis, The University

of Waikato, 1999.

[43] W. Han, C.-F. Chan, C.-S. Choy, and K.-P. Pun. An efficient mfcc extraction method in speech

recognition. In Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International

Symposium on, pages 4–pp. IEEE, 2006.

[44] H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical

Society of America, 87:1738, 1990.

[45] H. Hermansky and N. Morgan. RASTA processing of speech. Speech and Audio Processing, IEEE

Transactions on, 2(4):578–589, 1994.

[46] A. Hirson. Human laughter: a forensic phonetic perspective. Beiträge zur Phonetik und Linguistik,

64:77–86, 1995.

[47] A. Ito, X. Wang, M. Suzuki, and S. Makino. Smile and laughter recognition using speech process-

ing and face recognition from conversation video. In Cyberworlds, 2005. International Conference

on, pages 8–pp. IEEE, 2005.

[48] H. Kaya, A. M. Erçetin, A. A. Salah, and F. Gürgen. Random forests for laughter detection. In Pro-

ceedings of Workshop on Affective Social Speech Signals-in conjunction with the INTERSPEECH,

2013.

[49] L. S. Kennedy and D. P. Ellis. Laughter detection in meetings. In NIST ICASSP 2004 Meeting

Recognition Workshop, Montreal, pages 118–121. National Institute of Standards and Technology,

2004.

BIBLIOGRAPHY 55

[50] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab,

C. R. Antonescu, C. Peterson, et al. Classification and diagnostic prediction of cancers using gene

expression profiling and artificial neural networks. Nature medicine, 7(6):673–679, 2001.

[51] H. Khemiri, G. Chollet, and D. Petrovska-Delacrétaz. Automatic detection of known advertise-

ments in radio broadcast with data-driven alisp transcriptions. Multimedia Tools and Applications,

62(1):35–49, 2013.

[52] S. Kipper and D. Todt. The role of rhythm and pitch in the evaluation of human laughter. Journal

of Nonverbal Behavior, 27(4):255–272, 2003.

[53] M. T. Knox and N. Mirghafori. Automatic laughter detection using neural networks. In INTER-

SPEECH, pages 2973–2976, 2007.

[54] M. T. Knox, N. Morgan, and N. Mirghafori. Getting the last laugh: automatic laughter segmenta-

tion in meetings. In INTERSPEECH, pages 797–800, 2008.

[55] S. Kori. Perceptual dimensions of laughter and their acoustic correlates. Proc. Intern. Confer.

Phonetic Sciences Tallinn (4), pages 255–258, 1989.

[56] Y. LeCun, L. Jackel, L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. Muller,

E. Sackinger, P. Simard, et al. Learning algorithms for classification: A comparison on hand-

written digit recognition. Neural networks: the statistical mechanics perspective, 261:276, 1995.

[57] R. P. Lippmann. Review of neural networks for speech recognition. Neural computation, 1(1):1–

38, 1989.

[58] S. M. Marcus. Acoustic determinants of perceptual center (p-center) location. Perception & Psy-

chophysics, 30(3):247–256, 1981.

[59] G. McKeown, R. Cowie, W. Curran, W. Ruch, and E. Douglas-Cowie. Ilhaire laughter database.

In 4th International Workshop on Emotion Sentiment & Social Signals (ES3 2012)-Corpora for

Research on Emotion, Sentiment & Social Signals, held in conjunction with LREC 2012, ELRA,

Istanbul, Turkey, pages 32–35, 2012.

[60] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder. The SEMAINE database:

annotated multimodal records of emotionally colored conversations between a person and a limited

agent. Affective Computing, IEEE Transactions on, 3(1):5–17, 2012.

56 BIBLIOGRAPHY

[61] F. Metze, J. Ajmera, R. Englert, U. Bub, F. Burkhardt, J. Stegmann, C. Muller, R. Huber, B. An-

drassy, J. G. Bauer, et al. Comparison of four approaches to age and gender recognition for tele-

phone applications. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE Inter-

national Conference on, volume 4, pages IV–1089. IEEE, 2007.

[62] N. Morgan, D. Baron, J. Edwards, D. Ellis, D. Gelbart, A. Janin, T. Pfau, E. Shriberg, and A. Stol-

cke. The meeting project at icsi. In Proceedings of the first international conference on Human

language technology research, pages 1–7. Association for Computational Linguistics, 2001.

[63] C. Niemitz. Visuelle zeichen, sprache und gehirn in der evolution des menschen—eine entgegnung

auf mcfarland. Z. Sem, 12:323–336, 1990.

[64] E. E. Nwokah, P. Davies, A. Islam, H.-C. Hsu, and A. Fogel. Vocal affect in three-year-olds: A

quantitative acoustic analysis of child laughter. The Journal of the Acoustical Society of America,

94:3076, 1993.

[65] J. Oh, E. Cho, and M. Slaney. Characteristic contours of syllabic-level units in laughter. In Proc.

Interspeech, 2013.

[66] F. J. Orozco, F. A. García, L. Arcos, and J. Gonzàlez. Spatio-temporal reasoning for reliable

facial expression interpretation. In International Conference on Computer Vision Systems (ICVS).

Bielefeld University, 2007.

[67] E. Osuna, R. Freund, and F. Girosit. Training support vector machines: an application to face

detection. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer

Society Conference on, pages 130–136. IEEE, 1997.

[68] M. J. Owren and J.-A. Bachorowski. The evolution of emotional experience: A "selfish-gene"

account of smiling and laughter in early hominids and humans. 2001.

[69] S. Pammi, H. Khemiri, and G. Chollet. Laughter detection using ALISP-based n-gram models.

[70] J. Panksepp and J. Burgdorf. “laughing” rats and the evolutionary antecedents of human joy?

Physiology & Behavior, 79(3):533–547, 2003.

[71] M. Pantic, A. Pentland, A. Nijholt, and T. S. Huang. Human computing and machine understand-

ing of human behavior: a survey. In Artifical Intelligence for Human Computing, pages 47–71.

Springer, 2007.

BIBLIOGRAPHY 57

[72] G. E. Peterson and H. L. Barney. Control methods used in a study of the vowels. The Journal of

the Acoustical Society of America, 24:175, 1952.

[73] S. Petridis, A. Asghar, and M. Pantic. Classifying laughter and speech using audio-visual fea-

ture prediction. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International

Conference on, pages 5254–5257. IEEE, 2010.

[74] S. Petridis, M. Leveque, and M. Pantic. Audiovisual detection of laughter in human-machine

interaction. In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association

Conference on, pages 129–134. IEEE, 2013.

[75] S. Petridis, B. Martinez, and M. Pantic. The mahnob laughter database. Image and Vision Com-

puting Journal, 31(2):186–202, February 2013.

[76] S. Petridis and M. Pantic. Audiovisual discrimination between laughter and speech. In Acoustics,

Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages

5117–5120. IEEE, 2008.

[77] S. Petridis and M. Pantic. Audiovisual laughter detection based on temporal features. In Proceed-

ings of the 10th international conference on Multimodal interfaces, pages 37–44. ACM, 2008.

[78] S. Petridis and M. Pantic. Is this joke really funny? judging the mirth by audiovisual laughter

analysis. In Multimedia and Expo, 2009. ICME 2009. IEEE International Conference on, pages

1444–1447. IEEE, 2009.

[79] S. Petridis, M. Pantic, and J. F. Cohn. Prediction-based classification for audiovisual discrimination

between laughter and speech. In Automatic Face & Gesture Recognition and Workshops (FG

2011), 2011 IEEE International Conference on, pages 619–626. IEEE, 2011.

[80] M. Phillips and J. Glass. Phonetic transition modeling for continuous speech recognition. The

Journal of the Acoustical Society of America, 95:2877, 1994.

[81] J. Platt et al. Sequential minimal optimization: A fast algorithm for training support vector ma-

chines. 1998.

[82] R. K. Potter and G. E. Peterson. The representation of vowels and their movements. The Journal

of the Acoustical Society of America, 20:528, 1948.

[83] R. R. Provine. Laughter punctuates speech: Linguistic, social and gender contexts of laughter.

Ethology, 95(4):291–298, 1993.

58 BIBLIOGRAPHY

[84] R. R. Provine. Laughter. American scientist, 84(1):38–45, 1996.

[85] R. R. Provine. Laughter: A scientific investigation. Penguin Press, 2001.

[86] R. R. Provine and K. R. Fischer. Laughing, smiling, and talking: Relation to sleeping and social

context in humans. Ethology, 83(4):295–305, 1989.

[87] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition.

Proceedings of the IEEE, 77(2):257–286, 1989.

[88] L. R. Rabiner and R. W. Schafer. Digital processing of speech signals, volume 19. IET, 1979.

[89] B. Reuderink, M. Poel, K. Truong, R. Poppe, and M. Pantic. Decision-level fusion for audio-visual

laughter detection. In Machine Learning for Multimodal Interaction, pages 137–148. Springer,

2008.

[90] D. A. Reynolds. Gaussian mixture models., 2009.

[91] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted gaussian

mixture models. Digital signal processing, 10(1):19–41, 2000.

[92] R. M. Rifkin. Everything old is new again: a fresh look at historical approaches in machine

learning. PhD thesis, MaSSachuSettS InStitute of Technology, 2002.

[93] F. Ringeval. Ancrages et modèles dynamiques de la prosodie: application à la reconnaissance des

émotions actées et spontanées. PhD thesis, Université Pierre et Marie Curie-Paris VI, 2011.

[94] F. Ringeval, J. Demouy, G. Szaszák, M. Chetouani, L. Robel, J. Xavier, D. Cohen, and M. Plaza.

Automatic intonation recognition for the prosodic assessment of language-impaired children. Au-

dio, Speech, and Language Processing, IEEE Transactions on, 19(5):1328–1342, 2011.

[95] M. K. Rothbart. Laughter in young children. Psychological bulletin, 80(3):247, 1973.

[96] W. Ruch and P. Ekman. The expressive pattern of laughter. Emotion, qualia, and consciousness,

pages 426–443, 2001.

[97] J. A. Russell, J.-A. Bachorowski, and J.-M. Fernández-Dols. Facial and vocal expressions of

emotion. Annual review of psychology, 54(1):329–349, 2003.

[98] R. Sanchez-Reillo. Hand geometry pattern recognition through gaussian mixture modelling. In

Pattern Recognition, 2000. Proceedings. 15th International Conference on, volume 2, pages 937–

940. IEEE, 2000.

BIBLIOGRAPHY 59

[99] K. R. Scherer, H. G. Wallbott, and A. B. Summerfield. Experiencing emotion: A cross-cultural

study. Cambridge University Press, 1986.

[100] B. Schuller, R. Müller, F. Eyben, J. Gast, B. Hörnler, M. Wöllmer, G. Rigoll, A. Höthker, and

H. Konosu. Being bored? recognising natural interest by extensive audiovisual integration for

real-life application. Image and Vision Computing, 27(12):1760–1774, 2009.

[101] B. Schuller, G. Rigoll, and M. Lang. Speech emotion recognition combining acoustic features and

linguistic information in a hybrid support vector machine-belief network architecture. In Acoustics,

Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International Conference

on, volume 1, pages I–577. IEEE, 2004.

[102] B. Schuller, S. Steidl, and A. Batliner. The interspeech 2009 emotion challenge. In INTER-

SPEECH, pages 312–315, 2009.

[103] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. A. Müller, and S. S. Narayanan.

The interspeech 2010 paralinguistic challenge. In INTERSPEECH, pages 2794–2797, 2010.

[104] B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger,

F. Eyben, T. Bocklet, et al. The interspeech 2012 speaker trait challenge. In INTERSPEECH, 2012.

[105] B. Schuller, S. Steidl, A. Batliner, F. Schiel, and J. Krajewski. The interspeech 2011 speaker state

challenge. In INTERSPEECH, pages 3201–3204, 2011.

[106] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani,

F. Weninger, F. Eyben, E. Marchi, et al. The interspeech 2013 computational paralinguistics chal-

lenge: social signals, conflict, emotion, autism. In Proceedings INTERSPEECH 2013, 14th Annual

Conference of the International Speech Communication Association, 2013.

[107] H. Sierro. Automatically detected acoustic landmarks for assessing natural emotion from speech.

Master’s thesis, University of Fribourg, Switzerland, 2013.

[108] P. K. Smith and K. Lewis. Rough-and-tumble play, fighting, and chasing in nursery school children.

Ethology and Sociobiology, 6(3):175–181, 1985.

[109] I. Sneddon, M. McRorie, G. McKeown, and J. Hanratty. The belfast induced natural emotion

database. Affective Computing, IEEE Transactions on, 3(1):32–41, 2012.

[110] K. N. Stevens. Toward a model for lexical access based on acoustic landmarks and distinctive

features. The Journal of the Acoustical Society of America, 111:1872, 2002.

60 BIBLIOGRAPHY

[111] K. N. Stevens, S. Y. Manuel, S. Shattuck-Hufnagel, and S. Liu. Implementation of a model for lexi-

cal access based on features. In Second International Conference on Spoken Language Processing,

1992.

[112] D. Sutorius. The transforming force of laughter, with the focus on the laughing meditation. Patient

education and counseling, 26(1):367–371, 1995.

[113] V. C. Tartter. Happy talk: Perceptual and acoustic effects of smiling on speech. Perception &

psychophysics, 27(1):24–27, 1980.

[114] S. Tilsen and K. Johnson. Low-frequency fourier analysis of speech rhythm. The Journal of the

Acoustical Society of America, 124(2):EL34–EL39, 2008.

[115] A. S. Tokar and P. A. Johnson. Rainfall-runoff modeling using artificial neural networks. Journal

of Hydrologic Engineering, 4(3):232–239, 1999.

[116] F. J. Tolkmitt and K. R. Scherer. Effect of experimentally induced stress on vocal parameters.

Journal of Experimental Psychology: Human Perception and Performance, 12(3):302, 1986.

[117] J. Trouvain. Segmenting phonetic units in laughter. In Proceedings of the 15th International

Congress of Phonetic Sciences. Barcelona: Universitat Autònoma de Barcelona, pages 2793–

2796, 2003.

[118] K. P. Truong and D. A. Van Leeuwen. Automatic detection of laughter. In INTERSPEECH, pages

485–488, 2005.

[119] J. Urbain, E. Bevacqua, T. Dutoit, A. Moinet, R. Niewiadomski, C. Pelachaud, B. Picart,

J. Tilmanne, and J. Wagner. AVLaughtercycle: An audiovisual laughing machine. In Proceedings

of the 5th International Summer Workshop on Multimodal Interfaces, pages 79–87. Citeseer, 2009.

[120] D. Ververidis and C. Kotropoulos. Emotional speech recognition: Resources, features, and meth-

ods. Speech communication, 48(9):1162–1181, 2006.

[121] R. C. Villing, B. H. Repp, T. E. Ward, and J. M. Timoney. Measuring perceptual centers using the

phase correction response. Attention, Perception, & Psychophysics, 73(5):1614–1629, 2011.

[122] A. Vinciarelli, N. Suditu, and M. Pantic. Implicit human-centered tagging. In Multimedia and

Expo, 2009. ICME 2009. IEEE International Conference on, pages 1428–1431. IEEE, 2009.

BIBLIOGRAPHY 61

[123] J. Wagner, F. Lingenfelser, and E. André. Using phonetic patterns for detecting social cues in

natural conversations. 2013.

[124] H. M. Wallach. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international

conference on Machine learning, pages 977–984. ACM, 2006.

[125] J. Wiggings. The five factor model of personality: Theoretical perspectives. Guilford Press, 1996.

[126] B. Yegnanarayana. Artificial neural networks. PHI Learning Pvt. Ltd., 2004.

Documents

A UTOMATIC RECOGNITION OF LAUGHTER USING VERBAL AND NON … · 2014-03-06 · A UTOMATIC RECOGNITION OF LAUGHTER USING VERBAL AND NON-VERBAL ACOUSTIC FEATURES Tomasz Jacykiewicz 1