D3.5 - Enhanced version of vocal expression evaluation systemgeniiz.com/wp-content/uploads/2016/08/D3.5_-_Enhanced... · 2016. 8. 30. · uhohydqw orz dqg plg ohyho ghvfulswruv vxfk

Enhanced version of vocal expression

evaluation system Erik Marchi, Florian Eyben, Björn Schuller

DELIVERABLE D3.5 Grant Agreement no. 289021 Project acronym ASC-Inclusion Project title Integrated Internet-Based Environment for Social

Inclusion of Children with Autism Spectrum Conditions Contractual date of delivery 30 April 2014 Actual date of delivery 30 April 2014 Deliverable number D3.5 Deliverable title Enhanced version of vocal expression evaluation

system (incl. report) Type Report, Public Number of pages 23 WP contributing to the deliverable WP 3 (Voice Analysis) Responsible for task Björn Schuller (TUM) [email protected]

Author(s) Erik Marchi (TUM) [email protected] Florian Eyben (TUM) [email protected] Björn Schuller (TUM) [email protected]

ASC-Inclusion D3.5

Page 2 of 23 ICT FP7 Contract No. 289021

Table of ContentsExecutive summary .................................................................................................................. 3 1. Introduction ........................................................................................................................... 3 2. Databases ............................................................................................................................. 4

2.1 English ........................................................................................................................ 5 2.2 Swedish ...................................................................................................................... 5 2.3 Hebrew ....................................................................................................................... 6

3. emo-against-Rest classification ............................................................................................ 7 3.1 English data set .......................................................................................................... 7 3.2 Swedish data set ........................................................................................................ 8 3.3 Hebrew data set ....................................................................................................... 10

4. Confidence measure and semi-supervised learning ........................................................... 10 4.1 Databases ................................................................................................................ 10 4.2 Method ...................................................................................................................... 11 4.3 Evaluation ................................................................................................................. 13

5. Zero-resource emotion recognition ..................................................................................... 15 5.1 Motivation ................................................................................................................. 15 5.2 Features and Database ............................................................................................ 15 5.3 Method ...................................................................................................................... 16 5.4 Evaluation ................................................................................................................. 17

6. Enhanced version of vocal expression evaluation system .................................................. 18 6.1 System refinement and evaluations .......................................................................... 18 6.2 System integration .................................................................................................... 19 6.3 Roadmap for the upcoming 6 months ....................................................................... 20

7. Conclusions and self-analysis ............................................................................................. 21 8. References ......................................................................................................................... 22

ASC-Inclusion D3.5


Executive summary In order to suit the needs of children with ASC, the system has been adapted based on user studies and experiments. The new data collected at the clinician sites (KI, UCAM) have been used to ameliorate the accuracy of the emotion recognition system. Also the representation of the tracked features and the display of the appropriateness of the child’s expression have been adjusted according to the evaluation conducted at the clinical sites. The voice analyser is now capable to handle adult-child models and language-dependent models for emotion recognition. 1. Introduction The goal of WP3 is to implement and evaluate an on-line vocal expression analyser, showing children with Autism Spectrum Condition how they can improve their context-dependent vocal emotional expressiveness. Based on the child’s speech input recorded by a standard microphone, a set of emotionally relevant low- and mid-level acoustic features are extracted and visualised in real-time so that the child gets an immediate feedback about how well the vocal expression of the recorded utterance fits to a pre-recorded utterance or pre-defined parameters conveying the target emotion. The first step to implement such an analyser was to identify speech features for emotional expression perception, deriving potential descriptors of the affective state from the recorded prototypical utterances. For this reason, as a basis for data-driven analysis of speech features that are modulated by emotion, a number of sentences spoken in different emotion emotions by multiple ASC children and typically developed children have been recorded [1]. A core component of the final speech analyser is the on-line feature extractor calculating the speech features specified in [2] in real-time. Thus, the openSMILE [3] audio feature extraction engine has been integrated into the system and extended according to the relevant descriptors that are to be tracked for vocal expression analysis [4]. Besides relevant low- and mid-level descriptors such as pitch, energy and duration of voiced segments, also higher-level information concerning the affective state of the speaker are extracted in real-time. Hence, the implemented system is be able to recognize basic emotion emotional expressions (happy, sad, angry, surprised, afraid, proud, ashamed, calm), as well as dimensional representations of emotion and mental state including arousal, valence. A visual output is generated for recognized emotions and for tracked speech parameters. In order to assess a child’s performance in expressing emotions via speech, the extracted audio parameters are compared to the respective parameters extracted from pre-recorded prototypical utterances. Suitable measures for determining the ‘distance’ between the child’s expression and the corresponding prototypical utterance (based on relevant features) are defined and calibrated. This ‘distance’ or ‘difference’ is visualized in an easily understandable way and the child’s motivation for minimizing the deviation between its vocal expression and the expression conveyed in the prototypical utterance is assured by interpreting the task as a game. The child’s success in this game is tracked over a longer period of time in order to create a child’s individual ‘history’ leading to a personal user

ASC-Inclusion D3.5


profile that reveals specific difficulties, changes, and examples of vocal emotional expression. Corrective feedback regarding the appropriateness of the child’s vocal expression is provided based on contextual parameters provided from the central platform, the stored library of prototypical expressions, and the child’s individual history. Feedback is provided either visually or acoustically, e.g., by replaying both, prototypical and recorded patterns. In order to suit the needs of children with ASC, the system has to be adapted based on user studies and experiments. The accuracy of the emotion recognition system will be optimized using annotated speech data collected from children with ASC. Also the representation of the tracked features and the display of the appropriateness of the child’s expression needs to be validated and optimized so that the child has the feeling that it can control the visual output generated by the system using its speech. Finally, the feedback given by the system has to be evaluated to ensure that the system responses do not significantly differ from the feedback a human trainer would give. This report, gives a brief description of the new data sets collected in UK and Sweden (Section 2), then a study on emotion classification performances is given in Section 3. In Section 4 we describe the enhanced version of vocal expression evaluation system including updates on system integration and system refinement. Conclusions are drawn in Section 5. 2. Databases Two new data sets of prototypical emotional utterances have been created. The datasets contain sentences spoken in English and Swedish by children with ASC and by typically developed children in 8 different target emotions plus neutral. The two datasets along with the Hebrew dataset [1] forms the ASC-Inclusion emotional children speech database. The general method, procedures used to compose focus and control groups and equipment used to record, were based on the instructions designed together with BIU and provided in [1]. The task focuses mainly on the six 'basic' emotions except disgust plus three other mental states: ashamed, calm1 and proud. In the case of the English dataset a total of 19 emotions plus neutral [5] and calm were collected (cf. Table 1). The entire set of emotions is shown in Table 1. Concerning the recording set up, the child and the examiner sat at a table in front of a laptop. The microphone stood next to the laptop, about 20 cm in front of the child. As recording device, a Zoom H1 Handy Recorder was used. Recordings were taken in WAV format at a sampling rate of 96 kHz and a quantization of 16 bits and stored directly on the internal SD memory card. The examiner read to the child a sequence of short stories from a power point presentation. Each slide contained short sentences. The stories were simple and short. The child was asked to imagine he is the main character in the story. The idea of emotional tone was explained to the child through an example: the examiner said the sentence "I love gum" in an emotionless, flat tone and then in a happy tone. He then asked the child about his own likes and asked him to say what he likes very happily. The stories contained, every few sentences, a quotation of a saying by the story's main character. Each of these quotations relayed to a specific emotion, which was explicitly stated. For example: [Danny said happily: "It was the best birthday I ever had!"] or [Jain was very surprised. She looked ed with Neutral and Sad. since can be highly confusCalm was excluded from the evaluation 1

ASC-Inclusion D3.5


at the box and said: "What is that thing?"]. When the examiner read the stories, he read the quotation on a flat, unnatural tone. Then he asked the child to say the sentence as the child in the story would have said it. English, Swedish, Hebrew

Only in English

Only in English

# Emotion Category # Emotion Category # Emotion Category 1 Happy Basic 10 Disgusted Basic 18 Kind Social 2 Sad Basic 11 Excited Individual 19 Jealous Social 3 Afraid Basic 12 Interested Individual 20 Unfriendly Social 4 Angry Basic 13 Bored Individual 21 Joking Social 5 Surprised Basic 14 Worried Individual 6 Ashamed Self-conscious 15 Disappointed Individual 7 Proud Self-conscious 16 Frustrated Individual 8 Calm - 17 Hurt Individual 9 Neutral -

Table 1: Emotions and categories 2.1 English A total number of 18 children took part in the recordings held in UK by UCAM. The language throughout recordings is English and all children are native speakers. The focus group consists of 8 children (5 male and 4 female pupils) at the age of 7 to 11. All children were diagnosed with an autism spectrum condition by trained clinicians, based on established criteria (DSM IV/ICD 10). The control group consists of 10 children (5 male and 5 female pupils) at the age of 5 to 10. A detailed description of the gender and the age of the participants is given in Table 2 (left). The database comprises a total of 1507 utterances contain emotional speech of children with ASC and typically developing children. 660 utterances are related to material collected from children with ASC, whereas the remaining 847 utterances contains typical speech from children. In this dataset 19 emotions plus neutral and calm were collected, however in order to compare performances across different datasets we only selected those emotions that are commonly shared among the datasets. Table 3 shows the number of utterances for each group (typically developing children (TD) and children with ASC (ASC)) and for each emotion. 2.2 Swedish A total number of 20 children took part in the recordings held in Sweden by KI. The language throughout recordings is Swedish and all children are native speakers. The focus group consists of 9 children (9 male pupils) at the age of 7 to 11. All children were diagnosed with an autism spectrum condition by trained clinicians, based on established criteria (DSM IV/ICD 10). The control group consists of 11 children (5 male and 6 female pupils) at the age of 5 to 9. A detailed description of the gender and the age of the participants is given in Table 2 (centre). The database comprises a total of 360 utterances contain emotional speech of children with ASC and typically developing children. 360

ASC-Inclusion D3.5


utterances are related to material collected from children with ASC, whereas the remaining 332 utterances contains typical speech from children. Table 3 shows the number of utterances for each group (typically developing children (TD) and children with ASC (ASC)) and for each emotion.

English Swedish Hebrew # SPKID Age Gender # SPKID Age Gender # SPKID Age Gender 1 as06 8 m 1 as01 10 m 1 as01 8 m 2 as13 9 m 2 as02 9 m 2 as02 6 m 3 as14 10 m 3 as03 9 m 3 as03 6 m 4 as15 7 m 4 as04 7 m 4 as04 8 m 5 as16 8 f 5 as05 11 m 5 as05 10 m 6 as17 11 f 6 as06 9 m 6 as07 10 m 7 as18 10 f 7 as07 9 m 7 as09 9 f 8 as19 7 m 8 as08 8 m

9 as09 10 m 9 td01 10 f 10 td01 6 m 8 td01 9 f

10 td03 9 f 11 td02 9 f 9 td03 5 m 11 td04 9 f 12 td03 8 f 10 td04 9 m 12 td05 5 f 13 td04 5 m 11 td05 6 f 13 td07 9 f 14 td05 5 f 12 td06 5 f 14 td21 6 m 15 td06 7 m 13 td07 9 m 15 td22 7 m 16 td07 5 f 14 td08 9 m 16 td23 7 m 17 td08 9 f 15 td09 5 f 17 td24 9 m 18 td09 5 m 16 td10 6 f 18 td25 8 m 19 td10 7 m 17 td11 9 m

20 td11 9 m Table 2: Detailed description for English, Swedish and Hebrew datasets. 2.3 Hebrew This dataset was already introduced in [1], however we give a short description in the following. A total number of 17 children took part in the recordings. The language throughout recordings is Hebrew and all children are native speakers. The focus group consists of 7 children (6 male and 1 female pupils) at the age of 6 to 12. The control group consists of 10 children (5 male and 6 female pupils) at the age of 5 to 9. A detailed description of the gender and the age of the participants is given in Table 2 (right). The database comprises a total of 529 utterances contain emotional speech of children with ASC and typically developing children. 351 utterances are related to material collected from children with ASC, whereas the remaining 178 utterances contains typical speech from children. Table 3 shows the number of utterances for each group (typically developing children (TD) and children with ASC (ASC)) and for each emotion.

Focus

group

Co

ntrol g

roup

ASC-Inclusion D3.5


Language Diagnosis Gender Emotions/Categories Total

Basic Self m f Af An Ha Sa Su As Pr Ca Ne English TD 5 5 40 40 50 40 40 40 50 28 40 368

656 English ASC 5 3 31 30 39 31 31 31 39 24 32 288 Swedish TD 6 5 40 40 49 48 39 28 48 30 38 360

692 Swedish ASC 9 - 36 36 45 44 36 27 45 27 36 332 Hebrew TD 5 5 38 38 49 38 38 37 46 27 40 351

529 Hebrew ASC 6 1 18 20 30 21 21 17 22 13 16 178 Table 3: Detailed description for English, Swedish and Hebrew datasets

3. emo-against-Rest classification In order to ameliorate and optimise the voice analyser we performed an evaluation on the three data sets described in Section 2.1 and 2.2 and 2.3. Considering the results obtained in the categorical approach described in [5], here we performed the classification of one emotion against the remaining emotions (including Neutral). We choose only the emotions that are present in all the three dataset except calm (cf. Table 1 (left)) that showed to acoustically similar to neutral, hence confusable with neutral. The experiments were conducted using the INTERSPEECH 2013 ComParE Challenge [6] feature set. The feature set consists of voice quality features (jitter and shimmer), energy, spectral, cepstral (MFCC) and voicing related low-level descriptors (LLDs) as well as a few LLDs including logarithmic harmonic-to-noise ratio (HNR), spectral harmonicity, and psychoacoustic spectral sharpness. Altogether, the 2013 COMPARE feature set contains 6 373 features. The features were extracted by using the TUM’s open-source openSMILE feature extractor toolkit [7]. Since the class distribution among the different tasks (emo-against-Rest) is unbalanced (i.e. emo class is underrepresented in the data), the unweighted average recall (UAR) of the classes is used as scoring metric. Adopting the Weka toolkit [8], Support Vector Machines (SVMs) with linear kernel were trained with the Sequential Minimal Optimization (SMO) algorithm. The SVM training has been made at different complexity constant values C ∈ {0.001, 0.002, 0.005, 0.01, 0.02}. In order to ensure speaker independent evaluations, we performed Leave-One-Speaker-Out (LOSO) cross-validation. In each fold we upsampled the training material in order to have a balanced set. Furthermore, we adopt the speaker z-normalisation (SN) method since it is known to improve the performance of speech-related recognition tasks, as described in [9]. With such a method, the feature values are normalised to a mean of zero and a standard deviation of one for each speaker. For the classification of emo-against-Rest we perform the task on the focus and control group subset for each database. In the following section each dataset is analysed separately. 3.1 English data set Figure 1 shows the best results obtained over the different complexities among the following emotions: Afraid, Angry, Happy, Sad, Surprised, Ashamed, Proud, Neutral. The

ASC-Inclusion D3.5


experiments were conducted on the control group (Figure 1 left) and focus group ((Figure 1 right) subsets separately. We observe acceptable performances among all the emotions on the control group subset. Relative lower performances are shown for afraid (67.4% UAR), ashamed (68.7% UAR) and proud (69.0% UAR). In the focus group subset performances are lower on average and in particular anger has very low performance up to 54.7 % UAR.

Typically developing children

Children with ASC

Figure 1: Classification result for binary emo-against-Rest on the English data set. Rest class includes Neutral. For example happy-against-Rest: Rest = {Angry, Afraid, Sad, Surprised,, Ashamed, Proud, Neutral}. Results are given for typically developing children (left) and children with ASC (right). 3.2 Swedish data set Figure 2 shows the best results obtained over the different complexities among the following emotions: Afraid, Angry, Happy, Sad, Surprised, Ashamed, Proud, Neutral. The experiments were conducted on the control group (Figure 3 left) and focus group ((Figure 3 right) subsets separately. We observe acceptable performances among all the emotions on the control group subset. Relative lower performances are shown for afraid (65.1% UAR), surprised (63.8% UAR), and proud (63.3% UAR). In the focus group subset performances are lower on average and in particular – as in the English data set - anger has very low performance up to 56.0 % UAR.

67.4

78.974.3 75.3

79.3

68.7 69.0

83.8

71.3

54.7

72.9

65.0

80.2

72.169.3

65.0

50.055.060.065.070.075.080.085.090.0

UAR [

%]

ASC-Inclusion D3.5



Children with ASC

Figure 2: Classification result for binary emo-against-Rest on the Swedish data set. Rest class includes Neutral. Results are given for typically developing children (left) and children with ASC (right).


Children with ASC

Figure 3: Classification result for binary emo-against-Rest on the Hebrew data set. Rest class includes Neutral. Results are given for typically developing children (left) and children with ASC (right).

65.1 66.770.6 70.7

63.867.4

63.3

81.5

69.5

56.0

65.4

72.7

62.869.5

60.6

77.9

50.055.060.065.070.075.080.085.090.0

63.869.2 71.3

77.8 77.3 75.4

65.6

82.0

52.3

67.8

60.6

77.572.8

67.4 65.9 68.0

50.055.060.065.070.075.080.085.090.0

UAR [

%]

UAR [

%]

ASC-Inclusion D3.5


3.3 Hebrew data set Figure 3 shows the best results obtained over the different complexities among the selected emotions The experiments were conducted on the control group (Figure 3 left) and focus group ((Figure 3 right) subsets separately. We observe acceptable performances among all the emotions on the control group subset. Also in this scenario a relative lower performance is shown for afraid (63.8% UAR), and proud (65.6% UAR). In the focus group subset performances are lower on average and in particular we found that now afraid has very low performance up to 52.3 % UAR. 4. Confidence measure and semi-supervised learning In order to optimise the accuracy of the emotion recognition system using annotated speech data collected from children with ASC, we propose a confidence measure for speech emotion recognition (SER) systems based on semi-supervised learning with cross-corpus. During the semi-supervised learning procedure, five frequently used databases with created confidence labels are implemented to train classifiers. When the SER system predicts a test utterance, these classifiers serve as a reliability estimator for the utterance and output a series of recognized confidence ratios that are combined for a confidence measure (CM). The proposed confidence measure [10] is effective in indicating how much we can trust the predicted emotion. 4.1 Databases We chose six among the most frequently used databases that range from acted over induced to spontaneous affect portrayal. For better comparability among corpora, we map the diverse emotion groups onto the one most popular axes in the dimensional emotion models: valence (i. e., negative (“-”) vs. positive (“+”)). In the following, each database is briefly introduced including the mapping to binary valence by “+” and “-” per emotion and its number of instances. The Airplane Behaviour Corpus (ABC) is crafted for the special target application of public transport surveillance. It is based on induced certain mood by pre-recorded announcements of a vacation (return) flight, consisting of 13 and 10 scenes. And it contains aggressive (-, 95), cheerful (+, 105), intoxicated (-,33), nervous (-, 93), neutral (+, 79), and tired (-, 25) speech. The Audiovisual Interest Corpus (AVIC) is made up of spontaneous speech and natural emotion. In its scenario setup, a product presenter leads subjects through an English commercial presentation. AVIC is annotated in “level of interest” (loi) 1–3 having loi1 (-, 553), loi2(+, 2 279), and loi3 (+, 170). The Danish Emotional Speech (DES) database contains professionally acted nine Danish sentences, two words, and chunks that are located between two silent segments of two passages of fluent text. Emotions include angry (-, 85), happy (+, 86), neutral (+, 85), sadness (-, 84), and surprise (+, 79). The eNTERFACE (eNTER) consists of recording of naive subjects from 14 nations speaking pre-defined spoken content in English. Each subject listened to six successive short stories eliciting a particular emotion out of angry (-, 215), disgust (-, 215), fear (-, 215), happy (+, 207), sadness (-, 210), and surprise (+, 215). The Belfast Sensitive Artificial Listener (SAL) data is part of the final HUMAINE database. The set is an average length of 20 minutes per speaker of natural human- SAL conversations. The data has been labelled continuously in real time with respect to valence and activation using a system based on FEEL trace. The annotations were normalized to zero mean globally and

ASC-Inclusion D3.5


scaled so that 98% of all values in the range from -1 to +1. The 25 recordings have been split into turns using an energy based Voice Activity Detection. Labels for each obtained turn are computed by averaging over the whole turn. Per quadrant the samples are: q1 (+, 459), q2 (+,320), q3 (-, 564), and q4 (-, 349). The FAU Aibo emotion corpus comprises spontaneous, German speech emotionally coloured of children. For the two-class problem in Interspeech 2009 emotion challenge, the corpus consists of the cover classes NEGative (subsuming angry, touchy, reprimanding, and emphatic) and IDLe (consisting of all non-negative states). The set includes NEG of training (-, 3 358), IDL of training (+, 6 601), NEG of test (-, 2 465), and IDL of test (+, 5 792). More details on the corpora are summarized in Table 4 and found in [11]. Note that in the ongoing, balancing of the training partition is used and the task of emotion recognition in the current work is evaluated on the FAU Aibo emotion corpus.

Table 4: Overview of the selected emotion corpora (Lab: labellers, Rec: recording environment, f/m: (fe-)male subjects). 4.1.1 Acoustic features We decided on a typical state-of-the art emotion recognition engine operating on supra-segmental level, and use our open source openEAR toolkit [12] to extract a set of systematically generated acoustic features. We employ the pre-defined openEAR configuration “emo large” with 39 functionals of 56 acoustic Low-Level Descriptors (LLDs) including their first and second order delta regression coefficients. The 39 statistical functionals (such as mean, std. deviation, min, max, quartiles, percentiles) applied to 56 LLDs such as zero-crossing-rate, signal energy, F0 envelope, probability of voicing, energy in bands (0–250 Hz, 0–650 Hz, 250–650 Hz, 1–4 kHz), Mel-spectrum bands (1-26) and Mel-frequency cepstral coefficients. This results in a collection of 6 552 acoustic features. 4.2 Method A confusion matrix includes information about actual and predicted classifications made by a classification system. Let’s assume that a is the number of negative instances correctly predicted, b is the number of negative instances incorrectly predicted as positive, c is the number of positive instances incorrectly predicted as negative, and d is the number of negative instances correctly predicted. In this case, the reliability of these predicted decisions can be reflected by investigating four types of ratios in the confusion matrix, namely CN; ICN; CP, and ICP defined as follows:

= , = , = , = (1) where CN (Correct Negative) rate is the proportion of negative predictions correctly

ASC-Inclusion D3.5


classified; ICN (Incorrect Negative) rate is the proportion of negative predictions incorrectly classified as positive; CP (Correct Positive) rate is the proportion of positive predictions correctly classified; ICP (Incorrect Positive) rate is the proportion of positive predictions incorrectly classified as negative. According to these ratios, we can determine the reliability of a prediction. If a prediction is negative, for example, the probability of the prediction correctly classified is CN and incorrectly classified as positive is ICN. Therefore, these ratios can be regarded as indicators which roughly represent a CM of recognition systems. Based on these ratios considering reliability information about predictions, we develop a semi-supervised learning algorithm to train classifiers which estimate reliability of recognition decisions made by a SER system. N selected corpora (i. e, ABC, AVIC, DES, eNTER, and SAL) are first predicted by an emotion recognition classifier trained on FAU Aibo training set. Next, with their predicted labels and the corresponding actual labels, four new types of labels, which are mapped directly to the four kinds of ratios in confusion matrix, are created for these corpora. Then the semi-supervised learning algorithm is initialised by the N selected corpora together with their newly created labels and proceeds as shown in Table 5.

Inputs: N sets of initial labeled instances: L1; :::;LN; N sets of label predictions: P1; :::; PN; A set of most confidence instances: T = ø;; A set of unlabeled instances: U; Process: Build N classifiers: hi(Li); (i = 1; 2; :::;N); While iteration < MAXITERATION Label predictions: Pi = hi(U); (i = 1; 2; :::;N); Pick high confidence instances from U according to majority voting Pi: T = MV(P1; :::; PN); If T is empty Break; End Li = Li [ T; (i = 1; 2; :::;N); Ui = Ui - T; (i = 1; 2; :::;N); Rebuild the classifiers: hi(Li); (i = 1; 2; :::;N); End Outputs: N trained classifiers: h1; :::; hN.

Table 5: Training procedure of the proposed semi-supervised learning algorithm. Let L1;L2; :::;LN be the N selected corpora with newly created labels and an unlabelled set U be the training set of the FAU Aibo corpus. For each iteration, recognize the unlabelled instances in U by classifiers hi(Li); (i = 1; 2; :::;N), respectively. In order to obtain the best confidence set T, the majority voting (MV) rule is carried out: if three or more label predictions of an instance are the same, the instance with the label is picked into T. Afterward, the confidence set T is added into the training sets L1;L2; :::;LN. The classifiers hi are re-trained and the procedure repeated until the confidence set is empty or the number of iterations exceeds the maximum set number. After the learning process, we investigate prior confidence ratios for these training sets in accordance with Equation (1) which are shown in Table 6.

ASC-Inclusion D3.5


Table 6: Prior confidence ratios of outputs of the five trained classifiers after the proposed semi-supervised learning algorithm. All the predictions of confidence ratios are combined to estimate a final CM, while recognizing the test set by the output classifiers and the emotion recognition classifier. Note that to guarantee the final CM to be between 0 and 1, a normalization is implemented on the confidence ratios. Given a predicted emotion hypothesis e, the corresponding confidence ratios ∈ { , , , }( = 1,2, … , ), normalized confidence ratios can be defined as follows:

= + ∑ , = +1,+ ∑ , ℎ

(2) where

= ∑ (max , + max , ) (3) and is a binary weighted factor; = +1 if ri has strong correlation with e (i. e., CN or ICN with negative valence) and = -1 otherwise (i. e, CP or ICP with negative valence). Finally, a CM SCM is calculated by:

= ∑ × ̃ (4) 4.3 Evaluation For databases, we employ FAU Aibo emotion corpus to deal with the task of Interspeech 2009 emotion challenge [13] two-problem and the rest selected corpora (i. e, ABC, AVIC, DES, eNTER, and SAL) to run the proposed semi-supervised learning algorithm. As classifiers, we consider SVM using sequential minimal optimization with complexity 0:01. Following the setup, the unweighted and weighted accuracy of the SER system are 68:7% and 69:2% respectively. When evaluating CM annotation, we usually encounter two types of errors, namely false alarm errors and false rejection errors. Obviously, the receiver operating characteristic (ROC) curve gives a full picture of verification performance at all operating points. Figure 4 shows the comparison of the proposed semi-supervised learning method with different iteration stages and random guess in the ROC space. From Figure 4, it is easy to see that the proposed method produces a quick convergence in three iterations. Moreover, ROC obtained by the proposed semi-supervised learning method for CM is significantly higher than by random guess and its area under curve (AUC) benefits from the process of iterated semi-supervised learning and obtains the maximum of 66:43% at iteration 3.

ASC-Inclusion D3.5


Figure 4: Comparison of the proposed semi-supervised learning method with different iteration stages and random guess in the ROC space. In many cases, it is convenient to use a single-number metric for CM assessment. Normalized cross entropy (NCE) is widely used, which is defined as

= (5) where

_ _ = ∑ log ( ( = 1) + 1 − ( = 0)) (6)

and

= − log − ( − ) log 1 − . (7) Here we know a set of Nt confidence scores and the associated class labels ∈0,1 , {0,1})| = 1, … }, Nt is the number of test instances, and zi = 1 if the emotion recognition is correct and zi = 0 otherwise. In Equation (6), ( ) = 1 if x is true and ( ) = 0 otherwise, and n is the number of samples whose zi = 1. The higher the NCE is, the better the confidence quality. A perfect CM always outputs one when the label is true, and zero otherwise. For the proposed method, NCE is 0.80 which is close optimal value 1 and shows this method is effective in indicating the reliability of decisions from the SER system. Summing up, we propose a confidence measure method for speech emotion recognition (SER) systems based on semi-supervised learning with cross-corpus. The novel CM is based on a self-training procedure that utilises the selected corpora with created confidence labels to train classifiers. These classifiers led to a reliable way of assessing the correctness of the decisions of the SER system. The experimental results evaluated on Interspeech 2009 emotion challenge two-class problem demonstrated that this method

ASC-Inclusion D3.5


obtained better performance compared to chance level. Moreover, it achieved a very good NCE value 0.80, demonstrating its effectiveness. 5. Zero-resource emotion recognition A further study that we conducted in the project was to investigate how speech emotion recognition performs and generalises over different data conditions. In this paragraph we describe a first step into the direction of what we called “Zero-resource” speech emotion recognition. 5.1 Motivation Any robust speech emotion recognition system should be robust to any condition and also be able to generalise under different scenarios and data conditions. The main motivation of this approach is to provide a more robust measure of emotional states that could be transferred and applied across corpora. This approach can be essentially considered as an unsupervised method for affect recognition. The only assumption for our approach is to have speech material containing neutral expression from the speaker. 5.2 Features and Database In order to implement our unsupervised approach we looked at the literature and we selected some knowledge-inspired features from empirical and theoretical evidence. We implemented the knowledge-inspired features (cf. Table 7) proposed by Scherer in our openSMILE toolkit [7] by producing a compact feature set that contains prosodic, source, and vocal tract features. The prosodic features contain F0, and loudness. F0 is computed with the sub-harmonic summation algorithm and temporal Viterbi post smoothing. Loudness (intensity) is computed as the sum of energies in the auditory spectrum of the Perceptual Linear Predictive coding (PLP) frontend. From the F0 contour, the number of voiced segments are computed to estimate a pseudo syllable rate. Further, the mean length of voiced segments is used as a correlate to the mean syllable length. Additionally, mean, standard deviation, the 5-th percentile, the range of the 5-th to the 95-th percentile (a robust estimate of the F0 range), and the mean rising slope as well as falling slope of F0 peaks are applied as functionals to summarise F0 over a segment of interest. From the loudness contour, the mean, standard deviation, and the mean of the rising and falling slopes of peaks are computed. The source features, except for the Harmonics to Noise Ratio (HNR) are based on the exact locations of pitch periods (epochs) which are found with a correlation based wave-form matching algorithm guided by F0 estimates obtained via the spectral sub-harmonic summation method. Jitter is the relative deviation of the period length of successive periods, Shimmer the variation in peak amplitude of successive periods, and source quality refers to the Pearson correlation coefficient of two successive pitch periods. All source descriptors are averaged over all pitch periods in a low-level feature frame. For source quality, also the range within a frame is computed. To Jitter, Shimmer, and mean source quality, the arithmetic mean is applied as a segment level functional. To source quality range, both mean and standard deviation are applied. The source quality descriptors are used as measures for Laryngealization (Table 7). HNR is computed from the short-time autocorrelation function as the ratio of the peak corresponding to F0 to the total frame energy.

ASC-Inclusion D3.5


As vocal tract features, the bandwidths of the formants 1-5 (obtained from the roots of the Linear Predictor Coefficient (LPC) polynomial), the relative (to the total energy from 0-5 kHz) spectral energy in the 1-5 kHz region, and the slope of the logarithmic power spectrum in the 1-5 kHz region are implemented. The mean over a segment is applied to these three descriptors. Note, that no delta coefficients are computed from the low-level descriptors, as is the case in the large-scale, brute-forced feature sets. Overall a total number of 26 features is used in the Zero-resources (ZR) approach. In order to evaluate the performances of the proposed approach we applied the INTERSPEECH 2013 ComParE Challenge [6] feature set. The feature set (already described in section 3) contains 6 373 features. As evaluation database we used the Berlin Emotional Speech Database (EMO-DB) [Burkhardt et al., 2005]. It covers anger, boredom, disgust, fear, joy, sadness and neutral as emotions. Ten actors (5 female and 5 male) simulated the emotions, producing 10 German utterances like “Der Lappen leigt auf dem Eisschrank” (“The cloth is lying on the fridge”) in all seven emotional states. 494 phrases are commonly used: angry (127), boredom (79), disgust (38), fear (55), happy (64), neutral (78), and sadness (53). 5.3 Method Our rule-based approach functions as a sum of multiple ratings obtained from the rules defined in Table 7 in relation to a reference point (neutral). As shown in Figure 5, the knowledge-inspired features are first extracted from the input and then standardised according to mean and variance estimated on a neutral set of utterances that represent our reference with respect to which our rules will be evaluated. Once the features are standardised they are processed by the list of rules for each emotion (cf. Table 7). Each rule block subtracts a bias of 0.5 and multiply the feature value by -1 only if the rule is a “<”. The sum of all the values per each emotion block is used as score and the highest score determines the winning class.

Figure 5: Block diagram of the zero-resource approach.

ASC-Inclusion D3.5


5.4 Evaluation We compared the ZR with supervised high-dimensional feature space state-of-the-art approach. Adopting the Weka toolkit [8], Support Vector Machines (SVMs) with linear kernel were trained with the Sequential Minimal Optimization (SMO) algorithm with a complexity of 0.001. In order to ensure speaker independent evaluations, we performed Leave-One-Speaker-Out (LOSO) cross-validation. Concerning the ZR experiment set up we applied two different standardization. The former computes mean and variance on the full set of neutral utterances and standardises the emotional utterances accordingly (global neutral standardisation). The latter uses per speaker mean and variance on neutral material (speaker neutral standardisation).

Features Happy Angry Sad Afraid Bored Number_of_syllables_per_second >= <> < > < Syllable_duration <= <> > < > F0_mean > > < > <= F0_5th_percentile > = <= > <= F0_deviation > > < > < F0_range > > < <> <= Gradient_of_F0_rising_and_falling (2) > > < <> <= Intensity_db_mean >= > <= - <= Intensity_db_deviation > > < - < Gradient_of_intensity_rising_and_falling >= > < - <= Relative_spectral_energy_in_higher_bands > > < <> <= Spectral_slope < < > <> > Laryngealization = = > > = Jitter >= >= - > = Shimmer >= >= - > = Harmonics_noise_ratio > > < < <= Formant_bandwidth (5) - < > - >=

Table 7: Empirical findings concerning the effect of emotion on vocal parameters.

The results of this first and preliminary evaluation (cf. Table 8) shows that with the proposed unsupervised approach we can achieve quite reasonable performances up to 64.1 % UAR by using only 26 features. The state-of-the-art approach (SVM) still performs better but this is promising sign that the unsupervised approach can be adopted and further investigated. In fact future work will investigate cross-corpora evaluation and extend the evaluation on different tasks (e.g., arousal and valence).

ASC-Inclusion D3.5


Apprach #features UAR [%] ZR (global-n std) 26 61.6 ZR (spk-n std) 26 64.5 SVM 6373 83.4

Table 8: Preliminary results in terms of UAR on EMO-DB (456 instances, 6 classes). Global-n std: global neutral standardisation. Spk-n std: per speaker neutral standardization.

6. Enhanced version of vocal expression evaluation system This section describes the outcome of the cooperative loop between clinicians and technicians and the integration of the recent updates and adjustments of the voice analyser. 6.1 System refinement and evaluations According to the feedback from the evaluation carried out at the clinical site [14] we refined the system and introduced new features. The children liked the activity with the voice analyser software; in particular they had fun playing with their voices and influencing the plots. For this reason, we decided to re-introduce the plots in the analyser. Additionally we moved from the Arousal and Valence approach [15] to one-emotion-against-rest approach. In fact now the classification result is shipped along with a confidence measure that gives a measure on how much the system is certain about the recognised emotion against the remaining emotions. The confidence is represented by the probability estimate derived from the distance between the point and the hyper-plane [16]. This parameter is used to track the performance of the child and can be used to measure how recognisable the child’s expression is, given a target emotion. We faced the issue of having complicated labels (such as arousal and valence) by showing the recognised emotion with different background colours ranging from green (correct expression) to red (incorrect expression). In order to avoid the child to be frustrated or annoyed by the analyser we designed some conditions under which the system asks to try another emotion. For example if the system provides a negative feedback more than three times, it asks for a different emotion to be performed. Given this feedback the voice analyser [15] was simplified as shown in Figure 4.

ASC-Inclusion D3.5


Figure 4: Enhanced version of vocal expression evaluation system In addition to the new feedback system we have also introduced automatic age classification as described in [5]. The functionality can be enabled in the bottom part of the GUI by selecting “Auto”. It is also possible to manually select which models to load for emotion recognition (“Adult”, “Child”). Furthermore, given the three available databases it is possible to select the language of the user and activate the language-dependent models accordingly. 6.2 System integration We extended the existing voice module to handle automatic age recognition (Adult-Child module) [5] and language dependent models. A new classification layer has been added in order to detect the age of the user and the Emotion Recognition Components have been updated in order to perform one-motion-against-rest (cf. Figure 5).

ASC-Inclusion D3.5


Figure 5: Diagram of communications between the voice subsystem and the Platform Implemented a new component in openSMILE that handle new control messages received from the platform. In fact the message system has been modified to include new options and commands such as:

Sender:asc MsgName:start EmoSubset:happy MsgOptions:live|player_adults The MsgOptions optional parameter contains various options: the type of players (player_adult or player_children) and/or the type of stream to work on (live or recorded) each oprtion is separated by the ”|” character. To control the logging mechanism of each module the game sends the following message:

Sender:asc MsgName:startlogging To stop the logging system:

Sender:asc MsgName:stoplogging 6.3 Roadmap for the upcoming 6 months Over the next six months we will refine our subsystem according to the outcome of the RCT evaluation that is going to start in few weeks from now (at the time of writing). A final refinement will be done before the end of the project in order to finalise the voice analyser before the end of the project.

ASC-Inclusion D3.5


7. Conclusions and self-analysis The corrective visual feedback given in the game mode by the system is showing to the children how they can manipulate low level descriptors like pitch, energy and duration. In order to assess a child’s performance in expressing emotions via speech, the extracted audio parameters are compared to the respective parameters extracted from pre-recorded prototypical utterances. Suitable measures for determining the ‘distance’ between the child’s expression and the corresponding prototypical utterance (based on relevant features) have be defined and calibrated. This ‘distance’ or ‘difference’ is visualized in an easily understandable way and the child’s motivation for minimizing the deviation between its vocal expression and the expression conveyed in the prototypical utterance has been assured by interpreting the task as a game. The child’s success in this game is tracked over a longer period of time in order to create a child’s individual ‘history’ leading to a personal user profile that reveals specific difficulties, changes, and examples of vocal emotional expression. Corrective feedback regarding the appropriateness of the child’s vocal expression is provided based on contextual parameters of prototypical expressions, and the child’s individual history. Additionally the voice analyser has been augmented in order to monitor both adults and children in the cooperative playing scenario. Emotion recognition models for adult speech (typically developing) have been trained and shipped with the final system. A detailed description id given in [5]. The system has been adapted on user studies and experiments. The accuracy of the emotion recognition system has been optimized using annotated speech data collected from children with ASC, in fact two new data sets have been used in the voice analyser. The system will soon be used in the RCT evaluation in order to be evaluated in controlled condition.

ASC-Inclusion D3.5


8. References [1] E. Marchi, B. Schuller, S. Tal, S. Fridenzon, and O. Golan, “Database of prototypical emotional utterances,” Public Deliverable D3.1, The ASC-Inclusion Project (EU-FP7/2007-2013 grant No. 289021), Feb 2012. [2] E. Marchi, F. Eyben, A. Batliner, and B. Schuller, “Specification of suitable parameters for emotion perception and definition of api for later integration,” Public Deliverable D3.2, The ASC-Inclusion Project (EU-FP7/2007-2013 grant No. 289021), Apr 2012. [3] F. Eyben, M. Wöllmer, and B. Schuller, “openSMILE – The Munich Versatile and Fast Open-Source Audio Feature Extractor,” in Proceedings of the 18th ACM International Conference on Multimedia, MM 2010, (Florence, Italy), pp. 1459–1462, ACM, ACM, October 2010. [4] E. Marchi, F. Eyben, and B. Schuller, “Standalone on-line speech feature extractor and emotion recognizer with visual output,” Public Deliverable D3.3, The ASC-Inclusion Project (EU-FP7/2007-2013 grant No. 289021), Oct 2012. [5] E. Marchi, S. Piana, F. Eyben, A. Camurri, and B. Schuller, “Adjust technology for input analysis,” Public Deliverable D10.2, The ASC-Inclusion Project (EU-FP7/2007-2013 grant No. 289021), Apr 2014. [6] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. Kim, “The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals, Conflict, Emotion, Autism,” in Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, (Lyon, France), pp. 148–152, ISCA, ISCA, August 2013. [7] F. Eyben, F. Weninger, F. Groß, and B. Schuller, “Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor,” in Proceedings of the 21st ACM International Conference on Multimedia, MM 2013, (Barcelona, Spain), pp. 835–838, ACM, ACM, October 2013. [8] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: An update,” SIGKDD Explor. Newsl., vol. 11, pp. 10–18, Nov. 2009. [9] B. Schuller, B. Vlasenko, F. Eyben, M. Wöllmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll, “Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies,” IEEE Transactions on Affective Computing, vol. 1, pp. 119–131, July-December 2010. [10] J. Deng and B. Schuller, “Confidence Measures in Speech Emotion Recognition Based on Semi-supervised Learning,” in Proceedings INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, (Portland, OR), ISCA, ISCA, September 2012. [11] B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acoustic Emotion Recognition: A Benchmark Comparison of Performances,” in Proceedings 11th Biannual IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2009, (Merano, Italy), pp. 552–557, IEEE, IEEE, December 2009.

ASC-Inclusion D3.5


[12] F. Eyben, M. Wöllmer, and B. Schuller, “openEAR – Introducing the Munich Open-Source Emotion and Affect Recognition Toolkit,” in Proceedings 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009, vol. I, (Amsterdam, The Netherlands), pp. 576–581, HUMAINE Association, IEEE, September 2009. [13] B. Schuller, S. Steidl, and A. Batliner, “The Interspeech 2009 Emotion Challenge,” in Proceedings INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, (Brighton, UK), pp. 312–315, ISCA, ISCA, September 2009. [14] S. Tal, S. Fridenson, O. Golan, D. Lundqvist, S. Berggren, S. Boelte, A. Lassalle, D. Pigat, and S. Baron-Cohen, “Standalone platform and subsystems effectiveness report,” Public Deliverable D7.3, The ASC-Inclusion Project (EU-FP7/2007-2013 grant No. 289021), Apr 2014. [15] E. Marchi, F. Eyben, and B. Schuller, “Refined multi-modal vocal expression evaluation system providing corrective feedback,” Public Deliverable D3.4, The ASC-Inclusion Project (EU-FP7/2007-2013 grant No. 289021), Oct 2013. [16] V. Vapnik, S. E. Golowich, and A. Smola, “Support vector method for function approximation, regression estimation, and signal processing,” in Advances in Neural Information Processing Systems 9, pp. 281–287, MIT Press, 1996.

Documents

D3.5 - Enhanced version of vocal expression evaluation systemgeniiz.com/wp-content/uploads/2016/08/D3.5_-_Enhanced... · 2016. 8. 30. · uhohydqw orz dqg plg ohyho ghvfulswruv vxfk