15
ARIA Valuspa European Union’s Horizon 2020 research and innovation programme 645378, ARIA-VALUSPA August, 2017 -1- Deliverable (D2.3). Implementation of long-term adaptation for audio-visual communication analysis Artificial Retrieval of Information Assistants – Virtual Agents with Linguistic Understanding, Social skills, and Personalised Aspects Collaborative Project Start date of project: 01/01/2015 Duration: 36 months (D2.3). Implementation of long-term adaptation for audio-visual communication analysis Due date of deliverable: Month 31 Actual submission date: 13/09/2017 ARIA Valuspa

Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

Embed Size (px)

Citation preview

Page 1: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-1-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

ArtificialRetrievalofInformationAssistants–VirtualAgentswithLinguisticUnderstanding,Socialskills,andPersonalisedAspects

CollaborativeProject

Startdateofproject:01/01/2015 Duration:36months(D2.3).Implementationoflong-termadaptationforaudio-visualcommunication

analysis

Duedateofdeliverable:Month31 Actualsubmissiondate:13/09/2017

ARIAValuspa

Page 2: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-2-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

Projectco-fundedbytheEuropeanCommission

DisseminationLevel

PU Public X

PP Restrictedtootherprogrammeparticipants(includingtheCommissionServices)

RE Restrictedtoagroupspecifiedbytheconsortium(includingtheCommissionservices)

CO Confidential,onlyformembersoftheconsortium(includingtheCommissionServices)

STATUS:[DRAFT]

DeliverableNature

R Report

P Prototype

D Demonstrator X

O Other

ParticipantNumber

Participantorganizationname Participantorg.shortname

Country

Coordinator1 University of Nottingham,Mixed Reality/Computer

VisionLab,SchoolofComputerScienceUN U.K.

OtherBeneficiaries2 Imperial College of Science, Technology and

MedicineIC U.K.

3 Centre National de la Recherche Scientifique,TélécomParisTech

CNRS-PT France

4 UniversitatAugsburg UA Germany5 UniversiteitTwente UT TheNetherlands6 CereprocLTD CEREPROC U.K.7 LaCantocheProductionSA CANTOCHE France

Page 3: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-3-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

TableofContents1.PurposeofDocument.................................................................................................................................42.UseCase............................................................................................................................................................43.Methodology...................................................................................................................................................43.1Implementation(IC)...........................................................................................................................53.2Databases..................................................................................................................................................63.3Featureset...............................................................................................................................................73.3EvaluationonAVIC..............................................................................................................................8

4.Real-TimeImplementation(UA).........................................................................................................95.VisualUserIdentification(UN).........................................................................................................126.ASRUpdates(IC)..........................................................................Error!Bookmarknotdefined.7.Audio-VisualAffectionRecognitionUpdates(IC).......Error!Bookmarknotdefined.8.PlansforNextPeriod..............................................................................................................................139.Outputs...........................................................................................................................................................1410.Bibliography................................................................................Error!Bookmarknotdefined.

Page 4: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-4-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

1.PURPOSEOFDOCUMENTThepurposeofthisreportistodescribethemethodsoflong-termuseradaptation,theimplementationandintegrationintotheAriasystem,aswellastheevaluationresultsfortheDeliverableD2.3.Thisworkpackageisimportantregardingthe“personalaspects”asdenotedintheprojectacronym.Similartoanewfriend,theAriashouldgettoknowitsuserbetterandbetterwithtimeinordertoprovidepersonalassistancewithaccumulatedbackgroundknowledge.

2.USECASEThegoalofTask2.3istoendowtheARIA-VALUSPAPlatform(AVP)withcapabilitiesoflearningandadaptingtousercharacteristicswithenhancedcontextawareness.Tothisend,usertraitsextractedfromutterancesinTask2.2(automaticspeakeranalysis),willallowpreselectionofinterestandemotionmodelsthatbestfittheuserprofile(e.g.agegroup, gender). Also, this meta data serves as additional high-level features for thedialoguemanagement.Further,wecanautomaticallyadaptallmodelstothespeaker'svoiceinthecourseofthedialogue.Novelmethodsforacousticmodeladaptationoverthestateof theartdeveloped intheprojectwillbeusedforcreatingspecificmodels forasingleuserorausergroupduringlong-terminteractionwithAria.Finally,reliableaudio-visual subject identificationwillhelp toautomatically selectauserspecificmodel, if apersoncanbeidentifiedandtheaccordantuserprofileexistsinthesystem.Otherwise,modelssuitablefortheuser'sdemographicsareused.

3.METHODOLOGYFromanimplementationpointofview,thelong-termuseradaptationcanberealisedbyusingsemi-supervisedtechniques(e.g.self-training).Inourusecase,theuserinteractswiththeAriasystemrunningwithpre-trainedmodelsforspeakeranalysis,e.g.,interestandemotionrecognition.Theadaptationproblemcanbetackledbyatwofoldapproach.First,ontheinputsideoftheAriaarchitecture,SSI(JohannesWagner,2013)detectstheuser’sprofileandperformsapreselectionofthetrainedmodelssuitableforthespecifictargetuser(-group).Inalong-termview,thesystemwillmakeuseofthedatacollectedfromthisuserbyautomatically labelling itandusingthenewsamples forrefiningthemodels,therebyadaptingthesetothespecificperson.Inourexperiments,wepartitiontheselecteddatabaseintospeaker-disjointtrainingsetanduser-specificdata.Thetrainingsetcontainslabelledinstancesequalto1/3ofthetotaldataamount.Foreachtargetuser(-group),theremaining2/3ofthedataisfurthersplitinto 2/3 ‘unlabelled’ adaptation set and 1/3 labelled test set, which is used forperformanceevaluation(Figure1). Inparticular, for thecategorisation,weconsideredagegroup (young, adult, senior), gender (male, female), and the speaker ID itself.Thegreatbenefitofthisapproachisthattheadaptationschemecanbefurtherextendedtomanyothercategories,e.g.,differentnativelanguages.

Page 5: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-5-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

Inthelong-termadaptationprocess,weaimtoexploitunlabelleddatabyusingmachinepredictionswithoutanyhumaninterventionasitisunlikelythatannotationoftheuserdata canbedonemanually in the considered scenario. In thisway,we can iterativelyretrainthemodeladdingnewmachinelabelledinstancestotheoriginaltrainingset,thusenablingdataaggregationandlabelenrichment.

3.1 IMPLEMENTATION (IC) IntheAriaproject,wehavedevelopedagenerichuman-machineannotationframeworkbasedonDynamicCooperativeLearning(DCL)techniques(Figure2).It isabletofullyautomatically distribute the annotation workload between humans and machines bycombininguncertainty sampling (AL) and self-training (SSL). This algorithmhasbeenintegratedintoNOVA(AutomatedAnalysisofNonverbalSignals inSocialInteractions)(Baur, 2013) by our partners from UA for the annotation of the NOXI database at aminimumofhumanlabellingeffort.Astheadaptationprocessshouldbeunattended,weselecttheSSLpathinthelearningframework(markedasgreeninFigure2).Self-Trainingisawell-knownSSLmethodthattrainsaclassifieronasmalllabelleddatasetandre-trainsthemodeliterativelywiththemostconfidentmachinepredictionsforanunlabelleddatapool(Rosenberg,2005).Toenabletheframework’sapplicabilityforbothclassificationandregressiontasks,weuseadeep-learningbasedconfidencemeasure toassessmodeluncertainty,whichhas thegreatadvantageoverstate-of-the-artmethodsthatitisapplicabletobothclassificationandregressiontasks(Gal,2016).TheimplementationoftheDALmethodiswritteninthePythonlanguage.ForDNNtrainingandcomputingofuncertaintymeasures,thechoicefellonKeras,anopensourcedeeplearninglibrarybuiltontopofTheanoandTensorFlow.

Figure1:Datapartitioningforsimulatinglong-termuseradaptationusingself-trainingtechniques.

Page 6: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-6-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

Thedeeprectifiernetworkconsistsoffourhiddenlayerswith1,000;750;500;and150unitsaswellasanoutput layerwithasingleneuron.Thelossfunctionisgivenbythemean squared error. Optimisation is done via statistical gradient descent (SGD). Theinitialnetworkistrainedfor200epochsonthelabelled.Re-trainingaftereachlabellingstepisdonefor30epochs(YueZhang,2017).

3.2 DATABASES 3.2.1AVIC For interest recognition,we selected the Audiovisual Interest Corpus recorded at theTechnischeUniversitatMünchen(“TUMAVIC”)asdescribedin(B.Schuller,2009).Inthescenariosetup,anexperimenterandasubjectweresittingonoppositesidesofadesk.Theexperimenterplayedtheroleofaproductpresenterandledthesubjectthroughacommercial presentation. The subject’s role was to listen to explanations and topicpresentationsoftheexperimenter,askseveralquestionsofher/hisinterest,andactivelyinteractwiththeexperimenter.Thesubjectwasexplicitlyaskednottoworryaboutbeingpolitetotheexperimenter,e.g.,byalwaysshowingacertain levelof ‘polite’attention.Visualandvoicedatawasrecordedbyacameraandtwomicrophones,oneheadsetandone far-field microphone, at 44.1 kHz, 16 bit. In total, 21 subjects took part in therecordings,threeofthemAsian,theremainingEuropean.

Figure2:Flowchartof thehuman-machineannotation frameworkwith support forvariouslearningparadigmsincludingsemi-supervisedlearning(SSL),activelearning(AL), dynamic active learning (DAL), cooperative learning (CL), and dynamiccooperativelearning(DCL);theinstancespredictedwithhighmodelconfidence(hc)are directly taggedwithmachine labels,whereas the instanceswith relatively lowmodelconfidence(mc)aresubjecttohumaninspection.TakenfromtheManusscript(YueZhang,2017)submittedtoIEEETransactionsonCybernetics.

Page 7: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-7-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

ThelanguageoftheexperimentswasEnglish,andallsubjectswerenon-native,yetveryexperiencedEnglishspeakers.MoredetailsonthesubjectsaresummarisedinTable1.Thelevelofinterest(LOI)wasannotatedforeverysuchsub-speakerturn.TheoriginalLOIscalereachingfromLOI-2toLOI+2wasmappedto[−1,1]bydivisionby2. IntheChallenge,the21speakers(and3880subspeaker-turns)werepartitionedintospeaker-independent,gender,age,andethnicitybalancedsetsforTrain(1512sub-speakerturnsin51:44minutesofspeechof4female,4malespeakers),Develop(1161sub-speaker-turns in43:07minutesof speechof3 female,3male speakers), andTest (1207 sub-speaker-turnsin42:44minutesofspeechof3female,4malespeakers).

3.2.2GEMEPFor emotion recognition, we selected the “Geneva Multimodal Emotion Portrayals”(GEMEP) (Bänziger, 2012). It contains 1.2 k instances of emotional speech from tenprofessionalactors(5f,5m).UsingthesameheuristicapproachasintheINTERSPEECH2013ComParEEmotionsub-challenge, the18emotioncategoriesweremappedtothetwo dimensions ‘arousal’ and ‘valence’ (binary tasks) such as to obtain a balanceddistributionofpositive/negativearousalandvalenceclasses.

3.3 FEATURE SET For feature extraction, we retain the choice of the ComParE set of supra-segmentalacousticfeatures,whichisawell-evolvedsetforautomaticrecognitionofparalinguisticspeech phenomena, as used for the baseline of the INTERSPEECH ComParE series. Itcontains6,373staticfeaturesresultingfromthecomputationofvariousfunctionalsover

Table2:GEMAPby(pos(itive)/neg(ative);arousal(A)andvalence(V).Partitioningintotraining,developmentandtestsets.

Table1:DetailsonsubjectscontainedintheTUMAVICdatabase.

Page 8: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-8-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

low-leveldescriptor(LLD)contours.Theconfigurationfileisincludedinthe2.1publicreleaseofopenSMILE(Eyben,2010).

3.3 EVALUATION ON AVIC Forourempirical evaluations,weused theAVICdatabaseas it contains the speakers’demographicdatasuchasagegroup(20s,30s,40s)andgender.TheevaluationmetricistheSpearman'srankcorrelationcoefficient(CC).AscanbeseeninFigure3-4,themodelperformancecurvesforthedifferenttargetgroupsconsistentlyimprovebyaddingnewmachinelabelledinstancestothetrainingset.Thissubstantiatesourassumptionsthatfirst,themostreliablepredictionsareselectedandthustheappliedconfidencemeasureis effective, and second, data aggregation and label enrichment can lead to enhancedrecognitionaccuracyonthetestdataofthespecificusergroup,therebyachievingthegoalof long-termuseradaptation.Thedifferent lengthof the learning curves isdue to thevaryingsizeoftheadaptationdatasets.Further,itisnotedthatadifferenttestsetisusedforeachusergroup,thusalthoughtheinitialmodelissavedandreusedforeachlearningcurve,thesamestartingpointisnotpossible.Finally,ascanbeseeninFigure5,itismorechallenging to adapt themodel to a single user although a slight raising trend canbeobservedformostoftheusers.Onereasonforthecurves’instabilityandhighvariationamongtheuserscanbetheambigousnatureof interestrecognition,whichisatypicalsubjectivetaskinparalinguistics.ThetaskofuseridentificationisachallengeinitselfandwillbeaddressedinSection5.

Figure3:PerformancecurveoftheSSL-basedadaptationtechniquetothedifferentagegroupsontheAVICdatabase.

Page 9: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-9-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

4.REAL-TIMEIMPLEMENTATION(UA)

The real-time detection system of ARIA is implemented with the Social SignalInterpretation(SSI)framework(JohannesWagner,2013).ForadetaileddescriptionofthesystempleaserefertoD.2.1.Toenableuseradaptationinreal-time,theclassificationmoduleofSSIhadtobeadjusted.Sofar,eachclassifierofapredictionpipelinewasasingle

Figure4:PerformancecurveforadaptationtothegendergroupsontheAVICdatabase.

Figure5:PerformancecurveforadaptationtothespeakerIDontheAVICdatabase.

Page 10: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-10-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

model,whichwas loaded at start-up. At run-time, itwas possible to halt and resumepredictionbyturningaclassifieroffandon,butnottoswitchbetweenmodelson-the-fly.

Thenewimplementationnowallowsittoassigntoaclassifieralistofmodels,whereeachmodelistaggedwithauniqueidentifier.Bydefault,thefirstmodelinthelistwillbeusedforprediction.However,atanytimeamessagewithnameofdesiredmodelcanbesenttotheclassifiertoswitchtoanothermodelofthepool.Thiscanbeeitherdonemanually,orbyusingtheoutputofanotherclassifier.Inthelattercase,SSItriestomatchthenameofthepredictedlabeltooneoftheloadedmodels.Ifafittingmodelisfoundthatdoesnotmatchwiththecurrentone,themodelswillbeswitched.Thefollowingfigureillustratesthedataflow:

We can see that first audio chunks are first transformed into a compact featurerepresentation. The feature set is shared by two types of classifiers: on the top, weimplementconventionalclassifiers,whichuseasinglemodeltopredictstaticpersonalitytraits (here the gender and the age of theuser). Thepredictions of the classifiers arereceived by a second type of classifiers (bottom),which operate on a pool ofmodels.Thesemodelshavebeenspecializedforthegender(male,female)andage(child,youth,adult,senior)classes.If,forinstance,amalevoiceisdetected,butcurrentlythefemalemodelisloaded,theclassifierswitchestothemodelspecializedformalespeakers.Ifweexpectthatuserswillinteractwiththesystemforalongerperiod(e.g.severalminutes),anoptionalsmoothingstepcanbeaddedtoavoidswitchingmodelsbackandforthincaseoffalsepredictions.

Figure6:Real-timeImplementationofuseradaptationintoSSI.

Page 11: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-11-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

InaSSIpipeline,we implement thenewbehaviourbypopulating thepathoptionofaclassifier with a list of taggedmodels and setting it to receive the events of anotherclassifierpredictingtheaccordingclasses.Thefollowingxmlsnippetshowsanexcerptofthepipeline,whereaclassifierpredictingthegenderofthespeakersisusedtoguidetheselection of an arousal detector specialised on gender groups (changed spots arehighlighted):

...

<!-- feature extraction --><consumer create="TupleEventSender" address="feature@audio">

<input pin="audio8k" address="vad@audio" state="nonzerodur"> <transformer create="OSWrapper" configFile="IS13.conf"/> </input></consumer><!-- static classifieroutputs 'female' or 'male' --><object create="Classifier" path="gender" address="gender@audio"> <listen address="feature@audio"/></object><!-- adaptive classifiers switches between the specialized models 'arousal-f' and 'arousal-m' --><object create="Classifier" path="female:arousal-f;male:arousal-m"> <listen address="feature,gender@audio"/></object>...

Thesametriggermechanismweusetoswitchtoamodelspecializedonagroupofpeople,canalsobeusedtoadapttoaparticularuser(seeSection3.4).Ofcourse,thisrequiresthat specialisedmodelsareavailable for thedetecteduser.However, in the futurewecould implement a system that dynamically adapts to the user during an interaction.Therefore,webufferthepredictions(alongwiththefeaturevectors)andfromtimetotimeusethemtoretrainthemodelsinthebackground.Aftertheretrainingisfinished,wecanusetheexistinginterfacetoswitchtotheadaptedmodelson-the-fly.Thisway,thesystemcanevenadapttousersthatarenotyet inthedatabase.Ofcourse,wehavetomake sure to reset themodelswhenanewuser enters the floor.Also,wehave tobecarefulwhenselectingtheinstancesusedforretraining.Ideally,onlyinstancespredictedwithahighconfidenceshouldbekeptasotherwisenoisewillbeintroducedinthetrainingprocess.

Page 12: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-12-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

5.VISUALUSERIDENTIFICATION(UN)ThevisualsystemofARIAincorporatesafacetrackingsystem(E.Sánchez-Lozano,2016),described inARIAD2.2,Section3.2).The trackingsystembasicallyestimates, foreachframe taken from the video stream, a sparse set of 66 fiducial points, meant tomeaningfullylocatekeypartsoftheface,suchasthemouth,thecontour,theeyesorthenose.Thesepointsareusedasinputfortheemotionclassificationsystem(describedinARIAD2.2,Section3.2).TheUserIdentificationsystembuildsontheworkof(Déniz,2011),andtakesasinputboththecurrentimageandthetrackedpoints,andthenextractssomelocalinformationwithinlocalpatchessurroundingeachofthepoints.Thelocalinformationisahigh-levelrepresentationoftheimagefeatures,meanttoberobusttochangesinrotation,scale,andillumination,anditisbasedonaHistogramofOrientedGradients(Dalal,2005),describedinARIAD2.2,Section3.2).TheHOGfeaturesareconcatenatedtoformacolumnvector,resultingintherepresentationofafaceforagivenframe. Thesefeaturesareextractedforeachframe,andstoredinaslidingwindowof50frames.For each slidingwindow, amean feature vector, and a covariancematrix, are stored,describingthetemporalfeaturesforthecurrentuser.TheincrementalCCRalgorithm(iCCR)includestrackinglost/failuredetection,whichisactivatedeachtimethequalityoftheestimatedlandmarksisunderacertainthreshold.Thelostdetectionisactivatedeitherwhenthefitting(thelandmarkestimate)ispoorduetoalargemovement,orwhentheuserisnolongerwithinthecamera’srangeofvisibility.Inthesecases,thetrackerneedstoberestarted,byfirstapplyingafacedetectionstep(i.e.,locatewhetherthereisafaceinthecurrentframe,andwhere).Whenanewfaceisdetected and the facial points are estimated again, the feature extraction process isrepeated.Thenewsetoffeatures,collectedafterarestart,arenowcomparedtothestoredmean and covariance features. The comparison is done via theMahalanobisDistance,whichbasicallyindicatesthelikelihoodofthesefeaturestohavebeengeneratedbythesameGaussiandistributionthanthatdefinedbythemeanvectorandcovariancematrix.If the distance is under a certain threshold, the features can be said to belong to theprevioususer.Otherwiseitisassumedanewuserisnowconsidered.Insuchacase,themeanvectorandcovariancematrixbelongingtotheprevioususerarestoredandreferredto as the person-specific statistics for “user 1”, and a new storage process starts. TheprocessisdepictedinFigure7.

Page 13: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-13-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

Whenthetrackerneedstoberestarted,theidentificationprocessisrepeated,althoughtheMahalanobisDistanceisnowobtainednotonlywithrespecttotheprevioususer,butalsowithrespecttoalltheprevioususers.Thatistosay,thenewfeaturesarecomparedtoallpreviousstoredusers.Theusercorrespondingtothelowestdistanceisconsideredasthemostlikelytohavetakenoverthetracking,consideringthecorrespondingdistancetobeunderathreshold.Ifnoneofthedistancesareunderthethreshold,anewuserisaddedtothecurrentsession.For long-term user identification, the systemwill allow the user to store the person-specificdatacollectedduringaspecificsession.Furthertothisoption,itwillincorporateanenrollingstep,toallowtheusertobefullyidentifiedeachsession.Inthisscenario,thesystemwillnotonlyknowthatasetofspecificfeaturesbelongtoanexistinguser,butreachsession,thesystemwilltrytoidentifytheuserfromthestoreddatabase,andwillassign a temporal intra-session ID to the user in case no existing user is found to behandlingthesystem.

8.PLANSFORNEXTPERIODInthenextperiod,weaimforseveralachievements.First,aftertheadaptationapproachhasbeenvalidatedonAVICforregressionandimplementedinSSI,wewillapplyittotheclassificationtask,i.e.,emotionrecognitiononGEMEP.ForASR,wewilladdconfidence

Figure 7: User re-identification procedure: While the tracking is ongoing with nofailure,theuser-specificfeaturesarestoredinaslidingwindowof50frames.Whenthe tracker needs to be reinitialised, the new features are first compared to thestatisticsstoredfortheprevioususer(meanandcovarianceofthefeatures).Whenthedistance is higher than a threshold, the system detects a new user, and the datacollectionre-starts.

Page 14: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-14-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

measures for the transcriptions and integrate the French models. Finally, to furtheradvance multi-modal audio-visual affect recognition, we will investigate Deep NeuralNetworks(DNNs)withpretrainedlayers.

9.OUTPUTSInwhatfollows,weindicatetheoutputswithpertinencetothisdeliverable(categorisedbytopics)thathavebeenpublished(orareinpress)inthe31monthsoftheproject.

MachineLearning

[ML1]YueZhang, YifanLiu, FelixWeninger, andBjörn Schuller.Multitaskdeepneuralnetwork with shared hidden layers: Breaking down the wall between emotionrepresentations.InProc.ofICASSP,NewOrleans,LA,March2017.IEEE.

[ML2]EduardoCoutinhoandBjörnSchuller. Sharedacoustic codesunderlie emotionalcommunicationinmusicandspeech-evidencefromdeeptransferlearning.PLOSONE,12(e0179289):1–24,June2017.

Paralinguistics[PL1] Maximilian Schmitt, Erik Marchi, Fabien Ringeval, and Björn Schuller. Towards

cross-lingualautomaticdiagnosisofautismspectrumconditioninchildren’svoices.In Proc. of ITG Symposium on Speech Communication (ITG SC), pages 264–268,Paderborn,Germany,2016.VDE,IEEE.

[PL2] Björn Schuller, Stefan Steidl, Anton Batliner, Elika Bergelson, Jarek Krajewski,Christoph Janott, Andrei Amatuni, Marisa Casillas, Amanda Seidl, MelanieSoderstrom,AnneWarlaumont,GuillermoHidalgo, SebastianSchnieder,ClemensHeiser,WinfriedHohenhorst,MichaelHerzog,MaximilianSchmitt,KunQian,YueZhang, George Trigeorgis, Panagiotis Tzirakis, and Stefanos Zafeiriou. The in-terspeech 2017 computational paralinguistics challenge: Addressee, cold &snoring.InProc.ofINTERSPEECH,Stockholm,Sweden,August2017.ISCA.

[PL5] Yue Zhang, FelixWeninger, Boqing Liu,Maximilian Schmitt, Florian Eyben, andBjörnSchuller.Aparalinguisticapproach to speakerdiarisation. InProc. ofACMInternationalConferenceonMultimedia,Moutainview,CA,October2017.ACM.toappear.

[PL6] Yue Zhang, Felix Weninger, and Björn Schuller. Cross-domain classification ofdrowsiness in speech:The caseof alcohol intoxicationand sleepdeprivation. InProc.ofINTERSPEECH,Stockholm,Sweden,August2017.ISCA.

[PL7] Zixing Zhang, Felix Weninger, Martin Wöllmer, Jing Han, and Björn Schuller.Towardsintoxicatedspeechrecognition.InProc.ofInternationalJointConferenceonNeuralNetworks(IJCNN),pages1555–1559.IEEE,2017.

Page 15: Collaborative Project - aria-agent.eu · PDF fileDeliverable (D2.3). ... units as well as an output layer with a single neuron. The loss function is given by the ... (LLD) contours

ARIA Valuspa

EuropeanUnion’sHorizon2020researchandinnovationprogramme645378,ARIA-VALUSPA

August,2017

-15-

Deliverable(D2.3).Implementationoflong-termadaptationforaudio-visualcommunicationanalysis

ReferencesSchuller,B.,Müller,R.,Eyben,F.,Gast,J.,Hörnler,B.,Wöllmer,M.,Rigoll,G.,Höthker,A.

and Konosu, H., 2009. Being Bored? Recognising Natural Interest by ExtensiveAudiovisual Integration for Real-Life Application. Image and Vision ComputingJournal, Special Issue on Visual andMultimodal Analysis of Human SpontaneousBehavior,1760-1774.

Bänziger,T.,Mortillaro,M.andScherer,K.R.,2012.IntroducingtheGenevaMultimodalexpression corpus for experimental research on emotion perception.AmericanPsychologicalAssociation,1161–1179.

Baur,T.,Damian, I., Lingenfelser, F.,Wagner, J. andAndré,E. 2013.NOVA:Automatedanalysis of nonverbal signals in social interactions. International Workshop onHumanBehaviorUnderstanding(pp.160-171).Springer.

Dalal, N. and Triggs, B., 2005. Histograms of oriented gradients for human detection.Computer Vision and Pattern Recognition (CVPR) (pp. 886-893). San Diego, CA,USA:IEEE.

Déniz,O.,Bueno,G.,Salido,J.andDelaTorre,F.,2011.Facerecognitionusinghistogramsoforientedgradients.PatternRecognitionLetters,1598-1603.

Sánchez-Lozano, E., Martinez, B., Tzimiropoulos, G. and Valstar, M., 2016. CascadedContinuousRegressionforReal-timeIncrementalFaceTracking.EuropeanConf.onComputerVision(ECCV)(pp.645-661).Amsterdam,TheNetherlands:Springer.

Eyben,F.,Wöllmer,M.andSchuller,B.,(2010).Opensmile:themunichversatileandfastopen-source audio feature extractor.Proceedings of the 18th ACM internationalconferenceonMultimedia(pp.1459--1462).Florence,Italy:ACM.

Gal,Y.,2016.Uncertaintyindeeplearning.PhDthesis,UniversityofCambridge.Wagner,J.,Lingenfelser,F.,Baur,T.,Damian,I.,Kistler,F.andAndré,E.,2013.Thesocial

signal interpretation (SSI) framework – multimodal signal processing andrecognitioninreal-time.Proceedingsofthe21stACMinternationalconferenceonMultimedia(pp.831--834).Barcelona,Spain:ACM.

Rosenberg,C.,Hebert,M.andSchneiderman,H.,2005.Semi-supervisedself-trainingofobject detection models. Proceedings of IEEE Workshop on Motion and VideoComputing(pp.29-36).Breckenridge,CO:IEEE.

Zhang,Y.,Weninger,F.,Michi,A.,WagnerJ.André,E.,Schuller,B.2017.AGenericHuman-MachineAnnotationFrameworkUsingDynamicCooperativeLearningwithaDeepLearning-basedConfidenceMeasure.IEEETransactionsonCybernetics,submitted.