Rights / License: Research Collection In Copyright - Non ... · Citizen of Italy accepted on the recommendation of Prof. Dr. Gerhard Tröster, examiner ... a smart phone can be worn

Research Collection

Doctoral Thesis

Transfer of activity recognition capabilities to untrained sensorsystems

Author(s): Calatroni, Alberto

Publication Date: 2013

Permanent Link: https://doi.org/10.3929/ethz-a-010086015

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

https://doi.org/10.3929/ethz-a-010086015

http://rightsstatements.org/page/InC-NC/1.0/

https://www.research-collection.ethz.ch

https://www.research-collection.ethz.ch/terms-of-use

Diss. ETH No. 21509

Transfer of activityrecognition capabilities tountrained sensor systems

A dissertation submitted to

ETH ZURICH

for the degree ofDoctor of Sciences

presented by

Alberto Calatroni

Laurea Magistrale in Ingegneria Elettronica, Politecnico di Milano,Italy

Date of birth March 12, 1981Citizen of Italy

accepted on the recommendation of

Prof. Dr. Gerhard Tröster, examinerProf. Dr. James L. Crowley, co-examiner

2013

Alberto CalatroniTransfer of activity recognition capabilities to untrained sensor sys-tems.Diss. ETH No. 21509

First edition 2013Published by ETH Zürich, Switzerland.

Printed by Reprozentrale ETH

Copyright c©2013 by Alberto CalatroniAll rights reserved. No part of this publication may be reproduced,stored in a retrieval system or transmitted, in any form or by anymeans without the prior permission of the author.

Acknowledgments

I would like to thank first of all Prof. Dr. Gerhard Tröster for enablingme to pursue my PhD in the Wearable Computing Lab. It was animportant experience which put me in contact with many stimulatingcolleagues and where I learned how to solve many challenges. I amalso very grateful to Prof. Dr. James Crowley for being co-advisor formy thesis.

A very special thank you goes to Dr. Daniel Roggen. My PhD waspossible thanks to the many fruitful discussions with Daniel, somein the office and some in front of a good dinner. I learned a lot fromhim from the professional point of view and I was happy to have hisfriendship throughout the years.

A very warm thank you goes to Ruth Zähringer and Fredy Mettler,who are always there, with infinite patience for all possible needs. Ruthhas always been a reference point and it is an honor to have her andGede as friends.

I am thankful to all the wonderful colleagues of the OPPORTUNITYproject. I wish to thank especially Prof. Dr. Alois Ferscha. Thanks toAlois we always had an excellent time at JKU Linz, with many usefuldiscussions and a lot of productive joint work: I am glad I had thechance to work with him. Thanks to Hesam, Marc, David and Geroldfor the very interesting and intense collaboration. I am als grateful toDr. Ricardo Chavarriaga for the inspiring discussions.

I wish to acknowledge the work of the students, with whom I hadthe pleasure to carry out intesenting projects: Lukas Fässler, NicolasWidmer, Daniel Burgener, Sumit Kumar and Oliver Brand.

A big thank you goes to all my colleagues and ex-colleagues inthe Wearlab, with whom I had a great time. I would like to thankespecially Tobias, Tommy, Thomas and Sebastian, with whom I alsohad the pleasure to share a flat and many very nice moments. A specialmention goes to Long-Van and Minh, Christoph and Melanie, Holgerand Martin W., with whom I spent some very nice time.

Thanks to the countless friends whom I have the pleasure to know,starting with my flatmates Lucía, Anne and Leslie, who, along withTobias, make me feel at home. Thanks to Pernilla, Stefania, Pier, Alex,Tullia, Sandra, Maja, Gede, Francesca, Laura, Enea, Meike and the

iv

countless others whose names would fill up pages and pages.A big hug goes to Sonia, who supports me since I was in high school

and who is always there, with her care and attention.Last but not least, I am very grateful to my parents, who always

support and encourage me and show me their trust, whatever choicesI meet.

Zürich, October 2013Alberto Calatroni

Contents

Abstract ix

Riassunto xiii

1. Introduction 11.1 State-of-the-art activity recognition systems and their

limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Shift towards opportunistic systems . . . . . . . . . . . 31.3 Benefits of training new resources automatically with-

out user involvement . . . . . . . . . . . . . . . . . . . . 41.4 Research questions . . . . . . . . . . . . . . . . . . . . . 61.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . 71.7 Additional publications . . . . . . . . . . . . . . . . . . 8

2. State of the art 152.1 Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Machine learning paradigm for activity recognition . . 162.3 Approaches to cope with unknown sensor place-

ment/orientation . . . . . . . . . . . . . . . . . . . . . . . 172.4 Transfer learning and co-training . . . . . . . . . . . . . 182.5 Feature selection under class noise . . . . . . . . . . . . 192.6 Feature- and signal-level mapping . . . . . . . . . . . . 222.7 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3. Reference dataset 253.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Dataset description . . . . . . . . . . . . . . . . . . . . . 263.3 Challenges and lessons learned . . . . . . . . . . . . . . 323.4 Limitations of the dataset . . . . . . . . . . . . . . . . . 393.5 Recommendations . . . . . . . . . . . . . . . . . . . . . 393.6 Dissemination within the scientific community . . . . . 413.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vi

4. Classifier-level transfer learning 474.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3 Performance in the activity recognition scenario . . . . 494.4 Robustness against class noise (why a learner can out-

perform a teacher) . . . . . . . . . . . . . . . . . . . . . . 604.5 Enhancement with co-training step . . . . . . . . . . . . 644.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5. Exploiting ambient sensors and behavioral assumptions 755.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 765.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 795.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 855.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6. Learner candidate selection 896.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 906.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.3 Datasets and evaluation procedure . . . . . . . . . . . . 956.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7. Signal-level transfer learning 1077.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1087.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.4 Simulations and performance metrics . . . . . . . . . . 1147.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 1187.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8. Conclusion and outlook 1318.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . 1328.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

List of Abbreviations 137

List of Symbols 139

vii

Bibliography 141

Curriculum Vitae 155

Abstract

Human activity recognition is useful for a range of applications in-cluding health care, human-computer interaction, assistance of elderlyor industry workers. Activity recognition relies on the deploymentof sensors worn by persons or installed in the environment, like ac-celerometers, microphones and videocameras. The sensor signals arethen processed with pattern recognition algorithms to recognize ac-tivities. Pattern recognition algorithms require being trained beforesystem deployment in a data collection phase.

Obtaining a good performance when activity recognition systemsare used in real life is quite challenging though. The main reason isthat if we want the systems to be unobtrusive for the user, we cannotcontrol which sensors will be worn, where and with which orientation.As an example, a smart phone can be worn in any pocket and withmany orientations. The lack of a fixed placement and orientation canlead to a severe mismatch between the signal patterns used for train-ing the systems and the ones arising after deployment, which can bedetrimental for the classification accuracy.

In this thesis, we seek to change this paradigm, allowing the user towear new sensor systems which then get trained on-the-fly by existinginfrastructure. We call “learners” the new sensor systems and “teach-ers” the existing ones. In this work, we investigate different aspects ofthis new paradigm.

In the first place, we introduce two mechanisms for the teachersto train the learners: classifier-level and signal-level transfer learning.The first mechanism works at the classifiers level and does not makeassumptions on eventual physical relationships between teacher andlearner which could introduce correlations between their signals. Weapplied the classifier-level transfer to locomotion and posture recog-nition performed with accelerometers. In this setting, learners reachedon average an accuracy 9.3 % below the one obtained by a training withground-truth labels. For the transfer to take place, teachers and learn-ers need to operate simultaneously until enough activity instances takeplace.

The signal-level transfer learning mechanism speeds up manifoldthe training of the learner, compared to the classifier-level transfer.

x

Signal-level transfer learning makes the assumption that teacher andlearner signals can be obtained from each other by means of some trans-formation. The transformation is learned automatically from teacherand learner signals. Teachers and learners have to operate simultane-ously only until the transformation between their signals is learned,thus the duration in decoupled from the kind of activities that teachersand learners need to recognize. We tested the method on a gesturerecognition scenario where accelerometers learned from a camera-based system. We obtained a successful training of the learners withteacher and learner operating simultaneously for three seconds. Thelearners reached an accuracy up to 4 % below the one obtained by atraining with ground-truth labels.

Secondly, we introduce algorithms to decide, among many po-tential available candidate learners, which ones are most suitable forrecognizing the activities at hand. We formulate the problem as a su-pervised feature selection problem in presence of class noise. Classnoise is given by the fact that teachers do not always recognize ac-tivities perfectly. We introduce five scores to rank candidate learnersaccording to their accuracy in recognizing activities. We obtained rank-ing accuracies up to 96 %. We showed that with teachers making upto 40 % mistakes in recognizing activities, ranking accuracies drop atmost 10 %.

In the third place, we illustrate ideas to exploit in an opportunis-tic manner resources which are already deployed in the environment.One example are magnetic switches mounted on windows used by theheating/air conditioning systems. These switches can become teach-ers that provide training labels to learners. We show how magneticswitches can train body-worn accelerometers to recognize posture andlocomotion. In our experiments, the learners achieved an accuracybetween 20 % and 2.8 % below the one obtained by a training withground-truth labels. By using the infrastructure to provide labels, notraining set needs to be collected for the body-worn sensors.

Finally, we present a multimodal dataset which we collected withthe purpose of becoming a standard benchmark dataset in the fieldof human activity recognition. The dataset consists of 8216 instancesof posture/locomotion and 31336 gestures performed in a kitchen. Weused the dataset to validate many of the approaches proposed in thiswork and the dataset has been used to promote a challenge amongdiverse research groups, so that different algorithms could be testedon common grounds. This should push forward research, similarly to

xi

what happened in computer vision by using standard datasets.

Riassunto

Il riconoscimento di attività umane è utile per diverse applicazioni,tra cui assistenza sanitaria, interazione uomo-macchina o assistenzadi anziani o lavoratori in una fabbrica. Il riconoscimento di attivitàpresuppone l’installazione di sensori indossati dagli utenti o installatinegli ambienti, come accelerometri, microfoni o videocamere. I segnalidei sensori vengono utilizzati da algoritmi di riconoscimento di patternper riconoscere le attività. Per poter effettuare il riconoscimento, glialgoritmi devono prima passare attraverso una fase di training in cuivengono raccolti esempi delle attività da riconoscere.

Ottenere buone prestazioni quando un sistema di riconoscimentoattività viene poi installato ed usato nella quotidianità è difficile. Il mo-tivo principale risiede nel fatto che per rendere il sistema poco intrusivoper l’utente, quest’ultimo deve essere lasciato libero di posizionare isensori in posizioni arbitrarie e con vari orientamenti. Ad esempio,uno smartphone può venir infilato in qualunque tasca e può esseregirato in vari modi. Questa libertà di posizionamento può portaread una notevole discrepanza tra i segnali misurati in fase di traininge quelli poi effettivamente misurati quando l’utente usa il sistema.Questa discrepanza è deleteria per l’accuratezza della classificazionedelle attività.

In questa tesi, ci poniamo come obiettivo di modificare questoparadigma, permettendo all’utente di indossare nuovi sensori, il cuitraining viene effettuato non tramite una raccolta di dati ad hoc,bensì tramite infrastruttura già installata e senza intervento dell’utentestesso. Chiamiamo “learner” i nuovi sensori e “teacher” i preesistenti.In questo lavoro analizziamo diversi aspetti del paradigma teacher-learner.

Anzitutto introduciamo due meccanismi perché i sensori preesistentieffettuino il training dei nuovi sensori: transfer learning al livello deiclassificatori e al livello dei segnali. Il primo approccio non presup-pone che esistano perticolari relazioni tra i segnali di teacher e learner.Abbiamo applicato questo approccio ad uno scenario di riconosci-mento di posture e locomozione basato su accelerometri. In questoscenario, i nuovi sensori raggiungono un’accuratezza in media 9.3 %al di sotto di quella ottenibile se il training viene effettuato con la-

xiv

bel esatte (ground-truth). Per permettere al trasferimento di avvenirecon successo, teacher e learner devono interagire per un tempo suffi-ciente perché si verifichino istanze di tutte le attività che devono esserericonosciute.

Il transfer learning al livello dei segnali rende l’apprendimento daparte dei nuovi sensori sensibilmente più rapido. Questo approccioassume che esista una trasformazione che permetta di ottenere i seg-nali del learner a partire da quelli del teacher. Questa trasformazioneviene appresa direttamente dal confronto dei segnali per un temposufficiente. La trasformazione non ha a che vedere con le attività chevengono riconosciute. Abbiamo validato questo metodo in uno sce-nario di riconoscimento gesti, dove accelerometri imparano da un sis-tema basato su videocamera. Abbiamo ottenuto l’apprendimento dellatrasformazione in 3 s. I learner hanno raggiunto accuratezze fino al 4 %al di sotto di quella ottenibile effettuando un training con ground-truth.

In seguito, introduciamo alcuni algoritmi per selezionare quali tranumerosi potenziali learner possano raggiungere le prestazioni piùalte per il riconoscimento delle attività. Formuliamo il problema intermini di selezione di feature in presenza di rumore nelle label (classnoise). Il rumore deriva dal fatto che i teacher non riconoscono per-fettamente le attività. Introduciamo cinque metriche che permettonoeffettuare un ranking dei potenziali learner dal più accurato al menoaccurato. Abbiamo raggiunto una precisione fino al 96 % nell’effettuareil ranking. La precisione nel ranking decresce di circa 10 % quando iteacher commettono il 40 % di errori nel classificare le attività.

Successivamente, illustriamo come usare opportunisticamenterisorse già presenti nell’ambiente per svolgere il ruolo di teacher. Un es-empio sono gli interruttori magnetici spesso presenti su porte e finestreche vengono usati dai sistemi di aria condizionata. Questi interruttoripossono produrre label per il training di altri sensori. In particolare,mostriamo come gli interruttori magnetici possono effettuare il train-ing di accelerometri montati sul corpo dell’utente per riconoscere pos-ture e locomozione. Nei nostri esperimenti i learner hanno raggiuntoaccuratezze comprese tra il 20 % ed il 2.8 % al di sotto al di sotto diquelle ottenibili effettuando un training con ground-truth. L’uso disensori già montati negli ambienti permette di evitare completamentela raccolta di un training set per l’utente.

Infine, presentiamo un dataset multimodale che abbiamo raccoltocon l’intenzione di renderlo un riferimento standard nel campo delriconoscimento di attività. Il dataset è stato registrato in una cucina e

xv

comprende 8216 istanze di posture o locomozione e 31336 gesti, nonchéattività come ad esempio la preparazione di uno spuntino. Abbiamousato il dataset per validare molti degli algoritmi sviluppati in questatesi. Inoltre, il dataset è stato usato come base per un challenge che haconinvolto diversi gruppi di ricerca, con l’obiettivo di testare algoritmialternativi su una base comune. L’uso di dati comuni dovrebbe spin-gere la ricerca attraverso la competizione tra diversi gruppi, analoga-mente a quanto successo nel campo della computer vision.

1Introduction

In this chapter, we introduce the main characteristics and shortcomings ofpresent human activity recognition systems. We provide motivations for shift-ing towards opportunistic systems, whose aim is to exploit resources as theybecome available, as opposed to designing a fixed system. We then outlinethe research questions and the goals that we pursued to this end. Finally, weillustrate the research contributions.

2 Chapter 1: Introduction

1.1 State-of-the-art activity recognition systems and theirlimitations

Human activity recognition is useful for a range of applications, likehealthcare, human-computer interaction, assistance of elderly or in-dustry workers. The recognition is made possible by equipping envi-ronments and people with various sensors and using machine learn-ing techniques. Among the sensors, we find accelerometers, inertialmeasurement units, microphones, cameras, RFID tags, etc. Some ap-proaches exploit multimodal sensing (e.g. sound and video) to rec-ognize human behavior [1]. Sensor data are used to build classifiermodels via supervised learning. This implies collecting signal exam-ples and assign a label for an activity to each example. The classifiersare often trained before system deployment.

Obtaining a good performance when an activity recognition systemis used in real life is quite challenging though. The main reason is that ifwe want the systems to be unobtrusive for the user, we cannot controlwhich sensors will be worn, where and with which orientation. Forexample, a smartphone can be worn in any pocket and with potentiallymany orientations. This can lead to a severe mismatch between thesignal patterns used for training the systems and the ones arising afterdeployment, which can be detrimental for the classification accuracy.

In summary, in most recognition systems presented in the literature,the following assumptions are made:

• Predefined set of sensors: the number and type of the involvedsensors are decided at design time and assumed to be constant;

• Fixed sensor positions: on-body sensors are assumed not to slideor rotate, since this has a major negative impact on recognitionaccuracy;

• Static mapping between sensor signal and activity: the collectionof signal patterns to build classifier models is done before systemdeployment and assumed not to change with time;

• Little variability in signal patterns: multiple instances of the sameactivity should be as similar to each other as possible, to avoida drop in classification accuracy. If different users perform thesame activity, the signal patterns can differ significantly. Thus,user-specific models are needed.

1.2. Shift towards opportunistic systems 3

1.2 Shift towards opportunistic systems

Since the aforementioned assumptions conflict with an easy and un-obtrusive deployment of sensor systems, we shift towards a newparadigm, which we call opportunistic activity recognition [2]. The term“opportunistic” refers to the capabilities of the recognition systems tomake best use of the resources that happen to be available at a certaintime. The investigation of opportunistic activity recognition was at thecore of the project OPPORTUNITY1, funded by the European UnionSeventh Framework Program. Some key ideas towards opportunisticactivity recognition are the following:

1. Sensor self-description and self-characterization mechanisms. This al-lows a system to be able to exploit a sensor according to itsdescription, for example a sensor containing the keyword accel-eration in its self-description can be used to detect gestures orphysical activity.

2. Integration and usage of new resources on-the-fly as they appear. Beingable to exploit new resources is very important to increase thesystem performance or to be able to select what to use dependingon the needs.

3. Interaction between ambient and body-worn sensors. Ambient sen-sors can provide very crisp labels for the user’s actions (e.g. theopening of a drawer detected by a magnetic switch). This infor-mation can be used while the user acts in the environment toincrementally train the wearable sensors.

4. Interaction between different body-worn sensors. Existing, trainedbody-worn sensors can provide labels, although not always veryaccurate, to train other newly deployed sensors.

5. Discovery of relationships between sensor signals to facilitate trainingof new sensors. Different sensors measuring acceleration, posi-tion, orientation etc. could be interchanged without needing tobe trained again if all have a common representation of their mea-sured quantities. For example, position data can be transformedto and compared with acceleration data measured on the samespot, allowing the training data for a position sensor to becomethe training data for an accelerometer.

1http://www.opportunity-project.eu


6. Finding similarity between different sensor outputs. If two sensorsoutput data which show a high correlation, they can be likelyused interchangeably by the system. It is therefore important todetect this condition.

7. Capability for sensors to self-determine their on-body placement. Thetraining data and the usefulness of a sensor for recognizing aparticular activity depend on where the sensor is placed. Beingable to automatically detect the placement can determine if asensor needs to be used or not for the recognition of certainactivities.

8. Fusion of different sensing modalities to bring additional robustness.Often it is not enough to measure one physical quantity to recog-nize complex activities. It is therefore useful to merge differentcues, including sound, motion, time etc.

9. Detection of anomalous sensors. If a sensor is suddenly deliver-ing data which differ from the ones foreseen, the sensor can beremoved from the activity recognition chain or, if possible, theclassifier models associated with it can be updated.

This thesis focuses on the interactions between different sensors,wearable and ambient, to provide new means for training sensorsafter deployment. The thesis covers points 3, 4 and 5 in the latter list.Throughout this work, we call “teacher” any sensor system able toprovide labels. We call “learner” any sensor system that is deployedand needs to be trained to recognize activities. Figure 1.1 illustratesschematically the different building blocks of the thesis and serves asa reference throughout this monograph.

1.3 Benefits of training new resources automaticallywithout user involvement

The possibility of training new sensors by means of existing ones offersa set of advantages:

• User-specific models: by training classifiers on sensor systems asthey are worn by a specific user, the learned models are specif-ically tailored for the activities carried out by that specific userand reflect the way he or she performs those activities.

1.3. Benefits of training new resources automatically without user involvement 5

Magnetic switch(teacher)

?

Kinect (teacher)

Accelerometer (learner)

Accelerometer (teacher)

Smartphone(teacher)



Accelerometers(learner candidates)

1.

2.

3.

4.

Figure 1.1: Examples of scenarios for the methods presented in this the-sis. 1) Smartphone classifies activities and provides labels to train thelearner; 2) Magnetic switches in the environment are used to extracttraining labels for the body-worn learner; 3) Learner candidates areranked using scores to assess suitability in classifying the activities; 4)A mapping between limb position and acceleration is automatically es-tablished. With this mapping, the learner can be trained by translatingthe teacher training set.


• Factoring in sensor position/orientation: by training new sensorsystems in-place with the specific end-user, the models obtainedare matching the way the user indeed wears the sensor systems.

• Ready deployment of new resources: by exploting existing sensorsystems to train new ones, the user does not have to collecttraining examples on his or her own, but instead the systemsautomatically get trained along a certain period of time.

1.4 Research questions

In this work, we tackle the following research questions:

1. Can an existing teacher system providing imperfect labels traina learner system to recognize activities?

2. Can we rank a pool of learners to predict which ones will performbetter in the recognition task, given only imperfect teachers?

3. Can a system automatically establish relationships betweenteacher and learner signals to speed up the training?

4. Can we leverage common knowledge or assumptions to trans-form simple ambient sensors into teachers?

1.5 Contributions

We show that the research questions have a positive answer. The maincontributions of this thesis can be summarized as follows:

1. We establish a reference dataset for benchmarking transfer learn-ing and opportunistic activity recognition. This datasets fulfillsthe requirements needed for becoming a reference for the activityrecognition community. This is presented in Chapter 3.

2. We show the feasibility of training a newly appearing body-worn learner system for recognizing activities by leveragingother trained teacher systems present on the user’s body. Wedenote this as classifier-level transfer learning (Chapter 4).

3. We explore the usage of sensors installed in the environment,enhanced by assumptions on user behavior, to extract labels totrain body-worn sensor systems.

1.6. Thesis outline 7

4. We tackle the question on how to select the most suitable targetsfor learning how to recognize activities, given teachers providingonly imperfect labels. We denote the approach as learner candidateselection (Chapter 6).

5. We demonstrate that system identification techniques can beused to find mappings between sensor readings taken from am-bient and body-worn devices. This allows to perform a quickertraining of the learner sensor systems. We denote this as signal-level transfer learning (Chapter 7).

The approaches used in the thesis are validated on scenarios involvingboth wearable and ambient sensors.

1.6 Thesis outline

The thesis is organized as follows:In Chapter 2, we introduce the related work in the fields of activity

recognition, pointing out the limitations of current approaches. Wethen illustrate work in the areas of transfer learning, co-training, featureselection and system identification which are related to the currentthesis.

In Chapter 3, we explain the need for a rich multi-modal referencedataset for testing opportunistic activity recognition algorithms andmore specifically those discussed in the present work. We describe thedata collection and curation, along with lessons learned and exploita-tion strategies to bring the data to the scientific community.

In Chapter 4, we introduce the method used to use an existingsensor to train newly deployed ones by operating at the level ofthe classifiers. We show the effectiveness of this transfer learning ap-proach among body-worn accelerometers for the recognition of loco-motion/postures. We show the enhancement of the approach with aco-training step to improve the learner performance. We validate thelast approach on a scenario using smartphones.

In Chapter 5, we argue that ambient sensors like magnetic switchesembedded in doors or windows can act as a teacher for training newlydeployed learners for recognizing postures and locomotion. We il-lustrate a set of possible applications which could benefit from thisapproach. We then characterize the performance in terms of teacheraccuracy and we evaluate the performance of the learners trained insuch way.


In Chapter 6, we tackle the problem of how to decide, among a setof newly deployed sensor systems, which are the most promising onesfor reaching a high accuracy when acting as learners in the transferlearning approach. We formulate the problem as a feature selectionproblem under class noise. We introduce new heuristics for rankingthe candidate learners and we compare them with standard featureselection approaches under of class noise.

In Chapter 7, we propose a method operating at the signal levelto operate transfer learning by directly translating the training set ofa teacher to a learner. We show how the signals can be mapped auto-matically and we validate this approach with two datasets involvingHCI gestures and a fitness scenario. We compare this method with theclassifier-level transfer in terms of performance and amount of dataneeded for the transfer to take place.

In Chapter 8, we summarize the main achievements and results ofthis thesis. We then outline the next steps needed and new researchdirections which opened up thanks to the present thesis.

Fig. 1.2 shows how the different chapters play a role in the conceptof transferring activity recognition capabilities from existing to newsensor systems.

1.7 Additional publications

The following additional publications have been written:

• D. Roggen, K, Förster, A. Calatroni and G. Tröster. The adARCpattern analysis architecture for adaptive human activity recog-nition systems. In J. Ambient Intelligence and Humanized Computing4(2): 169-186, 2013.

• S. Mazilu, A. Calatroni, E. Gazit, D. Roggen, Jeffrey M Hausdorffand G. Tröster. Feature Learning for Detection and Prediction ofFreezing of Gait in Parkinson’s Disease. In Proceedings of the 9thInternational Conference on Machine Learning and Data Mining inPattern Recognition (MLDM), New York, NY, pages 144-158, 2013.

• Z. Zhu, U. Blanke, A. Calatroni and G. Tröster. Prior Knowledgeof Human Activities from Social Data. In Proceedings of the 17th In-ternational Symposium on Wearable Computers (ISWC ’13), Zurich,Switzerland, 2013.

1.7. Additional publications 9

Class-level

Transfer Learning

Signal-level

Primitive Sensors and Behavioral AssumptionsChapter 5

Learner selection as feature selection problemChapter 6

Using trained system to train a newcomerChapter 4

System identification to transfer between related modalitiesChapter 7

Am

bien

t-to

-wea

rabl

eW

eara

ble-

to-w

eara

ble

Reference DatasetChapter 3

Figure 1.2: Organization of the contributions according to their topic.


Chapter Publications3 Collection and curation of a large reference dataset for

activity recognitionA. Calatroni, D. Roggen and G. TrösterIEEE International Conference on Systems, Man, andCybernetics (SMC), pages 30-35, 2011

3 Collecting complex activity datasets in highly rich net-worked sensor environmentsD. Roggen, A. Calatroni, M. Rossi, T. Holleczek, K.Förster, G. Tröster, P. Lukowicz, D. Bannach, G. Pirkl,A. Ferscha, J. Doppler, C. Holzmann, M. Kurz, G. Hölzl,R. Chavarriaga, H. Sagha, H. Bayati, M. Creatura and J.MillánSeventh International Conference on Networked Sens-ing Systems (INSS), pages 233-240, 2010

4 Automatic Transfer of Activity Recognition Capabilitiesbetween Body-Worn Motion Sensors: Training Newcom-ers to Recognize LocomotionA. Calatroni, Daniel Roggen and Gerhard TrösterEighth International Conference on Networked SensingSystems (INSS’11), pages 16-23, 2011

5 A methodology to use unknown new sensors for activ-ity recognition by leveraging sporadic interactions withprimitive sensors and behavioral assumptionsA. Calatroni, D. Roggen and G. TrösterProceedings of the Opportunistic Ubiquitous SystemsWorkshop, part of 12th ACM Int. Conf. on UbiquitousComputing, 2010

7 Kinect=IMU? Learning MIMO Signal Mappings toAutomatically Translate Activity Recognition Systemsacross Sensor ModalitiesO. Banos, A. Calatroni, M. Damas, H. Pomares, I. Rojas,H. Sagha, J. Millán, G. Tröster, R. Chavarriaga and D.RoggenProceedings of the 16th International Symposium onWearable Computers (ISWC), pages 92-99, 2012

Table 1.1: List of publications directly related to the present thesis.


• H. Sagha, A. Calatroni, J. Millán, D. Roggen, G. Tröster and R.Chavarriaga. Robust Activity Recognition Combining AnomalyDetection and Classifier Retraining. In Proceedings of the 10th an-nual conference on Body Sensor Networks (BSN), 2013.

• R. Chavarriaga, H. Sagha, A. Calatroni, Sundara Tejaswi Digu-marti, G. Tröster, José del R. Millán and D. Roggen. The Opportu-nity challenge: A benchmark database for on-body sensor-basedactivity recognition. In Pattern Recognition Letters, 2013.

• A. Calatroni, D. Roggen and G. Tröster. Design of an Ecology ofActivity-Aware Cells in Ambient Intelligence Environments. In10th IFAC Symposium on Robot Control, Dubrovnik, Croatia, 2012.

• L. Nguyen-Dinh, D. Roggen, A. Calatroni and G. Tröster. Im-proving Online Gesture Recognition with Template MatchingMethods in Accelerometer Data. In Proceedings of the 12th Inter-national Conference on Intelligent Systems Design and Applications(ISDA), 2012.

• D. Roggen, A. Calatroni, K. Förster, G. Tröster, P. Lukowicz, D.Bannach, A. Ferscha, M. Kurz, G. Hölzl, H. Sagha, H. Bayati, J.Millán and R. Chavarriaga. Activity Recognition in Opportunis-tic Sensor Environments. In Procedia Computer Science 7, pages173-174, 2011.

• H. Sagha, S. T. Digumarti, J. Millán, R. Chavarriaga, A. Calatroni,D. Roggen and G. Tröster. Benchmarking classification tech-niques using the Opportunity human activity dataset. In IEEEInternational Conference on Systems, Man, and Cybernetics (SMC),pages 36-40, 2011.

• M. Kurz, G. Hölzl, A. Ferscha, A. Calatroni, D. Roggen and G.Tröster. Real-Time Transfer and Evaluation of Activity Recogni-tion Capabilities in an Opportunistic System. In Third Interna-tional Conference on Adaptive and Self-Adaptive Systems and Appli-cations (ADAPTIVE2011), Rome, Italy, pages 73-78, 2011.

• M. Kurz, G. Hölzl, A. Ferscha, A. Calatroni, D. Roggen, G. Tröster,H. Sagha, R. Chavarriaga, J. Millán, D. Bannach, K. Kunze and P.Lukowicz. The OPPORTUNITY Framework and Data ProcessingEcosystem for Opportunistic Activity and Context Recognition.In International Journal of Sensors, Wireless Communications and


Control, Special Issue on Autonomic and Opportunistic Communica-tions, pages 102-125, 2011.

• A. Manzoor, C. Villalonga, A. Calatroni, H. Truong, D. Roggen, S.Dustdar and G. Tröster. Identifying Important Action Primitivesfor High Level Activity Recognition. In 5th European conference onSmart sensing and context (EuroSSC 2010), Springer-Verlag, pages149-162, 2010.

• K. Förster, Samuel Monteleone, A. Calatroni, D. Roggen andG. Tröster. Incremental kNN classifier exploiting correct - errorteacher for activity recognition. In Proceedings of the 9th Interna-tional Conference on Machine Learning and Applications (ICMLA),pages 445-450, 2010.

• D. Roggen, K. Förster, A. Calatroni, A. Bulling and G. Tröster.On the issue of variability in labels and sensor configurations inactivity recognition systems. In Workshop at the 8th InternationalConference on Pervasive Computing (Pervasive 2010), 2010.

• P. Lukowicz, G. Pirkl, D. Bannach, F. Wagner, A. Calatroni, K,Förster, Thomas Holleczek, M. Rossi, D. Roggen, G. Tröster, J.Doppler, C. Holzmann, A. Riener, A. Ferscha and R. Chavarriaga.Recording a complex, multi modal activity data set for contextrecogntion. In Workshop on Context-Systems Design, Evaluation andOptimisation at ARCS, Hannover, Germany, 2010.

• R. Chavarriaga, José del R. Millán, H. Sagha, Hamidreza Bayati,P. Lukowicz, D. Bannach, D. Roggen, K, Förster, A. Calatroni, G.Tröster, A. Ferscha, M. Kurz and G. Hölzl. Robust activity recog-nition for assistive technologies: Benchmarking ML techniques.In Workshop on Machine Learning for Assistive Technologies at theTwenty-Fourth Annual Conference on Neural Information ProcessingSystems (NIPS-2010), Vancouver, Canada, 2010.

• M. Kurz, A. Ferscha, A. Calatroni, D. Roggen and G. Tröster.Towards a Framework for opportunistic Activity and ContextRecognition. In 12th ACM International Conference on UbiquitousComputing (Ubicomp 2010), Workshop on Context awareness and in-formation processing in opportunistic ubiquitous systems, 2010.

• D. Roggen, A. Calatroni, Mirco Rossi, Thomas Holleczek, K,Förster, G. Tröster, P. Lukowicz, D. Bannach, Gerald Pirkl, Flo-


rian Wagner, A. Ferscha, Jakob Doppler, Clemens Holzmann, M.Kurz, Gerald Holl, R. Chavarriaga, M. Creatura and J. Millán.Walk-through of the OPPORTUNITY dataset for activity recog-nition in sensor rich environments. In , 2010.

• A. Calatroni, C. Villalonga, D. Roggen and G. Tröster. ContextCells: Towards Lifelong Learning in Activity Recognition Sys-tems. It Proceedings of the 4th European Conference on Smart Sensingand Context (EuroSSC), pages 121-134, Springer, 2009

• D. Roggen, K, Förster, A. Calatroni, A. Bulling, T. Holleczek, G.Tröster, P. Lukowicz, G. Pirkl, D. Bannach, A. Ferscha, A. Riener,C. Holzmann, R. Chavarriaga and J. Millán. OPPORTUNITY: Ac-tivity and context awareness in opportunistic open-ended sensorenvironments. In Proceedings of the 1st European Future and Emerg-ing Technologies Conference (FET 2009), Prague, Czech Republic,European commission, 2009.

• D. Roggen, K, Förster, A. Calatroni, T. Holleczek, Y. Fang, G.Tröster, P. Lukowicz, G. Pirkl, D. Bannach, K. Kunze, A. Ferscha,C. Holzmann, A. Riener, R. Chavarriaga and J. Millán. OPPOR-TUNITY: Towards opportunistic activity and context recognitionsystems. In Proceedings of the 3rd IEEE Workshop on Autononomicand Opportunistic Communications, (WoWMoM) 2009: 1-6

2State of the art

In this chapter, we first briefly introduce the naming and symbol conventionsadopted in this work. We then describe the state of the art in the field of activityrecognition, specifically focusing on works which attempt to overcome theassumptions of a fixed and known sensor setup. We illustrate the limitationsof current approaches, giving the motivation for a paradigm shift.

16 Chapter 2: State of the art

2.1 Formalism

In this thesis, we use a formalism adapted and extended from [3]. Wedefine a domain D as a two-tuple D (X,P(X)). X = θ1 . . . θM is the M-dimensional feature space of D and P(X) is the marginal distributionwhere X = x1, . . . , xN ∈ X. We denote x1, . . . , xN as instances. Theinstances are extracted from the correspondent raw signals s1, . . . , sNvia a feature extraction function F , i.e. xi = F (si).

We define a task T as a two-tuple T(Y, f ()

)for a domain D. Y is

the label space of D and f () is an objective predictive function for D,i.e. f : X 7→ Y. We denote a labeled instance as a two-tuple (xi, yi).The function f () can be learned trough the set of labeled instancesΩ = (x1, y1) . . . (xN, yN), which is the training set.

In an activity recognition task, a domain D can be for example abody-worn accelerometer, the label space Y is formed by the activitiesthat need to be recognized and the function f () can be a classifier,operating on the feature space X.

Throughout the thesis, we denote as “teacher” any domain forwhich a function f () is known or can be obtained through a set oflabeled instances. We use the superscript T to denote a teacher domain.We denote as “learner” any domain for which f () is unknown andno labeled instances exist yet. We use the superscript L to denote alearner domain. The superscript G is used throughout the thesis todenote ground-truth labels. We denote with ε the error rate of aset of labels yi compared to their ground-truth counterparts yG

i .ε = 1

N∑N

i=1 1(yG

i − yi

), where the indicator function 1(x) returns 1 if

x , 0 and 0 otherwise. We use the terms “class noise” or “label noise”to denote situations where ε > 0.

The accuracyα is defined as 1−ε. The accuracy can also be expressedin terms of true positives (TP), false positives (FP), true negatives (TN)and false negatives (FN) as: α = TP+TN

TP+TN+FP+FN . The accuracy is one ofthe standard metrics used in activity recognition [4].

.

2.2 Machine learning paradigm for activity recognition

Activity recognition can be carried out using a machine learningparadigm. The approach consists in acquiring the labeled instancesin a training phase. The set of labels yi is usually collected through

2.3. Approaches to cope with unknown sensor placement/orientation 17

annotations provided by the end user [5], eventually prompted via anactive learning approach [6]. In order for this approach to work, theassumption has to hold, that P(X) remains as constant as possible whenswitching from the training to the deployment phase. This in turn im-plies a constant set of sensors always positioned and oriented in thesame way on the user’s body and in the environment. This assump-tion is quite restrictive. In real-world applications, the user will at timeswear new sensor systems, positioned in a previously unknown loca-tion, which need to be trained and included in the activity recognitionsystem.

2.3 Approaches to cope with unknown sensor place-ment/orientation

The first idea which has been proposed to tackle this problem is to firstdetermine the sensor placement. Kunze et al. showed that it is possibleto determine the orientation of an accelerometer in a pocket [7] orthe position of an acceleration sensor on the body [8]. The surface onwhich a sensor is placed in the environment can be detected by usingsound signatures [9]. Detecting sensor placement allows to loosen theassumption that the sensor positions need to be known at design time.Nevertheless, there is still the need to build classifier models for apredefined and possibly large set of positions/locations/sensor types.This involves a user-specific training phase or the collection of a datasetthat covers potentially many combinations of sensor types, positionsand orientations.

The second class of approaches deals with using position-invariantfeatures or classifier models. In [10] authors propose to calculate fea-tures that do not change with sensor position, to gain robustness andto allow more freedom to the user in sensor placement. The invari-ance is achieved by finding features with a Genetic Programmingapproach. Features are evolved by combining arithmetic and statis-tical operations, until the maximum position-invariance is found. In acomplementary fashion, Lester et al. [11] build position-invariant clas-sifier models. The authors proposed to train classifiers with featurevectors coming from various sensor positions, so that the classifiersautomatically provide the position-independence. This approach hasbeen mainly tested on the recognition of locomotion activities. Both forposition-invariant features and classifiers, an offline training phase is


required. The system designer needs to collect a rich training dataset,incorporating as many sensor positions and orientations as possible.

2.4 Transfer learning and co-training

The problem of training a new sensor system by leveraging existingones can be framed as a transfer learning problem. Transfer learningis the process of using the “knowledge” acquired in one domain togain knowledge in another domain [3]. Various approaches have beenproposed in various fields.

TrAdaBoost [12], derived from AdaBoost [13], consists in usingexamples from an existing domain to build up the “difficult” examplesfor the new domain. The assumptions are that the two systems areoperating with the same types of features. This is not always realisticin activity recognition settings, where sensor nodes can be measuringvery different kinds of physical quantities and some sensors could justbe delivering binary states (e.g. a magnetic switch).

Van Kasteren [14] showed that a transfer learning approach canexploit the knowledge (classifier models) related to activity recognitionin one smart home to train similar activity recognition systems inother homes, thus avoiding many costly training data collections. Theneed for a common feature space induced the authors to create “meta-features”, which are the common ground on which the transfer ofknowledge occurs. The method was applied only to binary sensors,i.e. sensors delivering two states (opened/closed). The meta-featuresneed to be carefully selected to enable the transfer.

Translated learning was proposed in [15] to tackle the case of dif-ferent feature spaces. The method assumes that a mapping functionbetween the feature spaces can be obtained by estimating a probabilitydensity on the co-occurrence of features in both spaces, which mayneed a conspicuous amount of data.

The problem of a teacher training a learner can also be viewed as aninstance of semi-supervised learning, and specifically co-training. Themain idea behind co-training, first proposed by Blum and Mitchell[16], is that of training two classifiers simultaneously by leveraginga small set of labeled data and a much larger set of unlabeled data.Co-training has been used in activity recognition in [17] and [18] withthe goal of reducing the labeling effort by the user. In these works,the approach is not to have a teacher and a learner, but rather differentclassifiers initialized with a set of labels, obtained for example by active

2.5. Feature selection under class noise 19

learning. In our teacher-learner case, the learner would neverthelessstart with no labeled data. Variations of co-training have been appliedto text classification [19], speech tagging [20] and object recognition[21]. Many of the co-training algorithms require the classifiers to outputa measure of confidence for each classified example, along with thelabel. This cannot be always guaranteed, for example if the teachersystem is a magnetic switch having just the two states “opened” and“closed”. In this thesis, we use a version of co-training which is basedon the agreement between teacher and learner classifiers.

2.5 Feature selection under class noise

When a user wears a number of new sensors (which could potentiallybecome many, if we think of a future T-shirt with an integrated setof accelerometers), we must tackle the problem of selecting, amongthe various candidate learners, which ones are most promising forparticipating in the activity recognition. We refer to this problem asthat of candidate learner selection (refer to Figure 2.1). This can in turnbe framed as a feature selection problem under class noise, since eachlearner’s feature space can be seen as a subset of a set of features andthe instances are labeled by a teacher which is imperfect (hence, underclass noise).

Many feature selection algorithms have been proposed to selectrelevant features and to eliminate redundant ones (see [22] for a sur-vey). Filter approaches provide a ranking, independent of the classifierwhich is then used for the pattern recognition task. Used filter meth-ods include Mutual Information [23, 24] which has also been com-bined with measures of redundancy [25]; Information Gain [26]; Chi-Square divergence [15, 27]; correlation-based approaches [28]; RELIEF,originally proposed by Kira and Rendell [29] and then modified byKononenko to obtain RELIEF-F [30]. All these methods give a rankingof individual features and not of subsets. Mutual Information couldin theory be applied directly to feature subsets, but this would implycollecting a very large amount of data for the estimation of the jointprobabilities and is usually not feasible in practice [31].

Max-Relevance and Min-Redundancy-Max-Relevance are used forsubsets instead [25]. Max-Relevance consists in averaging the MutualInformation of single features within a subset. We use this score asa benchmark as its robustness in the presence of class noise has notbeen investigated yet. The Fisher separability score [32, 33] has been


Smart shirt with accelerometers (many learners)

?

Smartphone(teacher)

Imperfect labelsLabel space: Y = Y = x,o

x

xx

oooo

x

x

xx

oo o

ox

X1L

2

X

L

x

xxo

o oo

x

3

X

L

LT

Figure 2.1: Framing of the learner candidate selection for a two classproblem as that of feature selection under class noise. The three featurespacesXL1 ,XL2 andXL3 , each associated with one accelerometer, can beseen as slices of dimensionality two of a feature space of dimensionalitysix. The smartphone provides imperfect labels, therefore the featureselection is under class noise.

2.5. Feature selection under class noise 21

used as a filter and is intrinsically suitable for multi-class problems,as well as feature subsets [34]. We include Fisher separability in ourbenchmarks for comparison and we test its robustness in presenceof class noise. The expected performance is not very high, since itmeasures linear separability, thus it is most suitable when coupled withclassifiers which generate linear decision boundaries. Other measuresfor similarity between clusters representing different classes in thefeature spaces [35] can be used as filters. The robustness of Fisherseparability and of other similarity measures against class noise has sofar not been investigated.

Wrapper approaches [36, 37] can operate directly on feature subsets.These approaches treat the learning algorithm as a black box and usethe classification accuracy to assess how suitable each subset is. Thesealgorithms use the same classifier for selecting the relevant features andfor carrying out the pattern recognition task. Using the same classifieris not suitable for our needs, since we want to decouple the ranking ofthe feature subsets from the classifier used. Furthermore, if the learningalgorithm has a computationally expensive training phase, wrapperscan become substantially more costly than filters.

Hybrid feature selection approaches have been investigated, wherea classifier acts as a filter and is decoupled by the one used for therecognition. Cardie [38] uses a decision tree as a filter for selectingfeatures to be used by a case-based learner (kNN) for the recognitionphase. Support vector machines have also been proposed in a hybridsetting by He et al. [39]. However, SVMs have a costly training phase,in the order of O(N2) and the impact of class noise in feature subsetranking capability has not been considered in previous work neitherfor SVMs nor for Decision Trees used as filters.

Regardless of the class of feature selection approaches, the influenceof class noise on the performance of the feature selection is crucial forthe concrete problem of ranking candidate learners, since the teachersoften provide imperfect labels. The influence of class noise has beeninvestigated in supervised learning settings [40, 41, 42, 43]. The worksby Sontrop et al. [44] and by [39] consider how feature selection isaffected by noise in the data, but not in the classes. Altidor et al. [45]investigated the impact of class noise on the ranking of single features,but not of subsets. In this work, Mutual Information and RELIEF-F proved to be more robust than all other considered algorithms, atleast for ranking of single features: for this reason we choose these asbenchmarks.


2.6 Feature- and signal-level mapping

A different approach to incorporate a new sensor system is to try tofind a representation at the signal or feature level which makes the newsystem compatible with the previously deployed ones. This approachallows to reuse previously collected training data to accomodate forthe new sensor.

An example of this approach is the calculation of orientations or tra-jectories in the 3D space starting from sensor data, as a preprocessingphase before operating the final feature extraction and classification.This has been done in [46], where inertial units provide orientationangles of body limbs, which are then converted to 3D trajectories. Sim-ilarly, in [47] trajectories and postures have been calculated through abody model. Sequences are also derived in [48], where the sensing isdone with cameras instead of body-worn sensors. These are the firststeps for allowing to compare outputs from different set of sensingmodalities, for example inertial measurement units and camera-basedtracking systems. Trajectories can also offer robustness, whenever thereis a variability in the speed at which gestures are performed, whichinstead influences the unprocessed acceleration signals. The aforemen-tioned approaches work only for a class of modalities for which it ispossible to calculate a trajectory (this is not the case for a simple ac-celerometer). Furthermore, these methods require an explicit and ad-hoc engineering of the transformations from the original domains tothe new shared domain. We seek an automation of this process. Wepropose to perform a system identification directly at the signal levelto discover mappings between signals, which was never done in thefield of activity recognition.

Among the various methods for performing system identification,neural networks offer rich sets of transfer functions such as linear ornon-linear time-invariant functions (e.g. with multi-layer perceptrons)and time-variant functions (e.g. with time-delay neural networks, re-current neural networks), but at the expense of large training data andslow training process [49]. A simpler class of system models is theone of linear multiple-input-multiple-output (MIMO) auto-regressivemoving average (ARMA) models [50], which are suitable for a largeset of transformations needed for the signals that we encounter in thedomain of activity recognition and offer a convenient training via leastsquares.

2.7. Datasets 23

2.7 Datasets

In the computer vision community numerous reference datasets canbe found. Some of these datasets have been used for community chal-lenges in various areas, e.g. face recognition [51], iris recognition [52],object recognition [53, 54, 55]. In the activity recognition community,this practice is not yet so well established. Nevertheless, attempts havebeen made to provide datasets which could be used as a reference.Prominent examples are the PlaceLab [56], the TUM [57] and the CMU[58] datasets. The first, collected by the MIT Media Lab, consists mainlyof data coming from sensors mounted in the environment (a smarthome) or in objects. The setting is realistic, but due to the high costs forannotating the data, only a small portion has been labeled. The second,recorded by TU Muenchen and set in a smart kitchen, contains videosacquired by cameras, which allow to build a 3D body model of thesubjects and track their movements. Furthermore, magnetic switchesand RFID tags have been used to mark events like opening or closingof items and use of objects. This dataset lacks body-worn sensors, thusit does not provide a playground to investigate approaches involv-ing wearable sensors. The third dataset was also recorded in a smartkitchen at Carnegie Mellon University, where 55 subjects prepareddifferent recipies. This dataset includes motion captures, videos andmeasurements from four inertial measurement units, but the setup didnot include sensors embedded into objects. Van Kasteren’s dataset [59]features particularly long recordings but with few sensors. The Darm-stadt Routine dataset, which is used for unsupervised activity patterndiscovery [60], is a long recording from body activities collected withjust one sensor (the Porcupine system [61]).

3Reference dataset*

In this chapter, we first illustrate the requirements on the dataset to validatealgorithms for training new sensors. We then describe the OPPORTUNITYdataset and the effort for its collection and curation. We finally explain howthe dataset has been used to validate approaches presented in the presentmanuscript and in other research works.

*based on “Collection and curation of a large reference dataset for activity recogni-tion” [62] and “Collecting complex activity datasets in highly rich networked sensorenvironments” [63]

26 Chapter 3: Reference dataset

3.1 Introduction

The dataset collection was designed to fulfill the needs of the projectOPPORTUNITY1 (see Section 1.2). The first crucial aspect in plan-ning the dataset collection was the definition of a scenario. This istightly linked to the set of sensors to be included. In the OPPORTU-NITY dataset, the key goal was to be able to simulate many differentsensor setups and activity recognition problems using the same dataas a starting point. More specifically, for the scope of this thesis, thedataset needed to allow to simulate the appearance of several sensorsto validate transfer learning approaches, as well as learner candidateselection. The key design choice for allowing this was to record a mul-timodal dataset. A multimodal dataset is a recording covering differentphysical modalities, e.g. acceleration, rate of turn, sound, video, etc.One ore more sensors belonging to each modality were then deployedand their signals were recorded. The sensor signals have been storedin form of synchronized data streams. The usage of this kind of datato simulate the appearance of a new sensor in an activity recognitionsystem is shown in Fig. 3.1. The complete dataset is available at con-textdb.org. A subset containing four subjects is also available from theUCI machine learning repository 2.

3.2 Dataset description

We chose as a scenario to have users in an instrumented kitchen. Thekitchen and users were equipped in total with 72 sensors belongingto 10 different modalities. The choice of using a kitchen, and thereforekitchen activities, as the basis for our experiments was advantageousfrom many points of view. First of all, the scenario offered differentlevels of complexity in the activities which were performed, rangingfrom the simple postures and modes of locomotion to complex activ-ities like preparing and eating a sandwich, which involved a set ofgestures. At the same time, choosing a kitchen scenario allowed to re-strict the set of activities and the locations, so that the dataset did notreach a complexity which would be difficult to handle. Furthermore,the recordings were organized keeping in mind the tradeoff betweenhaving a smaller amount of realistic instances of certain activities and

1http://www.opportunity-project.eu2http://archive.ics.uci.edu/ml/datasets/OPPORTUNITY+Activity+Recognition

3.2. Dataset description 27

Recorded dataset Sensors

Body-worn accelerometers

Magnetic switch

Sensor appearing

Sensor disappearing

1.

2.

3.

t

1LD

2LD

3LD

TD

LD

TD

Figure 3.1: Examples of using a multimodal dataset for simulating theappearance and training of various sensor systems: 1. A body-wornaccelerometer is trained using labels coming from a magnetic switch(Chapter 5); 2. Three body-worn accelerometers are trained by a fourthone (Chapter 4); 3. Body-worn accelerometers are ranked according tothe accuracy which they are expected to achieve (Chapter 6). Havingdifferent recorded tracks allows to simulate various combinations ofteachers-learners. The algorithms just use the data shaded in dark gray,thereby simulating that a sensor system is available only for a certaintime.


Figure 3.2: View of the kitchen from top. Dashed line: typical usertrajectory in the drill run. In the free runs, subjects moved with morevariability.

a higher number of more controlled repetitions, which is needed inorder to generate enough training data for classifiers. We solved thetradeoff by recording five “free” runs, along with one so-called “drillrun” for each subject. This latter kind of recording involved the repe-tition of a subset of the gestures for 20 times in a sequence, generatingmany instances of the activities performed in a more stable way.

3.2.1 Acquisition Sensor Network

The kitchen contained a fridge, a dishwasher, three drawers, a table, achair, a lazychair and had two doors. Each of these items was instru-mented with one or more sensors (see Fig. 3.2 for a picture with theplacement).

There were also 12 objects (cup, glass, knives, etc.) equipped withaccelerometers and gyroscopes, to detect their usage. The test subjectswere furthermore instrumented with 7 inertial measurement units (5wired, 2 wireless), 12 wireless accelerometers, two microphones, ECGmeasurement system, four indoor positioning tags (Ubisense) and acustom magnetic sensor measuring the position of the right hand rel-


B2_1

B2_2

B2_3

B2_4

B2_5

B2_12

B2_6

B2_7

B2_8

B2_9

B2_10

B2_11

B6_1

B6_2

B5_1

B5_2

B4

B1_1,2

B3

A2_1,2A2_3,4

Figure 3.3: Location of the body-worn sensors on the subject.

Figure 3.4: Subset of the objects instrumented with accelerometer andgyroscope. On the right we can see the bread slicer and an XSsensinertial measurement unit (IMU) measuring table vibration. Pressuresensors were installed under some of the tableware.

ative to the shoulder [64]. Table 3.1 lists and briefly describes all the


sensors used. Table 3.2 contains technical data for the sensors.

ID Sensor system # Placement and purposeB1 Commercial wireless mi-

crophones2 Chest and dominant wrist. Sense user activity.

B2 Custom wireless Blue-tooth acceleration sen-sors [65]

12 Body-worn. Sense limb movement.

B3 XSsens IMUs in a custommotion jacket [46]

5 Body-worn. Sense limb and body movement.

B4 Custom magnetic rela-tive positioning sensor[64]

1 Emitter on shoulder, receiver on dominant wrist.Senses relative position of hand to shoulder.

B5 Commercial Sun-SPOTacceleration sensors

2 Feet, right below the outer ankle. Sense modes of lo-comotion.

B6 Commercial Inerti-aCube3 systems withIMU

2 Feet, above the shoe. Sense modes of locomotion.

O1 Custom wireless Blue-tooth accelerome-ters+gyroscopes

12 On objects. Sense object usage.

A1 Microphones within ar-ray

4 Close to one room corner. Sense ambient sound anddirection of sound sources.

A2 Commercial localizationsystem (Ubisense) 3

4 Corners of the room. Sense user location through fourbody-mounted tags.

A3 Network cameras 3 Cover the whole room. For localization, documenta-tion and visual annotation.

A4 XSsens IMUs [46] On the table and chair. Senses vibration and use.A5 USB acceleration sensors

[66]8 On doors, drawers, shelves and deckchair. Sense us-

age.A6 Magnetic switches 13 On doors, drawers, shelves. Sense usage, help in an-

notation.A7 Custom power sensors 2 Coffee machine and bread cutter. Sense usage.A8 Custom force-resistive

sensors3 On the table below plate and cups/glasses. Sense us-

age of objects.

Table 3.1: Sensor systems deployed in the experimental setup.

The sensors were interconnected with several computers (see Table3.3). In Fig. 3.5 (1), some of the body-worn sensors are depicted alongwith their interconnections. The set of twelve Bluetooth sensors waspaired to three dongles to reduce the load on each Bluetooth Piconetand the dongles were placed in favorable locations to have the bestpossible communication. For the body-worn sensors, the dongles wereattached outside the backpack containing the laptop. The five inertialmeasurement units were part of an XSens4 kit and were connectedthrough cables to the same laptop. A second computer (see Fig. 3.5(2) ) recorded signals from the sensors installed in the furniture (wiredmagnetic switches and accelerometers) and on the other twelve kitchenobjects (cup, glass, etc.).

4www.xsens.com


The recordings were documented by three wide-angle cameras.The videos were then used to labels the dataset.

3.2.2 Statistics on collected data

We recorded six runs from 12 subjects (10 of which were fully anno-tated), giving rise to 8216 instances of postures/locomotion and 31336gesture instances performed with right and left hand in terms of action-object pairs (e.g. move bread, open door). The whole data corpus, includ-ing videos, amounts to 130 GB.

We annotated the data along four label tracks: Tk1=high-level ac-tivity, Tk2=posture/mode of locomotion, Tk3=left hand gesture andTk4=right hand gesture. More tracks can be added in the future by in-

ID Sampling freq [Hz] Resolution Other dataB1 44100 16bitB2 32 10bit Range: ±4gB3 30 0.05deg Acc range: ±5g, gyro range: ±1200deg/sB4 87 Maximum 80cm shoulder to handB5 Variable 10 − 35 10bit Acceleration range: ±6gB6 40

O1 32 10bit Accelerometer range: ±4g, gyro range:±2000deg/s

A1 44100 16bitA2 Variable, max 10 20cmA3 10 VGAA4 30 0.05deg Acc range: ±5g, gyro range: ±1200deg/sA5 98 10bit Range: ±4gA6 100A7 48A8 48

Table 3.2: Technical characteristics of the sensor systems deployed inthe experimental setup.

ID Records sensor sys-tems

Nature and location Data acquisition

R1 B2, B3, B4 Laptop, on body in a backpack CRN Toolbox [67]R2 A2, A4, A7 Desktop PC CRN ToolboxR3 B1, A1 Laptop (static) Audio acq. softwareR4 B5, B6 Laptop (carried by experimenter,

following subject)Commercial proprietarysoftware

R5 A3 Laptop (static) Axis proprietaryR6 A5 Laptop (static) Dedicated softwareR7 O1, A6 Laptop (static) CRN Toolbox

Table 3.3: Data acquisition infrastructure and software.


Stand Walk Sit LieCount 3874 3824 430 88Durations (s) 7.9(11.3) 4.1(9.1) 23.0(39.8) 28.4(13.8)

Table 3.4: Distribution of the 8216 instances of postures/locomotionand their durations, expressed as mean(std).

terested parties. Most of the labels were defined before the experiment.They were selected to cover most of the gestures and activities that aperson usually performs in a small flat with a kitchen. Further labelswere added a-posteriori looking at what the subjects effectively did.The label spaces for the four tracks are the following:

• YTk1 = relaxing, early morning, coffee time, sandwich time, cleanup

• YTk2 = stand, walk, sit, lie

• YTk3 = YTk4 = reach, release, open, close, lock, unlock, move, clean,stir, sip, bite, cut, spread x fridge, dishwasher, drawer 1, drawer2, drawer 3, door 1, door 2, switch, table, cup, chair, glass, spoon,sugar, knife salami, knife cheese, salami, bottle, plate, cheese, bread,milk, deckchair

The distributions of the postures/locomotion and gestures among thedifferent classes are reported in Tables 3.4, 3.5 and 3.6.

3.3 Challenges and lessons learned

3.3.1 System setup

The setup for the collection of the OPPORTUNITY dataset was per-formed by 10-15 persons. The setup took approximately three daysonly to put together the different sensor systems, until the results weresatisfactory. A few days had already been spent before for the installa-tion of the cameras and for the calibration of the Ubisense positioningsystem.

An important challenge was to develop a protocol to set up allthe body-worn systems quickly before each test subject could start theexperiment. The goal was to be able to install the sensing infrastructureon each user as quickly and as accurately as possible. Having a garmentwith preinstalled sensors (Fig. 3.6) helped both in terms of time savingand repeatability of the sensor placement across multiple recordings.

3.3. Challenges and lessons learned 33

reac

hre

leas

eop

encl

ose

lock

unlo

ckm

ove

clea

nst

irsi

pbi

tecu

tsp

read

Frid

ge52

753

437

038

00

01

00

00

00

Dis

hwas

her

223

221

116

123

00

160

00

00

0D

raw

er1

(top

)15

515

094

760

00

00

00

00

Dra

wer

2(m

iddl

e)14

614

588

840

00

00

00

00

Dra

wer

3(l

ower

)17

817

710

394

00

01

00

00

0D

oor1

112

113

4350

5438

00

00

00

0D

oor2

8380

3744

4426

10

00

00

0Sw

itch

4645

00

00

00

00

00

0Ta

ble

1311

00

00

067

00

00

0C

up29

630

30

00

043

00

313

90

05

Cha

ir44

430

00

041

00

00

00

Gla

ss10

410

90

00

011

90

018

00

0Sp

oon

3940

00

00

490

10

00

0Su

gar

7172

00

00

790

00

00

4K

nife

sala

mi

8879

00

00

102

00

00

00

Kni

fech

eese

9090

00

00

104

00

00

00

Sala

mi

166

173

77

00

203

00

08

240

Bott

le99

100

2726

11

100

00

00

03

Plat

e11

010

60

00

011

10

00

00

0C

hees

e14

815

613

130

017

00

00

00

9Br

ead

310

300

1114

00

424

00

010

032

0M

ilk12

712

517

210

111

90

00

00

4D

eckc

hair

106

00

00

20

00

00

0

Tabl

e3.

5:D

istr

ibut

ion

ofth

e10

875

gest

ure

inst

ance

spe

rfor

med

wit

hth

ele

ftha

ndin

term

sof

acti

on-o

bjec

tpa

irs.


reachrelease

openclose

lockunlock

move

cleanstir

sipbite

cutspread

Fridge285

287176

1230

01

00

00

00

Dishw

asher397

399211

2070

025

00

00

00

Draw

er1(top)

368367

204210

00

10

00

00

0D

rawer2

(middle)

352345

191197

00

11

00

00

0D

rawer3

(lower)

340342

186193

00

30

00

00

0D

oor1519

520266

264230

2370

10

00

00

Door2

466462

246225

194207

00

00

00

0Sw

itch376

3710

00

00

00

00

00

Table67

680

00

00

2090

00

00

Cup

681685

00

00

12120

15527

00

4C

hair55

550

00

053

00

00

00

Glass

293286

00

00

4640

1169

00

6Spoon

9090

00

00

1320

170

00

0Sugar

9384

00

00

910

00

00

4K

nifesalam

i149

1410

00

0184

00

00

50

Knife

cheese154

1480

00

1205

00

00

013

Salami

215204

107

00

2321

00

543

0Bottle

113109

2019

00

1440

00

00

7Plate

144137

00

00

1442

00

00

0C

heese181

17718

120

0177

01

00

123

Bread336

32413

150

0492

00

0132

313

Milk

157146

1510

00

1720

00

00

8D

eckchair7

60

00

01

00

00

00

Table3.6:

Distribution

ofthe20461

gestureinstances

performed

with

therighthand

interm

sofaction-object

pairs.


During the setup phase, one also has to make sure that no sensorsare forgotten when powering up the whole infrastructure. The sensorsshould be mounted in such a way that it is easy to reach the powerswitch, or they should adopt a magnetic switch so that they can beturned on without removing them from the designed position. Careshould be taken if sensors use a mechanical switch, since the frictionwith the subject’s garments can easily induce an unwanted power-off.

3.3.2 Synchronous acquisition

Once the systems are up and running, the goal is to acquire multipletracks of synchronized data. One way to enforce synchronicity of all therecorded signals would be to first perform synchronization at the levelof the individual sensor nodes. This can be done for example using thesimple algorithm described in [68], which involves exchanging packetsover a wireless connection. Afterward, the recording can be triggered.This approach becomes challenging as soon as the system becomesheterogeneous, since not all the nodes share the same communicationlayer (e.g., Bluetooth vs. ZigBee or even wired vs. wireless).

In our setup, each computer was running the CRN Toolbox [67]recording all the signal tracks in separate files. This software attachestimestamps upon reception of the data samples coming from thedifferent sources, so that the signals recorded on a single computerare roughly synchronized. Different computers were synchronizedthrough the Network Time Protocol (NTP), which nevertheless doesnot allow to follow the short-term drifts in the computer clocks. Fur-thermore, samples arrive sometimes in bursts, which generates groupsof samples having similar timestamps. This happens particularly whendata are being sent over a radio channel. Thus, timestamps put at thereceiver side cannot be simply used as they are, but rather, they have tobe recalculated to represent the real sampling times. In the OPPORTU-NITY dataset, we carried out the sample recalculation by performinga linear regression, in order to form a uniform sampling comb. In fact,the timestamps at the source and the ones at the receiver side shouldbe ideally just delayed with respect to each other.

In order to help to synchronize multiple systems, an appropriategesture should be introduced to cover the largest number of possiblesensor modalities. We asked the user to clap the hands five times andbeat five times on the ground with the right foot before and afterthe recordings. This lead to having distinctive signals collected by the


body-worn sensors. At the same time, clapping can also be clearly seenin the video footage and heard in the audio tracks. The other ambientsensors were synchronized using the videos. For example, the openingof a drawer could be seen both in the videos and in the signal from thecorresponding magnetic switch.

3.3.3 Data loss

In presence of several wireless connections, interference and channelfading cause disconnections and loss of data packets. Since the time-stamps at the receiver side are not particularly accurate, detecting shortdisconnections by just analyzing the timestamps can be difficult, if notimpossible. Samples have to be tagged with a sequence number (acounter), so that in case any get lost, this can be easily detected. Thecounter should be encoded with enough bits, so that also a long dis-connection would not allow the counter to roll over more times. Forexample, if the sample counter is encoded with 8 bits and the samplefrequency is 64 Hz, the counter would roll over every 4 seconds, mak-ing it impossible to distinguish among disconnection lengths whichdiffer by multiples of 4 seconds.

A way to reduce data loss is indeed using the radio channels theleast possible. In order to do this, the sampling frequency used on thesensors nodes should be as close as possible to the minimal possible(twice the maximum bandwidth of the recorded signals). In our record-ings, given that the subjects were performing quite slow actions andgestures, we decided to limit the sampling rate at 32 Hz. With this set-ting, we obtained an average data loss of 4 % among the 24 Bluetoothsensors. An alternative, in case custom sensor nodes are be developed,is to use a radio link to perform a sporadic synchronization and tostore the recorded data onto an on-board flash memory. In this case,the radio channel can be used for quality control purposes, sendingjust beacons regarding the status of the system.

Despite the apparent reliability with which one can collect datawith a wired system, even where sensors are attached with cables,failures can occur. In a realistic and complex setup, where the subjectsare allowed to perform gestures and activities as they wish, also themechanical stability of cables deserves attention and periodic checksthroughout recordings.

The quality and amount of the data should be checked right aftereach recording. In the OPPORTUNITY recordings, after each session


we checked that all the expected files were present. We also used scriptsto load and visualize sensor data at 10 times the natural speed throughanimated plots. With this procedure, we could spot abnormal dataright after each the recording. At the time of the recording, no toolswere available for monitoring the data from the heterogeneous sensorsetup while the data were collected. Such a tool was developed as anoutcome of the OPPORTUNITY project itself, as one of the lessonslearned [69]. The open source tool5 (called MASS - Monitoring Ap-plication for Sensor Systems) can be used by a person in charge ofmonitoring the experiment.

3.3.4 Labeling

After the data collection has taken place, time has to be dedicatedto cleaning, aligning and labeling the data. In our case, the labelingwas carried out by persons analyzing both video footage and sensorsignals. Video and sensor data were analyzed with a custom tool [69].With this labeling tool, users could navigate through the dataset in100 ms steps.

Labeling each session (15-25 minutes long) took between 7 and 14hours for the four label tracks Tk1-Tk4. Most of the time was needed forthe two gesture tracks Tk3 and Tk4. To accelerate the labeling process,we outsourced the labeling effort to 15 undergraduate students fromtwo universities. Outsourcing the labeling is a practice common alsoto other data collections, like that of the PlaceLab [56].

We offered the students a hourly salary to motivate them to devotea reasonable effort in labeling the data, which can become quite boring.We also made clear how important the labeling processing was to us.The critical aspect when involving third parties is to try to obtain a la-beling which is as uniform as possible. We achieved this by organizingan introductory meeting with the students and explaining how everyspecial situation in the video should be labeled. The students were alsoinvited to mark down all the spots in the dataset where they were notsure about the correct label.

Despite the best efforts to explain how to carry out the labeling,errors were still made. For example, in our kitchen scenario, it hap-pened that actions were properly labeled, but with the wrong object(e.g. “open glass” instead of ”open fridge”). It also happened that aninstance of an activity was much longer than average, due to a mistake

5https://redmine.wearcom.org/projects/mass


in the end time of the label. We went over the provided labels in orderto check consistency and to correct errors. In order to spot mistakeswithout going through the whole dataset again, we developed a toolshowing a graphical overview of the labels, including histograms ofthe lengths of the corresponding data segments. A screenshot of our“dataset exploration” tool is shown in Fig. 3.7.

A possible approach for further improving the quality of theground-truth labels would be to outsource the effort to a crowd ofworkers. This could be done for example with Amazon MechanicalTurk 6. It is debated whether the labels gathered with such a crowd-sourcing platform achieve a better quality than one expert. Algorithmshave been developed to use the labels provided by different subjectsto provide an estimate of the ground-truth labels [70].

It should be noted that it is impossible to achieve, and even to definea “perfect” ground-truth labeling for gestures and activities. The firstreason is that persons labeling will perceive actions in a subjectiveway. For some, a “drinking” gesture starts when the subject grabs aglass of water, while for others only when she is actually taking a sip.Furthermore, given a certain action, different people will label it withslightly different start and end times.

If some manipulative activities involve also the use of objects, it of-ten happens that multiple activities overlap. In our case this happenedfor example when subjects were carrying multiple objects in one hand,while opening a drawer with the other hand. In these situations, itis even difficult to understand which action is being performed withwhich object at a certain time point, by looking at the video footage.One crucial point that needs to be addressed is therefore the qual-ity of the video material which is collected as documentation for theexperiment. In our recordings, we used three VGA cameras, whichcovered most of the room, but the resolution and illumination wereoften insufficient to label certain actions a-posteriori.

The overall effort for bringing the dataset to a level at which it couldbe used and distributed is difficult to quantify exactly. Two personswere involved part-time over more or less six months. Also after thistime period, more minor issues were found and corrected.

6https://www.mturk.com/

3.4. Limitations of the dataset 39

3.4 Limitations of the dataset

The dataset has some limitations. Despite our best efforts to gathermovements and activities which were as realistic as possible, the abun-dant number of sensors was in some circumstances obtrusive for theusers. For example, some users had problems reaching for a cup on ashelf because the motion jacket and the sensors strapped on the armsreduced the mobility of the limbs. The movements were in many casesslower than the natural ones that a subject would perform in real life.

Most of the subjects were recruited from within the university andsome were knowledgeable in the field of activity recognition. Further-more, only one subject was above 30 years old. This has to be taken intoaccount when drawing conclusions from the recognition of activitieswith these data.

The dataset was recorded in a kitchen and included only activitiesrelated to this scenario. This limitation is partly due to the choice ofinstrumenting the environment with several sensor systems. Similardata collections in other relevant daily life scenarios would be benefi-cial. These could include sports, commuting to work and other socialactivities. It is nevertheless difficult to bring a comparable level ofmultimodality out in the wild.

3.5 Recommendations

Designing and recording a rich and realistic dataset for activity recog-nition is a complex task. The recording should be carefully plannedin every detail. Especially, we would like to underline how a carefuldefinition of the scenario and sensor setup can be beneficial for a wideusability of the data, by different groups and for diverse needs. Wehere list some practical recommendations for an in-the-lab data collec-tion to avoid mistakes which can then influence the usability of largeportions of the data.

• Select a proper room which is available for enough weeks beforethe planned experiment. Use the time to properly install and testall the infrastructure needed.

• Select carefully the set of sensors used and, where possible, adoptthe same sampling frequency for all sensors: this will make thesynchronization easier. The sampling frequency should be as lowas possible, compatibly with the application requirements.


• Select a consistent naming for the sensors, so that each of themcan be uniquely identified. The same names should appear onstickers attached to the sensors, in the dataset documentation, aswell as in the data files that are collected.

• Prepare a set of scripts/software tools that can be run at the endof each recording fragment to be able to check the quality of therecorded data. Monitor the data also during the recordings toimmediately identify sensor failures.

• If custom sensor nodes are developed for the recordings, theyshould be equipped with a standard radio link (e.g. ZigBee) andlocal storage (flash). The local storage minimizes the risk of dataloss, while the wireless link can be used to synchronize the sys-tems periodically and to monitor the status while recording.

• Plan recharging sessions between recordings or extra sensors thatcan easily be swapped without losing time.

• Prepare a precise protocol containing the sequence of opera-tions that should be done for each recording, to avoid chang-ing even the slightest condition between successive recordings.This should include all the operations from the mounting of thesensors to the scripts to run on the recording computers. In theOPPORTUNITY recording, for example, each experimenter wasin charge of placing always the same group of sensors at the samebody locations and with the same orientations. This allowed aconstant setup over the recordings.

• Start and end each recording with one or more synchroniza-tion gestures, spanning the highest possible number of channels.Eventually include more synchronization actions, so that theycan be easily retrieved in the largest possible number of sensorstreams.

• Record the data with a suitable toolbox (like the CRN Toolbox7)and distribute the load on different computers only when this isnecessary to reduce the computational burden on a single ma-chine, since using more computers introduces an offset betweenthe timestamps assigned, that then needs to be compensated for.

7http://crnt.sf.net/

3.6. Dissemination within the scientific community 41

• Do not give wired connections for granted. Cables mounted inthe clothes can be subject to disconnections which are usuallydifficult to detect, since they are entirely unexpected.

• Plan carefully a protocol to perform the labeling in a way thatis the most consistent possible, especially if many people areinvolved in the process. For example, when the subject opened adoor, we agreed on the following label sequence: reach door; unlockdoor; open door; release door. When closing a door, the sequencewas: reach door; close door; lock door; release door.

3.6 Dissemination within the scientific community

The OPPORTUNITY dataset was used first by partners within theconsortium. It was used for recognizing different activities, for examplelocomotion [71], gestures [72] and high level activities [73]. Differentkinds of approaches were tested, for instance: unsupervised adaptationto add robustness in case of sensor displacement and rotations [74];anomaly detection and resilience [75]; transfer learning [71, 76].

As a way to try to make the dataset a reference for the commu-nity, we launched a challenge at the Systems, Man and Cyberneticsconference (SMC) in 2011, which lead to seven contributions from sixgroups [77, 78]. We believe that this form of dissemination is beneficialsince it encourages the community to compare algorithms on commongrounds, pushing the boundaries of research and stimulating compe-tition among research groups.

After the SMC challenge was over, we published the whole dataset8.The page hosting a subset of the data9 at the UCI machine learningrepository was viewed more than ten thousand times until now. Toour knowledge, the dataset was used for validation in 10 publicationsof groups external to the consortium, excluding the participants inthe challenge. The dataset was cited in 20 publications from externalgroups and 52 from groups involved in the OPPORTUNITY project.

3.7 Conclusion

We collected and annotated a multimodal data corpus, amountingto 8216 instances of postures/locomotion and 31336 gesture instances

8http://www.contextdb.org9http://archive.ics.uci.edu/ml/datasets/OPPORTUNITY+Activity+Recognition


performed in an instrumented kitchen by 12 subjects simulating dailyroutines. This dataset was the basis for various research groups to testactivity recognition algorithms, both within the OPPORTUNITY con-sortium and outside. The dataset was exploited to launch a challengeto compare different methods on common grounds, thereby providinga fair comparison. A new challenge with other tasks (like streamingactivity recognition, robustness against label jitter) can be launched tofurther push the scientific community towards using common groundsto compare new algorithms. The dataset can also serve as a basis totest crowdsourcing and gamification schemes to provide incentives tomasses of persons for collecting annotations, where one goal is be tobuild online repositories of annotated activity data coming from mul-timodal sensor streams. In the specific case, crowdsourcing would alsopossibly ameliorate the quality of the annotations.

3.7. Conclusion 43

1.

2.

USB accelerometer

Bluetooth dongle

USB acquisition board

Magneticswitch

Bluetooth accelerometer

+ gyroscope

Bluetooth accelerometer

Bluetooth dongle

USB XSens controller

XSens IMUs

Figure 3.5: Interconnections of a subset of the body-worn (1) and am-bient (2) sensors. Bluetooth wireless links are represented by dottedlines. Solid lines indicate a wired connection. The Bluetooth connec-tions were distributed evenly across different dongles to obtain stableconnections.


Figure 3.6: A garment including placeholders for the sensors andcables reduces drastically setup time.

3.7. Conclusion 45

Figu

re3.

7:Sc

reen

shot

ofou

rda

tase

texp

lora

tion

tool

.Alo

ngw

ith

anap

prop

riat

eto

olfo

rla

belin

g,th

isco

nsti

-tu

tes

the

basi

sfo

ran

nota

ting

the

data

seta

ndsp

otti

ngla

belin

gm

ista

kes.

This

tool

islin

ked

toth

ela

belin

gto

ol,

soth

atla

belin

gm

ista

kes

can

beco

rrec

ted

imm

edia

tely

whe

nre

cogn

ized

.

4Classifier-level

transfer learning*

In this chapter, we introduce classifier-level transfer learning to use an existingsensor to train newly deployed ones. We validate the approach in a locomotionrecognition task and we investigate the robustness against class noise in theteacher domains. We extend the approach by including a co-training step.

*based on “Automatic transfer of activity recognition capabilities between Body-Worn motion sensors: Training newcomers to recognize locomotion”[71]

48 Chapter 4: Classifier-level transfer learning

4.1 Introduction

In this chapter, we introduce a transfer learning approach at the levelof the classifier to train new sensors by means of existing ones. Aspractical application examples we consider:

• Accelerometers acting as teachers for other accelerometers tolearn a recognition task involving locomotion and posture;

• Classifiers using sound acquired by smartphones training ac-celerometers in the smartphones to recognize commuting activi-ties (e.g. means of transportation).

Formally, given a teacher domain DT(X

T,P(XT))

associated with a

task TT defined by the(YT, f T()

)and a learner domain DL

(X

L,P(XL))

as-

sociated with a task TL defined by the(YL, f L()

), we tackle the problem

of finding a suitable function f L(), given the following assumptions:

1. YT = YL = Y, i.e., the label space associated to the learner domainis the same as the one of the teacher domain;

2. The set of instances seen in the two domains XT = xT1 , . . . , x

TN

and XL = xL1 , . . . , x

LN are acquired simultaneously, therefore each

pair xTi , x

Li constitutes a pair of views of a same event;

The practical translations of the aforementioned assumptions are:

1. The learner needs to learn the same activities as the ones recog-nized by the teacher;

2. Teacher and learner sensor systems acquire data at the same timeduring the transfer phase;

4.2 Method

We propose to tackle the problem outlined in Section 4.1 by using theteacher’s objective predictive function f T() to label instances of thesame events viewed within the learner domain. The algorithm can beformalized as follows:

1. For each instance xTi acquired by the teacher, a label is generated

by yi = f T(xTi );

4.3. Performance in the activity recognition scenario 49

2. In the learner domain, a labeled instance is formed by (xLi , yi),∀i ∈

[1,N];

3. After having collected N labeled instances, f L() is learned fromthe training set (xL

1 , y1) . . . (xLN, yN).

Put into a practical perspective, the algorithm can be implementedwith two interconnected sensor nodes (Fig. 4.1). In order to fulfill thesecond assumption in Section 4.1, the time references of the teacherand learner sensor node need to be synchronized (Fig. 4.1, part 1). Thetime synchronization ensures that the data and extracted features ofthe two sensor nodes form two views of the same event. The synchro-nization takes place in the first phase of coexistence of the two nodes.If the sensor nodes are interconnected wirelessly, a standard wirelesssynchronization algorithm can be used [79]. During the transfer learn-ing phase (Fig. 4.1, part 2) the teacher node sends a label yi for eachclassified instance xT

i . The learner node stores the labels and the corre-sponding measured instances xL

i . Finally, the learner performs a batchtraining using the labeled instances (xL

1 , y1) . . . (xLN, yN) (Fig. 4.1, part

3).

4.3 Performance in the activity recognition scenario

4.3.1 Dataset

The classifier-level transfer learning was tested on a subset of the OP-PORTUNITY dataset presented in Chapter 3. As a task, we chosethe recognition of locomotion/posture, for which the label space isY = stand, walk, sit, lie. We selected the data streams collected by eightaccelerometers (see Fig. 4.2) and from four users, each performing fivefree runs. It has to be noted that the recorded instances are on averagejust a few seconds long (e.g. 4.1 s on average for the class walk). Thesubjects were often interleaving very short walking instances, some-times just one step, with standing. This continuous interleaving madeit difficult to label segments containing only one or the other class. Therecognition problem is therefore more challenging than what reportedin other works (e.g. [80]). Examples of the signals are shown in Fig. 4.3.


FXS

ML

FXS

1.

XT

XLFXS

FXS

2.

XT

XL

FXS

FXS3.

XT

XL

ML ML

ML

ML

ML

Figure 4.1: Description of the practical operation of the classifier-leveltransfer with sensor nodes. In the example, the teacher node is mountedon the arm and the learner on the shoe. ML is the machine learningblock in charge of building the objective predictive functions from thelabeled training sets. First, time references are synchronized (1). Then,the teacher classifies its data into labels, as activities take place (2).The labels are used by the learner to build a labeled training set. Thetraining set is then used to create the objective predictive function onthe learner side (3).

4.3.2 Simulation procedure

Each of the eight body-worn accelerometers is in turn trained fromground-truth labels using two of the five recorded runs. This pro-vides eight possible teachers. Two other runs are used for the transferlearning phase. Within these runs, for each teacher, seven learnersget trained via classifier-level transfer, that is, their functions f L() arelearned. The fifth run serves the purpose of evaluating how well thelearned f L() can predict the classes on new data in the learner do-


Right Upper Arm(RUA)

Left Upper Arm(LUA)

Back

Left Lower Arm(LLA)

Knee

Shoe

Right Lower Arm(RLA)

Hip

Figure 4.2: Set of wearable accelerometers used for the simulations.

0 2 4−2

0

2

0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4

0 2 4−2

0

2

0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4

0 20 40−2

0

2

0 20 40 0 20 40 0 20 40 0 20 40 0 20 40 0 20 40 0 20 40

0 20 40−2

0

2

0 20 40 0 20 40 0 20 40 0 20 40 0 20 40 0 20 40 0 20 40

Figure 4.3: Example of the accelerometer signals. Each plot shows thesignals from the three accelerometer axes.


mains. The whole process is cross-validated (five folds) and the meanand standard deviation of the accuracies are reported.

The instances are obtained as follows:

1. The continuous signal is sliced into non-overlapping windows oflength 1 s. The window width is a tradeoff between informationcontent and latency. In fact, 1 s is enough to contain a represen-tative part of the signal for the examined postures and modes oflocomotion, while it corresponds to a delay for the classificationwhich is acceptable for most applications. This yields a total of4357 signal slices for the five runs together. Each run contains onaverage 871 instances.

2. For each slice, an instance is formed by extracting M = 4 sta-tistical features. These are the mean values of the x, y and zaccelerometer axes and the standard deviation of the accelera-tion magnitude. The choice of these features was not optimizedfor any particular sensor placement and the same features wereused for all placements.

The transfer learning approach is performed for all possible teacher-learner combinations and with the following classifiers:

1. Nearest-Centroid Classifier (NCC);

2. k-Nearest Neighbors (kNN), with k = 11;

3. Support Vector Machine (SVM).

The classifier parameters were chosen empirically. The classificationswere performed with an ad-hoc toolbox, the ones with SVM relied onLIBSVM [81].

4.3.3 Baselines

Beside the classifier-level transfer learning, the two following baselinesare considered:

• Upper baseline: The best performance on the learner side isachieved by learning the objective predictive function f L()through a training set (xL

1 , yG1 ) . . . (xL

N, yGN) obtained by using the

ground-truth annotations as labels; the upper baseline also rep-resents how well the systems perform as teachers, since teacherobjective prediction functions are learnt from ground-truth la-bels.


Knee Shoe Back RUA RLA LUA LLA HipNCC 92(0) 76(9) 72(6) 78(2) 73(1) 83(3) 75(2) 73(4)11-NN 93(1) 83(3) 78(5) 83(1) 79(1) 86(2) 83(1) 73(4)SVM 93(0) 77(7) 73(5) 81(2) 76(1) 86(2) 82(1) 73(4)

Table 4.1: Teacher accuracy, obtained by training with ground truthlabels, for NCC, 11-NN and SVM classifiers (standard deviation acrossfolds in brackets). The sensor positions are those reported in Figure4.2. For comparison, random guessing would yield 25% accuracy.

Knee

1 2 3 4

1234

Shoe

1 2 3 4

1234

Back

1 2 3 4

1234

RUA

1 2 3 4

1234

RLA

1 2 3 4

1234

LUA

1 2 3 4

1234

LLA

1 2 3 4

1234

Hip

1 2 3 4

1234

0

0.5

1

Figure 4.4: Confusion matrices for the eight sensor nodes using 11-NNclassifier. Legend of the activities: 1 = stand; 2 = walk; 3 = sit; 4 = lie.

• Lower baseline. The lower baseline is achieved by a naive ap-proach which just involves a transfer of the objective predictivefunction (in practice, of the classifier models), i.e.: f L() ≡ f T().

4.3.4 Results

Upper baseline

Table 4.1 shows the accuracies for the upper baseline in percentage(standard deviation in brackets) for the eight sensor nodes. The sensornode which is best suited to the recognition task is the accelerometermounted on the upper part of the knee, reaching 93 % accuracy. Thenodes on the upper parts of the arms follow with 86 % and 83 %.The normalized confusion matrices for the different sensor nodes areshown in Figure 4.4 for the 11-NN classifier and are very similar forthe other two classifiers.


Classifier-level transfer learning

Table 4.2 reports the accuracies obtained with the classifier-level trans-fer learning approach, for each teacher-learner combination and thethree classifiers. The learners reach on average an accuracy which is9.3 % lower than the upper baseline. NCC obtains the best results,achieving on average an accuracy 5.1 % below the upper baseline,whereas 11-NN is 9.7 % below and SVM only performs 13.1 % lowerthan upper baseline. The reason for the differences lies in the toleranceof the different classifiers to wrong labels (class noise), with NCC beingleast influenced.

From a comparison of Tables 4.1 and 4.2, we also see how in somecases the learner outperforms the teacher in terms of achieved accuracy.This is for example the case with NCC when the Knee sensor is actingas learner: its accuracy outperforms the one of all the teachers byan amount ranging between 5 % and 15 % (on average 9.4 %). It mightcome unexpected that a learner beats its teacher, however this is indeedpossible and can be proved theoretically (see Section 4.4).

Lower baseline

The accuracies achieved by the three classifiers with the naive approachare summarized in table 4.3, for each teacher-learner combination.The values obtained by the learners are on average (considering allclassifiers) 26.6 % lower than the baseline (4.1). NCC performs slightlybetter, achieving on average an accuracy 23.7 % lower than baseline,compared to 29.4 % and 26.8 % for 11-NN and SVM respectively. Inthe specific case of the model transfer from the RLA to the LLA sensornode, we reach an accuracy 5 % lower than the baseline with NCC andSVM.

4.3.5 Discussion

Lower baseline

The objective predictive functions are tightly related to the featurespaces in which they operate. Indeed, the naive method could workunder the condition that XT

' XL. Else, if the learner feature space doesnot match the one of the teacher, also the transferred function is notsuitable to provide classification. Thus, in the general case of transferfrom any body position to any other body position, this approach does


From

\To

Kne

eSh

oeB

ack

RU

AR

LALU

ALL

AH

ipN

CC

-58

(10)

67(6

)75

(1)

70(3

)76

(8)

70(2

)66

(5)

11-N

NK

nee

-67

(8)

61(2

)72

(5)

69(4

)71

(5)

70(6

)59

(4)

SVM

-60

(13)

55(8

)68

(4)

68(5

)69

(7)

71(3

)55

(9)

NC

C84

(7)

-71

(8)

76(3

)72

(2)

78(8

)68

(12)

70(4

)11

-NN

Shoe

74(8

)-

73(7

)78

(7)

76(4

)81

(7)

79(3

)61

(2)

SVM

62(1

5)-

66(5

)65

(8)

60(8

)66

(5)

66(1

3)52

(7)

NC

C82

(3)

68(1

0)-

79(2

)70

(3)

71(6

)61

(4)

71(6

)11

-NN

Bac

k70

(10)

78(2

)-

75(3

)73

(2)

79(5

)76

(2)

63(2

)SV

M57

(13)

56(7

)-

65(3

)60

(4)

70(8

)66

(12)

52(6

)N

CC

88(1

)78

(12)

72(8

)-

73(1

)82

(4)

73(3

)70

(6)

11-N

NR

UA

72(6

)81

(4)

76(6

)-

78(0

)85

(3)

82(1

)63

(4)

SVM

79(3

)72

(11)

72(6

)-

75(2

)84

(2)

80(2

)58

(7)

NC

C88

(2)

78(1

2)71

(8)

77(2

)-

80(5

)71

(4)

67(8

)11

-NN

RLA

72(5

)76

(3)

72(5

)80

(1)

-82

(2)

79(2

)62

(4)

SVM

77(5

)68

(13)

70(4

)79

(3)

-80

(2)

77(4

)56

(10)

NC

C88

(2)

69(1

1)72

(8)

78(1

)72

(1)

-73

(2)

71(6

)11

-NN

LUA

73(6

)80

(3)

78(6

)83

(1)

79(2

)-

82(2

)64

(1)

SVM

79(3

)72

(10)

75(7

)80

(2)

75(2

)-

81(2

)54

(7)

NC

C83

(5)

67(1

0)71

(11)

73(2

)67

(5)

72(4

)-

61(1

1)11

-NN

LLA

72(7

)76

(4)

73(5

)79

(4)

76(3

)80

(3)

-60

(2)

SVM

74(3

)66

(12)

63(9

)74

(4)

70(4

)80

(2)

-50

(9)

NC

C83

(3)

58(1

1)70

(7)

78(2

)68

(5)

71(5

)53

(3)

-11

-NN

Hip

66(4

)64

(8)

60(6

)62

(2)

62(3

)65

(8)

65(6

)-

SVM

74(1

4)56

(16)

57(7

)60

(9)

56(1

1)63

(9)

59(1

4)-

Tabl

e4.

2:A

ccur

acy

tabl

efo

rth

ecl

assi

fier-

leve

ltra

nsfe

rle

arni

ngap

proa

ch.E

ach

row

repr

esen

tsa

teac

her

and

each

colu

mn

ale

arne

r.Ea

chel

emen

tco

ntai

nsth

eac

cura

cyob

tain

edon

ale

arne

rw

hen

its

clas

sifie

rm

odel

isob

tain

edby

trai

ning

the

lear

ner’

scl

assi

fier

mod

elus

ing

its

own

sens

orda

ta,l

abel

edw

ith

the

outp

utof

the

teac

her

clas

sifie

r.Th

eac

cura

cies

that

are

wit

hin

10%

from

the

corr

espo

ndin

gba

selin

ear

ein

bold

.Ran

dom

gues

sing

wou

ldyi

eld

25%

accu

racy

.


From\To

Knee

ShoeB

ackR

UA

RLA

LUA

LLAH

ipN

CC

-61(11)

44(7)44(5)

52(5)67(6)

73(3)56(4)

11-NN

Knee

-58(7)

43(9)44(5)

49(6)65(9)

68(5)58(5)

SVM

-60(10)

43(9)44(6)

50(4)63(8)

72(3)57(5)

NC

C69(10)

-43(6)

70(2)54(3)

40(3)47(8)

67(8)11-N

NShoe

51(6)-

57(6)61(3)

43(6)43(6)

38(3)60(8)

SVM

43(16)-

51(4)69(4)

49(3)40(2)

46(6)65(8)

NC

C31(5)

34(10)-

36(1)23(1)

51(7)24(2)

41(8)11-N

NB

ack38(5)

53(12)-

46(3)35(2)

58(8)38(5)

53(6)SV

M44(7)

58(11)-

44(3)31(3)

56(8)36(8)

49(5)N

CC

29(7)68(12)

51(4)-

58(1)40(3)

44(3)70(5)

11-NN

RU

A14(9)

56(12)57(4)

-54(4)

31(6)33(2)

60(9)SV

M29(14)

60(15)55(3)

-44(2)

30(5)29(2)

54(3)N

CC

58(7)64(9)

47(6)63(2)

-52(5)

80(2)69(8)

11-NN

RLA

41(6)53(7)

49(5)64(3)

-50(8)

74(3)54(9)

SVM

53(6)60(11)

50(7)67(4)

-51(6)

77(3)63(10)

NC

C85(2)

37(12)56(6)

23(2)40(2)

-66(4)

51(6)11-N

NLU

A84(2)

40(10)53(8)

28(3)42(3)

-69(3)

58(5)SV

M84(2)

40(8)54(9)

27(3)42(3)

-68(4)

51(5)N

CC

80(14)60(2)

46(6)50(4)

67(3)62(5)

-66(6)

11-NN

LLA70(11)

58(8)48(6)

60(4)68(2)

61(7)-

64(11)SV

M57(3)

60(6)50(7)

63(4)71(4)

59(7)-

62(9)N

CC

45(14)66(10)

46(6)68(2)

67(1)49(5)

78(3)-

11-NN

Hip

52(8)65(8)

57(9)66(6)

58(8)54(5)

55(8)-

SVM

53(17)67(9)

54(10)70(6)

57(9)48(5)

57(11)-

Table4.3:A

ccuracytable

forthe

naiveapproach.Each

rowrepresents

ateacher

andeach

column

alearner.

Eachelem

entcontainsthe

accuracyobtained

onthe

Learnerw

henits

classifierm

odelisobtained

bystream

ingthe

Teacher’sclassifier

model

andusing

itw

ithoutany

modification

onthe

Learner.Theaccuracies

thatare

within

10%

fromthe

correspondingbaseline

arein

bold.Random

guessingw

ouldyield

25%accuracy.


−50

5

−5

0

5−5

0

5

Knee

−50

5

−5

0

5−5

0

5

Shoe

−50

5

−5

0

5−5

0

5

Back

−50

5

−5

0

5−5

0

5

RUA

−50

5

−5

0

5−5

0

5

RLA

−50

5

−5

0

5−5

0

5

LUA

−50

5

−5

0

5−5

0

5

LLA

−50

5

−5

0

5−5

0

5

Hip

LyingSittingWalkingStanding

Figure 4.5: Feature spaces for the eight sensor nodes, projected into3-D space through a Principal Components Analysis. The differentlocations of the instances in the feature spaces explain the sensitivityof the naive transfer to the chosen teacher and learner.

not perform well. This can be seen in Figure 4.5, where we show a three-dimensional projection (obtained with Principal Component Analysis)of the 4-dimensional feature spaces of the different sensor nodes. Themismatch in feature spaces arises because of different reasons. Firstly,it is inherent to the on-body sensor placement. While performing thesame activity (e.g. walking), different parts of the body move in adifferent way. This is reflected in the signals seen by the correspondingsensor nodes (see Figure 4.3) and ultimately in the feature spaces. Asecond reason is the orientation of the sensor nodes. Two sensor nodesplaced very close to each other but with a different orientation will seesimilar signals but on different axes, which induces a rotation of thefeature spaces.

Despite the generally not satisfactory performance of this method,it is very appealing because of its simplicity and reduced need for dataexchange, both in terms of amount of data and time for which thesensor nodes need to run together. Furthermore, there are cases wherethe method reaches good accuracies (see Table 4.3). This suggests thata hybrid approach could be used. In a preliminary phase, the sensornodes could use the algorithms presented in [8] to detect their on-bodyplacement and in case of correspondence, the naive classifier transfer


can be operated. This would be an alternative approach to download-ing a pre-defined model retrieved from a database, which would onlyneed a local connection between the sensor nodes in contrast to a con-nection to a remote repository. The location discovery phase can alsobe performed by a correlation analysis of the sensor node signals orfeatures, if we allow the sensor nodes to be deployed on the bodytogether for some time, to assess whether they are mounted in closevicinity. Once the transfer has been performed, the Learner node canperform a self-calibration (as proposed in [82]), to try to adjust thefeature space to the needs.

Classifier-level transfer learning

In contrast to the naive approach, classifier-level transfer learning per-forms well (on average 17.3 % better than the naive approach), reachingaccuracies which are in some cases at the upper baseline level. This ismostly due to the “masking” effect that we obtain by working at a dif-ferent level of abstraction. In fact, by broadcasting labels, we removethe need of having an identical distribution of the activity classes inthe feature spaces and the teacher’s feature space is then “hidden”behind the class labels, which are the only entities which are shared bythe teacher. The only factors influencing the performance reached withthis method compared to the maximum achievable are the fraction ε ofwrong labels (the teacher is not perfect) and the distribution of theseerrors among the different activity classes.

Furthermore, with this method the teacher and learner do not needto implement either the same feature extraction algorithm or classifierand can be of different sensor modality. This provides freedom to thesensor node designers and differs significantly from the state of theart, where there are constraints in the relationships between featurespaces (e.g. in the transfer learning approach proposed in [12]). Bymasking the feature spaces, the classifier-level transfer is also suitableif the labels are generated by simple sensors, like magnetic switches(see [83] and Chapter 5), case in which we do not have any featurespace on the teacher side.

The disadvantage of this approach is the need for the teacher andlearner sensor nodes to be worn simultaneously for a time span thatallows the activities of interest to occur at least a few times. For pos-tures/modes of locomotion, a reasonable time span could be one day.In addition to that, the sensor nodes need to synchronize their time


references periodically, so that the correspondence between each labeland the learner signal can be established properly. Given that typicalactivities or gestures last at least in the order of seconds, the con-straints on the synchronization are not stringent (accuracy in the orderof milliseconds is enough). The synchronization can be performed byusing any of the existing protocols for wireless sensor networks (forexample [79]), which allow synchronization accuracies down to a fewmicroseconds.

Comparison of different classifiers

When comparing the performance of the different classifiers, we ob-serve that NCC is achieving lower baseline results compared to kNN orSVM. The reason lies in the more complex decision boundaries that canbe achieved by kNN and SVM. The more complex decision boundariesmodel better the feature spaces, which present overlap, particularlybetween the instances of walk and stand. The overlap is due to the pres-ence of many instances where the subject is performing only a step oris moving slightly. These instances have been labeled as stand or walkaccording to the activities in which they were embedded.

When training using imperfect teacher labels, NCC is the classifierreaching the performance closest to the corresponding upper baseline(using ground-truth labels). This is due to the smoothing effect thatNCC introduces, since the centroids that represent the activity classesare the average of the training instances. Thus, if a teacher sensornode provides a wrong label, a learner operating with NCC experi-ences a shift of the corresponding activity centroid (and therefore ofthe decision boundaries), but only by an amount which is inverselyproportional to the total number of instances.

With an instance-based classifier like kNN, if we insert a wronglylabeled instance, this remains in the learner classifier model. Mitigationcan be obtained by increasing the parameter k, so that a few mislabeledinstances do not affect too much the classification output. The sensi-tivity of SVM to errors in the training labels appears to be higher thanwith the other classifiers. This is likely due to the fact that wronglyassigned training instances can induce changes in the set of supportvectors.

In terms of computational resources needed by the sensor nodes,the NCC classifier is the cheapest, with a number of operations of theorder of |Y|·M. kNN and SVM instead grow in complexity with the


number of instances N, but kNN can be implemented on sensor nodesby posing a limit on the number of stored instances and SVM can beefficiently implemented on nodes in a distributed way [84]. DespiteNCC being suitable for the present problem, its limitations should notbe overlooked. NCC would in fact fail for complex feature spaces,where classes are for example represented by multiple clusters.

Discussion on selected examples

Table 4.3 shows that there are cases when also the lower baseline givessatisfactory results. For example, when the RLA sensor node acts asa teacher and the LLA sensor node as a learner, this latter reaches anaccuracy which is at the level of the upper baseline. This reflects thesimilarity between the signals collected by the accelerometers mountedon the arms, for most of the activities. When using the LUA sensornode as a teacher, the accuracy drops a little bit, because of differentorientations of the sensors, since there is often an angle between thearm and the forearm. When using the classifier-level transfer, we cansee that the accuracies reached by the LLA Learner are much lesssensitive to the used teacher. They are very close to the baseline for allteachers, except for Hip, Shoe and Knee. The Hip and Shoe teachersmake more mistakes than the others when generating labels (see Tables4.1) and the Knee teacher, though providing the least mistakes, is theone having the most bias in the label errors (see the confusion matricesin Figure 4.4). This generates bigger changes in the classifier decisionboundaries compared to having evenly spread errors.

4.4 Robustness against class noise (why a learner can out-perform a teacher)

The situation of a learner outperforming the teacher providing thelabels has been pointed out explicitly in Section 4.3.4. We here providea short proof in a simple setting that this is indeed possible when thepoints in the feature spaceX are enough separated for different classes.

Consider a domain D (X,P(X)) with an associated task T(Y, f ()

).

Let the label space consist of two labels, i.e. Y = l+, l− and the featurespace have a single dimension θ:X = θ. Consider a set of 2N labeled

4.4. Robustness against class noise (why a learner can outperform a teacher) 61

Figure 4.6: Learner feature space with uniform class-conditional dis-tribution for two classes.

instances (N for each class):

X = (x1, lG+), (x2, lG+), . . . , (xN, lG+), (xN+1, lG−), (xN+2, lG−), . . . , (x2N, lG−) ∈ X.(4.1)

Let X+ and X− be the subsets of instances belonging to the two classes.Assume the instances to be uniformly distributed for the each of thetwo classes: P(X+) ∼ U(−δ − 2a,−δ) and P(X−) ∼ U(δ, δ + 2b). Theparameters 2δ, 2a and 2b represent the distance between the bordersof the distributions and, respectively, the width of the support for theclasses l+ and l−. The distributions are depicted in Figure 4.6. Assumewithout loss of generality b > a. Let y1, . . . , y2N be the set of teacherlabels, of which a fraction ε differs from ground-truth.

In the following sections we prove that the maximum achievableaccuracy for the learner in the case of a Nearest Class Center (NCC)classifier trained on the instance set labeled by the teacher is 1 evenif ε > 0. NCC is trained by calculating a centroid for each class. If wetrain the classifier with ground-truth labels, the centroids for the givendistributions are the following:

γG+ =

1N

N∑i=1

xi = −δ − a (4.2)

γG− =

1N

2N∑i=N+1

xi = δ + b (4.3)

The effect of having teacher labels differing from ground-truth is ofmoving the two class centers towards each other, having eventually aswap, which would cause the learner to misclassify all instances. Theconditions that need to hold for the learner accuracy to be 1 are:


γ+ < γ− (4.4)γ+ + γ−

2< δ (4.5)

γ+ + γ−2

> −δ (4.6)

(4.7)

We treat separately two cases: biased and unbiased label noise.

4.4.1 Unbiased label noise

If the teacher randomly makes mistakes in the labels, but the mistakesdo not depend on the real class, we can talk about unbiased noise. For-mally, we have P(yk = l+|yG

k = l−) = P(yk = l−|yGk = l+) = ε∀k ∈ [1, 2N].

To simplify the calculations, let us consider the worst case scenario, i.e.all the correct instances picked at the edges of the distributions closeto each other and all the mislabeled ones taken at the far ends. In thisconfiguration, the label noise has the highest impact. The centroids arethen:

γ+ = −δ(1 − ε) + (δ + 2b)ε (4.8)γ− = δ(1 − ε) + (−δ − 2a)ε (4.9)

(4.10)

Substituting the expressions 4.10 into the conditions 4.7, we obtain:

ε <δ

2δ + a + b(4.11)

ε <δ

b − a(4.12)

ε > −δ

b − a(4.13)

The last condition is automatically fulfilled, since ε ≥ 0. The mostrestrictive condition is ε < δ

2δ+a+b . We can conclude that if δ > 0, thelearner accuracy can be indeed 1 even while tolerating an error fractionup to δ

2δ+a+b .

4.4. Robustness against class noise (why a learner can outperform a teacher) 63

4.4.2 Biased label noise

Biased label noise occurs when a teacher makes certain mistakes moreoften than others. We take the extreme case, where P(yk = l+|yG

k = l−) =

2ε∀k ∈ [N + 1, 2N] and P(yk = l−|yGk = l+) = 0∀k ∈ [1,N]. The overall

error rate remains thus ε. Again we consider the worst case scenario,i.e. all the correct instances picked at the near edges of the distributionsand all the mislabeled ones taken at the far ends.

The centroids are then:

γ+ = −δ + 2ε(δ + 2b) (4.14)γ− = δ(1 − 2ε) (4.15)

(4.16)

The conditions 4.7 then become:

ε <1

2(δ + b)(4.17)

ε <δ2b

(4.18)

ε > −δ2b

(4.19)

Once again, the last condition is superfluous. If δ > 0, the learneraccuracy can be indeed 1 even while tolerating an error fraction up tomin

(1

2(δ+b) ,δ2b

).

4.4.3 Discussion

The learner model described in the previous section is indeed simpleand often not realistic, nevertheless we can draw some conclusionsfrom it. The sensitivity of the accuracy obtained by a learner in presenceof teacher label noise depends on the separation of the different clustersin its feature space. If the clusters are well separated, a learner can reachan accuracy of 1 despite noise in the training labels. If the clusters areoverlapping (δ < 0), then the learner has a lower maximum achievableaccuracy and the achievable accuracy starts degrading already withsmall values of ε (see Figure 4.7).

If the number of classes C > 2, we can expect the accuracy todecrease. Nevertheless, the robustness against teacher label noise can


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.2

0.4

0.6

0.8

1

Lear

ner

Acc

urac

y

Partial overlapMinimal overlapWell separated

Figure 4.7: Learner accuracy with different levels of class overlap. Therobustness against label noise is to some extent preserved also withsome overlap in the data distributions, but the baseline accuracy isreduced.

increase, at least for NCC, since different mislabeled instances are goingto pull the centroids in different directions, eventually also partiallycanceling out.

Classifiers capable of more complex decision boundaries comparedto NCC will be less robust against label noise. If we consider for exam-ple a kNN with k = 1, as few as only one wrong teacher label will createa local region in the feature space where misclassification can occur.Higher values of k have a smoothing effect on the decision boundariesand at the same time they increase the robustness against label noise,since more mistakes in the same region of the feature space are neededfor the classifier to return a wrong classification output. Nevertheless,having a too high value for k is not feasible, because this would smooththe decision boundaries too much.

4.5 Enhancement with co-training step

In Section 4.3 we showed the effectiveness of the classifier-level transferlearning. In most of the analyzed cases, the learner domains gainedaccuracies close to their upper baseline (obtained by training them

4.5. Enhancement with co-training step 65

with ground-truth labels). The parameters determining the success ofthe method are the robustness against label noise, which is enabled bysufficient separation in the feature spaces of the learner domains (asshown in Section 4.4), and the accuracy of the teacher domains, whichin the case of the activity recognition scenario was always above 73 %.We here present a scenario which deals as counterexample, wherethe teacher domain has lower accuracy. This scenario motivates theintroduction of a modification of the classifier-level transfer whichincludes a co-training step.

4.5.1 Mobile phone scenario

The benefit of using classifier-level transfer on a mobile phone stemsfrom the following observations:

• Online repositories are available containing tagged sound exam-ples, which enable to train teacher domains to provide contextrecognition [85].

• Sound-based context recognition requires a computational com-plexity much higher than acceleration-based recognition, sinceacceleration signals can be sampled at a fraction of the rateneeded for sound (kHz range for sound versus Hz range foracceleration.

• Building similar online repositories for accelerometer signals isvery challenging due to the many sensor placements and orien-tations.

• Sound and acceleration can be measured at the same time on anysmartphone.

In this scenario it is therefore beneficial to operate a classifier-leveltransfer from the sound, acting as teacher domain, to the acceleration,acting as learner domain.

4.5.2 Class-level transfer with co-training step

We enhance the classifier-level transfer learning described in Section4.2 by using a disagreement-based co-training procedure, which actsas a filter to exclude instances that are potentially mislabeled by theteacher [86, 87]. We partition the set of instances XL into two sets


Algorithm 1: Classifier-level transfer learning with co-training stepΩL← (xL

i , yTi )∀xL

i ∈ XLtransfer

build f L() by batch training with ΩL

for i = Sw + 1→ N doyL

i ← f L(xLi )

if yLi == yT

i thenΩL← ΩL

∪ (xLi , y

Ti )

rebuild f L() by batch training with ΩL

end ifend for

XLtransfer = xL

1 , . . . , xLSw and XL

co-training = xLSw+1, . . . , x

LN according to

whether they are collected before or after a “switching point” that wecall Sw. The choice of Sw should allow the learner to reach a certainaccuracy, before using it to filter teacher labels.

The algorithm works as follows. For all the instances classified bythe teacher before the switching point Sw, the classifier-level transferdescribed in 4.2 is applied. Therefore, the teacher labels are associ-ated to the learner signals, forming a training set. For all the instancesclassified by the teacher after the switching point, disagreement-basedco-training is applied. In this phase, teacher and learner classify in-stances. If the labels disagree, for sure either the teacher or the learneris wrong. Since we do not know which label is wrong, the instanceis simply discarded. If on the contrary the labels agree, we keep theinstance. The detailed algorithm is explained formally in 1.

4.5.3 Dataset

We collected a dataset from a user including 29 recordings, spreadover ten days, performed with the smartphone. The user labeled theongoing activities and locations, as well as the means of transportationused, by a graphical interface on the phone (see Fig. 4.9). The meansof transportation constitute the subset that we here use to test theclassifier-level transfer, enhanced with co-training. The label space isY = walking, train, tram, car. The smartphone recorded sound andacceleration at 16 kHz and 100 Hz respectively.

4.5. Enhancement with co-training step 67

4.5.4 Simulations

The dataset is split as follows: the first ten recordings are used as atraining set for the teacher domain; the following ten constitute the“transfer set”, where classifier-level transfer learning enhanced withco-training is actually applied; the last nine recordings constitute thetest set. The switching point Sw is set to ten different points correspond-ing to the end of each of the ten recordings in the “transfer set”. In thenext section, Sw is represented in term of recording number insteadof the corresponding instance number, for ease of understanding. Avalue Sw = 3 means for example that the first three recordings are usedto perform a normal classifier-level transfer, whereas from the fourthrecording, the co-training step is introduced.

4.5.5 Results

We first swept artificially the error rate ε of the teacher labels between0 and 0.9, using uniformly distributed label errors. To do so, the setof teacher labels is initialized with the ground-truth labels: yT

i = yGi

(ε = 0). Then, for each value of ε, a new set of teacher labels is generated.The set is generated as follows:

• Each label yTi is selected for being noisy with a probability ε.

• If yTi is selected to be noisy, the value for the label is extracted

randomly from the label space Y excluding the ground-truthlabel yG

i .

• If yTi is not selected to be noisy, then it is set to ground-truth

(yTi = yG

i ).

Fig. 4.10 shows the results for each value of ε and for each choice ofthe switching point. We see clearly that for a proper choice of Sw, thelearner domain accuracy outperforms that obtained by the classifier-level transfer learning without co-training step.

In Fig. 4.11 we use the real labels generated by the teacher domainas input to the transfer and co-training method. Again, the choiceof the switching point affects the performance of the method, but theeffectiveness of adding the co-training step is quite moderate comparedto the pure classifier-level transfer (last bar in the graph).


4.5.6 Discussion

Real versus artificial labels

The benefit of using the extra co-training step seems to be quite mod-erate when using the teacher domain to provide the labels. In thissetting, the labels do not contain unbiased label noise, but rather bi-ased, depending on which classes introduce most confusion in theteacher domains. Having a strong bias in the labels which are out-put by the teacher could be taken into account when performing thedisagreement-based co-training by introducing a reliability parameterrelated to the single classes.

Tradeoff in the choice of Sw

From Figs. 4.10 and 4.11 it appears clear that too small values of Swshould be avoided. This is obvious, since by switching too soon to theco-training phase, the learner domain has seen few training examplesand has therefore reached only a very poor accuracy. In this setting, ifthe teacher and learner label disagree, while the system discards theinstance, it is likely that the error is on the side of the learner, sincethis outputs basically noise, meaning that it is probably safer to keepanyway the teacher label.

On the other hand, if we set a too high value of Sw, that is, if wewait very long before switching to co-training, we exploit only a smallfraction of the instances to perform the co-training, meaning that manywrong labels are just used instead of being discarded.

Automated choice of Sw

A method to choose automatically the switching point remains anopen challenge. If the teacher accuracy is known to the system (andhigh enough), we can assess the performance of the learner (up to theimprecise knowledge that we can obtain about the ground-truth). Thiswould allow to switch to co-training only when the learner eventuallyreaches an accuracy close to that of the teacher.

Another idea to assess the performance of the learner as new in-stances are added is to monitor its feature space, using the scores thatwill be discussed in Chapter 6. Experiments could be performed toassess whether an automatic choice of Sw works in practice.

4.6. Conclusion 69

4.6 Conclusion

In this chapter we introduced a classifier-level transfer learning ap-proach to use existing teacher domains to train learner domains torecognize the same set of classes. We showed that the transfer offersrobustness under label noise. The key to this robustness is the choice offeatures which are enough discriminative, i.e., so that the data are wellseparated in the learner feature space. This allows the learner to evenexceed the accuracy of the teacher. The robustness can be exploitedin other settings, like for retraining a sensor system which has beendisplaced or rotated on the user body by using the fusion of other sen-sors which are detecting the same activities (see [88]). We proposed anadditional co-training step to further improve the performance of thetransfer. A user wearing a new system capable of performing activityrecognition can then automatically integrate the new system withoutneed of an explicit training.


FXS

ML

FXS

FXS

ML

ML

FXS

ML

FXS

ML

FXS

ML

FXS

ML

FXS

ML

XT

XT

XT

XT

XL

XL

XL

XL

equal?yesno

Figure4.8:Schem

aticrepresentationofthe

class-leveltransferlearningenhanced

byco-training.Before

reachingthe

switching

point,classifier-leveltransfer

isused.A

fterthe

switching

point,onlyinstances

classifiedw

iththe

same

labelbyteacher

andlearner

arefed

intothe

learnertraining

set.

4.6. Conclusion 71

Figure 4.9: Graphical user interface used by the user to label heractivities on the smartphone.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Error Rate

Acc

urac

y

Sw = 1Sw = 2Sw = 3Sw = 4Sw = 5Sw = 6Sw = 7Sw = 8Sw = 9Sw = None

Figure 4.10: Learner accuracies versus different levels of error rate inthe teacher. Each line represents the results obtained with a differentswitching point Sw. For Sw = 1 and Sw = 2, co-training is started tooearly. In this circumstance, disagreements in the labels are due to apoor learner model. This causes many instances to be discarded evenif the teacher was correct. Higher values of Sw tend to perform likeclassifier-level transfer Sw = none, since co-training is started only lateand only few instances can be discarded. The optimal Sw lies between3 and 4.

4.6. Conclusion 73

1 2 3 4 5 6 7 8 9 None0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Switching Point

Acc

urac

y

Figure 4.11: Learner accuracy obtained with the real teacher domain la-bels for the different values of Sw. The horizontal line indicates the up-per baseline achieved by training the learner with ground-truth labels.The improvement with co-training compared to simple classifier-leveltransfer learning is quite modest in this case. For Sw = 1 and Sw = 2,co-training is started too early. In this circumstance, disagreements inthe labels are due to a poor learner model. This causes many instancesto be discarded even if the teacher was correct.

5Exploiting ambient

sensors andbehavioral

assumptions*

In this chapter, we identify teacher domains derived from ambient sensors.We use as an example magnetic switches embedded in doors or windows. Wecharacterize their accuracy in labeling instances collected from body-wornsystems, which act as learner domains. We analyze then the performance oflocomotion recognition tasks learned by the learner domains.

*based on “A methodology to use unknown new sensors for activity recognition byleveraging sporadic interactions with primitive sensors and behavioral assumptions”[83]

76 Chapter 5: Exploiting ambient sensors and behavioral assumptions

5.1 Introduction

In the previous chapter, we introduced a method to transfer activityrecognition capabilities from a teacher to a learner domain. We pre-viously identified as possible teacher domains pre-trained body-wornsensing systems. In this chapter, we focus on finding other teacher do-mains that can be used as sources for a classifier-level transfer learningapproach. Throughout the chapter, we denote with “behavioral as-sumption” (BA) any guess about the user’s behavior that can be usedto infer a label (BA label). We assume only one user is interacting withthe environment.

5.2 Method

5.2.1 Identifying systems capable of acting as teacher domains

The first step towards enlarging the set of teacher domains is to inves-tigate the availability of sensor systems that can become label sources.This should be done taking into account the large-scale availability ofambient intelligence environments. In typical actual environments, wecan find among others these sources of information:

• Ambient sensors: pre-existing sets of sensors, typically of modal-ities such as magnetic switches, RFID tags or infra-red motionsensors, are often deployed in ambient intelligence environ-ments. As the user interacts with the environment, these sensorscan provide information about his or her actions.

• Videocameras: The widespread installation of videocameras forsurveillance, in combination with several algorithms do detecthuman actions from multiple scenes, enables to exploit this re-source as a vast set of teacher domains for training body-wornmotion sensors.

• Calendars: knowledge about the user’s activities available fromsources such as software calendars can provide labels for trainingother sensors.

We here focus on the usage of magnetic switches as teacher do-mains. These switches are of the kind that are usually deployed tomonitor when windows or doors are opened, for example to automat-ically control air conditioning in buildings, or to detect an intrusion.

5.2. Method 77

5.2.2 Extracting and enhancing labels from a single magnetic switch

The activation of a single magnetic switch corresponds to an actionrelated to the object, where the switch is installed. For example, aswitch on a door would signal the label open door. The information cannevertheless be enhanced by some common-sense assumptions givennormal user behavior and some physical constraints:

• Just before opening the door, the user is grasping for the doorhandle, while he or she is releasing it afterward.

• Normally, a user stands in front of the door while opening orclosing it.

• Some time before the user stands in front of the door, he or sheneeds to walk towards it.

A timeline with the described actions is depicted in Fig. 5.1. The lastassumption about the user walking before a switch is activated can beat times quite weak, for example if the user interacts twice with thesame door, without walking in between. This problem can be overcomeby combining pieces of information coming from different switchespresent in the infrastructure.

5.2.3 Combining more switches

In order to make more precise assumptions about when a user is reallywalking, signals from multiple switches can be combined together.In fact, if two switches mounted on different objects are triggered,then the user needs to have walked between the two actions to movefrom one place to another (see Fig. 5.2). The teacher domain uses theassumptions about the standing and walking actions to generate labelswhich are used to train body-worn sensor systems (learner domains).The assumptions are associated with three parameters which need tobe set for the creation of the labels:

• δS: the duration of the stand label, centered on the change in theswitch status;

• δM: the margin between a stand label and the preceding walk label- this margin is used to reduce the chance of having mislabeledsegments at the interface between labels;


Magnetic switch signal

Labelopen door

+ BA labelsopen door

stand

graspwalk let

Figure 5.1: Illustration of behavioral assumptions (BA) when the useropens a door, which is captured by a magnetic switch (top timeline).The switch signal can be used to generate a teacher label open door(middle timeline). Furthermore, the user is grasping the door handlejust before opening the door and letting it go shortly after. The user isalso likely standing while the opening the door, and walking towardsthe door before the interaction (bottom timeline). The durations andstarting points of all these assumed labels are parameters to be set.

• δW : the duration of the walk label occurring before the standingevery time two different switches are toggled.

We denote with ∆ = δS, δM, δW a parameter triplet. The parametersare shown in Fig. 5.3.

5.2.4 Metrics used for assessment

The performance of the learners is assessed through their accuracy (seeSection 2.1).

For the teacher labels, we use precision as a metric. Precision, alsocalled positive predictive value (PPV), is defined as:

PPV =TP

TP + FP(5.1)

The rationale behind using precision for teacher labels is the following.The teacher labels are used to build models on the learners. Having afalse positive in the teacher labels means introducing class noise, thusworsening the model learned on the learner. On the contrary, having afalse negative (no label when a label should be produced) implies that

5.3. Simulations 79

walk walk

BA Labelswalk walk

Figure 5.2: In this example, a fridge, a door and a drawer are instru-mented with a magnetic switch. If the three switches fire at times t1, t2and t3, then the teacher domain relying on the three switches can makethe assumption that the user was walking at some point for t1 ≤ t ≤ t2and for t2 ≤ t ≤ t3.

an instance is not used by the learner. A false negative thus just slowsdown the learning. For these reasons, high precision means that thetraining set collected by the learner has low class noise. See Fig. 5.3 foran example.

5.3 Simulations

5.3.1 Dataset

The method is tested using again a subset covering four subjects of theOPPORTUNITY dataset presented in Chapter 3 to recognize modesof locomotion. The total amount of data for each subject ranges from2h40’ and 3h20’, recorded over five runs. The label space of the teacher


S W

a b c d e

GT

Assumptions

standwalk walk

standwalk walk

Magnetic switch signal

Figure 5.3: Parameters related to the assumptions used to generatelabels for modes of locomotion from magnetic switches. GT standsfor ground truth. δS, δM and δW denote respectively the width of thegenerated stand label, the margin between labels and the width of thewalk label. The precision of the teacher labels would be in this casePPV = a+c+e

a+b+c+d+e .

and learner domains is Y = stand, walk, sit, lie. The labels for stand andwalk are obtained by fusing the information coming from the magneticswitches placed on the fridge, in the three drawers and on the dish-washer door and by using the assumptions described in the previoussections. The labels for the class lie are obtained by thresholding thevariance of the accelerometer mounted on the deckchair. The labels forthe class sit are set from ground-truth, but they can be easily obtainedin practice with a pressure mat on the chair [89].

To form the learner domains we choose again the data streamscollected by eight accelerometers (see Figure 4.2).

5.3.2 Simulation procedure

The simulations are performed in a subject-dependent fashion. Wedenote the four subjects with S1 − S4 For each subject, four out of the

5.4. Results 81

five runs are used to generate the teacher labels y1 . . . yN for N timeslots of duration 1 s according to the method described above. For thesame four runs, instances are extracted from the eight learner domainsto build, for each learner domain DLi a training set. The remaining runis used for testing and the whole procedure is repeated in a five-foldcross-validation.

The instances are obtained from the raw signals in each time slotby extracting M = 4 statistical features. These are the mean values ofthe x, y and z accelerometer axes and the standard deviation of theacceleration magnitude.

The transfer learning approach is performed for all learner domainswith the following classifiers:

1. Nearest-Centroid Classifier (NCC);

2. k-Nearest Neighbors (kNN), with k = 11;

3. Decision Tree.

The classifier parameters were chosen empirically.

5.3.3 Baseline

Beside the classifier-level transfer learning, an upper baseline is ob-tained by training each of the eight body-worn accelerometers fromground-truth labels using the same runs used to perform the transfer.The objective predictive function f L() is learned through the trainingset (xL

1 , yG1 ) . . . (xL

N, yGN).

5.4 Results

5.4.1 Precision of teacher domains

The precision of the teacher domains obtained by combining the signalsfrom the magnetic switches is shown in Fig. 5.4. The precision valuesrange between 73 % and 93.7 %, depending on the parameter choiceand the test subject. Although the precision varies, the values of theparameter choice for which the precision is maximum are neverthelessquite insensitive to the subject: optimal parameters calculated for aspecific subject can be used also for other subjects with on averageless than 1 % loss in precision in the teacher domains, compared to theoptimal precision.


Figure 5.4: Average precision of the generated labels for four subjects(S1-S4) sweeping the parameters δS, δM and δW . The highest precisionis obtained for subject S1. For all subjects, the values of the parametersthat maximize the teacher precision are similar. For all subjects, thehighest precisions are obtained with low values of δS. This is simplydue to the fact that the dataset contained several short instances ofthe standing activity. Assuming longer durations for standing leads tolower precision.

From Fig. 5.4 we also notice how for all subjects, the highest pre-cisions are obtained with low values of δS. This is due to the fact thatthe instances of the standing activity were short. Therefore, assuminglonger durations for the standing activity (higher values of δS) leads tolower precision. As we will see in the following Section 5.4.2, higherprecision in the training labels does not automatically imply a higheraccuracy after the transfer is carried out.

5.4. Results 83

−0.2

−0.15

−0.1

−0.05

0

−0.2

−0.15

−0.1

−0.05

0

−0.2

−0.15

−0.1

−0.05

0

−0.2

−0.15

−0.1

−0.05

0

Figure 5.5: Average difference between the accuracy achieved on thelearner domains and their baselines. The five groups of bars correspondto the five parameter combinations ∆1 to ∆5. The eight bars in eachgroup represent, from left to right, the learner domains: RKN, SHOE,BACK, RUA, RLA, LUA, LLA, HIP (for the acronyms, see Fig. 4.2).

5.4.2 Accuracy of the learner domains after transfer

We report the difference in accuracy between the baseline accuracies(achieved when training the learner domains with ground-truth labels)and the ones obtained with the transfer approach using the teacherdomains. For each subject, we selected five parameter combinations(denoted ∆1 to ∆5) corresponding to teacher domain accuracies rangingfrom the maximum to the minimum obtained for that subject. Wereport the simulation results for each parameter combination.

In Fig. 5.5, we average the results for the three classifiers, showingthe results for each learner domain and for the five parameter combi-nations, for all four subjects.

In Fig. 5.6, we focus on the difference in performance between theclassifiers. The results are averaged over the subjects and presented for


1 2 3−0.2

−0.15

−0.1

−0.05

0

1 2 3 1 2 3 1 2 3

1 2 3−0.2

−0.15

−0.1

−0.05

0

1 2 3 1 2 3 1 2 3

Figure 5.6: Average difference between the accuracy achieved on thelearner domains and their baselines. The three groups of bars cor-respond to the three classifiers (1=NCC, 2=kNN, 3=Decision Tree).The eight bars in each group represent, from left to right, the learnerdomains: KNEE, SHOE, BACK, RUA, RLA, LUA, LLA, HIP (for theacronyms, see Fig. 4.2). The differences within each group are due tothe different accuracies of different sensors in recognizing the posturesand modes of locomotion.

each learner domain and for the five parameter combinations. From theplots it emerges that a simple classifier like NCC is less penalized byusing noisy labels, compared to kNN and Decision Trees. Nevertheless,NCC offers also a lower baseline, since it cannot implement complexdecision boundaries and the performance is completely delegated tothe discriminative power of the features.

5.5. Discussion 85

5.5 Discussion

5.5.1 Interpretation of the results

The results presented in Section 5.4.2 can be counterintuitive at firstsight. In fact, one would expect the accuracy to drop when movingfrom settings with higher teacher precision (parameter set ∆1) to oneswith lower precision (parameter set ∆5). However, by analyzing Fig.5.4, we see that parameters set which lead to lower teacher precisionare coupled to higher values of δS, δM and δW . By having higher valuesfor δS and δW , the training sets contain more instances, since longerdata segments are assumed to belong to the classes stand and walk. Thiscan be seen by counting the effective average number of instances perrecording run which are extracted with the different parameter sets.Thus, there exists a trade-off between the precision of the teachers andthe amount of training data. This is shown in Fig. 5.7.

From the results, we can conclude that it is often better to gainmore training data, at the expense of the quality of their labels, ifthere are constraints on the amount of data that can be gathered. Ifon the contrary the transfer learning process can proceed indefinitely,then the best performance will be obtained by selecting the best (mostprecise) teachers. Whether the transfer learning has to happen withina limited amount of time or not depends on the specific application. If,for example, the teacher domains are deployed in an environment thatthe user visits regularly, like his or her own home, then it is reasonableto assume that the learner systems can acquire data each day. If, instead,the user visits only once a special environment that has teacher sensorsand the learners should make the most out of it, then the parametersshould be chosen so that a sufficient amount of training data is collected(even if noisier).

5.5.2 Significance for real-world deployment

From the results, we see that the learner domains reach accuracies thatin many cases are close to the upper baseline, obtained by trainingwith ground-truth labels. From the user perspective, this is obtainedwithout him or her needing to collect any training data explicitly. Thishas a positive impact on the deployment of activity recognition systemsin real-world applications, since a user could just buy a new sensor-equipped device (watch, phone, etc.) and this could learn from sensors


S1 S2 S3 S40.75

0.8

0.85

0.9

0.95

1

Figure 5.7: Trade-off between the teacher precision and the averagenumber of training instances available per recording.

located in the environment how to recognize activities, without userinvolvement and in a personalized way.

5.5.3 Finding more teacher domains

The set of teacher domains can be extended in terms of used sensorsand label spaces by analyzing the typical infrastructure which is in-stalled in users’ homes and environments. Tab. 5.1 shows an exampleof sources and ways to enrich the label spaces that can be covered byteacher domains extracted by typical infrastructure.

5.5. Discussion 87

Tabl

e5.

1:Il

lust

rati

onof

othe

rpo

ssib

lete

ache

rdo

mai

nsin

anam

bien

tin

telli

genc

een

viro

nmen

t.Th

eta

-bl

eill

ustr

ates

som

ety

pica

lse

nsor

syst

ems,

thei

rfo

rese

enus

age,

and

the

poss

ible

alte

rnat

eus

age

toex

trac

tte

ache

rla

bels

.Act

ivit

yki

nd:G

=ge

stur

es,M

OL=

mod

esof

loco

mot

ion.

Met

hod:

BA=

beha

vior

alas

sum

ptio

ns,

DC

=de

rive

dch

arac

teri

stic

s(i

nfor

mat

ion

deri

ved

dire

ctly

from

the

sign

al,e

.g.s

peed

from

GPS

coor

dina

tes)

Wha

tTy

pica

lusa

geW

hatc

anbe

infe

rred

Expl

oita

tion

Act

ivit

yH

owM

agne

tic

orno

rmal

swit

ches

Det

ecti

onof

open

ing

and

clos

ing

ofit

ems

(e.g

.doo

r,dr

awer

)or

acti

vati

on(e

.g.l

ight

swit

ch)

Use

rpe

rfor

min

gan

open

ing

orcl

osin

gge

stur

eor

swit

chto

gglin

g2-

clas

spr

oble

m:o

n-bo

dyse

nsor

sca

nle

arn

tode

tect

open

ing

orcl

osin

gG

BA

Ope

ning

orcl

osin

gim

plie

sa

reac

hge

stur

ebe

fore

and

are

leas

ege

stur

eaf

terw

ard

1-vs

-nul

l-cl

ass

prob

lem

:on-

body

sens

ors

can

lear

nto

spot

reac

hing

for

item

sG

BA

Lock

/unl

ock

shor

tly

befo

reth

eop

en,

but

afte

rth

ere

ach

(rea

ch-

unlo

ck-

open

-re

leas

e)or

(rea

ch-

clos

e-l

ock

-rel

ease

)

6-cl

ass

prob

lem

:rea

ch,r

elea

se,l

ock,

unlo

ck,o

pen,

clos

e.

Whi

lein

tera

ctin

g,th

eus

eris

very

likel

yst

andi

ng.

Befo

rean

daf

teri

nter

acti

onw

ith

item

,the

user

likel

yw

alke

d.

On-

body

sens

ors

can

lear

nto

dete

ctst

andi

ngv.

s.no

t-st

andi

ng(t

ypic

ally

othe

rm

odes

oflo

com

otio

n),

orev

enst

andi

ngv.

s.w

alki

ngv.

s.ot

her

acti

viti

es.

MO

LBA

Loca

lizat

ion

syst

emPo

siti

onof

the

pers

onin

the

room

.Inf

erpr

iorp

roba

-bi

litie

sof

acti

viti

es(e

.g.n

ear

sink

->hi

gher

prio

rson

was

hing

dish

es),

ordi

rect

lyac

tivi

ties

(if

very

loca

l-iz

ed)

Rat

eof

vari

atio

nof

posi

tion

can

bem

appe

dto

wal

k-in

gv.

s.st

atic

post

ure

On-

body

sens

ors

can

lear

nto

reco

gniz

ew

alki

ngM

OL

DC

Wit

han

addi

tion

alas

sum

ptio

non

typi

cal

spee

d,w

alki

ngan

dru

nnin

gca

nbe

diff

eren

tiat

edO

n-bo

dyse

nsor

can

lear

nto

reco

gniz

ew

alki

ng,r

un-

ning

,sta

tic

post

ure.

MO

LD

C

Prox

imit

yin

fra-

red

Mov

emen

tsen

sing

,aut

omat

iclig

htin

gM

ovem

entd

etec

tion

likel

yin

dica

tes

aw

alki

ngpe

r-so

n.O

n-bo

dyse

nsor

can

lear

nto

dete

ctw

alki

ngM

OL

DC

Cam

era

Pers

onid

enti

ficat

ion,

surv

eilla

nce,

secu

rity

Pers

onid

enti

ficat

ion

trac

king

,sp

eed

ofdi

spla

ce-

men

t,w

alki

ng,o

rru

nnin

gO

n-bo

dyse

nsor

sca

nle

arn

tore

cogn

ize

wal

king

,ru

nnin

gvs

stat

icpo

stur

e.M

OL

DC

Lim

bm

ovem

entt

rack

ing

[90]

On-

body

sens

orca

nle

arn

tode

tect

the

sam

eac

tivi

-ti

esG

DC

Obj

ects

inst

rum

ente

dw

ith

e.g.

RFI

D,

ac-

cele

rom

eter

s.In

dica

teth

eus

eof

the

obje

ct

Obj

ectu

se[9

1]Th

eus

eris

mov

ing

orca

rryi

ngor

usin

gth

eob

ject

inth

eha

ndTr

aini

ngha

ndge

stur

ere

cogn

itio

nsy

stem

tore

cog-

nize

carr

ying

/mov

ing

obje

cts

GBA

Shor

tly

befo

reth

eob

ject

ism

oved

,th

eus

erm

ust

perf

orm

are

ach

gest

ure.

Shor

tly

afte

rth

eob

ject

isle

ftst

atic

,the

user

mus

tper

form

are

leas

ege

stur

e

On-

body

sens

ors

can

lear

nto

spot

reac

hing

for

ob-

ject

sG

BA


5.6 Conclusion

We presented a method to exploit simple magnetic sensors to extractteacher labels to train body-worn sensor systems to recognize activities.Just using signals from magnetic switches, enhanced with assumptionson the user movements, teacher labels reach precision values up to93.7 % in recognizing modes of locomotion/postures. These teacher la-bels can train body-worn sensors to recognize the same classes achiev-ing accuracies on average 9.3 % below the level obtained by usingground-truth labels (the minimum and maximum gap to the ground-truth accuracy are respectively 20 % and 2.8 %). This means that thesensor systems can be worn by the user and learn to recognize her ac-tivities, without ever involving the user in any labeling and regardlessof where the user places the sensor systems. We suggested ways toextend the methodology to other sensor systems, in order to extractmost from existing pieces of installed infrastructure. An extension ofthe proposed method when different users are interacting with thesame smart environment can be envisioned. In such a case, magneticswitches may not be enough to detect who is interacting with an object.If doors and drawers also include an accelerometer, then a correlation-based method can detect who is using the items [92].

6Learner candidate

selection

In this chapter, we tackle the question of how to rank different candidate learnerdomains according to the accuracy that they are likely to obtain once trained.We formulate this as a feature ranking problem under class noise, for whichwe propose two new approaches. We finally compare new and state-of-the-artapproaches on two activity recognition datasets.

90 Chapter 6: Learner candidate selection

6.1 Introduction

In the previous chapters, we presented an approach to train new learnerdomains by means of existing ones. We now tackle the problem of rank-ing different possible candidate learners in terms of the accuracy thatthey can achieve when trained by teacher domains. A practical appli-cation example would be the case of a user wearing a smart shirt withmany integrated motion sensors that are trained by a smartphone torecognize activities (see Figure 2.1). Among the many motion sensors,there are likely a handful which are placed in more informative loca-tions and can classify the activities of interest, while the others can beswitched off to save power. The problem is then to rank the candidatelearners, given that the teacher produces labels that can differ fromground-truth.

Formally, consider a teacher domain DT(X

T,P(XT))

associated with

a task TT defined by the(Y, f T()

)and K candidate learner domains

DL1(X

L1 ,P(XL1 )). . .DLK

(X

LK ,P(XLK )). Let each domain DLi be associ-

ated with a task TLi(Y, f Li ()

). We propose to calculate a ranking score

Ψi((xLi1 , y

T1 ) . . . (xLi

N , yTN)) for each candidate learner domain DLi . De-

noting a1, a2, . . . , aK be the accuracies obtained when classifying datawithin the K domains, the ranking scores Ψi should have the followingproperties:

1. Provide correct ranking: ∀i, j|Ψi > Ψ j ⇒ ai > a j (i.e. higher rankcorresponds to higher accuracy);

2. Have the lowest possible computational complexity. This allowsto save battery power when implementing the ranking on wear-able sensor nodes.

In the remainder of this chapter:

• We introduce two new ranking scores “Bounding Box overlap”(ΨBB) and “Same Class Neighborhood” (ΨSCN) which allow toperform the ranking with low complexity, respectively in O(N)and O(N log N).

• We investigate the use of three new ranking scores based onthe use of classifiers (NCC, Naive Bayes and kNN) as filters forranking feature subsets.

6.2. Method 91

• We benchmark the five proposed scores against state-of-the-artscores based on RELIEF-F, Mutual Information and Fisher sepa-rability, showing that the proposed scores often outperform theones derived from the feature selection literature.

• We compare the robustness of the performance of the new rank-ing scores, as well as RELIEF-F, Mutual Information and Fisherseparability, with respect to class noise, uniformly distributedamong the classes.

6.2 Method

In the following section, we outline the three ranking scores ΨF, ΨMIand ΨRELF based on state-of-the-art feature selection algorithms, usedas benchmarks. Each score is calculated for each candidate learnerdomain DLi , hence we drop the explicit notation Li for the sake ofsimplicity.

6.2.1 Review of state-of-the-art ranking scores

Ranking score based on Fisher separability [32]

Within each candidate learner domain D and for each li ∈ Y, let usdefine ξi as

xj

|yT

j = li, i.e. ξ1 . . . ξC are the clusters of data vectorslabeled by the teacher as l1 . . . lC. Let ni = |ξi| The within-cluster (SW)and between-cluster (SB) scatter matrices are calculated as:

SW =

C∑k=1

∑xj∈ξk

(xj − µk

) (xj − µk

)trand (6.1)

SB =

C∑k=1

nk

(µk − µ

) (µk − µ

)tr, (6.2)

where µk and µ are respectively the centroids of the data belongingto the k-th cluster and to the whole set of patterns. The degree ofseparability is then expressed as:

ΨF =trace (SB)trace (SW)

, (6.3)

The complexity for the calculation of the score is O(N).


Ranking Score Based on Mutual Information

The Mutual Information [23] between two variables x and y is definedas:

MI(x, y) =

∫ ∫p(x, y) log

p(x, y)p(x)p(y)

dxdy (6.4)

For feature sets, the score ΨMI is calculated 1 according to the “Max-Relevance” criterion [25] as the average MI between the features andthe class labels:

ΨMI =1M

M∑n=1

MI(θn, y) (6.5)

where θn and y represent the random variables modeling respectivelythe n-th feature and the label stream.


Ranking Score Based on RELIEF-F

RELIEF-F [30] works as follows: for each featureθn, it builds a relevancescore W(θn) which estimates the probability difference

W(θn) = P(different θn|different class)−P(similar θn|same class) (6.6)

We calculate the score for an entire feature set as average across thesingle features:

ΨRELF =1M

M∑n=1

W(θn) (6.7)


6.2.2 Proposed ranking scores

In this section, we introduce the five new ranking scores. ΨNCC, ΨNBand ΨkNN are based on classifiers used as filters. ΨBB and ΨSCN try tocapture the degree of overlap between distributions of points belong-ing to different classes in the feature spaces.

1We use the toolbox provided by Peng et al. [25] for the calculations.

6.2. Method 93

Classifiers used as filters: NCC, Naive Bayes and kNN

We propose to use a hybrid feature selection approach [38, 39], but rely-ing on classifiers whose training has a smaller computational complex-ity. The scores ΨNCC, ΨNB and ΨKNN are set to the accuracies obtainedwhen classifying the data corresponding to each feature subset. Theaccuracy is calculated through cross-validation on the data and onlythe set of teacher labels yT

1 . . . yTN is used (that is, this does not match

with the accuracy compared against ground-truth). The computationalcomplexities are the following:

• O(N) for ΨNCC;

• O(N) for ΨNB;

• O(N log N) for ΨKNN if the classifier relies on a kd-tree storage.

6.2.3 Bounding Box Overlap Score

The idea behind this score is to enhance an existing score used as acluster similarity measure [35] with the intuition that stronger overlapbetween data in the feature spaces leads to lower classification accu-racy. Let R be the C-by-C similarity matrix [35], whose elements r(h,k)

represent the degree of similarity between clusters of class h and classk. Each element is a pairwise dispersion, defined as r(h,k) = sh+sk

m(h,k) , wheres and m are calculated as follows:

sh =

1nh

∑xj∈ξh

∣∣∣xj − µh

∣∣∣2

12

(6.8)

m(h,k) =

M∑n=1

∣∣∣µ(n)h − µ

(n)k

∣∣∣212

, (6.9)

where µ(n)h is the n-th element of the centroid vector µh. We introduce

the overlap matrix Γ = [γ(h,k)]. Each element γ(h,k) represents the degreeof overlap of the axes-aligned bounding boxes enclosing the clusterscontaining the data points of class h and k. The choice of axes-alignedbounding boxes avoids using computationally expensive algorithmsthat calculate convex hulls. Each bounding box is calculated as follows:

1. Define B subsets Ωh,1, . . . ,Ωh,B|⋃

b Ωh,b = ξh,⋂

b Ωh,b = ∅.


x

oooox

x

x

oo

o

ooo oo

oo

o

o

x

x

x

xx

xx

xx

x

x

x

x

lx

ux

uo

lo lx

ux uo

loX

1L 2LX

Figure 6.1: Illustration of the principle behind the “Bounding Box”score with two classes (label space: Y = x, o) and two candidatelearner domains DL1 and DL2 in a three-dimensional feature space. Inthe figure on the left, the bounding boxes are enclosed into each other.In this case equation γ(o,x) = 1. On the right, the overlap amounts toone quarter of the volume of each bounding box, giving γ(o,x) = 0.25.

2. Calculate the components of the vectors identifying two oppositecorners of the bounding box:

l(n)h =

1B

B∑b=1

minx(n)

i |xi ∈ Ωh,b

(6.10)

u(n)h =

1B

B∑b=1

maxx(n)

i |xi ∈ Ωh,b

(6.11)

where l(n)h and u(n)

h represent the n-th component of the corners lhand uh respectively.

The parameter B allows to trade sensitivity to outliers for precisionin the calculation of the bounding boxes. The overlap between thebounding boxes is calculated and normalized as follows:

γ(h,k) =

∏Mn=1 max

(0,min

(u(n)

h ,u(n)k

)−max

(l(n)h , l

(n)k

))∏M

n=1

(u(n)

h − l(n)h

) (6.12)

The numerator is the volume of the region of overlap between classes hand k. The denominator is the volume of the hyper-rectangle enclosingclass h. Fig. 6.1 illustrates an example with two classes “x” and “o”where the bounding boxes and their corners lx, ux, lo and uo are shown.

6.3. Datasets and evaluation procedure 95

We define the score ΨBB as:

ΨBB = 1 −1C

C∑h=1

maxk

(r(h,k)· γ(h,k)

)(6.13)

The multiplication r(h,k)· γ(h,k) modulates the similarity values. This

accounts for the fact that if all cluster pairs are completely separated(γ(h,k) = 0), the classification will be perfect, regardless of the value ofthe cluster similarity measure; in this case, ΨBB achieves its maximumvalue of 1.

The complexity for the calculation of ΨBB is O(N).

6.2.4 “Same Class Neighborhood” Score

We assume that a classifier can achieve a better accuracy if in the featurespace each data point has on average a high number of neighboringpoints that belong to the same class. This indicates that points of thesame class cluster together, suggesting low class mixing. To calculatethe level of mixing, the neighborhood of each data point is analyzed.For each point xi belonging to class h, the nearest neighbor xj belongingto a different class is located. Let d be the distance between xi and xj.We count how many points belonging to class h have a distance to xismaller than d. The concept is illustrated in Fig. 6.2 and the detailedprocedure is reported in Algorithm 2.

The computational complexity of this score is O(N log N) for a kd-tree implementation of the storage of the data points. It is instead O(N2)in case of a standard implementation.

6.3 Datasets and evaluation procedure

In this section we describe the datasets and the procedures used toevaluate the ranking scores. We considered two datasets belongingto the specific problem domain of human activity recognition. Thefirst is a subset of the OPPORTUNITY Activity Recognition Dataset[63] presented in this thesis (see Section 3). The second, the SkodaMini-Checkpoint Dataset2 [93], includes instances of manipulative ges-tures performed in a car maintenance scenario. In both datasets, body-worn accelerometers collected the movements of the subjects. These

2http://www.wearable.ethz.ch/resources/Dataset/skoda_mini_checkpoint/SkodaMiniCP.zip


d

d

x

o x

x

x x

x

x

xx

xx

x

xxx

xx

o

o

o o

o

oo

o

oooo

X1L 2L

X

count = 1

count = 5

Figure 6.2: Illustration of the principle behind the “Same Class Neigh-borhood” score with two classes (label space: Y = x, o) and two candi-date learner domains DL1 and DL2 . The data points in feature spaceXL1

(left figure) are more strongly overlapping than the ones inXL2 (right).This is reflected into the number of points (count) of the same class,measured within the neighborhood of each data point. ΨSC is definedas the average of the counts obtained in the neighborhood of all datapoints.

Algorithm 2: “Same Class Neighborhood” Scorefor k = 1→M do

countk ← 0for all xj ∈ ξk do

d← min(‖xj − xi‖L2

)∀xi < ξk

for all xi ∈ ξk doif ‖xj − xi‖L2 < d then

countk ← countk + 1end if

end forend forcountk ←

countkn2

k

end forΨSC ←

1M

∑Mi=1 counti

6.3. Datasets and evaluation procedure 97

datasets are a natural playground where to simulate the appearanceof new untrained sensing systems (learners), in the presence of a setof (teacher) systems trained to recognize activities (see Section 6.1). Inthese datasets, ranking feature sets effectively corresponds to rankingphysical sensor systems extracting certain features, which are candi-date learners as described in Sec. 6.1.

6.3.1 Datasets

• From a subset of four subjects from the OPPORTUNITY dataset,we selected instances of C = 4 postures/modes of locomotion(stand, sit, lie, walk). We extracted four feature sets from ac-celeration data collected from eight body-worn sensors, givingK = 32 candidate learner domains. The positions of the sensorsare shown in Fig. 4.2. The extracted features are:

1. Angle between x and y acceleration components; angle be-tween y and z acceleration components; standard deviationof acceleration magnitude (M = 3).

2. Mean of x, y and z acceleration components; standard devi-ation of acceleration magnitude (M = 4).

3. Mean crossing rate of x, y and z acceleration components;standard deviation of acceleration magnitude (M = 4).

4. Mean and standard deviation of x, y and z accelerationcomponents (M = 6).

• In the Skoda Mini-Checkpoint Dataset [93], the subject performedC = 10 manipulative gestures. We used acceleration data from 20body-worn sensors (10 on each arm), from which we extractedthree feature sets, giving K = 60 candidate learners. The extractedfeature sets are:

1. Angle between x and y acceleration components; angle be-tween y and z acceleration components; standard deviationof acceleration magnitude (M = 3).

2. Mean of x, y and z acceleration components in 2 subwin-dows within each signal instance; standard deviation of ac-celeration magnitude (M = 7).

3. Mean of x, y and z acceleration components in 6 subwin-dows within each signal instance; standard deviation of ac-celeration magnitude (M = 19).


Dataset C N K MOpportunity 4 4200 32 3/4/6Car Manufacturing 10 725 60 3/7/19

Table 6.1: Characteristics of the datasets used for the simulations. C= number of classes; N = number of instances; K = number of candi-date feature sets; M = dimensionality of the feature spaces (number offeatures within each feature set).

In Table 6.1 we summarize the main characteristics of the datasets.

6.3.2 Ranking accuracy evaluation procedure

For each candidate feature setXLi , the classification accuracy ai reachedby the candidate learner is calculated by performing a 5-fold cross-validation on the N feature vectors x j ∈ Xi using four classifiers: Near-est Class Center (NCC), kNN (k = 5), Naive Bayes and Decision Tree.These classifiers are commonly used in sensor networks performingactivity recognition. They are quite diverse in the complexity of thedecision boundaries that they can achieve.

We here describe how we carry out the evaluation of the rankingscores and their robustness with respect to class noise.

1. The first step is to generate the sets of teacher labels. Let ε be thefraction of teacher labels differing from ground-truth. First, theset of teacher labels is initialized with the ground-truth labels:yT

i = yGi (ε = 0). We then sweep ε from 0 to 0.9. For each value of

ε, a new set of teacher labels is generated. The set is generated asfollows:

• Each label yTi is selected for being noisy with a probability

ε.

• If yTi is selected to be noisy, the value for the label is extracted

randomly from the label space Y excluding the ground-truthlabel yG

i .

• If yTi is not selected to be noisy, then it is set to ground-truth

(yTi = yG

i ).

2. For each value of ε, the candidate learners are trained on thetraining set using the set of teacher labels yT

i . Test instances are

6.4. Results 99

classified by the trained learners and the output labels are com-pared to ground-truth. The comparison gives the true classifica-tion accuracy a.

3. For each value of ε, the scores ΨF, ΨBB, ΨSC, ΨMI, ΨRELF, ΨNCC,ΨKNN and ΨNB are calculated on the feature vectors belonging toX

Li .

4. For each value of ε and for each candidate learner XLi , the pairsΨi, ai are used to assess the ranking performance as follows:

• Ranking accuracy of the scores: for each pair of candidatesX

Li and XL j , characterized by Ψi, ai and Ψ j, a j, a suc-cessful ranking (SR) is counted every time that we haveΨi > Ψ j ∧ ai > a j ∨ Ψi < Ψ j ∧ ai < a j. A wrong ranking(WR) is counted otherwise. The ranking accuracy is thencalculated as α = SR

SR+WR .

• For each wrong ranking, we compute the accuracy differ-ence δ = |ai − a j|. This measures severe the mistake was.Having a wrong ranking between candidate learners whoseaccuracies are nearly the same (low value of δ) is less severethan having a wrong ranking where the accuracy differenceis big (high value of δ). The mean and standard deviationµδ and σδ of the set of δ are calculated. See Fig. 6.3 for agraphical representation of the calculation of δ.

6.4 Results

In this section we present the results of the evaluation, illustrating thecases without and with class noise. We exclude from the performanceevaluation the cases where the same classifier is used as a filter forfeature set ranking and as classification algorithm, e.g. when ΨNCC isthe ranking score and the actual classification is performed with NCC:in these cases, in absence of class noise, the ranking accuracy is triviallyα = 1.

Table 6.2 shows the ranking accuracies α and ranking error aver-ages µδ and standard deviations σδ with ε = 0 for all the scores andclassifiers, reporting the details for each dataset.

The proposed scores (ΨNCC, ΨKNN, ΨNB, ΨSC and ΨBB) outperformnearly always the state-of-the-art scores (ΨMI, ΨRELF and ΨF) in the


NCC kNN Naive Bayes Decision TreeOPPORTUNITY Activity Recognition Dataset

ΨF 0.66/0.10/0.06 0.74/0.08/0.06 0.69/0.07/0.05 0.75/0.07/0.06ΨBB 0.69/0.11/0.08 0.71/0.09/0.06 0.64/0.08/0.06 0.71/0.08/0.06ΨSC 0.80/0.07/0.05 0.74/0.07/0.04 0.70/0.07/0.05 0.73/0.06/0.04ΨMI 0.77/0.08/0.06 0.77/0.06/0.04 0.79/0.06/0.04 0.77/0.05/0.04ΨRELF 0.76/0.08/0.06 0.75/0.06/0.04 0.84/0.05/0.03 0.75/0.05/0.04ΨNCC -/-/- 0.81/0.06/0.05 0.78/0.06/0.04 0.80/0.06/0.05ΨKNN 0.81/0.08/0.07 -/-/- 0.82/0.05/0.05 0.96/0.02/0.01ΨNB 0.78/0.09/0.07 0.82/0.06/0.05 -/-/- 0.82/0.06/0.04

Skoda Mini-Checkpoint DatasetΨF 0.59/0.08/0.07 0.52/0.06/0.06 0.56/0.06/0.05 0.50/0.07/0.06ΨBB 0.81/0.03/0.06 0.82/0.01/0.03 0.82/0.02/0.04 0.80/0.02/0.03ΨSC 0.84/0.09/0.13 0.87/0.01/0.02 0.85/0.07/0.10 0.87/0.01/0.02ΨMI 0.58/0.08/0.07 0.51/0.06/0.06 0.55/0.06/0.05 0.50/0.07/0.06ΨRELF 0.70/0.06/0.07 0.66/0.05/0.05 0.70/0.04/0.04 0.64/0.05/0.05ΨNCC -/-/- 0.88/0.03/0.05 0.89/0.01/0.01 0.85/0.03/0.04ΨKNN 0.88/0.07/0.12 -/-/- 0.89/0.06/0.10 0.88/0.01/0.01ΨNB 0.89/0.01/0.02 0.89/0.03/0.05 -/-/- 0.86/0.03/0.04

Table 6.2: Ranking accuracy α / mean µδ / standard deviation σδ of theaccuracy error in case of wrong ranking for the two datasets. For eachdataset and classifier used, we marked in bold the best result in termsof ranking accuracy.

6.4. Results 101

a a

ajai

akai

aj

alak δ2

δ1

δ2

δ1

WR

WRWR

WR

Figure 6.3: Examples of computation of the ranking accuracy and accu-racy error in case of wrong ranking for two scores ΨA and ΨB. The plotsdepict accuracy values (a) versus the calculated scores (ΨA and ΨB) forfive candidates. Out of ten one-versus-one comparisons, we have twowrong rankings (WR) in both cases (ranking accuracy 80 % for bothscores). The mean error would be µδ = 1

2 (δ1 + δ2), which is higher forthe second score. In this example, ΨA should be then preferred since itmakes as many mistakes as ΨB, but less severe.

considered datasets. Specifically, the ranking accuracies of ΨNB andΨNCC are nearly always higher than those of all the other scores, whenNCC, Naive Bayes and kNN are used as classifiers. ΨSC and ΨKNNusually outperform the other scores when the Decision Tree is usedas classifier. ΨBB is the worst performing among the proposed scores,but it is still often better than ΨMI, ΨRELF and ΨF. Among the state-of-the-art scores, ΨRELF is nearly always better than ΨMI, which in turnperforms better than ΨF.

The results in terms of the reachable ranking accuracies dependon the dataset, but similar trends can be observed. With the OPPOR-TUNITY Activity Recognition dataset, the best ranking scores haveranking accuracies between 81 % and 96 % and average accuracy errorµδ between 2 % and 8 %. With the Skoda Mini-Checkpoint dataset, theaverage accuracy error µδ is in many cases as low as 1 %. This meansthat either the ranking is correct or the wrongly ranked feature spacesobtained classification accuracies differing by 1 % on average.

6.4.1 Sensitivity Analysis with respect to Class Noise

We hereafter analyze the influence of class noise on the feature set rank-ing. Class noise impacts both the ranking ability and the performance


that classifiers achieve on the feature sets. The results are shown in Fig.6.4. The figure shows the ranking accuracy (a), the average accuracy er-ror among wrongly ranked pairs (b) and the classification accuracy (c)in function of the teacher error rate. In the bottom plots (c), the averageand standard deviation of the classification accuracy are depicted.

The ranking accuracy of the scores decreases in nearly all cases byless than 10 % with noise levels up to 40 %. Beyond that, the scoresbased on using classifiers as filters (ΨNCC, ΨKNN and ΨNB) seem tooutperform the others in terms of ranking accuracy and suffer fromthe smallest accuracy errors in case of wrong ranking, throughout thewhole range of noise levels.

The state-of-the-art scores (based on Mutual Information, RELIEF-F and Fischer) are performing worse than the proposed ones on thewhole range of noise levels. In the OPPORTUNITY Activity Recogni-tion dataset, the score based on RELIEF-F shows a steeper decrease inranking accuracy with increasing class noise, compared to the otherscores, while in the Skoda Mini-Checkpoint dataset, the steepness iscomparable to the other scores.

6.4.2 Implication of the results on the teacher-learner problem

We showed that ranking scores can achieve high accuracies at rank-ing candidate learners. This happens also when the teacher labels arecorrupted by class noise. These results have implications for the ap-plication at hand, that is, the concrete problem of learners buildingmodels using labels provided by trained teachers. In the application,as activities takes place, a teacher classifies them into labels. These la-bels, that can differ from ground-truth, are sent to the learners, whoassociate feature vectors to these labels. Each learner calculates then aranking score, only using the teacher labels. Ranking the learners bythe output of their ranking scores, according to the results presented,provides also a good ranking with respect to how well the learners willrecognize activities.

6.5 Discussion

The ability of ΨNCC and ΨNB to obtain a good ranking regardless of theclassifier used on the learner might seem surprising. The explanationmight be that, for our datasets, feature vectors belonging to different

6.5. Discussion 103

(a)

(b)

(c)

ΨF

ΨBB

ΨSC

ΨMI

ΨRELF

ΨNCC

ΨKNN

ΨNB

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

OPPORTUNITY ActivityRecognition Dataset

Skoda MiniCheckpoint Dataset

Figure 6.4: Influence of class noise on the ranking accuracy (a) andaccuracy error (b) for the OPPORTUNITY Activity Recognition andSkoda Mini-Checkpoint datasets. The plots at the bottom (c) show thedecrease in classification accuracy when training the candidate learnersin presence of class noise (increasing teacher error rate). The plots (c)represent average and standard deviation across the four classifiersused (NCC, kNN, Naive Bayes and Decision Tree).


1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Number of Gaussians

Ave

rage

Ran

king

Cap

abili

ty

1 2 3 4 5 6 7 8 9 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Number of Gaussians

Ave

rage

Acc

urac

y E

rror

ΨF

ΨBB

ΨSC

ΨMI

ΨRELF

ΨNCC

ΨKNN

ΨNB

Figure 6.5: Ranking accuracy and accuracy error on synthetic 2-classdatasets constituted by mixtures of one to ten Gaussians. As the num-ber of Gaussians representing each class increases, the performance ofΨNCC and ΨNB experience a sudden drop.

classes are localized in different areas of the feature space. In this sce-nario, if for two feature sets XLi and XL j we have Ψi > Ψ j, then thefeature vectors belonging to different classes in XLi are less overlap-ping than those in XL j . Therefore, most classifiers will also performbetter on XLi than on XL j . This will happen regardless of the abilityof the classifier to produce arbitrarily complex decision boundaries.The picture will change if we increase the complexity of the class-conditional distributions in the feature spaces. To investigate this, webuilt two-dimensional, two-class synthetic datasets where each classis represented by points sampled from a mixture of one to ten Gaus-sians. We then assessed the performance of the ranking scores on thesesynthetic data. As can be seen in Fig. 6.5, already when going from oneto three Gaussians per class, the performance degradation of ΨNCCand ΨNB is much more pronounced than for all other scores. This sug-gests that ΨNCC and ΨNB are suitable for feature sets where each classis represented by a quite compact set of points in the feature space.Otherwise, ΨKNN and ΨSC are advisable.

The simulations in presence of class noise show that wrong teacherlabels do not necessarily prevent the scores from correctly ranking thefeature sets. The accuracy error for wrong rankings δ increases up to10 % compared to the case without noise, but the ranking accuracy

6.5. Discussion 105

x

o x

x

x x

x

x

xx

xx

x

xxx

xx

o

o

o o

o

oo

o

oooo

X1L 2L

X

x

x x

x

x o

o

o

xx

xo

x

oxx

xx

o

o

o o

o

xx

o

xooo

X1L 2L

X

ε = 0

ε = 0.3

Figure 6.6: Example illustrating the impact of class noise with twoclasses (label space: Y = x, o) and two candidate learner domainsDL1 and DL2 . In the upper figures, teacher labels are matching ground-truth. Ranking scores calculated on these domains will give Ψ2 > Ψ1.Adding ε = 0.3 class noise modifies the feature spaces, but rankingscores will still perceive the candidate DL2 as better, since the wronglabels do not completely destroy the order in the feature space XL2 .

experiences a gentle degradation even for ε > 0.4. The noise resilienceof the ranking scores is due to the fact that the scores are always used tomake comparisons between candidates and not absolute assessments.This is better explained with an example (see Fig. 6.6).

The results show that ΨBB is systematically performing worse thanΨNCC and ΨNB. Its performance is often comparable to the Fisher score.Nevertheless, in the Car Manufacturing dataset, it outperforms theFisher score by up to 30 %, which makes the scores promising forfurther investigation. A thorough investigation of the outlier rejectionpolicy for building the bounding boxes will be needed in future work.


The performance could improve if the bounding boxes were aligned tothe main directions present in the data or by using more sophisticatedshapes (e.g. by computing the convex hull around the data points).These solutions could nevertheless imply a computational complexityhigher than what is desirable for an application where the resourcesare limited, like activity recognition performed by sensor nodes.

6.6 Conclusion

We introduced scores to rank different candidate learner domains withrespect to their achievable accuracy when trained from existing teacherdomains. We showed that the ranking is possible even when the teacherdomains are delivering up to 40 % labels that differ from ground-truth.By ranking candidate learner domains we can build systems that al-low a user to wear many new sensors (like a smart shirt with manyembedded accelerometers) and automatically detect the best sensorsto recognize activities, while switching off all the others. This offers anadvantage in terms of scalability, since it allows to reduce the powerconsumption, while still training the new sensors.

7Signal-level transfer

learning*

In this chapter, we present a transfer learning approach operating at thesignal level. The first step is the estimation of a transformation which linksteacher and learner signals. Using this transformation, training instancesof the teacher domain are converted to the learner domain. As opposed tothe classifier-level transfer, the signal-level transfer can be performed withoutneeding the user to perform all the activities that need to be recognized.We validate the approach in an HCI scenario involving signal-level transferbetween a vision-based 3D tracking system and accelerometers.

*based on “Kinect=imu? learning mimo signal mappings to automatically translateactivity recognition systems across sensor modalities” [76]

108 Chapter 7: Signal-level transfer learning

7.1 Introduction

In this chapter, we introduce a transfer learning approach which oper-ates directly at the signal level. This approach is viable if a mapping canbe established between the signals associated to the teacher and learnerdomains. The mapping is learned automatically by means of systemidentification techniques. The practical application example used tovalidate the approach is that of gesture recognition in an HCI scenario,where the transfer is operated between a camera-based (Kinect) systemand body-worn accelerometers.

Formally, given a teacher domain DT(X

T,P(XT))

associated with a

task TT defined by the(YT, f T()

)and a learner domain DL

(X

L,P(XL))

as-

sociated with a task TL defined by the(YL, f L()

), we tackle the problem

of finding a suitable function f L(), given the following assumptions:

1. YT = YL = Y, i.e., the label space associated to the learner domainis the same as the one of the teacher domain;

2. The signals seen in the two domains sT(n) and sL(n) are acquiredsimultaneously;

3. A mapping function between sT(n) and sL(n) exists at least in onedirection. We call that function ΦT↔L, ΦT→L or ΦL→T dependingwhether the mapping can be found easily in both directions oronly from teacher to learner or vice versa.

The practical translations of the aforementioned assumptions are:

1. The learner needs to learn the same activities as the ones recog-nized by the teacher;

2. Teacher and learner sensor systems acquire data at the same timeduring the transfer phase;

3. The signals acquired by the sensor systems are linked by a trans-formation. An example of transformation is double derivative thatlinks position to acceleration. Another example is scaling whichlinks the acceleration of sensors mounted on nearby points ona same limb. An example of signals that in general cannot belinked by a transformation are sound and acceleration. Considera user raising her hand to activate the button of a coffee ma-chine. The sound wave produced is completely unrelated to the

7.2. Method 109

accelerometer signal. The same acceleration can in fact be mea-sured when the user activates another device, which produces adifferent sound.

In general, we can state that a transformation exists if teacher andlearner measure the motion of the same limb with the same, or differentmodalities.

Throughout the chapter, we use the example of position and ac-celeration as modalities used by the teacher and learner systems. Po-sitions are expressed by 3D coordinates within a room. Accelerationsare expressed by values along three axes.

7.2 Method

7.2.1 Finding the mapping

The first step of the proposed method is to learn the mapping betweenteacher and learner signals automatically. We express the mapping asa linear multiple-input-multiple-output (MIMO) system. The multipleinputs and multiple outputs are the different channels of the teacherand learner signals. In the case of mapping position to acceleration,the MIMO system is defined as follows:

• Three inputs: the three coordinates in the space, referred to anarbitrary origin, e.g. one corner of a room;

• Three outputs: the three acceleration axes, which are referencedto the accelerometer local frame of reference.

We here outline the procedure for building the mapping ΦT→L: themapping ΦL→T is obtained by just swapping the role of the teacher andlearner signals.

Let dT and dL be the number of channels of the teacher and learnerrespectively. Let the multichannel signals be represented by the vectors

sT(n) =(sT

(1)(n), · · · sT(dT)(n)

)trand sL(n) =

(sL

(1)(n), · · · sL(dL)(n)

)tr. Define z−1

as the delay operator: the operator, applied to a signal, generates adelayed version of the signal itself: z−ks(n) = s(n−k). The MIMO systemcan be represented by a matrix of polynomials in z−1:

sL(n) = B(z−1)sT(n). (7.1)

Let q be the number of past input samples that influence the currentoutput sample. We define τik as the static delays from the k-th input


channel to the i-th output channel. The matrix B(z−1) contains elementsbik(z−1) of the form:

bik(z−1) = b(0)ik z−τik + b(1)

ik z−τik−1 + . . . + b(q)ik z−τik−q (7.2)

where Given the matrix B(z−1), the mapping ΦT→L applied to sT(n) is:

ΦT→L(sT(n)) = B(z−1)sT(n) (7.3)

The linear MIMO mapping is useful, despite its simplicity. In realsettings, signals from different settings might be linked by variousmappings, including: scaling, rotation and differentiation. In the fol-lowing, we explain the meaning of the aforementioned transformationsfor practical signals and we illustrate how these are a subset of a linearMIMO mapping.

• Scaling between teacher and learner signals by a factor γ:sL(n) = γsT(n). This happens for example when acceleration ismeasured on two different points on a limb. The MIMO mappingmatrix is obtained by setting b(0)

ik = γ and b( j)ik = 0∀ j > 0,∀i = k.

Furthermore, all the coefficients b( j)ik , i , k of the off-diagonal poly-

nomials will be zero, yielding a diagonal matrix.

• Rotation between the teacher and learner frames of reference.This happens for example if teacher and learner have the samemodality and are mounted on the same limb, rotated with respectto each other. Let R be the rotation matrix. Then sL(n) = RsT(n).The corresponding MIMO mapping matrix is obtained by settingb(0)

ik to the corresponding element at position (i, k) in the rotation

matrix and by setting b( j)ik to zero ∀ j > 0.

• Teacher signal is the differential of order h of the learner sig-nal. With h = 2, we obtain the second derivative. The sec-ond derivative is needed for example when mapping posi-tions (seen by a vision system) to acceleration measured by anIMU. The corresponding MIMO mapping is obtained by settingb( j)

ik ,∀ j ≤ h,∀i = k to the corresponding coefficients of the im-pulse response of the derivative. All the other coefficients are setto zero.

7.2. Method 111

7.2.2 Automatic learning of the mapping parameters

When teacher and learner deliver signals at the same time, we canestimate the coefficients in Eq. 7.1. The (q + 1) · dT

· dL coefficients andthe dT

· dL static delays can be calculated from the teacher and learnersignals with a least squares approach. For this purpose, developingEq. 7.1 for each channel i of the learner signal we obtain:

sL(i)(n) =

dT∑k=1

bik(z−1)sT(k)(n) =

=(b(0)

i1 + . . . + b(q)i1 z−q

)sT

(k)(n) =

= b(0)i1 sT

(1)(n) + . . . + b(q)i1 sT

(1)(n − q)+

+ b(0)idT sT

(dT)(n) + . . . + b(q)idT sT

(dT)(n − q). (7.4)

Let us consider n = 0 to be the last sample index available. DefineH ≥ q as the index of the oldest available past sample. The sets ofknown samples are:

sTk (−H), sT

k (−H + 1), . . . , sTk (0)

,∀1 ≤ k ≤ dT, and

sLi (−H), sL

i (−H + 1), . . . , sLi (0)

,∀1 ≤ i ≤ dL

Eq. 7.4 can then be written for each −H + q ≤ n ≤ 0 obtaining asystem of H−q+1 equations in the the (q+2) ·dT

·dL unknowns b( j)ik and

τik. The system of equations is overdetermined if H−q+1 > (q+2)·dT·dL,

which can be easily satisfied in practice. The system can be written inmatrix form for each channel i as:

Ab = c, (7.5)

where A contains all the known samples of the teacher signals; bcontains the unknowns; c is a vector containing the known samples ofthe learner signals.

The system is solved via Moore-Penrose pseudoinverse1 for eachchannel i. The solution is obtained as follows: Ab = c ⇒ AtrAb =Atrc ⇒ b = (AtrA)−1Atrc = (A+c. Calculating the inverse (AtrA)−1 cancreate problems if (AtrA) is close to singular. To stay on the safe side, wecalculate that inverse using singular value decomposition (SVD) [94].

1http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse


We use a reduced-rank approximation discarding the singular values(and corresponding singular vectors) which are smaller than 1/1000 ofthe first singular value. The mapping function is then used to convertsignals:

sL(n) = ΦT→L(sT(n)) = B(z−1)sT(n). (7.6)

The hatˆdenotes throughout the chapter the predicted signal.In the two following sections, we explain how to operate the trans-

fer between teacher and learner in the two cases:

1. The mapping from the teacher to the learner signals ΦT→L isavailable.

2. The mapping from the learner to the teacher signals ΦL→T isavailable.

7.2.3 Training set translation

Once the mapping from the teacher to the learner signals ΦT→L has beenlearned from the signals, it is used to translate the teacher training set.Each labeled signal instance sT

i available for the teacher domain istranslated into the learner domain:

sLi = ΦT→L(sT

i ). (7.7)

The learner training set is then obtained by extracting the features inthe learner domain: xL

i = F L(sLi ). The learner training set is thus:

(F L(ΦT→L(sT1 )), yT

1 ) . . . (F L(ΦT→L(sTN)), yT

N).

The learner training set is then used to create the objective predictivefunction f L(). The complete process (starting with the mapping learn-ing) is shown in the diagram in Fig 7.1. In the scheme, the positionsensor acts as teacher and the accelerometer as learner.

7.2.4 Continuous signal translation

If the mapping from the learner to the teacher signals ΦL→T is available,another method is used to operate the transfer from teacher to learner.The learner signals sL(n) are continuously translated to obtain signalsin the teacher domain: sT(n) = ΦL→T(sL(n)). The learner then adopts

7.3. Dataset 113

the same objective predictive function and the same feature extractionfunction as the teacher’s:

FL() ≡ F T()

f L() ≡ f T()(7.8)

The complete process (starting with the mapping learning) is shown inthe diagram in Fig. 7.2. In the scheme, the accelerometer acts as teacherand the Kinect as learner.

7.3 Dataset

The test bench for the methods introduced in this chapter is a ges-ture recognition setup (Fig. 7.3). The setup includes five body-wornIMUs and a consumer vision-based skeleton tracking system (Mi-crosoft Kinect). These sensors are commonly deployed for activityrecognition. IMUs are now available on most smartphones. The Kinectallows activity-aware gaming on the XBox console2. It has been usedfor the recognition of activities of daily living [95] and gait analysis[96]. The Kinect contains an 8-bit 640×480 RGB camera, an infrared(IR) LED projecting structured light and an IR camera. It computeson-the-fly an 11-bit 640x480 depth map in a range of 0.7-6m from thereflected IR light. The driver fits a 15-joint skeleton on the depth map(proprietary algorithm similar to [97]) in real time and delivers 3D jointcoordinates in millimeters measured from the Kinect center. Trackingis specified in a range of 1.2-3.5m [98]. The Kinect is interfaced overUSB to a PC. We use [99] to record the RGB and depth map videos andthe joint coordinates at 30 Hz. Five IMUs (XSens [100]) wired to a PCsense the orientation of five upper body elements (see Fig. 7.3): rightupper arm (RUA), right lower arm (RLA), left upper arm (LUA), leftlower arm (LLA) and back (BACK). We use the CRN Toolbox [67] toacquire the raw sensor data and the orientation of the IMUs at 30 Hz.

The Kinect and Xsens data are recorded and resampled offlineto the regular Kinect sample comb to obtain a synchronized datasetcomprising acceleration, position and labels. The subject performsfive kinds of geometric gestures with the right hand, repeating the

2The Kinect sold 10 millions units between its release on November 4th, 2010 andMarch 2011, earning it the Guinness World Record of the “Fastest selling consumerelectronic device”. Its low cost makes it affordable for many households.


sequence of five gestures 48 times. Each gesture lasts on averagearound 3 s. The label space for this experiment is therefore Y =circle, infinity, slider, triangle, square. These gestures were selected be-cause similar ones can be recognized with wearable sensors [74] orthe Kinect [98]. We also recorded a five minutes long “idle” dataset,where the user performs infrequent low-amplitude arm movementsand moves around, without any specific task. The user faces the Kinectwithin±45 to ensure that no limb gets hidden, as seen from the Kinect.

7.4 Simulations and performance metrics

The dataset is partitioned at random into three separate sections: thefirst is used to build the mapping ΦT↔L, the second is used to buildthe teacher training set and to operate the transfer and the third is thetesting set on the learner domain. The random partitioning is repeated20 times in an outer cross-validation loop.

7.4.1 Choice of teachers/learners and transfer method

The proposed approach is validated on a set of six combinations ofteachers and learners from the dataset described in Sec. 7.3:

1. T = Coordinates Kinect right hand; L = Acceleration RLA

2. T = Coordinates Kinect right hand; L = Acceleration RUA

3. T = Coordinates Kinect right hand; L = Acceleration BACK

4. T = Acceleration RLA; L = Coordinates Kinect right hand

5. T = Acceleration RUA; L = Coordinates Kinect right hand

6. T = Acceleration BACK; L = Coordinates Kinect right hand

For all combinations, the mapping is built from the Kinect coordinatesto the acceleration. The inverse mapping is not stable, since it wouldinvolve a noisy double numerical integration to reconstruct the po-sition from acceleration. Therefore, for the first three combinations,the mapping ΦT→L is available and the method used is the trainingset translation (Section 7.2.3). For the last three combinations, the map-ping ΦL→T is available and the method used is the continuous signaltranslation (Section 7.2.4).

7.4. Simulations and performance metrics 115

7.4.2 Data used for mapping calculation

The mapping was learned on three different subsets of the data toinvestigate the characteristics of the signals that are needed to con-struct a working mapping. We give the following names to the learnedmappings, depending on which data are used:

• Gesture-aspecific mapping (GAM): The signals sT(n) and sL(n) arebuilt by stitching together one signal instance for each label inthe set Y, totaling approximately 15 s of data. The signal obtainedcontains therefore the samples of one class, followed by the sam-ples of another class, etc.

• Gesture-specific mapping (GSM): The signals sT(n) and sL(n) arebuilt by selecting a single signal instance belonging to only onelabel contained in Y (amounting to approximately 3 s of data), totest whether a mapping learned on a single gesture can enablethe transfer also for the other signals.

• Unrelated-dataset mapping (UDM): The signals sT(n) and sL(n) areextracted from the corresponding sensors in the “idle” dataset,meaning that the mapping is built on data which represent anull-class for the gesture recognition problem at hand. For com-parison with the GSM approach, again 3 s of data are selected.The UDM approach is used to assess whether the mapping isable to capture the physical transformations that link the signals,so that in practice a user would not need to perform a specificgesture to allow the mapping to be built.

Within the data partition selected for the mapping building in eachof the outer cross-validation loops, the signals for building the map-ping are chosen at random 100 times in an inner cross-validation loop.

7.4.3 Feature extraction and classification

We tested two feature sets. Feature set 1 (FS1) is built as follows:

• Each signal instances is split into four non-overlapping temporalwindows;

• For each window, the mean value of the signal is calculated foreach of the three axes.


FS1 amounts thus to 12 features (4 windows times 3 axes).Feature set 2 (FS2) is built as follows:

• Each signal instances is split into four non-overlapping temporalwindows;

• For each window, the maximum and minimum values of thesignal are calculated for each of the three axes.

FS2 amounts thus to 24 features (4 windows times 3 axes, times 2features).

The classification is performed with a k-Nearest-Neighbors classi-fier (k = 3).

7.4.4 Performance metrics

The performance of the mapping is evaluated by comparing the mea-sured learner signal to the one obtained through the mapping (pre-dicted). To perform the comparison, we use the following “Fit” metric.For the mapping ΦT→L, the Fit metric is calculated between the signalsL(n) and sL(n) = ΦT→L(sT(n)) for each channel i of sL(n):

Fiti = 1 −

(∑0n=−H

(sL

(i)(n) − sL(i)(n)

)2) 1

2

(∑0n=−H

(sL

(i)(n) − sL(i)

)2) 1

2

(7.9)

Fit =

dL∑i=1

Fiti (7.10)

where sL(i) = 1

H+1

∑0n=−H sL

(i)(n) indicates the temporal average of thesignal sL

(i). The metric is based on the ratio of the root-mean-squareerror between the measured and the predicted signal and the standarddeviation of the measured signal. This metric is a variation of thenormalized root-mean-square deviation 3.

If the predicted and actual signal are identical, the numerator is0 and Fit = 1. If instead the predicted and measured signals differ,then Fit < 1. If the measured signal has low standard deviation, theroot-mean-square error needs to be comparatively low, to achieve a

3https://en.wikipedia.org/wiki/Root_mean_square_deviation

7.5. Results 117

value close to 1. This means that low-variance signals need to be ap-proximated better than a high-variance signal to give a good Fit value.Examples of values for the Fit (see Fig. 7.4).

The performance of the complete transfer process was evaluatedby calculating the accuracy on the learner domain, after the transferhas taken place. The teacher and learner domain baseline accuraciesare calculated by cross-validation using only data of the teacher andlearner and represent the highest achievable accuracy in each domain.These are therefore upper bounds also for the transfer methods.

7.5 Results

The best fit is obtained with the GAM model (Fig. 7.5a).This was expected, as the MIMO models are learned on the dynam-

ics of all gestures. Learning can also occur on one gesture of a givenclass (Fig. 7.5b). The best unique gesture to learn a MIMO model isthe circle. Thus, one gesture may be sufficient for the MIMO model tocapture the dynamics of the physical system and extrapolate to a widerrange of body movements. The UDM model does not achieve an ad-equate mapping (Fig. 7.5c). The idle dataset has only rare occurrencesof larger amplitude limb movements and is insufficient to representthe dynamics of the physical system. The fit worsens for mappings be-tween less related body regions (hand to upper arm or hand to back).Nevertheless, the fit between hand and upper arm is close to that ofthe hand to lower arm. This suggests that this approach may be ap-plicable to close-by and related limbs. The back acceleration is hardlypredictable from the hand position.

Classification accuracy baselines in the teacher (BT) and learnerdomain (BL), and those after transfer to the target domain are presentedin Fig. 7.6. We present the results with feature set FS2. The featuresincluded in FS2 are sensitive to outliers (minimum and maximumoperations). Therefore, this set is also expected to be more sensitive toinaccurate signal mapping and should be treated as a worst case. TheGSM models are learned on the “circle” gesture. The baselines indicatethat the gestures can be classified with an accuracy of 98 % or morewith the lower-arm acceleration and the upper-arm acceleration in bothsource and target domains. The high accuracy obtained with the backacceleration (baseline of about 88 %) indicates that torso movementsare correlated with the execution of the gestures. This is a particularcharacteristic of this scenario, that likely does not generalize to other


scenarios. The results after transfer must be assessed according to theperformance drop from the baselines. The performance drop fromBT indicates how much worse the system becomes after transfer. Thedrop from BL indicates how much better would be a system devisedspecifically for the learner domain.

In the transfer between hand position and lower or upper armacceleration, the GAM and GSM models tend to perform equally well.The best results are obtained when translating from hand position tolower-arm acceleration or vice-versa, with less than 4 % drop from BS.The drop in performance from BS is less than 8 % for the transfer fromhand position to upper-arm acceleration and vice-versa. The directionof the transfer does not affect the results much. The GSM modelsshow that executing a single “circle” is sufficient to identify a mappingmodel that leads to a transfer with performance drop between 1 %and 7 % from BT. The transfer between the hand position and the backacceleration shows a large drop from BL (10% to 70% depending onthe cases). The UDM models appear thus unsuitable for the transfer,which is consistent with the analysis of the Fit metric. The UDM modelsnevertheless improve when learned on more “idle” data (Fig. 7.7).

We tested the UDM model with up to 2000 samples, correspondingto 67 s of idle data. In this case, the performance is about 15 % to 30 %below the corresponding baselines for FS1 and FS2 respectively. Thisis an improvement compared to using less data. This suggests that byusing enough data, a dataset from an unrelated domain could allowthe MIMO models to capture the dynamics of the physical system. Thedifference between FS1 and FS2 highlights that an automatic selectionof better features by the teacher or learner system may lead to improvedresults.

7.6 Discussion

7.6.1 Challenges and limitations

Accelerometers measure data in their frame of reference, which moveswith the sensor itself. The Kinect, on the contrary, uses a fixed (world)reference. This fact implies that mapping position to accelerationwould have to include not only a second derivative, but also a ro-tation which depends on the body posture. The linear MIMO modelcan only approximate the second derivative and a fixed rotation, whichwould be an average rotation. This may become an issue when the sub-

7.6. Discussion 119

ject performs ample movements. In our dataset, the relative rotationof the frames of reference was limited for most gestures to ±(30− 40).Only for the slider gesture, the lower arm rotates by almost 90 at theextreme of the movement, compared to the starting position.

The Kinect and other video-based tracking systems are affected byocclusions, i.e. some limbs might be hidden for a certain amount oftime. This does not constitute a limiting factor. In fact, since a smallamount of data are needed to learn the mapping, it is enough to usedata available between two successive occlusions. Data belonging tosituations where an occlusion occurs will show a low value for Fit, so itis enough to let the system learn only when Fit is higher than a certainthreshold, that needs to be determined.

Another limitation of the approach is that some movements maynot be sensed by certain modalities. The Kinect cannot detect torsionsof hand and forearm (e.g. in gestures like turning a knob or tighten-ing a screw), but this is easily sensed by gyroscopes and accelerome-ters, meaning that the expected performance is modality- and gesture-dependent.

An open challenge is the case of teacher and learner measuring themotion of different body limbs. Consider as an example a teacher anda learner measuring acceleration at the arm and forearm. The mappingin this case would be the rotation between the frames of reference ofthe two sensors. The mapping would depend on whether the userperforms gestures with the arm stretched or not. This, in turn, coulddepend on which gesture the user performs (e.g. pointing towardsan object, or eating while seated at a table). For the transfer learningapproach to be successful in such a scenario, the mapping shoulddepend on the activity.

7.6.2 Advantages and genericness

The approach applies to all situations where a mapping function canbe estimated from data. The approach can be applied to other sensingsystems, or to systems of identical modality translated or rotated withrespect to each other. The method should scale well with the numberof classes, since a mapping model learned with one instance of a singleclass (GSM model) performed well on the prediction of the signals ofother gestures. This indicates that the model approximated the physicalrelations between the sensing systems, independently of the gestures.


We evaluated isolated activity recognition, but the approach is alsoapplicable for continuous recognition (spotting).

Low-variance data unrelated to the activities of interest (UDMmodel) can be used to learn a mapping model, albeit with more data.This has practical benefits, since “unrelated” domain data can easilybe acquired “in the background”, whenever the user is in the sens-ing range of the teacher and learner sensor systems. As an example,consider a case where the learner needs to recognize interactions withdoors. For this case, all other movements of the user represent a nullclass, i.e. something that needs to be discarded. Nevertheless, this “nullclass” can be used to learn the mapping between teacher and learnersystems. Every movement of the user is useful for training the map-ping.

The signal-level transfer approach may be useful in crowd-sourcingscenarios [101] to translate classifier models to the specific sensormodalities that one user has. In a crowd-sourcing scenario, a databaseof signals corresponding to activities is labeled by users. When a userwears a learner system, she is queried to perform a specific activity. Thesignal-level transfer learning approach is carried out using as teacherall possible signals contained in the database and including the cho-sen activity. For every teacher-learner pair, the Fit metric is calculated.Once the maximum fit is found, the corresponding teacher is chosen.The training set of that teacher is converted via the learned mappingto the learner domain.

7.6.3 Choice of the model

We chose a linear MIMO mapping model. If the desired transformationis known beforehand, the model can be tailored to the specific case. Forexample, if we know that learner signals are just a scaled version ofteacher signals, then the model reduces to a set of scalars (one for eachsignal axis) that multiply the learner signals to get the teacher signals.Having only one scalar per signal channel makes it easier to estimatethe values of the parameters.

Using a mapping model which is too tailored for a certain set oftransformations can be detrimental for other cases. For example, sometransformations or some modalities could be deployed without beingforeseen by the system designer. The learning approach that we pro-pose allows to take advantage of additional sensors as they becomeavailable to build the transfer model, without expert intervention.

7.6. Discussion 121

The approach may be improved by nonlinear or time-varying mod-els, e.g. with time-delay neural networks [49] or nonlinear ARMA [102],or by modeling the transformations between multiple sensors (e.g. twojoint coordinates and one acceleration). More complex transformationslikely need longer coexistence time between teacher and learner to es-timate the model parameters.

7.6.4 Usefulness of the Fit metric.

The Fit value provides an indication of the quality of the mappingand to some extent of the quality of the resulting transfer. This maybe an indicator guiding the self-organization of an ecology of sensorsystems for opportunistic activity recognition. The Fit value tends to behigher when sensors measure the movement of a same limb. Furtherinvestigation may evaluate whether this could be used to automaticallylocalize on-body sensor placement by the calculating the Fit value withdata delivered by a skeleton tracking system after signal mapping.

7.6.5 Training set translation vs. continuous signal translation

The transfer architectures differ in their complexity and memory needs.The training set translation approach does not add computational loadon the learner system after the activity models are translated but itrequires the teacher system to store activity templates. This howeverdoes not demand large amount of space (47kBytes in floating pointvalues in our example). This is well suited for an ambient teacher anda wearable learner system, since an ambient teacher might have morestorage available.

In contrast, the continuous signal translation approach requires thatthe learner sensor signals are continuously transformed by the map-ping function. This increases the computational load on the learner,which makes the approach more suitable when the learner is an ambi-ent system. The storage requirement is lower, since only the classifierparameters need to be stored.

The architectures also differ in the direction in which the mappingis calculated: from teacher to learner signals or vice versa. If a mappingmodel exists in both directions, then the choice of the architecture isbased on computational and memory requirements. If the mappingmodel is more accurate in one way, then the architecture that uses thismapping is favored.


7.7 Conclusion

In this chapter, we showed that a mapping between the signals of theteacher and learner domains can be learned at runtime. The mappingcan be learned quickly, compared to the duration of most activities. Inour scenario, a single gesture of one class of duration 3 s was enoughto build the mapping and use it to translate the teacher training set tothe learner domain. The user does not need to wear the teacher andlearner sensors for a long time together.

In our experiments, the accuracy of the learner domains after thetransfer was on average just 4 % below the maximum achievable accu-racy. The maximum achievable accuracy is the one obtained by trainingthe learner domains directly from ground-truth labels.

7.7. Conclusion 123

time

FXS

FXS

BB ML

FXS

FXS

FXS

FXS T

each

er

Lear

ner

y

ŷx Tx S

^ x T

ML

ML

ML

ML

ML

Pha

se 1

Pha

se 2

Pha

se 3

Pha

se 4

Tra

ject

orie

s ca

ptur

edby

Kin

ect

Acc

eler

omet

er s

igna

ls

Figu

re7.

1:Si

gnal

-lev

eltr

ansf

erw

ith

trai

ning

sett

rans

lati

on.P

hase

1.st

arts

wit

ha

teac

hers

yste

m(K

inec

t)w

ith

stor

edtr

aini

ngsi

gnal

ssT (i)

.In

phas

e2.

,the

lear

ner

(acc

eler

omet

er)

appe

ars

prod

ucin

git

sow

nsi

gnal

san

dth

em

appi

ngΦ

T→

Lfr

omte

ache

rto

lear

ner

sign

als

isle

arne

d.Ph

ase

3.co

nsis

tsin

the

tran

slat

ion

ofth

ete

ache

rtr

aini

ngse

tint

oth

ele

arne

rdom

ain

via

the

lear

ned

map

ping

.In

phas

e4.

,the

lear

nerc

anno

wcl

assi

fyin

com

ing

sign

als.


time

FXS

FXS

FXS

FXS

BB

ML

ML

ML

ML

ML

FX

SFX

SM

L

Teacher

LearnerT

rajectories capturedby K

inect

Phase 1

Phase 2

Phase 3

Phase 4

Accelerom

eter signals

Figure7.2:

Transferem

ployingcontinuous

signaltranslation.

Inphase

1,w

estart

with

theteacher

system(accelerom

eter)w

ithits

objectivepredictive

functionf T().In

phase2.,the

learner(K

inect)appears

andthe

mapping

ΦL→

Tfrom

learnerto

teachersignals

islearned.In

phase3.the

learnergets

acopy

ofthe

objectivepredictive

functionfrom

theteacher.In

phase4.,the

learnercan

nowclassify

incoming

signals,which

arefirst

convertedby

thelearned

mapping

function.

7.7. Conclusion 125

Figure 7.3: IMUs and a Kinect capture the user’s movements (top).The Kinect delivers a depth map, a color image and a 15-joint skele-ton of the user (middle). The right hand position and limb accel-eration are synchronously recorded for five types of gestures: Y =circle, infinity, slider, triangle, square (bottom).


0 0.5 1 1.5−0.5

0

0.5

1

1.5

2

Time (s)0 1 2 3

Time (s)

MeasuredPrediction

Acc

eler

atio

n (g

)

Figure 7.4: Acceleration at the lower-arm predicted by a GAM modelfrom the hand position sensed by the Kinect and measured accelerationfor a slider (left) and a circle (right). As a rule of thumb, a good matchbetween the predicted and measured signals is obtained for Fit > 0.

7.7. Conclusion 127

Circle Infinity Slider TriangleSquareCircle Infinity Slider TriangleSquare

10−1

100

101

102

Circle Infinity Slider TriangleSquare

10−1

100

101

102

Figure 7.5: Logarithmic box plot of 1-Fit between the acceleration mea-sured at the RLA, RUA and BACK (first, second and third box withineach gesture group) and the acceleration predicted at those locationsvia the mapping applied to the coordinates of the hand measured bythe Kinect. a) Mapping trained on all gestures and Fit computed on theindicated gestures. b) Mapping trained on the indicated gesture andFit computed on all of them. c) Mapping trained on “idle” dataset andFit computed on the indicated gestures.


From Kinect... ... to Kinect

Figure 7.6: Classification accuracy for the translation between an am-bient and wearable system with FS2 for various baselines and MIMOmodels. Left half: transfer from a system trained on the Kinect handposition to a system operating on the acceleration measured at theindicated positions. Right half: transfer from a system trained on theacceleration signals measured at the indicated positions to a systemusing the Kinect hand position. BT and BL indicate the baseline accu-racies obtained with a system trained and tested on the teacher andlearner domain.

7.7. Conclusion 129

100 200 500 1k 2k 4k 9k

5060708090

100

#Samples

Acc

urac

y (%

)

100 200 500 1k 2k 4k 9k#Samples

FS1BT

FS1

FS1

FS2

FS2

FS2

BL

L

L

BT

BL

Figure 7.7: Improvement of the accuracy after transfer, when increas-ing the amount of “idle” data used to learn the UDM model on theclassification accuracy for a transfer from Kinect hand position to ac-celeration at the lower arm (left) and vice-versa (right). BT and BL arethe teacher and learner domain baselines.

8Conclusion and

outlook

In this chapter, we summarize the achievements of this work and we outlinepossible directions for future investigation.

132 Chapter 8: Conclusion and outlook

8.1 Achievements

This thesis has explored a new direction in activity recognition whichshifts the paradigm from that of having static and pre-defined systemsto that of including new sensor systems as they appear on the user’sbody or on her surroundings. The main achievements of this workwhich have gone beyond the state of the art are:

• We established a new reference in the activity recognition com-munity by collecting and annotating the OPPORTUNITY activityrecognition dataset, a rich multi-modal dataset containing morethan 30000 gestures measured by 72 sensors. The data collectionwas an effort by four universities participating in the OPPOR-TUNITY project and served both as a reference for the presentthesis, but also for an activity recognition challenge which waslaunched in the community thanks to these data. A part of thedata have been donated to the UCI Machine Learning Repository,which has a high visibility and serves as a benchmark whenevernew pattern recognition algorithms come to life.

• We introduced a method to use an existing sensor to train newlydeployed ones by operating at the level of the classifiers. Wecould reach in most cases accuracy levels within 10 % from theupper baselines obtained by training learner domains directlywith ground-truth labels. We showed that learner domains canoutperform the teacher domains providing labels.

We showed the effectiveness of this transfer learning approachamong body-worn accelerometers for the recognition of locomo-tion/postures and in a scenario involving smartphones. In thislatter, we showed also an enhancement of the approach with adisagreement-based co-training step to improve the learner per-formance. This step is useful whenever teacher labels are toonoisy, thereby undermining the ability of learners to achieve suf-ficient accuracy.

• We introduced a method to exploit ambient sensors like mag-netic switches embedded in doors or windows as teachers fortraining newly deployed learners for recognizing postures andlocomotion. These sensors are able to provide labels concern-ing the locomotion of a person with a precision around 90 %.When using these labels for transfer learning, learner domains

8.2. Outlook 133

get within 10 % of their upper baselines. We surveyed possiblesensors, like webcams, infrared detectors, GPS units, that cou-pled with assumptions on people’s behavior allow to extractlabels for training other sensor systems.

• We tackled the problem of how to decide, among a set of newlydeployed sensor systems, which are the most promising ones forreaching a high accuracy when acting as learners in the transferlearning approach. The problem was formulated as a featureselection problem under class noise. We introduced two newheuristics and three classifier-based filter approaches for rankingthe candidate learners. We compared the five ranking methodswith standard feature selection approaches under class noise. Wewere able to rank candidate learners with up to 92 % accuracyaccording to their achievable accuracies when effectively trainedby the teacher domains. We found that in practical situations,NCC can be a good filter for ranking candidates. We could alsoshow that up to 40 % class noise does not significantly impair theranking capabilities of most algorithms.

• We proposed a method operating at the signal level to oper-ate transfer learning by directly translating the training set of ateacher to a learner. This allowed to shorten the time for whichteacher and learner need to be active together. This time was 3 sin a gesture recognition scenario involving five gestures. For acomparison, in the classifier-level transfer, the time needed forthe transfer to take place is in the order of hours to days, sinceusers need to perform all activities at least once. We showed howthe signals from two modalities can be mapped automaticallyand we showed how this transfer algorithm allows the learner toreach an accuracy on average just 4 % below its best achievablelevel.

8.2 Outlook

This thesis has opened new research directions which deserve furtherinvestigation. An important topic which was not covered in the presentwork is the choice of features. We always made the assumption thateach sensor system extracts a set of pre-defined features. Throughoutthis work, we used statistical features which were not tuned for specificsensor positions. This nevertheless can limit the potential of the activity

134 Chapter 8: Conclusion and outlook

recognition algorithms. Research could be conducted in the area ofunsupervised or semi-supervised feature selection algorithms

One promising approach for extracting features automatically isdeep learning [103]. Deep learning employs neural networks withmany layers, hence the adjective “deep”. The neural networks are feddirectly with the raw signals and features are calculated automatically.The networks then learn the structure in the data without using labels,but rather minimizing some cost functions [103]. This approach couldbe used in the scenarios of the present work as an automatic featureextraction block.

One drawback of deep learning approaches is that they requireconsiderable amounts of data to be trained. Plenty of unlabeled datacould be nevertheless crawled from online resources. Repositories ofdata could also be built through crowdsourcing.

We focused on methods which work either at the classifier or thesignal level. This is meaningful when the activities of interest are quitesimple (e.g. walking, running, drinking from a glass). The problemof transferring capabilities to new sensor systems to recognize com-posite activities (e.g. cooking, assembling a piece of furniture) is anopen challenge. Research could be performed on building incremen-tally ontologies on the learner domains by using labels provided bythe teacher domains. Via an ontology, composite activities can be rec-ognized by reasoning and composing outputs from simple activitydetectors. It has been already shown, that ontologies can indeed belearned incrementally from web-crawled text [104].

In this work, we have investigated algorithms to transfer recogni-tion capabilities at the signal level, at the classifier level, and we showedways of selecting sensors, out of a pool of candidates, which are moresuitable than others to perform the recognition. In future work, the in-terplay between these algorithms needs to be made automatic. Ideally,a framework should be able to cope automatically with all the changesin the sensing infrastructure, like addition of new sensors, displace-ment of old ones, failure of some sensors. This framework should beable to call the proper algorithms in the proper order (some in par-allel, some in sequence) in order to accomplish the best optimizationon-the-fly to achieve the best possible recognition chain. A softwareframework which blends together the algorithms described in thiswork and others developed in the OPPORTUNITY project has beenalready delivered [105], but much work still needs to be carried out toactually automate the algorithm and parameter selection depending

8.2. Outlook 135

on the situations.

List of Abbreviations

Notation Description PageList

ARMA Autoregressive-Moving-Average 22

FN False Negative 16FP False Positive 16, 78

GAM Gesture-Aspecific Mapping 115GSM Gesture-Specific Mapping 115

IMU Inertial Measurement Unit 29, 30,113

kNN k-Nearest Neighbors 21, 52,84

MIMO Multiple-Input and Multiple-Output 22,118,128

NCC Nearest Centroid Classifier 52, 84,133

Piconet Ad-hoc network linking a wirelessuser group of devices using Bluetoothtechnology protocols

30

RFID Radio-frequency identification 2, 23

SVM Support Vector Machine 21, 52,54

TN True Negative 16TP True Positive 16, 78

138 List of Abbreviations

Notation Description PageList

UCI University of California Irvine 26, 41UDM Unrelated-Dataset Mapping 115

List of Symbols

1() Indicator function

α = 1 − ε Accuracy of a set of labels yi

C = |Y| Number of labels/classes

D = D (X,P(X)) Domain

∆ = δS, δM, δW Parameter vector for behavioral assumptions

ε = 1N

∑Ni=1 1

(yG

i − yi

)Error rate of a set of labels yi

f () Objective predictive function

F () Feature extraction function

Φ Signal-level mapping function

·G Ground-truth

Y = l1 . . . lC Label/Class space

·L Learner

M Dimensionality of feature space

N Number of instances

Ω = (x1, y1) . . . (xN, yN) Set of labeled instances/training set

P() Probability distribution

Ψ Candidate learner ranking score

si Raw signal for the i-th instance

s(n) Multivariate time series

·T Teacher

s(n) Predicted multivariate time series

·tr Transpose

140 List of Abbreviations

T = T(Y, f ()

)Task associated to a domain

θi i-th feature

U(a, b) Uniform distribution on the interval (a, b)

X = θ1 . . . θM Feature space associated to D

X = x1, . . . , xN ∈ X Set of instances

xi Feature vector for the i-th instance

yi Label for i-th instance

Bibliography

[1] O. Brdiczka, M. Langet, J. Maisonnasse, and J. Crowley, “Detect-ing human behavior models from multimodal observation in asmart home,” Automation Science and Engineering, IEEE Transac-tions on, vol. 6, no. 4, pp. 588–597, 2009.

[2] D. Roggen, K. Förster, A. Calatroni, T. Holleczek, Y. Fang,G. Tröster, P. Lukowicz, G. Pirkl, D. Bannach, K. Kunze, A. Fer-scha, C. Holzmann, A. Riener, R. Chavarriaga, and J. del R. Mil-lán, “Opportunity: Towards opportunistic activity and contextrecognition systems,” in Proc. 3rd IEEE WoWMoM Workshop onAutononomic and Opportunistic Communications, 2009.

[3] S. J. Pan and Q. Yang, “A survey on transfer learning,” Knowledgeand Data Engineering, IEEE Transactions on, vol. 22, no. 10, pp. 1345–1359, oct. 2010.

[4] J. A. Ward, P. Lukowicz, and H. W. Gellersen, “Performancemetrics for activity recognition,” ACM Trans. Intell. Syst.Technol., vol. 2, no. 1, pp. 6:1–6:23, Jan. 2011. [Online]. Available:http://doi.acm.org/10.1145/1889681.1889687

[5] L. Bao and S. S. Intille, “Activity recognition from user-annotatedacceleration data,” in Pervasive Computing: Proc. of the 2nd Int’lConference, Apr. 2004, pp. 1–17.

[6] M. Stikic, K. Van Laerhoven, and B. Schiele, “Exploringsemi-supervised and active learning for activity recognition,”in Proceedings of the 2008 12th IEEE International Symposium onWearable Computers, ser. ISWC ’08. Washington, DC, USA:IEEE Computer Society, 2008, pp. 81–88. [Online]. Available:http://dx.doi.org/10.1109/ISWC.2008.4911590

[7] K. Kunze, P. Lukowicz, K. Partridge, and B. Begole, “Whichway am i facing: Inferring horizontal device orientation froman accelerometer signal,” in Wearable Computers, 2009. ISWC ’09.International Symposium on, 4-7 2009, pp. 149 –150.

http://doi.acm.org/10.1145/1889681.1889687

http://dx.doi.org/10.1109/ISWC.2008.4911590

142 Bibliography

[8] K. Kunze, P. Lukowicz, H. Junker, and G. Troester, “Where am i:Recognizing on-body positions of wearable sensors,” LOCA’04:International Workshop on Locationand Context- . . . , Jan 2005.

[9] K. Kunze and P. Lukowicz, “Symbolic object localization throughactive sampling of acceleration and sound signatures,” in Proc.9th Int. Conf. on Ubiquitous Computing - Ubicomp 2007, Innsbruck,Austria, 2007.

[10] K. Förster, P. Brem, D. Roggen, and G. Tröster, “Evolving dis-criminative features robust to sensor displacement for activityrecognition in body area sensor networks,” in Proceedings of thefifth International Conference on Intelligent Sensors, Sensor Networksand Information Processing (ISSNIP 2009). IEEE press, 2009.

[11] J. Lester, T. Choudhury, and G. Borriello, “A practical approach torecognizing physical activities,” in Pervasive, ser. Lecture Notesin Computer Science, K. P. Fishkin, B. Schiele, P. Nixon, and A. J.Quigley, Eds., vol. 3968. Springer, 2006, pp. 1–16.

[12] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting for transferlearning,” in ICML ’07: Proceedings of the 24th international confer-ence on Machine learning. New York, NY, USA: ACM, 2007, pp.193–200.

[13] Y. Freund and R. E. Schapire, “A decision-theoretic generaliza-tion of on-line learning and an application to boosting,” Journalof Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.

[14] T. van Kasteren, G. Englebienne, and B. J. A. Kröse, “Transferringknowledge of activity recognition across sensor networks,” inPervasive, ser. Lecture Notes in Computer Science, P. Floréen,A. Krüger, and M. Spasojevic, Eds., vol. 6030. Springer, 2010,pp. 283–300.

[15] W. Dai, Y. Chen, G. rong Xue, Q. Yang, and Y. Yu, “TranslatedLearning: Transfer Learning across Different Feature Spaces,” inNeural Information Processing Systems, 2008, pp. 353–360.

[16] A. Blum and T. Mitchell, “Combining labeled and unlabeleddata with co-training,” in Proceedings of the eleventh annualconference on Computational learning theory, ser. COLT’ 98. NewYork, NY, USA: ACM, 1998, pp. 92–100. [Online]. Available:http://doi.acm.org/10.1145/279943.279962

http://doi.acm.org/10.1145/279943.279962

Bibliography 143

[17] Q. Ning, Y. Chen, J. Liu, and H. Zhang, “Heterogeneousmultimodal sensors based activity recognition system,” inProceedings of the 2011 IEEE International Conference onMultimedia and Expo, ser. ICME ’11. Washington, DC, USA:IEEE Computer Society, 2011, pp. 1–4. [Online]. Available:http://dx.doi.org/10.1109/ICME.2011.6012091

[18] M. Stikic, D. Larlus, S. Ebert, and B. Schiele, “Weaklysupervised recognition of daily life activities with wearablesensors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33,no. 12, pp. 2521–2537, Dec. 2011. [Online]. Available:http://dx.doi.org/10.1109/TPAMI.2011.36

[19] K. Nigam and R. Ghani, “Analyzing the effectiveness andapplicability of co-training,” in Proceedings of the ninthinternational conference on Information and knowledge management,ser. CIKM ’00. New York, NY, USA: ACM, 2000, pp. 86–93.[Online]. Available: http://doi.acm.org/10.1145/354756.354805

[20] W. Wang, Z. Huang, and M. Harper, “Semi-supervised learningfor part-of-speech tagging of mandarin transcribed speech,” inAcoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEEInternational Conference on, vol. 4, april 2007, pp. IV–137 –IV–140.

[21] M. Chen, K. Q. Weinberger, and Y. Chen, “Automatic featuredecomposition for single view co-training,” in ICML, L. Getoorand T. Scheffer, Eds. Omnipress, 2011, pp. 953–960.

[22] I. Guyon and A. Elisseeff, “An introduction to variable andfeature selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182,Mar. 2003. [Online]. Available: http://dl.acm.org/citation.cfm?id=944919.944968

[23] N. Kwak and C.-H. Choi, “Input feature selection by mutualinformation based on parzen window,” Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, vol. 24, no. 12, pp. 1667 –1671, dec 2002.

[24] G. Kumar and K. Kumar, “A novel evaluation function for fea-ture selection based upon information theory,” in Electrical andComputer Engineering (CCECE), 2011 24th Canadian Conference on,may 2011, pp. 000 395 –000 399.

http://dx.doi.org/10.1109/ICME.2011.6012091

http://dx.doi.org/10.1109/TPAMI.2011.36

http://doi.acm.org/10.1145/354756.354805

http://dl.acm.org/citation.cfm?id=944919.944968


144 Bibliography

[25] H. Peng, F. Long, and C. Ding, “Feature selection based on mu-tual information criteria of max-dependency, max-relevance, andmin-redundancy,” Pattern Analysis and Machine Intelligence, IEEETransactions on, vol. 27, no. 8, pp. 1226 –1238, aug. 2005.

[26] C. Dhir, N. Iqbal, and S.-Y. Lee, “Efficient feature selection basedon information gain criterion for face recognition,” in InformationAcquisition, 2007. ICIA ’07. International Conference on, Jul. 2007,pp. 523–527.

[27] G. Forman, “An extensive empirical study of featureselection metrics for text classification,” J. Mach. Learn.Res., vol. 3, pp. 1289–1305, Mar. 2003. [Online]. Available:http://dl.acm.org/citation.cfm?id=944919.944974

[28] L. Yu and H. Liu, “Feature selection for high-dimensional data:A fast correlation-based filter solution,” in ICML, T. Fawcett andN. Mishra, Eds. AAAI Press, 2003, pp. 856–863.

[29] K. Kira and L. A. Rendell, “The feature selection problem:traditional methods and a new algorithm,” in Proceedingsof the tenth national conference on Artificial intelligence, ser.AAAI’92. AAAI Press, 1992, pp. 129–134. [Online]. Available:http://dl.acm.org/citation.cfm?id=1867135.1867155

[30] I. Kononenko, “Estimating attributes: Analysis and extensionsof relief,” in Machine Learning: ECML-94, ser. Lecture Notesin Computer Science, F. Bergadano and L. De Raedt,Eds. Springer Berlin / Heidelberg, 1994, vol. 784, pp.171–182, 10.1007/3-540-57868-4_57. [Online]. Available: http://dx.doi.org/10.1007/3-540-57868-4_57

[31] K. Torkkola, “Feature extraction by non parametric mutualinformation maximization,” J. Mach. Learn. Res., vol. 3, pp.1415–1438, Mar. 2003. [Online]. Available: http://dl.acm.org/citation.cfm?id=944919.944981

[32] R. A. Fisher, “The use of multiple measurements in taxonomicproblems,” Annals of Eugenics, vol. 7, no. 7, pp. 179–188, 1936.

[33] X. Wang and V. Syrmos, “Optimal cluster selection based onfisher class separability measure,” in American Control Conference,2005. Proceedings of the 2005, june 2005, pp. 1929 – 1934 vol. 3.



http://dx.doi.org/10.1007/3-540-57868-4_57

http://dx.doi.org/10.1007/3-540-57868-4_57



Bibliography 145

[34] T. H. Dat and C. Guan, “Feature selection based on fisher ratioand mutual information analyses for robust brain computer in-terface,” in Acoustics, Speech and Signal Processing, 2007. ICASSP2007. IEEE International Conference on, vol. 1, april 2007, pp. I–337–I–340.

[35] D. L. Davies and D. W. Bouldin, “A cluster separation measure,”Pattern Analysis and Machine Intelligence, IEEE Transactions on,vol. PAMI-1, no. 2, pp. 224 –227, 1979.

[36] R. Kohavi and G. H. John, “Wrappers for feature subset selec-tion,” Artif. Intell., vol. 97, no. 1-2, pp. 273–324, 1997.

[37] J. Jelonek and J. Stefanowski, “Feature subset selection for clas-sification of histological images.” Artif Intell Med, vol. 9, no. 3,pp. 227–39, 1997.

[38] C. Cardie, “Using decision trees to improve case-based learning,”in ICML, 1993, pp. 25–32.

[39] X. He, P. Beauseroy, and A. Smolarz, “Feature subspaces selectionvia one-class svm: Application to textured image segmentation,”in Image Processing Theory Tools and Applications (IPTA), 2010 2ndInternational Conference on, july 2010, pp. 21 –25.

[40] R. Amini and P. Gallinari, “Semi-supervised learningwith an imperfect supervisor,” Knowl. Inf. Syst., vol. 8,pp. 385–413, November 2005. [Online]. Available: http://portal.acm.org/citation.cfm?id=1101568.1101573

[41] N. Gayar, F. Schwenker, and G. Palm, “A study of the robustnessof knn classifiers trained using soft labels,” in Artificial NeuralNetworks in Pattern Recognition, ser. Lecture Notes in ComputerScience, F. Schwenker and S. Marinai, Eds. Springer Berlin/ Heidelberg, 2006, vol. 4087, pp. 67–80. [Online]. Available:http://dx.doi.org/10.1007/11829898_7

[42] D. Angluin and P. Laird, “Learning from noisy examples,”Mach. Learn., vol. 2, pp. 343–370, April 1988. [Online]. Available:http://portal.acm.org/citation.cfm?id=639961.639996

[43] N. D. Lawrence and B. Schölkopf, “Estimating a kernel fisherdiscriminant in the presence of label noise,” in Proceedings

http://portal.acm.org/citation.cfm?id=1101568.1101573


http://dx.doi.org/10.1007/11829898_7


146 Bibliography

of the Eighteenth International Conference on Machine Learning,ser. ICML ’01. San Francisco, CA, USA: Morgan KaufmannPublishers Inc., 2001, pp. 306–313. [Online]. Available:http://portal.acm.org/citation.cfm?id=645530.655665

[44] H. Sontrop, R. van den Ham, P. Moerland, M. Reinders, andW. Verhaegh, “A sensitivity analysis of microarray feature se-lection and classification under measurement noise,” in GenomicSignal Processing and Statistics, 2009. GENSIPS 2009. IEEE Inter-national Workshop on, may 2009, pp. 1 –4.

[45] W. Altidor, T. M. Khoshgoftaar, and J. V. Hulse, “Robustness offilter-based feature ranking: A case study,” in FLAIRS Conference,R. C. Murray and P. M. McCarthy, Eds. AAAI Press, 2011.

[46] T. Stiefmeier, D. Roggen, G. Ogris, P. Lukowicz, and G. Tröster,“Wearable activity tracking in car manufacturing,” IEEE Perva-sive Computing, vol. 7, no. 2, pp. 42–50, 2008.

[47] U. Blanke, B. Schiele, M. Kreil, P. Lukowicz, B. Sick, and T. Gruber,“All for one or one for all? – combining heterogeneous featuresfor activity spotting,” in 7th IEEE PerCom Workshop on ContextModeling and Reasoning (CoMoRea), Mannheim, Germany, 2010.

[48] M. Elmezain, A. Al-Hamadi, J. Appenrodt, and B. Michaelis,“A hidden markov model-based continuous gesture recognitionsystem for hand motion trajectory,” in ICPR. IEEE, 2008, pp.1–4.

[49] A. Yazdizadeh and K. Khorasani, “Adaptive time delay neuralnetwork structures for nonlinear system identification,” Neuro-computing, vol. 47, no. 1–4, pp. 207–240, 2002.

[50] S. D. Fassois, “MIMO LMS-ARMAX identification of vibratingstructures - part i: The method,” Mechanical Systems and SignalProcessing, vol. 15, no. 4, pp. 737–758, 2001.

[51] P. Phillips, P. Flynn, T. Scruggs, K. Bowyer, J. Chang, K. Hoffman,J. Marques, J. Min, and W. Worek, “Overview of the face recogni-tion grand challenge,” in Computer Vision and Pattern Recognition,2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, june2005, pp. 947 – 954 vol. 1.


Bibliography 147

[52] P. Phillips, K. Bowyer, P. Flynn, X. Liu, and W. Scruggs, “The irischallenge evaluation 2005,” in Biometrics: Theory, Applications andSystems, 2008. BTAS 2008. 2nd IEEE International Conference on, 292008-oct. 1 2008, pp. 1 –8.

[53] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visualmodels from few training examples: An incremental bayesianapproach tested on 101 object categories,” in Computer Vision andPattern Recognition Workshop, 2004. CVPRW ’04. Conference on,june 2004, p. 178.

[54] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object categorydataset,” California Institute of Technology, Tech. Rep. 7694,2007. [Online]. Available: http://authors.library.caltech.edu/7694

[55] B. C. Russell, A. Torralba, K. P. Murphy, and W. T.Freeman, “Labelme: A database and web-based tool forimage annotation,” Int. J. Comput. Vision, vol. 77, no.1-3, pp. 157–173, May 2008. [Online]. Available: http://dx.doi.org/10.1007/s11263-007-0090-8

[56] S. Intille, K. Larson, E. Tapia, J. Beaudin, P. Kaushik, J. Nawyn,and R. Rockinson, “Using a live-in laboratory for ubiquitouscomputing research,” in Proc. Int. Conf. on Pervasive Computing,2006, pp. 349–365.

[57] M. Tenorth, J. Bandouch, and M. Beetz, “The TUM kitchen dataset of everyday manipulation activities for motion tracking andaction recognition,” in IEEE Int. Workshop on Tracking Humansfor the Evaluation of their Motion in Image Sequences (THEMIS) atICCV, 2009.

[58] F. De la Torre Frade, J. K. Hodgins, A. W. Bargteil, X. MartinArtal, J. C. Macey, A. Collado I Castells, and J. Beltran, “Guide tothe carnegie mellon university multimodal activity (cmu-mmac)database,” Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-08-22, April 2008.

[59] T. van Kasteren, A. Noulas, G. Englebienne, and B. Kröse, “Ac-curate activity recognition in a home setting,” in Proceedings ofthe 10th international conference on Ubiquitous computing. ACMPress, 2008, pp. 1–9.

http://authors.library.caltech.edu/7694

http://dx.doi.org/10.1007/s11263-007-0090-8

http://dx.doi.org/10.1007/s11263-007-0090-8

148 Bibliography

[60] T. Huynh, M. Fritz, and B. Schiele, “Discovery of activity pat-terns using topic models,” in Proceedings of the 10th internationalconference on Ubiquitous computing. ACM New York, NY, USA,2008, pp. 10–19.

[61] K. Van Laerhoven, H. Gellersen, and Y. Malliaris, “Long-termactivity monitoring with a wearable sensor node,” in BSN Work-shop, 2006.

[62] A. Calatroni, D. Roggen, and G. Troster, “Collection and curationof a large reference dataset for activity recognition,” in Systems,Man, and Cybernetics (SMC), 2011 IEEE International Conferenceon, oct. 2011, pp. 30 –35.

[63] D. Roggen, A. Calatroni, M. Rossi, T. Holleczek, K. Forster,G. Tröster, P. Lukowicz, D. Bannach, G. Pirkl, A. Ferscha,J. Doppler, C. Holzmann, M. Kurz, G. Holl, R. Chavarriaga,H. Sagha, H. Bayati, M. Creatura, and J. del R Millan, “Collectingcomplex activity datasets in highly rich networked sensor envi-ronments,” in Networked Sensing Systems (INSS), 2010 SeventhInternational Conference on, 2010, pp. 233–240.

[64] G. Pirkl, K. Stockinger, K. Kunze, and P. Lukowicz, “Adaptingmagnetic resonant coupling based relative positioning technol-ogy for wearable activitiy recogniton,” in Wearable Computers,2008. ISWC 2008. 12th IEEE International Symposium on, 28 2008-oct. 1 2008, pp. 47 –54.

[65] M. Bächlin, D. Roggen, M. Plotnik, N. Inbar, I. Meidan, T. Her-man, M. Brozgol, E. Shaviv, N. Giladi, J. M. Hausdorff, andG. Tröster, “Potentials of enhanced context awareness in wear-able assistants for parkinson’s disease patients with freezing ofgait syndrome,” in Proceedings of the 13th International Symposiumon Wearable Computers (ISWC), Sep. 2009, pp. 123–130.

[66] P. Zappi, C. Lombriser, E. Farella, D. Roggen, L. Benini, andG. Tröster, “Activity recognition from on-body sensors: accuracy-power trade-off by dynamic sensor selection,” in 5th EuropeanConf. on Wireless Sensor Networks (EWSN 2008), R. Verdone, Ed.Springer, 2008, pp. 17–33.

Bibliography 149

[67] D. Bannach, P. Lukowicz, and O. Amft, “Rapid prototypingof activity recognition applications,” Pervasive Computing, IEEE,vol. 7, no. 2, pp. 22 –31, april-june 2008.

[68] M. Sichitiu and C. Veerarittiphan, “Simple, accurate time syn-chronization for wireless sensor networks,” in Wireless Commu-nications and Networking, 2003. WCNC 2003. 2003 IEEE, vol. 2,march 2003, pp. 1266 –1273 vol.2.

[69] D. Bannach, K. Kunze, J. Weppner, and P. Lukowicz,“Integrated tool chain for recording and handling large,multimodal context recognition data sets,” in Proceedings of the12th ACM international conference adjunct papers on Ubiquitouscomputing - Adjunct, ser. Ubicomp ’10 Adjunct. New York,NY, USA: ACM, 2010, pp. 357–358. [Online]. Available:http://doi.acm.org/10.1145/1864431.1864434

[70] X. Liu, L. Li, and N. Memon, “A lightweight combinatorialapproach for inferring the ground truth from multipleannotators,” in Machine Learning and Data Mining inPattern Recognition, ser. Lecture Notes in Computer Science,P. Perner, Ed. Springer Berlin Heidelberg, 2013, vol. 7988,pp. 616–628. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-39712-7_47

[71] A. Calatroni, D. Roggen, and G. Tröster, “Automatic transferof activity recognition capabilities between Body-Worn motionsensors: Training newcomers to recognize locomotion,” in EighthInternational Conference on Networked Sensing Systems (INSS’11),Penghu, Taiwan, 6 2011.

[72] H. Sagha, J. del R. Millán, and R. Chavarriaga, “Detecting anoma-lies to improve classification performance in opportunistic sen-sor networks,” in Pervasive Computing and Communications Work-shops (PERCOM Workshops), 2011 IEEE International Conferenceon, march 2011, pp. 154 –159.

[73] A. Manzoor, C. Villalonga, A. Calatroni, H.-L. Truong,D. Roggen, S. Dustdar, and G. Tröster, “Identifying importantaction primitives for high level activity recognition,” inProceedings of the 5th European conference on Smart sensing andcontext, ser. EuroSSC’10. Berlin, Heidelberg: Springer-Verlag,

http://doi.acm.org/10.1145/1864431.1864434

http://dx.doi.org/10.1007/978-3-642-39712-7_47

http://dx.doi.org/10.1007/978-3-642-39712-7_47

150 Bibliography

2010, pp. 149–162. [Online]. Available: http://portal.acm.org/citation.cfm?id=1940159.1940175

[74] R. Chavarriaga, H. Bayati, and J. d. R. Millán, “Unsupervisedadaptation for acceleration-based activity recognition: Robust-ness to sensor displacement and rotation,” Personal and Ubiqui-tous Computing, 2012.

[75] H. Sagha, H. Bayati, R. Chavarriaga Lozano, and J. d. R. Millán,“On-line anomaly detection and resilience in classifier ensem-bles,” Pattern Recognition Letters, 2013.

[76] O. Banos, A. Calatroni, M. Damas, H. Pomares, I. Rojas, H. Sagha,J. del R. Millán, G. Tröster, R. Chavarriaga, and D. Roggen,“Kinect=imu? learning mimo signal mappings to automaticallytranslate activity recognition systems across sensor modalities,”in Wearable Computers (ISWC), 2012 16th International Symposiumon, 2012, pp. 92 –99.

[77] H. Sagha, S. T. Digumarti, J. d. R. Millán, R. Chavarriaga, A. Ca-latroni, D. Roggen, and G. Tröster, “Benchmarking classificationtechniques using the Opportunity human activity dataset,” in2011 Ieee International Conference On Systems, Man, And Cybernet-ics (Smc), ser. IEEE International Conference on Systems Man andCybernetics Conference Proceedings. Ieee Service Center, 445Hoes Lane, Po Box 1331, Piscataway, Nj 08855-1331 Usa, 2011,pp. 36–40.

[78] R. Chavarriaga, H. Sagha, A. Calatroni, S. Digumarti, G. Tröster,J. d. R. Millán, and D. Roggen, “The Opportunity challenge: Abenchmark database for on-body sensor-based activity recogni-tion,” Pattern Recognition Letters, 2013.

[79] D. Cox, E. Jovanov, and A. Milenkovic, “Time synchronizationfor zigbee networks,” in System Theory, 2005. SSST ’05. Proceed-ings of the Thirty-Seventh Southeastern Symposium on, March 2005,pp. 135–138.

[80] N. Ravi, N. Dandekar, P. Mysore, and M. L. Littman, “Activityrecognition from accelerometer data,” in Proceedings of the 17thconference on Innovative applications of artificial intelligence - Volume3, ser. IAAI’05. AAAI Press, 2005, pp. 1541–1546. [Online].Available: http://dl.acm.org/citation.cfm?id=1620092.1620107




Bibliography 151

[81] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vectormachines, 2001, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[82] K. Förster, D. Roggen, and G. Tröster, “Unsupervised classifierself-calibration through repeated context occurences: is there ro-bustness against sensor displacement to gain?” in Proceedings ofthe 13th International Symposium on Wearable Computing. IEEEComputer Society, 2009, pp. 77–84.

[83] A. Calatroni, D. Roggen, and G. Tröster, “A methodology to useunknown new sensors for activity recognition by leveragingsporadic interactions with primitive sensors and behavioral as-sumptions,” in Proc. of the Opportunistic Ubiquitous Systems Work-shop, part of 12th ACM Int. Conf. on Ubiquitous Computing, 2010.[Online]. Available: http://www.wearable.ethz.ch/resources/UbicompWorkshop_OpportunisticUbiquitousSystems

[84] K. Flouri, B. Beferull-Lozano, and P. Tsakalides, “Training a svm-based classifier in distributed sensor networks,” in Proc. of 14ndEuropean Signal Processing Conference 2006, 2006, pp. 1–5.

[85] Recognizing Daily Life Context Using Web-Collected Audio Data. in:ISWC 2012, Proceedings of the 16th IEEE International Sympo-sium on Wearable Computers, 2012, Jun. 2012.

[86] S. Dasgupta, M. L. Littman, and D. A. McAllester, “Pac gen-eralization bounds for co-training,” in NIPS, T. G. Dietterich,S. Becker, and Z. Ghahramani, Eds. MIT Press, 2001, pp. 375–382.

[87] S. Clark, J. R. Curran, and M. Osborne, “Bootstrapping postaggers using unlabelled data,” in Proceedings of the seventhconference on Natural language learning at HLT-NAACL 2003 -Volume 4, ser. CONLL ’03. Stroudsburg, PA, USA: Associationfor Computational Linguistics, 2003, pp. 49–55. [Online].Available: http://dx.doi.org/10.3115/1119176.1119183

[88] H. Sagha, A. Calatroni, J. d. R. Millán, D. Roggen, G. Tröster,and R. Chavarriaga Lozano, “Robust Activity Recognition Com-bining Anomaly Detection and Classifier Retraining,” in 10thAnnual Body Sensor Networks, 2013.

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://www.wearable.ethz.ch/resources/UbicompWorkshop_OpportunisticUbiquitousSystems

http://www.wearable.ethz.ch/resources/UbicompWorkshop_OpportunisticUbiquitousSystems

http://dx.doi.org/10.3115/1119176.1119183

152 Bibliography

[89] B. Tessendorf, B. Arnrich, J. Schumm, C. Kappeler-Setz, andG. Tröster, “Unsupervised monitoring of sitting behavior,” inProceedings of EMBC09, 2009.

[90] B. Chakraborty, M. Pedersoli, and J. Gonzàlez, “View-invarianthuman action detection using component-wise hmm of bodyparts,” in AMDO ’08: Proceedings of the 5th international conferenceon Articulated Motion and Deformable Objects. Berlin, Heidelberg:Springer-Verlag, 2008, pp. 208–217.

[91] B. Logan, J. Healey, M. Philipose, E. M. Tapia, andS. S. Intille, “A long-term evaluation of sensing modalitiesfor activity recognition,” in UbiComp 2007: UbiquitousComputing, ser. Lecture Notes in Computer Science, vol.4717. Springer, 2007, pp. 483–500. [Online]. Available:http://dx.doi.org/10.1007/978-3-540-74853-3_28

[92] S. Neff, C. Villalonga, D. Roggen, and G. Tröster, “Do i looklike my neighbors? a method to detect sensor data similarity foropen-ended activity recognition systems,” in Eighth InternationalConference on Networked Sensing Systems (INSS’11), 2011.

[93] P. Zappi, T. Stiefmeier, E. Farella, D. Roggen, L. Benini,and G. Tröster, “Activity recognition from on-body sensorsby classifier fusion: Sensor scalability and robustness,”in 3rd Int. Conf. on Intelligent Sensors, Sensor Networks,and Information Processing (ISSNIP), 0 2007, pp. 281–286. [Online]. Available: http://www2.ife.ee.ethz.ch/~droggen/publications/wear/EDAS_ISSNIP.pdf

[94] G. Strang, Introduction to Linear Algebra, 3rd ed. Wellesly-Cambridge Press, 2003.

[95] J. Sung, C. Ponce, B. Selman, and A. Saxena, “Human activitydetection from rgbd images,” in AAAI Workshop: Plan, Activity,and Intent Recognition, 2011.

[96] E. Stone and M. Skubic, “Evaluation of an inexpensive depthcamera for passive in-home fall risk assessment,” in Proc Perva-sive Health Conference, 2011, pp. 71–77.

[97] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,R. Moore, A. Kipman, and A. Blake, “Real-time human pose

http://dx.doi.org/10.1007/978-3-540-74853-3_28

http://www2.ife.ee.ethz.ch/~droggen/publications/wear/EDAS_ISSNIP.pdf

http://www2.ife.ee.ethz.ch/~droggen/publications/wear/EDAS_ISSNIP.pdf

Bibliography 153

recognition in parts from a single depth image,” in Proc of IEEEConf on Computer Vision and Pattern Recognition, 2011, pp. 1297–1304.

[98] “Prime sensor NITE 1.3 algorithms notes, version 1.0,” Prime-Sense Inc., 2010, http://www.primesense.com.

[99] “Qtkinectwrapper,” http://code.google.com/p/qtkinectwrapper/.

[100] XM-B Technical Documentation, Xsens Technologies B.V., May2009, http://www.xsens.com.

[101] M. Berchtold, M. Budde, D. Gordon, H. Schmidtke, and M. Beigl,“Actiserv: Activity recognition service for mobile phones,” inProc. 14th Int. Symp. on Wearable Computers (ISWC), 2010.

[102] J. Sjöberg, Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P.-Y. Glo-rennec, H. Hjalmarsson, and A. Juditsky, “Nonlinear black-boxmodeling in system identification: a unified overview,” Automat-ica, vol. 31, no. 12, pp. 1691–1724, 1995.

[103] G. Hinton and R. Salakhutdinov, “Reducing the dimensionalityof data with neural networks,” Science, vol. 313, no. 5786, pp. 504– 507, 2006.

[104] D. Movshovitz-Attias and W. W. Cohen, “Bootstrappingbiomedical ontologies for scientific text using nell,” inProceedings of the 2012 Workshop on Biomedical Natural LanguageProcessing, ser. BioNLP ’12. Stroudsburg, PA, USA: Associationfor Computational Linguistics, 2012, pp. 11–19. [Online].Available: http://dl.acm.org/citation.cfm?id=2391123.2391126

[105] M. Kurz, G. Hölzl, A. Ferscha, A. Calatroni, D. Roggen, G. Tröster,H. Sagha, R. Chavarriaga, J. del R. Millán, D. Bannach, K. Kunze,and P. Lukowicz, “The opportunity framework and data pro-cessing ecosystem for opportunistic activity and context recog-nition,” International Journal of Sensors, Wireless Communicationsand Control, Special Issue on Autonomic and Opportunistic Commu-nications, December 2011.

http://code.google.com/p/qtkinectwrapper/

http://code.google.com/p/qtkinectwrapper/


154 Bibliography

Curriculum Vitae

Personal information

Alberto Calatroni

Born on March 12, 1981 in Varese, ItalyCitizen of Italy

Education

2009 - 2013 Ph.D. studies (Dr. sc. ETH) in Information Technology andElectrical Engineering, ETH Zürich, Switzerland.

1999 - 2006 M.Sc. in Electrical EngineeringPolitecnico di Milano, Italy.

1996 - 1999 Baccalaureate at European School of Varese,Varese, Italy.

Work experience

2006 - 2009 Research assistant, Image and Sound Processing Group (ISPG)Politecnico di Milano, Italy.

Documents

Rights / License: Research Collection In Copyright - Non ... · Citizen of Italy accepted on the recommendation of Prof. Dr. Gerhard Tröster, examiner ... a smart phone can be worn