INSTITUTO DE COMPUTAÇÃOreltech/2014/14-13.pdf · 1.17 Processors Power Analysis. J essica T. He ernan, Rodolfo Azevedo 98 1.18 Uma abordagem caixa-branca para teste de transforma˘c~ao

�� INSTITUTO DE COMPUTAÇÃOUNIVERSIDADE ESTADUAL DE CAMPINAS

Anais do IX Workshop de Teses, Dissertaçõese Trabalhos de Iniciação Científica em

Andamento do IC-UNICAMP

Ariadne Carvalho Divino CésarEdelson Constantino Elisa Rodrigues

Ewerton Martins Ewerton SilvaFabio A. Faria Flavio Nicastro

Greice Mariano Gustavo AlkmimHeiko Hornung Juliana Borin

Leonardo Tampelini Maxiwell GarciaPriscila T. M. Saito Roberto Pereira

Samuel Buchdid Vanessa MaikeWalisson Pereira

Technical Report - IC-14-13 - Relatório Técnico

August - 2014 - Agosto

The contents of this report are the sole responsibility of the authors.O conteúdo do presente relatório é de única responsabilidade dos autores.

Apresentacao

Este relatorio tecnico contem os resumos de 33 trabalhos apresentados no IXWorkshop de Teses, Dissertacoes e Trabalhos de Iniciacao Cientıfica, do Instituto deComputacao da Universidade Estadual de Campinas (WTD-IC-UNICAMP 2014).

O Workshop ocorreu nos dias 11 e 12 de agosto de 2014 e contou com mais de 100participantes, entre ouvintes, apresentadores de trabalhos e organizadores. Na ocasiaohouve 31 apresentacoes orais e 12 posteres. Aos alunos foi dada a possibilidade deescolher a forma de apresentacao (oral, poster, ou ambas). A publicacao dos resumos,sob forma de relatorio tecnico, tem por objetivo divulgar os trabalhos em andamentoe registrar, de forma sucinta, o estado da arte da pesquisa do IC no ano de 2014.

Neste ano, a palestra de abertura, intitulada “Analise de Grandes Volumes deDados Bibliograficos”, foi proferida pelo Professor Luciano Digiampietri, da Univer-sidade de Sao Paulo (USP). Foram oferecidos dois minicursos. O primeiro, intitulado“Fundamentos de Estatıstica para Computacao”, foi ministrado pelo Professor JorgeStolfi; o segundo, intitulado “Como Arruinar uma Apresentacao Oral”, foi ministradopela Professora Claudia Bauzer Medeiros, ambos do IC-UNICAMP. Tanto a palestracomo os minicursos foram gravados e transmitidos online.

Agradecemos aos alunos que participaram do evento, em particular aqueles quese dispuseram a apresentar seus trabalhos, seja oralmente ou na forma de posteres,bem como aos orientadores que os incentivaram a faze-lo. Agradecemos, tambem,aos professores e alunos de doutorado do IC que compuseram as bancas de avaliacaodos trabalhos. Agradecemos ao Professor Ricardo da Silva Torres, diretor do IC, eao Professor Paulo Lıcio de Geus, coordenador da Pos-Graduacao, pelo incentivo,apoio e patrocınio ao evento. Finalmente, agradecemos imensamente aos alunos doprograma de Pos-Graduacao do IC que efetivamente organizaram o evento e quesao coeditores deste relatorio – Divino Cesar, Edelson Constantino, Elisa Rodrigues,Ewerton Martins, Ewerton Silva, Fabio A. Faria, Flavio Nicastro, Greice Mariano,Gustavo Alkmim, Heiko Hornung, Leonardo Tampelini, Maxiwell Garcia, PriscilaTiemi Maeda Saito, Roberto Pereira, Samuel Buchdid, Vanessa Maike e WalissonPereira. A eles dedico o IX Workshop de Teses, Dissertacoes e Trabalhos de Iniciacaodo IC.

Ariadne Maria Brito Rizzoni CarvalhoJuliana Freitag Borin

Coordenadoras do ix wtd

Professoras do Instituto de Computacao - unicamp

Sumario

1 Resumos estendidos 5

1.1 A Comparative Analysis between Object Atlas and ObjectCloud Models for Medical Image Segmentation. Renzo Phellan,Alexandre Falcao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 A Fault-Tolerant Software Product Line for Data Collectionusing Mobile Devices and Cloud Infrastructure. Cecılia M. F.Rubira, Gustavo M. Waku, Edson R. Bollis, Ricardo S. Torres, FlavioNicastro, Leonor P. Morellato . . . . . . . . . . . . . . . . . . . . . . 12

1.3 A Multiscale and Multi-Perturbation Blind Forensic Techni-que For Median Detecting. Anselmo Ferreira and Anderson Rocha 18

1.4 A Software Products Line Solution Based in Components andAspects for the E-commerce Domain. Raphael Porreca Azzolini,Cecılia Mary Fischer Rubira . . . . . . . . . . . . . . . . . . . . . . . 24

1.5 An Architecture for Dynamic Self-Adaptation in Workflows.Sheila K. Venero Ferro, Cecilia Mary Fischer Rubira . . . . . . . . . 30

1.6 Coloracao de Arestas Semiforte de Grafos Split. Aloısio deMenezes Vilas-Boas, Celia Picinin de Mello . . . . . . . . . . . . . . 36

1.7 Complex pattern detection and specification from multiscaleenvironmental variables for biodiversity applications. Jacque-line Midlej do Espırito Santo, Claudia Bauzer Medeiros . . . . . . . . 40

1.8 Diabetic Retinopathy Image Quality Assessment, Detection,Screening and Referral. Ramon Pires and Anderson Rocha . . . . 45

1.9 Ferramentas para simulacao numerica na nuvem. Renan M. P.Neto, Juliana Freitag Borin, Edson Borin . . . . . . . . . . . . . . . 51

1.10 Information and Emotion Extraction in Portuguese from Twit-ter for Stock Market Prediction. Fernando J. V. da Silva, AriadneM. B. R. Carvalho . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

1.11 Modelo Ativo de Aparencia Aplicado ao Reconhecimento deEmocoes por Expressao Facial Parcialmente Oculta. FlavioAltinier Maximiano da Silva, Helio Pedrini . . . . . . . . . . . . . . . 63

2

3 IX Workshop de Teses, Dissertacoes e Trabalhos de Iniciacao Cientıfica

1.12 Multiple Parenting Phylogeny. Alberto Oliveira, Zanoni Dias, Si-ome Goldenstein, Anderson Rocha . . . . . . . . . . . . . . . . . . . . 69

1.13 O problema da particao em cliques dominantes. Henrique Vieirae Sousa, C. N. Campos . . . . . . . . . . . . . . . . . . . . . . . . . . 75

1.14 Optimizing Simulation in Multiprocessor Platforms using Dynamic-Compiled Simulation. Maxiwell Salvador Garcia, Rodolfo Azevedoand Sandro Rigo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

1.15 Preempcao de Tarefas MapReduce via Checkpointing. AugustoSouza, Islene Garcia . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

1.16 Processo Situado e Participativo para o Design de Aplicacoesde TVDi: Uma Abordagem Tecnico-Social. Samuel B. Buchdide M. Cecılia C. Baranauskas . . . . . . . . . . . . . . . . . . . . . . . 92

1.17 Processors Power Analysis. Jessica T. Heffernan, Rodolfo Azevedo 98

1.18 Uma abordagem caixa-branca para teste de transformacao demodelos. Erika R. C. de Almeida, Eliane Martins . . . . . . . . . . 104

1.19 Uma Solucao para Monitoracao de Servicos Web Baseada emLinhas de Produtos de Software. Romulo J. Franco, Cecılia M.Rubira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

1.20 User Association and Load Balancing in HetNets. AlexandreT. Hirata, Eduardo C. Xavier, Juliana F. Borin . . . . . . . . . . . . 116

1.21 Reconstrucao de Filogenias para Imagens e Vıdeos. Filipe deO. Costa, Anderson Rocha (Orientador), Zanoni Dias (Coorientador) 122

1.22 Superpixel-based interactive classification of very high reso-lution images. John E. Vargas, Priscila T. M. Saito, Alexandre X.Falcao, Pedro J. de Rezende, Jefersson A. dos Santos . . . . . . . . . 128

1.23 Supporting the study of correlations between time series viasemantic annotations. Lucas Oliveira Batista, Claudia Bauzer Me-deiros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

2 Resumo dos posteres 139

2.1 A Model-Driven Infrastructure for Developing Dynamic Soft-ware Product Line. Junior Cupe Casquina, Cecilia Mary FischerRubira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

2.2 Aplicacao do Criterio Analise de Mutantes para Avaliacaodos Casos de Testes Gerados a partir de Modelos de Estados.Wallace Felipe Francisco Cardoso, Eliane Martins . . . . . . . . . . . 142

2.3 Combining active semi-supervised learning through optimum-path forest. Priscila T. M. Saito, Willian P. Amorim, Alexandre X.Falcao, Pedro J. de Rezende, Celso T. N. Suzuki, Jancarlo F. Gomes,Marcelo H. de Carvalho . . . . . . . . . . . . . . . . . . . . . . . . . 144

SUMARIO 4

2.4 Desenvolvimento de um ambiente de baixo custo para o usode Interfaces Tangıveis no ensino de criancas da educacao fun-damental. Marleny Luque Carbajal, M. Cecılia C. Baranauskas . . . 146

2.5 Explorando os Benefıcios das Interfaces Tangıveis no Trata-mento de Criancas Autistas. Kim P. Braga, Maria Cecılia C.Baranauskas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

2.6 Identificacao de Programas Maliciosos por meio da Moni-toracao de Atividades Suspeitas no Sistema Operacional. Mar-cus Felipe Botacin, Prof. Dr. Paulo Lıcio de Geus, Andre R. A. Gregio150

2.7 Implementacao de um simulador de plataforma ARMv7 uti-lizando ArchC. Gabriel K. Bertazi, Edson Borin . . . . . . . . . . . 152

2.8 Oportunidades de economia de energia em dispositivos moveis.Joao H. S. Hoffmam, Sandro Rigo, Edson Borin . . . . . . . . . . . . 154

2.9 Video-Based Face Spoofing Detection through Visual RhythmAnalysis. Allan Pinto, Anderson Rocha . . . . . . . . . . . . . . . . 156

2.10 Wear-out Analysis of Error Correction Techniques in Phase-Change Memory. Caio Hoffman, Rodolfo Jardim de Azevedo, GuidoCosta Souza de Araujo . . . . . . . . . . . . . . . . . . . . . . . . . . 158

3 Programacao do WTD 160

1. Resumos estendidos

“Existem muitas hipoteses emciencias que estao erradas. Isto eperfeitamente aceitavel, elas sao aabertura para achar as que estaocertas”

Carl Sagan

A Comparative Analysis between Object Atlas and ObjectCloud Models for Medical Image Segmentation

Renzo Phellan1, Alexandre Falcao1

1LIV, Institute of Computing – University of Campinas (UNICAMP)Campinas, SP, Brazil

[email protected], [email protected]

Abstract. Medical image segmentation is crucial for quantitative organ anal-ysis and surgical planning. However, it may be a time-consuming task forclinicians, when it has to be done interactively. Automatic segmentationmethods based on 3D object appearance models have been proposed to miti-gate/circumvent the problem. Among them, approaches based on object atlas(OAM) are the most actively investigated. Nevertheless, they require a time-costly image registration process. Object cloud models (OCM) have been in-troduced to avoid registration, considerably speeding up the whole process, butthey have never been compared to OAM. The present paper fills up this gap bypresenting a comparative analysis of both approaches for nine organs of thehuman body.

Resumo. A segmentacao de imagens medicas e uma etapa fundamental naanalise de orgaos e o planejamento cirurgico. Porem, pode se tornar uma tarefademorada para os medicos, quando e realizada interativamente. Os metodos desegmentacao automatica, baseados em Modelos da Aparencia do Objeto 3D,tem sido propostos para mitigar/evitar esse problema. Entre eles, as aborda-gens baseadas em Atlas (OAM) sao as mais pesquisadas. No entanto, elas pre-cisam de um processo de registro de imagens, que e demorado. Os ModelosNebulosos do Objeto (OCM) tem sido propostos com o objetivo de evitar o reg-istro e acelerar o processo de segmentacao, mas eles nunca foram comparadoscom os OAM. O presente artigo preenche essa lacuna apresentando uma analisecomparativa entre as duas abordagens, para nove orgaos do corpo humano.

1. IntroductionQuantitative organ analysis and surgical planning often require a precise segmentationof the organs contained in 3D medical images of the patient [Udupa et al. 2011]. Thistask can be done interactively/manually by medical experts. However, manual segmen-tation is a time-consuming, tedious, error-prone, and subjected to inter-observer variabil-ity [Vos et al. 2013]. In order to solve this problem, automatic segmentation methodsbased on object appearance models have been proposed. These approaches build an atlas-based/cloud model from a set of training images and their respective object masks, asobtained by interactive segmentation. The model is then used to detect and delineate theobject in each new image.

Object atlas models (OAM) are the most popular ones ([Vos et al. 2013],[Lotjonen et al. 2011] and [Tamez-Pena et al. 2011]). The model appears as a probabilis-tic object, with a very narrow uncertainty region. By thresholding the probability values,

Resumos estendidos 6

one can obtain objects with the 3D shape of the organ or other anatomical structure. OAMconstruction requires the selection of a training image to be the reference coordinate sys-tem and the time-costly registration of the other training images into that system. The3D anatomical structure in a new image is detected by also registering the image into theatlas coordinate system, which usually takes some minutes. Object delineation can thenbe simplified, for instance, to voxel classification [Vos et al. 2013].

Object cloud models (OCM) [Miranda et al. 2008] avoid registration by buildingthe model simply based on translations of the training masks to a common reference point(their object centers). As a result, the model presents a fuzzy appearance of the object’sshape, with an uncertainty region larger than the one of OAM due to the abscence of reg-istration. Object detection in a new image requires to translate the model over the image,delineate a candidate object at each position within the uncertainty region of the model,and evaluate each delineation score, which should be maximum at the correct organ po-sition. Such a simplification with a clever implementation of the object search processallows to complete segmentation in some seconds per anatomical structure. Moreover,model construction also takes some seconds per object rather than hours, as in OAM.

2. Contributions

The advantages in computational time of OCM over OAM makes it more attactive, but itis also important to compare their accuracies in segmentation. The present work fills upthis gap. We detail our implementation of each model, OAM and OCM, and indicate thesituations where one approach has advantages over the other.

3. Segmentation using Object Atlas Model

For a given anatomical structure of interest and training set of medical images withtheir respective binary masks, we select the atlas reference system by computing the dis-tances between each pair of training masks and selecting the one whose average distanceto the others is minimum. We used the Average Symmetric Surface Distance (ASSD)[Langerak et al. 2011] for this task, because it can better detect the local differences be-tween boundaries than global measures, such as Dice Similarity Coefficient (DSC), forinstance.

Then, we register all training images into the reference system of the imagewhose mask has the minimum ASSD with affine transformation followed by locally de-formable transformation. Image registration was computed by the software tool Elastix[Klein et al. 2010], which is publicly available at http://elastix.isi.uu.nl. The resulting de-formation field for each image is then applied to transform its respective binary mask. Inthe atlas coordinate system, the training masks are averaged to compute a prior probabilitymap (i.e., the atlas).

For a given new image containing an anatomical structure we must first regis-ter it into the atlas coordinate system. The anatomical structure could be segmented bythresholding the prior probability map at 0.5, but this would disregard the local imageproperties of the structure in the new image [Gao and Tannembaum 2011]. We take intoaccount these properties by estimating the conditional probability density, the joint prob-ability density, and finally the posterior probability map [Vos et al. 2013].


The conditional probability density is a Gaussian distribution, whose mean andstandard deviation are computed over the gray values of the voxels inside the anatomi-cal structure, using all training images. It reflects the probability of a voxel belongingto the structure, given its gray value. The joint probability density is simply the normal-ized histogram of the new image, which is going to be segmented. By Bayes formula,the posterior probability corresponds to the product of the prior probability value withthe conditional probability density value, divided by the joint probability density value.The anatomical structure is then segmented by selecting voxels with posterior probabilitygreater or equal than 0.5 [Vos et al. 2013].

4. Segmentation using Object Cloud Model

An object cloud model (OCM) for a given anatomical structure can be defined as atriple that consists of a fuzzy object, a delineation algorithm A and a functional F[Miranda et al. 2008].

The fuzzy object is obtained by translating all training binary masks to a samereference point (their geometric center) and averaging their values. Due to the absence ofregistration, the uncertainty region of the fuzzy object tends to be much larger than theone of the atlas. The uncertainty region is defined by voxels with values in the interval(0, 1). The interior and exterior of the model are defined by voxels that are either inside(value 1) or outside (value 0) of any training mask, respectively. Such a wider uncertaintyregion requires higher effectiveness of the delineation algorithm than the one used inOAM to locally fine tuning the anatomical structure’s boundary. Figure 1 shows both,the probabilistic atlas and the fuzzy object of the Left Pleural Sac, as obtained from CTimages of the thorax.

We chose the Image Foresting Transform from Seeds Competition (IFT-SC) as thealgorithm A, for delineation [Miranda and Falcao 2009]. It interprets an image as a graphby assigning each voxel to a node and connecting it to its 26 nearest neighbors. The weightof each arc between voxels is given by a combination between the magnitude of the 3DSobel’s gradient of the original image and the magnitude of the 3D Sobel’s gradient of theconditional probability density map estimated for that image (the same used for the atlas).The arc weights are meant to be higher on the border of the anatomical structure thaninside and outside it. At any given position of the fuzzy object in the image, the interiorand exterior voxels are used to define two different seed sets, which must compete for themost closely connected voxels in the uncertainty region. Then, the anatomical structure isdefined by the union between the interior voxels and the voxels of the uncertainty regionconquered by the first ones.

For the functional F, which scores the candidate segmentations, we chose the meanarc weight along the IFT-SC graph cut [Miranda et al. 2008]. This way, for a given newimage, the fuzzy object translates over the image and, for each position, the IFT-SC algo-rithm is executed inside the uncertainty region to obtain a candidate segmentation. Thefunctional F is evaluated on this segmentation to obtain a matching score. The desiredsegmentation is expected to be the one with maximum score. The object search processcan be optimized by several ways. We optimized this process by estimating the searchregion (a small fraction of the image) from the translations of the training masks duringconstruction of the the model and applying the Multi Scale Parameter Search (MSPS)


optimization algorithm [Chiachia et al. 2011].

Figure 1. Coronal slices of the object (1) atlas and (2) cloud models of the LeftPleural Sac.

5. ResultsWe evaluated and compared both models, OAM and OCM, using three datasets: (a) 35MR-T1 images of the brain with voxel sizes 0.98 x 0.98 x 0.98 mm and their correspond-ing binary masks for the Cerebellum (C), Left Hemisphere (LH) and Right Hemisphere(RH); (b) 35 CT images of the thorax with voxel sizes 2.5 x 2.5 x 2.5 mm and theircorresponding binary masks for the Internal Mediastinum (IMS) and Pericardial Region(PC); in the case of PC, however, we had only 30 binary masks available; and (c) 40low-dose CT images of the thorax with voxel sizes 1.25 x 1.25 x 1.25 mm and their cor-responding binary masks for the following anatomical structures: Left Pleural Sac (LPS),Right Pleural Sac (RPS), Respiratory System (RS) and Thoracic Skin (TSkn). Datasets(a) and (b) were segmented by multiple experts using manual and interactive segmen-tation tools. Dataset (c) was obtained from the VIA/I-ELCAP Public Access ResearchDatabase, available at http://www.via.cornell.edu/lungdb.html. It was segmented by us-ing interactive segmentation tools, under the supervision of an expert.

Figure 2. Coronal slices of the resulting segmentations when using (1) OAM and(2) OCM to segment the Cerebellum.

The experiments used Leave-one-out cross-validation. The results were evaluatedusing the Dice Similarity Coefficient (DSC) expressed as a percentage. We performed


Anat. Struct. OCM Acc. OAM Acc. Z-test SignificantC 92.34 +/- 1.42 87.53 +/- 4.72 5.76 Yes

LH 94.30 +/- 1.67 90.77 +/- 9.75 2.11 YesRH 94.60 +/- 1.77 91.53 +/- 8.30 2.14 Yes

Tskn 98.55 +/- 1.94 85.66 +/- 4.12 17.90 YesLPS 97.31 +/- 1.32 96.49 +/- 1.10 3.01 YesRPS 97.30 +/- 1.12 97.32 +/- 0.86 -0.08 NoRS 98.33 +/- 0.83 97.22 +/- 0.93 5.60 Yes

IMS 81.76 +/- 3.86 86.31 +/- 1.97 -6.22 YesPC 86.78 +/- 5.42 90.14 +/- 2.96 -2.98 Yes

Table 1. DSC accuracies (mean and standard deviation) for OCM and OAM.

Anat. Struct. OCM time OAM timeC 30.38 13 748.99

LH 59.31 14 086.75RH 64.81 14 084.56

Tskn 653.64 28 514.36LPS 362.51 26 791.41RPS 414.70 26 757.40RS 658.36 26 756.34

IMS 19.15 13 856.59PC 12.39 13 820.99

Table 2. OCM and OAM segmentation times in seconds.

a two-tailed Z-test, with a 95% confidence limit. In this case, the critical value for nullhypothesis rejection was +/-1.96.

Table 1 shows the results for DSC. As it can be noticed, OCM can generate sig-nificantly better accuracies for C, LH, RH, LPS, RS and TSkn. In the case of RPS,no significant improvement was noticed. Finally, for PC and IMS, OAM was better thanOCM. Figure 2 shows two coronal slices of the resulting segmentations when using OAMand OCM to segment the Cerebellum.

Table 2 shows the times in seconds required to use the models in order to segmenta new image. They were calculated using a same computer with an Intel Core i5 processor,at 2.4 GHz, with 4 GB of RAM and without graphic card. It can be noted that OCM canprovide a considerable speed up, being at least 40 times faster than OAM.

6. ConclusionsWe implemented and compared two segmentation methods based on object appearancemodels. The first one is an OAM which uses a thresholding on the posterior probabilitymap for object delineation from the object’s detection provided by image registration withthe atlas reference system. The second one is an OCM which replaces image registrationby object search based on translations of the model over the image. By avoiding the time-costly image registration for model construction and use, and constraining object searchwith a small region of the image, OCM can provide a considerable speed up, being at


least 40 times faster than OAM. OCM also exhibits a significantly better accuracy for sixof the nine evaluated anatomical structures, while OAM generates better results for twoof them.

7. AcknowledgmentThe authors thank CNPq (303673/2010-9, 479070/2013-0, 131835/2013-0) and FAPESPfor the finantial support. The authors also thank Dr. Jayaram K. Udupa for providingmedical images used to test the models, and their organ segmentations.

ReferencesChiachia, G., Falcao, A. X., and Rocha, A. (2011). Multiscale parameter search (msps): A

deterministic approach for black-box global optimization. Tech. Rep. IC-11-15, Univ.of Campinas, Campinas, SP.

Gao, Y. and Tannembaum, A. (2011). Combining atlas and active contour for automatic3D medical image segmentation. In The Eigth IEEE Intl. Symp. on Biomedical Imag-ing: From Nano to Macro (ISBI), pages 1401–1404.

Klein, S., Staring, M., Murphy, K., Viergever, M. A., and Pluim, J. P. W. (2010). Elastix:A toolbox for intensity based medical image registration. IEEE Transactions on Med-ical Imaging, 29(1):196–205.

Langerak, T. R., der Heide, U. A. V., Kotte, A. N. T. J., Berendsen, F. F., and Pluim, J.P. W. (2011). Evaluating and improving label fusion in atlas-based segmentation usingthe surface distance. In SPIE on Medical Imaging, volume 7962, pages 1–7.

Lotjonen, J., Wolz, R., Koikkalainen, J., Thurfjell, L., Lundqvist, R., Waldemar, G., Soini-nen, H., and Rueckert, D. (2011). Improved generation of probabilistic atlases for theexpectation maximization classification. In The Eigth IEEE Intl. Symp. on BiomedicalImaging: From Nano to Macro (ISBI), pages 1839–1842.

Miranda, P. A. and Falcao, A. X. (2009). Links between image segmentation based onoptimum-path forest and minimum cut in graph. Journal of Mathematical Imaging andVision, 35(2):128–142.

Miranda, P. A. V., Falcao, A. X., and Udupa, J. K. (2008). CLOUDS: A model forsynergistic image segmentation. In The Fifth IEEE Intl. Symp. on Biomedical Imaging:From Nano to Macro (ISBI), pages 209–212.

Tamez-Pena, J., Gonzalez, P., Farber, J., Baum, K., Schreyer, E., and Totterman, S. (2011).Atlas based method for the automated segmentation and quantification of knee fea-tures: Data from the osteoarthritis initiative. In The Eigth IEEE Intl. Symp. on Biomed-ical Imaging: From Nano to Macro (ISBI), pages 1484–1487.

Udupa, J. K., Odhner, D., Falcao, A. X., Ciesielski, K. C., Miranda, P. A. V.,Vaideeswaran, P., S.Mishra, Grevera, G. J., Saboury, B., and Torigian, D. A. (2011).Fuzzy object modeling. In SPIE on Medical Imaging, volume 7964, pages 1–10.

Vos, P. C., Isgum, I., Biesbroek, J. M., Velthuis, B. K., and Viergever, M. A. (2013).Combined pixel classification and atlas-based segmentation of the ventricular systemin brain CT Images. In SPIE on Medical Imaging, volume 8669, pages 1–6.


A Fault-Tolerant Software Product Line for Data Collectionusing Mobile Devices and Cloud Infrastructure

Cecılia M. F. Rubira1, Gustavo M. Waku1, Edson R. Bollis1,Ricardo S. Torres1, Flavio Nicastro1, Leonor P. Morellato2

1Institute of Computing – University of Campinas (UNICAMP)Campinas, SP – Brazil

2Instituto de Biologia – Universidade Estadual Paulista (UNESP)Rio Claro, SP – Brazil

Abstract. The constant evolution of mobile devices and decreasing associatedcosts made them suitable for data acquisition. In order to effectively supportdata acquisition procedures, some challenges related to data integrity and mo-bile availability, such as battery level, data loss and data integrity have to be ad-dressed. We propose a modular and configurable fault-tolerant Software Prod-uct Line (SPL) through the use of aspects. As a proof of concept, a productgenerated from the SPL aiming to support the acquisition of phenological dataon the field using mobile devices is been developed.

1. IntroductionIn the past few years, sales of mobile devices increased year after year. Their increas-ing processing power and constant cost reduction made them very popular. Today, thesedevices are equipped with Wi-fi, 2G/3G/LTE antennas, cameras, accelerometer, GPS,bluetooth, high-speed processors which make them suitable to collect a variety of data.There are some devices used to collect data in the field for surveying and monitoring envi-ronmental changes [Pundt 2002], i.e., information about citizens (census) or informationabout which areas are affected by a epidemic disease (dengue).

Android1 is an open platform, developed by Google and Open Mobile HandsetAlliance targeting mobile devices, in particular, tablets and mobile cell phones. Thisecosystem has grown a lot and attracted many companies to develop applications in thisenvironment. In this scenario, there is an opportunity to create applications for data ac-quisition aiming at specific markets in this platform, for example, an application to col-lect demographic information or biological information about certain plants, but usingcommon software parts. Instead of developing a single application at a time, meaningduplication of effort, the use of Software Product Line (SPL) is promising.

SPL is an approach for software development which allows reuse of core assetsand the managements of software variability to obtain a family of similar products. Coreassets are common software artifacts including code and software models. Software vari-ability is the management of ability to change, customize, and evolve software. Varia-tion points represent places where decisions are made to create different products aiming

1http://developer.android.com/index.html(As of July 2014).


at specific markets. Variants are alternatives associated with these variation points. Inthe mobile phone industry, the GPU (Graphics Processing Unit) type can be a varia-tion point, each mobile phone with a specific GPU represents the variants. Recently, astudy has shown that the use of aspects applied to SPL creates more stable architectures[Tizzei et al. 2010]. Aspects are concerns scattered in different places, classic examplesinclude logging, exception handling and authentication.

The SPL approach could be used to create applications that have a certain commonparts using aspects to modularize scattered concerns, allowing customization for domainswith different business rules. In other words, there may be a SPL with the same coreassets but instantiated for different domains like agriculture, geology, and phenology.

Creating software for data collection helps to improve the quality of collecteddata and reduce turnaround analysis [Gravlee et al. 2006]. However, there are challengesrelated to data acquisition that need to be addressed: data integrity and mobile availability.

In Brazil, according to Oliveira and Nascimento [Oliveira and Nascimento ] thereis a lack of investment on network infrastructure preventing remote areas and low-incomeareas to transmit real-time data from the field. Once the data cannot be transmitted, thereis a concern with integrity and availability. Integrity is a property related to the non-occurrence of improper data change [Laprie 1995], or how it can store data correctly,making sure it is not lost or incompletely stored. Availability is the readiness for usage[Laprie 1995], the system should be operable most of the time in the field.

Integrity and availability are properties that will guide data collection becauseproblems such as battery level, data storage, and hardware/system crash are very usualin the field. These properties are attributes to achieve dependability. Dependability is aproperty of a system such that one can justifiably rely on a service this system delivers[Laprie 1995]. Part of the concept of dependability is connected with impairments andthere are three categories: fault (the root cause of the error), error (part of the systemstate and occurs in presence of the fault) and failure (deviation of the specified service)[Pullum 2001].

In order to control the error state and to compensate the presence of faults, wepropose to use techniques of fault tolerance. Fault Tolerance is the ability of an applica-tion to deliver correct service even in the presence of the faults [Pullum 2001]. In orderto tolerate faults, we have to describe the resulting behavior of the system and classify theexpected faults, which generates a fault model [Gartner 1999].

Our project intends to create a set of core assets that deals with challenges in thefield of data collection to achieve dependability and allow the creation of similar datacollection applications. In order to assess our solution, as a proof of concept, we proposethe execution of a case study using the phenology domain. In this case study, the SPL willbe used to create a fault-tolerant application to support phenological data acquisition onthe field.

The rest of this document is organized as follows. Section 1.1 states the problem.Section 2 shows our proposed solution and methods used to solve it. Section 3 showsour expected results. Section 4 shows some related research. Section 5 discusses the casestudy. Section 6 shows how we intend to evaluate our results. Section 7 shows the currentresearch status. Section 8 shows our acknowledgements.


1.1. Problem StatementThe cost reduction of mobile devices and their constant evolution made them a suitableway to collect data in the field [Gravlee et al. 2006]. Data collection can be applied in avariety of sectors including agriculture, biology, and geology, creating an opportunity tobuild software core assets that provide high degree of customization. However, in orderto collect data, the network availability introduces problems related to data integrity andmobile availability.

Some of those problems are associated with software faults (bugs in the software),low battery level, and fail in access of server. In the field, when the mobile cannot send andcannot receive data from server, the efficiency of the research can be minimized, the datacan be lost in accidents or the data collection can be interrupted because the occurrenceof failure.

The main research question to be addressed in this project is: how can we de-velop applications that can be customized to different domains considering different faultmodels and different levels of fault tolerance?

2. Proposed Solution and MethodsWe propose to build a SPL that incorporates fault-tolerant techniques with the aim ofproducing core assets that can be used to create applications for data acquisition but withhigh degree of customization to suit different domains, handling data integrity and mobileavailability for Android platform. As a case study, we intend to build, as a proof ofconcept, a product based on a SPL for the phenology domain ( see Section 5 ).

Collecting field data requires the use of mobile devices and also a cloud infrastruc-ture to store collected information. The SPL will create core assets for mobile devices andfor the cloud infrastructure. To build the SPL, we intend to use methods such as FODA(Feature Oriented Domain Analysis) [Kang et al. 1990], AO-FArM (Aspect Oriented Fea-ture Architecture Mapping)[Tizzei et al. 2012] and COSMOS*-VP (COmponent SystemMOdel for Software architectures) [Dias et al. 2010]. FODA is an approach to performdomain analysis; AO-FArM provides a step-by-step guideline to create a software archi-tecture and COSMOS*-VP is a component implementation model to realize architectureat the source code.

To deal with availability and integrity it is necessary to develop a fault model,because it is impossible to prevent all the possible faults that can occur [Dubrova 2013].Once the fault model is defined, it is possible to build a fault-tolerant solution for datacollection domain. In this domain, the main problems are related to battery level, dataloss, and data integrity.

In order to mitigate failures due to low battery status we will develop an abstractmodule to calculate and estimate the battery usage. To prevent failures regarding data lossand data integrity, we will introduce redundancy in storage (e.g., synchronize data amongredundant mobile devices).

3. Expected ResultsThis project has the following expected results:

• A fault model for representing the main problems on data collection.


• A dependable SPL for data collection in the cloud.• A dependable SPL for data collection in mobile applications.• An architecture for mobile data collection.• An abstract module to calculate and estimate the battery usage and autonomy.• An application and an infrastructure for data acquisition in the phenology domain

(described in Section 5).

4. Related Work4.1. Software Product Lines in mobile platformFigueiredo et al. [Figueiredo et al. 2008] present a comparative study of a SPL calledMobileMedia, developed in J2ME2. Two SPLs were implemented using different pro-gramming paradigms: Object-Oriented programming and Aspect-Oriented Program-ming (AOP). Later, different metrics were extracted to evaluate the change impact, cohe-sion, coupling and size of the program. The authors concluded that Aspect-Oriented SPLtends to have a more stable design when implementing optional and alternative features.

Tizzei et al. [Tizzei et al. 2010] extend the work of Figueiredo et al.[Figueiredo et al. 2008] adding the analysis about pure component-based SPLs and hy-brid approach (with components and AOP). The combination of AOP and componentscreated a SPL that is more resilient, cohese and less coupled.

The work of Alves et al. [Alves et al. 2005] shows a method that uses incrementaland extractive approaches in game evolution in mobile environment. The authors built aSPL in J2ME that creates product specific configurations (such as screen size, APIs andfunctions). Three game applications were refactored to create a product line that usesAOP to isolate functions and represent specific configurations.

4.2. Fault Tolerance in mobile platformAcker et al. [Acker et al. 2009] propose a fault injection software for mobile devicesbased on a fault model proposed by Cristian [Cristian 1991]. They use the receivingand message interference to cause fault in communication between server and mobiledevices. Acker et al. [Acker et al. 2009] propose to generate a fault model to mobiledevices and study their behavior based on implementations ensuring dependability. Thestudy of Acker et al. [Acker et al. 2009] has direct relation with this project because theusage of fault injection is a suitable way to test and evaluate the architectural patterns offault tolerance.

The studies discussed SPLs and Fault Tolerance in mobile platform, but none ofthem tried to propose an integrated SPL that deals with challenges in the data collection.We aim to provide this using Android technology to develop a product for phenologydomain in the context of the e-phenology project, presented in Section 5. Also, none ofthe SPL studies tried the use of AOP in Android platform.

5. Case Study - E-PhenologyThe e-Phenology project 3 addressed theoretical and practical problems involving the useof digital cameras for near-surface remote leaf phenological observation. It was set up

2J2ME - Java Platform Micro Edition3http://www.recod.ic.unicamp.br/ephenology/ (As of July 2014).


the first tower-based near-surface digital monitoring systems in the Brazilian cerrado sa-vanna vegetation, coupled to climatic stations and associated with local on-the-groundphenology observations to validate our near-surface remote data. This project is led by apartnership among University of Campinas (UNICAMP), Universidade Estadual Paulista(UNESP), Sao Paulo Research Foundation (FAPESP) and Microsoft Research.

Phenology is the study of periodic animal and plant life-cycle events and howthese are influenced by seasonal and variations in climate. Phenology studies open awide range of opportunities to address different problems because they can combine di-rect on-the-ground observation to data collected by different devices, from digital cam-eras to spectral sensors. On-the-ground phenological observations preclude large areas ofstudy and are laborious and time consuming. In the case of the e-Phenology project, forexample, on-the-ground data refer to phenological properties of more than 2,000 plant in-dividuals observed on monthly basis since 1994. These data have been initially registeredin paper spreadsheets and later curated in the lab to guarantee, for example, consistencyamong observations over time (e.g., fruits should be observed after flowers). Finally, cu-rated data are inserted into electronic spreadsheets to be processed and analyzed. Thewhole data acquisition process is error-prone, which potentially may impact negativelyscientific discoveries. Besides, it is a remote location subject to the influence of networkavailability. In this scenario, there is a strong requirement of fault tolerance because theamount of data collected is very large, which creates a real case study to develop coreassets to derive a fault-tolerant application to collect phenological data.

6. Evaluation

The evaluation will be executed using robustness testing by using fault injection tech-niques [Hsueh et al. 1997] regarding the effectiveness of the fault tolerance model usedin the application.

Also, we have an interest to evaluate the benefits that AOP brought to the Androidplatform in terms of number of lines that it avoided to develop, the flexibility achievedtrough the integrated use with the SPL and the amount of consumed memory. To measurethis, we will use similar set of metrics already used be colleagues [Figueiredo et al. 2008,Tizzei et al. 2010].

7. Research Status

We have a working prototype that collects data for e-phenology. This prototype has helpedidentifying lots of requirements. Models necessary to obtain an architecture were createdand specification is going on. We expect to reuse some of the prototype’s code in the finalproduct.

8. Acknowledgments

Authors are grateful to CAPES, CNPq, and the FAPESP-Microsoft Virtual Institute.

References

Acker, E. V., Weber, T. S., and Cechin, S. L. (2009). Injecao de falhas para validaraplicacoes em ambientes moveis. XI Workshop de Testes e Tolerancia a Falhas.


Alves, V., Matos Jr, P., Cole, L., Borba, P., and Ramalho, G. (2005). Extracting and evolv-ing mobile games product lines. In Software Product Lines, pages 70–81. Springer.

Cristian, F. (1991). Understanding fault-tolerant distributed system. Comunication of theACM, 34(2).

Dias, M. O., Tizzei, L., Rubira, C. M., Garcia, A. F., and Lee, J. (2010). Leveragingaspect-connectors to improve stability of product-line variabilities. In VAMOS’10:Proceedings of the 4th International Workshop on Variability Modelling of Software-Intensive Systems, pages 21–28.

Dubrova, E. (2013). Fault-Tolerant Design. Springer.

Figueiredo, E., Cacho, N., Sant’Anna, C., Monteiro, M., Kulesza, U., Garcia, A., Soares,S., Ferrari, F., Khan, S., Filho, F. C., and Dantas, F. (2008). Evolving software productlines with aspects: An empirical study on design stability. In ICSE ’ 08. ACM/IEEE30th International Conference on Software Engineering, pages 261–270.

Gartner, F. C. (1999). Fundamentals of fault-tolerant distributed computing in asyn-chronous environments. ACM Computing Surveys (CSUR), 31(1):1–26.

Gravlee, C. C., Zenk, S. N., Woods, S., Rowe, Z., and Schulz, A. J. (2006). Handheldcomputers for direct observation of the social and physical environment. Field Meth-ods, 18(4):382–397.

Hsueh, M.-C., Tsai, T. K., and Iyer, R. K. (1997). Fault injection techniques and tools.Computer, 30(4):75–82.

Kang, K. C., Cohen, S. G., Hess, J. A., Novak, W. E., and Peterson, A. S. (1990). Feature-oriented domain analysis (foda) feasibility study. Technical report, DTIC Document.

Laprie, J.-C. (1995). Dependable computing: Concepts, limits, challenges. In FTCS-25,the 25th IEEE International Symposium on Fault-Tolerant Computing-Special Issue,pages 42–54.

Oliveira, M. M. d. S. and Nascimento, M. P. d. Iii encontro ulepicc-br novas tecnolo-gias e estrategias competitivas: a atuacao das operadoras de telefonia movel no brasil,frente a convergencia com os servicos de tv por assinatura e internet gt1: Polıticas decomunicacao.

Pullum, L. L. (2001). Software Fault Tolerance Techniques and Implementation. ArtechHouse.

Pundt, H. (2002). Field data collection with mobile gis: Dependencies between semanticsand data quality. GeoInformatica, 6(4):363–380.

Tizzei, L., Rubira, C., and Lee, J. (2012). An aspect-based feature model for architect-ing component product lines. In Software Engineering and Advanced Applications(SEAA), 2012 38th EUROMICRO Conference on, pages 85–92.

Tizzei, L. P., Dias, M., Rubira, C. M. F., Garcia, A., and Lee, J. (2010). Componentsmeet aspects: Assessing design stability of a software product line. Information andSoftware Technology, 53(2):121–136.


A Multiscale and Multi-Perturbation Blind ForensicTechnique For Median DetectingAnselmo Ferreira 1 and Anderson Rocha 1

1Institute of Computing, University of CampinasAv. Albert Einstein, 1251, Cidade Universitaria, Campinas/SP - Brasil, CEP 13083-852

Abstract. This paper aims at detecting traces of median filtering in digital im-ages, a problem of paramount importance in digital forensics, given that fil-tering can be used to conceal traces of image tampering such as resamplingand light direction in photomontages. To accomplish this objective, we presenta novel approach based on multiple and multiscale progressive perturbationson images able to capture different median filtering traces through using imagequality metrics. Such measures are then used to build a discriminative featurespace, suitable for proper classification regarding whether or not a given imagecontains signs of filtering.

Resumo. Esse artigo objetiva detectar tracos de filtragem mediana em imagensdigitais, um problema de fundamental importancia em analise forense de im-agens digitais, dado que essa filtragem pode ser usada para esconder tracosde falsificacoes de imagens tais como redimensionamento e direcao de luz emfoto-montagens. Para alcancar esse objetivo, apresentamos uma nova abor-dagem baseada em perturbacoes multiplas e progressivas em imagens de formaa capturar diferentes tracos de filtragem mediana atraves do uso de metricas dequalidade de imagens. Essas medidas sao utilizadas para construir um espacode caracterısticas discriminativo, capaz de classificar apropriadamente se umaimagem sofreu ou nao borramento mediano.

1. IntroductionThe ever-growing availability of digital images resulting from the massive use of cheap digitalcameras and social networks has allowed people to easily share information all over. With sucha massive flood of information the question is how to tell the real from the fake. The old adage:“one image is worth a thousand words” no longer holds intact. Frequently people are misleadand innocently believe in what photographs depict. Therefore the development of reliable forensicdetection tools are paramount.

One form of detecting the presence of image tampering is through the analysis of arti-facts left by the resampling operations. Popescu and Farid [Popescu and Farid 2004] noted thatre-sampling operations use interpolation techniques and proposed an Expectation-Maximizationtechnique for finding periodic samples of the image and detecting resampling operations. Oneparticular problem of this technique is the assumption of a linear correlation of the pixels. Asstated by Kirchner and Bohme [Kirchner and Bohme 2008], a non-linear filter such as the me-dian filter can destroy these re-sampling artifacts by replacing each pixel with the median-valuedpixel within a neighborhood. Given the use of median filtering operations for hiding tracesof image doctoring, several researchers have tackled the problem of detecting it over the lastyears [Kirchner and Fridrich 2010, Yuan 2011, Chen et al. 2013]. Most of these methods assumethat the median filtering has streaking artifacts and use them as proxy to detect filtering.


In this paper, we propose a median filtering detection algorithm based on the hypothesisthat the median filtering streaking artifacts affect the image quality under multiscale filterings (fil-terings with different regions of interest) and over progressive perturbations (henceforth perturba-tions are defined as cascade-wise successive image filterings). We evaluate Image Quality Metrics(IQMs) upon perturbed images building a highly discriminative feature space for future classi-fication. Experiments with compressed and uncompressed public datasets confirm the method’scompetitiveness without assuming anything about the underlying filtering process of the inputimages.

2. Proposed Method

Most of the median filtering detection techniques proposed in the literature rely on artifacts knownas streaking [Bovik 1987]. In median filtered images, the probability of 2 pixels in a given distancehave the same value is high due to the nature of the median filtering: when the sliding windowmoves, the probability of the changed pixel being the median value of the new pixel grid in thetranslated window is high [Kirchner and Fridrich 2010].

For detecting median filtered images, we observe that an image that was never median-filtered, when filtered for the first time, will exhibit a behavior different from the already filteredimages after the same operation. We can make an analogy to text file compression in this case.When a text file is compressed for the first time, most often the file size decreases because thereare redundancies in the text. However, when a second compression is applied to an already com-pressed file, chances are the file size will increase. This happens because there are much lessredundant elements to compress, so the file size will increase compared to the previous compres-sion.

An opposite behavior happens in median-filtered images after filtering them again. In thiscase, the redundancy will increase because the streaking artifacts will be highlighted. This canbe easily observed if one applies median filtering until idempotency. To detect such behavior, weperform n progressive and multi-scale perturbations on the image. These perturbations consist ofmulti-scale filters applied on the image progressively. We consider filtering windows of size 3 ×3, 5 × 5, 7 × 7 and 9 × 9.

The rationale for using the multi-scale perturbations is that when an already filtered imageis filtered once more using a different median filtering window, the streaking artifacts will beemphasized. When applied in succession (progressively) it tends to find groups of streaking pixelsinstead of just a few of them when considering just one scale as in previous approaches. Byapplying the median filtering in this way, the image quality will degrade differently from pristineimages, so IQMs [Eskicioglu and Fisher 1995] can capture the traces of median filtering afterthe progressive and multiscale perturbations. Given an input image, we compare it to its multipleperturbation filtered version by using eight bivariate IQMs per perturbation and scale: Peak Signalto Noise Ratio, Structural Content, Average Difference, Maximum Difference, Normalized CrossCorrelation, Normalized Absolute Error, Structural Similarity, and Mean Squared Error.

The proposed method performs a set of perturbations by applying median filtering n timeswith w×w window sizes and calculating q IQMs on each filtered image version. The filtered imageis the input for the following filtering and we can use s windows (scales) with different sizes toperform the filtering. By comparing each filtering result with the input image using q IQMs, wecan build a discriminative feature space for proper learning of a classifier by concatenating the qquality images calculated per filtered image. The feature vector has therefore s×n×q dimensions.The same is done to testing images and a classifier (such as Support Vector Machines) trained withthe data from the training images can discriminate between the pristine and median filtered images.


Figure 1 shows how the proposed method works.

Figure 1. An image is filtered n times in different scales (s = 4) and eight IQMsare calculated (q = 8). The feature space comprises the fusion of n× 8-d vectorscreated at each perturbation in four scales for each image. Any machine learningclassifier can later be used to carve the decision boundary in the formed IQMsfeature space.

3. ExperimentsFor validating the proposed method and comparing it to the ones in literature, we have chosen aCross-Dataset validation protocol in which we have training samples containing images in similarconditions and testing samples with a variety of conditions which can be seen as a real-worldsituation.

The training datasets comprise images with almost similar lighting conditions, low reso-lution and the median filtered samples were blurred using one median filtering window size (i.e., 3× 3). The compressed images training set contains 3,996 JPEG images from the Chinese AcademyImage Tampering Database (CASIA) [Cas 2010]. The compressed images testing dataset is morecomplex and comprises 800 JPEG images collected with very different resolutions, taken fromdifferent cameras and smartphones. Also, the blurred images in the testing dataset were blurredwith different median filtering implementations (MATLAB, OPENCV and GIMP) and differentmedian window sizes. We used a similar configuration for a situation where compressed and un-compressed images can occur. In this case, we used the previous compressed dataset along with2,773 uncompressed images for training from CASIA [Cas 2010] and 1,338 uncompressed imagesfrom UCID [Schaefer and Stich 2004] for testing.

The experiments have two parts. First, we assess the choices for improving the proposedclassifier such as the number of considered scales and perturbations for building up the featurevectors. Then the best configuration found during this validation step is used to compare it toexisting methods in the literature (second part) considering compressed and uncompressed images.The minimal number n of perturbations of the proposed technique was found in statistical testsafter randomly splitting the 3,996 images of the publicly-available CASIA [Cas 2010] dataset 10times in training and testing sets (also known as 5× 2 cross-validation).

Since we are using this dataset for fine tuning the parameters of our method, it will be usedonly as a training dataset in Sec. 3 when comparing our method to others in the literature. Thatis why we chose to perform a cross-dataset validation being fair and closer to the actual forensicscenario one might face. Here, we characterize the images as described in Sec. 2 and train anSVM classifier with an RBF kernel, whose parameters are automatically learned during trainingaccording to the 5× 2 cross-validation protocol used.


For validation, we performed two rounds of experiments: in the first one we used just onewindow w × w for blurring, where w ∈ {3, 5, 7, 9} and vary the number of perturbations n in theinterval 1 ≤ n ≤ 5. In the second round, we used a multiscale approach, where we use all of thefour window sizes for filtering the images and vary only the perturbations. Table 1 shows the threebest experiment results.

Number of Perturbations Windows Result3 3 × 3 98.8% ± 0.224 3 × 3, 5×5, 7×7, 9×9 98.7% ± 0.293 3 × 3, 5×5, 7×7, 9×9 98.7% ± 0.28

Table 1. Results in mean classification accuracy after a 5 × 2 cross-validation onCASIA dataset [Cas 2010].

The ANOVA statistical test results confirm that varying the number of windows and per-turbations are statistically significant (p-value < 0.05) and these factors are correlated. Figure 2show the results of Tukey tests for pairwise comparisons.

0 2 4 6

5−

45−

34−

35−

24−

23−

25−

14−

13−

12−

1

95% family−wise confidence level

Differences in mean levels of perturbation

Pert

urb

ations

(a)

−10 −5 0 5 10

5−

45−

34−

35−

24−

23−

25−

14−

13−

12−

1

95% family−wise confidence level

Win

dow

Differences in mean levels of window

Window

1− 3x3 window

2− 5x5 window

3− 7x7 window

4− 9x9 window

5− All windows

(b)

Figure 2. (a) Tukey test pairwise comparison in factor perturbation (b) Tukey testpairwise comparison in factor window.

As we can see on Fig. 2(a), there is no statistical difference when using three and four,three and five and four and five perturbations. However, there is significative difference whenusing more than one perturbation. In addition, varying the window sizes is statistically significantaccording to Fig. 2(b). The ANOVA test in the three best algorithms yielded a p-value of 0.79,which helps us to state that the accuracy difference between these techniques is not statisticallysignificant. We chose to use the two last configurations in the second part of the experiments whencomparing to the methods in the literature because the first one yielded slightly worse results.

We now turn our attention to comparing the classification results of two of the three bestapproaches of the proposed technique with 128 and 96 dimensional feature vectors. We callthese techniques, respectively, as FPMW (Four Perturbations, Multiple Windows) and TPMW(Three Perturbations, Multiple Windows). We compare them to the following state-of-the-artmethods (1) Kirchner and Fridrich [Kirchner and Fridrich 2010] with T = 3 and second orderMarkov Chains as described in their work, yielding a 686-d feature vector (we call it SPAM), (2)Yuan [Yuan 2011] in 3 × 3 blocks and a 44-d feature vector (which we call MFF) and (3) Chen etal. [Chen et al. 2013] with parameters T=10, B=3 and K=2 and 56-dimensional feature vectorsas described in their work (which we call GLF).

We then used an SVM with RBF kernel from LibsVM [Chang and Lin 2011] whose pa-rameters were learned during training using the LibSVM’s built-in grid-search method. Table 2shows the results for the compressed dataset and the McNemar’s statistical test results between the


best ranked technique and the others. Table 3 shows tests for the scenario with compressed anduncompressed images present during training and testing.

According to Table 2, the proposed technique FPMW presents the best accuracy by cor-rectly classifying 676 out of 800 test images while SPAM and GLF correctly classified 561 out of800 images and 521 out of 800 images, respectively. Note that both SPAM and GLF present lowspecificities (42% and 31% respectively) in this dataset. In a forensic scenario, low specificity of-ten means putting the blame on an innocent person and is undesirable. The second best techniquewas the proposed TPMW, which correctly classified 657 out of 800 images followed by MFF (561out of 800 images).

TPMW FPMW SPAM [Kirchner and Fridrich 2010] MFF[Yuan 2011] GLF[Chen et al. 2013]Accuracy 82.1% 84.5% 70.1% 70.1% 65.1%Sensitivity 92% 91% 98 % 88% 99%Specificity 72% 77% 42% 52% 31%Precision 76% 80% 62% 64% 59%

Significant? yes - yes yes yes

Table 2. Compressed dataset experiments results and McNemar’s statistical testscomparing the best technique (FPMW) with others. TPMW and FPMW are varia-tions of the proposed method discussed in Sec. 2.

TPMW FPMW SPAM [Kirchner and Fridrich 2010] MFF[Yuan 2011] GLF[Chen et al. 2013]Accuracy 82.2% 80.8% 77.9 % 74.2% 79.9%Sensitivity 78.2% 74.4 % 68.3% 76.9% 90.9%Specificity 90.2% 91.5% 90.6 % 78.6% 76.8%Precision 88.9% 89.8% 87.9 % 78.3 % 79.7%

Significant? - yes yes yes yes

Table 3. Cross dataset experiment with compressed and uncompressed imagesand McNemar’s statistical tests comparing the best technique (TPMW) with oth-ers. TPMW and FPMW are variations of the proposed method discussed in Sec. 2.

Table 3 shows the proposed method performance compared to the ones in literature whendealing with compressed and uncompressed images simultaneously. In this case, TPMW presentsthe best accuracy correctly classifying 1,801 out of 2,138 images (remember that the test datasetnow contains 800 compressed and 1338 uncompressed images). In sequence, the next top per-formers are FPMW, GLF, and SPAM. MFF presented the worst results (1,663 images correctlyclassified). Although the proposed methods perform well in these scenarios, there are situationswhere they are not top performers. This happens specially when the training and testing setscontain only images without any compression. Considering a cross-dataset scenario where onlyuncompressed images are used for training and testing, the proposed methods present smaller ac-curacies (TPMW with 87.5%, FPMW with 87.8%) than its counterparts (GLF with 99.7%, MFFwith 99.9% and SPAM with 99.7%). This is not a problem since in practice it is unlikely that onlytraining and testing images would be available.

4. ConclusionIn this paper, we presented a novel approach to forensically detect median blurring traces on digitalimages. Our technique is different from others as it perturbs the image by blurring it multipletimes with different window sizes (filtering intensities) building a discriminative feature for laterdecision making by using image quality metrics.

We showed the method’s reliability when tested with diverse images on a cross-datasetprotocol which we believe to be a more real-world situation. The experiments and results corrobo-rate our initial hypothesis that multiple perturbations with different intensities capture the telltales


left behind by median filtering algorithms in a competitive way with some existing methods. Asfuture work, we will focus our study on more image quality metrics that can be combined to theproposed feature vector, include more median blurring intensities in the dataset, study the pro-posed approach on different image compression settings, perform fusion of classifiers and studythe proposed method under the median filtering anti-forensic operation [Fontani and Barni 2012].

References(2010). Casia tampered image detection evaluation database. database avaiable at http://

forensics.idealtest.org/.

Bovik, A. (1987). Streaking in median filtered images. IEEE Transactions on Acoustic Speechand Signal Processing, 35(4):493–503.

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACMTransactions on Intelligent Systems and Technology, 2(27):27:1–27:27. software available athttp://www.csie.ntu.edu.tw/˜cjlin/libsvm.

Chen, C., Ni, J., and Huang, J. (to appear in December, 2013). Blind detection of median fil-tering in digital images: A difference domain based approach. IEEE Transactions on ImageProcessing, 22(12):4699–4710.

Eskicioglu, A. and Fisher, P. (1995). Image quality measures and their performance. IEEE Trans-actions on Communications, 43(12):2959–2965.

Fontani, M. and Barni, M. (2012). Hiding traces of median filtering in digital images. In EuropeanSignal Processing Conference (EUSIPCO), pages 1239–1243.

Kirchner, M. and Bohme, R. (2008). Hiding traces of resampling in digital images. IEEE Trans-actions on Information Forensics and Security (TIFS), 3(4):582–592.

Kirchner, M. and Fridrich, J. (2010). On detection of median filtering in digital images. In SPIEMedia Forensics and Security II, pages 754110–754110–12.

Popescu, A. and Farid, H. (2004). Statistical tools for digital forensics. In ACM Intl. Workshop onInformation Hiding (IH), pages 128–147.

Schaefer, G. and Stich, M. (2004). Ucid - an uncompressed colour image database. In Storageand Retrieval Methods and Applications for Multimedia, volume 5307 of Proceedings of SPIE,pages 472–480.

Yuan, H.-D. (2011). Blind forensics of median filtering in digital images. IEEE Transactions onInformation Forensics and Security (TIFS), 6(4):1335–1345.


A Software Products Line Solution Based in Components andAspects for the E-commerce Domain

Raphael Porreca Azzolini1, Cecılia Mary Fischer Rubira1

1Institute of Computing - Campinas University (UNICAMP) - Campinas - SP - Brazil


Abstract. This work presents the implementation of a Software Products Linefor the e-commerce domain, a Web domain into high and constant growth in theinformation technology market. This implementation is made according to theCOSMOS*-VP, a components and aspects model responsible for architecturalscattering minimization and support in a modular form crosscutting featuresmodelling. It is expected to study the efficiency of the COSMOS*-VP in real,modern, and very complex case study. Furthermore, this work aims to providea solution to the e-commerce systems development, which requires Web appli-cations with similar variants, but customizable according to the needs of eachclient.

Resumo. Este trabalho apresenta a implementacao de uma Linha de Produ-tos de Software para o domınio de e-commerce, um domınio Web em altoe constante crescimento no mercado de tecnologia da informacao. Estaimplementacao e feita de acordo com o modelo COSMOS*-VP, um modelo decomponentes e aspectos responsavel por minimizar o espalhamento da arquite-tura e apoiar de forma modular a modelagem de caracterısticas transversais.Espera-se estudar a eficiencia do COSMOS*-VP em um caso de estudo real,moderno e bastante complexo. Alem disso, este trabalho pretende ofereceruma solucao para o desenvolvimento de sistemas de e-commerce, que requeremvariantes similares mas customizaveis de acordo com as necessidades de cadacliente.

1. IntroductionThis work presents the development of a Software Products Line (SPL) for the e-commerce domain. This domain was chosen due the fact of it is a modern case studybecause of its large growth at the global market. In Brazil, the growth of the area in thelast ten years was 1400% [Exame.com 2013] and the sales in 2012 was over R$22 bil-lion [IDGNow! 2013]. It is a complex case study because leads integration with othersystems, as shipping and payment services; has the need of high security, because ofthe customers personal data and payment service; and demands efficient user experience[E-Commerce News 2012].

Clements and Northrop [Clements and Northrop 2002] define SPL as set of soft-ware systems that share commons features and have distinct features in order to satisfya market’s segment. With this set and the software reuse [Sommerville 2010] it is possi-ble to develop customized products at a low cost [Pohl et al. 2005]. Despite it’s advan-tages, the SPL evolution can be impaired by the inefficiency of variability mechanisms


to accommodate changes, that can lead to several undesirable stability consequences[Figueiredo et al. 2008].

To support the SPL architectures evolution to lead with the variability, modular-izing the architectural variability and the crosscutting concerns into products lines, it wascreated the COSMOS*-VP model [Tizzei 2012], an extension of the COSMOS* model[Gayard et al. 2008] that integrates components and aspects.

In his work, Tizzei [Tizzei 2012] shows that the COSMOS-*VP model is betterthan the COSMOS* model, increasing the evolution ability of an architecture. However,he made a empirical study with one simple case study. Despite of what was describe inTizzei’s work, the COSMOS*-VP still needs a more complex case study that represents areal application, that it is made by the work here presented with the e-commerce domain.

The possibility to use the COSMOS*-VP model for implementing a e-commerceSPL can improve the development of this software systems family reducing the costs ofthe development and offering more stable software systems, since the software reuse willoffer components that have been already tested, used and fixed [Sommerville 2010].

This paper is organized in six Sections. Section 2 presents the related work;Section 3 discusses the e-commerce domain; Section 4 discusses the SPL developmentmethod; Section 5 presents the SPL implementation; Section 6 discusses the expectedresults; finally the Section 7 will present the conclusions of this paper.

2. Related Work

Laguna and Hernandez [Laguna and Hernandez 2010] present a e-commerce SPL devel-opment for the .NET platform. Unlike the work here presented, it is not used componentand aspects techniques for the implementation of the SPL. Other difference is that theSPL of the presented work is implemented with Java programming language.

The study of the SPL evolution using aspect oriented implementation is presentedby Figueiredo et al. [Figueiredo et al. 2008] and the study of the SPL evolution integrat-ing components and aspects is presented by Tizzei [Tizzei 2012]. These two works do notpresent the development of a real and complex application as this work presents with thee-commerce domain.

3. The E-commerce Domain

The e-commerce domain is typical problem that can be solved with the SPL approach, ithas a set of similar software systems that share commons features that can be reused, italso has distinct features that makes them different from each other. In many cases, theneed to customize one e-commerce system makes necessary the development of a newproduct without the software reuse, that can be hindered by inefficiency of variabilitymechanisms.

The buying process in a e-commerce system is the same or similar in every kindof online store that can be found in the Internet, this process is shown at Figure 1. It isimportant that all e-commerce systems have a secure environment with public and privatekeys [Diffie and Hellman 1976][Rivest et al. 1978] at least in the operations market in theFigure 1.


Figure 1. The e-commerce buying process flowchart

The payment process is mediated by a payment gateway, responsible for commu-nication between the bank that receives the customer’s payment and the store’s bank. Thee-commerce system must communicate with the payment gateway through a webservice.First, the e-commerce system sends the payment information to the payment gateway;Secondly, the payment gateway informs that received the information; and finally, whenthe payment is made or cancelled, the payment gateway makes the necessary bank trans-actions and informs the e-commerce system about the confirmation or cancellation of theorder’s payment.

4. The SPL Development MethodologyThe first step of the SPL development was the e-commerce features survey, this was madewith the analysis of four online stores: a Brazilian store that sells many kinds of products;a Brazilian store specialized in female fashion products; a U.S. store that sells many kindsof products; and a U.S. store specialized in supplements and health products. One of thegoals of this work is to develop these four e-commerce systems using the SPL that willresult of this study.

With this survey was possible to analyze what are the common and variable fea-tures of the e-commerce systems according to the type of product sold. Furthermore, itwas possible to analyze the common and variable features according to the country of thestore.

A feature model was built with mandatory, optional, alternative and OR-featurefeatures [Czarnecki and Eisenecker 2000]. The main defined features were: product,shopping basket, order, products search, customer, dependability, and marketing. Allthese features are mandatory, except marketing that is optional, and have subfeatures.The Figure 2 shows the initial SPL feature model.

The initial feature model was transformed according to the Aspect-orientedFeature-Architecture Mapping (AO-FArM) method [Tizzei 2012]. There are four stepsto transform the feature model in SPL architecture: (i) remove features not related withthe architecture and resolve the non-functional features; (ii) transform based on the ar-chitecture requirements; (iii) transform based on the features interactive relations; (iv)transform based on the hierarchical relationships. After the execution of these steps, thedependability feature was replaced by architectural features (risk analysis, password en-cryption, transport layer security, session control, load balance) and new features were


Figure 2. The initial SPL feature model

added to the model, such as logging, persistence, and exception handling.

This feature model generated a initial architectural model, where the features weretransformed in components, classes or methods. Five refinement steps from the AO-FArMwere made to obtain the SPL architecture: (i) identify the base and crosscuting interfaces;(ii) specify the base and crosscuting interfaces operations; (iii) assess legacy components;(iv) implement/refactor the base and crosscuting components; (v) specify and implementthe base and crosscuting connectors. The steps (iii) and (iv) were not executed becausethere was no legacy components.

After the execution of the AO-FArM, the components of the architecture startedto be implemented as described in the Section 5.

5. The SPL implementationThis work is actually in the implementation step. It is being developed the base andcrosscuting components defined by the COSMOS*-VP method.

The SPL is being developed with the Java Development Kit 7 and AspectJ in theEclipse IDE with the Apache Maven builder. Each component of the architecture is aMaven Module and each connector is developed in a package of the component with therequired interface the connector will provide.

Every component has a connection with the data type layer, that is composed bythe data type interfaces ant its factory interfaces. Most part of these interfaces are imple-mented in the persistence component, but they can be implemented by other components,like the PaymentMethod interface that is implemented by the specific payment compo-nents, like credit card component or debit component.

The interfaces methods of the components are being tested with JUnit framework.The persistence tests uses the framework with HSQLDB for database interaction and theother components tests use Mockito for mocking external methods.


The view layer has not been implemented yet, but it will be made with JavaServerFaces (JSF) and will be tested with Selenium.

6. Expected ResultsTwo analysis will be made: analyse if the COSMOS*-VP has a better evolution abilitythen the COSMOS* for the e-commercer domain; analyse if the implemented SPL allowscustomizations that are not possible with e-commerce frameworks known in the market.

The first results will be collected with the same analysis made by Figueiredoet al. [Figueiredo et al. 2008]: changing impact [Yau and Collofello 1985]; modularity[Sant’Anna et al. 2003]; and features dependency [Greenwood et al. 2007]. It is expectedthat the COSMOS*-VP implementation has better results than the COMOS* implemen-tation.

The second results will be collected comparing the features of the SPL of this workwith e-commerce framework available in the market. It is expected that the SPL provesmore flexible in customizing of the e-commerce products in relation to these frameworks.

7. ConclusionThis paper presented a Master’s degree work that pretends to develop SPL for e-commercesystems using the integration of components and aspects. As result of this implementationit is expected a validation of the model proposed by Tizzei [Tizzei 2012].

The code implemented is being made available to the community for continuityof the support for developing e-commerce systems and continuous evolution of the SPL.

This work will let other interesting studies opened for future work. Web servicesfor payment and shipping services, security protocols for personal data storage and pay-ment transactions, techniques for better availability and stability are just some examplesof these studies.

ReferencesClements, P. and Northrop, L. (2002). Software product lines: practices and patterns,

volume 59. Addison-Wesley Reading.

Czarnecki, K. and Eisenecker, U. W. (2000). Generative Programming: Methods, Tools,and Applications. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA.

Diffie, W. and Hellman, M. (1976). New directions in cryptography. Information Theory,IEEE Transactions on, 22(6):644–654.

E-Commerce News (2012). 38% abandonam uma loja online apos 4 segundos deespera. Disponıvel em <http://ecommercenews.com.br/noticias/pesquisas-noticias/38-abandonam-uma-loja-online-apos-4-segundos-de-espera>. Acesso em: 30 set. 2013.

Exame.com (2013). E-commerce pode chegar a R$ 45 bi em 2016. Disponıvel em<http://exame.abril.com.br/tecnologia/noticias/e-commerce-pode-chegar-a-r-45-bi-em-2016>. Acesso em: 30 set. 2013.

Figueiredo, E., Cacho, N., Sant’Anna, C., Monteiro, M., Kulesza, U., Garcia, A., Soares,S., Ferrari, F., Khan, S., Castor Filho, F., and Dantas, F. (2008). Evolving software


product lines with aspects: an empirical study on design stability. In Proceedings ofthe 30th International Conference on Software Engineering, ICSE ’08, pages 261–270,New York, NY, USA. ACM.

Gayard, L. A., Rubira, C. M. F., and de Castro Guerra, P. A. (2008). COSMOS*: aCOmponent System MOdel for Software Architectures. Technical Report IC-08-04,Institute of Computing, University of Campinas.

Greenwood, P., Bartolomei, T., Figueiredo, E., Dosea, M., Garcia, A., Cacho, N.,Sant’Anna, C., Soares, S., Borba, P., Kulesza, U., and Rashid, A. (2007). On theimpact of aspectual decompositions on design stability: An empirical study. In Ernst,E., editor, ECOOP 2007 - Object-Oriented Programming, volume 4609 of LectureNotes in Computer Science, pages 176–200. Springer Berlin Heidelberg.

IDGNow! (2013). E-commerce no Brasil cresce 20% e fatura R$ 22,5 bilhoes em 2012.Disponıvel em <http://idgnow.uol.com.br/internet/2013/03/20/e-commerce-no-brasil-cresce-20-e-fatura-r-22-5-bilhoes-em-2012/>. Acesso em: 30 set. 2013.

Laguna, M. A. and Hernandez, C. (2010). A software product line approach for e-commerce systems. 2012 IEEE Ninth International Conference on e-Business En-gineering, 0:230–235.

Pohl, K., Bockle, G., and Linden, F. J. v. d. (2005). Software Product Line Engineering:Foundations, Principles and Techniques. Springer-Verlag New York, Inc., Secaucus,NJ, USA.

Rivest, R. L., Shamir, A., and Adleman, L. (1978). A method for obtaining digital signa-tures and public-key cryptosystems. Commun. ACM, 21(2):120–126.

Sant’Anna, C., Garcia, A., Chavez, C., Lucena, C., and Von Staa, A. (2003). On thereuse and maintenance of aspect-oriented software: An assessment framework. InProceedings of Brazilian Symposium on Software Engineering, pages 19–34.

Sommerville, I. (2010). Software Engineering. Addison-Wesley, Harlow, England, 9edition.

Tizzei, L. P. (2012). Evolucao de Arquiteturas de Linhas de Produtos baseadas em Com-ponentes e Aspectos. PhD thesis, Instituto de Computacao, Universidade Estadual deCampinas.

Yau, S. and Collofello, J. (1985). Design stability measures for software maintenance.Software Engineering, IEEE Transactions on, SE-11(9):849–856.


An Architecture for Dynamic Self-Adaptation in Workflows

Sheila K. Venero Ferro1, Cecilia Mary Fischer Rubira1

1Instituto de Computação – Universidade Estadual de Campinas (UNICAMP)Av. Albert Einstein, 1251 Cidade Universitária – Campinas (SP) - Brasil

{ra144653,cmrubira}@ic.unicamp.br

Abstract. As the technology is reaching high levels of complexity, there is aninevitable need to expand the dynamic behavior of the computational solutionsin organizations. However, one of the drawbacks of the workflow platforms isthat they cannot dynamically adapt their processes according to itsenvironment or context. The aim of this research is to develop an architecturefor workflow management systems that provides means of flexibility todynamically adapt the workflows during runtime. In order to validate oursolution, it will be conducted a case study in nursing processes to show apractical applicability of the proposed adaptive architecture.

Resumo. Conforme a tecnologia atinge níveis mais elevados de complexidade,existe a necessidade de expandir o comportamento dinâmico das soluçõescomputacionais nas organizações. No entanto, um dos maiores problemas dossistemas de workflow atuais é que eles não podem adaptar dinamicamenteseus processos de acordo com o seu meio ambiente ou contexto. O objetivodesta pesquisa é desenvolver uma arquitetura para sistemas de gerenciamentode workflows que forneça suficiente flexibilidade para adaptar dinamicamenteos fluxos de trabalho em tempo de execução. A fim de validar a solução, serárealizado um estudo de caso no processo de enfermagem para mostrar aaplicabilidade prática da arquitetura proposta.

1. Introduction

Workflow Management Systems (WFMSs) provide tools to coordinate businessprocesses, allowing organizations to model and control their processes [1]. In addition,some workflow management systems also allow to measure and analyze the execution sothat users can improve their processes. Most of workflow systems also integrate other ITapplications and tools [2].

These technologies partially or totally automate business processes. Computerprograms perform tasks such as create, process, manage, and provide information,enforcing rules and politics that were previously implemented by humans [3]. Workflowsystems have become the major technology to assist the automation of businessprocesses in various domains.

There exist several commercial workflow management systems that not only dealwith business processes but also with scientific processes. In general, the systems that aredesigned to support business processes consist of a number of activities that can beexecuted automatically, manually, or a combination of both. They also take care of thedistribution and assignment of work items to employees.

In the other hand, the systems that support scientific applications often work with


a large and complex set of computations and data analysis. These computations mightcomprise thousands of steps in distributed environments. In addition to automation, theyprovide information for reproducibility, result derivation and result sharing amongcollaborators [4].

Traditional Workflows Management Systems usually work with well-structuredprocesses and typically for predictable and repetitive activities [5]. But modern processesoften are required to be flexible and adaptive in order to reflect foreseeable or evenunforeseeable changes in the environment. Thus, WFMSs face some limitations in meansof adaptability, they cannot dynamically adapt business process during runtime which ishighly necessary because of the dynamic and continuously changing environment.

In particular, this research project proposes a solution to provide adaptability inworkflow management systems so they can be easier changed to suit new conditions,imposed by changes in the environment. Solving this drawback, they would be able toadapt the workflow execution to cope with unexpected situations and conditions.

2. Literature Review

In order to discover consistent bibliographic information, and understand the currentState of Art of the topic of interest, it has been chose to do a Systematic Review that is amethod to do a literature reviews.

Systematic reviews consist of a set of scientific methods that try to limit thesystematic error in order to identify, analyze, and synthesize better all relevant studies ofa given subject or a particular issue [6].

The methodology used in this paper to perform the systematic review is based onstrategies of the framework PRISMA (Preferred Reporting Items for Systematic Reviewsand Meta-Analyses). The main objective of PRISMA is to help researchers to improvethe reporting of systematic reviews and meta-analyses. This process has four mainstages: Identification, screening, eligibility, and included [7].

In this research, the main topic identified was "dynamic adaptation in workflows".In this paper, we define “dynamic adaptation” as the ability of workflow processes tochange or react according to its environment or context. In order to find relevantinformation to thoroughly understand the research topic, it was combined using Booleanoperators the keywords "workflows" or "business process" with the related concepts of"adaptation" and “dynamic”, in addition to "workflow system".

These keywords were searched in major databases of the area of ComputerScience: IEEE, ACM and Science Direct. To refine the search, it was predetermined thatthe main concepts such as "workflow” or “business process” should be contained in thetitle field and the concepts related to "dynamic adaptation" should be contained inabstracts. As PRISMA suggests, Tables 1, 2 and 3 show the detailed queries for theIEEE database.


Table 1. Search query for “Workflow” as a main concept.

Main concept: Workflow

(("Document Title":workflow) AND ("Abstract":adapt* OR "Abstract":reschedul* OR"Abstract":reconfigur* OR "Abstract":"self-adaptive workflow" OR"Abstract":"dynamic adaptation" OR "Abstract":"context-aware" OR"Abstract":"workflow systems" OR "Abstract":"agent" OR "Abstract":"workflowarchitecture" OR "Abstract":"auto adaptation" OR "Abstract":"auto management" OR"Abstract":"readjusting workflows" OR "Abstract":"knowledge based processmanagement" OR "Abstract":" workflow management software"))

Results: 712 papers

Table 2. Search query for “ Business Process” as a main concept.

Main concept:: Business Process

(("Document Title":"business process") AND ("Abstract":adapt* OR"Abstract":reschedul* OR "Abstract":reconfigur* OR "Abstract":"self-adaptivebusiness process" OR "Abstract":"dynamic adaptation" OR "Abstract":"context-aware" OR "Abstract":"business process systems" OR "Abstract":"agent" OR"Abstract":"business process architecture" OR "Abstract":"auto adaptation" OR"Abstract":"auto management" OR "Abstract":"readjusting business process" OR"Abstract":"knowledge based process management" OR "Abstract":"businessprocess management software"))

Results:: 112 papers

Table 3. Complete query for the research.

Complete search

((("Document Title":workflow) OR ("Document Title":"business process")) AND ("Abstract":adapt* OR "Abstract":reschedul* OR "Abstract":reconfigur* OR "Abstract":"self-adaptive workflow" OR "Abstract":"self-adaptive business process" OR "Abstract":"dynamic adaptation" OR "Abstract":"context-aware" OR "Abstract":"workflow systems" OR "Abstract":"business process systems" OR "Abstract":"agent" OR "Abstract":"workflow architecture" OR "Abstract":"business process architecture" OR "Abstract":"auto adaptation" OR "Abstract":"auto management" OR "Abstract":"readjusting workflows" OR "Abstract":"readjusting business process" OR "Abstract":"knowledge based process management" OR "Abstract":"workflow management software" OR "Abstract":"business process management software")))

Results: 824 papers


The complete query in the three databases has returned 1229 papers. From thisresult, we have delimited the date for the search from January 2003 to May 2014,remaining 1040 papers. After elimination by title, 349 papers were remained. Then 138articles were eliminated after reading their summaries, remaining 211 papers. In thenext phase, “by thematic focus” 171 papers were eliminated. Thus, applying the fourphases of The PRISMA framework, the search resulted in 40 primary papers for thisliterature review. The Figure 1 below summarizes the four phases of PRISMA [7].

Figure 1. PRISMA's Diagram

From the literature review, it could be seen that the subject has gained interestover the years, but there still are many gaps to be studied. At beginning of the century,approaches deal processes adaptation by making models to handle unexpected changes.Some others also use models but they add ECA rules (Event-Condition-Action) [8].

In most recent years, workflow systems had gained a lot of attention in thescientific domain, there are many studies that treat adaptation as a performance issue insoftware application such as resource allocation and discovery of services. This researchdoes not treat adaptation in the performance level but in the logical level.

In comparison, there is a bigger quantity of approaches that treat adaptation atthe level of performance than at the logical level. Even though, there exist architecturesthat deal with dynamic adaptation, there are few of architectures that treat self-adaptation according to the environment or context. Thus, this research will focus in self-adaptation for dynamic for workflows as a logical problem.


3. Objectives

3.1. General Objective

The overall aim of this proposal is to develop an architecture for Dynamic Self-adaptation in workflows that provides sufficient flexibility for dynamic adaptation inprocesses. This research will focus in the adaptive workflow support layer.

3.2. Specific Objectives

The following specific objectives are defined to meet the general aim:

• Understand the major approaches that deal with dynamic adaptation in businessprocesses/ workflows

• Extend the characteristics of architectures with facilities for dynamic adaptationaccording to the context.

• Implement the solution using the most appropriate technologies

• Validate the solution by means of a case study.

• Analyze the solution and compare with other existing solutions.

4. Case Study

A medical procedure could be seen as a workflow since it is a set of activities that are intotal or in partial order according to their results. A medical treatment is a typicalexample of flexible process in which there could happen any unexpected changes. Forexample, when a patient goes to a hospital, he must register and then have a first visit toa doctor. The doctor will make a superficial diagnosis and send him to do someexaminations. According to the result of those examinations, the doctor will make a finaldiagnosis and a prescription. During this process, situations like active health problems,active medications, allergies, adverse reactions, recent laboratory results, and vitals mustbe monitored all the process, computationally speaking at “running time”. Thesesituations are unforeseeable and unique for each different patient.

Workflow technology can deliver the right information to the right person and atthe right time reflecting all the changes. And if this technology could adapt the processesdynamically according to the environment or context they also would help in the decisionmaking process and certainly will improve the efficiency of the healthcare professionalsto respond to unexpected changes.

There is an extensive research about medical processes in which a set of activitieshas been defined. Clinicians have developed and published a very large number ofevidence-based medical procedures to disseminate the best practices. In order to validatethe proposed architecture it will be made a case of study in the healthcare domain.

In principle, the architecture could be applied in nursing processes. There aremany guidelines that define activities that should be perform in a nursing workflow. Thenursing process orients the professionals in their activities and help them in the decisionmaking process.

The PROCENF is a system developed for University of Sao Paulo that supportsthe nursing processes. This system assists and guides nurses during their activities. Itfollows an Assistance Systematic process defined by the nurse Wanda Horta in 70s.


PROCENF uses internationally well recognized standards such as NANDA, NIC andNOC. In the last year, hospitals of both USP and UNICAMP University have beenimplemented.

5. Progress of the Project

This research is in the final stage of the literature review. The next step is designing thearchitecture that will provide sufficient flexibility to adapt workflows behavior accordingto changes on its operating environment.

The architecture will use an approach based on the MAPE-K model (Monitor,analyze, plan, execute and Knowledge). It will monitor the executing workflow and theenvironment, analyze and reason the new conditions, plan a new workflow according tothe adaptation rules or conditions, execute the plan by reconfiguring the workflow and,finally, generate or update knowledge to generate new rules.

References[1] A. Sheth and K. J. Kochut, “Workflow applications to research agenda: Scalable and

dynamic work coordination and collaboration systems,” in In NAT0-ASI Advances inWorkflow Management Systems and Interoperability, 1997.

[2] D. Hollingsworth. “Workflow Management Coalition - The Workflow ReferenceModel. Technical report ”, Workflow Management Coalition, Jan. 1995.

[3] R. Medina-Mora, T. Winograd, and R. Flores, “Action workflow as the enterpriseintegration technology,” in Bulletin of the Technical Committee on Data Engineering,IEEE Computer Society, June 1993.

[4] Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox, D. Gannon, C. Goble, M.Livny, L. Moreau, and J. Myers, “Examining the challenges of scientific workflows,”Computer, vol. 40, no. 12, pp. 24–32, 2007.

[5] ShuiGuang Deng, Zhen Yu, ZhaoHui Wu, and LiCan Huang “Enhancement ofworkflow flexibility by composing activities at run-time”, in Proceedings of the 2004ACM symposium on Applied computing (SAC '04). ACM, New York, NY, USA,2004.

[6] Petticrew, Mark e Robers, Helen, “Systematic reviews in the social sciences: Apratical guide”, Oxford, United Kingdom: Blackwell Publishing, 2006.

[7] Moher, David; Liberati, Alessandro; Tetzlaff, Jennifer et al., “Preferred reportingitems for systematic reviews and meta-analyses: the PRISMA statement.”, in PLoSMedicine, v. 339, n. 7, p. 6, 2009.

[8] Klaus R. Dittrich, Stella Gatziu, “The Active Database Management SystemManifesto: A Rule base of ADBMS Features”, in ACM SIGMOD 1996.


Coloracao de Arestas Semiforte de Grafos SplitAloısio de Menezes Vilas-Boas1, Celia Picinin de Mello1

1Instituto de Computacao – Universidade Estadual de Campinas (UNICAMP)Caixa Postal 6176 – 13.084-971 – Campinas – SP – Brasil


Abstract. An adjacent strong edge colouring of a simple graph G is an edgecolouring of G if for any two adjacent vertices u, v, the set of colours assignedto the set of edges incident to u differs from the set of colours assigned to the setof edges incident to v. The adjacent strong chromatic index of G is the smallestnumber of colours such thatG admits an adjacent strong edge colouring. In thisnote, we determine the adjacent strong chromatic index of some split graphs.

Resumo. Uma coloracao de arestas semiforte de um grafo simples G e umacoloracao de arestas onde para quaisquer vertices adjacentes u, v de G, o con-junto das cores das arestas incidentes em u e diferente do conjunto das cores dasarestas incidentes em v. O ındice cromatico semiforte de G e o menor numerode cores para o qual G admite uma coloracao de arestas semiforte. Neste tra-balho determinamos o ındice cromatico semiforte de algumas famılias de grafossplit.

1. IntroducaoNeste trabalho, G = (V (G), E(G)) denota um grafo simples e C um conjunto de cores.Uma atribuicao π : E(G) → C e uma coloracao de arestas de G se a arestas adjacentessao atribuıdas cores distintas. O rotulo, Lπ(v), de um vertice v deG e o conjunto formadopelas cores atribuıdas por π as arestas que incidentes em v. Se para cada aresta uv de G,Lπ(u) 6= Lπ(v), π e uma coloracao de arestas semiforte de G. O numero mınimo decores, χ′(G), para o qual G admite uma coloracao de arestas π e denominado ındicecromatico de G. Se π e semiforte, este numero e denominado ındice cromatico semifortee denotado por χ′

a(G).

A coloracao de arestas semiforte foi introduzida em 2002 por Zhang et al. [9] eposteriormente investigada por diversos pesquisadores [1, 4, 10]. Na figura 1 exibimosum grafo G e uma coloracao de arestas semiforte π de G. O conjunto associado a cadavertice v de G e Lπ(v) = C \Lπ(v).

3

2

1

12

3

{2} {3}

{1}

{2}{3}

{1}

G

Figura 1. Um grafo G com χ′a(G) = 3.

Desde que toda coloracao de arestas semiforte e uma coloracao de arestas, temosχ′a(G) ≥ χ′(G). Logo, por Vizing [8], χ′

a(G) ≥ ∆(G) . No artigo seminal, os autores


afirmaram que se os graus de quaisquer dois vertices adjacentes de G forem distintos,χ′a(G) = ∆(G) e provaram que se G possuir pelo menos dois vertices adjacentes com

grau maximo, entao χ′a(G) ≥ ∆(G) + 1. Alem disso, formularam a seguinte conjectura.

Conjectura 1 Se G e um grafo simples, conexo, com pelo menos tres vertices e naoisomorfo a C5, entao χ′

a(G) ≤ ∆(G) + 2.

Note que χ′a(C5) = 5 = ∆(C5) + 3. A conjectura 1 continua aberta, porem foi

comprovada para varias classes de grafos [1, 4, 9, 10].

Em 2009, Zhang et al. [10] mostraram que o ındice cromatico semiforte e onumero cromatico total de grafos regulares estao fortemente relacionados.

Uma coloracao total de G, τ : E(G) ∪ V (G) → C, e uma atribuicao de coresaos elementos (vertices e arestas) de G, onde para cada par de elementos p e q adjacentesou incidentes, τ(p) 6= τ(q). O numero cromatico total de G, χT (G), e menor numerode cores para o qual G admite uma coloracao total. Note que para um grafo G com graumaximo ∆(G), χT (G) ≥ ∆(G) + 1. A Conjectura da Coloracao Total afirma que paratodo grafo simples G, χT (G) ≤ ∆(G) + 2 [2].

Um grafo G e regular se todos os vertices de G tem o mesmo grau. Observeque se G e um grafo regular, χ′

a(G) ≥ ∆(G) + 1. Zhang et al. provaram que se G eum grafo regular conexo com pelo menos tres vertices, entao χ′

a(G) = ∆(G) + 1 se esomente se χT (G) = ∆(G)+1. Logo, se G um grafo regular tal que χT (G) ≥ ∆(G)+2,entao χ′

a(G) ≥ ∆(G) + 2. Portanto, para provar que χ′a(G) = ∆(G) + 2, basta exibir

uma coloracao de arestas semiforte com ∆(G) + 2 cores. Dessa forma, diversas classesde grafos regulares tem seu ındice cromatico semiforte determinado tais como grafoscompletos, ciclos, bipartidos completos regulares, hipercubos e potencias de ciclos.

Na secao 2 apresentamos os resultados que obtivemos em relacao a coloracao dearestas semiforte para os grafos split, um classe que contem grafos nao regulares.

2. Grafos Split

Um grafo e split se o conjunto de seus vertices admite uma particao em uma clique Q eem um conjunto independente S. Os grafos split tem sido amplamente estudados desdea sua definicao em 1977 por Foldes e Hammer [6]. Diversos problemas combinatoriaisreconhecidamente difıceis possuem solucoes polinomiais nesta classe [5, 6, 7]. Porem, osproblemas da coloracao de arestas e da coloracao total nao estao solucionados.

Os grafos completos sao os unicos grafos split regulares. Seja G um grafo com-pleto com pelo menos tres vertices. Neste caso, temos

χ′a(G) = χT (G) =

{∆(G) + 1, para |V (G)| ımpar;∆(G) + 2, caso contrario.

Nesta secao, denotamos um grafo split G com particao (Q,S) por G = [Q,S].Note que se G nao e isomorfo a um grafo completo, podemos considerar Q maximal.

Observacao 1 Seja G = [Q,S] um grafo split nao isomorfo a um grafo completo eπ uma coloracao de arestas de G. Para provar que π e semiforte, basta mostrar queLπ(ui) 6= Lπ(uj), para ui, uj ∈ Q.


Prova: Seja G = [Q,S] um grafo split nao isomorfo a um grafo completo. Suponha Qmaximal. Seja u ∈ Q e v ∈ S. Entao, dG(u) = |Q| − 1 + k onde k, 0 ≤ k ≤ |S|,representa o numero de vertices em S que sao adjacentes a u e dG(u) o grau de u em G.Logo, dG(u) ≥ |Q| − 1. Desde que Q e maximal, dG(v) ≤ |Q| − 1.

Temos dG(u) = dG(v) somente se ambos possuem grau igual a |Q| − 1. Nestecaso, u nao e adjacente a vertices em S. Portanto, |Lπ(u)| 6= |Lπ(v)| para u e v adjacentescom u ∈ Q e v ∈ S. Concluımos entao que π e semiforte se e somente se os rotulos dosvertices em Q sao dois a dois distintos.

Teorema 1 SejaG = [Q,S] um grafo split com dG(u) = ∆(G) para u ∈ Q. Se χT (G) =∆(G) + 1, entao χ′

a(G) = ∆(G) + 1.

Prova: Se G ∼= K|V (G)|, entao χ′a(G) = χT (G). Seja G 6∼= K|V (G)| com Q maximal.

Entao, |Q| ≥ 2 e G possui pelo menos dois vertices adjacentes com grau ∆(G). Portanto,χ′a(G) ≥ ∆(G) + 1.

Seja τ uma coloracao total deG com ∆(G)+1 cores e π a restricao de τ as arestasde G. Por hipotese, dG(u) = ∆(G) para u ∈ Q. Logo, Lπ(u) = {τ(u)}. Desde que τ euma coloracao total de G, temos que Lπ(u) 6= Lπ(v) para u, v ∈ Q. Pela observacao 1temos que π e uma coloracao de arestas semiforte de G com ∆(G) + 1 cores e portantoχ′a(G) = ∆(G) + 1.

Para finalizar mostramos, na sequencia, duas famılias de grafos split tratadas nestetrabalho.

Inicialmente, associamos a um grafo split G = [Q,S], um grafo bipartido BG =[Q,S], com biparticao [Q,S], onde V (BG) = V (G) e E(BG) = E(G) \ {uv, |u, v ∈Q}. Denotamos por d(Q) = max{dB(u)|u ∈ Q} e por d(S) = max{dB(v)|v ∈ S}.Portanto, ∆(BG) = max{d(Q), d(S)} e ∆(G) = |Q| − 1 + d(Q).

Chen et al. [3] mostram que se d(Q) ≥ d(S) para um grafo splitG = [Q,S], entaoχT (G) = ∆(G) + 1. Utilizando a observacao 1 e a tecnica utilizada por esses autores,mostramos que essa condicao e tambem suficiente para garantir que χ′

a(G) ≤ ∆(G) + 1.

Teorema 2 Seja G = [Q,S] um grafo split. Se d(Q) ≥ d(S), entao χ′a(G) ≤ ∆(G) + 1.

Esboco da prova: Construımos um coloracao de arestas semiforte π de G com ∆(G) + 1cores. Desde que d(Q) ≥ d(S), temos ∆(G) = ∆(G[Q]) + d(Q). Seja π1 uma coloracaode arestas de BG com ∆(BG) = d(Q) cores. Seja π2 uma coloracao de arestas semifortede G[Q] com ∆(G[Q]) + 1 cores novas se |Q| e ımpar, e com ∆(G[Q]) + 2 cores novas se|Q| e par e |Q| > 2. (O caso |Q| ≤ 2 foi tratado a parte.) Quando |Q| e par, |Lπ2(u)| = 2para u ∈ Q. Seja c uma cor atribuıda por π1. Neste caso, para cada aresta uv ∈ BG tal queπ1(uv) = c, escolhemos convenientemente uma cor x ∈ Lπ2(u) e definimos π1(uv) = x.Seja π = π1 quando restrito BG e π = π2 quando restrito a G[Q]. O elemento de Lπ(v)que pertence a Lπ2(v) distingue o rotulo de v dos rotulos dos demais vertices em Q. Logo,π e semiforte.

A outra classe de grafos que tratamos e a formada pelos grafos que sao simulta-neamente split e indiferenca, os split-indiferenca. Um grafo e indiferenca se e o grafointersecao de intervalos unitarios da reta. Um grafo split-indiferenca G = [Q,S] possuid(Q) ≤ 2 e d(S) ≥ d(Q). Verificamos a conjectura 1 para os grafos split-indiferenca.


Determinamos o ındice cromatico semiforte para grafos G split-indiferenca quando Gpossui vertice universal. Se G nao possui vertice universal, fornecemos condicoes sufici-entes para que χ′

a(G) seja igual a ∆(G) + 1 e conjecturamos χ′a(G) = ∆(G) + 2 caso

contrario.

Para comparar os valores de χ′a(G) e χT (G), exibimos na segunda coluna da tabela

1, os valores de χ′a(G) que obtivemos no caso em que um grafo split-indiferenca G e a

uniao de dois grafos completos G[A] e G[B] tal que G[A] \ G[B] ∼= K1 e A e B saoconjuntos de vertices e na terceira coluna os valores conhecidos de χT (G).

Restricoes χ′a(G) χT (G)

|A ∩B| = 1 ∆(G) ∆(G) + 1

2 ≤ |A ∩B| ≤ ∆(G)+12

∆(G) + 1 ∆(G) + 1∆(G)+1

2< |A ∩B| ≤ 3∆(G)+1

4∆(G) + 1 ∆(G) + 2

|A ∩B| > 3∆(G)+14

∆(G) + 2 ∆(G) + 2

Tabela 1. Os valores de χ′a(G) e χT (G) dependem da cardinalidade de A ∩B.

Referencias[1] P. N. Balister, E. Gyori, J. Lehel, and R. H. Schelp. Adjacent vertex distinguishing edge-

colorings. SIAM J. Discrete Math, 21(1):237–250, 2007.

[2] M. Behzad. Graphs and their chromatic numbers. PhD thesis, Michigan State University,1965.

[3] B.-L. Chen, H.-L. Fu, and M. T. Ko. Total chromatic number and chromatic index ofsplit graphs. Journal of Combinatorial Mathematics and Combinatorial Computing,17:137 – 146, 1995.

[4] M. Chen and X. Guo. Adjacent vertex-distinguishing edge and total chromatic numbersof hypercubes. Inf. Process. Lett., 109(12):599–602, May 2009.

[5] M. Farber. Independent domination in chordal graphs. Operations Research Lett.,1(4):134–138, 1982.

[6] S. Foldes and P. L. Hammer. Split graphs. Congressus Numerantium, 19:311 – 315, 1977.

[7] M. C. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press, NewYork, 1980.

[8] V. G. Vizing. On an estimate of the chromatic class of a p-graph (in russian). Diskret.Analiz., 3:25–30, 1964.

[9] Z. Zhang, L. Liu, and J. Wang. Adjacent strong edge coloring of graphs. Appl. Math.Lett., 15(5):623–626, 2002.

[10] Z. Zhang, D. Woodall, B. Yao, J. Li, X. Chen, and L. Bian. Adjacent strong edge coloringsand total colorings of regular graphs. Science in China Series A: Mathematics,52(5):973–980, 2009.


Complex pattern detection and specification from multiscaleenvironmental variables for biodiversity applications

Jacqueline Midlej do Espırito Santo1, Claudia Bauzer Medeiros1

1Instituto de Computacao – Universidade Estadual de Campinas (UNICAMP)13.083-852 – Campinas – SP – Brazil


Abstract. Biodiversity scientists often need to define and detect scenarios ofinterest from data streams concern meteorological sensors. Such streams arecharacterized by their heterogeneity across spatial and temporal scales, whichhampers construction of scenarios. To help them in this task, this paper proposesthe use of the theory of Complex Event Processing (CEP) to detect complex eventpatterns in this context. This paper was accepted for VIII Brazilian e-ScienceWorkshop (BreSci).

Resumo. Cientistas de biodiversidade frequentemente precisam definir e detec-tar cenarios de interesse a partir de fluxos de dados gerados por sensores mete-orologicos. Tais fluxos sao caracterizados pela sua heterogeneidade nas escalastemporal e espacial, dificultando a construcao de cenarios. Para ajudar os ci-entistas nesta tarefa, este artigo propoe o uso da teoria de Processamento deEventos Complexos para detectar padroes de eventos complexos neste contexto.Este artigo foi aceito no VIII Brazilian e-Science Workshop (BreSci).

1. IntroductionBiodiversity broadly means the abundance, distributions and interactions across geno-types, species, communities, ecosystems and biomes. Countless problems in biodiversitystudies require data collected and analyzed at multiple space and time scales, correlatingenvironmental variables, living beings and their habitats [Hardisty and Roberts 2013]. Anopen problem in this context is how to specify and detect patterns from environmentalvariables in multiple scales to help scientists to analyze phenomena and correlate resultswith data collected on the field.

To help solving the problem, this work proposes to use Complex Event Process-ing (CEP), the technology to process data streams concern meteorological sensors viaevent detection. The main goal is to detect event patterns in near real-time, in order tosignal situations of interest [Sen et al. 2010]. The idea is to allow researchers to specifyand combine events that characterize such situations, in the context of biodiversity appli-cations. For now, scenarios of interest are usually built case-by-case; sensor events aresometimes captured by customized software. The paper extends the framework proposedby [Koga 2013] for this purpose.

2. Basic ConceptsIn CEP, the word event means the programming entity that records an occurrence of some-thing in a domain [Etzion and Niblett 2010]. Events are classified into primitive and com-plex. Primitive events represent an occurrence at a given place and time. Complex eventsare formed by combinations of primitive or complex events.


The main task of CEP is to detect complex events, in order to identify within aset of events those that are significant to an application domain. Such a detection occursthrough matching events with patterns. Patterns represent models of scenario of interestcomposed by specification of events and their relationships. Patterns can be defined on ahierarchy of events in which the highest level events are formed by inferences from lowerlevel events.

3. Related Work

Depending on the context, the structure and components of events can change.[Koga 2013] defines 4 attributes to specify events in environmental applications:measured-value, nature, spatial-variable, and timestamp. However, this representationonly describes primitive events. The description of complex events must define rela-tionships between events. For example, [Sen et al. 2010] represents complex events inbusiness applications by a model based on semantics which, besides the basic attributes,has reference to operators that connects events.

Patterns are specified by event processing languages. These languages aremainly defined using approaches based on logics (logic-based) or automata (automata-based). Many research efforts in defining more powerful languages. For instance,[Barga and Caituiro-Monge 2006] describe the language Complex Event Detection andResponse (CERD) for expressing patterns that filter, generate and correlate complexevents in business applications.

Logic-based patterns are defined as combinations of predicates on events.Examples of works using this approach are [Motakis and Zaniolo 1995] and[Obweger et al. 2010]. The first authors define a model for active databases in whichthe pattern composition is described by Datalog1S rules. For biodiversity applications,our target, this model is limited because Datalog1S only supports one temporal operator.Scenarios that have more complex temporal relationships and/or have spatial relationshipscannot be represented. On the other hand, [Obweger et al. 2010] do not limit the pred-icate to the use of specific operators. In addition, their model allows users to composehierarchical patterns using an interface that abstracts the definition of sub-patterns.

In automata-based approaches, regular expression operators are used to com-pose patterns. This approach limits the temporal relationships to the notion of prece-dence and does not support spatial operators. Examples of papers in this line are[Pietzuch et al. 2004] and [Agrawal et al. 2008]. The first one performs event detection indistributed systems. The latter focuses on improving the runtime performance of patternqueries over event streams, for business applications.

4. Partial Results

This work has two main parts. The first one aims at formalizing specification ofevents on biodiversity, inspired by literature proposals applied to different domains (e.g.,[Etzion and Niblett 2010, Barga and Caituiro-Monge 2006, Sen et al. 2010]). It must: al-low the hierarchical events composition, such as [Sen et al. 2010]; combine heteroge-neous data sources, such as [Koga 2013]; and consider the place where the event occurs,such as [Koga 2013]. It must also extend the semantics of operators to support spatial and


temporal multiscale data. This specification can express biodiversity scenarios of differ-ent complexity, from excessive rain to situations combining river data with vegetation andrelief data;

The second contribution of this work is the development of a mechanism thatallows patterns composition and detection in order to assist biodiversity applications. Thisstep extends the work of [Koga 2013], which allows integrating data from heterogeneoussources; however, it is limited to the detection of primitive event patterns. The Figure 1illustrates the adapted architecture, horizontally drawn, of the extended framework. Thearchitecture has two main aspects: the use of Enterprise Service Bus (ESB) to processdata streams uniformly and the use of CEP to detect patterns. Environmental data arepre-processed and translated into events, which pass through the ESB and are processedby CEP.

Figure 1. Adapted architecture from [Koga 2013]

From bottom to top, steps 1 and 2 filter data according to the goal of applica-tion. At step 3, the data are encapsulated into messages that are standardized by chan-nel adapters (ESB template). Steps 4 and 5 correspond to the translation of messagesinto events and their processing by CEP. If a pattern is detected, step 6 encapsulates thematched event into a new message. At the steps 7 and 8, this message is standardized andsent to the interested user.

Our work complements the architecture adding complex pattern composition anddetection, illustrated by the red arrow from step 6 to step 4 in Figure 1. This adaptationprovides more representative patterns. The detected composition of events is sent back tothe ESB bus, and forwarded back into the pipeline, creating a hierarchical structure. Theoutput of a complex event may become part of more complex compositions, generatingcomposite events at a higher level.

Using the architecture, biodiversity scientists can represent scenarios (as defor-


estations and forest fires) by complex patterns and detect them. For instance, detectingclimatic changes as the arrival of a cold front in Campinas involves the monitoring ofseveral environmental variables. A short logic-based pattern for this scenario can be:

∃Et1|Et1.temp < 5 ◦C ∧ dist(Et1.space, Campinas) < 200km ∧∃Et2|Et2.temp > 20 ◦C ∧ touch(Et1.space, Et2.space) ∧∃Ew|Ew.windSpeed > 60km/h ∧ overlap(Ew.space, Et1.space) ∧movDir(Ew) = Campinas

overlap(Et1.time,Et2.time,Ew.time)

This pattern searches for a composition of event Et1 signaling low tempera-ture (cold air mass), “meeting” with Et2 signaling high temperature in Campinas (hotair mass), and Ew which shows the presence of strong wind carrying the cold front toCampinas. The detection process finds events Et1 and Et2, generating complex eventCE1. This event is fed back to the bus. Next, CE1 and Ew are detected, generating thecomplex event CE2 that confirms the cold front. At the detection hierarchy, when CE1and CE2 are generated, they form a higher hierarchical level.

This framework will be validated over sensor data, provided by Cooxupe, coop-erative of coffee farmers, from 14 weather stations in Minas Gerais and Sao Paulo, dataused to validate the work of [Koga 2013]. Each weather station continuously collects atleast 26 types of measurements, e.g., temperature, humidity, barometric pressure and soon.

5. ConclusionsThis paper proposes a software framework to help biodiversity scientists to quickly detectscenarios of interest. These scenarios are specified by event patterns. The expressivenessof patterns and events is considered in their specification, and the handling of multiscaledata is considered. The detection is made by a hierarchical and logic-based approach.Future directions include defining the pattern language and partial implementation.

Acknowledgements Work partially financed by FAPESP (2013/02269-7),FAPESP/Cepid in Computational Engineering and Sciences (2013/08293-7), theMSR-FAPESP Virtual Institute (NavScales project), CNPq (MuZOO Project), FAPESP-PRONEX (eScience project), INCT in Web Science, and grants from CNPq.

ReferencesAgrawal, J., Diao, Y., Gyllstrom, D., and Immerman, N. (2008). Efficient pattern match-

ing over event streams. In ACM SIGMOD, pages 147–160.

Barga, R. S. and Caituiro-Monge, H. (2006). Event correlation and pattern detection incedr. In EBDT, pages 919–930.

Etzion, O. and Niblett, P. (2010). Event Processing in Action. Manning Publications Co.

Hardisty, A. and Roberts, D. (2013). A decadal view of biodiversity informatics: chal-lenges and priorities. BMC Ecology, 13(1).

Koga, I. K. (2013). An Event-Based Approach to Process Environmental Data. PhDthesis, Instituto de Computacao - Unicamp. Supervisor Claudia Bauzer Medeiros.

Motakis, I. and Zaniolo, C. (1995). Composite temporal events in active database rules:A logic-oriented approach. In DOOD, volume 1013 of LNCS, pages 19–37.


Obweger, H., Schiefer, J., Kepplinger, P., and Suntinger, M. (2010). Discovering hierar-chical patterns in event-based systems. In SCC, pages 329–336.

Pietzuch, P., Shand, B., and Bacon, J. (2004). Composite event detection as a genericmiddleware extension. IEEE Network, 18(1):44–55.

Sen, S., Stojanovic, N., and Stojanovic, L. (2010). An approach for iterative event patternrecommendation. In DEBS, pages 196–205.


Diabetic Retinopathy Image Quality Assessment,Detection, Screening and Referral

Ramon Pires1 and Anderson Rocha1

1Institute of Computing – University of Campinas (UNICAMP)13.083-852 – Campinas – SP – Brazil

{pires.ramon,anderson}@ic.unicamp.br

Abstract. Diabetic Retinopathy (DR), a common complication caused by dia-betes, manifests through different lesions that can result in blindness if not bediscovered in time. In this work, we present a general framework whose ob-jective is to automate the eye-fundus image analysis. The work comprises foursteps: image quality assessment, DR-related lesion detection, screening, andreferral. In each step, we provide satisfactory results, comparable to the Stateof the Art and, in many cases, surpassing it, especially when dealing with hard-to-detect lesions. An important advance in our work is the validation protocol,the cross-dataset, which is closer to real situations. Furthermore, we proposeda Bag-of-Visual-Words representation highly suitable to retinal image analysis.

1. IntroductionDiabetes mellitus (DM) is a chronic end-organ disease caused by a decrease in insulinsensitivity or a loss of pancreatic function, depending on the type of diabetes, both leadingto an increase in the blood glucose level. An increased blood sugar level may lead todamage of blood vessels in all organ systems of the body. Currently, diabetes affects 366million people worldwide or 8.3% of adults. It is estimated that this number will increaseto approximately 552 million people (one adult in 10 worldwide will have diabetes)

The growing prevalence of diabetes creates an increasing prevalence of the com-plications related to the disease, including Diabetic Retinopathy (DR). DR occurs in ap-proximately 2-4% of the population but is greater in indigenous populations. DR is themain cause of blindness in the 20 to 74 age group in developed countries, creating theneed for systems that screen diabetic retinopathy in its early stages, so to allow an eco-nomically viable management of the disease.

In this context, we present herein a complete step solution for diabetic retinopathyimage quality assessment, detection, screening and referral. In the first step, we applycharacterization techniques to assess image quality by two criteria: field definition andblur detection. In the second step, we propose an approach for detection of any lesion,in which we explore several alternatives for low-level (dense and sparse extraction) andmid-level (coding/pooling techniques of bags of visual words) representations, aimingat the development of an effective set of individual DR-related lesion detectors. Thescores derived from each individual DR-related lesion, taken for each image, represent ahigh-level description, fundamental point for the third and fourth steps. Given a datasetdescribed in high-level (scores from the individual detectors), we propose, in the third stepof the work, the use of machine learning fusion techniques aiming at the development ofa multi-lesion detection method. The high-level description is also explored in the fourth


step for the development of an effective method for evaluating the necessity of referral ofa patient to an ophthalmologist within one year.

2. DatasetsTwo different datasets tagged by medical specialists, DR1 and DR2, were used to performthe experiments. In the experiments with cross-dataset protocol, DR1 is used for trainingwhile DR2 is the test set. The datasets were created by the Department of Ophthalmology,Federal University of Sao Paulo (UNIFESP).

DR1 has 1,077 retinal images with an average resolution of 640 × 480 pixels ofwhich 595 images are normal and 482 images have at least one disease. Each image wasmanually annotated for DR-related lesion (presence/absence) by three medical specialists,and only the images for which the three specialists agree were kept in the final dataset.

DR2 comprises 520 images with 12.2 megapixels cropped to 867 × 575 pixels toincrease the processing speed. Among the 520 images, 300 are normal and 149 have atleast one lesion. Ignoring the specific lesion that can be present, 337 images have beenmanually categorized by two independent specialists as not requiring referral and 98 im-ages require referral within one year. Both datasets are freely available through FigSharerepository, under URL http://dx.doi.org/10.6084/m9.figshare.953671.

3. Quality AssessmentImage quality is an important aspect of automated image analysis and the factor thatsuccessful image analysis relies on. Although it is a common task in lesion detectionprojects, the manual quality assessment is expensive. Most of the works in the literaturefocus only on the blur detection and discard important factors such as field definition.

For this stage, a method was developed for analyzing image quality regardingmotion blur and field definition [Pires et al. 2012]. Furthermore, alternative methods werealso developed for blur detection [Jelinek et al. 2013].

3.1. Field Definition

In this problem, a good retinal image for DR analysis is one image centered on the macula.

The method we discuss herein operates based on the methodology of full-reference comparison. In this methodology, a reference image with assured quality isassumed to be known and quantitative measures of quality for any image are extractedby comparisons with the reference. Given that the macular region has a distinguishablecontrast, and we are interested in the content of the center of retinal images, metrics ofsimilarity have shown to be highly suitable for this objective. To characterize the retinalimages we use the method known as Structural Similarity (SSIM).

3.2. Blur Detection

Although image quality analysis can have several ramifications before arbitrating on thequality of an image, we focus on two very common problems during image acquisition:blurring and out-of-focus capture. Our method involves a series of different blurringclassifiers and classifier fusion to optimize the classification. Basically, we rely uponfour descriptors: vessel area, visual dictionaries, progressive blurring and progressive


sharpening. We also explore combinations of them. For a description about the methods,see the complete dissertation and the paper [Pires et al. 2012]. Fig. 1 presents the ROCcurves achieved under the cross-dataset validation.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.01 - Specificity

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Sens

itivi

ty

Grayscale (w/o CLAHE) (AUC = 84.7%)Grayscale (CLAHE) (AUC = 83.2%)RGB (w/o CLAHE) (AUC = 75.5%)RGB (CLAHE) (AUC = 75.6%)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.01 - Specificity

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Sens

itivi

ty

Area descriptor (A) (AUC = 87.1%)150 visual words (B) (AUC = 85.6%)Blurring descriptor (C) (AUC = 60.8%)Sharpening descriptor (D) (AUC = 83.9%)Blurring and Sharpening descriptor (E) (AUC = 69.0%)Fusion by concatenation (A, B, C, D and E) (AUC = 87.0%)Meta-SVM fusion (A, B, C, D and E) (AUC = 87.6%)

Figure 1. Cross-dataset validation for field definition (left) and blur detection(right) using DR1 as training and DR2 as testing sets.

4. DR-related Lesion Detectors

Due to several lesions related to DR and their diversified characteristics, there are severalworks which focus on the detection of individual lesions, exploiting particular pre- andpost-processing methods for each disease. In this stage, it was developed a series ofindividual detectors for the most important DR-related lesions: hard exudates, superficialhemorrhages, deep hemorrhages, cotton wool spots, and drusen. An additional classifierable to detect both superficial and deep hemorrhages was also implemented: red lesions.

This section comprises a brief description of the experiments performed for thedetection of individual DR-related lesions, as well as presents the experimental resultsfor each anomaly [Pires et al. 2014]. Appreciating the reproducibility, the source codeis freely available through GitHub: https://github.com/piresramon/pires.

ramon.msc.git.

In our work, we employ a different strategy. We use a unified methodology,based on bag-of-visual-words (BoVW) representations, associated to maximum-marginsupport-vector machine (SVM) classifiers. Such methodology has been widely exploredfor general-purpose image classification, and consists of the following steps: (i) extractionof low-level local features from the image; (ii) learning of a codebook using a training setof images; (iii) creation of the mid-level (BoVW) representations for the images based onthat codebook; (iv) learning of a classification model for one particular lesion, using anannotated training set; (v) using the BoVW representation and the learned classificationmodel to make decisions on whether or not a particular image has a lesion.

The mid-level BoVW features are based upon the low-level features, whose choicehas great impact on performance. Two treatments are usual: sparse features, based uponthe detection of salient regions, or points-of-interest; and dense features, sampled overdense grids of different scales. A challenging step, the codebook learning is usually per-formed by a k-means clustering over features chosen at random from a training set.


With the codebook in hand, the next steps are the BoVW operations: coding andpooling. For coding, besides the traditional hard assignment, we have tested the softassignment, and proposed a new semi-soft assignment especially conceived for the DR-related lesion detection application. The semi-soft coding tries to combine the advantagesof both hard and soft assignments, i.e., avoiding the boundary effects of the former, andthe dense codes of the latter.

For the pooling step, we forgo the traditional sum-pooling and employ the morerecent max-pooling. The pooling step is considered one of the most critical for the per-formance of BoVW representations, and max-pooling is considered an effective choice.

The detailed results are presented in Table 1, which shows the AUCs obtained foreach lesion in the cross-dataset protocol (training with DR1 and test with DR2 dataset).

Table 1. AUCs in %, for Training with DR1, Testing with DR2Sparse features Dense features

Hard Semi-soft Soft Hard Semi-soft SoftHard Exhudates (HE) 93.1 97.8 95.5 94.5 95.6 95.6Red Lesions (RL) 92.3 93.5 87.1 89.1 90.6 89.9Cotton-wool Spots (CS) 82.1 90.8 84.9 84.5 90.4 90.3Drusen (D) 66.5 82.8 62.6 84.1 82.5 75.5

5. Detector FusionGiven a set of detectors of individual DR-related lesions developed with a method whichprovides satisfactory results for the definition of presence/absence of the most commonlesions, this work involves the use of combining approaches aimed at pointing out whetheran image is normal or has any lesion including possible ones not present during training.

The classifier fusion was explored for combination of the individual DR-relatedlesions [Jelinek et al. 2012]. Our main approach consists in investigating fusion of dif-ferent detectors to identify the presence of DR. The work contains a set of classifiers thatact in cooperation to solve a pattern recognition problem, followed by several methodsfor classifier fusion. This approach is intuitive since it imitates our nature to seek severalopinions before making a decision.

For this step, we have a set of detectors for six individual DR-related lesions.The assignment approach explored for the development of the detectors is the semi-soft,explained in Section 4. Two fusion methods were evaluated: OR and meta-classification.

• OR - Labels as positive the data classified as positive in at least one classifier.• Meta-classification - Employed in Section 3 for quality assessment, can be

loosely defined as learning from information generated by different learners.

We performed the fusion in three different testing steps:

(1) images from DR2 for testing with at least one of the discussed lesions;(2) images from DR2 for testing with any DR-related lesion (including neovascular-

ization, increased vascular tortuosity, foveal atrophy, chorioretinitis scar, etc.);(3) images from DR2 for testing which present signals of other anomalies (except the

ones we trained for). This step was performed only for the OR fusion technique.


Figure 2 depicts the results achieved using the logic OR fusion and the meta-classification method. The results for meta-classification express the mean and the stan-dard deviation of the AUCs obtained using the 5× 2-fold cross-validation protocol.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.01 - Specificity

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Sens

itivi

ty

Same lesions (AUC = 88.6%)All lesions (AUC = 81.6%)Other lesions (AUC = 66.8%)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.01 - Specificity

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Sensi

tivit

y

Same Lesions (AUC = 89.3% ± 2.6%)

All Lesions (AUC = 82.5% ± 2.5%)

Figure 2. Cross-dataset validation for fusion by logic OR (left) and 5×2-fold cross-validation for fusion by meta-classification (right).

6. ReferralIn order to achieve early detection of DR, helping to stop or slow down its progress, inter-national guidelines recommend annual eye screening for all diabetic patients. However,the increasing number of diabetic patients and the decreasing number of ophthalmolo-gists make this suggested annual examination difficult to be performed sufficiently. Thisfactors tend to overwhelm the specialist even more during the next years.

Thus, aiming at referring to a specialist only the patients who really need a consul-tation, this work includes a stage for classifying retinal images as referable (to be referredto a specialist) or non-referable (not to be referred to a specialist) in the interval of oneyear [Pires et al. 2013].

Table 2 summarizes all the results obtained for referral. The experiments wereperformed without normalization, and normalizing with z-scores and term-frequency.

Table 2. AUCs for referralTechnique Hard-sum Soft-max

Without normalization 90.8%±3.1% 93.4%±2.1%Term-frequency 82.5%±4.6% 83.4%±4.6%Z-score 91.7%±2.1% 89.4%±3.0%

7. ConclusionIn this research, we proposed original solutions to deal with diabetic retinopa-thy related problems. The major results were published in international confer-ences [Pires et al. 2012, Jelinek et al. 2012, Jelinek et al. 2013] and in a top-tier interna-tional journal [Pires et al. 2013]. Furthermore, part of the achievements were combinedwith new discoveries in the doctorate (under way), resulting in a paper recently acceptedfor publishing in the top-tier journal PLOS ONE [Pires et al. 2014].


The main breakthrough for the quality analysis step was the use of classifierfusion to optimize the classification. This tactic gave us an interesting result: to ensurethat a satisfactory percentage of poor quality images will be discovered (w 90.0%), we canestablish that only 10.0% of the enough quality images will be unnecessarily retaken. Thequality assessment constitutes a key step of a robust DR-related lesion screening systembecause it helps preventing misdiagnosis and posterior retake.

The detection of individual DR-related lesions is one of the most importanttopics of this work. The development of detectors aim at facilitating the attendance inrural and remote communities. A considerable contribution of this step was the proposalof a new coding scheme called semi-soft, that outperforms the state of the art, mainly forhard-to-detect DR-related lesions, such as drusen and cotton-wool spots.

Based upon the scores associated to the detection of the most common DR-relatedlesions, we developed an accurate multi-lesion detector which showed to be effectivefor the detection of all the considered lesions. Taking as strategy the fusion of individuallesion detectors, the meta-classification method provided us the most satisfactory results.

For assessing the need for referral, our proposed method can be used especiallyin remote and rural areas. The method captures retinal images, evaluates them in real-time, and suggests whether or not the patient requires a review by an ophthalmic specialistwithin one year. We have achieved important results with this methodology. For example,for a sensitivity of 90.0%, we have a specificity of 85.0%, which means that the specialisttime may be saved in 85.0% (only 15.0% of the attended patients will be normal).

In closing this work, we would like to emphasize that there is still important re-searches to be done in DR image analysis. For instance, identifying the precise locationof the lesion, and defining the DR severity degree of a patient further classifying the im-ages as related to DR cases in early, mild, moderate and severe stages. We intend also toexplore more sophisticated methodologies for machine learning and image representation.

ReferencesJelinek, H. F., Pires, R., Padilha, R., Goldenstein, S., Wainer, J., and Rocha, A. (2012).

Data fusion for multi-lesion diabetic retinopathy detection. In IEEE Intl. Computer-Based Medical Systems, pages 1–4.

Jelinek, H. F., Pires, R., Padilha, R., Goldenstein, S., Wainer, J., and Rocha, A. (2013).Quality control and multi-lesion detection in automated retinopathy classification usinga visual words dictionary. In Intl. Conference of the IEEE Engineering in Medicine andBiology Society.

Pires, R., Jelinek, H., Wainer, J., Valle, E., and Anderson, R. (2014). Advancing bag-of-visual-words representations for lesion classification in retinal images. PLoS ONE,9(6):e96814.

Pires, R., Jelinek, H. F., Wainer, J., Goldenstein, S., Valle, E., and Rocha, A. (2013). As-sessing the need for referral in automatic diabetic retinopathy detection. IEEE Trans-actions on Biomedical Engineering, 60(12):3391–3398.

Pires, R., Jelinek, H. F., Wainer, J., and Rocha, A. (2012). Retinal image quality analysisfor automatic diabetic retinopathy detection. In Intl. Conference on Graphics, Patternsand Images, pages 229–236.


Ferramentas para simulacao numerica na nuvemRenan M. P. Neto1, Juliana Freitag Borin1, Edson Borin1

1Instituto de Computacao – Universidade Estadual de Campinas(Unicamp)Campinas – SP – Brazil

[email protected], {juliana,edson}@ic.unicamp.br

Abstract. Numeric simulations are widely used to predict the behavior of realsystems. The high computational power required by most of these simulati-ons has increased the interest in using cloud computing to perform them. Thisproject aims at investigating tools to perform numeric simulation in the cloud.Additionally, we will develop a Web intereface to allow users from all areas ofknowledge to plant their simulations in the cloud.

Resumo. Simulacoes numericas sao amplamente utilizadas para avaliar ocomportamento de sistemas reais. O alto poder computacional exigido paraa execucao de muitos simuladores tem aumentado o interesse no uso decomputacao na nuvem para esse fim. Este projeto tem como objetivo investi-gar ferramentas para execucao de simulacao numerica na nuvem. Adicional-mente, pretende-se desenvolver uma interface Web simples e funcional paraque usuarios sem ampla experiencia em computacao possam executar suassimulacoes na nuvem.

1. IntroducaoSimulacoes computacionais utilizam modelos matematicos abstratos para descrever ocomportamento de modelos reais em diversas areas, tais como engenharia, extracaode petroleo, aerodinamica, entre outras. As simulacoes sao amplamente utilizadaspara prever o funcionamento de sistemas reais antes mesmo de construı-los, possibi-litando a reducao de custos associados com execucoes de testes reais [Liu et al. 2012,Pan et al. 2013, Johnson and Tolk 2013].

Para que os resultados de uma simulacao sejam estatisticamente representativos, epreciso executa-la milhares de vezes com diferentes entradas. Quando os modelos simu-lados sao complexos, surge a necessidade de infraestrutura computacional de alto desem-penho (HPC - do ingles - High Performance Computing) para conseguir executar todasessas iteracoes em tempo habil [Pan et al. 2013, Johnson and Tolk 2013].

A crescente oferta de servicos de computacao na nuvem tornou acessıvel sistemascom grandes quantidades de processadores. Uma nuvem e um tipo de sistema distribuıdoe paralelo que consiste em uma colecao de computadores interconectados e virtualiza-dos que sao providos dinamicamente sob demanda com base em acordos de nıvel deservico (service-level aggrements) estabelecidos entre consumidores e o provedor da nu-vem [Buyya et al. 2009].

Simulacoes numericas podem se aproveitar do rapido avanco das nuvens compu-tacionais. Como nuvens computacionais sao elasticas e simulacoes numericas escalaveis,e possıvel executar milhares de simulacoes ao mesmo tempo diminuindo assim o tempototal da execucao e livrando o usuario da compra e manutencao de maquinas fısicas.


1.1. Computacao na nuvem

Computacao na nuvem tem sido considerada como um novo paradigma na computacao.Neste novo paradigma a infra-estrutura de computadores e movida para um provedor e eacessada via Internet [Vaquero et al. 2008]. A principal filosofia do paradigma e entregarcomputacao como uma utilidade, assim como agua, eletricidade, gas e telefone. Algunsautores tratam computacao na nuvem como a quinta utilidade [Buyya et al. 2009].

Computacao na nuvem possui diversos benefıcios [Juve et al. 2009], dentre eles:

• Elasticidade - As nuvens permitem que os usuarios aloquem e desaloquem re-cursos computacionais sob demanda. Desse modo, seus recursos computacionaispodem ser sempre proporcionais as suas necessidades.• Ilusao de recursos infinitos - Os provedores de computacao na nuvem dao a

ilusao de possuir recursos infinitos. Sendo assim, os usuarios podem solicitar aquantidade de recursos computacionais desejada.

No paradigma de computacao na nuvem, ha tres tipos de servico. O primeiroe o Infrastructure-as-a-Service (IaaS), em que o usuario e responsavel por gerenciarsuas aplicacoes, dados, sistemas operacionais e camadas intermediarias. O segundotipo de servico e o Platform-as-a-Service (PaaS), em que a nuvem abstrai questoescomo camadas intermediarias e sistemas operacionais e o usuario fica responsavel so-mente por suas aplicacoes e seus dados. Ja o ultimo servico e o Software-as-a-Service (SaaS), em que o usuario nao precisa gerenciar nem os seus dados nem suasaplicacoes [Hwang and Li 2010, Jha et al. 2011].

As nuvens podem ser classificadas em: nuvens privadas, nuvens publicas ou nu-vens hıbridas. Nuvens privadas sao aquelas em que a infraestrutura da nuvem e disponibi-lizada por algum provedor de nuvem [Amazon , Microsoft ]. Nuvens privadas sao nuvensque pertencem a uma organizacao sendo essa organizacao responsavel pela gerencia emanutencao da nuvem. A nuvem hıbrida consiste na uniao de duas ou mais nuvens inde-pendentes, mas que podem trabalhar em conjunto [Peng et al. 2009, Jha et al. 2011].

Nuvens privadas normalmente sao criadas e gerenciadas atraves de ferramen-tas especıficas para esse fim. Eucalyptus [Eucalyptus ], OpenNebula [OpenNebula ] eOpenStack [OpenStack ] sao exemplos de ferramentas que administram nuvens. Es-sas ferramentas realizam operacoes de alocacao de maquinas virtuais, liberacao demaquinas virtuais, gerencia de usuarios, redundancia, entre outros [Peng et al. 2009,von Laszewski et al. 2012, Khan et al. 2011, Sempolinski and Thain 2010].

1.2. Trabalhos Relacionados

Jorissen, Vila e Rehr [Jorissen et al. 2012] criaram uma interface de programacao quefacilita operacoes basicas na nuvem computacional da Amazon [Amazon ] tais como:conectar-se ao servico, criar novas instancias de maquinas, eliminar instancias, acessararquivos, iniciar tarefas, entre outras. Os autores criam tambem uma interface graficapara realizar alguns calculos em especıfico bem como uma interface para configuracaoda computacao nas nuvens. Nessa interface e possıvel informar usuario, senha e arqui-vos de configuracao. Por fim, os autores mostram que o processo de iniciar/configurar asinstancias e paralelizavel e com isso e possıvel ter um ganho de tempo de aproximada-mente duas vezes na etapa da inicializacao, em comparacao com a versao serial.


Angeline e Masala [Angeli and Masala 2012] tambem citam a importancia daparalelizacao na etapa de inicializacao e configuracao das instancias. Porem, neste caso,os autores apontam que quanto mais tempo uma instancia fica esperando por outra, maiscaro sera o servico, visto que na nuvem computacional se paga pelo tempo de alocacaode cada maquina, independente se ela esta realizando ou nao computacoes. Outro pontoimportante levantado pelos autores e que deve-se avaliar o custo do uso de nuvem com-putacional em comparacao com a compra de servidores proprios. Para isso deve-se levarem consideracao diversas variaveis, tais como: desgaste e deterioracao das maquinas,atualizacoes de hardware, refrigeramento, entre outras. Outro fator importante citado pe-los autores e a utilizacao de instancias especializadas para o problema em questao. Paraproblemas que possuam a memoria como fator limitante para o desempenho, por exem-plo, deve-se utilizar instancias especializadas em desempenho de memoria.

No trabalho feito por Jakovits, Srirama e Kromonov [Jakovits et al. 2012] foi pro-posta uma ferramenta para simulacoes cientıficas utilizando nuvem computacional. Essaferramenta e baseada no modelo BSP - Bulk Synchronous Parallel - que e um algoritmoparalelo e iterativo. Esse modelo e dividido em super passos e cada super passo e divi-dido em: computacao local, sincronizacao e barreira global. Nesse trabalho os autoresse preocuparam com falhas e tentaram otimizar o modelo, visto que em certos pontosda execucao algumas tarefas podem ficar esperando outras tarefas menores. O grandeproblema dessa ferramenta e que utiliza a biblioteca legada Oxford BSPlib [Oxford ]. Aultima atualizacao desta biblioteca foi em 1998 e apresenta problemas na instalacao emcomputadores modernos, principalmente em sistemas de 64 bits [Jakovits et al. 2012].

Os autores Liu, He, Qiu e Huang [Liu et al. 2012] propoem utilizar nuvem com-putacional para realizar simulacoes. Alem disso os autores tambem propoem melhora-mentos para arquitetura do CSim - Cloud-based computer Simulation. Assim como osoutros autores, eles citam que o bom uso dos recursos computacionais e a chave para ouso da nuvem computacional bem como do CSim.

2. ObjetivosO objetivo deste trabalho e o estudo e desenvolvimento de ferramentas para execucao desimuladores numericos na nuvem. Serao estudados os diferentes modelos de negocio einterfaces de programacao para uso da nuvem e sera desenvolvido um sistema capaz deexecutar simuladores de forma automatizada na nuvem.

Por meio do estudo sobre modelos de negocio e desempenho de provedores decomputacao na nuvem, sera possıvel analisar quais provedores satisfarao os requisitospara realizar as simulacoes de um usuario. Como dito anteriormente, ha modelos em queo usuario se preocupa com questoes de infra-estrutura e ha modelos que a infra-estruturae abstraıda. Ha tambem varias maneiras de cobranca na nuvem.

Os recursos disponibilizados nas nuvens - tambem chamados de instancias - saodiferentes entre si. Algumas instancias visam boa qualidade de processamento e outrasmelhor qualidade de acesso a memoria principal, por exemplo. Com o estudo propostoespera-se identificar qual tipo de instancia e a melhor para o problema do usuario, vistoque existem diversos tipos de simulacoes que possuem comportamentos diferentes.

Outra contribuicao importante desse trabalho sera fornecer uma interface Web sim-ples e amigavel para a execucao de simulacao na nuvem. Com essa interface o usuario


podera carregar sua simulacao e seus parametros de entrada bem como escolher a quan-tidade de maquinas que realizara sua tarefa de modo a otimizar custo e/ou tempo deexecucao.

3. Materiais e metodosNa fase atual do projeto, esta sendo realizado um estudo aprofundado dos trabalhos relaci-onados a execucao de simulacao na nuvem. Tambem estao sendo estudadas as ferramentasdisponıveis para criacao e gerenciamento de nuvens privadas. Nos proximos meses, seracriada uma nuvem privada no cluster do LMCAD (Laboratorio Multidisciplinar de AltoDesempenho) que servira como uma plataforma de estudo para entendimento do funcio-namento de uma nuvem bem como uma plataforma de testes nas fases mais avancadas doprojeto.

Existem diversos provedores de nuvem publica no mercado. Dentre eles, os doismais populares sao Amazon AWS [Amazon ] e Windows Azure [Microsoft ]. A AmazonAWS fornece Infrastructure-as-a-Service e e a maior e mais conhecida empresa de nuvemcomputacional como mostra a Figura 1. Na Amazon AWS e possıvel encontrar diversostipos de instancias. Dentre essas instancias ha instancias de proposito geral e instanciasespecializadas. Essas instancias especializadas variam em qualidade e quantidade de pro-cessadores, memoria RAM, memoria principal e placas de vıdeo.

O Windows Azure nao possui tantas opcoes de instancias quanto a Amazon AWSmas, fornece Infrastructure-as-a-Service e Platform-as-a-Service. Assim como a Ama-zon tambem possui algumas instancias com foco em HPC. Este projeto foi contempladocom recursos computacionais da Microsoft Azure com valor estimado de US$40.000,00para realizacao de pesquisas e experimentos.

Figura 1. Receitas das maiores empresas de computacao nas nuvens.1

Depois de entendido o funcionamento das nuvens privadas e publicas, sera inici-ado o desenvolvimento das ferramentas para execucao de simulacoes na nuvem. Nesta

1Fonte: Company data, Evercore Group L.L.C. Research. Azure foi baseado nos comentarios da MSFTsobre sua receita de U$1b em Maio de 2013. A receita do Google foi estimada pela TBR


fase, sera preciso estudar qual linguagem de programacao se encaixa melhor na naturezado problema.

Desenvolvidas as ferramentas, uma interface Web sera criada para permitir queusuarios das diversas areas do conhecimento possam executar suas simulacoes em nu-vens publicas ou privadas de forma simples. Essa interface abstraira todos os passosnecessarios para a implantacao de uma simulacao na nuvem.

Web sites possuem padroes bem definidos, permitindo que seja compreendido pordiversas pessoas em todo o mundo. Sao normalmente construıdos utilizando HTML -HyperText Markup Language. Para complementar o HTML sao utilizados, normalmente,JavaScript, para programar acoes e comportamentos da pagina e CSS - Cascading StyleSheets - para programar o aspecto estetico da pagina. Com essas ferramentas e possıvelcriar um web site acessıvel de todo mundo seguindo padroes ja bem definidos. A interfaceWeb criada neste projeto devera atender esses padroes.

4. Resultados esperados

Com a ferramenta proposta, espera-se que mesmo usuarios com conhecimentos basicosem computacao consigam realizar suas simulacoes numericas de maneira simples, efici-ente e de qualquer lugar do mundo. Esperamos tambem prover escalabilidade e elastici-dade para os usuarios. Desse modo, um usuario podera utilizar o numero de instanciasque desejar podendo balancear entre tempo total de execucao e valor gasto com a nuvem.

A ferramenta sera avaliada atraves de um estudo de caso utilizando um simula-dor de acoplamento poco-reservatorio desenvolvido no Laboratorio de Mecanica Com-putacional (LabMeC) do Departamento de Estruturas da Faculdade de Engenharia Civil,Arquitetura e Urbanismo da Unicamp.

Referencias

Amazon. Amazon web services. http://aws.amazon.com/.

Angeli, D. and Masala, E. (2012). A cost-effective cloud computing framework for acce-lerating multimedia communication simulations. Journal of Parallel and DistributedComputing, 72(10):1373 – 1385.

Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., and Brandic, I. (2009). Cloud compu-ting and emerging {IT} platforms: Vision, hype, and reality for delivering computingas the 5th utility. Future Generation Computer Systems, 25(6):599 – 616.

Eucalyptus. Eucalyptus - open source private cloud software.https://www.eucalyptus.com/.

Hwang, K. and Li, D. (2010). Trusted cloud computing with secure resources and datacoloring. Internet Computing, IEEE, 14(5):14–22.

Jakovits, P., Srirama, S. N., and Kromonov, I. (2012). Stratus: A distributed compu-ting framework for scientific simulations on the cloud. In High Performance Com-puting and Communication 2012 IEEE 9th International Conference on EmbeddedSoftware and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on,pages 1053–1059.


Jha, S., Katz, D. S., Luckow, A., Merzky, A., and Stamou, K. (2011). UnderstandingScientific Applications for Cloud Environments, pages 345–371. John Wiley & Sons,Inc.

Johnson, H. E. and Tolk, A. (2013). Evaluating the applicability of cloud computingenterprises in support of the next generation of modeling and simulation architectures.In Proceedings of the Military Modeling & Simulation Symposium, MMS ’13, pages4:1–4:8, San Diego, CA, USA. Society for Computer Simulation International.

Jorissen, K., Vila, F., and Rehr, J. (2012). A high performance scientific cloud com-puting environment for materials simulations. Computer Physics Communications,183(9):1911 – 1919.

Juve, G., Deelman, E., Vahi, K., Mehta, G., Berriman, B., Berman, B., and Maechling,P. (2009). Scientific workflow applications on amazon ec2. In E-Science Workshops,2009 5th IEEE International Conference on, pages 59–66.

Khan, I., Rehman, H., and Anwar, Z. (2011). Design and deployment of a trusted eu-calyptus cloud. In Cloud Computing (CLOUD), 2011 IEEE International Conferenceon, pages 380–387.

Liu, X., He, Q., Qiu, X., Chen, B., and Huang, K. (2012). Cloud-based computer si-mulation: Towards planting existing simulation software into the cloud. SimulationModelling Practice and Theory, 26(0):135 – 150.

Microsoft. Microsoft azure. http://azure.microsoft.com/.

OpenNebula. Opennebula — flexible enterprise cloud made simple.http://opennebula.org/.

OpenStack. Openstack - cloud software. https://www.openstack.org/.

Oxford. The oxford bsp toolset. http://www.bsp-worldwide.org/implmnts/oxtool/.

Pan, Q., Pan, J., and Wang, C. (2013). Simulation in cloud computing envrionment.Service Sciences, International Conference on, 0:107–112.

Peng, J., Zhang, X., Lei, Z., Zhang, B., Zhang, W., and Li, Q. (2009). Comparison ofseveral cloud computing platforms. In Information Science and Engineering (ISISE),2009 Second International Symposium on, pages 23–27.

Sempolinski, P. and Thain, D. (2010). A comparison and critique of eucalyptus, open-nebula and nimbus. In Cloud Computing Technology and Science (CloudCom), 2010IEEE Second International Conference on, pages 417–426.

Vaquero, L. M., Rodero-Merino, L., Caceres, J., and Lindner, M. (2008). A break in theclouds: Towards a cloud definition. SIGCOMM Comput. Commun. Rev., 39(1):50–55.

von Laszewski, G., Diaz, J., Wang, F., and Fox, G. C. (2012). Comparison of multi-ple cloud frameworks. In Cloud Computing (CLOUD), 2012 IEEE 5th InternationalConference on, pages 734–741.


Information and Emotion Extraction in Portuguese fromTwitter for Stock Market Prediction

Fernando J. V. da Silva1, Ariadne M. B. R. Carvalho1

1Institute of Computing – University of CampinasCampinas – SP – Brazil


Abstract. Stock market prediction is a very challenging endeavor which hasbeen explored in many different ways since the advent of modern computing.With the upsurge of microblogging in the last two decades, stock market tradingbecame a popular topic on Twitter and, as a consequence, natural languageprocessing techniques started to be used to deal with the problem. However,the prediction of individual stocks based on tweets is still little explored. Wepresent a research proposal for investigating a new technique aiming at stockmarket prediction for IBOVESPA, based on information extraction and emotionsdetected from tweets in Portuguese.

Resumo. Prever o mercado de acoes e uma tarefa bastante desafiadora, masque vem sendo amplamente explorada. Com o aparecimento de microblogs nasultimas duas decadas, a operacao no mercado de acoes se tornou um topicobastante popular no Twitter e, como consequencia, tecnicas baseadas em pro-cessamento de linguagem natural comecaram a ser usadas para lidar com oproblema. Todavia, a predicao de acoes individuais com base em tweets con-tinua pouco explorada. Apresentamos uma proposta de pesquisa que visa inves-tigar uma nova tecnica para previsao de precos de acoes da IBOVESPA, combase em extracao de informacao e deteccao de emocoes em tweets escritos emportugues.

1. IntroductionOne of the main questions which comes to the economists’ mind is: “Can stock pricesbe really predicted?”. A possible clue to such a controversial issue is presented inthe Efficient Market Hypothesis (EMH) [Fama 1970], which states that market pricesfully reflect all the available information. This information is indeed used by allsort of investors, such as the ones who use techniques based on Fundamental Analy-sis [Abarbanell and Bushee 1997]. In particular, small investors and some stock marketspecialists take this analysis into account when deciding which stock to buy or sell, andmany discuss their evaluation on Twitter1. They often post news about stocks, their ownperspectives and expectations, not rarely enriched with results obtained from TechnicalAnalysis [Lo et al. 2000]. Some of these tweets can be seen in Figure 1.

In this work we want to verify if it is possible to predict individual stock pricesin the Brazilian stock market, based on investor comments made on Twitter only. Fur-thermore, we want to consider if this can be done automatically by extracting meaningful

1http://www.twitter.com.


#bbas3 depois acho meu post mas ainda aguardo 18,75[#bbas3 I will find my post latter but I still wait for (the stock price to reach) 18.75]Ai meu bolso.... Bbas3 caiu pra c****** hoje[Ouch, my wallet... Bbas3 felt as (expletive) today]Trade de venda no grafico semanal acaba de ser acionado em BBAS3[Weekly sell trade sign has just been detected in BBAS3]

Figure 1. Tweets mentioning stocks. Company codes in bold.

information and emotions from stock market messages in Portuguese, helping to decidewhether or not to buy stocks.

The remainder of this paper is organized as follows. In the next section the prob-lem is stated, along with the main previous research. In Sections 3 and 4, respectively,our objectives and methods are presented, whilst in Section 5 we describe the evaluationtechniques which will be used, as well as the expected results. Finally, in Section 6 wepresent our final remarks.

2. ContextThe first works on stock market prediction were mostly conducted by fi-nancial researchers, who used mainly message boards as a source for theirwork [Gaskell et al. 2013]. However, the use of microblogs, such as Twitter, has beensolidified since 2011, attracting computer scientists to the area.

A research on emotions was presented in [Bollen et al. 2011]. The authors mea-sured collective moods, and studied their correlation to the Dow Jones Industrial Average(DJIA) during the year 2008. The moods were extracted using the OpinionFinder tool2,and GPOMS, a tool which captures six different moods: calm, alert, sure, vital, kind andhappy. The tool moods were derived from an existing psychometric instrument, calledProfile of Mood States (POMS-bi)([Norcross et al. 1984] apud [Bollen et al. 2011]). Thefindings indicate a strong linear correlation between calm mood index and DJIA pricesfrom 2 to 6 days of lag, but other mood indexes and the OpinionFinder have not shown anycorrelation with stock market index values at all. Furthermore, a self-organizing FuzzyNeural Network was used to predict DJIA based on inputs from the past 3 days of DJIAvalues and a combination of values from the mood indexes, from the past 3 days as well.The last experiment showed that the calm index also helps to increase performance forpredicting DJIA values for the next 3 or 4 days.

Regarding sentiment analysis from message boards, in several previous workscorrelation was found between sentiment on tweets and stock market indexes, such asDJIA, S&P 100 and S&P 500 [Zhang et al. 2011, Mao et al. 2011, Bollen et al. 2011,Gaskell et al. 2013]. However, few articles indicate correlation to individual stocks. Effi-ciency is found in price prediction involving messages that suggest buying, but there areno good results for messages suggesting selling [Sprenger et al. 2013].

Though the prediction period tends to be no more than 1 to 4 days, researchesusually suggest a 3 days prediction as an optimal period [Gaskell et al. 2013]. Further-more, messages with fewer emotions indicate DJIA index increases [Zhang et al. 2011],

2http://mpqa.cs.pitt.edu/opinionfinder/.


and calm moods are better for predicting DJIA values [Bollen et al. 2011]. There isa growing research field using Twitter as a message source, instead of the traditionalmessage boards. There is already some research to identify sentiments in Portuguesetweets or message boards, and to predict stock market based on news [Lopes et al. 2008,Nizer and Nievola 2012]. Nevertheless, with respect to Brazilian stock market prediction,neither message boards nor Twitter were previously used, to the best of our knowledge.

3. Objectives

The main goal of this research is to verify if it is possible to predict stock price varia-tion in the Brazilian stock market using only information and emotions from tweets inPortuguese. Our main hypothesis are:

• H0 - Individual stock prices can be predicted by Twitter alone: Twitter hasbeen thoroughly explored to predict variation on stock market indexes, but thereis a need for experiments to predict individual stock prices. The great challenge isthat individual stocks tend to show up in a small number of tweets per day.

• H1 - Twitter can predict IBOVESPA stock prices: Although Twitter has mostlybeing used to predict American stock market indexes, we believe it can similarlybe used for Brazilian stock market prediction.

• H2 - Tweets can only suggest buys: This hypothesis is based on the results from[Sprenger et al. 2013], but it will be further investigated in our research.

• H3 - Emotions on tweets, or the lack of them, may help to decidewhether or not to buy a stock: This is inspired by Zhang and collegesstatement[Zhang et al. 2011], who say that fewer emotions indicate increase onDJIA. Furthermore, as presented in [Bollen et al. 2011], we may look for calmmoods in the messages too.

• H4 - The number of readers can affect prices: We argue that if someone pub-lishes an information or expresses an emotion, this could influence other usersfuture actions, either to sell or buy a stock, which could indeed affect their prices.

4. Methods

We will analyze tweets that contain at least one stock symbol from the IBOVESPA in-dex during the year 2014. Taken that as input, the experiment will be conducted in twosteps: 1) tweets will be tagged according to the emotions they carry; and 2) all tweetsthat mention a specific stock in a given day will be joined together, thus their terms andemotion frequencies will be counted, and they will be classified according to the stockprice variation, using Machine Learning techniques [Faceli et al. 2011].

4.1. Emotion Tagger

The emotion tagger will identify moods in tweets using Machine Learning algorithms.Tweets will be classified according to the emotions specified in Plutchik’s theoryas “wheel of emotions” ([Plutchik and Kellerman 1986] apud [Suttles and Ide 2013]),namely: joy, sadness, anger, fear, trust, disgust, surprise, anticipation or neutral.

Firstly, tweets that mention stocks for companies from IBOVESPA index, reg-istered during the period of 1st to 31st March of 2014, will be manually annotated by


two people, who will indicate the emotions present in each tweet, including the possibil-ity of presenting no emotion at all (neutral emotion). The same techniques presented in[Suttles and Ide 2013] will be used in our work. Hence, there will be four binary classi-fiers: joy vs sadness, anger vs fear, trust vs disgust, surprise vs anticipation. An additionalneutral class will be considered for each of these four classifiers. If the probability of atweet belonging to a class is smaller than a certain cutoff value, then the tweet is consid-ered neutral. All possible probabilities generated by the four classifiers will be examined,and the one that best fits the real (annotated) data will be used. The results from the fourclassifiers will then be combined in order to tag a tweet up to 4 emotions. The attributesused in the training set will be based on unigrams and bigrams that appear more than fivetimes in the corpus (following [Wang et al. 2012]).

The attributes will be binary [Faceli et al. 2011], indicating the existence of adifferent n-gram in a given tweet. n-gram features were chosen because they pre-sented the best results in comparison to other types of attributes [Wang et al. 2012,Suttles and Ide 2013]. Furthermore, punctuations and emoticons will be considered aspart of unigrams and bigrams, as in [Suttles and Ide 2013].

4.2. Preprocessing

The processing steps before training the price classifier were inspired both in[Sprenger et al. 2013] and [Nizer and Nievola 2012]. They are: 1) Upcasing and remov-ing punctuation; 2) Tokenizing for repeating elements: Stock codes will be replacedby “[ticker]”. In addition, every hyperlink, currency values and percentage values willbe replaced by a token for each of these categories; 3) Removing stop words: The stopwords listed at LINGUATECA’s website3 will be removed; 4) Stemming: Words will bestemmed using the PTStemmer [Oliveira ]; and 5) Word Counting.

The remaining words will have their frequency counted to apply the bag of wordstechnique. The counter will consider all the tweets that mention a stock in a day. Forexample, if the word “comprar” (buy) appears on four different tweets mentioning thesame stock in a day, then it will be counted four times.

4.3. Price Variation Classifier

Once the tweets are properly tagged, then two “Price Change” attributes will be appended:one indicating the price variation in 1 day, and another indicating the price variation after3 days. The “after 1 day” period will measure the stock price variation in the next dayafter the tweet has been posted, obtaining the difference between the price at the end ofthe day and at the beginning of the next day. The “after 3 days” period is the differencebetween the price at the end of the 3 latter working days, and at the beginning of the nextday. The use of 3 days lag is supported by work presented in [Gaskell et al. 2013], whichindicates that the optimal period for prediction is 3 days. After the price variations for1 and 3 days have been obtained, they will be classified as price increases, decreases orhold.

At the end, attribute selection algorithms, such as InformationGain [Azhagusundari and Thanamani 2013] will be used to choose the terms to betaken as attributes. In this case, each attribute will represent the existence of one different

3http://www.linguateca.pt/.


term in the tweets that mention a specific stock in a given day. We expect to have nomore than 4500 terms after stemming.

Several supervised classifiers will be evaluated, such as Naive Bayes and SupportVector Machines (SVM) [Faceli et al. 2011], considering both price changes for 1 to 3days. Different experiments will be conducted for each of these periods.

5. Expected ResultsWith the results obtained with the experiments, a different evaluation approach will beused for each of the hypothesis raised earlier. The price variation classifier will be used tovalidate H0 and, as a consequence, H1 as well. The classification algorithm will also beevaluated by separating the tweets as training and test sets, and applying cross validationtechniques to avoid over-fitting. In this case, training and testing set will use tweets fromdifferent periods. Tweets collected from April/2014 to September/2014 will be used astraining set, and the ones collected from October/2014 to December/2014 will be used astesting set.

The H2 hypothesis will be evaluated by comparing the f-measure [Hripcsak and Rothschild 2005] to identify a stock price fall (which is abuy signal) with the identification of a price rise (which suggests a sell). For the H3hypothesis, the evaluation will be done by applying attribute selection techniques inorder to evaluate the contribution of emotion features in comparison to other features.Furthermore, experiments will be conducted by trying the classifier using emotionfeatures, non-emotion features, and both. The results will then be compared.

Finally, an additional experiment will be conducted to validate H4, in which theattributes will be weighted by the number of readers. It will be done by multiplyingtheir values by the number of followers of the users who posted the messages. As anexample, if an user publishes a tweet with the word “comprar” (buy), and this user has500 followers, then the attribute indicating how many times this word appeared will countas 500 instead of 1 for this experiment. The obtained values will be compared against theordinary experiments.

6. ConclusionWe have presented a proposal for investigating new stock market prediction techniques,based on information extraction and emotions detected from tweets written in Portuguese.We have discussed the main previous works, and the main steps for the ongoing research,its objectives and techniques for evaluating the forthcoming results.

ReferencesAbarbanell, J. S. and Bushee, B. J. (1997). Fundamental analysis, future earnings, and

stock prices. Journal of Accounting Research, 35(1):1–24.

Azhagusundari, B. and Thanamani, A. S. (2013). Feature selection based on informa-tion gain. International Journal of Innovative Technology and Exploring Engineering(IJITEE) ISSN, pages 2278–3075.

Bollen, J., Mao, H., and Zeng, X. (2011). Twitter mood predicts the stock market. Journalof Computational Science, 2(1):1–8.


Faceli, K., Lorena, A. C., Gama, J., and de Carvalho, A. C. P. L. F. (2011). InteligenciaArtificial: Uma Abordagem de Aprendizado de Maquina. LTC - Livros Tecnicos eCientıficos.

Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. TheJournal of Finance, 25:383–417.

Gaskell, P., McGroarty, F., and Tiropanis, T. (2013). An investigation into correlationsbetween financial sentiment and prices in financial markets. In Proceedings of the 5thAnnual ACM Web Science Conference, pages 99–108. ACM.

Hripcsak, G. and Rothschild, A. S. (2005). Agreement, the f-measure, and reliabilityin information retrieval. Journal of the American Medical Informatics Association,12(3):296–298.

Lo, A. W., Mamaysky, H., and Wang, J. (2000). Foundations of technical analysis: Com-putational algorithms, statistical inference, and empirical implementation. Journal ofFinance, 40:1705–1765.

Lopes, T. J. P., Hiratani, G. K. L., Barth, F. J., Jr., O. R., and Pinto, J. M. (2008).Mineracao de opinioes aplicada a analise de investimentos. In Companion Proceedingsof the XIV Brazilian Symposium on Multimedia and the Web, pages 117–120. ACM.

Mao, H., Counts, S., and Bollen, J. (2011). Predicting financial markets: Comparingsurvey, news, twitter and search engine data. CoRR, abs/1112.1051.

Nizer, P. S. and Nievola, J. C. (2012). Predicting published news effect in the brazilianstock market. Expert Systems with Applications, 39(12):10674 – 10680.

Norcross, J. C., Guadagnoli, E., and Prochaska, J. O. (1984). Factor structure of theprofile of mood states (poms): two partial replications. Journal of Clinical Psychology,40(5):1270–1277.

Oliveira, P. PTStemmer: A Stemming toolkit for the Portuguese language.<URL:http://code.google.com/p/ptstemmer>.

Plutchik, R. and Kellerman, H. (1986). Emotion: theory, research and experience. Acad.Press.

Sprenger, T. O., Tumasjan, A., Sandner, P. G., and Welpe, I. M. (2013). Tweets and trades:The information content of stock microblogs. European Financial Management.

Suttles, J. and Ide, N. (2013). Distant supervision for emotion classification with discretebinary values. In Computational Linguistics and Intelligent Text Processing, pages121–136. Springer.

Wang, W., Chen, L., Thirunarayan, K., and Sheth, A. P. (2012). Harnessing twitter ”bigdata” for automatic emotion identification. In Privacy, Security, Risk and Trust (PAS-SAT), 2012 International Conference on and 2012 International Conference on SocialComputing (SocialCom), pages 587–592. IEEE.

Zhang, X., Fuehres, H., and Gloor, P. A. (2011). Predicting stock market indicatorsthrough twitter “i hope it is not as bad as i fear”. Procedia - Social and BehavioralSciences, 26(0):55 – 62. The 2nd Collaborative Innovation Networks Conference -COINs2010.


Modelo Ativo de Aparencia Aplicado ao Reconhecimento deEmocoes por Expressao Facial Parcialmente Oculta

Flavio Altinier Maximiano da Silva1, Helio Pedrini1

1Instituto de Computacao – Universidade Estadual de Campinas (UNICAMP)Av. Albert Einstein 1251, Campinas-SP, Brazil, 13083-852


Abstract. Algorithms capable of recognizing facial expressions and associatingthem to emotions are useful for many applications and machine learning techni-ques demonstrate to be efficient in solving this challenge, which becomes evenmore difficult when facial regions are occluded. This research project focuseson studying and implementing a robust Active Appearance Model to reconstructoccluded regions of the image and then verifying its efficiency in a classifiertrained with non-occluded images. Experimental results will be tested on variouspublic databases to evaluate the effectiveness of the method.

Resumo. Algoritmos capazes de reconhecer expressoes faciais e associa-las aemocoes se fazem necessarios para diversas aplicacoes e tecnicas de aprendizadode maquina se mostram eficientes na solucao desse desafio, que se torna aindamais complexo quando regioes da face a ser analisada estao ocultas. Esteprojeto de pesquisa tem como objetivo estudar e implementar um Modelo Ativode Aparencia robusto para reconstrucao de partes ocultas da imagem e entaoverificar sua eficiencia em um classificador de emocoes treinado com imagenssem oclusao. Resultados experimentais serao avaliados em varias bases publicasde dados para demonstrar a eficacia do metodo.

1. Contexto

Desde o inıcio dos estudos sobre visao computacional, o reconhecimento automatico deemocoes por meio de computador e um grande desafio. As expressoes faciais tem papelfundamental nesse sentido, pois sao um dos principais sinais da emocao do ser humano. Odesenvolvimento de um sistema preciso e rapido, capaz de identificar emocoes por meioda analise de expressoes faciais, mostra-se util em diversos domınios de conhecimento.

Conforme avancam os estudos sobre robotica, especialmente robos humanoides,a necessidade de um sistema automatico de interacao e evidente. A cada dia, robos saomais presentes na vida cotidiana, tornando-se desejavel que sejam capazes de entender asemocoes e humores de um ser humano. A analise de expressoes faciais mostra-se, nessesentido, um importante fator para entendimento entre homem e maquina.

Alem disso, um reconhecedor automatico de emocoes pode ser util para aquelesque tem dificuldade em faze-lo pessoalmente. Em [Gay et al. 2013], pesquisadores estaodesenvolvendo um aplicativo para celular capaz de ajudar criancas autistas a entender suasemocoes por meio da analise de expressoes faciais.


Este projeto de pesquisa foca-se em um desafio: reconhecer corretamente emocoesquando a imagem da expressao facial esta parcialmente oculta. Uma situacao bastante co-mum e quando o indivıduo esta usando oculos escuros ou ha sombras sobre seu rosto. Nessecaso, um Modelo Ativo de Aparencia (MAA) robusto para oclusoes [Gross et al. 2006]poderia ser treinado de modo a reconstruir a regiao oculta da face.

Uma vez reconstruıda a face, uma abordagem semelhante aquela desenvolvidapor [Bartlett et al. 2004] sera investigada. Nesse trabalho, as imagens foram realcadas porfiltros de Gabor [Fogel and Sagi 1989] e, em seguida, utilizadas para treinar e testar umclassificador. Neste projeto, outros metodos para delinear as linhas da face devem serutilizados.

Um caminho semelhante a esse foi tomado em [Jiang and Jia 2011], em que fo-ram obtidos resultados satisfatorios. Este projeto se diferencia na forma de extracao decaracterısticas e por visar uma base de dados de imagens diferente e mais extensa.

2. Objetivos e Contribuicoes

Este projeto tem como principal objetivo o desenvolvimento de sistemas capazes declassificar emocoes basicas a partir da analise de imagens estaticas em escala de cinza deexpressoes faciais com regioes de interesse ocultas - como olhos e boca. A regiao ocultasera reconstruıda com a utilizacao de um robusto Modelo Ativo de Aparencia (MAA) e,em seguida, diferentes metodos de classificacao serao testados.

Espera-se ao fim do projeto poder comparar os resultados do classificador sobreimagens parcialmente ocultas com imagens sem oclusao alguma e determinar assim se autilizacao do MAA e valida.

As principais contribuicoes esperadas do projeto incluem o fortalecimento dosconhecimentos na area e a proposicao de uma tecnica robusta para reconstruir regioesinvisıveis do rosto, bem como a avaliacao do metodo em diversas bases de dados. Talestudo podera ser util para o desenvolvimento de sistemas robustos de Computacao Afe-tiva [Scherer et al. 2010], alem do desenvolvimento de robos capazes de entender oshumores e as emocoes humanas.

3. Metodologia

Este projeto visa investigar o reconhecimento de emocoes por meio da analise computa-cional de expressao faciais parcialmente ocultas. Regioes de interesse do rosto - comoolhos, nariz e boca - serao oclusos artificialmente, cujo objetivo principal e construir umModelo Ativo de Aparencia robusto capaz de reconstruir essa regiao oculta para que aimagem possa ser usada em um classificador multiclasses. Reconstruıda a imagem, nelasera aplicado um descritor de texturas ou bordas; em seguida, a imagem sera rotulada porum classificador treinado com imagens sem oclusao alguma.

Para avaliar o metodo proposto, extensas bases de dados com imagens de expressoescatalogadas com a emocao correspondente serao necessarias. Inicialmente, serao utilizadasa Cohn-Kanade Database [Kanade et al. 2000, Lucey et al. 2010] e a Japanese Female Fa-cial Expression Database (JAFFE) [Dailey et al. 2010, Lyons et al. 1997]. Tres exemplosde imagens da base de dados de Cohn-Kanade podem ser visualizados na Figura 1.


Figura 1. Imagens extraıdas da base de dados Cohn-Kanade. As emocoes cata-logadas sao, da esquerda para a direita, surpresa, tristeza e felicidade.

Ambas as bases de dados citadas fornecem um catalogo com as emocoes correspon-dentes para cada imagem. A classificacao das emocoes e baseada na parametrizacaoFacial Action Coding System (FACS) [Ekman and Friesen 1977, Ekman et al. 2002,Cohn et al. 2005]. Essa padronizacao classifica as expressoes atraves dos grupos mus-culares envolvidos para gerar o semblante do indivıduo em questao. Para este trabalho,serao levadas em consideracao as imagens correspondentes as expressoes de: felicidade,surpresa, tristeza, raiva, aversao e medo. Essas seis emocoes sao consideradas como tendosinais faciais claros e diferenciados [Ekman and Davidson 1994] e as duas bases possuemum numero relativamente extenso de imagens correspondentes a essas emocoes basicas.

Todas as imagens de rostos foram recortadas manualmente. Baseando-se nosresultados de outros trabalhos, como [Giorgana and Ploeger 2012], foram tambem redi-mensionadas para 96 × 96 pixels; dessa forma, as caracterısticas faciais ainda sao visıveise diminui-se o custo computacional de trabalho para cada imagem. Alem disso, as figurasdevem estar todas em escala de cinza para facilitar, posteriormente, a aplicacao de filtros.

Uma vez que as imagens estejam preparadas, um tipo de filtro e empregadopara realcar as caracterısticas que o algoritmo de aprendizado deve identificar comomarcantes de cada expressao e emocao. Nesta etapa do projeto, os descritores LBP(Local Binary Patterns) [Ojala et al. 2000] e HOG (Histogram of Oriented Gradi-ents) [Ludwig Junior et al. 2009] estao sendo empregados.

O algoritmo de aprendizado escolhido como classificador de emocoes e o SVM(Support Vector Machine). A biblioteca LIBSVM [Chang and Lin 2011] esta sendo usadana implelmentacao; suas funcoes sao abrangentes e, conforme indica a literatura, o SVM eum dos algoritmos mais eficazes para este proposito.

As imagens de um conjunto de testes terao entao alguns de seus pontos de interesseocluıdos artificialmente; um MAA robusto a oclusoes deve ser desenvolvido, aplicadosobre a imagem e espera-se que ele seja capaz de reconstruir a parte oculta. Feito isso, epossıvel seguir o caminho para a classificacao: na imagem e aplicado o mesmo filtro quetreinou o classificador e seu resultado deve ser utilizado para testa-lo. Um diagrama queilustra o caminho proposto pela abordagem proposta pode ser visto na Figura 2.

Para realizar a medida de avaliacao dos algoritmos e poder compara-los de formaimparcial, a base de dados de imagens sera dividida em particoes de treinamento, validacaoe teste. A tecnica de validacao cruzada sera utilizada no particionamento dos dados, tantono caso do MAA quanto no classificador de emocoes final. Segundo essa tecnica, cercade 60% dos elementos da base de dados deve ser usado como grupo de treinamento, 20%como grupo de validacao - para calibrar os parametros do classificador - e os 20% restantescomo grupo de testes.


Reconstrução da parte oculta pelo MAA

Aplicação do descritor de características Classificação

Figura 2. Diagrama da abordagem proposta. A imagem parcialmente oculta deveser reconstruıda pelo MAA e o resultado, apos a aplicacao do descritor de carac-terısticas, classificado pelo SVM.

Espera-se poder comparar os resultados do classificador sobre as imagens parcial-mente oclusas reconstruıdas pelo MAA com os resultados de classificacao sobre imagenssem oclusao alguma. Tambem ha interesse em usar o classificador sobre as imagensparcialmente oclusas sem a intervencao do MAA e compara-las com os resultados geradoscom o uso do MAA; dessa forma, sera possıvel verificar se a tecnica e valida para essefim.

4. Resultados PreliminaresEmbora o projeto ainda esteja em fase de desenvolvimento, alguns resultados preliminaresobtidos ja se mostraram interessantes. Objetivos intermediarios do projeto ja foramalcancados e serao discutidos nesta secao.

Primeiramente, a eficacia dos descritores de caracterısticas e dos classificadorespara garantir a corretude da teoria foi avaliada. Nesta etapa, os descritores LBP com raiode 1 pixel e o HOG foram testados. A acuracia alcancada pode ser vista na Tabela 1.Os resultados sao semelhantes aos observados na teoria e pode-se notar uma vantagemconsideravel do descritor HOG.

Tabela 1. Acuracia do classificador com os descritores LBP e HOG nos grupos detestes das bases de dados Cohn-Kanade (59 elementos) e JAFFE (35 elementos).

LBP HOGCohn-Kanade 71.2% 83.1%

JAFFE 80.0% 88.6%

Alem disso, uma outra hipotese que pareceu relevante durante os testes foi usarclassificadores de uma base de dados na outra. Pode-se observar que a base Cohn-Kanade eformada predominantemente por fotos de pessoas de origem ocidental; a JAFFE, por outrolado, de orientais. Procurou-se, dessa forma, observar se caracterısticas culturais poderiamalterar a forma com que os elementos das bases de dados expressam suas emocoes.

Essa classificacao cruzada teve uma acuracia bastante baixa; concluiu-se queos detalhes diferentes nas expressoes de cada cultura sao suficientes para confundir oclassificador. Fato que reforca essa hipotese e que um classificador multicultural tambemfoi treinado com imagens de ambas as bases e ele teve acuracia bastante alta - mais alta,inclusive, do que a classificacao das bases nelas proprias, cujos resultados sao mostrados naTabela 1. As acuracias de cada classificacao cruzada e do classificador multicultural podemser vistas na Tabela 2. O descritor utilizado foi o HOG, e os numeros entre parentesesindicam o total de elementos de cada grupo de testes.

No estagio atual, o MAA esta sendo investigado mais detalhadamente e testespreliminares para compreensao de suas aplicacoes estao sendo realizados. E interessante


Tabela 2. Acuracia dos classificadores treinados com elementos de cada base.O descritor utilizado foi o HOG e os totais de elementos dos grupos de testesestao indicados entre parenteses. O grupo de testes formado pela uniao dasduas bases tem 34 elementos de cada uma.

Grupo de TestesCohn-Kanade JAFFE Cohn-Kanade + JAFFE

Treinamento do classificadorCohn-Kanade 83.1% (59) 41.3% (183) -

JAFFE 45.9% (309) 88.5% (35) -Cohn-Kanade + JAFFE 89.8% (59) 94.2% (35) 86.7% (68)

notar que num MAA simples, treinado apenas com imagens de expressoes de felicidade,seu modelo ja e capaz de reconstruir pedacos da imagem de forma a torna-la mais com-preensıvel para o classificador. No futuro, espera-se poder usar o MAA para reconstruirimagens de qualquer emocao de forma eficaz.

5. ConclusoesOs resultados alcancados ate o momento sao bastante promissores, sendo semelhantes aosapresentados na literatura e motivam a continuidade da investigacao sobre o MAA.

A eficacia do filtro HOG como descritor das caracterısticas faciais para a aplicacaomostrou-se notavel. Ele e bastante simples de ser utilizado e produziu resultados seme-lhantes aos de filtros mais complexos usados na literatura, como o filtro de Gabor. Alemdisso, o desempenho do classificador multicultural tambem foi interessante: sua acuraciasobre elementos de teste de ambas as culturas estudadas reforca o fato de que as expressoesfaciais basicas devem ser realmente universais.

Investigacoes futuras incluem o estudo do impacto do classificador em relacaoa oclusao de regioes importantes da face e a analise de eficacia do uso do MAA parareconstrucao dessas regioes.

6. AgradecimentosOs autores agradecem o apoio financeiro obtido por meio da Fundacao de Amparo aPesquisa do Estado de Sao Paulo (FAPESP), processo numero 2014/04020-9. As opinioes,hipoteses e conclusoes ou recomendacoes expressas neste material sao de responsabilidadedos autores e nao necessariamente refletem a visao da FAPESP.

ReferenciasBartlett, M. S., Littlewort, G., Lainscsek, C., Fasel, I., and Movellan, J. (2004). Machine

Learning Methods for Fully Automatic Recognition of Facial Expressions and FacialActions. In IEEE International Conference on Systems, Man and Cybernetics, volume 1,pages 592–597.

Chang, C.-C. and Lin, C.-J. (2011). LIBSV: A Library for Support Vector Machines. ACMTransactions on Intelligent Systems and Technology, 2:27:1–27:27. Software availableat http://www.csie.ntu.edu.tw/˜cjlin/libsvm.

Cohn, J., Ambadar, Z., and Ekman, P. (2005). Observer-Based Measurement of FacialExpression with the Facial Action Coding System. In Coan, J. and Allen, J., editors,The Handbook of Emotion Elicitation and Assessment. Oxford University Press Seriesin Affective Science.


Dailey, M. N., Joyce, C., Lyons, M. J., Kamachi, M., Ishi, H., Gyoba, J., and Cottrell,G. W. (2010). Evidence and a Computational Explanation of Cultural Differences inFacial Expression Recognition. Emotion, 10(6):874–893.

Ekman, P. and Friesen, W. (1977). Manual for the Facial Action Coding System. ConsultingPsychologists Press.

Ekman, P., Friesen, W., and Hager, J. (2002). Facial Action Coding System Investigator’sGuide - A Human Face. Salt Lake City, UT, USA.

Ekman, P. E. and Davidson, R. J. (1994). The Nature of Emotion: Fundamental Questions.Oxford University Press.

Fogel, I. and Sagi, D. (1989). Gabor Filters as Texture Discriminator. Biological Cyberne-tics, 61(2):103–113.

Gay, V., Leijdekkers, P., Agcanas, J., Wong, F., and Wu, Q. (2013). CaptureMyEmo-tion: Helping Autistic Children Understand their Emotions Using Facial ExpressionRecognition and Mobile Technologies. Studies in Health Technology and Informatics,189:71–76.

Giorgana, G. and Ploeger, P. (2012). Facial Expression Recognition for Domestic ServiceRobots. Lecture Notes in Computer Science, 7416:353–364.

Gross, R., Matthews, I., and Baker, S. (2006). Active Appearance Models with Occlusion.Image and Vision Computing, 24(6):593–604.

Jiang, B. and Jia, K.-b. (2011). Research of Robust Facial Expression Recognition underFacial Occlusion Condition. In Active Media Technology, pages 92–100. Springer.

Kanade, T., Cohn, J., and Tian, Y. (2000). Comprehensive Database for Facial ExpressionAnalysis. In Fourth IEEE International Conference on Automatic Face and GestureRecognition, pages 46–53, Grenoble, France.

Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., and Matthews, I. (2010).The Extended Cohn-Kande Dataset (CK+): A Complete Facial Expression Dataset forAction Unit and Emotion-Specified Expression. In Third IEEE Workshop on ComputerVision and Pattern Recognition for Human Communicative Behavior Analysis.

Ludwig Junior, O., David, Goncalves Delgado, U. V., and Nunes (2009). TrainableClassifier-Fusion Schemes: An Application to Pedestrian Detection. In IntelligentTransportation Systems, pages 1–6, St. Louis, MO, USA.

Lyons, M. J., Kamachi, M., and Gyoba., J. (1997). Japanese Female Facial Expressions(JAFFE).

Ojala, T., Pietikainen, M., and Maenpaa, T. (2000). Gray Scale and Rotation InvariantTexture Classification with Local Binary Patterns. In European Conference on ComputerVision, pages 404–420. Springer.

Scherer, K., Banziger, T., and Roesch, E. (2010). A Blueprint for Affective Computing: ASourcebook. Oxford University Press.


Multiple Parenting PhylogenyAlberto Oliveira1, Zanoni Dias1, Siome Goldenstein1, Anderson Rocha1

1Instituto de Computacao – Universidade Estadual de Campinas (UNICAMP)

[email protected], {anderson, zanoni, siome}@ic.unicamp.br

Abstract. Multiple Parenting Phylogeny is an image phylogeny extension todeal with cases whereby an image is created by combining content from mul-tiple source images. In this work, is proposed a method using an approachfrom image phylogeny forests combined with object detection, aiming to iden-tify various phylogeny trees in a set of images, as well as finding their multipleinheritance relationships.

Resumo. Filogenia de multiplos parentescos e uma extensao de Filogenia deImagens para lidar com casos nos quais uma imagem e criada atraves dacombinacao de conteudo de multiplas imagens fonte. Nesse trabalho, e propostoum metodo que utiliza abordagens de florestas de filogenia de imagens combi-nada com deteccao de objetos, buscando identificar varias arvores de filogeniaem um conjunto de imagens, tambem encontrando relacoes de heranca multiplaexistentes entre elas.

1. IntroductionWhen a picture is first shared on the internet, it is downloaded, modified and re-sharedas a new image, being prone to the same fate as the original. This results in many im-ages with the same semantic content, but differing by small transformations, denominatednear-duplicates, cohabiting the internet. The detection and recovery of near-duplicatesbecame a compelling problem, with applications such as spam filtering and content re-trieval. Recently, this problem has been approached from a new viewpoint: finding howimages in a set of near-duplicates generated each other, recovering the original pictureused as starting point to create the whole set. This problem was dubbed Phylogenyby [Dias et al. 2010, Dias et al. 2012], relating it to the Biology concept of how organismsevolve. A document can be viewed as an organism, which evolves as transformations areapplied to it, while keeping its original semantic content as inheritance. In this context,images are of special interest, due to their high amount of semantic information and theireasiness to obtain and share. With Image Phylogeny, [Dias et al. 2010, Dias et al. 2012]intent to model the relationships between images in a set of near-duplicates as inheritanceof content (parentage), resulting in a directed graph structure to represent the set.

The image phylogeny formulation by Dias et. al considers that an image inheritscontent from at most one parent, one of its near-duplicates, hence the resulting graphstructure used to represent the near-duplicate set is a Phylogeny Tree. This model issufficient when dealing only with near duplicates, but not always true in more generalscenarios. An image can be a composition, created by combining parts of multiple sourceimages. In this case, the composition inherits content from all its different source images.In this work, an extension to image phylogeny’s formulation is proposed to deal with theaforementioned scenario, named Multiple Parenting Phylogeny.


In practical scenarios, discovering multiple parenting relationships can be appliedto content tracking, forensics, copyright enforcement and forgery detection. As an exten-sion to image phylogeny, multiple parenting phylogeny presents many new challenges, asit aims to find relationships not only between images of related semantic content, but alsothose of seemingly unrelated content. This work presents a multiple parenting phylogenyframework, which simultaneously reconstructs the different phylogenies and recovers themultiple inheritance relationships existent in a set of images.

2. Related work

The pioneers in trying to model relationships among near-duplicate images were[Kennedy and Chang 2008]. The authors aimed to find a visual structure to display theserelationships by analyzing the directionality of transformations between images to estab-lish their generational order. Using this information among all pairs of images, redun-dancy elimination rules are applied to reconstruct the graph structure representing the set.Also trying to find this structure, [De Rosa et al. 2010] method employs a measure basedon transformation estimation and noise comparison in pairs of images to represent theirrelationships, once again applying rules to reconstruct the final structure. The first authorsto propose that the visual structure to represent a set of near-duplicate images should bea tree were [Dias et al. 2010, Dias et al. 2012]. In their work, they propose an approachthat employs both a dissimilarity measure based on transformation estimation and contentcomparison in pairs of images, as well as an automatic reconstruction algorithm, ensuringthe graph obtained is a tree. None of the done approaches deals with scenarios of sharedcontent among non near-duplicate images, as in by multiple parenting phylogeny.

Phylogeny Forests was proposed by [Dias et al. 2013] as another extension of im-age phylogeny’s original formulation. It deals with the scenario of multiple images ofthe same scene, but not all of them belonging to the same near-duplicate group, whichresults in a forest of phylogeny trees. The automatic oriented kruskal algorithm proposedfor reconstruction of the forest uses an adaptive threshold derived from the dissimilaritymeasure used to quantify the relationship between the images in the set. This algorithmis used in one of the steps of our method.

3. Methodology

In this work, it is proposed a three step approach to tackle the multiple parenting phy-logeny: Separation of near duplicate groups (1); Classification of groups (2); Identifyingthe images inside the source image groups used to create the composition (3). Our maincase study are splicing-type compositions, whereby there is a pair of source images, anda part of one is cut and pasted into the other, creating the composition. The source imagefrom which the object is cut is named as alien, while the source image where the objectis pasted receives the name host. The pipeline of the method is depicted in figure 1

3.1. Separation of near-duplicate groups

Image phylogeny as presented by [Dias et al. 2010, Dias et al. 2012] comprises two steps:building the dissimilarity matrix and reconstructing the tree. The root of the phylogenytree is expected to be the original (or closest to original) image which spawned the set ofnear-duplicates. The dissimilarity measure d between a pair of images IA and IB describes


Uncategorized set of images

Forest of Phylogeny trees

Composition and nodes used to create it

Reconstructed graph structure

1. Find the groups of

near duplicates

2. Find composition relation

between trees 3. Multiple parentingedges localization

GraphReconstruction

Tress with compositionrelationship

Figure 1. Pipeline of the proposed multiple parenting phylogeny approach.

which of both is more likely to be the parent of the other in the tree. Given a predefinedfamily T of image transformations and T~β , a subset of transformations, parameterized by~β, belonging to the family T , the definition of the dissimilarity from image IA to imageIB is:

d(IA, IB) = minT~β

∣∣∣IB − T~β(IA)∣∣∣pointwise comparison L

, (1)

considering all possible parameter values ~β of T . Several transformation estimators areapplied to find those values, which depend on the family T . Resampling, cropping, andaffine as geometrical transformations, brightness, contrast, and gamma correction as colortransformations, and compression was the family T [Dias et al. 2012] used. T~β(IA) is thebest approximation of IA to IB, considering the family T of transformations. d(IA, IB) isobtained by comparing T~β(IA) with IB. Minimum Squared Error was image comparisonmetric [Dias et al. 2010, Dias et al. 2012] employed, but any other metric can be used.

The dissimilarity matrix is obtained by computing the dissimilarities for every pairof images. Considering This matrix represents a complete, directed graph, with weightson its edges, [Dias et al. 2010, Dias et al. 2012] introduced a modification to the Kruskalminimum spanning tree for directed graphs to reconstruct the phylogeny tree of the set.This approach considers that all images in the set are near-duplicates. As detailed insection 2, Dias et. al later introduced an extension to their image phylogeny algorithm,modifying it to deal with cases when images might depict the same scene but have norelation whatsoever. In this work, [Dias et al. 2013] approach to reconstruct the forest ofphylogeny trees is used to simultaneously separate the groups of near-duplicates existentin the set of images and reconstruct their phylogeny.

3.2. Classification of reconstructed trees

Once groups of near-duplicates are separated, the shared content relationships amongthem need to be established. Using those relationships and the already calculated dissim-ilarities, each group is classified by the contents it represent (alien, host or composition).

Finding shared content between a pair of images is done through clusteringof local keypoints. First, the keypoints are detected and described in both imagesusing algorithms such as SIFT [Lowe 2004], and subsequently matched. Then, J-Linkage [Toldo and Fusiello 2008] algorithm is applied to the matches, clustering them.The matches are clustered using geometrical transformations derived from random sam-ples of matched keypoints. If big enough clusters are found, it is assumed the pair ofimages has shared content. Classification starts by randomly selecting nodes of each treeas representatives. Among these nodes, all possible pairs are tested for shared content,and the occurrences counted. This procedure is repeated a set number of times, and onceit is over, the tree with most occurrences of shared content is selected as the composi-


tion’s. The remaining trees are classified using the dissimilarities of their roots to the rootof the newly found composition tree. The one with the lowest value is the host tree whilethe one with the highest is the alien tree.

3.3. Identification of parent source images

With each of the trees classified, we need to find the alien and host parents: nodes insidethe respective trees used to create the root of the composition tree. In splicing-type com-positions, usually the object extracted from the alien image is small in comparison to thehost it is pasted on. Taking this into consideration, most of the original scene present inthe host parent is kept intact in the composition root. It is then possible to use the alreadycomputed dissimilarity measure between the composition root and all the nodes in thehost tree to identify which of its nodes is the host parent. We select as host parent the hostnode that has the lowest dissimilarity with the composition root.

The alien parent, on the other had, has only a small region shared with the com-position root. A local dissimilarity approach is proposed, which first identifies the sharedregion, and then apply an image comparison measure only inside this region. Taking analien image Ak and the composition root RC , the shared region between this pair is foundusing the J-Linkage clustering approach described in section 3.2. A single cluster withmore than a threshold number of matches and lowest mean distance of matches is selectedto represent the shared region. The matches inside the cluster are used to align Ak to RC ,and the convex hull of the matched keypoints used to generate a comparison mask in bothimages. As those masks often differ in size and shape among alien images, a probabilisticapproach is used to generate a single shared-region comparison mask, to be used for allaliens. To compute the local dissimilarity from Ak to CR, the transformations applied tothe cut region in the former when pasted into the later are estimated, and applied to Ak.Finally, the inverse of mutual information, considering only the pixels inside the shared-region comparison mask in both images, is used as local dissimilarity value. The alienimage with the lowest dissimilarity value is selected as alien parent.

4. ValidationThe protocol used for validation of the proposed method is presented in this section.

4.1. Dataset and test cases

A total of 600 splicing test cases were used, equally divided in two types of splicing oper-ation: direct pasting and poisson blending. In direct pasting, the object cut from the alienis pasted with no modifications into the host. In poisson blending, the object is blendedto the host using [Perez et al. 2003] approach to gradient matching. Each test case com-prises a forest with tree phylogeny trees: alien, host and composition. To generate thetrees, the protocol defined by [Dias et al. 2012] is adopted. The trees have each 25 nodes,and the composition tree root is created using a random pair of alien and host nodes.

4.2. Metrics

The forest algorithm’s efficiency to separate near-duplicate groups is evaluated employingthe metrics [Dias et al. 2010, Dias et al. 2012] defined: root, edges, leaves, and ancestry.A new subset metric which measures group separation by computing the ratio of nodes


with parents that belong to the same near-duplicate group (alien, host or composition),was also used. Multiple parenting results comprise both classification of groups and com-position/parent identification. The metrics composition root (CR), alien parent (AP) andhost parent (HP) measure if such nodes were correctly found. Additionally, the metricscomposition node (CN), alien node (AN), and host node (HN) evaluate if the compositionroot and alien and host parents found are, respectively, composition, alien and host nodes.The second set of metrics indirectly measures classification of groups.

5. ResultsIn this section, the experiments and results for the multiple parenting phylogeny methodproposed are discussed.

5.1. Group separation resultsTwo variations of the phylogeny forests reconstruction algorithm are used in thegroup separation step. K3T is the same directed kruskal algorithm [Dias et al. 2010,Dias et al. 2012] proposed, but modified to find exactly three trees in the set of images.This variation is considered assisted, as it depends on the information about the numberof trees existent in the set of images, not normally known. K3T is treated as an upperbound to the quality of the approach to separate the groups of near-duplicates. The othervariation is Automatic Oriented Kruskal (AOK) [Dias et al. 2013], detailed in section 3.1.Table 1 display the results for both approaches, in relation to the type of splicing operationused.

Table 1. Results for near-duplicates’s group separation with forest algorithm.Algorithm

AOK K3TSplicing type root edges leaves ancestry subset root edges leaves ancestry subset

Direct pasting 81.6% 74.3% 81.4% 65.5% 99.9% 83.9% 74.4% 81.4% 66.6% 99.9%Poisson blending 78.7% 74.5% 81.3% 63.2% 99.7% 82.2% 74.6% 81.4% 66.1% 99.9%

Results for K3T and AOK were close, even though AOK has no information aboutthe number of trees in the forest. It was also observed that, in most cases, AOK cor-rectly pointed that there were three trees in the set of images. This fact, coupled with thenear perfect group separation both algorithms presented (as shown by the subset results),demonstrate that AOK is a very effective choice for group separation in this scenario. Thesubset results, in particular, are very important, as bad separation of near-duplicate groupscan lead to poor results in the subsequent steps of the multiple parenting approach.

5.2. Multiple parenting resultsIdentification of alien and host parent and composition root, which comprise the multipleparent results, are shown in table 2. The results displayed include only forests recon-structed with the AOK algorithm. CN’s results in comparison to CP’s, show that althoughthe classification of the composition group might be correct, its root is not always thecorrect one. To improve these results, a forest algorithm with better root identificationwould be necessary. Classification of host and alien groups was also very good, as shownby AN’s and HN’s results. The ratios of HP being slightly higher than CR show thatfinding this node is still somewhat robust to a wrong composition root. The alien par-ent, on the other hand, is not only very reliant on the composition root, but also on the


local dissimilarity measure. Yet, the method proposed still had decent results, speciallyconsidering the CR ratio. Poisson blending, in many cases, introduces heavy changes tothe composed object, making the localization and comparison of shared content harder,resulting in lower AP values for this splicing method.

Table 2. Multiple parenting results for AOK.Metrics

Splicing type CR CN HP HN AP ANDirect pasting 74.0% 92.3% 76.3% 93.7% 66.3% 98.3%Poisson blending 64.7% 85.0% 72.7% 88.0% 41.3% 98.0%

6. ConclusionMultiple parenting phylogeny is an image phylogeny extension which goes beyond near-duplicates, aiming to establish relationships among images with varied amounts of sharedsemantic content. This work presents an approach to multiple parenting phylogeny whichstarts by separating the images in groups of near duplicates and reconstructing their phy-logeny trees. Once the trees are separated, object detection techniques are used to identifythe relationships between different groups.

The method was validated using splicing-type test cases, with near-duplicates ofalien, host, and composition images. It had close to perfect group separation using a forestalgorithm, over 85% classification for the three groups studied, and good accuracy inidentifying the alien and host parents. Improving the identification of the composition rootand host, and alien parents used to create it, as well as working with scenarios containingreal images, will be focused on our future works.

ReferencesDe Rosa, A., Uccheddu, F., Costanzo, A., Piva, A., and Barni, M. (2010). Exploring

image dependencies: a new challenge in image forensics. In Media Forensics SecurityII, pages X–X12.

Dias, Z., Rocha, A., and Goldenstein, S. (2010). First steps toward image phylogeny. InIEEE International Workshop on Information Forensics and Security, pages 1–6.

Dias, Z., Rocha, A., and Goldenstein, S. (2012). Image phylogeny by minimal spanningtrees. IEEE Transactions on Information Forensics and Security, 7(2):774–788.

Dias, Z., Rocha, A., and Goldenstein, S. (2013). Toward image phylogeny forests: Auto-matically recovering semantically similar image relationships. Elsevier Forensic Sci-ence International, 231(1-3):178–189.

Kennedy, L. and Chang, S.-F. (2008). Internet image archaeology: Automatically tracingthe manipulation history of photographs on the web. In ACM International Conferenceon Multimedia, pages 349–358.

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. SpringerInternational Journal of Computer Vision, 60(2):91–110.

Perez, P., Gangnet, M., and Blake, A. (2003). Poisson image editing. In ACM SpecialInterest Group on GRAPHics and Interactive Techniques, pages 313–318.

Toldo, R. and Fusiello, A. (2008). Robust multiple structures estimation with J-Linkage.In Springer European Conference on Computer Vision, pages 537–547.


O problema da particao em cliques dominantes

Henrique Vieira e Sousa1, C. N. Campos1

1 Instituto de ComputacaoUniversidade Estadual de Campinas (UNICAMP) – Sao Paulo, SP – Brazil


Abstract. Given a graph G, a dominating clique partition of G is a partition ofits vertex set such that each of its parts is a dominating clique. The dominatingclique partition problem consists of determining max{|P| : P is a dominatingclique partition of G}. In this work, the problem is presented and approachedfor some classes of graphs. In particular, bipartite graphs, cartesian product ofgraphs, and direct products of complete graphs are considered.

Resumo. Dado um grafo G, uma particao em cliques dominantes e umaparticao de V (G) tal que cada uma de suas partes seja uma clique domi-nante. O problema da particao em cliques dominantes consiste em determinarmax{|P| : P e uma particao em cliques dominantes de G}. Neste trabalho,este problema e apresentado e estudado para a classe dos grafos bipartidos, aclasse dos produtos cartesianos e tambem para a classe dos produtos diretosdos grafos completos.

1. Introducao

Seja G um grafo com conjunto de vertices V (G) e conjunto de arestas E(G). Seja S ⊆V (G). O conjunto S e uma clique se os seus vertices sao dois a dois adjacentes. Por outrolado, o conjunto S e independente se os seus vertices sao dois a dois nao adjacentes.

A vizinhanca (aberta) de um vertice v ∈ V (G) e definida como NG(v) := {w ∈V (G) : {v, w} ∈ E(G)}. A vizinhanca fechada de um vertice v ∈ V (G) e definidacomo NG[v] := NG(v) ∪ {v}. Se o vertice x pertence a vizinhanca de y dizemos quex e vizinho de y. Estendendo o conceito de vizinhanca para um conjunto S ⊆ V (G),a vizinhanca (aberta) de S e definida como NG(S) :=

⋃v∈S NG(v). Analogamente, a

vizinhanca fechada de S e definida como NG[S] := NG(S) ∪ S.

Um conjunto S ⊆ V (G) e denominado um conjunto dominante do grafo G, separa todo vertice v ∈ V (G), ou v e um elemento de S, ou v e adjacente a um elemento deS. Isto e, N [S] = V (G). O numero de dominacao de G, γ(G), e a cardinalidade de ummenor conjunto dominante de G. O problema do conjunto dominante mınimo consisteem determinar γ(G) para um grafo G arbitrario. Esse problema foi demonstrado ser NP-difıcil em 1979, por M. Garey e D. Johnson [Garey and Johnson 1979]. O conceito dedominacao de um grafo foi formalmente definido em 1958, por C. Berge [Berge 1962],usando o nome de coeficiente de estabilidade externa. A terminologia atual foi cunhadapor Ø. Ore [Ore 1962].

Muitas aplicacoes reais podem ser modeladas como problemas de conjuntos domi-nantes. Entre estas, podemos citar problemas de comunicacao em redes de computadores


e de roteamento de onibus. Diversas aplicacoes levaram a definicao de variantes do pro-blema do conjunto dominante mınimo original. Uma classe de variantes muito difundidaconsiste em determinar a cardinalidade de um conjunto dominante mınimo em um grafoarbitrario satisfazendo um conjunto de restricoes extra. Uma variante, especialmente im-portante para este trabalho, consiste em adicionar a restricao de que o conjunto dominanteseja uma clique (clique dominante), definindo o problema da clique dominante. O valorγcl(G) e a cardinalidade de uma menor clique dominante em G. Convem observar queγcl(G) nao esta definido para todos os grafos, pois existem grafos que nao possuem cli-ques dominantes. Um destes grafos e o grafo ciclo com cinco vertices.

Em 1977, E. J. Cockayne e S. T. Hedetniemi [Cockayne and Hedetniemi 1977]introduziram uma nova variante do problema do conjunto dominante. Uma particaodominante e uma particao de V (G) de maneira que cada uma de suas partes se-jam conjuntos dominantes de G. Define-se o numero dominativo1, d(G), comomax{|P| : P e uma particao dominante de G}. Note que d(G) esta sempre bemdefinido, pois todo grafo possui uma particao dominante com uma unica parte igualV (G). Este problema tem atraıdo atencao de varios pesquisadores e muitos resultadostem aparecido na literatura [Bonuccelli 1985, Bertossi 1988, Kaplan and Shamir 1994,Rautenbach and Volkmann 1998, Volkmann 2011].

Uma extensao natural deste problema consiste em considerar particoes de V (G)em conjuntos dominantes com restricoes adicionais. Dentre os novos problemas defini-dos, encontra-se o problema da particao em cliques dominantes, para o qual adiciona-se a restricao de que cada parte seja uma clique. Outros problemas de particaoem conjuntos dominantes, considerando outras restricoes, tambem tem sido investiga-dos [Haynes et al. 1998, Zelinka 1992]. Este trabalho aborda o problema da particao emcliques dominantes para algumas classes de grafos.

2. Resultados Principais

Dado um grafo G, o problema da particao em cliques dominantes consiste em deter-minar se existe uma particao de V (G) tal que cada uma de suas partes seja uma cliquedominante em G. Se o conjunto de vertices de um grafo G admite uma particao emcliques dominantes, dizemos que o grafo G possui uma particao em cliques dominantes(PCD). Para os grafos que possuem uma PCD, definimos o numero clique dominativocomo dcl(G) := max{|P| : P e uma particao em cliques dominantes de G}.

O grafo completo com n vertices, Kn, possui uma PCD com n partes unitarias.Logo, dcl(Kn) = n. Para um grafo bipartido completo, Km,n, cujas partes possuemcardinalidade m e n, uma particao em cliques dominantes nem sempre existe. O teoremaa seguir caracteriza os grafos bipartidos completos que possuem uma PCD.

Teorema 2.1. Um grafo bipartido G possui uma PCD se e somente se G ∼= Kn,n. Ade-mais, dcl(Kn,n) = n, para n ≥ 2.

Muitos resultados na literatura analisam a variacao de certos parametrosdos grafos apos operacoes de produtos entre eles [Nowakowski and Rall 1996,Valencia-Pabon 2010]. Em conjuntos dominantes nao tem sido diferente. Um problema

1Em ingles, denominado domatic number.


classico de conjuntos dominantes envolvendo produtos cartesianos de grafos e a Conjec-tura de Vizing [Vizing 1963], a qual afirma que para dois grafos G e H , γ(G � H) ≥γ(G)γ(H).

O produto cartesiano de dois grafos simples G1 e G2 e o grafo G1 � G2, quepossui V (G1)×V (G2) como conjunto de vertices, e dois vertices (u, v) e (x, y) adjacentesse e somente se, ou u = x e {v, y} ∈ E(G2), ou {u, x} ∈ E(G1) e v = y. A Figura1(a) exemplifica esse produto. O produto direto de dois grafos simples G1 e G2 e o grafoG1×G2, que tambem possui conjunto de vertices V (G1)×V (G2), com dois vertices (u, v)e (x, y) adjacentes se e somente se {u, x} ∈ E(G1) e {v, y} ∈ E(G2). A Figura 1(b)exemplifica este conceito.

u0

u1

u2

w0 w1 w2 w3

(a) Produto cartesiano do K3 pelo K4.

u0

u1

u2

w0 w1 w2 w3

(b) Produto direto do K3 pelo K4.

Figura 1. Exemplos de produto cartesiano e produto direto.

Sejam G1 e G2 dois grafos simples com V (G1) = {u0, . . ., u(p−1)} e V (G2) ={w0, . . ., w(q−1)}. Neste trabalho, representamos o grafo G obtido pelo produto cartesianoou direto de G1 e G2 por meio de uma grade formada por pq vertices definida da seguinteforma:

- A numeracao das linhas vai de 0 ate p− 1;- A numeracao das colunas vai de 0 ate q − 1;- O vertice (ui, wj) esta na posicao (i, j) da grade e e denotado por vij , 0 ≤ i < p e0 ≤ j < q;

- Um vertice qualquer na linha i e denotado por vi•. O conjunto de todos os vertices nalinha i e denotado por Vi•;

- Um vertice qualquer na coluna j e denotado por v•j . O conjunto de todos os vertices nacoluna j e denotado por V•j;

Esta representacao do produto de G1 e G2 e denominada representacao matricial.A Figura 2 ilustra esta representacao para os grafos exibidos na Figura 1.

As propriedades a seguir sao simples consequencias da definicao destes produtose sao usadas extensivamente neste trabalho.

Proposicao 2.2. Seja G ∼= G1 � G2. Seja e ∈ E(G). Entao, na representacao matricialde G, os extremos de e ou estao na mesma linha, ou estao na mesma coluna.


G1

G2

u0

u1

u2

w0 w1 w2 w3

v00 v01 v02 v03

v10 v11 v12 v13

v20 v21 v22 v23

Figura 2. Representacao Matricial do K3 � K4

Proposicao 2.3. Sejam G1 e G2 dois grafos com n1 e n2 vertices, respectivamente. SejamG ∼= G1 �G2 e S uma clique de G. Entao, na representacao matricial de G, ou S ⊆ Vi•,ou S ⊆ V•j , 0 ≤ i < n1 e 0 ≤ j < n2. Ademais, se S e uma clique dominante e: (i) seS ⊆ Vi• e n1 ≥ 2, entao S = Vi•; (ii) se S ⊆ V•j e n2 ≥ 2, entao S = V•j .

Proposicao 2.4. Sejam G1 e G2 dois grafos simples. Seja G ∼= G1 × G2 e e ∈ E(G).Entao, na representacao matricial de G, os dois extremos de e estao em linhas e colunasdiferentes.

Os grafos obtidos por meio do produto cartesiano que possuem uma particao emcliques dominantes sao caracterizados pelo Teorema 2.5, o qual tambem estabelece ovalor de dcl(G) para estes grafos.

Teorema 2.5. Seja G o grafo obtido pelo produto cartesiano dos grafos G1 e G2, com|V (G1)| ≥ 2 e |V (G2)| ≥ 2. O grafo G possui uma particao em cliques dominantes se esomente se G1

∼= Kp e G2∼= Kq. Ademais, dcl(Kp �Kq) = max{p, q}.

O Teorema 2.6 estabelece uma condicao que deve ser respeitada por dois grafospara que seu produto direto possua uma clique dominante. Esta condicao, apesar de sernecessaria, nao e suficiente para garantir a existencia de uma clique dominante em umgrafo obtido pelo produto direto. Esse fato e demonstrado pelo Teorema 2.7, pois nesteteorema e provado que o grafo Kp,p × Kq nao possui cliques dominantes, apesar de serum produto direto de dois grafos que possuem cliques dominantes.

Teorema 2.6. Sejam G e H dois grafos tais que G nao possui cliques dominantes, entaoo grafo G×H tambem nao possui cliques dominantes.

Teorema 2.7. O grafo Kp,p ×Kq, p ≥ 2, nao possui cliques dominantes.

Para os produtos diretos de grafos completos, o Teorema 2.8 estabelece que,nesta classe, os grafos que possuem PCD sao os produtos de grafos com pelo menos tresvertices. Para estes, e demonstrado que uma clique dominante mınima possui pelo menostres vertices e que e possıvel construir uma PCD de cardinalidade ⌊pq

3⌋, o que permite

concluir que dcl(Kp ×Kq) = ⌊pq3⌋.

Teorema 2.8. Seja G ∼= Kp ×Kq, p ≥ 3, q ≥ 3. Entao, dcl(G) = ⌊pq3⌋.


3. ConclusaoNeste trabalho introduzimos o problema da particao em cliques dominantes e estuda-mos este problema para algumas classes de grafos. Para as classes dos grafos bipartidos,produtos cartesianos e dos produtos diretos de grafos completos foi possıvel caracterizaros grafos que possuem uma particao em cliques dominantes e tambem determinar seusnumeros clique dominativos. Uma outra classe de grafos que estamos estudando e a daspotencias de ciclo. Para esta classe conseguimos caracterizar os grafos que possuem umaPCD e determinar o valor de dcl(C

kn). Este ultimo resultado esta em fase de escrita.

ReferenciasBerge, C. (1962). Theory of Graphs and its Applications. Methuen, London.

Bertossi, A. A. (1988). On the domatic number of internal graphs. Information ProcessingLetters, 28(6):275 – 280.

Bonuccelli, M. A. (1985). Dominating sets and domatic number of circular arc graphs.Discrete Applied Mathematics, 12(3):203 – 213.

Cockayne, E. J. and Hedetniemi, S. T. (1977). Towards a theory of domination in graphs.Networks, 7:247–261.

Garey, M. R. and Johnson, D. S. (1979). Computers and Intractability: A Guide to theTheory of NP-Completeness. Freeman, New York.

Haynes, T. W., Hedetniemi, S., and Slater, P. (1998). Domination in Graphs: Volume 2:Advanced Topics (Chapman & Hall/CRC Pure and Applied Mathematics). CRC Press.

Kaplan, H. and Shamir, R. (1994). The domatic number problem on some perfect graphfamilies. Information Processing Letters, 49(1):51 – 56.

Nowakowski, R. J. and Rall, D. F. (1996). Associative graph products and their indepen-dence, domination and coloring numbers. Discuss. Math. Graph Theory, 16(1):53–79.

Ore, O. (1962). Theory of Graphs. Amer. Math. Soc. Colloq. Publ, Providence, RI.

Rautenbach, D. and Volkmann, L. (1998). The domatic number of block-cactus graphs.Discrete Mathematics, 187(1 - 3):185 – 193.

Valencia-Pabon, M. (2010). Idomatic partitions of direct products of complete graphs.Discrete Mathematics, 310(5):1118 – 1122.

Vizing, V. G. (1963). The cartesian product of graphs. Vycisl. Sistemy No., 9:30–43.

Volkmann, L. (2011). Upper bounds on the signed total domatic number of graphs. Dis-crete Appl. Math., 159(8).

Zelinka, B. (1992). Domatic number of a graph and its variants (extended abstract).In Fourth Czechoslovakian Symposium on Combinatorics, Graphs and Complexity,volume 51 of Annals of Discrete Mathematics, pages 363 – 369. Elsevier.


Optimizing Simulation in Multiprocessor Platforms usingDynamic-Compiled Simulation

Maxiwell Salvador Garcia1, Rodolfo Azevedo1 and Sandro Rigo1

1Institute of Computing, UNICAMP, Campinas, Sao Paulo, Brazil

{maxiwell, sandro, rodolfo}@ic.unicamp.br

Abstract. Contemporary SoC design involves the proper selection of cores froma reference platform and fast design space exploration. Platforms simulationwith high performance and flexibility is expected by the designers. The dynamic-compiled instruction-set simulation compiles code blocks of application at run-time to accelerate the simulation with high efficiency. This paper presents a retar-getable dynamic-compiled simulator to improve the performance in multiproces-sor platforms. Three architectures were modeled and tested on platforms up to 8cores. The performance was 3 times better than interpreted simulators.

Resumo. O desenvolvimento atual de SoC implica na escolha minuciosa de seusprocessadores com uma rapida exploracao do espaco de projeto. Simulacaode plataformas com alta flexibilidade e desempenho e esperada pelos projetis-tas. A simulacao compilada dinamica de conjunto de instrucoes compila blo-cos de codigos da aplicacao em tempo de execucao para melhorar a eficienciada simulacao. Esse artigo apresenta um simulador compilado dinamico redire-cionavel para acelerar plataformas multiprocessadas. Tres arquiteturas forammodeladas e testadas em plataformas com ate 8 cores. O desempenho foi 3 vezesmelhor que simuladores interpretados.

1. IntroductionAn instruction-set simulator (ISS) functionally mimics the behavior of a target processoron a host workstation. Due to the flexibility and reduced cost, simulation is one of the mostused methods to evaluate system behavior and performance.

The classic ISS style is interpretation, which mimics the hardware behavior: fetchan instruction from memory, decode, and execute it. Unfortunately, the growing systemcomplexity makes the traditional simulation approach inefficient for todays architectures.The increasing complexity of the architectures and time-to-market pressure make perfor-mance simulation critical in ISSs.

To gain performance, it’s possible avoid the overhead in instruction fetching and de-coding, performing binary translation prior to simulation or at runtime. The prior techniqueis called static compiled simulation and the runtime is dynamic compiled simulation. Thestatic approach is based on creating a simulator optimized for the application that will beexecuted, using an additional pre-processing step before the simulation. A more flexiblealternative is the dynamic approach, which make that pre-processing at runtime, optimizingcode blocks on-demand.

An important feature for an ISS is retargetability in order to support a wide spectrumof architectures. The retargetability was obtained with the introduction of the Architecture


Description Languages (ADLs) in the 90’s. In this work, we used a ADL called ArchC[Azevedo et al. 2005], which offers retargetable instruction-set simulator generation withdynamic compiled simulation techniques. This ADL also has static compiled simulation[Garcia et al. 2010], but will not be used here. The simulators generated by ArchC arewrapped in a SystemC module [SystemC ] and use TLM (Transaction-Level Modeling)[Cai and Gajski 2003] for external connection, like memory, buses and IP’s (IntellectualProperty). With these attributes, you can easily build complete platforms with various con-figurations [Araujo et al. 2008]. Simulate the entire SoC (System-on-Chip) – interconnec-tions, multiple processors, IP’s – reduces errors and helps in achieving the requirements.

This paper presents the ACDCSIM, a dynamic compiled simulator generator ba-sed in ArchC, using the LLVM infrastructure [Lattner and Adve 2004] to pre-compile codeblocks at runtime. The generated simulators were used to speedup multiprocessor plat-forms, which previously were performed with interpreted simulators.

The remaining sections of this paper are organized as follows. Section 2 presents re-lated work concerning fast retargetable ISS. Section 3 describes our approach using LLVMJIT (Just-in-Time compilation) infrastructure to generate dynamic compiled simulators.Section 4 presents some simulation results using the MIPS, PowerPC and SPARC archi-tectures, in platforms with 1, 2, 4 and 8 homogeneous cores. Finally, Section 5 summarizesour conclusions and describes the future work.

2. Related Work

There are many fast simulators and ADLs in current literature. Some fast simulators areavailable regardless of ADL, but they are exceptions. SintSim [Burtscher and Ganusov 2004]is a static-compiled ISS and uses C as intermediate representation. It translates a target pro-gram into a huge C function which may sometimes overflow the C compiler. Selecting onlythe most frequently executed instructions to translate alleviates the problem. The FSCStechnique, implemented in ADL ArchC, is a static-compiled ISS [Bartholomeu et al. 2004]which translates complete target program into C++. It does so by dividing the target codeinto blocks, and translating each block into a C++ function. This approach is scalable andhas been successfully applied to large benchmarks.

Static-compiled ISS’s requires prior knowledge of the target program at compiletime; for every target program to simulate, an instance of the simulator needs to be built.Another drawback is the inability to execute self-modifying code. Therefore, more researchhas explored the construction of dynamic-compiled ISS’s. Shade [Cmelik and Keppel 1994]is the first simulator in this class. However, it is not retargetable nor portable.

LISA [Zivojnovic et al. 1996] is an ADL designed for automatically retargeting fastcompiled simulators. Nohl et al. presented the Just-in Time Cache Compiled Simulation(JIT-CCS) technique in LISA to combine the flexibility of interpreted simulators with theperformance of the compiled principle. The compilation of an instruction takes place inruntime, just before the instruction execution. The extracted information is stored in asimulation cache for direct reuse in repeated executions of the same program address.

In the EXPRESSION ADL [Halambi et al. 1999], Reshadi et al. [Reshadi et al. 2003]presented a technique called Instruction-Set Compiled Simulation (IS-CS). The decodinghappens at compile time, like static-compiled, but with re-decoding flexibility if an ins-


(a) (b)

Figura 1: Dynamic-Compiled flow: (a) Simulator generation (b) Binary simulation

truction is changed. They called this approach the hybrid-compiled simulation. The IS-CSperforms up to 40% better than JIT-CCS, as shown in [Halambi et al. 1999].

Qin et al. [Qin et al. 2006] proposed a multiprocessing approach to accelerate thedynamic-compiled simulation. The simulation is performed on one thread, while otherthreads make binary translation. Their implementation is fully in C/C++, including theintermediate representation. The other tools mentioned – SyntSim, FSCS, JIT-CCS andIS-CS – also use C/C++ in their implementation and are therefore highly portable. Theapproach presents in this paper uses C++ in simulator code and LLVM-IR as intermediaterepresentation.

3. Dynamic-Compiled SimulationArchC [Azevedo et al. 2005] is an architecture description language based on C++ and Sys-temC [SystemC ] that was used in this work for experimenting new fast simulation techni-ques. The ArchC toolchain are publically available for download with an open-source GPLlicence in www.archc.org.

Figure 1a presents the dynamic-compiled simulator generation workflow in ArchC.The behaviors are transformed in LLVM Intermediate Representation (LLVM-IR) for inli-ning in to the JIT function. Figure 1b shows the execution flow. The outputs dyn-compiledsimulator and behaviors LLVM bitcode from 1a is the the dotted box and the behaviorsLLVM bitcode box in 1b, repectively. This technique divides the target program in manyblocks. Each block is a LLVM-IR switch structure containing 512 decoded instructions,indexed by the Program Counter (PC). These blocks are called JIT function and are builton-demmand. The behavior calls in JIT functions are inlined in Optimizing phases. Finally,the Execution Engine compile and execute the block – JIT stage. These compiled blocksare stored for future access, avoding the build and compilation phases. In the end, the Exe-cution Engine return the next PC, indicating the next block to be executed. If the new PCindicate a compiled block, the engine execute it. If not, the new block must be processed.When a compiled block changes by a self-modifying code, the simulator marks this blockas not compiled and resumes the simulation.

Since the JIT function is basically a switch, avoiding unnecessary breaks increasesthe performance. That optimization requires the identification of the instructions that canchange the execution flow. The delay slot feature of some architectures has to be treatedcarefully in instruction level simulation. The program flow may change not after the controlinstruction, but after the instruction in the delay slot. For this reason, the number of delayslots must also be given in the ISA description. ArchC descriptions have this information,


which allowed the insertion of breaks only on those instructions. The remaining instructionsexecute sequentially, without breaks between them. Control flow instructions were alsoused as a mark for synchronization of multicore in ArchC platforms. The “wait()” statementalways appears after a jump or a branch. This allows the SystemC to switch control toanother thread, in this case, another processor simulator.

4. ExperimentsIn order to evaluate the proposed dynamic-compiled simulator, we selected three archi-tecture models described in ArchC: Mips32, PowerPC, and Sparc. These simulator wereinserted into virtual platforms with 1, 2, 4 and 8 cores. The communication was made th-rough buses, shared memory, and lock devices, all implemented using SystemC and TLMlibraries. Although the simulated platform contains multiple processors, and the machinehas multiple cores, the SystemC execution model is serial.

We used benchmarks with parallel programs from ParMiBench [Iqbal et al. 2010]and Splash-2 [Woo et al. 1995], and adapted some programs from MiBench [Guthaus et al. 2001]for multicore. The pthread library, used in various programs, has been rewritten to run onArchC. But the applications code was not changed. The experiments were performed onan Intel i7 860 2.8GHz, equipped with 1GB RAM. We used GCC 4.4.1 with the -O3 tocompile the ArchC 2.1 toolchain. The JIT function were built using the API (ApplicationProgramming Interface) of LLVM 2.8. The benchmarks are compiled using cross-GCCs3.3.1 with the -O2 flag for optimization, and the -msoft-float flag because the ArchCmodels do not have floating-point instructions.

For each architecture, we first ran all platforms with the interpreted simulator. Wethen ran the platforms with the dynamic-compiled simulator. Each configuration was exe-cuted 3 times, using the arithmetic average to get the results. The small amount of executionwas due to the high time of the simulations. But the workstation was exclusively perfor-ming it, and the standard deviation varied from 2.1 to 6.8%. Figures 2a, 2b, and 2c show thespeedup factors of dynamic-compiled simulation over interpretation for Mips32, PowerPCand Sparc architectures, respectively.

The first five applications in the figures have fast execution and low computationalcost. The last five are computational intensive, take up to 6 hours to simulate. The dynamic-compiled approach in short programs is not interesting, because the overhead of the LLVMframework makes it is more costly than interpreted simulation, slowing the simulation downfor some configurations. The dynamic-compiled simulation is worth in large and slow pro-grams, where speed really needs to be increased. If a program takes 4 seconds in interpretedsimulation, to improve its performance twice will not cause much impact. But the bench-mark susan smoothing has a speedup of 2 in PowerPC and saved three simulation hours foreach configuration with dynamic-compiled.

The speedup averages were 2.25, 1.94, and 2.11, for Mips32, PowerPC and Sparc,respectively. The overall average was 2.11. If we consider only the last five programs thespeedup averages are 3.16, 2.27, and 2.98, with overall average 2.98. Standalone experi-ments were also performed with the dynamic-compiled simulators, with a memory connec-ted directly. The Benchmark MiBench for single core was used and the fft program inMips32 Standalone, for example, reached 240 MIPS over 20 MIPS in the Virtual Platformwith 1 core. The simulators in Platform Embedded had a performance drop of about 6 times


(a) Mips32 Model (b) PowerPC Model

(c) SPARCv8 Model

Figura 2: Dynamic-Compiled vs Interpreter

in average. This decrease is due to the SystemC kernel, responsible for about 50% of theprocessing time.

5. ConclusionIn this paper, we implemented the dynamic-compiled simulation approach in the ArchCtoolchain, for evaluating virtual platforms. The speedup was about 3 times in average.All the benchmarks performed in 97.8 hours with compiled-dynamic simulators and in486.9 hours with interpreted simulators. The SystemC kernel has a negative impact on thesimulation performance. Standalone simulators reached up to 6 times the simulation speedcompared to virtual platforms, which rely on SystemC kernel. In future work, ArchC teamintends to reduce the overhead in SystemC kernel. There is interest in simulating multicoreplatforms using real multicore simulation. Nowaday, the simulation is sequential, due tothe SystemC kernel.

ReferenciasAraujo, G., Rigo, S., and Azevedo, R. (2008). Processor design with ArchC. In Mishra, P. and Dutt,

N., editors, Processor Description Languages, chapter 11. Morgan Kaufmann Publishers Inc.,San Francisco, CA, USA.

Azevedo, R., Rigo, S., Bartholomeu, M., Araujo, G., Araujo, C., and Barros, E. (2005). The archcarchitecture description language and tools. Int. J. Parallel Program., 33:453–484.

Bartholomeu, M., Azevedo, R., Rigo, S., and Araujo, G. (2004). Optimizations for compiled simu-lation using instruction type information. In Proceedings of the 16th Symposium on Computer


Architecture and High Performance Computing, SBAC-PAD ’04, pages 74–81, Foz do Iguacu,PR, Brasil.

Burtscher, M. and Ganusov, I. (2004). Automatic synthesis of high-speed processor simulators.In Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture,MICRO 37, pages 55–66, Portland, Oregon.

Cai, L. and Gajski, D. (2003). Transaction level modeling: an overview. In Proceedings of the 1stIEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis,CODES+ISSS ’03, pages 19–24, New York, NY, USA.

Cmelik, B. and Keppel, D. (1994). Shade: a fast instruction-set simulator for execution profiling.In Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling ofcomputer systems, SIGMETRICS ’94, pages 128–137, Nashville, Tennessee, United States.

Garcia, M. S., Azevedo, R., and Rigo, S. (2010). Optimizing a retargetable compiled simulator toachieve near-native performance. In Proceedings of the 11th Symposium on Computing Systems,WSCAD-SCC ’10, pages 33–39, Petropolis, RJ, Brasil.

Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. (2001).Mibench: A free, commercially representative embedded benchmark suite. In Workload Charac-terization, 2001. WWC-4. 2001 IEEE International Workshop on, pages 3 – 14.

Halambi, A., Grun, P., Ganesh, V., Khare, A., Dutt, N., and Nicolau, A. (1999). Expression: alanguage for architecture exploration through compiler/simulator retargetability. In Proceedingsof the conference on Design, automation and test in Europe, DATE ’99, pages 485–490, Munich,Germany. ACM.

Iqbal, S. M. Z., Liang, Y., and Grahn, H. (2010). Parmibench - an open-source benchmark forembedded multiprocessor systems. IEEE Computer Architecture Letters, 9:45–48.

Lattner, C. and Adve, V. (2004). Llvm: A compilation framework for lifelong program analysis &transformation. In Proceedings of the international symposium on Code generation and optimi-zation: feedback-directed and runtime optimization, CGO ’04, pages 0–75, Palo Alto, California.

Qin, W., D’Errico, J., and Zhu, X. (2006). A multiprocessing approach to accelerate retargetable andportable dynamic-compiled instruction-set simulation. In Proceedings of the 4th internationalconference on Hardware/software codesign and system synthesis, CODES+ISSS ’06, pages 193–198, Seoul, Korea.

Reshadi, M., Mishra, P., and Dutt, N. (2003). Instruction set compiled simulation: a technique forfast and flexible instruction set simulation. In Proceedings of the 40th annual Design AutomationConference, DAC ’03, pages 758–763, Anaheim, CA, USA.

SystemC. http://www.systemc.org, 2011.

Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. (1995). The splash-2 programs:characterization and methodological considerations. SIGARCH Comput. Archit. News, 23(2):24–36.

Zivojnovic, V., Pees, S., and Meyr, H. (1996). Lisa - machine description language and genericmachine model for hw/sw co-design. In in Proceedings of the IEEE Workshop on VLSI SignalProcessing, VLSISP ’96, pages 127–136, San Francisco, USA.


Preempção de Tarefas MapReduce via Checkpointing

Augusto Souza1, Islene Garcia1

1 Instituto de ComputaçãoUniversidade Estadual de Campinas – Campinas, SP – Brasil


Abstract. Apache Haddop is one of the main systems for big data analysis andstorage. It is utilized to execute jobs with distinct priorities. The preemptionused by Hadoop schedulers consists in just stopping the execution of less prio-ritary tasks, so the computation up to that moment is lost when their executionis restarted. We have been working on a solution for state storage of tasks inthe phases of Reduce and Shuffle in the HDFS to improve resources usage. Thispaper will discuss the need for this better preemption system in Hadoop anddescribe which methods will be used to validate it.

Resumo. Apache Hadoop é um dos principais sistemas para armazenamento eanálise de grandes quantidades de dados em clusters. Ele é utilizado para exe-cutar jobs de distintas prioridades. A preempção dos escalonadores do Hadoopconsiste apenas em parar a execução das tarefas menos prioritárias, a compu-tação realizada até aquele momento não é aproveitada quando suas execuçõessão retomadas. Estamos contribuindo para uma solução de armazenamento doestado de tarefas que estejam na fase de Reduce ou Shuffle no HDFS para me-lhorar o aproveitamento dos recursos. Este artigo pretende argumentar a res-peito da necessidade dessa melhoria no mecanismo de preempção e descreveros métodos de validação que serão empregados para comprovar sua eficácia.

1. Introdução

Uma solução simples para o problema de se armazenar e processar grande quantidade dedados é distribui-los em vários discos; assim o conteúdo é lido de vários HDs ao mesmotempo e a velocidade de acesso global é maior. Porém, isso cria uma série de dificuldades,como por exemplo:

• de que forma desenvolver aplicações que lêem seus dados desse sistema de arqui-vos distribuído;

• como aproveitar o poder de processamento dos clusters, pois não seria eficienteutilizar apenas o poder de armazenamento;

• como identificar e se recuperar em casos de falhas de hardware e rede;• como estruturar uma arquitetura com máquinas convencionais (commodity ma-

chines) para diminuir os custos.

Foi com o intuito de oferecer aos seus desenvolvedores um modelo computacio-nal onde esses problemas fossem tratados, e consequentemente, a programação de apli-cações distribuídas fosse simplificada, que o Google divulgou em 2004 o MapReduce[Dean and Ghemawat 2004]. Trata-se de um framework que permite ao programador se


preocupar com a implementação de duas funções apenas: Map e Reduce. Map é res-ponsável por dividir a entrada em vários agrupamentos pequenos e independentes, quesão processados em paralelo nas diversas máquinas do cluster. As saídas dos mappers(processos que executam a função Map) são entradas para os reducers (processos queexecutam paralelamente a função Reduce). Os reducers, por sua vez, combinam os dadospara chegar à resposta final. Para que MapReduce funcione adequadamente, é necessárioo apoio de um sistema de arquivos distribuídos, no caso do Google, o GFS (Google FileSystem) [Ghemawat et al. 2003].

“Desenvolver software de código aberto para computação confiável, escalável edistribuída” é o objetivo do Hadoop [Apache Hadoop ]. Mantido pela Apache, o Hadoopé um conjunto de projetos, dentre eles o homônimo MapReduce e o HDFS (Hadoop Dis-tributed File System), equivalentes ao MapReduce e GFS do Google. A popularização doHadoop fez com que diversos tipos de aplicações fossem desenvolvidos para ele. Mui-tas decisões de negócios dependem hoje em dia de aplicações dessa natureza. Algumasdessas decisões são urgentes, como o processamento dos logs de cliques para medir aefetividade de anúncios em sites como Facebook e Yahoo (empresas populares por utili-zarem e contribuirem para o Hadoop); já outras nem tanto, como analisar os dados de usopara decidir onde posicionar anúncios em uma página [Cho et al. 2013]. Essa criticidadepara algumas operações justifica a necessidade de um mecanismo de preempção robustono Hadoop. Assim, caso uma aplicação menos prioritária esteja em execução em umdado cluster no mesmo momento em que uma aplicação mais prioritária for submetida,essa última pode ser alocada preferencialmente suspendendo temporariamente parte daexecução, ou seja, algumas tarefas, da primeira. Com um mecanismo de checkpointingconfiável, pode-se aproveitar o processamento das tarefas que foram interrompidas.

O restante desse documento se organiza da seguinte forma: a Seção 2 descrevea arquitetura e o funcionamento do HDFS e das versões 1 e 2 do MapReduce; a Seção3 descreve como a preempção de tarefas é trabalhada hoje em dia no Hadoop e os pro-blemas dessa abordagem; a Seção 4 descreve a forma como nossa pesquisa tem evoluidopara melhorar o mecanismo de preempção de tarefas do Hadoop; a Seção 5 lista nossasconclusões.

2. Visão Geral do Hadoop

Hadoop contêm uma implementação de MapReduce disponível livremente através da li-cença Apache [Apache License 2.0 ]. Os artigos do Google [Dean and Ghemawat 2004,Ghemawat et al. 2003] foram utilizados como base para o projeto. MapReduce é consti-tuído de um modelo de programação e de um ambiente de execução escalável, ou seja,que pode ser utilizado em poucas máquinas ou até em clusters de milhares de compu-tadores convencionais. O modelo de programação exige que os programadores especi-fiquem funções de Map e Reduce. Map processa um arquivo de entrada e gera pareschave-valor. Reduce combina esses pares de acordo com a chave em um único par chave-valores. É possível expressar diversos problemas do mundo real utilizando esse modelo[Dean and Ghemawat 2004].

Um arquivo de entrada previamente gravado no HDFS é utilizado em conjuntocom funções de Map e Reduce para formar o que chamamos de job no Hadoop. Conformeilustrado na Figura 1, primeiramente, os mappers leêm os dados de entrada do HDFS e


Figura 1. Execução de MapReduce (traduzido de [White 2012])

produzem sua saída (chave-valor). Então, cada reducer computa um intervalo de chavese seus respectivos valores, para isso é necessário que os reducers copiem a saída dosmappers diretamente do disco local de cada máquina que as processou. Essa fase decópia é denominada shuffling. Após a fase de Reduce, múltiplas saídas são concatenadasem um arquivo que é gravado no HDFS.

Uma mudança arquitetural mostrou-se necessária a medida que o Hadoop passou aser amplamente adotado em grandes clusters. Como relatado em [Vavilapalli et al. 2013]com mais de 4000 nós, o MapReduce mostrou-se não escalável. O principal gargalo era oJobTracker e por isso que um grupo do Yahoo trabalhou em um novo modelo arquiteturaldenominado YARN (ou MapReduce 2), que significa “Yet Another Resource Negotiator”.A maior mudança no YARN foi a divisão das responsabilidade do JobTracker entre duasnovas entidades: o ResourceManager e o ApplicationMaster. O primeiro gerencia os re-cursos do cluster, enquanto o segundo gerencia o ciclo de vida da aplicação. Além disso,introduziu-se o conceito de containers, divisão lógica de memória e CPU dos nós, utili-zada pelo ResourceManager na alocação de recursos. A responsabilidade de gerenciar oscontainers é de uma nova entidade denominada NodeManager.

Figura 2. Arquitetura do YARN (traduzido de [Vavilapalli et al. 2013])


A arquitetura do YARN é ilustrada na Figura 2, onde duas aplicações (representa-das pelas cores preto e cinza) estão sendo executadas em um mesmo cluster. No YARN,a execução de uma aplicação de MapReduce se inicia quando um cliente do cluster sub-mete um job ao ResourceManager, em seguida, o escalonador aloca um container paraexecutar o ApplicationMaster que obtêm a entrada necessária para sua execução. Eleentão negocia com o ResourceManager por mais containers repassando informações so-bre a localidade dos dados. Além disso, o ResourceManager também recebe dados arespeito da quantidade de memória e CPU necessários para a execução das tarefas paraserem utilizadas no escalonamento. Depois que o escalonador mapeia tarefas a recursos,o ApplicationMaster inicia os containers através de uma requisição ao NodeManager quebusca os dados necessários para realizar a execução propriamente dita da tarefa.

3. Preempção de tarefasTipicamente se trabalha com dois tipos de jobs no Hadoop, os de produção e os de pes-quisa. Esses jobs têm características distintas, normalmente os de produção tendem ater maior prioridade e por isso devem ser escalonados assim que submetidos e devem tergrande parte dos recursos do cluster dedicados a eles para serem executados mais rapida-mente. No Google, os jobs de produção correspondem a 30% do total [Cheng et al. 2011].Essa heterogeneidade criou a necessidade do suporte à preempção nos escalonadores doHadoop.

Atualmente, na versão 2.4.0 do Hadoop, a preempção de uma tarefa correspondea finalizar a sua execução imediatamente para liberar seus recursos para serem ocupadospor uma nova tarefa (de um job de maior pioridade). Nenhum tipo de estado é salvo,por isso, todo o trabalho já realizado por aquela tarefa é perdido e quando ela voltar a serexecutada precisará ser refeito.

Em condições convencionais, o retrabalho realizado quando as tarefas de um jobde menor prioridade retomarem suas execuções pode não significar grande parte do tempode processamento. Porém, em casos com tarefas de longa duração isso pode representarum problema grave. Para exemplicar isso, imagine um cluster onde um job de prioridadebaixa está sendo executado e que ele seja composto de tarefas de longa duração. Agora,imagine que nesse mesmo cluster, jobs de maior prioridade passem a ser submetidos comfrequência. Uma perda de recursos significativa poderá ser observada a cada chegada deum job de maior prioridade, pois toda a computação realizada pelas tarefas do job demenor prioridade que forem suspensas não será reaproveitada. Para piorar, como essastarefas são longas, isso consistirá em desperdício grande de tempo gasto em computardados que anteriomente já haviam sido processados.

Com o objetivo de aproveitar melhor os recursos dos clusters e melhorar o desem-penho dos jobs de menor prioridade, nosso trabalho visa analisar os impactos da utilizaçãode checkpointing na preempção das aplicações de MapReduce no Hadoop.

4. MétodoO Hadoop tem um sistema de controle de alterações (chamado JIRA) totalmente abertopodendo ser estudado e analisado por qualquer desenvolvedor. Nesse sistema a entrada[MAPREDUCE-5269 ] descreve uma proposta de melhoria nos mecanismos de preemp-ção. Se o cluster tiver suporte e o desenvolvedor autorizar, essa alteração permitirá que


quando uma aplicação de maior prioridade for submetida ao cluster, ela poderá tomar o lu-gar da aplicação de mais baixa prioridade. Além disso, se a aplicação a sofrer preempçãoestiver na fases de Shuffle ou Reduce, será possível gravar os estados das tarefas a seremsupensas no HDFS (checkpointing) e restaurá-los assim que elas retomem a execução.

Atualmente, a alteração [MAPREDUCE-5269 ] está incompleta, existe um patchanexo à ela que precisa de algumas correções para ser incorporado à base de código doHadoop. Estamos recebendo a mentoria dos pesquisadores Carlo Curino e Chris Douglas,responsáveis pelo pedido de mudança, e trabalhando nessa solução, apesar de estarmosainda em uma etapa bastante inicial de nossa pesquisa. Estão faltando alguns ajustesantes de submetermos um novo patch. Vamos ainda comparar os tempos médios de exe-cução antes e depois da melhoria. Para essas métricas, pretende-se utilizar um cluster naAmazon EC2 e simular a submissão de jobs com diferentes tamanhos e prioridades.

Como trabalho futuro, prentendemos atuar em uma otimização acerca da pre-empção do Hadoop. A comunidade do Hadoop tem uma série de mudanças propostasa respeito do tema preempção [MAPREDUCE-4584 ] o que mostra sua importância. Otrabalho junto aos pesquisadores Carlo e Chris tem nos ajudado a focar naquilo que elesapontam como necessidades mais prioritárias para o projeto nesse momento. Baseadonessas informações, pretendemos alterar o local de armazenamento das saídas da fase deMap, passando a persistí-las no HDFS ao invés de no disco local onde o mapper foi exe-cutado. O principal benefício disso será que estado a ser gravado no checkpoint diminuirásignificativamente. Seria necessário gravar somente qual foi a última chave processadapara se restaurar um reducer, por exemplo. Uma outra possível expansão desse traba-lho seria atuar em preempção com checkpointing para os mappers também, já que apósa nossa alteração somente tarefas de Reduce e Shuffle estarão sendo preemptíveis viacheckpointing.

5. ConclusãoEm um mundo onde grandes quantidades de dados são produzidas diariamente, a necessi-dade de mecanismos eficazes de armazenamento e processamento desses dados culminouno desenvolvimento do GFS e do MapReduce pelo Google. A necessidade de sistemassimilares originou o Apache Hadoop, que hoje em dia já pode ser considerado o padrãodo mercado para se trabalhar com Big Data. Os clusters que utilizam Hadoop já sãotão grandes (da ordem de milhares de máquinas) que sua arquitetura precisou ser revista,evoluindo para o que hoje chamamos de YARN.

Em paralelo, veio a necessidade de priorizar dentre os diversos tipos de jobs exe-cutados nesses clusters. Então, surgiram as políticas de preempção e os escalonadores doHadoop passaram a ser capazes de destinar recursos a jobs de maior prioridade mesmoque seja necessário parar a execução de tarefas pertencentes a jobs de menor prioridadepara retomá-la depois. O problema é que nenhum tipo de estado é gravado e portantorecursos são disperdiçados para recomputar o trabalho perdido dessas tarefas.

Para endereçar esse problema, a comunidade do Hadoop requisitou a mudança[MAPREDUCE-5269 ]. Ela consiste em inserir mecanismos para checkpointing à tarefasde Reduce e Shuffle. O estado passará a ser persistido no HDFS e poderá ser restauradojunto com as tarefas que venham a retomar a execução. Um patch incompleto já estáanexo ao JIRA para essa mudança e estamos trabalhando em conjunto com os desen-


volvedores dele em ajustes para poder finalmente incoporar a solução ao Hadoop. Pos-teriormente, pretendemos analisar as melhorias de desempenho ocasionadas pela nossamudança.

Para trabalhos futuros, podemos seguir duas possíveis abordagens para melhorarainda mais o mecanismo preempção do Hadoop. A primeira e menos provável consiste emexpandir a solução aplicada a tarefas de Reduce e Shuffle às de Map. A segunda consisteem passar a gravar os dados intermediários dos mappers também no HDFS, diminuindoo estado do checkpoint significativamente.

Referências

Apache Hadoop. http://hadoop.apache.org/. Último acesso em 31 de julho de2014.

Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0. Úl-timo acesso em 31 de julho de 2014.

Cheng, L., Zhang, Q., and Boutaba, R. (2011). Mitigating the negative impact of preemp-tion on heterogeneous mapreduce workloads. In Proceedings of the 7th InternationalConference on Network and Services Management, CNSM ’11, pages 189–197, La-xenburg, Austria, Austria. International Federation for Information Processing.

Cho, B., Rahman, M., Chajed, T., Gupta, I., Abad, C., Roberts, N., and Lin, P. (2013).Natjam: Design and evaluation of eviction policies for supporting priorities and dea-dlines in mapreduce clusters. In Proceedings of the 4th Annual Symposium on CloudComputing, SOCC ’13, pages 6:1–6:17, New York, NY, USA. ACM.

Dean, J. and Ghemawat, S. (2004). MapReduce: Simplified Data Processing on LargeClusters. In Proceedings of the 6th Conference on Symposium on Opearting SystemsDesign & Implementation - Volume 6, OSDI’04, pages 10–10, Berkeley, CA, USA.USENIX Association.

Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The Google File System. In Proce-edings of the nineteenth ACM symposium on Operating systems principles, SOSP ’03,pages 29–43, New York, NY, USA. ACM.

MAPREDUCE-4584. Umbrella: Preemption and restart of MapReduce tasks -JIRA. https://issues.apache.org/jira/browse/MAPREDUCE-4584.Último acesso em 31 de julho de 2014.

MAPREDUCE-5269. Preemption of Reducer (and Shuffle) via checkpointing -JIRA. https://issues.apache.org/jira/browse/MAPREDUCE-5269.Último acesso em 31 de julho de 2014.

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves,T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed,B., and Baldeschwieler, E. (2013). Apache Hadoop YARN: Yet Another ResourceNegotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC’13, pages 5:1–5:16, New York, NY, USA. ACM.

White, T. (2012). Hadoop: The Definitive Guide. O’Reilly Media, Inc.


Processo Situado e Participativo para o Design de Aplica-

ções de TVDi: Uma Abordagem Técnico-Social

Samuel B. Buchdid e M. Cecília C. Baranauskas

1Instituto de Computação – Universidade Estadual de Campinas (UNICAMP)

Av. Albert Einstein – 1251, 13083-970 – Campinas – SP – Brazil

{buchdid, cecilia}@ic.unicamp.br

Abstract. Interactive Digital Television (iDTV) can be considered a potential

technology to face economic and social difficulties related with knowledge ac-

cess and digital inclusion in Brazil. However, designing applications for it is a

challenging task, partially because of its complex context and the lack of theo-

retical and methodological referential. For TV companies, while there is a

demand for producing interactive content, the development of an application

is a new element for their supply chain. In this sense, this study investigates

the possibility of conducting design practices inside the situated context of an

organization in order to understand the organizational culture and to propose

a design process for iDTV applications that fit broadcasters’ production chain

and mitigate the challenge of designing applications.

Resumo. A Televisão Digital Interativa (TVDi) pode ser considerada uma tec-

nologia com potencial para enfrentar as dificuldades econômicas e sociais re-

lacionadas com o acesso ao conhecimento e inclusão digital no Brasil. No en-

tanto, Projetar aplicativos para TVDi é uma tarefa desafiadora, parcialmente

por causa de seu contexto complexo e a falta de referencial teórico e metodo-

lógico. Para as empresas de TV, enquanto que há uma demanda para a pro-

dução de conteúdo interativo, o desenvolvimento de um aplicativo é um novo

elemento para a sua cadeia de produção. Neste sentido, este estudo investiga

a possibilidade de realização de práticas de design dentro do contexto situado

de uma organização, a fim de compreender a cultura organizacional e propor

um processo de design para aplicações de TVDi que se ajuste à cadeia de

produção das emissoras e torne mais fácil o desafio de projetar aplicações.

1. Introdução

A chegada da Televisão Digital abre uma variedade de possibilidades para a televisão,

tais como um portal de acesso à Internet ou mesmo o uso de aplicações televisivas, ori-

ginando assim a TV Digital Interativa (TVDi). No entanto, a TVDi per se apresenta al-

guns desafios inerentes à sua tecnologia que devem ser trados no design de aplicações,

por exemplo: i) a interação (limitada) por meio do controle remoto; ii) a distância física

entre o usuário e o aparelho de televisão; iii) a diversidade de usuários (cognitiva, físi-

ca, socioeconômica, cultural); iv) a falta de hábito dos telespectadores interagirem com

o conteúdo da televisão [Kunert 2009].

Considerando os grandes índices de desigualdades sociais e econômicas do país,

a TVDi se apresenta como um veículo promissor para a divulgação da informação e a


disseminação da educação [Waisman 2006], podendo contribuir para a inclusão social e

digital, reduzindo as barreiras que impedem o acesso participativo e universal do cida-

dão brasileiro ao conhecimento: um dos 5 grandes desafios de pesquisa em Computação

no cenário brasileiro [Baranauskas e Souza 2006].

No entanto, na maioria das cidades brasileiras, quando um telespectador compra

uma televisão adaptada ao formato digital, ele não pode usufruir desses benefícios em

sua totalidade, pois grande parte das emissoras está ainda na era analógica. Além disso,

as emissoras que transmitem conteúdo digital, eventualmente transmitem o conteúdo

interativo. As normas que regulamentam as especificações de codificação do conteúdo

interativo somente recentemente foram aprovadas. Os receptores, com limitações de

recursos e processamento, ainda não estão aptos a receber aplicações televisivas de ma-

neira efetiva e incorporaram recentemente o canal de interatividade [Fórum 2014]. As

aplicações que são transmitidas não são atrativas, complementam o conteúdo televisivo

e utilizam o canal de interatividade em situações muito especificas (e.g., enquetes).

No caso das emissoras, esse cenário se deve, em partes, porque a cadeia de pro-

dução das emissoras não está preparada para desenvolver aplicações de TVDi [Veiga

2006], e falta investimento para adequar a sua cadeia de produção a uma tecnologia que

ainda está em fase de experimentação técnica e mercadológica. Se considerarmos a ca-

deia de produção de uma emissora para a criação de programa televisivo, vamos encon-

trar esforços de equipes que trabalham de forma sincronizada com papeis (e.g., diretor,

produtor, roteirista) e tarefas (e.g., produção, montagem e ensaio) muito bem definidos

[Bonasio 2002]. O mesmo cenário pode ser visto dentro de um processo de desenvolvi-

mento de sistemas de software, por exemplo, com etapas específicas para a análise de

requisitos, design, desenvolvimento e testes, e também em papeis como designer, de-

senvolvedor e analista, etc. [Pressman 2001]. Para Veiga (2006), criar uma aplicação

para TVDi é diferente de criar um sistema computacional tradicional, pois existe uma

distância entre um processo de produção de conteúdo televiso e o processo de desenvol-

vimento de software convencional. Nesse sentido, surge a oportunidade de se investigar

e elaborar um modelo de processo específico para desenvolvimento de aplicações para

TVDi, que agregue características dois modelos de processos e se ajuste a cadeia de

produção das emissoras.

2. Objetivo

De acordo com o exposto, o problema que essa tese endereça envolve um processo de

concepção adequado ao design da interação de aplicações para TVDi em um contexto

situado, que incorpore artefatos e dinâmicas que possam ser usados em todas as etapas

de design (clarificação de problemas, análise de requisitos, design e testes), sob perspec-

tivas técnico-sociais e sob a óptica de diferentes stakeholders, que influenciam e sofrem

influencia do sistema a ser criado. Como objetivos específicos destacam-se:

Clarificar o conceito de TVDi na realidade Brasileira, suas aplicações e caracte-

rísticas, bem como entender os desafios e oportunidades de pesquisa no contexto

deste tipo de sistema;

Entender em um contexto situado as etapas do processo de produção de progra-

ma de TV, a dinâmica entre as diversas áreas da emissora e as diferentes pers-

pectivas dos diversos interessados na aplicação para TVDi;


Identificar técnicas e conceitos que possam ser aplicados em diversas etapas do

processo de design que, articulados, possibilitam projetar e avaliar uma aplica-

ção para TVDi orientada a questões técnicas e sociais;

Propor e/ou adaptar artefatos que apoiem a utilização da abordagem a ser conce-

bida;

Criar e colocar em prática uma abordagem que auxilie projetistas de aplicações

para TVDi a envolverem e articularem questões técnico-sociais em diferentes

etapas do processo (clarificação de problemas e de requisitos, design e testes);

Uma vez entendida as necessidades do contexto situado, avaliar a adequabilida-

de da abordagem concebida no item anterior, para apoiar o design de novas apli-

cações para TVDi ou sistemas computacionais em outros contextos;

Codificar uma aplicação para TVDi apoiada pelo uso da abordagem e verificar o

resultado de seu broadcasting.

3. Metodologia

Para atingir tal objetivo usamos como base teórico-metodológica a Computação Soci-

almente Consciente [Baranauskas 2009; Baranauskas 2014], que tem como fundamen-

tação teórica bases da Semiótica Organizacional [Liu 2000] e Design Participativo

[Müller et al. 1997], para propor soluções de design de aplicações para TVDi e entender

o contexto situado de um programa televisivo em uma emissora de televisão real. As

teorias que vão apoiar essa abordagem são: Design Patterns para TVDi [Kunert 2009,

Design Persuasivo [Fogg 2009], Processos de Concepção para TVDi [Veiga 2006], Pro-

jeto de Sistemas Computacionais [Pressman 2001] e Processo de Produção de Programa

de TV [Bonasio 2002].

3.1. Estudo de Caso: Design de Aplicação para o Programa “Terra da Gente”

O estudo de caso deste trabalho de pesquisa utiliza como contexto prático uma emissora

de televisão, a EPTV (acrônimo para “Emissoras Pioneiras de Televisão”), mais especi-

ficamente o programa “Terra da Gente”. A EPTV é uma das filiais da Rede Globo de

Televisão e atende uma área de cobertura de 300 municípios com mais de 10 milhões de

cidadãos [EPTV 2014]. O programa “Terra da Gente”, no ar desde junho de 1997, tem

uma equipe própria de editores, redatores, produtores, designers, técnicos, jornalistas,

entre outros. Será o primeiro programa da emissora a receber uma aplicação de TVDi.

Como público alvo, o programa procura atingir pessoas que gostam de pesca esportiva,

da diversidade de fauna e flora, a culinária e da boa música de raiz. Além do programa

televisivo, a equipe do programa “Terra da Gente” também produz uma revista e um

portal de informações de mesmo nome [Terra da Gente 2013].

O intuito desta parceria é a “Colaboração para Design e Desenvolvimento de

Aplicação para TVDi” que permita a criação do aplicativo de TVDi que vai ao ar junto

ao conteúdo televisivo do programa “Terra da Gente”. Para isso, 10 participantes, en-

volvidos direta e indiretamente no domínio do problema, com diferentes perfis (por

exemplo, diretor, designer, engenheiros, pesquisadores e estagiários) participam do es-

tudo de caso.


3.2. Visão Principal: Abordagem Situada para o Design de Aplicações para TVDi

A metodologia proposta para a concepção do processo de design aqui proposto é uma

instância do modelo de processo vindo da Computação Socialmente Consciente, e pode

ser visto na Figura 1. As atividades prévias, antes das práticas no contexto situado, são:

1. Identificação das técnicas e artefatos: identificar na literatura conceitos, artefa-

tos e técnicas que podem ser utilizados no processo de desenvolvimento de apli-

cações para TVDi.

2. Adaptação das técnicas e artefatos: Adaptar, se necessário, os conceitos, arte-

fatos e técnicas encontrados no passo anterior para serem usados, de maneira ar-

ticulada, no contexto situado.

3. Concepção do processo: envolve desde a definição das etapas do processo (e.g.,

clarificação do problema e de requisitos, implementação, testes e avaliação) até a

definição dos conceitos, artefatos e técnicas a serem utilizados em cada etapa.

As atividades em conjunto com a EPTV estão definidas a seguir:

Oficinas semio-participativas: desenvolvidas para clarificar o problema e o

contexto situado, propor soluções, gerar ideias de design e também para avaliar o

protótipo criado (detalhes A, B e D na Figura 1).

Elaboração do Protótipo: criado em fase intermediária do processo para mos-

trar aos participantes, de maneira concreta, o resultado das oficinas (detalhe C na

Figura 1).

Avalição do Protótipo: realizada com usuários finais, e também com os partici-

pantes das oficinas, para identificar possíveis problemas de interação humano-

computador, antes da codificação da aplicação final (detalhe D na Figura 1).

Implementação e Testes em Laboratório: codificação da aplicação a partir do

protótipo gerado e das sugestões de melhoria vindas da etapa anterior. Uma vez

codificada, testes em laboratório são necessários para avaliar a parte tecnológica

da aplicação e também problemas de interação humano-computador que podem

aparecer somente na aplicação final (detalhe E na Figura 1).

Transmissão, avaliação em campo e análise dos resultados: uma vez transmi-

tida, a aplicação pode ser avaliada em campo, o que inclui, posteriormente, o re-

finamento dos dados coletados (detalhe F na Figura 1).

Figura 1. Metodologia proposta


A avaliação do processo deve ser realizada por meio de questionários aplicados

nos participantes das oficinas. Avaliações auxiliares podem ocorrer por meio de avalia-

ções formativas com usuários finais, sobre a aplicação que está sendo criada, em dife-

rentes etapas do processo de design (e.g., teste do protótipo e teste em laboratório).

4. Resultados Parciais

Até o momento já realizamos oficinas semio-participativas previamente programadas

para: a clarificação do problema e proposição de soluções, a análise e organização dos

requisitos, a geração de ideias de design. As ideias produzidas foram materializadas,

com a utilização das ferramentas Balsamiq® e CogTool®, em telas e em um protótipo

interativo. O protótipo final (Figura 2) foi avaliado de forma interativa por representan-

tes do público-alvo, por especialistas em HCI, e também em uma prática participativa

com os participantes das oficinas. Estas práticas de avaliação foram importantes para

identificar problemas e propor soluções antes de produzir a primeira versão do aplicati-

vo. Os materiais e resultados produzidos nestas atividades também já foram reportados

em [Buchdid et al. 2014a] e [Buchdid et al. 2014b].

Figura 2. Screenshots do protótipo

Como resultados parciais, a avaliação participativa indicou que os participantes

aprovaram o protótipo interativo que eles co-projetaram. Por outro lado, a avaliação

com representantes do público-alvo reforçou os resultados positivos, indicando que o

protótipo foi compreendido e bem aceito por usuários que nunca utilizaram uma aplica-

ção para TVDi anteriormente.

5. Conclusões

A experiência na EPTV indicou que um projeto situado e participativo contribui para o

desenvolvimento de soluções que estejam alinhadas aos interesses das pessoas direta-

mente envolvidas em práticas de projeto, bem como para os prospectivos usuários fi-

nais. Estes resultados sugerem que uma perspectiva de design situado e participativo

favorece a construção de soluções que façam sentido para as pessoas, o que reflete um

entendimento sobre o domínio do problema e sobre o contexto social complexo em que

estas soluções serão usadas.

Como passos futuros estamos implementando e devemos testar a aplicação final,

e liberá-la para uso dos telespectadores do programa. Além disso, é necessário realizar

mais estudos, dentro da perspectiva da Computação Socialmente Consciente, para inves-

tigar o impacto potencial das práticas realizadas com a equipe de TV e com os usuários

finais da aplicação para TVDi.


Agradecimento. Pesquisa parcialmente financiada pelo CNPq (# 165430/2013-3). Os

autores agradecem a equipe da EPTV que são colaboradores neste projeto de pesquisa.

References

Baranauskas, M.C.C.; Souza, C.S. Desafio no 4: Acesso Participativo e Universal do

Cidadão Brasileiro ao Conhecimento. Computação Brasil. SBC, ano VII, no 27, p. 7,

Set./Out./Nov (2006).

Baranauskas, M.C.C. Socially Aware Computing. In: ICECE’09, VI International Con-

ference on Engineering and Computer Education (2009).

Baranauskas, M.C.C. Social Awareness in HCI. ACM INTERACTIONS July–August

2014, pp. 66-69 (2014)

Bonasio, V. Televisão Manual de Produção & Direção.Minas Gerais: Ed. Leitura (2002)

Buchdid, S.B., Pereira, R., Baranauskas, M.C.C. Creating an iDTV Application from

Inside a TV Company: a Situated and Participatory Approach. In: ICISO’14, 15th In-

ternational Conference on Informatics and Semiotics in Organisations. Springer, Ber-

lin, DEU, pp. 63-73 (2014a)

Buchdid, S.B., Pereira, R., Baranauskas, M.C.C.: Playing Cards and Drawing with Pat-

terns: Situated and Participatory Practices for Designing iDTV Applications. In:

ICEIS’14, 16th International Conference on Enterprise Information Systems. SciTe-

Press, Lisboa, pp. 14-27 (2014b)

EPTV, Portal de Notícias EPTV. http://www.viaeptv.com, março de 2014.

Fogg, B.J. A behavior model for persuasive design. In: Persuasive’09, 4th International

Conference on Persuasive Technology. Art No. 40. ACM Press, NY, pp. 1-7 (2009)

Fórum, Fórum Brasileiro de TV Digital. http://www.forumsbtvd.org.br, julho de 2014.

Kunert, T. User-Centered Interaction Design Patterns for Interactive Digital Television

Applications, Springer, Berlin (2009)

Liu, K. Semiotics in Information Systems Engineering. 1st edition. Cambridge Universi-

ty Press. UK. (2000).

Müller, M. J., Haslwanter, J. H., Dayton, T. Participatory Practices in the Software

Lifecycle. In: Helander, M. G., Landauer, T. K., Prabhu, P. V. (ed.). Handbook of

Human-Computer Interaction, 2. ed. Amsterdam: Elsevier, pp. 255-297 (1997)

Pressman, R. S. Software Engineering - A Practitionar's Approach, McGraw-Hill Inter-

national Edition, 5th Edition (2001)

Terra da Gente, Portal Terra da Gente. http://www.terradagente.com.br, março de 2014.

Veiga, E. G. Modelo de Processo de Desenvolvimento de Programas para TV Digital e

Interativa. 141 f. Dissertação de Mestrado - Redes de Computadores, Universidade

Salvador, Janeiro (2006)

Waisman, T. Usabilidade em Serviços Educacionais em Ambiente de TV Digital. 201 f.

Tese de doutorado - Escola de Comunicação e Artes, Universidade de São Paulo, Ja-

neiro (2006)


Processors Power Analysis

Jéssica T. Heffernan1, Rodolfo Azevedo

1

1Instituto de Computação

Universidade Estadual de Campinas (UNICAMP) – Campinas, SP – Brazil


Abstract. This work aims to generate power tables for processor-platform

pairs, entries for benchmarks characterization softwares. We used MIPS and

SPARC processors synthesized in Xilinx and Altera FPGAs as well as in

ASICs with a 45nm library, generating models on which each ISA instruction

was simulated separately [Tiwari 1996], providing a power activity profile.

The estimates were evaluated for different operating frequencies, in order to

develop a model of linear dependence.

1. Introduction

With the increasing complexity of digital systems, power consumption has become an

important design metric. In this context, the acSynth framework [Guedes et al 2013]

was created for power analysis of MIPS-I and SPARCv8 processors. The development

flow is shown in Figures 1 and 2.

Figure 1. Overview of acSynth development flow.

Figure 2. Detailed characterization flow.

Analysis is done on instruction level [Tiwari 1996]. For each instruction,

acPowerGen creates a characterization program in assembly, containing repetitions of

that instruction with registers and immediate values randomly generated. A testbench

submit this program to a timing simulation, which results in a file with signals switching

activity. This file is used by a power analyzer to generate a power consumption report.

The acSynth is responsible for iterating the testbench for all programs generated

by acPowerGen and organize the data obtained at the end of the characterization step in

a single database in CSV format. This database is included, along with PowerSC library,

in a simulator generated by acSim tool with the acPower parameter enabled, providing

an energy profile of benchmarks executed in that processor core. The framework also

may be extended to multicore platforms. MPSoCBench [Duenha et al 2014] is a

benchmark composed by a scalable set of MPSoCs which includes as one of its features


power consumption estimation for MIPS and SPARC processors, making use of

acSynth generated tables.

In most cases, the generated data is concatenated in internal processes managed

by the framework itself. However, some tasks still rely on external tools, requiring

manual intervention. The user must provide two files needed to integrate acSynth with

other EDA tools:

TCL script: must describe the steps for obtaining a feasible simulation netlist

from the processor HDL description and the target frequency for the synthesis.

In FPGA’s case, it’s also necessary to specify family, package and speed grade

of the platform. For ASICs, the desired standard cells library must be indicated.

Testbench: responsible for executing the initialization instructions block,

storing the characterization program in an internal ROM or RAM and

controlling the loop of this program. The clock must be set to the desired

operation frequency and processor signals exported to a VCD file.

The use of FPGAs introduces energy overhead over ASICs as a result of the

programmable logic. Though, this overhead is not uniform, depending on size, number

of components, voltage levels and operating frequencies supported by the platform. The

classification into families points to the development metric prioritized, guiding

designer’s choice. ASIC standard cells also present variations among them according

desired operating conditions.

2. Objectives

The present work aims to expand the database of power consumption tables for MIPS-I

and SPARCv8 architectures. The evaluated technologies are summarized in Table 1.

Regarding the frequency, we sought to design a model that provides linear

adjustment coefficients αk for each processor-platform pair (k) used for testing. With

this, once a pair is simulated, it isn’t necessary to make new simulations to obtain

estimates for different frequencies.

Table 1. Processor-platform pairs analyzed in this work.

P r

o c

e s

s o

r

Leon3 (Sparcv8)

FPGA

Manufacturer Family Model Package Speed Technology Simulation Frequency (MHz)

Xilinx

Spartan3 XC3S1000 FT256 4 90nm 40 / 50 / 150 / 300 / 320 / 350 / 450 /500 /550

Spartan6 XC6SLX75T FGG484 2 45nm 40

Virtex5 XC5VLX50T FF1136 3 65nm 40

ASIC

StdCell Technology Simulation Frequency (MHz)

FreePDK 45nm (RVT – TT - 1.1V) 10 / 50 / 100 / 125 / 150 / 160

Plasma (MIPS-I)

FPGA

Manufacturer Family Model Package Speed Technology Simulation Frequency (MHz)

Xilinx

Spartan3E XC3S1200E FT256 5 90nm 100

Spartan6 XC6SLX75T FGG484 2 45nm 20 /40 /100 /160

Virtex4 XC4VLX15 FF668 12 90nm 100

Virtex5 XC5VLX50T FF1136 3 65nm 20 /40 /100 /160

Altera CycloneV 5CGXFC7C7 F23 8 28nm 25 / 50 / 80/ 100 /125 /160 / 200 / 250 / 300 / 400 / 600

StratixIII EP3SL150 F1152 2 60nm 100

ASIC

StdCell Technology Simulation Frequency (MHz)

FreePDK 45nm (RVT – TT - 1.1V) 50 / 125 / 250 / 312.5 / 400 / 455


3. Methodology

As shown in table 1, Plasma and Leon3 were used as representatives of MIPS and

SPARC architectures. As memory and cache consumption are not being measured in

this project, the Plasma’s processor component was directly included in the testbench

and, regardless of the address indicated, the program flow in the software

implementation of memory was kept sequential, preventing access of illegal areas, even

with the random parameters. However, for Leon3, the dependence between the

processor and other modules of the AMBA platform required the use of a higher

hierarchy level, being necessary to filter the data related to the processor. Also, since the

memory was physically implemented as a sub-module of the top-level component, there

was no way to intervene in the program flow, making it necessary to manually modify

operands, to ensure valid addresses. In both cases, instructions with similar behavior

were grouped and only one was simulated, having its value manually replicated in the

corresponding gaps of CSV table.

In a first stage, we focused on the analysis of FPGA platforms. Later, we

extended acSynth framework to include support for ASICs. The external tools chosen

for build and power analysis steps were Altera (Quartus and PowerPlay) and Xilinx

(ISE and XPower), in FPGA’s case, and Cadence RTL Compiler for the ASIC’s flow.

Regarding simulation, Mentor Modelsim and Cadence NCSim were used.

Altera and Xilinx provide board specific libraries along their software. Cadence

tools however require third-party cell libraries. We used FreePDK [Stine et al 2007], an

open source 45nm Process Design Kit, which contains standard cell and pad libraries.

For frequency analysis, timing reports were used to obtain a maximum

frequency for the target pair into consideration. Subsequently, this value was set as a

synthesis constraint. The netlist obtained was used for simulations in several lower

frequencies.

3.1. FPGAs

First, the simulations done by Guedes et al (2013) were repeated to validate the

configured workspace. Next, it was evaluated the influence of process variables on

results: synthesis parameters, random seed of placement tools, operands and number of

iterations in characterization programs and operating frequency were analyzed. For this,

three different sets of synthesis parameters were defined, with two syntheses executed

for each of them. Then, one of the sets was chosen and, starting with 10, the number of

iterations was increased at a step of 5, searching for a value that minimizes the influence

of the initialization code over the estimates. The VCD size was used as stopping

criterion, since XPower has a 2GB limitation. Finally, having the testbench and the TCL

script fixated, the results of three sets of characterization programs were compared.

While expanding the database, we observed that Leon3 estimates for Spartan6

were much higher than those for Spartan3 values and sought to investigate the causes of

this discrepancy, since the results for Plasma had already confirmed the hypothesis that

newer generations of a same family consume less dynamic power. It was verified that

there was indeed an underestimation for Spartan3 and an overestimation for Spartan6,

result of an unexpected behavior of EDA tools when recording signals of an inner

component in the VCD file. For Spartan3, the information about the clock circuit had


been lost while for Spartan6, even with the restriction, the statistics of the entire

network clock were recorded in the VCD file.

The approach taken to solve this was to change the clock generator component,

doubling the output of DCM and destining one of the signals to the processor and other

to peripherals. With this, the power analyzer report contains, besides the values for logic

and signals of the top-level component and of the processor, an estimate for the CPU’s

internal clock. In possession of this value, the total processor power is calculated as the

sum of the clock, the internal logic and internal signals with a fraction of the energy

consumed by the DCM, once this is a shared resource between the CPU and the other

modules. In Spartan6 case, a simulation following the previous approach provides all

the necessary information on the same report. For Spartan3, it was necessary to execute

a simulation recording signals of the complete circuit, obtain the value of the clock and

add it to the data obtained by a processor restricted simulation. For other platforms, it

was necessary to make a restricted simulation and verify if the clock consumption was

reported as zero. From there, we could adopt one of the above approaches.

3.2. ASICs

Once our purpose was to show acSynth flow could be extended to ASIC targets,

frequency was the only process variable analyzed in this case. Synthesis attributes

values used were essentially the defaults of RTL Compiler.

4. Results

For Plasma replications, in Spartan3E’s case, 91.5 % of instructions had a deviation

below 20%, with only slt, sltu and sra being above 30%. In Altera’s case, deviation is

below 20% for 81% of the instructions, being 68% of them below 10%. Jal presents the

largest deviation (28.21%), followed by branches. Regarding Leon3, 89% of the

instructions are below a 10% deviation. The largest one occurs for mulscc_reg (29.3%).

While different syntheses with the same parameters generated identical results,

the change of synthesis parameters caused a variation of up to 6% in consumption.

Concerning the number of iterations, the stop criterion was achieved in 35. For xor_reg

this value wasn’t enough for complete convergence but wry_reg, for example,

accomplished it in 15 iterations. For distinct characterization programs, mulscc_reg

instruction presented the largest deviation, 16.7%, what would be expected since

multiplication instructions depend on operands values. So it is concluded that, despite

the random parameters, the optimization step of routing tools converge to similar

results, but the parametric decisions made by the designer can significantly impact

system operation.

As expected, estimates for Plasma instructions on Spartan and Virtex platforms

were lower for newer generations. Between families, the hypothesis that Spartan would

have lower consumption than Virtex was confirmed for 90nm platforms. However, the

comparison of Virtex 5 and Spartan 6 running at 100MHz presented some instructions

with lower values for Virtex. As Spartan family presupposes lower operating

frequencies, we plotted Power versus Frequency curves to evaluate whether this

comportment was kept at lower frequencies. Instructions which had already lower

values for Spartan maintained this behavior, with an enlarged difference. Most


instructions that at 100 MHz had lower consumption for Virtex reversed this metric at

40MHz. 32-bits load and store only reversed the situation around 15 MHz.

For Leon3, the clock adjust methodology described in section 3 has been

adopted. Using Spartan3 and Virtex5, the clock value was the same for all instructions

and independent of the number of iterations of the characterization program. Thus, it

was sufficient to add 14.6 mW and 30.27 mW, respectively, to the partial results

obtained with a restricted simulation. However, for Spartan6, the clock consumption

varied among instructions, although independent of the number of iterations.

The analysis of the dependence between dynamic power and operating

frequency confirmed the hypothesis of linear relation. The maximum frequency

reported for the pair Plasma-CycloneV was 400 MHz. As can be seen in Figure 3,

below this limit the linear regression coefficient is greater than 0.99. For faster

simulations, the framework cannot predict what would be the system behavior. A point

out the range was included to exemplify this effect, but it wasn’t used to calculate the

regression model.

For Leon3-Spartan3, the maximum indicated was 500MHz. Higher values

prevented circuit operation but beyond 320 MHz estimates already slightly decrease or

start increasing slowly, affecting the linear model. So, we chose to use this transition

value as upper limit indicated in CSV tables. ASIC curves required a similar approach:

for Plasma, synthesis indicated a 600 MHz limit but 400 MHz was identified as a

behavioral change point and then imposed as limit; for Leon3, the limit pointed was 160

MHz but beyond 150 MHz power profile is affected.

Figure 3. Power vs Frequency for MIPS instructions in CycloneV

Figure 4. Power (mW) vs Time (s) for a Quad-MIPS-Spartan3 running basicmath

Though the model is applicable, the slope is not the same among the analyzed

instructions, ranging from (0.49 ± 0.01) to (2.10 ± 0.01) mW/Hz for Plasma-Cyclone,

from (0.156 ± 0.004) to (0.169 ± 0.06) mW/Hz for Leon3-Spartan3, from (32.15 ± 0.02)

to (48 ± 1) µW/Hz for Plasma-ASIC and from (543.1 ± 0.1) to (562.32625 ± 0.00005)

µW/Hz for Leon3-ASIC. This difference also occurs for the other pairs analyzed.

Even with a larger die, ASIC circuit had power estimates 17 to 43 times lower

than that implemented in CycloneV. This enforces the programmable logic overhead.

This comparison also revealed that, in FPGA’s case, critical operation conditions took

place for memory access instructions while for ASICs this behavior was associated with

arithmetic instructions. This is probably due the fact we are modeling only the processor


core so memory instructions uses signals which pass through circuit’s pads. This is

relevant once FPGAs includes extra control logic for IOs.

Figure 4 exemplifies an output of MPSoCBench using acSynth tables for energy

profiling of a quad-core MIPS in Spartan 3 platform.

References

Duenha, L., Almeida, H., Guedes, M., Boy, M. and Azevedo, R. (2014) “MPSoCBench:

A Toolset for MPSoC System Level Evaluation”, In: Proceedings of the International

Conference on Embedded Computer Systems: Architectures, Modeling and

Simulation.

Guedes, M., Auler, R., Duenha, L., Borin, E. and Azevedo, R. (2013) “An automatic

energy consumption characterization of processors using ArchC”, In: JSA.

Tiwari, V., Malik, S., Wolfe, A. and Lee, M. (1996) “Instruction level power analysis

and optimization of software”, In: Journal of VLSI Signal Processing, vol.13, pages

1–18.

Stine, J.E., Castellanos, I., Wood, M., Henson, J., Love, F., Davis, W.R. ; Franzon, P.D.,

Bucher, M. ; Basavarajaiah, S. ; Julie Oh ; Jenkal, R. (2007) ‘‘FreePDK: An Open-

Source Variation-Aware Design Kit’’, In: Proc. IEEE Int’l Conf. Microelectronic

Systems Education, IEEE CS, pp. 173-174.


Uma abordagem caixa-branca para teste de transformacao demodelos

Erika R. C. de Almeida1, Eliane Martins1

1Instituto de Computacao – Universidade Estadual de Campinas (Unicamp)Caixa Postal 6.176 – 13.083-970 – Campinas – SP – Brasil


Abstract. Model transformations are the core elements of Model Driven En-gineering (MDE). They should be developed according to the same softwareengineering patterns that are applied for a software system. Testing a modeltransformation is different from common software testing due to the complexityof the test data and the heterogeneity of the transformation languages. There arefew approaches that try to automate one of the model transformation testing’ssteps. Our proposal is creating a white box approach for test input generationthat is independent from the model transformation language. This proposal isbased on domain specific languages (DSLs). The advantage of our proposal,besides being language-independent, is it can be used in any phase of the modeltransformation development, even when the documentation is not completed.

Resumo. Transformacoes de modelo sao os pilares de Engenharia Dirigida porModelo (MDE). Elas devem ser desenvolvidas com os mesmos padroes de en-genharia que sao aplicados nos sistemas de software. Testar uma transformacaode modelo e diferente de realizar o mesmo processo para um software comumdevido a complexidade dos dados de teste e a heterogeneidade das linguagensde transformacao. Existem abordagens que tentam automatizar algum dos pas-sos do teste da transformacao. Nossa proposta e criar uma abordagem caixa-branca para geracao dos dados de entrada que e independente da linguagem detransformacao. Esta proposta e baseada em linguagens especıficas de domınio(DSLs). As vantagens da nossa abordagem sao: independencia da linguagemde transformacao e que ela pode ser aplicada em qualquer fase de desenvolvi-mento da transformacao, mesmo que a documentacao nao esteja completa.

1. ContextoTranformacoes de modelo sao os pilares de Engenharia Dirigida por Modelo (MDE)e, sendo assim, devem ser desenvolvidas aplicando-se os princıpios de engenharia paragarantir sua corretude [Guerra 2012].

Uma transformacao de modelo e o artefato essencial na aplicacao de MDE, poisela representa as regras que devem ser aplicadas para que, dado um modelo de entrada, seobtenha o respectivo modelo de saıda. Quando se faz o design de uma transformacao demodelo, leva-se em conta o formato do modelo de entrada, chamado de metamodelo deentrada, e do modelo de saıda (metamodelo de saıda), sendo que, dessa forma, e possıvelfazer sua reutilizacao todas as vezes que se tem uma instancia do metamodelo entrada Xe se deseja obter uma instancia do metamodelo de saıda Y.


Utilizando termos de MDE, as transformacoes de modelo sao usadas para refinarmodelos independentes de plataforma1 ou para transforma-los em modelos especıficos deplataforma2. A Figura 1 ilustra o processo da abordagem MDE. A abordagem MDE secompleta com a geracao do codigo fonte do sistema, a partir do PSM, por um mapea-mento, tambem chamado de transformacao modelo-texto [Dai 2004]. Vale ressaltar queo PIM e o PSM sao instancias de um metamodelo que especifica os elementos do modeloe como podem ser as relacoes entre esses elementos.

Figure 1. Processo de transformacao entre os modelos na abordagem MDE

Como a transformacao de modelo e reutilizada muitas vezes e e possıvel que umsistema de software inteiro seja construıdo com a mesma transformacao de modelo, eessencial que ela esteja correta [Baudry et al. 2010] e para garantir isso devem ser apli-cadas tecnicas de Verificacao e Validacao (V&V).

Em [Selim et al. 2012], discute-se sobre teste ser a forma mais eficiente e simplespara verificar se uma transformacao de modelo esta correta. Mesmo que, ao contrario demetodos formais, a atividade de teste nao possa garantir a ausencia de bugs, ela tem suasvantagens, como a possibilidade de testar uma transformacao de modelo em seu ambientede producao.

Entretanto, existem algumas dificuldades nos testes de transformacao de modelos,como o fato de que as entradas nao sao dados simples, mas modelos, que sao estruturasde dados complexas (os dados de entrada sao modelos que devem estar em conformi-dade com o metamodelo de entrada), a falta de ferramentas de gerenciamento e a he-terogeneidade das linguagens de escrita das transformacoes de modelo. Alem disso, aobtencao do oraculo de teste3 e outro problema bastante difıcil de ser resolvido nestaarea [Baudry et al. 2006].

Nao sao muitas as propostas na literatura para o problema de testar transformacoesde modelo, mas as existentes se focam na geracao dos dados de teste (modelos de teste)ou na geracao do oraculo (ou oraculo parcial) de teste. As abordagens podem ser classifi-cadas em caixa-preta ou caixa-branca [Gonzalez and Cabot 2012], a depender se o codigofonte da transformacao de modelo e levado em consideracao.

1O Modelo Independente de Plataforma (PIM) representa parte da logica do sistema e algumasconsideracoes tecnicas.

2O Modelo Especıfico de Plataforma (PSM) e um refinamento de um PIM e e descrito em termosespecıficos de uma plataforma onde o sistema e implementado.

3O oraculo de teste e o procedimento responsavel por decidir se um teste obteve ou naosucesso [de Farias 2003].


2. Objetivos

Nossa proposta e gerar automaticamente os dados de entrada do teste de transformacaode modelo usando a tecnica de teste caixa-branca, mas independente da linguagem detransformacao de modelo. Para tornar a abordagem independente da linguagem datransformacao de modelo, faremos uso de linguagens especıficas de domınio, chamadasDSLs. Dessa forma, alem da vantagem de ser independente da linguagem, outra van-tagem e que nossa abordagem podera ser aplicada em qualquer fase do desenvolvimentoda transformacao, mesmo se a documentacao nao estiver completa.

3. Trabalhos relacionados

Grande parte dos trabalhos sobre teste de transformacoes de modelo dedica parte dotexto a falar das dificuldades encontradas. Alem dos fatores citados anteriormente, existetambem o problema das cadeias de transformacoes, nas quais pode haver uma mistura delinguagens de transformacao de modelo. Ainda pode-se citar o fato de que, em muitoscasos, a especificacao da transformacao de modelo nao esta disponıvel para ajudar noprocesso de elaboracao dos casos de teste [Tiso et al. 2012].

Usualmente divide-se o processo de teste de transformacao de modelos em 4 eta-pas: (i) geracao dos dados de teste, (ii) avaliacao da qualidade dos dados de teste, (iii)geracao do oraculo de teste e (iv) execucao da transformacao de modelo e comparacaodos resultados obtidos com os resultados esperados [Selim et al. 2012].

Dentre as propostas para geracao dos dados de teste, Brottier etal. [Brottier et al. 2006] definiram um algoritmo que, a partir do metamodelo deentrada e de fragmentos de modelos4 (fornecidos pelo testador ou derivados do meta-modelo de entrada), produz um conjunto de modelos de teste. Eles especificam as partesdo metamodelo que devem ser instanciadas com valores especıficos que sao interessantespara o teste, como uma particao de equivalencia. Entao, dadas as entradas, o algoritmoconsiste em combinar fragmentos de modelo e completa-los para construir instanciasvalidas do metamodelo.

Outra abordagem tambem caixa-preta para geracao dos dados de teste e pro-posta por Guerra [Guerra 2012], na qual e criada uma linguaguem para especificacaoda transformacao de modelo para se derivar tanto os dados de teste como o oraculo deteste. Tal linguagem e usada na especificacao dos requisitos dos modelos de entradada transformacao (pre-condicoes), propriedades esperadas (pos-condicoes), bem como aspropriedades que devem ser satisfeitas antes e depois de uma transformacao (invariantes).

Os dados de teste sao obtidos a partir da especificacao usando solvers5. Estaabordagem e diferente das demais, pois leva em conta as propriedades de interesse daespecificacao, enquanto as demais propostas normalmente usam o metamodelo de en-trada ou a propria transformacao. Outro fato interessante e o uso do mesmo artefato parageracao dos dados de entrada e do oraculo. Segundo os autores, esta e uma vantagem,

4Um fragmento de modelo e composto de um conjunto de fragmentos de objeto. Um fragmento deobjeto, por sua vez, e um conjunto de restricoes de propriedade que especificam os intervalos que devemexercitados.

5Um solver e um software que tenta encontrar um modelo de entrada que atenda as condicoes que saocolocadas para este modelo [Guerra 2012].


entretanto, deve-se lembrar que se houver um erro na especificacao, este nao sera desco-berto.

Uma das poucas abordagens caixa-branca para geracao dos dados de teste echamada ATLTest [Gonzalez and Cabot 2012]. Em um processo tradicional de testecaixa-branca, normalmente se tem dois passos: obtencao de um grafo de fluxo de controleou um grafo de fluxo de dados a partir do codigo fonte e geracao do conjunto de dados deteste a partir do percurso de um desses grafos usando-se um criterio de teste (ou criteriode cobertura). Embora seja essencialmente o mesmo, o processo de geracao de testecom ATLTest apresenta algumas diferencas basicamente devido ao misto de construcoesdeclarativas e imperativas de ATL e a natureza complexa das transformacoes de modelo.O processo consiste de 3 passos:

• Obtencao de um grafo de dependencia (que tem o mesmo papel do grafo de fluxode controle ou dados na abordagem tradicional).

• Percorrer o grafo com um criterio de cobertura. Isso significa atribuir valoresbooleanos para as diferentes condicoes e, cada percurso implicara em um con-junto de condicoes que simboliza um conjunto de dados de teste relevantes paratransformacao.

• Uso de solvers para gerar os modelos de testes.

Mesmo que seja mais popular o uso de tecnicas caixa-preta por serem indepen-dentes da linguagem em que a transformacao de modelo e implementada, os autores deATLTest defendem que, quando e feita a implementacao, e muito difıcil que ja se tenhauma especificacao formal em maos, tornando mais difıcil a aplicacao de tecnicas caixa-preta.

Diferentemente das propostas de geracao de dados de teste, Fleurey etal. [Fleurey et al. 2009] criou uma abordagem para avaliar os dados de teste com relacaoa cobertura dos requisitos de teste do metamodelo de entrada. Tal abordagem tambem usaos conceitos de fragmentos de modelo e avalia se os dados de teste estao cobrindo todosos fragmentos de acordo com o criterio escolhido (sao disponibilizados dos criterios decobertura na ferramenta desenvolvida pelos autores).

Ja o problema do oraculo e considerado mais difıcil de ser resolvido. Exis-tem algumas propostas, como a abordagem que cria oraculos parciais atraves decontratos, fazendo uso da linguagem OCL6. Em [Baudry et al. 2006], sao apre-sentados 3 tipos de contrato que podem ser usados como oraculos parciais: naespecificacao da transformacao, sobre o modelo de saıda e da transformacao (divididaem subtransformacoes).

4. Metodo e resultados esperadosDado o fato de existirem poucas abordagens de teste de transformacao de modelos e,das abordagens que sao caixa-branca para geracao dos dados de teste, todas serem de-pendentes da linguagem da transformacao de modelo, nossa proposta e criar uma novaabordagem caixa-branca para geracao dos dados de teste que seja independente da lin-guagem de transformacao.

6A Object Constraint Language (OCL) e uma linguagem declarativa criada para especificar propridadesde modelos de modo formal, porem compreensıvel [Richters and Gogolla 1998].


Devido a necessitar apenas do codigo fonte da transformacao, a abordagem poderaser utilizada em qualquer fase do desenvolvimento da transformacao, pois nao requer quea especificacao esteja completa.

Nossa proposta tem 3 passos que sao semelhantes aos da abordagem ATL-Test [Gonzalez and Cabot 2012], mas os dados de entrada e os artefatos intermediariossao diferentes. Primeiramente, para que seja uma abordagem independente da linguagemde transformacao, utilizaremos o conceito de linguagem especıfica de domınio (DSL),uma vez que criaremos um formato DSL para que a transformacao de modelo escrita emqualquer linguagem de transformacao possa ser convertida para nosso formato e se possaaplicar nossa abordagem de teste.

Com o codigo fonte da transformacao no formato DSL inicia-se os 3 passos parageracao dos dados de teste, sendo que os dois ultimos sao semelhantes ao da propostaATLTest, variando apenas os criterios de cobertura utilizados. Com relacao ao primeiropasso, ao inves de grafo de depedencia, nossa proposta e utilizar um diagrama de esta-dos, pois ele e simples de ser utilizado e analisado. Dessa forma, sera possıvel tambemverificar atraves do diagrama de estados se existe algum erro na transformacao.

Para validar nossa proposta, serao seguidos os passos abaixo:

• Definicao do formato DSL para representar o codigo fonte das transformacoes demodelo.

• Definicao dos passos de conversao do formato DSL para o diagrama de estados.• Definicao dos criterios de cobertura do diagrama de estados que poderao ser apli-

cados em nossa abordagem.• Aplicacao pratica em um exemplo real:

- Criacao de um conversor da linguagem ATL (mais utilizada nas aplicacoesde MDE) para o formato DSL.

- Conversao de uma transformacao de modelo escrita em ATL para o formatoDSL.

- Geracao do diagrama de estados e variacao dos criterios de cobertura.- Geracao dos dados de teste.- Validacao dos dados de teste.

Para assegurar que serao gerados bons dados de teste, que, no caso, sao osmodelos de entrada da transformacao de modelo, usaremos a ferramenta de Fleurey etal. [Fleurey et al. 2009] que verifica qual a cobertura dos fragmentos de modelo obtidosatraves do metamodelo de entrada.

Alem disso, uma outra opcao para garantir a qualidade dos dados de teste e aaplicacao da tecnica de mutantes, como citado em [Mottu et al. 2006].

5. Conclusao

Este trabalho tratou da necessidade e das principais abordagens para teste detransformacao de modelo, podendo ser aplicadas tecnicas de caixa-preta, como caixa-branca. Alem disso, muitas abordagens tendem a gerar automaticamente alguns dosartefatos necessarios para o processo de teste, tais como os dados de entrada ou osoraculos de teste.


Foi apresentada tambem nossa proposta para trabalho de doutorado que consisteem uma nova abordagem para geracao dos dados de teste para teste de transformacao demodelo. Nossa abordagem utilizada a tecnica caixa-branca e tem como vantagens o fatode ser independente da linguagem de escrita da transformacao e de poder ser aplicada emqualquer fase do desenvolvimento da transformacao.

Foram tambem descritos quais serao os passos para que esta abordagem seja efeti-vamente concluıda, sendo que e importante citar que para valida-la utilizaremos exemplosreais, diferentemente de muitos trabalhos na literatura que so aplicam exemplos simples.

ReferencesBaudry, B., Dinh-Trong, T., Mottu, J.-M., Simmonds, D., France, R., Ghosh, S., Fleurey,

F., Le Traon, Y., et al. (2006). Model transformation testing challenges. In ECMDAworkshop on Integration of Model Driven Development and Model Driven Testing.

Baudry, B., Ghosh, S., Fleurey, F., France, R., Le Traon, Y., and Mottu, J.-M. (2010).Barriers to systematic model transformation testing. Communications of the ACM,53(6):139–143.

Brottier, E., Fleurey, F., Steel, J., Baudry, B., and Le Traon, Y. (2006). Metamodel-based test generation for model transformations: an algorithm and a tool. In SoftwareReliability Engineering, 2006. ISSRE’06. 17th International Symposium on, pages 85–94. IEEE.

Dai, Z. R. (2004). Model-driven testing with uml 2.0. Computer Science at Kent, page179.

de Farias, C. M. (2003). Um metodo de teste funcional para verificacao de componentes.PhD thesis, Universidade Federal de Campina Grande.

Fleurey, F., Baudry, B., Muller, P.-A., and Le Traon, Y. (2009). Qualifying input test datafor model transformations. Software & Systems Modeling, 8(2):185–203.

Gonzalez, C. A. and Cabot, J. (2012). ATLTest: a white-box test generation approach forATL transformations. Springer.

Guerra, E. (2012). Specification-driven test generation for model transformations. InTheory and Practice of Model Transformations, pages 40–55. Springer.

Mottu, J.-M., Baudry, B., and Le Traon, Y. (2006). Reusable mda components: a testing-for-trust approach. In Model Driven Engineering Languages and Systems, pages 589–603. Springer.

Richters, M. and Gogolla, M. (1998). On formalizing the uml object constraint languageocl. In Conceptual Modeling–ER’98, pages 449–464. Springer.

Selim, G. M., Cordy, J. R., and Dingel, J. (2012). Model transformation testing: the stateof the art. In Proceedings of the First Workshop on the Analysis of Model Transforma-tions, pages 21–26. ACM.

Tiso, A., Reggio, G., and Leotta, M. (2012). Early experiences on model transformationtesting. In Proceedings of the First Workshop on the Analysis of Model Transforma-tions, pages 15–20. ACM.


Uma Solução para Monitoração deServiços Web Baseada em Linhas de Produtos de Software

Rômulo J. Franco1, Cecília M. Rubira1

1Instituto de Computação – Universidade Estadual de Campinas

{romulo, cmrubira}@ic.unicamp.br

Resumo. Apresentamos uma solução de monitoração de serviços Web que visamonitorar valores de atributos de QoS, bem como, permitir a compreensão dosfatores da degradação dos atributos de QoS no contexto de um processo demonitoração. A solução adota métodos baseados em Linhas de Produtos deSoftware para explorar a variabilidade de software existente em sistemas demonitoração, gerando uma família de monitores em atendimento a diferentesatributos de QoS, diferentes recursos de TI como alvos, onde diferentes modosde operação podem ser aplicados. Mostramos por meio de um estudo de caso,a viabilidade da solução, obtendo-se resultados efetivos para a monitoração daentrega de valores de atributos de QoS e na compreensão da degradação dosatributos em termos do contexto onde o serviço foi inserido.

Abstract. We present a solution for monitoring Web services that aims to mo-nitor QoS attributes. Solution has a comprehensive way to understand factorsof QoS attributes degradation because it is applied in the monitoring processcontext. Solution adopts methods based on Software Product Lines to explorevariability existing software systems for monitoring, generating a monitors fa-mily in response to different QoS attributes, different resources as targets and,different operating modes can be applied. We show through a case study, the fe-asibility of solution, obtaining effective results in delivering QoS attributes andunderstanding the degradation these attributes in Web services context.

1. ContextoSistemas de monitoração de serviços Web são utilizados atualmente em apoio à forte ex-pansão de modelos como Arquitetura Orientada a Serviços (SOA). A adoção de métodoscomo Linhas de Produtos de Software (LPS) está interrelacionada com modelos comoSOA, uma vez que ambos apóiam a reutilização de software. Devido às propriedades ine-rentes de SOA, tais como, dinamicidade, heterogeneidade, distribuição, autonomia dosserviços e a incerteza do ambiente de execução, atributos de qualidade de serviço (QoS)podem sofrer flutuações em seus valores ao longo do tempo (e.g. disponibilidade e de-sempenho) [1]. Consequentemente, é necessário monitorar serviços ao longo do tempo afim de garantir que tais serviços apóiem o nível de qualidade esperado, ou seja, definidopelo provedor do serviço.

Este trabalho tem por objetivo apresentar uma solução que monitora um conjuntode atributos de QoS como disponibilidade, desempenho e confiabilidade. Em diferentesalvos da monitoração, como serviços Web, servidores, aplicação servidora e a rede decomunicação. Diferentes modos de obter dados relacionados a esses atributos são também


aplicados, como por exemplo, por meio da invocação ao serviço ou por interceptação demensagens. Como consequência da identificação da variabilidade de software, adotamostécnicas de Linhas de Produtos de Software (LPS) como base para solucionar o problema.

A visão original da LPS é formada por um domínio concebido a partir das ne-cessidades do negócio, formulado como uma família de produtos e com um processode produção associado [6]. Este domínio apóia a construção rápida e evolução de pro-dutos personalizados conforme mudanças nos requisitos dos clientes. Variabilidade deSoftware é um conceito fundamental em LPS e refere-se a capacidade de um sistema desoftware ou artefato ser modificado, personalizado ou configurado, para ser utilizado emum contexto específico [7].

Variabilidades de software podem ser representadas por meio de um modelo decaracterísticas, onde há uma classificação, separando características comuns das variá-veis em sistemas [5]. Esta classificação ainda pode ser mais detalhada separando-as emopcionais, alternativas e obrigatórias [3].

2. Solução PropostaTipicamente, a monitoração de serviços Web pode ser realizada por meio da interceptaçãode mensagens e testes de uso do serviço efetuando invocações sobre ele e obtendo-serespostas para análise posterior. Quando alguma anomalia é detectada ocorre a notificaçãodos interessados na monitoração.

2.1. Taxonomia proposta para monitoração de serviços WebUma revisão sistemática da literatura foi efetuada para apoiar a identificação de carac-terísticas. Esta revisão foi executada baseada nas diretrizes da Engenharia de SoftwareExperimental propostas por Kitchenham (2004). A revisão nos apoiou de forma progres-siva no refinamento de uma taxonomia criada para atender a solução e permitiu entenderos diferentes elementos necessários que compuseram a taxonomia. Além disso, a revi-são nos forneceu com mais acurácia as lacunas não preenchidas deixadas pelas soluçõesexistentes. Tentamos com nossa solução, preencher estas lacunas criando uma soluçãomelhorada para monitorar serviços Web.

Para realizar este tipo de monitoração questões básicas de projeto devem ser con-sideradas: 1) Onde ocorrerá a monitoração (alvo)? 2) O que se deseja obter com a moni-toração ou o que deveria ser monitorado (atributos de QoS)? 3) De que modo isso podeser realizado (modo de operação)? 4) Qual a frequência da monitoração? 5) Quais osmeios de se obter resultados ou alertas gerados pela monitoração (notificação)?

Estas questões podem ser mapeadas em uma classificação para monitoração deserviços Web, como descrito a seguir.

2.2. Alvos da MonitoraçãoA taxonomia de monitoração é composta por um conjunto de alvos, como: um ou maisServiços, que são mantidos em uma Aplicação Servidora (e.g. Apache Tomcat1), quepor sua vez está instalado em Servidor contendo recursos como memória e disco rígido;por fim, a comunicação entre serviços diretamente é realizada pela Rede, a partir de umendereço IP.

1Aplicação Servidora - Apache Tomcat - http://tomcat.apache.org/


2.3. Atributos de QoS

O objetivo primário da monitoração é a coleta de informações relacionadas a atributos deQoS. Posteriormente, ao aplicar a monitoração, os resultados provenientes desta, podemser usados para fins específicos, como por exemplo, aperfeiçoar a qualidade do serviço.São considerados neste caso, os atributos de QoS separados por alvo conforme Tabela 1.

Tabela 1. Atributos de QoS por alvo de monitoraçãoAtributo de QoS Serviço Servidor Aplicação Servidora Rede

Desempenho • •Disponibilidade • •Confiabilidade •

Acurácia •Robustez •

Condição do hardware •Falhas detectadas em Logs •

Os atributos de QoS relacionados ao serviço, são atributos contidos no modelorecomendado pela W3C [9]. Demais recursos monitorados como no servidor e arquivosde log da Aplicação Servidora apoiam a compreensão de fatores ligados a degradação deatributos de QoS [10].

2.4. Modo de Operação

Os atributos de QoS podem ser obtidos por meio de eventos temporais onde pode seraplicado os modos de operação por invocação e/ou interceptação ao serviço, coletandométricas via cálculos dos tempos de invocação, processamento e tempo de resposta. Atri-butos de QoS como disponibilidade, desempenho e acurácia estão ligados a eventos tem-porais; e também por meio do modo de operação por inspeção do estado e condição derecursos como discos, memória ou identificar falhas em arquivos de log.

2.5. Frequência de Execução

A frequência da monitoração pode ser contínua ou intermitente. A monitoração tam-bém pode ser acionada mediante o uso do serviço por uma aplicação terceira, a ferramentade monitoração fica aguardando uma requisição. Este tipo de frequência configura umamonitoração passiva, e torna a monitoração não pró-ativa com valores de atributos de QoSdesatualizados.

2.6. Tipo de Notificação

Por fim, a notificação aos interessados na monitoração pode ser realizada através de enviode mensagem ou salvar os resultados em arquivos de log.

3. Estudo de casoApresentamos agora um estudo de caso que possibilitou avaliar a viabilidade da soluçãoproposta. O principal problema que o estudo de caso visa atender, é demonstrar a capaci-dade da solução em identificar a degradação de um atributo de QoS em uma composiçãode serviços. Onde têm-se diversos provedores com seus serviços, diversos recursos como


servidores, aplicação servidora e a rede de comunicação, e que todos estes são passíveisde falhas. Quando uma falha ocorre, implica em degradar um dado atributo de QoS.Como há muitos recursos, torna-se difícil monitorar todos de forma efetiva. Neste caso,uma família de monitores é gerada para atender ao cenário que exemplificamos aqui. Aofinal da monitoração, os dados podem ser cruzados para saber quem é o responsável peladegradação, tanto em nível de provedores, quanto em nível de recursos.

De acordo com a Figura 1 consideramos um cenário controlado simulando umcenário típico de SOA. Este cenário é composto por provedores de serviços simples (e.g.Provedor A e Provedor B) e um provedor de serviço composto, sendo o Provedor Master eo Consumidor A. Outros recursos que também são considerados no cenário são: o objetoGateway e um objeto IP Público (Google).

A Figura 1 também apresenta a identificação de cada monitor no cenário por meiode círculos enumerados. Note na Figura 1 que cada monitor foi gerado pela LPS a atenderde forma específica e foi inserido de forma estratégica a coletar valores de atributos deQoS dos alvos. Os monitores gerados são para monitorar 1) Disponibilidade; 2) Desem-penho; 3) Rede (incluindo gateways, roteadores e IP Público Google); 4) Servidor; 5)Confiabilidade e 6) Aplicação Servidora. Em resumo, foi gerado uma família de produtosbaseado em LPS (e.g. monitorarDisponibilidade.jar, monitorarDesempenho.jar, etc).

Figura 1. Cenário controlado composto por provedores, consumidores e os mo-nitores inseridos de foram estratégica nos alvos a serem monitorados

A execução dos seis monitores, de forma a atender ao cenário definido na Figura 1,durou 14 horas de monitoração. Este tempo foi necessário para garantir a qualidade dosdados do estudo. Após vários testes com diferentes intervalos, conseguimos encaixar nes-tas 14 horas, a execução de quatro intervenções no cenário controlado. Intervenções estas,como por exemplo, na primeira hora, desligar interface de rede do Provedor Master, as


demais intervenções foram desligar serviços e recursos. Estas intervenções foram execu-tadas em intervalos de tempos estrategicamente definidos para gerar anomalias no cenárioe capturar informações sobre a qualidade dos dados gerados provenientes da monitoração.

A partir dos valores entregues para cada atributo de QoS, foi possível cruzar dadosde diferentes monitores e identificar de onde de fato ocorria a degradação de um atributode QoS. O gráfico na Figura 2 apresenta um dos cruzamentos obtidos.

Figura 2. Gráfico representando o resultado da intervenção aplicada sobre ocenário do estudo de caso

No caso apresentado na Figura 2, mesmo que o Provedor Master não soubesseda indisponibilidade proveniente de uma intervenção na primeira hora, a monitoraçãodo Provedor A sobre o Provedor Master e sobre o IP Público do Google iria garantir aprecisão do diagnóstico oferecido pela solução de monitoração. Assim, mesmo que oProvedor Master queira eventualmente se isentar de culpa da degradação, o Provedor Apode apontar o real causador da degradação, para este caso. Os resultados obtidos foramdo monitor de rede apontado para IP Público Google (Monitor 3 da Figura 1) e monitorde disponibilidade do serviço do Provedor Master (Monitor 2 da Figura 1)).

4. ConclusõesA adoção de técnicas de LPS contribuiu em possibilitar a geração de uma família de mo-nitores. De acordo com o estudo de caso representado na Seção 4 vários monitores foramgerados em atendimento ao cenário e estabelece uma das possibilidade que a solução ofe-rece. Onde vários monitores da família foram gerados para monitorar diferentes alvos eobter valores de diferentes atributos de QoS de cada alvo. Sendo assim, os resultados pro-venientes da monitoração em cada alvo, foi possível então, cruzá-los para identificar ondede fato estava a degradação de um dado atributo de QoS. Outras possibilidades podemser exploradas pela solução. Como por exemplo, ser aplicado na violação de Acordos emNível de Serviço (do inglês Service Level Agreement - SLA). Também pode ser aplicadapor administradores de rede na identificação de problemas nos recursos de infraestruturade TI.

Estas possibilidades de aplicação da solução, provém da flexibilidade obtida pelaadoção de técnicas de LPS. A flexibilidade aqui mencionada, está ligada em atender um


conjunto de atributos de QoS. Os valores destes atributos são também obtidos em dife-rentes alvos. Em relação a atributos de QoS, a flexibilidade nestes tipos de solução temem seu maior nível de flexibilidade, quando qualquer atributo de QoS pode ser incluído/-removido da solução. Porém, nem todos atributos podem ser obtidos somente a um modode operação. E portanto, a nossa solução permite operar também em diferentes modosde operação. Entretanto, avaliamos em nossos estudos internos um total de 7 atributos deQoS, com diferentes modos de operação, em diferentes alvos da monitoração.

A execução dos estudos de caso mostraram também que a solução tem vantagensquando comparada com soluções existentes, como em [1, 4]. Contudo, algumas limi-tações foram identificadas, tais limitações podem ser exploradas em trabalhos futuros,como: 1) Monitorar Serviços Web com protocolo REST; 2) Incluir protocolos de co-municação TCP/ICMP no modelo de característica; 3) Monitoração autoadaptativa, ondea monitoração compreende e atua no ambiente para tomar decisões sobre o impacto daprópria monitoração.

Referências[1] F. Souza, D. Lopes, K. Gama, N. S. Rosa, and R. Lima (2011). Dynamic event-based

monitoring in a soa environment. In OTM Conferences.

[2] B. Wetzstein, P. Leitner, F. Rosenberg, I. Brandic, S. Dustdar, F. Leymann. (2009).Monitoring and Analyzing Influential Factors of Business Process Performance.EDOC, 141-150.

[3] H. Gomaa. (2004) Designing Software Product Lines with UML: From Use Cases toPattern-Based Software Architectures. Addison Wesley Longman Publishing Co.,Inc., Redwo o d City, CA, USA.

[4] L. Baresi, S. Guinea, M. Pistore, M. Trainotti. (2009) Dynamo + Astro: An IntegratedApproach for BPEL Monitoring. ICWS 2009: 230-237

[5] Clements, P. and Northrop, L. (2001) Software product lines: practices and patterns.Addison Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2001.

[6] G. H. Campbell. Renewing the product line vision. In Proceedings of the 2 008 12thInternational Software Product Line Conference, Washington, DC, USA, 2008.IEEE Computer Society

[7] J. V. Gurp, J. Bosch, and M. Svahnb erg. (2001) On the notion of variability in softwareproduct lines. In Proceedings of the Working IEEE/IFIP Conference on SoftwareArchitecture, WICSA ’01, Washington, DC, USA, 2001. IEEE Computer Society.

[8] B. Kitchenham. Procedures for performing systematic reviews. Technical Report Te-chnical Report TR/SE-0401, Keele University and NICTA, 2004.

[9] W3C. QoS for Web Services: Requirements and Possible Approaches, W3C WorkingGroup Note 25 November 2003

[10] H. Ludwig. Web services QoS: external SLAs and internal policies or: how do wedeliver what we promise? Fourth International Conference on Web InformationSystems Engineering Workshops. 4:115-120, Dec. 2003.


User Association and Load Balancing in HetNets

Alexandre T. Hirata1, Eduardo C. Xavier1, Juliana F. Borin1

1Institute of Computing – University of Campinas (Unicamp)Campinas – SP – Brazil

[email protected], {eduardo,juliana}@ic.unicamp.br

Abstract. HetNets come as a clever approach to increase the capacity and thecoverage of cellular networks in which low power base stations can share theload of high power ones. However, it also brought some problems and user as-sociation strategies used on homogeneous networks are no longer so efficient.In this work, this problem is modeled as an integer linear programming (ILP)problem and LP GREEDY, an heuristic based on it, is presented. This heuristicproduces good solutions in terms of number of accepted users and load balanc-ing among base stations in comparison with the optimal solution and with somestrategies presented in the literature.

Resumo. HetNets apareceram como uma solucao interessante para aumentar acapacidade e a cobertura de redes de telefonia celular nas quais estacoes radio-base de baixa potencia podem compartilhar a carga dos de mais alta potencia.Entretanto, isto trouxe alguns problemas e estrategias para associacao deusuarios utilizadas em redes homogeneas nao se mostram mais tao eficientes.Neste trabalho, este problema e modelado como um problema de programacaolinear inteira e LP GREEDY, uma heurıstica baseada neste modelo, e apre-sentada. Este heurıstica obtem boas solucoes em termos de numero de usuariosaceitos e balanceamento de carga entre estacoes base em comparacao a solucaootima e com relacao a algumas estrategias apresentadas em outros trabalhos.

1. IntroductionHetNets (Heterogeneous Networks) have been proposed to increase the capacity

of wireless networks by introducing pico cells into the existing coverage area of the macrocells resulting in significant increase in user data throughput [Andrews et al. 2014]. Theyare composed of macro cells, pico cells and femto cells (tiers 1, 2 and 3 respectively). Tier1 cells have more capacity, bigger coverage area, greater power transmission and highersignal strength in comparison to tier 2 and so on.

Macro cells, created by base stations (BS) called eNodeB, are usually deployedby the carriers, so they are spread in a way to maximize the coverage. Pico cells aredeployed in high density areas to share the macro cell load or to extend its coverage areaand so do the femto cells but with less power. Both pico and femto cells deployment isnot necessarily planned so it is feasible to model their positions randomly.

The capacity of the cells can be measured in Physical Resource Blocks (PRBs)which is the minimum unit of resource that can be allocated to a user equipment (UE). AUE is any sort of equipment that makes use of one of these services through the cellularnetwork. Moreover, it may be in the coverage area of one or more cells. A service can be


any kind of telecommunications service such as a call, a text message, a video conferenceor an audio streaming.

In homogeneous networks, mobile phones used to connect to the base station withthe best signal strength in order to provide a better user experience but, in HetNets, thisstrategy would make all mobile phones to connect to macro cells while the pico and femtocells would receive little or no load.

The telecommunications industry is also looking for alternatives to relieve the loadof the cellular networks. VoWifi (Voice over Wi-Fi) is one of the efforts towards this goal.The idea is to use the Wi-Fi connection as much as possible to place calls instead of usingthe conventional cellular network.

This paper presents a formulation in integer linear programming (ILP) of the userassociation and load balancing problem in HetNets and also an heuristic, LP GREEDY,based on a greedy strategy which is applied to the solution given by a relaxed version ofthe ILP model.

2. Related Work[Viering et al. 2009] introduced a mathematical framework for quantitative inves-

tigations of self-optimizing wireless networks with focus on LTE. It was applied to a sim-ple algorithm that adjusts the cell-specific handover thresholds aiming the load balancingand reducing the number of unsatisfied users.

[Lobinger et al. 2010] presented the results of simulations of a self-optimizingload balancing algorithm in a LTE network. This algorithm distributes the load of over-loaded cells to ther neighbor ones. Similarly, [Siomina and Yuan 2012] proposed an opti-mization framework for load balancing in LTE HetNets by means of cell range assignmentusing cell-specific offsets. Both works analyze the achieved load balance respectivelythrough simulations and mathematically but they do not compare their strategies withother existing ones.

[Fooladivanda 2011] formulated a joint association, channel allocation, and inter-cell interference management problem as a Non-linear NP-hard problem and comparedit against SINR-based strategies. Afterwards, [Fooladivanda and Rosenberg 2013] pro-posed an association rule that combines both user and resource allocation with signifi-cant performance gains in throughput. [Ye et al. 2013] presented a distributed algorithmwhich considers cell association and resource allocation jointly. In addition, it is linear tothe number of UEs and the number of BSs. The main goal of these works is to achieveuser association and they analyze load balancing as a side effect.

3. System ModelThe total capacity of a given cell is estimated as the effective number of PRBs

that are available to transport data in a timeframe, that is, the PRBs used to control thecommunication are not being considered. Although downlink and uplink do not behavesimilarly in HetNets, this work is focusing on downlink.

Each UE can connect to just a single cell and it requests the minimum numberof PRBs that assures the minimum data rate of its service. Besides, there is no priorityamong the services, that is, emergency services are not being considered.


Lets consider all UEs having the same nominal transmit power but requestingdifferent services. Despite having the same transmit power, the UEs are not susceptible toexactly the same interference as the others; so, it is another factor that affects the minimumnumber of required PRBs.

Furthermore, let the interference, the position of the UE and the minimum numberof PRBs required by a UE be constant in the considered timeframe.

4. Problem FormulationThere are many association strategies that aim to associate the UEs with BSs in

order to maximize the number of attended UEs. However, load balancing is treated asanother parameter of quality of these strategies and not the final objective. A model inILP for the system described in section 3 is presented in 4.1. Afterwards, LP GREEDYis defined on 4.2.

4.1. Integer Linear Programming ModelObjective:

Associate UEs to the cells considering just the quantity of available PRBs ondownlink.

Definitions:M set of UEsB set of cellsMb = {m ∈M |m is in b’s coverage area where b ∈ B}cb = capacity of resources of b ∈ B in PRBslmb = minimum load that m ∈ Mb demands from b ∈ B (considering the lossescaused by interference), i.e., the minimum quantity of needed PRBsα = minimum percentage of UEs that must be attended (QoS)

Decision variables:xmb = 1 if m ∈Mb is attended by b ∈ BLmax = percentage of the used capacity of the most overloaded cellLmin = percentage of the used capacity of the least overloaded cell

Model:

(P.1)

minLmax − Lmin

s.t. cbLmax −∑

m∈Mb

lmbxmb ≥ 0 ∀b ∈ B (1)

cbLmin −∑

m∈Mb

lmbxmb ≤ 0 ∀b ∈ B (2)

∑

b∈Bxmb ≤ 1 ∀m ∈M (3)

∑

m∈Mb

lmbxmb ≤ cb ∀b ∈ B (4)

∑

b∈B

∑

m∈Mb

xmb ≥ α|M | (5)

xmb ∈ {0, 1} ∀m ∈M,∀b ∈ B (6)


Objective Function: Minimize the difference between the most overloaded cell and theleast overloaded one (load balancing)

(1) Lmax must be greater than any other load among all cells(2) Lmin must be smaller than any other load among all cells(3) Each UE must be attended by at most one cell(4) Sum of the loads of all attended UEs in a cell cannot be greater than its capacity(5) The number of attended UEs must be greater than α% of all UEs(6) UE is or is not attended (0 or 1), i.e., binary variable

4.2. Greedy Linear Programming Heuristic

The ILP problem described previously gives the optimal solution but it is too ex-pensive. Thus, in order to get a possibly good solution in average polynomial time, theintegrality constraints were relaxed and a greedy strategy was applied over the infeasiblesolution.

In this greedy strategy, given the solution of the relaxed version of the linear pro-gram, the values of all decision variables xmb are enqueued in descending order. Af-terwards, this queue is consumed and each UE is associated accordingly if it was notpreviously associated and if the target cell has available capacity to attend it. If a cell isoverloaded, the UE is not going to be associated in this iteration but it will possibly be ina future one while the queue is being consumed.

5. SimulationTo evaluate the performance of the proposed heuristic, the scenario described

on [Lien et al. 2012] was used. They adopt simulation parameters for UEs from[3GPP 2011] that had been agreed by 3GPP. In this simulation, seven macro cells andthree pico cells are deployed according to Figure 1 [Lien et al. 2012]. The numbers withinthe cells indicate the percentage of the total of UEs that are being covered.

Figure 1. Simulation scenario

To create a test scenario,the position of the UEs were cal-culated randomly but respectingthe distribution illustrated on Fig-ure 1. For each fixed number oftotal UEs, 30 different test sce-narios were generated. In addi-tion, Best Downlink, Best SINRand Pico Cell First strategies[Fooladivanda 2011] were usedto evaluate the efficiency of theILP model and the proposedheuristic LP GREEDY.

Since both ILP modeland LP GREEDY depend onα from constraint (5), a bi-nary search in the finite set{1.00, 0.99, 0.98, . . . , 0.00} looking for the biggest feasible solution was performed.


Hence, both strategies would attend as much UEs as possible and balance their load asevenly as possible.

6. ResultsA simulator was implemented to run the simulations proposed and a time limit to

the solver was set to 5 minutes. Due to this time limit, the ILP model could not be solvedfor instances with more than 1,000 UEs.

Figure 2. Attended Users Figure 3. Load balancing

Figure 2 shows that LP GREEDY behaves at least as well as the best of the otherstrategies specially for instances greater than 900 UEs, i.e., it can attend more UEs thanthe other strategies and it is the closest one to the optimal solution. Note that all strate-gies perform equally well on all instances with less than 800 UEs because there is nooverloaded cells in these scenarios.

Figure 3 shows that LP GREEDY behaves better than the other strategies for in-stances greater than 1,000 UEs (overloaded cells scenarios), however, it is a little bit farfrom the optimal solution provided by the ILP model.

Therefore, LP GREEDY produces as good results as the non-optimal best strategyfor underloaded cells scenarios and presents better results on overloaded cells scenarios.Moreover, as it is based on linear programming and not on integer linear programming,it is faster than the ILP model. Furthermore, by aiming for load balancing, LP GREEDYnot only achieved better results on load sharing but also attended more users even incomparison with strategies that aim for user attendance.

7. ConclusionsIn this work, the problem was formulated as an integer linear program problem

and an heuristic based on a greedy strategy over a relaxed version of this model was


proposed. In addition, a simulator was implemented to test the proposed strategy againstsome existing ones. According to the presented results, the proposed strategy performsbetter than the other non-optimal strategies, mainly on overloaded cells scenarios, byattending more user equipments and by sharing the load more evenly.

8. Future WorkFor the next step, strategies considering both downlink and uplink will be inves-

tigated as well as online strategies. Furthermore, considering Carrier Aggregation willget this work closer to 5G networks. In addition, many other restrictions can be addedto the problem to make it closer to real scenarios like considering priority among the ser-vices, increasing the diversity of the services, investigating power efficiency schemata andstudying other distribution of cells and user equipments.

References3GPP (2011). Study on ran improvements for machine-type communications. TR 37.868,

3GPP.

Andrews, J., Singh, S., Ye, Q., Lin, X., and Dhillon, H. (2014). An overview of loadbalancing in hetnets: old myths and open problems. Wireless Communications, IEEE,21(2):18–25.

Fooladivanda, D. (2011). Joint channel allocation and user association for heterogeneouswireless cellular networks. 2011 IEEE 22nd International Symposium on Personal,Indoor and Mobile Radio Communications, pages 384–390.

Fooladivanda, D. and Rosenberg, C. (2013). Joint resource allocation and user associ-ation for heterogeneous wireless cellular networks. IEEE Transactions on WirelessCommunications, 12(1):248–257.

Lien, S.-Y., Liau, T.-H., Kao, C.-Y., and Chen, K.-C. (2012). Cooperative access classbarring for machine-to-machine communications. Wireless Communications, IEEETransactions on, 11(1):27–32.

Lobinger, A., Stefanski, S., Jansen, T., and Balan, I. (2010). Load Balancing in Down-link LTE Self-Optimizing Networks. In Vehicular Technology Conference (VTC 2010-Spring), 2010 IEEE 71st, pages 1–5.

Siomina, I. and Yuan, D. (2012). Load balancing in heterogeneous LTE: Range opti-mization via cell offset and load-coupling characterization. 2012 IEEE InternationalConference on Communications (ICC), pages 1357–1361.

Viering, I., Dottling, M., and Lobinger, A. (2009). A mathematical perspective of self-optimizing wireless networks. In Communications, 2009. ICC ’09. IEEE InternationalConference on, pages 1–6.

Ye, Q., Rong, B., Chen, Y., Al-Shalash, M., Caramanis, C., and Andrews, J. (2013).User Association for Load Balancing in Heterogeneous Cellular Networks. WirelessCommunications, IEEE Transactions on, 12(6):2706–2716.


Reconstrucao de Filogenias para Imagens e Vıdeos

Filipe de O. Costa1, Anderson Rocha (Orientador)1, Zanoni Dias (Coorientador)1

1Instituto de Computacao – Universidade Estadual de Campinas (UNICAMP)

Abstract. In this work, our main objective is to design and develop approachesfor identifying the structure of relationships of duplicates in a given set ofsemantically similar images or videos, and reconstructing a structure thatrepresents their past history and ancestry information (phylogeny). For this,we aim at designing approaches that calculate the dissimilarities among imagesand videos and group them in distinct trees of processing history automatically.

Resumo. Neste trabalho, objetivamos a proposicao e o desenvolvimento deabordagens para, dado um conjunto de duplicatas proximas, identificarautomaticamente a relacao entre elas e reconstruir a estrutura que representeo historico de geracao das mesmas (denominada filogenia). Para isso, iremosdesenvolver abordagens que calculem a dissimilaridade entre duplicatas e asseparem corretamente em arvores que representem a estrutura de relacao entreelas de forma automatica.

1. IntroducaoNas ultimas decadas, o numero de usuarios da Internet e de redes sociais sofreu umgrande aumento. Com isso, o compartilhamento de conteudo multimıdia se tornoualgo simples. Entretanto, a facilidade com que tal conteudo pode ser compartilhado edistribuıdo permite que usuarios obtenham copias de uma imagem/vıdeo na internet ecompartilhem versoes modificadas de tal conteudo, por exemplo. Em alguns casos, issopode infringir direitos autorais ou ate mesmo espalhar pela rede conteudo digital ilegal.

Nesse sentido, muitas abordagens tem sido desenvolvidas com o objetivo deidentificar versoes modificadas de imagens ou vıdeos digitais em um conjunto. Naliteratura, tais tecnicas sao denominadas tecnicas para deteccao e recuperacao deduplicatas proximas (Near-Duplicate Detection and Retrieval – NDDR) [Valle 2008,Joly et al. 2007], onde o objetivo e verificar, dado um par de imagens ou vıdeos, se umdeles e uma copia “proxima” do outro e encontrar todas as copias “proximas” de umaimagem ou vıdeo em um conjunto.

Uma abordagem mais desafiadora para expandir o problema de NDDR e, dadoum conjunto de imagens ou vıdeos os quais um analista sabe que sao duplicatas proximas,identificar qual e o original desse conjunto e qual a estrutura de geracao de tais duplicatas.Nesse cenario, nos estamos interessados em identificar o historico de geracao dasduplicatas, levando em consideracao a relacao existente entre elas. Essa identificacaose faz util para rastreamento de conteudo ilegal sem a necessidade de marcacoes digitaisou para encontrar a imagem ou vıdeo original dentro de um conjunto de duplicatas, porexemplo.

Algumas pesquisas recentes tem conseguido resultados interessantes no quediz respeito a definicao de uma arvore que represente a estrutura de geracao de


um conjunto de imagens [Rosa et al. 2010, Dias et al. 2010, Dias et al. 2013a] ouvıdeos [Dias et al. 2011]. Entretanto, e importante notar que existem situacoes ondese deseja encontrar nao uma, mas um conjunto de arvores filogeneticas (floresta) querepresente a relacao entre as duplicatas. Este cenario e desejavel quando nao se temcerteza que todas as duplicatas foram geradas a partir de um unico ancestral comum.

Este trabalho visa buscar possıveis solucoes para o problema de filogenia deimagens e vıdeos onde, dado um conjunto de duplicatas proximas, determinar a arvore(no caso de todas as duplicatas possuırem um ancestral em comum) ou a floresta (nocaso de varios subconjuntos de duplicatas proximas no mesmo conjunto) que representea relacao de geracao entre as duplicatas.

1.1. Objetivo

O objetivo principal deste trabalho e desenvolver abordagens que permitam a identificacaoda estrutura de geracao de um conjunto de duplicatas de imagens e vıdeos. Consideramoscomo meta principal do projeto resolver o problema de encontrar florestas (isto e, quandoum conjunto de duplicatas pode ter varios subconjuntos distintos).

1.2. Contribuicoes cientıficas e impacto

As abordagens apresentadas podem ser uteis na identificacao, por exemplo, da fontede imagens ou vıdeos ilegais que foram distribuıdos indevidamente pela internet (e.g.,vıdeos de pornografia infantil, imagens com direitos autorais desrespeitados, etc.), dadasalgumas duplicatas de tais documentos. Nos estamos interessados em encontrar tanto umaarvore quanto uma floresta filogenetica (isto e, uma ou mais arvores para cada conjuntode duplicatas).

Por se tratar de um problema relativamente novo, acreditamos que os tipos deabordagens propostos gerarao contribuicoes cientıficas importantes para o estudo dafilogenia de imagens e vıdeos, tais como:

• Estudo detalhado sobre o problema de filogenia de imagens e vıdeos considerandoarvores e florestas, uma vez que abordagens para solucionar este problema saorecentes e poucos estudos foram feitos a respeito do tema;• Definicao de algoritmos que, dado um conjunto de duplicatas proximas,

encontrem a filogenia que represente a relacao entre as duplicatas. Os algoritmosgerados auxiliarao diretamente na solucao de problemas como distribuicao deconteudo ilegal, infracao de direitos autorais e busca de imagens/vıdeos porconteudo;• Validacao da abordagem gerada neste projeto para solucionar o problema de

filogenia de imagens e vıdeos em um ambiente onde nao se sabe a provenienciade tais documentos, uma vez que este tipo de validacao foi pouco investigado naliteratura;• Analise de propriedades das imagens e vıdeos que possam auxiliar no

desenvolvimento de novos calculos de dissimilaridade, uma vez que essecalculo influencia diretamente nos resultados dos algoritmos para reconstrucaoda filogenia de imagens e vıdeos.


2. Filogenia de DocumentosPodemos dizer que existem, principalmente, dois fatores importantes e independentesno processo de reconstrucao de uma filogenia de um conjunto de imagens ou vıdeossemanticamente semelhantes: a funcao de dissimilaridade relacionada com as diferencasentre cada par de duplicatas no conjunto e o algoritmo de reconstrucao da filogenia em si.

Formalmente, seja T uma famılia de transformacoes em imagens, e seja T umatransformacao tal que T ∈ T . Podemos definir a funcao de dissimilaridade entre duasimagens IA e IB como o menor valor de dIA,IB [Dias et al. 2010].

dIA,IB = |IB − T−→β (IA)|metodo de comparacao ponto a ponto L (1)

para todos os possıveis valores de β que parametriza T . A Equacao 1 calcula a quantidadede dados residuais entre a melhor transformacao de IA para IB, de acordo com a famıliade transformacoes T e IB. Por fim, as imagens sao comparadas utilizando algum metodode comparacao ponto-a-ponto L. Dado um conjunto de duplicatas de imagem, os autorescalculam a dissimilaridade entre cada par de imagens do conjunto IA e IB, seguindoquatro etapas:

1. Registro de imagens, onde sao estimadas as transformacoes (rotacao, escala,recorte, etc.) que devem ser aplicadas na imagem IB para que esta tenha asmesmas (ou semelhantes) caracterısticas geometricas da imagem IA;

2. Casamento de cor, onde se faz a transferencia das caracterısticas de cor da imagemIA para uma imagem IB;

3. Casamento de compressao JPEG, que visa garantir que a imagem IB possua osmesmos fatores de qualidade de compressao JPEG que a imagem IA;

4. Calculo de dissimilaridade, feita por meio do calculo da soma dos erros quadrados(como sendo a tecnica L) entre a imagem IA e a imagem resultante das tres etapasanteriores I ′B.

Para a reconstrucao da arvore filogenetica, e criado um grafo direcionado combase na matriz de dissimilaridade e os autores calculam a arvore final utilizando umavariacao do algoritmo de Arvore Geradora Mınima de Kruskal adaptada para grafosorientados [Dias et al. 2012].

Em [Dias et al. 2010], os autores adaptaram o calculo da dissimilaridade e areconstrucao da filogenia descrita acima para tratar vıdeos. A dissimilaridade entre doisvıdeos e calculada utilizando-se um quadro de cada vıdeo por vez, assumindo que estessao temporalmente coerentes. Assim, se os vıdeos possuem q quadros, sao construıdasq matrizes de dissimilaridade, sendo uma para cada quadro. Feito isso, os autoresreconstroem q filogenias com o algoritmo de Kruskal orientado e a filogenia final edefinida pelas arestas mais frequentes nas filogenias construıdas.

Considerando que as duplicatas de um conjunto podem ter sido geradas pordois ou mais documentos originais (por exemplo, duas imagens ou vıdeos originaiscom o mesmo conteudo semantico mas obtidos em momentos diferentes ou por fontesdiferentes), a definicao de quantas arvores uma floresta tera se faz necessaria e tambemapresenta desafios importantes. Possıveis alternativas de solucoes sao (1) o corte dasarestas mais pesadas da arvore gerada inicialmente e re-execucao do algoritmo dereconstrucao apos o corte; e (2) clusterizacao inicial na matriz de dissimilaridade antes


da execucao do algoritmo de reconstrucao. Outro desafio e, em um ambiente em que naosabemos a priori quantas arvores devem ser geradas, decidir qual a forma utilizada paraseparar as imagens em arvores diferentes e qual o criterio de parada do algoritmo (istoe, quando decidir se uma aresta deve ser removida para dividir uma arvore em duas ouadicionada para unir duas arvores, por exemplo).

Na literatura atual de filogenia de imagens, existem dois metodos parareconstrucao de arvores filogeneticas que estao sendo explorados pela comunidadeacademica: Kruskal orientado [Dias et al. 2010] e o algoritmo de arborescenciaotima [Chu and Liu 1965, Bock 1971, Edmonds 1967]. Esses algoritmos podem serfacilmente adaptados para resolver o problema de florestas, gerando nao so uma, masvarias arvores de filogenia, como foi feito em [Dias et al. 2012] com o algoritmo deKruskal orientado. Nos tambem adaptamos o algoritmo de arborescencia mınima paratratar florestas e, visando aumentar a eficacia dos metodos na reconstrucao de florestas,nos realizamos uma comparacao entre os resultados de ambas as tecnicas para observar acomplementaridade entre eles e verificar como os resultados na reconstrucao das florestaspodem ser melhorados por meio da fusao das tecnicas. Melhorias individuais nas tecnicastambem foram propostas a partir de descobertas e propriedades que analisamos. Osresultados obtidos com esses experimentos foram recentemente aceitos para publicacaoem revista cientıfica [Costa et al. 2014].

Um desafio do projeto em relacao a filogenia de vıdeos e a reconstrucao dafilogenia onde (1) os vıdeos nao sao necessariamente temporalmente coerentes; (2)uma determinada duplicata pode ter sido gerada por meio de composicao de sequenciasdistintas de outros vıdeos, o que muitas vezes ocorre na pratica em um cenario real.

3. Conjunto de dadosPara tratar o problema de filogenia de imagens, nos iremos utilizar dois conjuntos dedados.

• Conjunto de imagens A: Conjunto de dados utilizado por Dias etal. [Dias et al. 2012]. Este conjunto contem tres imagens de tres cenas diferentes,obtidas por tres cameras diferentes, sendo que as imagens semanticamentesimilares originais (raızes das arvores) podem ter sido geradas por uma mesmacamera ou por cameras diferentes. Para cada cena, foram geradas duplicatas querepresentam florestas de tamanho |F | ∈ {1..5} arvores, considerando 5 topologiaspara cada tamanho de floresta e 10 variacoes nas geracoes das duplicatas paracada topologia. Este conjunto foi separado em dois subconjuntos: treino (umatopologia para cada tamanho de floresta, com um total de 2.700 florestas) e teste(contendo quatro topologias para cada tamanho de floresta, com um total de10.800 florestas).• Conjunto de imagens B: Contem 10 imagens de 20 cenas diferentes, onde cada

cena foi obtida considerando 10 cameras diferentes, onde as imagens originaispodem ter sido geradas por uma mesma camera ou por cameras distintas. Paracada cena, foram geradas duplicatas que representam florestas de tamanho |F | ∈{1..10} arvores. Consideramos, para este conjunto, 10 topologias para cadatamanho de florestas e 10 variacoes randomicas para a geracao das duplicatas.Foi gerado um total de 2000 florestas.


O protocolo a ser utilizado nos experimentos sera o cross dataset, onde o subconjuntode treino do conjunto A sera utilizado para estimativa de possıveis parametros e osubconjunto de testes do conjunto A e o conjunto B serao utilizados somente para a etapade testes. Para tratar o problema de filogenia de vıdeos, utilizaremos neste trabalho umconjunto de vıdeos construıdo em parceria com os pesquisadores do Politecnico di Milano(Italia). As duplicatas de vıdeos serao geradas considerando as seguintes transformacoes:corte espacial, corte temporal, codificacao, rotacao, insercao de logomarca, ajustes debrilho e contraste e embacamento (blurring). O conjunto inicialmente contem 300 vıdeosseparados em conjuntos com 30 duplicatas de vıdeos, onde 10 duplicatas representamuma cena, 10 duplicatas sao de outra cena, uma duplicata foi gerada por meio decomposicao dos vıdeos originais de cada cena e as nove duplicatas restantes foram geradasconsiderando transformacoes sobre o vıdeo composto.

Para validar as abordagens propostas com dados reais, iremos obter imagens evıdeos da Internet. Os vıdeos serao obtidos do YouTube, procurando manter, inicialmente,a mesma resolucao e a mesma taxa de quadros por segundo. No caso de imagens, iremosobter duplicatas da internet por meio do Google Images e Flickr. Iremos analisar, tambem,possıveis casos reais encontrados ao longo da pesquisa, como o exemplo do caso SituationRoom, o qual foi abordado em [Dias et al. 2013b].

3.1. Conclusoes e Trabalhos Futuros

Neste trabalho, destacamos os principais pontos discutidos atualmente na literatura defilogenia de imagens e vıdeos, que se trata de uma abordagem mais desafiadora quea deteccao de duplicatas, uma vez que desejamos encontrar a relacao de parentescoexistente entre as duplicatas. O trabalho ja possui algumas contribuicoes importantes e jaconta com uma publicacao aceita em revista cientıfica internacional [Costa et al. 2014].

Atualmente, estamos realizando estudos sobre a forma de calcular adissimilaridade, a fim de melhorar os resultados finais da reconstrucao de uma filogenia.Substituindo as abordagens utilizadas em algumas etapas do processo de calculoda dissimilaridade entre duas duplicatas (registro, casamento de cor e metrica decomparacao) e verificando como o algoritmo de reconstrucao de filogenia se comportacom a nova matriz, nos conseguimos resultados iniciais promissores. Alem disso, estamosavaliando uma forma de calcular a dissimilaridade para cada etapa individualmente,reconstruindo a filogenia para cada etapa e combinar os resultados obtidos de algumaforma.

Iremos tambem estudar novos algoritmos de reconstrucao de filogenias baseadosem medidas estatısticas, como por exemplo, a curtose, que e uma medida deextrema variacao, onde valores muito altos representam poucas variacoes extremosdos dados, enquanto valores baixos representam variacoes pequenas, porem maisfrequentes. Experimentos iniciais mostram uma relacao existente entre a variacaoda curtose dos pesos das arestas da filogenia final encontrada pelo algoritmo dearborescencia mınima e a escolha do numero de arvores que uma filogenia devepossuir. Tambem pretendemos analisar outros algoritmos existentes na area de BiologiaComputacional [Saitou and Nei 1987, Bachmaier et al. 2005].

Por fim, iremos abordar o problema de filogenia de vıdeos e verificaremos seas abordagens desenvolvidas para filogenia de imagens podem ser adaptadas. Esta


etapa sera realizada durante o doutorado Sanduıche de Filipe Costa em parceria com oscolaboradores do Politecnico di Milano (Italia) durante um perıodo de nove meses a partirde setembro deste ano.

ReferencesBachmaier, C., Brandes, U., and Schlieper, B. (2005). Drawing phylogenetic trees.

Springer.

Bock, F. (1971). An algorithm to construct a minimun directed spanning tree in a directednetwork. Developments in Operations Research, pages 29–44.

Chu, Y. J. and Liu, T. H. (1965). On the shortest arborescence of a directed graph. ScienceSinica, 14:1396–1400.

Costa, F. O., Oikawa, M., Dias, Z., Goldenstein, S., and Rocha, A. (2014). Imagephylogeny forest reconstruction. in Proc. IEEE Transactions on Information Forensicsand Security.

Dias, Z., Goldenstein, S., and Rocha, A. (2013a). Exploring heuristic and optimumbranching algorithms for image phylogeny. Elsevier Journal of Visual Coimmunicationand Image Representation, 24(7):1124–1134.

Dias, Z., Goldenstein, S., and Rocha, A. (2013b). Toward image phylogeny forests:Automatically recovering sematically similar image relationships. Forensic ScienceInternational, 231:178–189.

Dias, Z., Rocha, A., and Goldenstein, S. (2010). First steps towards image phylogeny. InIEEE Intl. Workshop on Information Forensics and Security (WIFS), pages 1–6.

Dias, Z., Rocha, A., and Goldenstein, S. (2011). Video phylogeny: Recovering near-duplicate video relationships. In IEEE Intl. Workshop on Information Forensics andSecurity (WIFS), pages 1–6.

Dias, Z., Rocha, A., and Goldenstein, S. (2012). Image phylogeny by minimal spanningtrees. IEEE Transactions on Information Forensics and Security (TIFS), 7(2):774–788.

Edmonds, J. (1967). Optimum branchings. Journal of Research of National Institute ofStandards and Technology (NIST), 71B:48–50.

Joly, A., Buisson, O., and Frelicot, C. (2007). Content-based copy retrieval usingdistortion-based probabilistic similarity search. IEEE Trans. Multimedia, 9(2):293–306.

Rosa, A. D., Uccheddu, F., Costanzo, A., Piva, A., and Barni, M. (2010). Exploring imagedependencies: a new challenge in image forensics. SPIE-IS&T Electronic Imaging,7541:774–788.

Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new method forreconstructing phylogenetic trees. Molecular biology and evolution, 4(4):406–425.

Valle, E. (2008). Local descriptor matching for image identification systems. Ph.d. thesis,Universite de Cergy-Pontoise.


Superpixel-based interactive classification of very highresolution images

John E. Vargas1, Priscila T. M. Saito12, Alexandre X. Falcao1,Pedro J. de Rezende1, Jefersson A. dos Santos3

1Institute of Computing, University of Campinas,Campinas, Brazil

2Department of Computer Engineering, Federal Technological University of Parana,Cornelio Procopio, Brazil

3Department of Computer Science, Universidade Federal de Minas Gerais,Belo Horizonte, Brazil

Abstract. Very high resolution (VHR) images are large datasets for pixel an-notation — a process that has depended on the supervised training of an effec-tive pixel classifier. Active learning techniques have mitigated this problem, butpixel descriptors are limited to local image information and the large numberof pixels makes the response time to the user’s actions impractical, during ac-tive learning. To circumvent the problem, we present an active learning strategythat relies on superpixel descriptors and a priori dataset reduction. Moreover,we subdivide the reduced dataset into a list of subsets with random sample re-arrangement to gain both speed and sample diversity during the active learningprocess.

Resumo. Imagens de muito alta resolucao sao conjuntos de dados grandes paraanotacao de pixels — um processo que depende do treinamento supervisionadode um classificador de pixels eficaz. Os metodos de aprendizado ativo tem aten-uado este problema, mas os descritores de pixels tornam o tempo de resposta asacoes do usuario impraticavel durante o processo de aprendizado ativo. Paracontornar este problema, apresentamos uma estrategia de aprendizado ativoque se baseia em descritores de superpixels e uma reducao do conjunto de da-dos a priori. Alem disso, subdividimos o conjunto de dados reduzido em umalista de subconjuntos com amostragem aleatoria para ganhar velocidade e di-versidade das amostras durante o processo de aprendizado ativo.

1. IntroductionPixel annotation in hyperspectral and very high resolution (VHR) images has reliedon supervised classifiers so far [Tuia et al. 2011, Demir et al. 2011]. However, man-ual selection and annotation of a sufficiently large number of training pixels is unfea-sible for very large datasets. In order to handle large datasets, active learning tech-niques have been proposed for selecting a small training set that represents well not onlyall classes under annotation but also discriminative samples on the boundary betweenclasses [Demir et al. 2011, Tuia et al. 2011, Saito et al. 2012].

Such active learning process starts off with a small training set for manual anno-tation. The labeled samples are then used to train a preliminary classifier that selects and


labels new samples for user validation. The user may confirm/correct the labels of theselected samples, increasing the size of the training set to generate an improved classifierfor the next iteration. The process may be halted by the user once the accuracy of theclassifier reaches a certain level, which can be verified either by cross validation on thetraining set or, regarding images, by visualizing the classification of the entire image.

The effectiveness of active learning techniques has been substantiated in theliterature as superior to a simple random selection of samples [Tuia et al. 2011,Demir et al. 2011] even when much larger amounts of randomly selected samples areconsidered. Nonetheless, the performance of an active learning technique depends ona clever strategy for selection of representative and discriminative samples from thelarge unlabeled dataset. The Multi-Class Level Uncertainty (MCLU) method is con-sidered a state-of-the-art approach for pixel annotation in VHR and hyperspectral im-ages [Demir et al. 2011]. However, works in this line evaluate their methods restrictedto a subset of the image domain for which the pixel labels can be easily determined bythe user. Moreover, the pixel datasets are labeled and the methods are assessed withoutconsidering the mean response time for the user’s actions. Indeed, they are impracticalfrom the user’s point of view. Besides, pixel descriptors are also limited to local imageinformation and susceptible to noise.

We circumvent these problems by presenting an active learning strategy based onthe MCLU method that relies on superpixel descriptors and a priori dataset reduction.The image segmentation into a superpixel representation is done by the Simple LinearIterative Clustering algorithm (SLIC), described in [Achanta et al. 2012]. The superpixelrepresentation initially provides a considerable dataset reduction and also allows the com-bination of higher-level descriptors within regions that respect the boundaries betweenclasses.

Our contribution. Firstly, we demonstrate the effective gain of the superpixelrepresentation over the pixel representation using the MCLU active learning techniqueover an entire VHR image. Secondly, given that, as such, the method is still impracticalfor interactive response times, we propose an a priori superpixel dataset reduction usingthe Optimum-Path Forest (OPF) clustering technique [Rocha et al. 2009]. The choicefor OPF clustering is justified by previous works on active learning [Saito et al. 2012,Saito et al. 2014] and its ability to find natural clusters with no shape constraints or needto specify a number of clusters. Our approach selects the roots and a few random samplesfrom the boundary set to compose the first training set for manual annotation. The MCLUtechnique is based on Support Vector Machine (SVM) classification. Hence, the firstinstance of the SVM classifier is applied to select a reduced set from the entire datasetwith the most uncertainty samples. Moreover, we subdivide the reduced dataset into msubsets and include them in a queue to improve efficiency. The active learning processuses one subset per iteration until the queue is empty. Then, the unlabeled samples ofthese subsets are merged and randomly rearranged to compose a new queue of subsets,in order to attain sample diversity in the subsets for the next m iterations. The result is afeasible and very effective active learning approach, as demonstrated here.


2. MethodologyActive learning techniques aim to iteratively select the most informative samples froma given unlabeled dataset U for training a supervised classifier. In the first step, a fewunlabeled samples are selected for manual annotation, forming an initial training set T .A classifier C is trained with the labeled set T and a query function Q uses C to classifyand select a set X of a few more informative unlabeled samples from U . Then the userconfirms or corrects the labels of the selected samples. The labeled samples in X areincluded into T and the classifier C is retrained with the new training set. This process isrepeated until the user is satisfied with the accuracy of the classifier C.

2.1. Multi-Class Level Uncertainty

The Multi-Class Level Uncertainty (MCLU) technique is based on Support Vector Ma-chine (SVM) classification — one of the most successful methods in pattern recognition,which has inspired many other active learning techniques. The MCLU technique selectsthe unlabeled samples with higher uncertainty in classification among all samples in thedataset U . The uncertainty criterion is defined by a function c(x), x ∈ U , which is calcu-lated on the basis of the signed distances di(x), i = 1, 2, . . . , n, from sample x to each ofthe n decision boundaries (n classes) of the one-versus-all SVM classifier. The distanceset {d1(x), . . . , dn(x)} for every x ∈ U is computed and the confidence value c(x) can becomputed in two ways, as explained in [Demir et al. 2011]: 1) The smallest distance tothe hyperplane or 2) The difference between the first and second largest distance values tothe hyperplanes. In this work, we use MCLU with the second uncertainty criterion. The|X| (cardinality of X) samples with lower c(x) values are selected to be displayed to theuser.

2.2. Dataset reduction method

Inspired on a recent work [Saito et al. 2012], our approach also explores the Optimum-Path Forest (OPF) clustering algorithm [Rocha et al. 2009] for dataset reduction and ini-tial training sample selection. However, differently from [Saito et al. 2012], we selectthe initial training samples as the representative cluster samples and randomly selectedsamples from the boundary between clusters; and then reduce the dataset by using theresulting SVM classifier from the initial training set.

The initial classifier should be as effective as possible because the proposed re-duction method is based on the classification results of the first classifier. Clustering isperformed in order to find representative samples that are likely to cover all classes, aswell as, uncertainty samples between classes, which are likely to be on the boundary be-tween clusters. The roots of the clusters (maxima of the pdf) and a small set of randomlyselected boundary samples are then annotated by the user to form a small initial trainingset T .

Before the learning cycle, the samples of the labeled set T are used to train theclassifier C. Then, the query function Q uses C to classify the entire set of unlabeledsamples U . These unlabeled samples are sorted according to the uncertainty criterionexplained in Section 2.1 — lower is the confidence value c(x), higher is the uncertaintyvalue of the sample x. The |U |/r most uncertainty samples form the reduced set R. Notethat the value of r controls the size of the reduced set.


In order to further reduce the number of unlabeled samples that have to be classi-fied in the query function Q at every iteration, we propose to split the reduced set R intom equal-sized subsets {R1, . . . , Rm} of random samples, and include these subsets in aqueue L for processing, one per iteration, until L is empty. After the m-th iteration, theunlabeled samples in R are merged, randomly divided into new m subsets {R1, . . . , Rm},and included again in L for the next m iterations. This strategy also aims to attain sam-ple diversity in these subsets along the iterations. Algorithm 1 describes the proposedmethod.

Algorithm 1 Active learning algorithm with dataset reduction.1: Cluster the set of unlabeled samples U with OPF and select root and boundary sam-

ples, which are annotated by the user to form the initial training set T .2: Train the classifier C with the labeled set T .3: The query function Q uses C to classify and sort the samples in the unlabeled set U

according to the uncertainty criterion.4: Select the |U |/r most uncertainty samples from U to form the reduction set R.5: Divide the set R into m subsets {R1, ..., Rm} and put them in the queue of subsets L.6: repeat7: The query function Q uses C to classify and select a set of samples X from the

next subset of unlabeled samples in the queue of subsets L.8: The user confirms/corrects labels and T is increased by X .9: Retrain the classifier C with the new set T .

10: If the m subsets are processed, then merge the subsets {R1, ..., Rm} and dividethem again into m subsets for L.

11: until the user is satisfied with the accuracy of C.

In the learning cycle of the active learning process (Line 6), the query function Quses C to classify and select a set of samples X , which are the most uncertain samplesin the next subset of the queue of subsets L. In our implementation, the query functionQ used is the MCLU. The user confirms or corrects the labels of the samples selectedby Q and the set T is increased by X (Line 8). Then, the classifier C is trained withthe new training set T . If the m subsets in the queue L have been processed, the subsetsare combined and randomly divided again into m subsets and rearranged in a queue ofsubsets L (Line 10). These steps are repeated until the user is satisfied with the accuracyof the classifier C.

3. Experiments

Given that annotated VHR images are presently scarce, we opted for manually annotatinga VHR image acquired on the Vatican City in April 2004. It is a 2847×2817 natural colorimage obtained from the Mapmart QuickBird satellite imagery samples website. Theimage was labeled with seven classes of interest (Road, Shadow, Tree, Water, Building,Grass, and Bare soil). In all experiments, we selected an initial small set comprised of64 samples. For the proposed approach, the 64 samples include the roots of the clusters(in order to guarantee samples from all classes) and samples from the boundary betweenclusters. The rest of the samples form the learning set. In all experiments, each methodwas run ten times, so each graph reports the mean of these runs.


To do a comparison between the pixel- and superpixel-based approaches for thetask of VHR image classification, we used the active learning method MCLU with thesame number of initial training samples and an equal number of samples labeled per iter-ation. In all experiments, the active learning methods select 14 samples, twice the numberof classes (as suggested in [Saito et al. 2014]), to be validated by the user in each itera-tion. The descriptor used for superpixels is a combination of four descriptors: Mean color,Color histogram, Local binary patterns (LBP), and Border/ Interior Classification (BIC).The overall accuracy of the two approaches is presented in Figure 1. The superpixel-basedclassification is consistently superior in comparison with the pixel-based classification. Inorder to set the parameters of the SVM-RBF classifier of the MCLU technique used in theexperiments, we performed a grid-search model selection in the first and fifth iterationsfor the pixel and superpixel-based approaches.

Figure 1. Overall accuracy Superpixel vs Pixel classification.

The VHR image used in the experiments is comprised of |U | = 24968 superpixelsobtained by the SLIC algorithm. The time in the selection process of the MCLU methodis not suitable for an application where interactive response times are expected (we mea-sured 87 seconds on average). As baseline for the comparison of the superpixel-basedmethods, we used a random sampling method that selects samples from the learning setto be validated by the user. Thirty percent of the entire set of unlabeled samples wasselected to form the reduced set (1/r = 0.3), which was divided into m = 4 subsets. InFigure 2 a), we present the overall accuracy for the proposed MCLU method with datareduction in comparison with the original MCLU method and a Random Sampling selec-tion strategy. In Figure 2 b), we present the time spent per iteration for the both MCLUselection strategies.

As illustrated in Figure 2 a), the accuracy of the MCLU with data reduction iscomparable to the one achieved by the original MCLU method. However, the computingtime is remarkably smaller as one can observed in Figure 2 b).

4. ConclusionIn this paper, we have shown that interactive classification of very high resolution images,through active learning sessions, benefits more from a superpixel-based approach than


Figure 2. a) Overall accuracy MCLU and MCLU-data reduction b) Time MCLU andMCLU-data reduction.

from a pixel-based one. We have also proposed a method that reduces and rearranges thedataset, lowering the computing time of the selection process (from 87 to 9 seconds onaverage) and keeping similar accuracy in comparison with the active learning method thatprocesses the entire set of unlabeled samples. This cutback in time makes user interactionfeasible in active learning.

AcknowledgmentWe thank Mapmart for the QuickBird satellite imagery.

ReferencesAchanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and Susstrunk, S. (2012). Slic

superpixels compared to state-of-the-art superpixel methods. IEEE Transactions onPattern Analysis and Machine Intelligence, 34(11):2274–2282.

Demir, B., Persello, C., and Bruzzone, L. (2011). Batch-mode active-learning methodsfor the interactive classification of remote sensing images. IEEE Transactions on Geo-science and Remote Sensing, 49(3):1014–1031.

Rocha, L. M., Cappabianco, F. A. M., and Falcao, A. X. (2009). Data clustering as anoptimum-path forest problem with applications in image analysis. Int. J. Imaging Syst.Technol., 19(2):50–68.

Saito, P. T. M., de Rezende, P. J., Falcao, A. X., Suzuki, C. T. N., and Gomes, J. F.(2012). Improving active learning with sharp data reduction. In WSCG InternationalConference on Computer Graphics, Visualization and Computer Vision, pages 27–34.

Saito, P. T. M., de Rezende, P. J., Falcao, A. X., Suzuki, C. T. N., and Gomes, J. F. (2014).An active learning paradigm based on a priori data reduction and organization. ExpertSystems with Applications (ESwA), 41(14):6086–6097.

Tuia, D., Volpi, M., Copa, L., Kanevski, M., and Munoz-Mari, J. (2011). A survey ofactive learning algorithms for supervised remote sensing image classification. IEEEJournal of Selected Topics in Signal Processing, 5(3):606–617.


Supporting the study of correlations between time series viasemantic annotations

Lucas Oliveira Batista1, Claudia Bauzer Medeiros1

1Instituto de Computação – Universidade Estadual de Campinas (UNICAMP)CEP 13083-852, Campinas - São Paulo - Brazil


Abstract. This paper shows a work in progress to design and develop a softwareframework that supports experts in the correlation of time series. It will allowsearching for time series via semantic annotations. Thereby fostering collab-oration among experts, and aggregate knowledge to content. This paper wasaccepted for VIII Brazilian eScience Workshop - BreSci.

Resumo. Este artigo apresenta um trabalho em andamento para projetar e de-senvolver um framework de software que auxilia especialistas na correlaçãode séries temporais. Este framework irá permitir a busca por séries temporaisutilizando anotações semânticas. Além disso, auxilia a colaboração entre espe-cialistas e permite a agregação de conhecimento ao conteúdo. Este artigo foiaceito no VIII Brazilian eScience Workshop - BreSci.

1. Introduction

Time series are used in several knowledge domains, such as economics (monthly un-employment rate), meteorology (daily temperature) or health (electrocardiogram). Anefficient search for time series helps the task of analyzing these series, for instance, toperform forecasts, identify patterns, find the origins of certain phenomena, and so on.Experts often use series annotations to help this analysis.

An annotation establishes a relation between the annotated data and the annota-tion content [Oren et al. 2006]. Time series annotations are potentially made by differ-ent researchers or research groups and they are generally created in text format usingan annotation-specific languages and stored in files. In many domains, due to the factthat there are no standards, annotations have a large heterogeneity. This complicates thesearch, sharing and integration of data. A semantic annotation is the description of adigital resource according to its semantics [Sousa 2010]. [Sousa 2010] defines seman-tic annotation as triples < s,m, o >, where s is the subject described, m the label of ametadata field, and o is an ontology term that semantically describes this subject.

In several situations, scientists need to correlate many kinds of series to study aproblem. The search for relevant series with respect to the problem is expensive and takesa long time. Given this scenario, this paper proposes a model of a framework to supportthe study of correlations between time series, where series are searched via semanticannotation. Besides allowing more sophisticated searches, and to aggregate knowledge toseries, this framework will facilitate collaboration among experts.


2. Related Work

The goal of this work is to provide to users tools that make easier the search for timeseries, taking advantage of semantic annotation. Therefore, the literature review focuseson highlighting research on time series annotation (2.1) and time series search (2.2).

2.1. Time series annotation

Annotations may be made in many formats, for example, video, audio or text. This workfocuses in textual annotations of time series. There are many tools that support seriesannotation - for example, [Pressly 2008, Silva 2013]. Both enable to associate multipletextual annotations to parts of a series. In the first, experts may just visualize annotationsmade by others. In the second, experts may collaborate modifying annotations made byothers.

These tools present a few disadvantages. First of all, both store annotations asfree text, which hampers search, sharing, integration of data and their interpretation bymachines. Second, they do not consider the ambiguity of annotation contents. Theseproblems may be attacked using semantic annotations.

2.2. Search for time series

According to [Gao and Wang 2009], the search for time series refers to finding, from aset of time series, the series that satisfy a given search criterion. This search may be doneusing a model or statistics of series, temporal dependencies, similarity among series orpatterns and so on.

There are tools that perform the search for time series in a limited way,for instance, they just allow the user to select a series category and a time pe-riod [Secretaria do Tesouro Nacional 2013]. Others perform search based on simi-larity among series using as input one time series - for instance, [Ding et al. 2008,Negi and Bansal 2005].

Still other approaches to search for time series are proposed by[Aßfalg et al. 2006, von Landesberger et al. 2009]. The former proposes a tool thatsearches time series using a range of values defined by the user as search parameter.The latter uses text in natural language, on financial databases. The disadvantage ofthese approaches is that they do not consider extra information that may be indirectlyassociated with series.

3. Model and methodology proposed

This paper describes a specification of a software framework that supports users to per-form correlations among time series performing search via semantic annotation. Thework of [Silva 2013] is used as the starting point to achieve this goal. This framework isvalidated with data provided by EMBRAPA.

Suppose that a researcher needs to analyze the impact of corn production on localeconomy of a region. Several series are related to corn production, for instance, temper-ature, humidity, harvested amount, and so on. There are challenges to find time seriesrelated to the problem, for instance, “corn” is cultivated in different parts of world, and


unless the series are georeferenced the search by keyword “corn” will return too many se-ries. Furthermore, each region has a different nomenclature for “corn”, which complicateseven more the obtention of results.

The proposed approach allows to solve these kinds of problem. In order to returnother relevant results, the search for time series uses extra information that may indirectlybe associated to series. Figure 1 illustrates an overview of the solution considered. Inthis framework, experts may add extra information to series semantically annotating partsof one or more time series. Series are stored in relational databases and may have manysemantic annotations (step 1). Semantic annotations are stored in a RDF (Resource De-scription Framework) database kept apart from the time series themselves, aiming to useless storage space as done by [Silva 2013] (step 2). Moreover, using Linked Data con-cepts, annotations are associated to external ontologies aggregating knowledge to content(step 3).

Returning to the example and as shown in the figure, the researcher may use astring “milho” in the search (step 4). Navigating through ontologies that are connectedto annotations, new information associated to “milho” are inferred, for example, “ZeaMays” (scientific name of corn), terms like “dentado” (term that refers to texture of thecorn grain), “Cercospora zeae-maydis” (fungus that affects corn), “mancha-branca” (corndisease), “espiga” (term related to corn), time period in which corn is harvested and placeswhere corn is planted. Defining ontologies is very hard, so we intend to use knownontologies, like AGROVOC 1, wich is an ontology covering several areas related with foodand agriculture. It allows the subsequent search for time series with this new information.Thereby, the researcher will obtain as results (step 5) time series annotated with termsassociated with “milho” (like “dentado”), temperature and humidity series, grain amountharvested on regions where “milho” is planted and dollar price series in the period inwhich “milho” is harvested. More accurate results and additional information about thecontent allow new types of correlations.

The collaboration among experts is performed using comments on the annotations(or meta-annotations). These meta-annotations are also structured; their value is a free-text value and they are associated to time series annotations, which are versioned to storetheir evolution over time. This new level of annotations records experts discussions overtime, avoiding the repetition of time series annotations created just to register this collab-oration (as made by [Silva 2013]). Therefore, new time series annotations will be createdjust when there is an agreement about the real change of content.

4. Conclusion and future work

This paper shows an ongoing research that helps experts to share information, collaboratein production of data of common interest and search time series returning more relevantresults. The framework proposed uses semantic annotation as a new possibility to searchtime series, taking advantage of extra information attached to the series. Semantic annota-tions are interpreted by machines, allowing the use of automatic techniques on these data,for example, inference techniques. This new possibility expands and refines the searchscope allowing more refined analyses to obtain results.

1http://aims.fao.org/standards/agrovoc/about


Figure 1. Overview of the model proposed

Acknowledgments Work partially financed by CAPES, FAPESP/Cepid in Computa-tional Engineering and Sciences (2013/08293-7), the Microsoft Research FAPESP Vir-tual Institute (NavScales project), CNPq (MuZOO Project), FAPESP-PRONEX (eScienceproject), INCT in Web Science, and individual grants from CNPq. We thank EMBRAPAfor the data.

References

Aßfalg, J., Kriegel, H.-P., Kröger, P., Kunath, P., Pryakhin, A., and Renz, M. (2006).Tquest: threshold query execution for large sets of time series. In Advances inDatabase Technology-EDBT 2006, pages 1147–1150. Springer.

Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., and Keogh, E. (2008). Querying andmining of time series data: Experimental comparison of representations and distancemeasures. Proc. VLDB Endow., 1(2):1542–1552.

Gao, L. and Wang, X. S. (2009). Time series query. In Liu, L. and Özsu, M. T., editors,Encyclopedia of Database Systems, pages 3114–3119. Springer US.

Negi, T. and Bansal, V. (2005). Time series: Similarity search and its applications. In Pro-ceedings of the International Conference on Systemics, Cybernetics and Informatics:ICSCI-04, Hyderabad, India, pages 528–533.

Oren, E., Möller, K., Scerri, S., Handschuh, S., and Sintek, M. (2006). What are semanticannotations. Relatório técnico. DERI Galway.

Pressly, Jr., W. B. S. (2008). Tspad: a tablet-pc based application for annotation andcollaboration on time series data. In Proc. of the 46th Annual Southeast RegionalConference, ACM-SE 46, pages 166–171, New York, NY, USA. ACM.


Secretaria do Tesouro Nacional (2013). Séries temporais. http://www3.tesouro.fazenda.gov.br/series_temporais/principal.aspx. Acessado: 03-09-2013.

Silva, F. H. (2013). Serial annotator: Managing annotations of time series. Master’sthesis, Universidade Estadual de Campinas - UNICAMP. Supervisor Claudia BauzerMedeiros.

Sousa, S. R. (2010). Gerenciamento de anotações semânticas de dados na web para apli-cações agrícolas. Master’s thesis, Universidade Estadual de Campinas - UNICAMP.Supervisor Claudia Bauzer Medeiros.

von Landesberger, T., Voss, V., and Kohlhammer, J. (2009). Semantic search and vi-sualization of time-series data. In Networked Knowledge-Networked Media, pages205–216. Springer.


2. Resumo dos posteres

“A ciencia conhece um unicocomando: contribuir com a ciencia.”

Bertold Brecht

A Model-Driven Infrastructure for Developing DynamicSoftware Product Line

Junior Cupe Casquina1, Cecılia M.F. Rubira1

1 Institute of Computing – University of Campinas(Unicamp)Campinas, SP, Brazil

jcupe.casgmail.com, [email protected]

Abstract. A fundamental principle of SPLs is variability management, whichinvolves separating the product line into three parts-common, common to somebut not all products, and individual-and managing these throughout develop-ment. At present, there is a growing need for systems to be able to self-adapt todynamic variations in runtime. To address this need, researchers have investi-gated dynamic software product lines, also to facilitate the development of thiskind of systems it is needed to have infrastructures or tools to create dynamicsoftware product lines. This article proposes a infrastructure to create dynamicsoftware product lines.

1. IntroductionA product line is a set of products in a product portfolio that share similarities and that are,ideally, created from a set of reusable parts. Instead of offering a single, standardized pro-duct, manufacturers offer multiple products. The software product line (SPL) approachprovides a form of mass customization by constructing individual solutions based on aportfolio of reusable software components. Instead of developing software systems fromscratch, they should be constructed from reusable parts and it should be tailored to requi-rements of the customer, where customers can select from a large space of configurationsoptions [Apel et al. 2013].

SPL engineering handles the development and maintenance of families of soft-ware products. For these purpose, core assets, also known as domain artifacts, are de-fined as parts that are built to be used by more than one product in the family, whileproduct artifacts, or applications artifacts, are specific parts of the software products[Reinhartz-Berger 2013]. A key distinction of software product line engineering fromother reuse approaches is that the various assets themselves contain explicit variability.For example, a representation of the requirements may contain an explicit description ofspecific requirement that apply only for a certain subset of the products .

Dynamic software product lines (DSPLs) extend existing product line engi-neering approaches by moving their capabilities to runtime - in other words, pro-duce software capable of adapting to changes in user needs and resource constraints[Hallsteinsen et al. 2008]. There is a growing need for systems to be able to self-adaptto dynamic variations in user requirements and system environments. To address thisneed, researchers have investigated dynamic software product lines.

2. The InfrastructureAt present, there are few proposals in support of runtime variability models but none ofthem cover the full spectrum of a complete solution to be used by a DSPL model. One

Resumo dos posteres 140

of these solutions relies on the Common Variability language (CVL) as an attempt formodeling runtime variability transformations, as it allows different types of substitutionsto re-configure new versions of base models.

Our proposal infrastructure relies on the ArCMAPE, therefore it uses the cvl tomodel the variability into the software product line and to model runtime variability trans-formations. Our infrastructure is generic to develop dynamic software product line insteadof dynamic Fault-Tolerant software. The CVL is a domain-independent language for spe-cifying and resolving variability, it allows the specification of variability over models ofany language defined using a MOF-based metamodel [Nascimento et al. 2014].

Figura 1. An Overview of new ArCMAPE

The figure 1 show a overview of the new ArCMAPE, Then we can say that theold ArCMAPE is an instance of our new infrastructure.

3. Expected ResultsIt is expected to have the new infrastructure in an eclipse plug-in to this way create a toolthat allows help to develop dynamic software products lines. Also it is expected to proveour Model-Driven infrastructure using two case of study.

ReferenciasApel, S., Batory, D., Kastner, C., and Saake, G. (2013). Software product lines. In

Feature-Oriented Software Product Lines, pages 3–15. Springer Berlin Heidelberg.

Hallsteinsen, S., Hinchey, M., Park, S., and Schmid, K. (2008). Dynamic software productlines. Computer, 41(4):93–95.

Nascimento, A., Rubira, C., and Castor, F. (2014). ArCMAPE: A software product lineinfrastructure to support fault-tolerant composite services. In 2014 IEEE 15th Interna-tional Symposium on High-Assurance Systems Engineering (HASE), pages 41–48.

Reinhartz-Berger, I. (2013). When aspect-orientation meets software product line engine-ering. In Reinhartz-Berger, I., Sturm, A., Clark, T., Cohen, S., and Bettin, J., editors,Domain Engineering, pages 83–111. Springer Berlin Heidelberg.


Aplicação do Critério Análise de Mutantes para Avaliaçãodos Casos de Testes Gerados a partir de Modelos de Estados

Wallace Felipe Francisco Cardoso, Eliane Martins

Instituto de Computação – Universidade Estadual de Campinas (UNICAMP) – CaixaPostal 6176 – 13081-970 – Campinas – SP – Brasil


Resumo. Modelos de sistemas de software são frequentemente utilizados durante todoo processo de desenvolvimento como ponte entre clientes e desenvolvedores. O critérioAnálise de Mutantes é considerado por pesquisadores como um ótimo critério deavaliação de conjuntos de teste. Neste trabalho, propomos uma ferramenta de testecom suporte a Análise de Mutantes, no contexto de Máquina de Estados FinitaEstendida, e a utilização desta ferramenta na avaliação dos conjuntos de teste geradospelo Teste Baseado em Busca Multiobjetivo.

1. Introdução e Conceitos

O critério Análise de Mutantes (AM) é um dos critérios de teste baseados em defeitos.Alguns pesquisadores o consideram um critério poderoso por subsumir outros critériosde teste. O objetivo da AM é auxiliar o testador a criar bons conjuntos de teste capazesde revelar a presença de determinados tipos específicos de defeitos.

Como não é possível se utilizar de todos os possíveis defeitos de umadeterminada linguagem para extração dos requisitos de teste, a AM utiliza-se de duashipóteses na seleção de um subconjunto de possíveis defeitos: a hipótese doprogramador competente e a hipótese do efeito de acoplamento. A hipótese doprogramador competente afirma que programadores competentes escrevem seusprogramas ou correto, ou muito próximo do correto. A hipótese do efeito deacoplamento afirma que se um caso de teste é capaz de revelar a presença de defeitosem M1 e M2, este também é capaz de revelar a presença de defeitos em M3, onde M3 éa fusão entre M1 e M2 [Delamaro et. al. 2007] [Fabbri 1997].

Um requisito de teste na Análise de Mutantes é conhecido como mutante, o qualé uma versão defeituosa do artefato sob teste. A escolha de como e onde a mutaçãoocorrerá é obtida através dos operadores de mutação. Dado um caso de teste T, e ummutante M gerado a partir de um artefato A, diz-se que T matou M se e somente se asaída de teste obtida em M é diferente da saída obtida em A. Se as saídas são iguais,duas opções são válidas: ou M é equivalente a A, ou T não é capaz de identificar asdiferenças entre ambos. Os mutantes equivalentes não devem ser considerados naanálise dos resultados, logo, devem ser identificados e descartados. Para que umconjunto de teste seja adequado à Análise de Mutantes, todos os mutantes devem estarmortos, assim, deve haver para cada mutante pelo menos um caso de teste que consigamatá-lo [Jia e Harman 2011].

2. Formulação do Problema

Torna-se muito complexa a atividade de teste sem o uso de uma ferramenta de apoio.Neste trabalho, temos o objetivo de criar uma ferramenta de apoio ao teste com suportea AM, no contexto de Máquina de Estados Finita Estendida (MEFE). Nosso trabalho


nos ajudará na avaliação de métodos automáticos de geração de casos de teste a partirde MEFE. É de interesse nosso avaliar a qualidade dos conjuntos de teste gerados peloTeste Baseado em Busca Multiobjetivo (MOST, do inglês Multi-Objective Search-Based Testing), um método automático de geração de casos de teste proposto por Yano(2011), no contexto de MEFE.

3. Métodos e Resultados

Este trabalho teve início com um levantamento teórico do critério AM, suas limitações,os operadores de mutação propostos para MEFE, e um estudo de como poderíamosutilizar técnicas de redução de custos em nossa pesquisa. Neste primeiro levantamento,conseguimos identificar um total de 31 operadores de mutação, e duas técnicas deredução de custos: a mutação aleatória e a mutação restrita.

A segunda parte do trabalho consiste do desenvolvimento da ferramenta deapoio ao teste. Foi realizado um levantamento inicial dos requisitos, priorizando asoperações básicas da AM, que são: geração, execução e análise de mutantes. Para omodelo de MEFE, utilizamos a notação da Unified Modeling Language (UML), todacodificação escrita em Tool Command Language (TCL), e formatação de arquivo emeXtensible Markup Language (XML). Para gerar o modelo executável utilizamos abiblioteca State Machine Compiler (SMC), e para executar o modelo utilizamos abiblioteca JACL. Devido a utilização dessas bibliotecas, optamos por adicionar Javacomo restrição de projeto.

Na terceira parte deste trabalho, pretendemos utilizar as funcionalidades daferramenta desenvolvida para avaliar os conjuntos de teste gerados pelo MOST.

4. Considerações Finais

O teste com mutantes se mostrou muito eficiente comparado a outros critérios de teste.A aplicação de testes em modelos de software contribui a uma melhoria de todo umprocesso de desenvolvimento de software, diminui-se custos à medida que aumenta aqualidade do produto final. Por se tratar de uma ferramenta acadêmica, espera-se queesta contribua com pesquisas futuras que envolvam tanto a AM quanto os métodosautomáticos de geração de casos de teste.

Referências

DELAMARO, Márcio Eduardo; MALDONADO, José Carlos Maldonado; JINO,Mario. “Introdução ao teste de software”. Elsevier Brasil, 2007.

FABBRI, Sandra Camargo Pinto Ferraz. “A análise de mutantes no contexto de sistemasreativos: uma contribuição para o estabelecimento de estratégias de teste evalidação”. PhD thesis, USP, São Carlos, SP, 1996.

JIA, Yue; HARMAN, Mark. “An analysis and survey of the development of mutationtesting”. Software Engineering, IEEE Transactions on, 2011, 37.5: 649-678.

YANO, Thaise. “Uma abordagem evolutiva multiobjetivo para geração automática decasos de teste a partir de máquinas de estados”. PhD thesis, IC/Unicamp, Campinas,SP, 2011.


Combining active semi-supervised learning throughoptimum-path forest

Priscila T.M. Saito1,2, Willian P. Amorim3, Alexandre X. Falcao1, Pedro J. de Rezende1,Celso T.N. Suzuki1, Jancarlo F. Gomes11, Marcelo H. de Carvalho3

1Institute of Computing, University of Campinas, SP, Brazil

2Dept. of Computer Engineering, Federal Technological University of Parana, PR, Brazil

3Institute of Computing, Federal University of Mato Grosso do Sul, MS, Brazil

Abstract. Active and semi-supervised learning have emerged as a promisingway for automatic classification when labeled data are scarce and unlabeleddata are abundant. In previously studied approaches, the active learning strate-gies require the classification and/or organization of samples at each learn-ing iteration. For large datasets, these strategies become very inefficient oreven infeasible. We proposed an efficient and effective active semi-supervisedapproach, whose learning set is substantially reduced and the organization ofsamples takes place only once. The experimental evaluation shows that our ap-proach improves the selection of the labeled set and diminishes the propagatederrors on the unlabeled set. Moreover, it is able to identify samples from allclasses quickly while keeping expert interaction to a minimum.

Given the limited availability of labeled samples in contrast to an unbounded num-ber of unlabeled ones, semi-supervised learning (SSL) has become an increasingly pop-ular learning approach. However, most of researches in this area typically assumes thatthe labeled set is given and fixed [Zhu 2008]. Recent applications involving very largeamounts of unlabeled samples would certainly benefit from a combination of active learn-ing and semi-supervised learning strategies [Goldberg et al. 2011, Yu et al. 2010], as thiswoud allow for the identification and labeling of a small number of better representativesamples. Despite some efforts in active semi-supervised learning, their success dependson an approach suitable to be applied to large real datasets.

We introduce an approach, called Active Semi-Supervised Learning Based onOptimum-Path Forest (ASSL-OPF) [Saito et al. 2014a]. ASSL-OPF is a novel integrationof semi-supervised learning [Amorim et al. 2014] and a priori-reduction and organizationcriteria [Saito et al. 2014b] for active learning based on OPF classifiers [Papa et al. 2012].It dramatically differs from standard active learning in which all samples in the databasehave to be classified and/or reorganized at each learning iteration. The active learningstrategy reduces the possibility of selecting an irrelevant sample from a large learning set,since a well chosen size reduction process and an a priori ordering allow essentially goodinformative samples. In the ASSL-OPF, the learning set is substantially reduced and theorganization of samples takes place only once.

We carried out an experimental evaluation on real datasets. We compared ASSL-OPF with RSSL-OPF, a semi-supervised learning approach [Amorim et al. 2014] inwhich the labeled and unlabeled samples were randomly selected. The results demon-


strate that our ASSL-OPF approach attains significantly fast and high accuracies, by se-lecting the most representative set to be annotated by an expert, while decreasing the prop-agated errors on the unlabeled set throughout the semi-supervised learning iterations. Inaddition, ASSL-OPF is able to identify and select samples from all classes more quicklywhile keeping expert interaction to a minimum. In order to appreciate the quality of theresults obtained on each dataset by ASSL-OPF, we present the total size of the learningset Z , the total number of samples annotated/corrected by the expert, mean accuracieswith standard deviations and computational time (in minutes) for selecting the most rep-resentative samples at the end of the learning cycle (Table 1). This highlights the superiorefficacy and efficiency advantage of the ASSL-OPF on practical applications.

Table 1. Total size of the learning set Z, total number of annotated samples,mean accuracies ± standard deviations and selection computational times forASSL-OPF.

datasets |Z| annotated accuracy ± std timeStatlog 1,761 8.74% 90.79% ± 0.91 0.14Faces 1,469 9.12% 99.26% ± 0.25 0.13

Pendigits 8,791 4.72% 98.76% ± 0.48 25.22Cowhide 1,351 2.27% 99.66% ± 0.10 0.56Parasites 1,455 6.45% 97.82% ± 1.69 0.85

ReferencesAmorim, W. P., Falcao, A. X., and de Carvalho, M. H. (2014). Semi-supervised pattern

classification using optimum-path forest. In XXVII SIBGRAPI Conference on Graph-ics, Patterns and Images (SIBGRAPI), pages 1–7.

Goldberg, A. B., Zhu, X., Furger, A., and Xu, J.-M. (2011). Oasis: Online active semi-supervised learning. In AAAI Conference on Artificial Intelligence. AAAI Press.

Papa, J. a. P., Falcao, A. X., de Albuquerque, V. H. C., and Tavares, J. a. M. R. S.(2012). Efficient supervised optimum-path forest classification for large datasets. Pat-tern Recognition, 45(1):512–520.

Saito, P. T. M., Amorim, W. P., Falcao, A. X., de Rezende, P. J., Suzuki, C. T. N., Gomes,J. F., and de Carvalho, M. H. (2014a). Active semi-supervised learning using optimum-path forest. In 22nd International Conference on Pattern Recognition, pages 1–6.

Saito, P. T. M., de Rezende, P. J., Falcao, A. X., Suzuki, C. T. N., and Gomes, J. F. (2014b).An active learning paradigm based on a priori data reduction and organization. ExpertSystems with Applications, 41(14):6086–6097.

Yu, D., Varadarajan, B., Deng, L., and Acero, A. (2010). Active learning and semi-supervised learning for speech recognition: A unified framework using the global en-tropy reduction maximization criterion. Computer Speech and Language, 24(3):433–444.

Zhu, X. (2008). Semi-supervised learning literature survey. Technical Report 1530, Com-puter Sciences, University of Wisconsin-Madison.


Desenvolvimento de um ambiente de baixo custo para o uso

de Interfaces Tangíveis no ensino de crianças de educação

fundamental

Marleny Luque Carbajal1, M. Cecília C. Baranauskas

1

1Instituto de Computação – Universidade Estadual de Campinas (Unicamp)

Caixa Postal 6176 13084-971 – Campinas – SP – Brazil


Abstract. The Tangible Interfaces introduce innovative forms of interaction

that may be more natural to the human, and have been widely spread in

various fields, including education. The Tangible Interfaces have great

potential to support the child's interaction with technology, because they

exploit the innate ability of human to act in the physical space and interact

with physical objects. Developing countries can hardly invest in this type of

technology for economic reasons, among others. This research project aims to

develop a low cost environment based on Tangible Interfaces that

complements the teaching in the basic education.

Resumo. As Interfaces Tangíveis introduzem formas de interação inovadoras

que podem ser mais naturais ao ser humano, e têm se difundido em vários

campos, inclusive na Educação. As Interfaces Tangíveis têm um grande

potencial para apoiar a interação da criança com a tecnologia, porque elas

exploram a capacidade inata dos seres humanos para atuar no espaço físico e

interagir com objetos físicos. Os países em desenvolvimento ainda não podem

investir neste tipo de tecnologia por razões econômicas, entre outras. Este

projeto de pesquisa tem como objetivo desenvolver um ambiente de baixo

custo baseado nas Interfaces Tangíveis que complemente o ensino do nível

fundamental.

1. Introdução

Nos últimos anos, o uso generalizado de sistemas de computador exige que a interação

entre usuários e computadores seja mais natural e accessível possível. A necessidade de

o usuário realizar seu trabalho “naturalmente” tem demandado o uso de novas

tecnologias. Nesse cenário, surgem as Interfaces Tangíveis, que permitem as pessoas

interagir com a informação digital por meio de um ambiente físico de forma mais

natural [Ullmer e Ishii 2000]. A abordagem tangível aplicada na educação tem o

potencial de estimular e trabalhar diversos sentidos (visão, audição, tato), além de

promover maior inclusão social.

2. Objetivo

Desenvolver um ambiente de jogos lúdico-educativos de baixo custo com base nas

Interfaces Tangíveis.


3. Método

Por meio da câmera web, a informação associada ao fiducial maker é enviado ao

framework de código aberto reacTIVision [Kaltenbrunner 2009]. Desenvolvemos um

aplicativo para plataforma Linux que comunica o reacTIVision com a linguagem de

programação Scratch. A resposta é mostrada em forma de imagem e áudio na tela do

computador. O ambiente mostrado na Figura 1 possui um pequeno computador de baixo

custo chamado Raspberry Pi que aceita conexão de mouse, keyboard e pode ser

conectado a um aparelho de televisão ou uma tela de computador. Para avaliação do

ambiente (hardware e software) serão realizados casos de estudo nos quais participarão

as crianças e outros atores do ambiente de ensino.

Figura 1. Ambiente de baixo custo.

4. Exemplo do cenário

A criança deve usar peças com as quais está familiarizada. Cada peça vai ter associado

um fiducial maker. O único dispositivo de entrada para o computador será uma câmera.

A criança interage como o sistema mostrando objetos para a câmera. O software

desenvolvido e instalado deve “reconhecer” o código relacionado ao objeto. O sistema

deve emitir uma resposta adequada por meio de imagem e áudio.

7. Conclusão

As Interfaces Tangíveis constituem uma abordagem inovadora que propõe enriquecer os

materiais concretos com elementos computacionais, criando recursos didáticos uteis no

campo da educação. Neste projeto propomos um ambiente de baixo custo composto por

hardware e software que tenha como entrada as Interfaces Tangíveis (TUI) e como saída

uma interface gráfica (GUI) que complemente o ensino de nível fundamental sem fazer

um enorme investimento.

Referências

Ullmer, B. and Ishii, H. (2000) “Emerging frameworks for tangible user interfaces”, In:

IBM Systems Journal, Vol. 39(3/4): 915-931

HKaltenbrunner, M. (2009) “reacTIVision and TUIO: A tangible tabletop toolkit”. In:

Proceedings of the ACM Internacional Conference on Interactive Tabletops and

Surfaces, pp. 9 – 16.


Explorando os Benefícios das Interfaces Tangíveis no Tratamento de Crianças Autistas

Kim P. Braga1, Maria Cecília C. Baranauskas

1Instituto de Computação – Universidade Estadual de Campinas (UNICAMP) Caixa Postal 6176 13084-971 – 13.083-852 – Campinas – SP – Brasil


Resumo. Este artigo apresenta uma proposta de interação tangível através de um tablet, utilizando seu magnetômetro para identificar os tokens. Esta interação será utilizada em jogos educativos para crianças com autismo, na esperança de aproximar as crianças da informática e tornar as atividades mais divertidas e interativas.

1. Contexto

O autismo é um Transtorno Global do Desenvolvimento (TDG) reconhecido por apresentar três características principais: (1) habilidade de interação social debilitada; (2) desenvolvimento da linguagem prejudicado, e (3) comportamento restritivo e repetitivo[1].

Tendo em vista o crescente reconhecimento de pessoas com este transtorno [2], e as dificuldade de aprendizado e comunicação que as crianças autistas têm nos primeiros anos de vida, este trabalho propõe a investigação dos benefícios que as interfaces tangíveis podem trazer a essas crianças.

2. Interfaces Tangíveis

As interfaces tangíveis são formas de interagir com conteúdo digital através de objetos físicos [4]. Dentre as várias formas de se realizar este tipo de interação, este trabalho promove a tangibilidade utilizando um tablet como plataforma para os tokens (objetos interativos).

A identificação dos tokens é feita utilizando ímãs dentro dos mesmos, de forma que a configuração destes ímãs gere uma alteração no campo magnético diferente uma da outra, permitindo ao aplicativo reconhecer qual é o token que está sendo colocado sobre o tablet.

3. O Jogo

Durante a fase de idealização do projeto, fizemos contato com a APAE de Poços de Caldas. Lá, as crianças realizam atividades com objetos concretos que ajudam no tratamento da sua condição, e tentam ensinar habilidades básicas como: reconhecimento de formas e cores, formação de palavras, habilidades básicas em matemática, entre outras.

O jogo oferece dois tipos de atividades, reconhecimento de formas e de cores; cada jogo traz três tipos de dificuldade, que se diferem na quantidade total de formas ou


cores que serão utilizados no desafios: o nível iniciante são três, intermediário seis e o avançado são nove. Cada desafio consiste em pedir que a criança coloque sobre o tablet um objeto (token), que tem a respectiva representação gráfica. Cada ciclo do jogo, ou seja, para cada vez que for selecionado o tipo de atividade (formas ou cores) e uma dificuldade (iniciante, intermediário, avançado), o jogo apresentará dez desafios nessa configuração; isso ajuda a manter a idéia de começo meio e fim que é fator importante para a rotina do autista.

4. Considerações Finais

Esperamos que, com os testes práticos do aplicativo, sejam identificados indicadores positivos no tratamento profissional do caso em questão (por ex. imaginamos observar aspectos como tempo de interação, sinais de socialização, afetividade e diversão se apropriados). Superamos um dos desafios iniciais do projeto, que é como realizar a interação tangível através de um tablet como plataforma, tendo como ponto de partida as referências [5] e [6]. Os próximos passos do projeto serão o uso e avaliação do aplicativo em uma sala específica onde as crianças autistas realizam sua rotina de atividades.

Referências

[1] Dr. Drauzio Varella. Corpo Humano, Autismo. Disponível em: < http://drauziovarella.com.br/crianca-2/autismo/ >. Acesso em: ago. de 2013.

[2] SCHWARTZMAN, José Salomão (16 de Setembro de 2010). Autismo e outros transtornos do espectro autista. Revista Autismo, edição de setembro de 2010.

[3] JUNIOR, Paiva (20 de Março de 2013). Há 1 autista em cada 50 crianças nos EUA, segundo governo. Revista Autismo, edição de março de 2013 disponível em: < http://www.revistaautismo.com.br/noticias/ha-1-autista-em-cada-50-criancas-nos-eua>. Acesso em 18 ago. 2013.

[4] Ishii, H. (2008). Tangible bits: beyond pixels. Interface, xv. doi:10.1145/1347390.1347392

[5] Ketabdar, H., Yüksel, K., & Roshandel, M. (2010). MagiTact: interaction with mobile devices based on compass (magnetic) sensor. Proceedings of the 15th INTERNATIONAL CONFERENCE ON INTELLIGENT USER INTERFACES. Retrieved from http://dl.acm.org/citation.cfm?id=1720048

[6] Bianchi, A., & Oakley, I. (2013). Designing tangible magnetic appcessories. Proceedings of the 7th International Conference on Tangible, Embedded and Embodied Interaction - TEI ’13, 255. doi:10.1145/2460625.2460667


Identificação de Programas Maliciosos por meio daMonitoração de Atividades Suspeitas no Sistema Operacional

Marcus Felipe Botacin1, Prof. Dr. Paulo Lício de Geus1, André R. A. Grégio1,2

1Instituto de Computação – Universidade Estadual do Campinas (UNICAMP)

{marcus,paulo}@lasca.ic.unicamp.br

2Centro de Tecnologia da Informação Renato Archer (CTI)

[email protected]

Abstract. Malware are persistent threats to systems security that are constantlyevolving to hinder detection and dynamic analysis techniques. Currently, thereis no publicly known dynamic analysis system that supports 64-bits malware. Inthis paper, we propose of a novel malware dynamic analysis system for Windows8, as well as results of testing it with 2,937 malware samples.

Resumo. Programas maliciosos (malware) são uma ameaça persistente à se-gurança, evoluindo constantemente para evitar a detecção e análise dinâmica.Atualmente, não há sistema conhecido publicamente que suporte malware de 64bits. Neste artigo, propõe-se um novo sistema de análise dinâmica de malwareem Windows 8, avaliado com 2.937 exemplares.

1. IntroduçãoOs sistemas operacionais Windows são os principais alvos de ataques por malware, pro-gramas maliciosos que subvertem a operação legítima de um sistema computacional demodo a violar sua integridade, confidencialidade ou disponibilidade. Embora os sistemasWindows 7 e 8 (NT versão acima de 6) introduzam diversos mecanismos novos para au-mentar a segurança, eles apresentam compatibilidade com aplicações feitas para XP (NT5), podendo portanto executar programas em 32 e 64 bits. Devido às diferenças entre asversões de Windows, é necessário a compreensão de como os exemplares de malware secomportam durante a execução no Windows 8. Mais do que isso, é necessário projetare implementar uma nova ferramenta de monitoração de comportamento de execução deprogramas capaz de atuar em sistemas de 64 bits. Neste artigo, propõe-se um sistema deanálise dinâmica de malware baseado em Windows 8 e apresenta-se os resultados obtidoscom a execução de mais de 2 mil exemplares de malware.

2. Aspectos Técnicos do NT 6.xA proteção de patch de kernel, como definida em [Microsoft 2013], proíbe que driversestendam ou substituam serviços do kernel por meios não documentados. Outro meca-nismo que visa evitar que componentes arbitrários sejam carregados no kernel se referea exigência de assinatura de drivers por parte da Microsoft. De modo a tentar isolar di-ferentes famílias de aplicações, o Windows passou a implementar o conceito de sessões,no qual os dados e privilégios de execução de cada uma dessas famílias ficam restritos àsmesmas. Além dos mecanismos citados, novas interfaces de programação estão presente,sendo o uso desta fundamental para explorar toda a capacidade do kernel do Windows 8.


3. Projeto do Sistema PropostoRossow et al. [Rossow et al. 2012] discutem que o método de monitoração deve atuar emum nível mais privilegiado do que o objeto sob análise. Dessa forma, decidiu-se imple-mentar a ferramenta de monitoração através de um driver de kernel que aplica as técnicasde callbacks e filesystem filter para interceptação das ações realizadas pelo programa mo-nitorado nos subsistemas de Registro, processos e sistema de arquivos. Na parte derede, o tráfego gerado durante a execução é capturado via tcpdump.

4. Testes e ResultadosForam feitas análises de 2.937 exemplares de malware coletados entre janeiro e junhode 2014 (Tabela 1). Verificou-se certas ações potencialmente maliciosas por parte dosexemplares analisados, como a finalização de mecanismos antívirus e firewall instaladosno sistema operacional, a criação de novos arquivos e injeções de código, bem comotentativa de persistência e remoção de evidências. A análise do tráfego de rede mostrouque aproximadamente a metade dos exemplares fez uso do protocolo HTTP para buscarnovos componentes ou enviar dados para o atacante. Além disso, foi possível observaratividades em outras portas, incluindo envio de e-mails e conexões IRC.

Tabela 1. Atividades monitoradas e quantidade de exemplares que as exibiram.Atividade Qtde. PorcentagemEscrita no Registro 1073 36,53%Remoção de chave(s) do Registro 772 26,29%Criação de processo(s) 602 20,50%Término de processos 1337 45,52%Escrita em arquivo(s) 1028 35%Leitra de arquivo(s) 1694 57,68%Remoção de arquivo(s) 551 18,76%

5. ConclusãoO sistema proposto é o único de que se tem notícia que é tanto capaz de executar arquivosno formato PE+ (64 bits) quanto de prover um ambiente de 64 bits (Windows 8) paraanálise de malware. Com os resultados obtidos, constatou-se que, devido a retrocompa-tibilidade, mesmo os mecanismos de segurança propostos a partir do NT 6 não impedemque exemplares compilados para Windows XP infectem Windows 8

ReferênciasMicrosoft (2013). Kernel patch protection for x64-based operating systems.http://technet.microsoft.com/pt-br/library/cc759759(v=ws.10).aspx. Acesso em junho/2014.

Rossow, C., Dietrich, C. J., Kreibich, C., Grier, C., Paxson, V., Pohlmann, N., Bos, H.,and van Steen, M. (2012). Prudent Practices for Designing Malware Experiments:Status Quo and Outlook . In Proceedings of the 33rd IEEE Symposium on Securityand Privacy (S&P) , San Francisco, CA.


Implementacao de um simulador de plataforma ARMv7utilizando ArchC

Gabriel K. Bertazi, Edson Borin1Instituto de Computacao – Universidade Estadual de Campinas (UNICAMP)

Av. Albert Einstein, 1251 – Campinas – SP – Brasil


Abstract. ArchC is an Architecture Description Language designed to generatesingle process virtual machines. This paper presents the modifications we madeon an ARM model generated with ArchC, in order to extend it to support a com-plex operating system, such as GNU/Linux. Beyond providing us with a func-tional ARMv7 platform simulator for research and educational purposes, thisproject provided us with important insight on how to extend ArchC to automati-cally generate system level virtual machines.

Resumo. ArchC e uma linguagem de descricao de arquiteturas que permite mo-delar a arquitetura em uma interface simulada de alto nıvel, com tempo redu-zido e sem custos de implementacao fısica de dispositivos de hardware. Todavia,os modelos so eram capazes de executar aplicacoes simples em nıvel de usuario.Neste trabalho, nos desenvolvemos um modelo de plataforma de um dispositivobaseado em arquitetura ARMv7, capaz de suportar o funcionamento de um sis-tema operacional GNU/Linux. O projeto, alem de oferecer uma plataforma po-derosa para pesquisas e desenvolvimento de aplicacoes baseadas em ARMv7,permitiu identificar limitacoes do ArchC na descricao de sistemas complexos,oferecendo mecanismos para o melhoramento desta ferramenta.

1. IntroducaoA simulacao de um sistema computacional complexo demanda diversos outros modulosalem da unidade central de processamento, responsavel por executar as instrucoes in-terpretadas. De fato, dispositivos perifericos ao processador, como a unidade de geren-ciamento de memoria, coprocessadores e controladores de hardware, que podem ser taocomplexos quanto o nucleo em si, tem um papel fundamental nos simuladores de sistema,ou seja, naqueles capazes de executar um sistema operacional e seus diversos processos.

Enquanto a linguagem ArchC [Rigo et al. 2004] nos permite modelar o nucleofacilmente, atraves da descricao em alto nıvel das instrucoes simuladas, ela nao oferecesuporte a descricao destes perifericos, o que a impede de gerar simuladores de plataforma.Este trabalho consiste na extensao de um modelo de processador ARM [Auler 2009],visando torna-lo capaz de suportar a execucao do Sistema Operacional GNU/Linux.

Nos modelamos e conectamos ao nucleo de processamento escrito em ArchC,diversos modelos de perifericos de hardware que fazem parte da plataforma de desenvol-vimento iMX53 Quick Start Board [Freescale 2012], que utiliza um processador ARMv7.Assim, conseguimos executar todas as etapas do boot do dispositivo, desde a execucao decodigo estatico da ROM, todo o bootloader e parte do carregamento do nucleo Linux.

Este projeto, alem de oferecer uma maquina virtual de plataforma funcional de umsistema ARMv7, permitiu a identificacao de diversas limitacoes do ArchC na descricao


de arquiteturas complexas, e o melhoramento desta ferramenta, com a inclusao de umacache de decodificacao com suporte a codigo auto modificavel, e de uma unidade dedecodificacao completamente nova.

2. modelo de plataforma do ARMv7Alem do nucleo de processamento modelado em ArchC, nos desenvolvemos os seguin-tes modelos de perifericos, afim de tornar possıvel a execucao do sistema operacional:Unidade de Gerenciamento de Memoria, Coprocessador de Controle de Sistema, barra-mentos, memorias ROM e RAM, controladores e cartoes SD, alem de diversos outrosdispositivos mais simples. Todos estes dispositivos foram modelados de acordo com aespecificacao da iMX53 e conectados ao nucleo atraves de interfaces genericas, visandoautomatizar futuramente a geracao de dispositivos perifericos com o ArchC.

Desenvolvemos tambem um sistema de bootstrapping em linguagem de monta-gem ARM, para ser alocado na memoria ROM. Este e o primeiro codigo executado nosistema, e e responsavel por iniciar o processo de boot, carregando e executando o boo-tloader, armazenado na memoria do cartao SD modelado.

Com estes modelos e sistemas, pudemos executar uma imagem do bootloader U-Boot com o nucleo Linux inalterada dentro do ambiente de simulacao, ate um estagioavancado do processo de carregamento do Linux, em que encontramos limitacoes no de-codificador de instrucoes do ArchC, que impediram a modelagem de certas instrucoes desistema, vitais a continuidade do processo de boot.

3. ConclusaoEste trabalho apresenta o desenvolvimento de uma maquina virtual de sistema baseadaem ARMv7 utilizando a linguagem ArchC. Pudemos demonstrar o boot do sistema parci-almente, passando por diversas etapas ate atingir o nucleo Linux, onde a continuidade dotrabalho foi afetada por limitacoes do ArchC, relacionadas a decodificacao de instrucoes.

A modelagem de uma maquina virtual de plataforma e uma tarefa complicada,exigindo tecnicas complexas tanto para o desenvolvimento quanto para a depuracao. Aexecucao parcial do processo de boot do GNU no modelo permitiu obter uma visao globalda complexidade envolvida, permitiu identificar limitacoes do ArchC e permitiu desen-volver novas tecnicas de depuracao, afim de simplificar a geracao dos modelos em ArchC.

Trabalhos futuros podem focar em mecanismos para superar as limitacoes do Ar-chC relativas ao processo de decodificacao, bem como terminar o processo de boot nosimulador, afim de torna-lo capaz de executar qualquer aplicacao em nıvel de usuario.

ReferenciasAuler, R. e Centoducatte, P. (2009). Dotando ArchC com infraestrutura para geracao

de montadores e simuladores ARM. Em WSCAD-WIC ’09: X Simposio em SistemasComputacionais - Workshop de Iniciacao Cientıfica.

Freescale (2012). i.MX53 Multimedia Applications Processor Reference Manual. Frees-cale Semiconductors, Inc., Arizona - USA, imx53rm rev 2.1 edition.

Rigo, S., Araujo, G., Bartholomeu, M., e Azevedo, R. (2004). ArchC: a SystemC-basedarchitecture description language. Em 16th Symposium on Computer Architecture andHigh Performance Computing, 2004 - SBAC-PAD 2004.


Oportunidades de economia de energia em dispositivos moveis

Joao H. S. Hoffmam, Sandro Rigo, Edson Borin

1Instituto de Computacao – Universidade Estadual de Campinas (Unicamp)Campinas – SP – Brasil

[email protected], [email protected], [email protected]

Resumo. Este artigo investiga o uso de codigo paralelo em dispositivos moveise a sua relacao com o consumo energetico desses equipamentos, baseando-seem experimentos realizados com diferentes frequencias e numero de nucleos deprocessamento. Os resultados mostram as vantagens da paralelizacao e tambema possibilidade de tornar a reducao do consumo de energia ainda maior atravesde uma selecao cuidadosa das frequencias de operacao dos nucleos combinadacom a paralelizacao das aplicacoes.

1. Introducao

Dispositivos moveis representam uma grande revolucao: equipamentos como smartpho-nes e tablets foram responsaveis pelo surgimento de uma nova forma de pensar e usar atecnologia, conquistaram uma grande e lucrativa parte dos mercados e trouxeram novosdesafios para a ciencia. A conjuntura atual requer que tais dispositivos apresentem desem-penho cada vez melhor, ao mesmo tempo que a autonomia da bateria deve ser suficientepara o uso cotidiano sem restricoes.

Neste trabalho, nos exploramos como a execucao de codigo paralelo em disposi-tivos moveis com multiplos nucleos de processamento pode afetar a potencia, o consumode energia e o desempenho do sistema.

2. Metodologia

Para explorarmos como a execucao de codigo paralelo em dispositivos moveis commultiplos nucleos de processamento pode afetar a potencia, o consumo de energia e odesempenho do sistema, nos utilizamos os seguintes materiais:

• Um smartphone Samsung Galaxy S3 (GT-I9300);• Um smartphone Samsung Galaxy S4 (GT-I9500);• Uma fonte de alta precisao Minipa (modelo MPL-3305M);• Um resistor de 0.25 ohm, produzido atraves da conexao de quatro resistores de

alta precisao com resistencia de 1 ohm;• Um dispositivo de aquisicao de dados NI USB-6212, da National Instruments;• Uma estacao de trabalho para coleta e processamento dos dados produzidos pelo

dispositivo de aquisicao;• A suıte de aplicativos PARSEC 3.0; e• O compilador GCC versao 4.8 para ARM. A versao utilizada foi configurada para

fazer uso das instrucoes de ponto flutuante do processador (hard-float).


3. ResultadosOs resultados obtidos apresentaram comportamentos variados, a depender do benchmarkrodado. Neste texto, sera mostrado apenas o exemplo mais comum. O grafico da Fi-gura 1 (a) mostra o comportamento do consumo energetico em funcao da frequencia parao Galaxy S3 de uma aplicacao de codificacao de vıdeo que apresenta alto grau de parale-lismo, enquanto a Figura 1 (b) caracteriza o tempo e o consumo energetico em funcao dapotencia para a configuracao mais eficiente.

200 400 600 800 1000 1200 1400

2030

4050

60

Frequência(Mhz)

Ene

rgia

(J)

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

● ● ● ●

●●

●

●

●

●

●

●● ● ●

● ●●

●

●

●

●

●

●

● ● ●●

● ●

●

●

●

●

●

●

●

●

1 núcleo2 núcleos3 núcleos4 núcleos

(a) Energia em funcao da frequencia para 1, 2, 3 e 4nucleos ativos

●

●

●

●●

●

●

●●

●

●

●

●

0.5 1.0 1.5

1618

2022

24

Potência(W)

Ene

rgia

(J)

●

●

●

●

●

●

●

●●

●●

● ●

1020

3040

5060

70

Tem

po(s

)

●

●

EnergiaTempo

(b) Energia e tempo em funcao da potencia para 4nucleos ativos

Figura 1. Resultados experimentais para a aplicacao “x264”

4. ConclusoesO consumo de energia em smartphones e tablets e um dos principais desafios para aevolucao destes dispositivos. Neste trabalho, nos exploramos como a execucao de codigoparalelo em dispositivos moveis com multiplos nucleos de processamento pode afetar apotencia, o consumo de energia e o desempenho do sistema.

Os resultados apresentados permitem concluir, primeiramente, que a paralelizacaode codigo pode reduzir o consumo energetico significativamente. Alem disso, a convexi-dade proposta por Vogeleer [Vogeleer et al. 2014] pode ser observada nos experimentos,com o ponto de menor consumo em torno de 800Mhz, correspondendo, em termos depotencia, a cerca de 0.8W. Tal resultado indica que uma grande economia de energiapode ser alcancada caso o usuario esteja disposto a gastar um tempo um pouco maior naexecucao de um aplicativo. Alem disso, a potencia dissipada sera menor, gerando, conse-quentemente, menos calor e permitindo uma utilizacao mais confortavel do equipamento.

ReferenciasVogeleer, K., Memmi, G., Jouvelot, P., and Coelho, F. (2014). The energy/frequency

convexity rule: Modeling and experimental validation on mobile devices. In ParallelProcessing and Applied Mathematics. Springer Berlin Heidelberg.


Video-Based Face Spoofing Detectionthrough Visual Rhythm Analysis

Allan Pinto 1, Anderson Rocha 1

1Institute of Computing – University of Campinas (Unicamp)Campinas-SP, Brazil, 13083-852

Abstract. Recent advances on biometrics, information forensics, and securityhave improved the accuracy of biometric systems, mainly those based on fa-cial information. However, an ever-growing challenge is the vulnerability ofsuch systems to impostor attacks, in which users without access privileges tryto authenticate themselves as valid users. In our work, we present a solution tovideo-based face spoofing to biometric systems. Such type of attack is charac-terized by presenting a video of a real user to the biometric system. To evaluatethe effectiveness of the proposed approach, we introduce the novel UnicampVideo-Attack Database (UVAD) which comprises 14, 870 videos composed ofreal access and spoofing attack videos.

1. Conceptualization and Motivation

Biometric authentication is a technology concerned with recognizing humans in an auto-matic and unique manner based on behavior, physical and chemical traits. Examples ofphysical traits include fingerprint, geometric and veins of the hand, face, iris and retina.Speech and handwriting are examples of behavior traits and skin odor and DNA (De-oxyribonucleic Acid) information are examples of chemical traits [Jain and Ross 2008].In the last decades, biometrics have emerged as an important mechanism for access con-trol that has been used in many applications, in which the traditional methods includingthe ones based on knowledge (e.g., keywords) or based on tokens (e.g., smart cards)might be ineffective since they are easily shared, lost, stolen or manipulated. In con-trast, the biometric access control has been shown as a natural and reliable authenticationmethod [Jain and Ross 2008].

However, at the same time that significant advances have been achieved in bio-metrics, several spoofing attack techniques have been developed to deceive the biometricsystems. Spoofing attack is a type of attack wherein an impostor presents a fake biomet-ric data to the acquisition sensor with the goal of authenticating oneself as a legitimateuser. Depending on the biometric trait used by the system, this mode of attack can beeasily accomplished because some biometric data can be synthetically reproduced with-out much effort. Face biometric systems are highly vulnerable to such attacks since facialtraits are widely available on the Internet, on personal websites and social networks. Inaddition, we can easily collect facial samples of a person with a digital camera. In thecontext of face biometrics, an attempt of spoofing attack can be performed by presentingto the acquisition sensor a photograph, a video or a 3D face model of a legitimate userenrolled in the database. If an impostor succeeds in the attack using any of these ap-proaches, the uniqueness premise of the biometric system is violated, making the systemvulnerable [Jain and Ross 2008].


2. Contributions and ResultsIn our work [Pinto 2013], we present a method for detecting video-based face spoofingattacks under the hypothesis that fake and real biometric data contain different acquisition-related noise signatures. To the best of our knowledge, this was the first attempt of dealingwith video-based face spoofing based on the analysis of global information that is invari-ant to the video content. Our solution explores the artifacts added to the biometric samplesduring the viewing process of the videos in the display devices and noise signatures addedduring the recapture process performed by the acquisition sensor of the biometric system.Through the spectral analysis of the noise signature and the use of visual rhythms, wedesigned a feature characterization process able to incorporate temporal information ofthe behavior of the noise signal from the biometric samples. To contemplate a more re-alistic scenario, this dissertation introduces the Unicamp Video-based Attack Database(UVAD) [Pinto 2013]1, specifically developed to evaluate video-based attacks in orderto verify the following aspects/questions: (i) behavior of the method for attempted at-tacks with high resolution videos; (ii) influence of the display devices in our method; (iii)whether attacks with tablets are more difficult to be detected; (iv) influence of the biomet-ric sensor in our method; (v) best feature characterization to capture the video artifacts;and (vi) comparison with one of the best anti-spoofing methods for photo-based spoofingattack of notice.

Such verifications can be accomplished due to diversity of the devices used tocreate the database which comprises valid access and attempted attack videos of 304 dif-ferent people. Each user was filmed in two sections in different scenarios and lightingconditions. The attempted attack videos were produced using eight different display de-vices and three digital cameras from different manufacturers. The database has 608 validaccess videos and 14, 262 videos of video-based attempted spoofing attacks, all in fullhigh definition quality. Results showed that there is an influence of the biometric sensorand of the display device, used during attack, on the proposed method. We also concludethat visual rhythms are best characterized whether consider them as texture maps and thattablet-based attacks normally are harder to be detected than LCD-based attacks. Finally,using a Gaussian filter, a horizontal visual rhythm, the Gray Level Co-occurrence Matri-ces feature descriptor, and using the Support Vector Machine classification technique, weobtained an AUC of 83.74%, 100.00%, and 100.00% in three different sensors.

AcknowledgmentWe would like to thank the researchers Dr. Helio Pedrini and Dr. William RobsonSchwartz from IC/Unicamp and DCC/UFMG, respectively, by collaboration during re-alization this work. We also would like to thank Microsoft Research, FAPESP, CNPq,and CAPES for the financial support.

ReferencesJain, A. K. and Ross, A. (2008). Handbook of Biometrics, chapter Introduction to Bio-

metrics, pages 1–22. Springer.

Pinto, A. (2013). A countermeasure method for video-based face spoofing attacks. Mas-ter’s thesis, University of Campinas.1This database will be make public and freely available. Users present in the database formally autho-

rized the release of their data for scientific purposes.


Wear-out Analysis of Error Correction Techniques inPhase-Change Memory

Caio Hoffman∗, Rodolfo Jardim de Azevedo1, Guido Costa Souza de Araujo1

1Advisors – Institute of Computing – University of Campinas (UNICAMP)Campinas – SP – Brazil

{caio.hoffman, rodolfo, guido}@ic.unicamp.br

Abstract. Phase-Change Memory (PCM) is a new memory technology and apossible replacement for DRAM, whose scaling limitations require new litho-graphy technologies. Despite being promising, PCM has limited endurance (itscells withstand roughly 108 bit-flips before failing), which prompted the adop-tion of Error Correction Techniques (ECTs). In this work, we have shed newlight on PCM lifetime analyses for the main published techniques in the litera-ture. Our models enables not only an accurate analysis of PCM wear-out, butalso a computation of energy consumed by ECTs.

Phase-Change Memory (PCM) is a new memory technology considered a possi-ble replacement for DRAM. PCM is byte-addressable, non-volatile, has multi-level cell(MLC) capability, and has demonstrated scalability beyond the 5nm manufacturing pro-cess [International Technology Roadmap for Semiconductor 2011].

A concern about PCM that hinders its use as main memory is its low endurance.Currently a PCM cell withstands around 108 bit-flips (modification of stored bit values)[Lee et al. 2009] before failing. In fact, considering the variability in the manufacturingprocess, PCM cells may withstand even less than 108 bit-flips, and a single cell failuremay invalidate an entire PCM chip. Besides, PCM’s endurance is far from that of DRAM(considered unlimited) [Schechter et al. 2010].

To extend the average lifetime of PCM chips, Error Correction Techniques (ECTs)have been used to ensure that a few cell failures will not cause a chip failure. Recently,some ECTs for PCM were proposed in the literature: DRM [Ipek et al. 2010], ECP[Schechter et al. 2010], SAFER [Seong et al. 2010], and FREE-p [Yoon et al. 2011]. Tocorrect cell failures, ECP and SAFER leverage the features of resistive memories, whereasDRM and FREE-p use Error-Correcting Codes (ECCs).

Unfortunately, the works that propose those ECTs have simplifying assumptionsthat overlook important characteristics of memory writes. First, they assume a Bit-FlipProbability (BFP) of 50%, i.e., on a write to a PCM memory block, every bit flips with a50% probability. Second, they assume that the bit-flip behavior in data bits and code bitsis identical. As we show in this paper, those assumptions can mislead the evaluation ofPCM’s lifetime while not enabling an accurate analysis of energy consumption.

Moreover, Schechter et al. argue that ECCs speed up cell wear-out because anymodification to data bits requires a rewrite of code bits (those used to correct errors in the

∗We thank CNPq, CAPES, and FAPESP (2011/05266-3).


data bits) [Schechter et al. 2010]. However, the authors do not confirm their hypothesiswith experimental results or theoretical analyses.

In this work, we carefully evaluate those hypotheses by mathematically modelingthe bit-flip behavior in DRM, SECDED, ECP, SAFER, and FREE-p. Specifically, ourmodels consider that data and code bits have different bit-flip frequencies. This approachnot only enables a more accurate evaluation of PCM’s endurance for each ECT, but alsoit enables computing PCM’s write energy at a finer granularity. This kind of analysis andcomparison had not been done in the original ECT research.

As our results, we introduced a more accurate and fine-grained approach toanalyze the impact of ECTs on PCM’s lifetime. Our approach also enables a meaningfulcomputation of PCM write energy and its trade-off with memory endurance. This kind ofanalysis was not possible with previous ECT models, but it is still critical for modern com-puter systems. We evaluated five state-of-the-art ECTs using our approach and comparedour results to those in the literature. Our results extend and shed light on previous worksthat used simplifying assumptions; and support the argument that ECC-based techniquesspeed up the wear-out of PCM.

ReferenciasInternational Technology Roadmap for Semiconductor (2011). Process integration, devi-

ces, and structures - 2011 Edition.

Ipek, E., Condit, J., Nightingale, E. B., Burger, D., and Moscibroda, T. (2010). Dynami-cally replicated memory: Building reliable systems from nanoscale resistive memories.SIGARCH Comput. Archit. News, 38(1):3–14.

Lee, B. C., Ipek, E., Mutlu, O., and Burger, D. (2009). Architecting phase change memoryas a scalable dram alternative. SIGARCH Comput. Archit. News, 37:2–13.

Schechter, S., Loh, G. H., Straus, K., and Burger, D. (2010). Use ecp, not ecc, for hardfailures in resistive memories. SIGARCH Comput. Archit. News, 38:141–152.

Seong, N. H., Woo, D. H., Srinivasan, V., Rivers, J. A., and Lee, H.-H. S. (2010). Safer:Stuck-at-fault error recovery for memories. In Proceedings of the 2010 43rd AnnualIEEE/ACM International Symposium on Microarchitecture, MICRO ’43, pages 115–124, Washington, DC, USA. IEEE Computer Society.

Yoon, D. H., Muralimanohar, N., Chang, J., Ranganathan, P., Jouppi, N., and Erez, M.(2011). Free-p: Protecting non-volatile memory against both hard and soft errors.In High Performance Computer Architecture (HPCA), 2011 IEEE 17th InternationalSymposium on, pages 466–477.


3. Programacao do WTD

“A ciencia, como um todo, nao enada mais do que um refinamentodo pensar diario.”

Albert Einstein


Neste capıtulo, apresentamos a programacao do IX Workshop de Teses, Dis-sertacoes e Trabalhos de Iniciacao Cientıfica (WTD) do Instituto de Computacao(IC) da UNICAMP.

A palestra de abertura, intitulada “Analise de Grandes Volumes de Dados Bi-bliograficos”, foi ministrada pelo professor Luciano Antonio Digiampietri (USP). Oprofessor Jorge Stolfi (IC-UNICAMP) ministrou o minicurso “Fundamentos de es-tatıstica para computacao”. A professora Claudia Bauzer Medeiros (IC-UNICAMP)ministrou o minicurso “Como Arruinar uma Apresentacao Oral”.

As apresentacoes dos trabalhos foram realizadas em duas salas (351 e 352) e osposteres foram exibidos no saguao (Tabelas 3.1 a 3.4). Os trabalhos foram avaliadospor diversos professores e por alunos de doutorado do IC, os quais participaramtambem como chairs das sessoes.

O IX WTD contou com a participacao de mais de 100 pessoas, sendo que 31apresentaram trabalhos oralmente, 12 na forma de poster, mais de 100 pessoas parti-ciparam como ouvintes inscritos e 19 como membros do comite de organizacao.

A palestra e os minicursos foram transmitidos online pelo website do evento, queeste ano foi remodelado. Os certificados foram confeccionados em formato eletronicoe enviados diretamente para os e-mails dos participantes. Outra caracterıstica doevento foi a participacao de alunos no papel de avaliadores da banca e chairs dassessoes.

A Figura 3.1 mostra a distribuicao dos trabalhos por area de pesquisa. Houve 58%de trabalhos da area de Sistemas de Informacao, 33% de Sistemas de Computacao, 7%de Teoria da Computacao e 2% de Sistemas de Programacao. Destes, 58% dos traba-lhos apresentados sao de doutorado, 30%, de mestrado e 12% de alunos de iniciacaocientıfica (Figura 3.2). A Figura 3.3 apresenta uma tagcloud com as palavras-chavesutilizadas nos trabalhos.

Programacao do WTD 162

Apresentacoes Orais – Sala 353 (IC3) – 11/08/2014

Inıcio Trabalho N Aluno Orientador Area Chair

10:00 A Multiscale and Multi-Perturbation Blind Fo-

rensic Technique For Median Detecting

D Anselmo Cas-

telo Branco

Ferreira

Prof. Ander-

son de Rezende

Rocha

S.C. Vanessa

Maike

10:30 Modelo Ativo de Aparencia Aplicado ao Reco-

nhecimento de Emocoes por Expressao Facial

Parcialmente Oculta

IC Flavio Altinier

Maximiano da

Silva

Prof. Helio Pe-

drini

S.C.

11:00 Preempcao de Tarefas MapReduce via Check-

pointing

M Augusto Rodri-

gues de Souza

Profa. Islene

Garcia

S.C.

11:30 Ferramentas para simulacao numerica na nu-

vem

M Renan Mon-

teiro Pinto

Neto

Profa. Juliana

Freitag Borin

S.C.

12:00 User Association and Load Balancing in Het-

Nets

M Alexandre

Toshio Hirata

Profa. Juliana

Freitag Borin

S.C.

12:30 Almoco

14:00 Leitura Automatizada de Medidores de Con-

sumo de Energia Utilizando Veıculos Aereos

Nao Tripulados

M Jose Rodrigues

Torres Neto

Prof. Leandro

Aparecido Vil-

las

S.C. Fabio

F.

14:30 Decreasing Greenhouse Emissions Through an

Intelligent Traffic Information System Based

on Inter-Vehicle Communication

M Allan Mariano

de Souza

Prof. Leandro

Aparecido Vil-

las

S.C.

15:00 Processors Power Analysis in FPGA plata-

forms

IC Jessica Tavares

Heffernan

Prof. Rodolfo

Azevedo

S.C.

15:30 Coffee Break – Apresentacao dos Posteres

16:00 Optimizing Simulation in Multiprocessor Plat-

forms using Dynamic-Compiled Simulation

D Maxiwell Sal-

vador Garcia

Prof. Sandro

Rigo

S.C. Atılio

Gomes

Luiz

16:30 Uma Solucao Ciente do Consumo de 9Energia

para os Problemas de Localizacao 3D e Sincro-

nizacao em RSSFs

M Cristiano Bor-

ges Cardoso

Prof. Leandro

Aparecido Vil-

las

S.P.

17:00 Coloracao de Arestas Semiforte de Grafos Split M Aloısio de

Menezes Vilas-

Boas

Profa. Celia

Picinin de

Mello

TE

17:30 O problema da particao em cliques dominantes M Henrique

Vieira e Sousa

Profa. Ch-

ristiane Neme

Campos

TE

18:00 An Exact Algorithm for the Discrete Chroma-

tic Art Gallery Problem

M Maurıcio Jose

de Oliveira

Zambon

Prof. Pedro

Jussieu de Re-

zende

TE

Tabela 3.1: Programacao do IX WTD da sala 353 em 11 de agosto de 2014


Apresentacoes Orais - Sala 352 (IC3) - 11/08/2014


14:00 A Fault-Tolerant Software Product Line for

Data Collection using Mobile Devices and

Cloud Infrastructure

M Gustavo Mit-

suyuki Waku

Profa. Cecılia

Mary Fischer

Rubira

S.I. Ewerton

Al-

meida

Silva

14:30 Uma Solucao de Linha de Produtos de Soft-

ware Baseada em Componentes e Aspectos

para o Domınio de e-Commerce

M Raphael Por-

reca Azzolini

Profa. Cecılia

Mary Fischer

Rubira

S.I.

15:00 An Architecture for Dynamic Self-Adaptation

in Workflows

M Sheila Venero

Ferro

Profa. Cecılia

Mary Fischer

Rubira

S.I.

15:20 Coffee Break – Apresentacao de Posteres

16:00 Uma Solucao para Monitoracao de Servicos

Web Baseada em Linhas de Produtos de Soft-

ware

M Romulo Jose

Franco

Profa. Cecılia

Mary Fischer

Rubira

S.I. Heiko

Hor-

nung

16:30 Information and Emotion Extraction in Por-

tuguese from Twitter for Stock Market Pre-

diction

D Fernando J. V.

da Silva

Profa. Ariadne

M. B. R. Car-

valho

S.I.

17:00 Automatizacao da geracao de testes em Beha-

vior Driven Development

M Thaıs Harumi

Ussami

Profa. Eliane

Martins

S.I.

17:30 Uma abordagem caixa-branca para teste de

transformacao de modelos

D Erika Regina

Campos de

Almeida

Profa. Eliane

Martins

S.I.

18:00 Geracao semiautomatica de Maquinas Finitas

de Estados Estendidas a partir de um docu-

mento padrao da area espacial

D Juliana Gal-

vani Greghi

Profa. Eli-

ane Martins e

Profa. Ariadne

M. B. R. Car-

valho

S.I.



Apresentacoes Orais - Sala 353 (IC3) - 12/08/2014


13:15 Superpixel-based interactive classification of

very high resolution images

M John Edgar

Vargas Munoz

Prof. Alexan-

dre Falcao

S.I. Allan

da

Silva

Pinto

13:45 A Comparative Analysis between Object Atlas

and Object Cloud Models for Medical Image

Segmentation

M Renzo Phellan Prof. Alexan-

dre Falcao

S.I.

14:15 Reconstrucao de Filogenia para Imagens e

Vıdeos

D Filipe de Oli-

veira Costa

Prof. Ander-

son de Rezende

Rocha

S.I.

14:45 Multiple Parenting Phylogeny M Alberto Ar-

ruda de Oli-

veira

Prof. Ander-

son de Rezende

Rocha

S.I.

15:15 Coffee Break

15:30 Diabetic Retinopathy Image Quality Assess-

ment, Detection, Screening and Referral

D Jose Ramon

Trindade Pires

Prof. Ander-

son de Rezende

Rocha

S.I. Juliana

Gal-

vani

Greghi

16:00 Facilitando a construcao social de significado

em sistemas colaborativos de aprendizagem

M Fabrıcio

Matheus

Goncalves

Profa. M.

Cecılia C.

Baranauskas

S.I.

16:30 Interpretation of Construction Patterns for Bi-

odiversity Spreadsheets

D Ivelize Rocha

Bernardo

Prof. Andre

Santanche

S.I.

17:00 Coffee Break

17:30 Processo Situado e Participativo para o De-

sign de Aplica-coes de TVDi: Uma Aborda-

gem Tecnico-Social

D Samuel B. Bu-

chdid

Profa. M.

Cecılia C.

Baranauskas

S.I. Ewerton

Al-

meida

Silva

18:00 Complex pattern detection and specification

from multiscale environmental variables for bi-

odiversity applications

D Jacqueline Mi-

dlej do Espırito

Santo

Profa. Claudia

Bauzer Medei-

ros

S.I.

18:30 Supporting the study of correlations between

time series via semantic annotations

M Lucas Oliveira

Batista

Prof. Claudia

Bauzer Medei-

ros

S.I.



Apresentacoes de Posteres - Hall do IC3 - 11/08/2014

Inıcio Trabalho N Aluno Orientador Area

15:30 Combining active semi-supervised learning th-

rough optimum-path forest

D Priscila Tiemi Ma-

eda Saito

Prof. Alexandre

Xavier Falcao

S.I.

Video-Based Face Spoofing Detection through

Visual Rhythm Analysis

D Allan da Silva

Pinto

Prof. Anderson de

Rezende Rocha

S.I.

A Model-Driven Infrastructure for Developing

Dynamic Software Product Line

M Junior Cupe Cas-

quina

Profa. Cecılia

Mary Fischer

Rubira

S.I.

Implementacao de um simulador de plata-

forma ARMv7 utilizando ArchC

IC Gabriel Krisman

Bertazi

Prof. Edson Borin S.C.

Analise de Desempenho de Tecnicas para Mi-

gracao de Dados em Arquiteturas NUMA

IC Gilvan dos Santos

Vieira


Oportunidades para economia de energia em

dispositivos moveis

IC Joao Henrique

Stange Hoffmam


Aplicacao do Criterio Analise de Mutantes

para Avaliacao dos Casos de Testes Gerados

a partir de Modelos de Estados

M Wallace Felipe

Francisco Cardoso

Profa. Eliane Mar-

tins

S.I.

Wear-out Analysis of Error Correction Techni-

ques in Phase-Change Memory

D Caio Hoffman Prof. Guido Costa

Souza de Araujo e

Prof. Mario Lucio

Cortes

S.C.

Situated and Participatory Process for Desig-

ning iDTV Applications: A Socio-Technical

Approach

D Samuel B. Buchdid Profa. M. Cecılia

C. Baranauskas

S.I.

Explorando as Interfaces Tangıveis no Trata-

mento de Criancas Autistas

M Kim Pontes Braga Profa. M. Cecılia

C. Baranauskas

S.I.

Desenvolvimento de um ambiente de baixo

custo para o uso de interfaces tangıveis no en-

sino de criancas da educacao fundamental

M Marleny Luque

Carbajal

Profa. M. Cecılia

C. Baranauskas

S.I.

Identificacao de Programas Maliciosos por

Meio da Monitoracao de Atividades Suspeitas

no Sistema Operacional

IC Marcus Felipe Bo-

tacin

Prof. Paulo Lıcio

de Geus

S.C.

Tabela 3.4: Programacao do IX WTD – Posteres


Figura 3.1: Distribuicao das areas de pesquisa dos trabalhos apresentados

Figura 3.2: Quantidade relativa de alunos de Iniciacao Cientıfica, Mestrado e Douto-rado


Figura 3.3: Tagcloud com as palavras-chaves

Documents

INSTITUTO DE COMPUTAÇÃOreltech/2014/14-13.pdf · 1.17 Processors Power Analysis. J essica T. He ernan, Rodolfo Azevedo 98 1.18 Uma abordagem caixa-branca para teste de transforma˘c~ao