19
The Role and Relevance of Experimentation in Informatics * Carlos Andujar Viola Schiaffonati Fabio A. Schreiber § Letizia Tanca Matti Tedre k Kees van Hee ** Jan van Leeuwen †† Summary Informatics is a relatively young field within sci- ence and engineering. Its research and develop- ment methodologies build on the scientific and de- sign methodologies in the classical areas, often with new elements to it. We take an in-depth look at one of the less well-understood methodologies in infor- matics, namely experimentation. What does it mean to do experiments in in- formatics? Does it make sense to ‘import’ tradi- tional principles of experimentation from classical disciplines into the field of computing and informa- tion processing? How should experiments be docu- mented? These are some of the questions that are treated. The report argues for the key role of empiri- cal research and experimentation in contemporary Informatics. Many IT systems, large and small, can only be designed sensibly with the help of experiments. We recommend that professionals * This paper is based on the contributions to the Work- shop on the Role and Relevance of Experimentation in Informatics, held prior to the 8th European Computer Science Summit (ECSS 2012) of Informatics Europe, November 19th 2012, Barcelona. Dept. de Llenguatges i Sistemes Inform`atics, Univer- sitat Polit` ecnica de Catalunya, Barcelona (ES), email: [email protected]. Corresponding author, Dip. Elettronica, Infor- mazione e Bioingegneria, Politecnico di Milano, Milano (I), email: viola.schiaff[email protected]. § Dip. Elettronica, Informazione e Bioingegne- ria, Politecnico di Milano, Milano (I), email: [email protected]. Dip. Elettronica, Informazione e Bioingegne- ria, Politecnico di Milano, Milano (I), email: [email protected]. k Dept. of Computer and Systems Sciences, Stockholm University, Stockholm (S), email: [email protected]. ** Dept. of Mathematics and Computer Science, Eind- hoven University of Technology, Eindhoven (NL), email: [email protected]. †† Dept. of Information and Computing Sciences, Utrecht University, Utrecht (NL), email: [email protected]. and students alike are well-educated in the prin- ciples of sound experimentation in Informatics. We also recommend that experimentation protocols are used and standardized as part of the experimental method in Informatics. Contents 1. Introduction 2. Role and Relevance of Experimentation in Informatics 3. Competing Viewpoints on Experimen- tation in Computing 4. Beyond Benchmarking: Statistically Sound Experimentation 5. Role of User Studies in Computer Graphics and Virtual Reality 6. Experiments in Computer Science: Are Traditional Experimental Princi- ples Enough? 7. Conclusions 8. References 1. INTRODUCTION (Letizia Tanca and Jan van Leeuwen) The ability to design and support experiments is a vital but still little appreciated part of Com- puter science (K. Keahey et al., 2011) Experimentation in Informatics? Does it apply, as in other sciences? Do we understand how ex- perimental methods work in the field? How should they be documented? Despite the wide use of em- pirical studies and experiments, computer scientists do not seem to agree on the role and relevance of experiments in their field. In this paper we argue that experimentation should be better understood and appreciated as a key methodology in Informatics. The basic ideas of the ‘experimental method’ should be included in Informatics curricula like in all other science curric- ula. We also argue that experimentation requires

The Role and Relevance of Experimentation in Informatics · shop on the Role and Relevance of Experimentation in ... ing c.q. software engineering experimentation ... the Role and

  • Upload
    buikiet

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

The Role and Relevance of Experimentationin Informatics ∗

Carlos Andujar† Viola Schiaffonati‡ Fabio A. Schreiber§ Letizia Tanca¶

Matti Tedre‖ Kees van Hee∗∗ Jan van Leeuwen††

Summary

Informatics is a relatively young field within sci-ence and engineering. Its research and develop-ment methodologies build on the scientific and de-sign methodologies in the classical areas, often withnew elements to it. We take an in-depth look at oneof the less well-understood methodologies in infor-matics, namely experimentation.

What does it mean to do experiments in in-formatics? Does it make sense to ‘import’ tradi-tional principles of experimentation from classicaldisciplines into the field of computing and informa-tion processing? How should experiments be docu-mented? These are some of the questions that aretreated.

The report argues for the key role of empiri-cal research and experimentation in contemporaryInformatics. Many IT systems, large and small,can only be designed sensibly with the help ofexperiments. We recommend that professionals∗This paper is based on the contributions to the Work-shop on the Role and Relevance of Experimentation inInformatics, held prior to the 8th European ComputerScience Summit (ECSS 2012) of Informatics Europe,November 19th 2012, Barcelona.†Dept. de Llenguatges i Sistemes Informatics, Univer-sitat Politecnica de Catalunya, Barcelona (ES), email:[email protected].‡Corresponding author, Dip. Elettronica, Infor-mazione e Bioingegneria, Politecnico di Milano, Milano(I), email: [email protected].§Dip. Elettronica, Informazione e Bioingegne-ria, Politecnico di Milano, Milano (I), email:[email protected].¶Dip. Elettronica, Informazione e Bioingegne-ria, Politecnico di Milano, Milano (I), email:[email protected].‖Dept. of Computer and Systems Sciences, StockholmUniversity, Stockholm (S), email: [email protected].∗∗Dept. of Mathematics and Computer Science, Eind-hoven University of Technology, Eindhoven (NL), email:[email protected].††Dept. of Information and Computing Sciences, UtrechtUniversity, Utrecht (NL), email: [email protected].

and students alike are well-educated in the prin-ciples of sound experimentation in Informatics. Wealso recommend that experimentation protocols areused and standardized as part of the experimentalmethod in Informatics.

Contents

1. Introduction

2. Role and Relevance of Experimentationin Informatics

3. Competing Viewpoints on Experimen-tation in Computing

4. Beyond Benchmarking: StatisticallySound Experimentation

5. Role of User Studies in ComputerGraphics and Virtual Reality

6. Experiments in Computer Science:Are Traditional Experimental Princi-ples Enough?

7. Conclusions

8. References

1. INTRODUCTION(Letizia Tanca and Jan van Leeuwen)

The ability to design and support experiments is

a vital but still little appreciated part of Com-

puter science (K. Keahey et al., 2011)

Experimentation in Informatics? Does it apply,as in other sciences? Do we understand how ex-perimental methods work in the field? How shouldthey be documented? Despite the wide use of em-pirical studies and experiments, computer scientistsdo not seem to agree on the role and relevance ofexperiments in their field.

In this paper we argue that experimentationshould be better understood and appreciated as akey methodology in Informatics. The basic ideasof the ‘experimental method’ should be included inInformatics curricula like in all other science curric-ula. We also argue that experimentation requires

detailed protocolisation. The conclusions are basedon the results of the Workshop on ‘Experimenta-tion’ held prior to the ECSS’12 summit meeting ofInformatics Europe in Barcelona (2012).

1.1 A View of InformaticsInformatics emerged in the 20th century as the

field that underlies all uses of computing and in-formation technology. Owing to its many origins,Informatics is often described both as a science andas an engineering discipline. Due to its far-reachingimplications for the understanding of natural phe-nomena and systems, Informatics is sometimes evencalled a natural science. Due to its emphasis on theconstruction of computational artifacts (e.g. soft-ware), Informatics is sometimes called a technicaldesign science as well. As a consequence, the scien-tific methodologies of Informatics have varied ori-gins as well and are not always uniquely defined orunderstood.

Empirical methods and experimentation havebeen part of Informatics for many years. In factalready in 1967, one of the founding fathers of thediscipline, George Forsythe, pointed out that ‘Com-puter Science’ is to be valued both as a theoreticaland as an experimental science. Using the analogywith physics, he wrote:

Whether one is primarily interested inunderstanding or in design, the conclu-sion is that [. . . ] good experimental workin computer science must be rated veryhigh indeed, ([21], p. 4)

but, with great foresight, he also noted that:

[. . . ] computer science is in part a youngdeductive science, in part a young exper-imental science, and in part a new field ofengineering design. Because of this broadsweep of activity, the nature of computerscience is usually misunderstood and dep-recated by those with an acquaintance,however close, to only one aspect of com-puting. ([21], p. 4)

In a nutshell this explains the complexity of themethodological landscape in the field of Informatics.Many different concerns developed simultaneouslyand, in-as-much as the development of the field hasbeen influenced by sociological factors [47], the in-terest for experimentation has varied greatly overthe years and per sub-discipline [30]. With experi-mentation currently gaining recognition in the fieldat large, the principles of experimentation shouldbe well-understood in the discipline.

In this paper we highlight both the philosophyand some concrete ramifications of the experimentalmethod in Computer Science. The paper leads toseveral conclusions which will become clear in thesubsequent sections.

Our first, main thesis is:

A: Experimentation is a key methodologyin Informatics, and its principles shouldbe fully recognized in the field’s philoso-phy. Informatics curricula should givean adequate background in the under-standing and practical use of experimen-tal methods.

In particular Sections 2, 3, and 6 will providefurther background to thesis A.

1.2 ExperimentationExperimentation is well-understood in the phi-

losophy of science. In his classic on The principlesof science, W.S. Jevons [26] already noted in thenineteenth century that (observations and) exper-iments are the primary source of experience (andthus knowledge, in his view). He proceeded to ex-plain how experiments are intended to give insightin the causal effects of different (and ideally inde-pendent) parameters on a phenomenon or system,when they are varied in a controlled setting. Theyreveal unknown properties, confirm or refute hy-potheses, help in debugging theories or prototypes,and set the stage for alternate explanations or de-signs.

One can only guess why experimentation has de-veloped so unevenly in the various subareas of Infor-matics (or: Computer Science). There is no short-age of experimental methods in the field. Somewere developed to great sophistication, like visuali-sation, case studies, simulation and parameter tun-ing. Sometimes statistical methods are employed,as e.g. in the analysis of optimization methods [7].In areas like Information Systems there is a lot ofexperience with experimentation e.g. on the inter-play between human behaviour and design as well(see e.g. [29]). An often confirmed experience is thatgood experimentation is hard.

It seems that the more formalized and theory-driven subareas of Informatics have long suppressedthe possible role of experimentation. Milner [35]made the case that ‘theory’ and ‘experiment’ canwell go together. For example, in a classical fieldlike algorithm design it has long beenl commonpractice to assess algorithms by their asymptoticproperties only. However, their actual behaviour inpractice is now increasingly being studied as well,even though the models for the latter are typically

still theoretical. With Informatics becoming evermore interactive, large-scale, and distributed, ex-perimentation is increasingly called for in all areas,to understand the unknown behaviours of a designor system, in any realistic setting.

The essential role of experimentation in Informat-ics was explicated before, notably by Denning [12],and later by Tichy [48] when he wrote:

To understand the nature of informationprocesses, computer scientists must ob-serve phenomena, formulate explanationsand theories, and test them. ([48], p. 33)

Interestingly, in those areas where experimentationdid not come naturally, the approach to sciencethrough experimentation has led to a kind of re-discovery of the experimental methodology. It hasoccasionally led to separate fields like experimentalalgorithmics [33] and empirical software engineer-ing c.q. software engineering experimentation [8, 27,52], which isn’t an simple way to get experimenta-tion integrated in a field. In other cases, the ‘experi-mental methodology’ is explicitly brought under at-tention as a possible way to publishable results pro-vided suitable standards are followed, as done e.g.by the IEEE journal Pervasive Computing [31].

1.3 ProtocolsDespite this positive, albeit gradual, develop-

ment we note that there seems to be hardly anyeffort in Informatics to standardize its experimen-tal methodologies. Fletcher[19] observed already in1995 that:

When we do have occasion to do experi-ments, we should adopt the same rigorousprotocols as practiced in the natural andsocial sciences. ([19], p. 163)

Fletcher’s advice is commendable, but does notseem to have been implemented widely in the dis-cipline (with notable exceptions again e.g. in thearea of Information Systems). Good experimentsrequire that they have sound designs and that theiroutcome can be validated. This is why many ar-eas require explicit experimentation protocols. Webelieve that Informatics should do the same.

Experimentation protocols depend on the theory,behaviour or properties one is trying to assess byexperiment. A protocol typically documents theapproach, the parameters and hypotheses involved,the ways of measurement and analysis (with thereasoning behind it), and the way outcomes arerecorded and stored for later inspection. As exper-iments must be sound and verifiable, experimenta-tion protocols are indispensable. Clearly, different

subareas of Informatics may require different typesof protocols.

Our second main thesis is therefore:

B: Experimentation in Informatics re-quires sound and verifiable experimen-tation protocols. In the case of multi-disciplinary research, these protocolsshould follow standard practices that arerecognizable across discipline boundaries.

We will see examples of thesis B in Sections 4 and5 of this paper. Section 6 reviews where we standoverall.

1.4 WorkshopThe discussion of experimentation in Informatics

requires a good understanding of the issues. Withthis in mind, we organized a small workshop onthe Role and Relevance of Experimentation in Com-puter Science and a subsequent panel discussion onthe same issue within the 8th European ComputerScience Summit (ECSS’12). This paper containsthe written text of the contributions.

Among the questions that were discussed, werethe following:

- What does it mean to make experiments in In-formatics. Is it possible to ask this question in gen-eral terms, or does it have to be asked separatelyfor each subfield?

- Is it possible to single out some common featuresthat characterise ‘good experiments’ in Informatics?

- Does it make sense to ‘import’ traditional exper-imental principles from the classical scientific disci-plines into Informatics?

We believe that the discussion on experiments inInformatics (or: computer science and engineering)can contribute to a better understanding of the sci-entific approach in the discipline. We also believethat it may contribute to the debate on the rela-tionship between Informatics and science.

1.5 This PaperThe paper is organized as follows. In Section 2

(by V. Schiaffonati) the philosophical premises ofexperimentation in Informatics are presented. Insection 3 (by M. Tedre) the analysis is continuedwith an assessment of the different types of exper-iments that computer scientists have been using sofar. Sections 4 (by K. van Hee) and 5 (by C. An-dujar) show how experiments and experimentationprotocols can be designed to great advantage in al-gorithmic and user studies, respectively. Finally,section 6 (by F. Schreiber) assesses whether themethodological landscape, and experimentation in

particular, is really understood sufficiently acrossthe discipline. The section also includes a summaryof the comments and suggestions from the discus-sions at the Workshop and in the panel session atECSS’12. In Section 7 we offer some final conclu-sions.

2. ROLE AND RELEVANCE OF EX-PERIMENTATION IN INFORMAT-ICS

(Viola Schiaffonati)

Experiments play a fundamental role in empiri-cal sciences, both in acquiring new knowledge andin testing it, and modern science has its roots inthe very idea of experiment, as devised by GalileoGalilei during the so called Scientific Revolution inthe XVII century. One of the main achievements ofthis revolution has been to recognize that pure ob-servations are not enough for the comprehension ofthe natural world, and that to properly ‘interrogate’Nature, interference with the natural world itself bymeans of experiments is necessary. An experimen-tal procedure aims, thus, at controlling some of thecharacteristics of a phenomenon under investigationwith the purpose of testing the behavior of the samephenomenon under some controlled circumstances.Accordingly, an experiment can be defined as a con-trolled experience, namely as a set of observationsand actions, performed in a controlled context, tosupport a given hypothesis.

A trend has recently emerged toward making ex-perimental scientific method take center stage in In-formatics as well. Given that Informatics possessesan empirical component, even without consideringhere the debate about the prevalence of the empir-ical versus the theoretical one [46], it is natural toask about the role and relevance of experimenta-tion in it. Thus experimental scientific method hasrecently been under attention in computer scienceand engineering both at a general level [23, 36] andin specific disciplines (see for example the case ofempirical software engineering [27]). Experimentsare recognized as relevant in this field as they helpbuild a reliable base of knowledge, lead to useful andunexpected insights, and accelerate progress [48].

Unfortunately computer scientists do not seem toagree on how experimental methods are supposedto impact their theory and practice. This sectionis a first attempt to start reflecting on the role ofexperiments in Informatics with the help of sometools developed in the philosophy of science and thephilosophy of technology. The peculiarity of Infor-matics in between science and engineering should

be taken as a starting point. Therefore, if on theone side inspiration can be drawn from how exper-iments are conducted in pure science, on the otherside this perspective should be enlarged to includethe analysis of how experiments are conducted inengineering disciplines, a topic so far almost com-pletely neglected by the traditional philosophy ofscience.

2.1 Taking inspiration from scienceIt is commonly recognized that experimental

methodologies in Informatics have not yet reachedthe level of maturity of other traditional scientificdisciplines. A possible way to start investigating therole of experiments in this field is to analyze howexperiments are performed in traditional science inorder to evaluate whether some useful lessons can belearned. This is a way to obtain, at least, a betterterminological and conceptual clarity in using termsthat are historically and conceptually loaded. Forexample, in Informatics there still exists confusionbetween generic empirical methods (only requiringto be based on the aggregation of naturally occur-ring data) and experimental methods (that must ad-here to strict rules as assessed in the course of thehistory of experimental science). Equally mislead-ing is the idea that the replication of experiments issufficient to guarantee the requirements of a seriousexperimental approach.

Generally speaking, an experiment is a controlledexperience, a set of observations and actions, per-formed in a controlled context, to test a given hy-pothesis. Accordingly, the phenomenon under in-vestigation must be treated as an isolated object;it is assumed that other factors, which are not un-der investigation, do not influence the investigatedobject. Despite the controversies about the sci-entific experimental method and its role, experi-ments possess some general features that are uni-versally acknowledged and often are not even madeexplicit. These are: comparison, repeatability andreproducibility, justification and explanation.

• Comparison. Comparison means to knowwhat has been already done in the past, bothfor avoiding to repeat uninteresting experi-ments and for getting suggestions on what theinteresting hypotheses could be.

• Reproducibility and repeatability. These fea-tures are related to the very general idea thatscientific results should be severely criticizedin order to be confirmed. Reproducibility, inparticular, is the possibility for independentresearchers to verify the results of a given ex-periment by repeating it with the same ini-

tial conditions. Repeatability, instead, is theproperty of an experiment that yields the sameoutcome from a number of trials, performed atdifferent times and in different places.

• Justification and explanation. These featuresdeal with the possibility of drawing well justi-fied conclusions based on the information col-lected during an experiment. It is not suffi-cient to have as many precise data as possible,but it is also necessary to look for an expla-nation of them, namely all the experimentaldata should be interpreted in order to derivethe correct implications leading to the conclu-sions.

Is it possible to decline these general principlesin a computer science and engineering experimentalcontext? And, if the answer is positive, is there anyutility in doing this? To try to decline these ques-tions within specific areas of research offers somepreliminary, but promising, answers. The case ofautonomous mobile robotics, involving robots withthe ability to maintain a sense of position and tonavigate without human intervention, offers someuseful insights. In response to a quite recent inter-est of this field in experimental methodologies, someresearch projects funded by the European Commis-sion, journal special issues, series of workshops, andsingle papers have been produced with the aim ofassessing rigorous evaluation methods for empiricalresults. In particular [1] and [2] have declined thethree aforementioned principles for the case of mo-bile autonomous robotics, leading to the conclusionthat, even if some works in this field are addressingin a more and more convincing way these experi-mental principles thus bringing autonomous mobilerobotics closer to the standards of rigor of more ma-ture scientific disciplines, its engineering componentcannot be put aside and needs to be included in thediscussion. Thus, importing these principles fromphysics to autonomous robotics is only the first stepof a serious investigation on the relevance and roleof experiments in a computer engineering field.

2.2 From science to engineeringEngineering can be defined as an activity that

produces technology, and technology is a practicefocused on the creation of artifacts and artifact-based services [22]. Hence, to move from a purelyscientific context to an engineering one means notonly to address different objects (natural objectsversus technical artifacts), but also to consider thedifferent purposes for which experiments are per-formed. If in science the goal of experimentation isunderstanding a natural phenomenon (or a set of

phenomena), in engineering the goal is testing anartifact.

Technical artifacts are material objects deliber-ately produced by humans in order to fulfill somepractical functions. They can be defined in accor-dance with the three following questions.

• What is the technical artifact for? Namely itstechnical function.

• What does it consists of? Namely its physicalcomposition.

• How must it be used? Namely its instructionsfor use.

Informatics products are technical artifacts, as theyare physical objects with a technical function and ause plan deliberately designed and made by humanbeings [50].

The notion of technical artifact plays an impor-tant role in analyzing the nature of experiments inengineering disciplines: experiments evaluate tech-nical artifacts according to whether and to whatextent the function for which they have been de-signed and built is fulfilled. This is the reason whynormative claims are introduced in engineering ex-periments. An artifact, such as an airplane, canbe ‘good’ or ‘bad’ (with respect to a given refer-ence function), whereas a natural object, such asan electron, whether existing in nature or producedin a laboratory, can be neither ‘good’ nor ‘bad’; infact it is analyzed without any reference to its func-tion and use plan and so is free from any normativeconstraints regarding its functioning1.

Is the reference to the notion of technical artifactenough to analyze the role and relevance of exper-imentation in Informatics? Of course not. As theinspiration from science represents just one facetof this analysis, the attention to the way technicalartifacts are evaluated in engineering experimentsis also just only another partial facet. While dis-cussing experiments and their role, the dual natureof Informatics at the intersection between scienceand engineering strongly emerges. Experiments areperformed to test how well an artifact works withrespect to a reference model and a metric (think forexample of a robot or a program); but, at the sametime, experiments are performed to understand howcomplex artifacts, whose behavior is hardly pre-dictable, work and interact with the environment(in different degree if you think of the example ofthe robot or the program).

1It is worth noticing, however, that although an electroncannot be ‘good’ or ‘bad’ per se, a theory pertainingelectrons can be ‘good’ or ‘bad’.

Surely, the relevance of a better (in terms of rigor)experimental approach in computer science and en-gineering is an important first step in the method-ological maturation of Informatics. Moreover, adeeper and a wider analysis is required to under-stand the peculiar role of experimentation in thisdiscipline. Informatics is composed of very hetero-geneous subfields and whether it is possible to singleout some common features characterizing “good ex-periments” across these subfields is still under dis-cussion.

An interesting side effect of this discussion is topromote a reflection on the disciplinary status of In-formatics based on its methodological stances, andnot only on the nature of its objects. This, again,evidences the peculiarity of Informatics also froma methodological point of view. As said, it makessense to “import” traditional experimental princi-ples (comparison, reproducibility and repeatability,justification and explanation) from traditional sci-entific disciplines into computer science and engi-neering, thus promoting a more rigorous approachto experimentation. This would avoid all thosecases in which published results are considered asvalidated just by a single experiment that is impos-sible to reproduce and/or repeat because the exper-imental conditions are only vaguely described [6].But it is worth remembering that these general prin-ciples are valid for disciplines (such as physics andbiology) that aim at understanding and explainingnatural phenomena, whereas computer science andengineering realize artifacts. This awareness canrepresent a first step in the direction of the devel-opment of a philosophy of engineering that shouldhave, among its goals, the analysis of the featuresthat characterize experiments in engineering disci-plines.

3. COMPETING VIEWPOINTS ON EX-PERIMENTATION IN COMPUTING

(Matti Tedre)

The discipline of computing was born at the con-junction of a number of research traditions thatcame together around the 1930s and the 1940s.The first research tradition came from mathemat-ical logic and from the attempts to formalize hu-man patterns of rational thought; its pioneers weremathematicians and logicians like George Boole,Gottlob Frege, and Alan Turing [11]. The secondtradition was that of electrical engineering; its pio-neers included engineers like Konrad Zuse, ClaudeShannon, and John Atanasoff [51]. The third tra-dition came from the mechanization of numerical

computation for the purposes of science and appliedmathematics, with pioneers like Charles Babbage,Herman Hollerith, and Vannevar Bush [10].

The research traditions that formed modern com-puting continued to flourish within the discipline asthree intertwined, yet partially autonomous lines ofresearch, each with its own research agenda [14].The theoretical tradition developed an identity au-tonomous of mathematics, and it established thegroundwork for the development of theory and prac-tice in computing. The engineering tradition drovethe unprecedented development of machinery (andlater software). The scientific tradition intertwinedwith other disciplines, giving birth to countless vari-ations of computational science as well as new in-sights into computing itself.

The rapid growth of knowledge in computingfields was fueled by the interplay of the three inter-twined traditions. However, the three very differentintellectual traditions also caused continuous fric-tion within the discipline of computing. The lack ofrigor in some branches of engineering work gave riseto criticism from the science and theory camps ofcomputing. The abstract orientation of theoreticalcomputer science was accused of alienation from thereal problems in computing. The empirical side ofcomputing was criticized for improperly presentingwhat computing really is about.

3.1 Experimental Computer ScienceMovement

Science has always been a central part of com-puting as a discipline. Although its roots are inoffice machinery, the modern (fully electronic, Tur-ing complete, digital) computer was born in uni-versities, and the first modern computers were usedfor applied sciences and numerical calculation. Theterm ‘computer science’ was adopted into comput-ing parlance in the late 1950s. As the disciplinematured, the discussions about the disciplinarynature of computing frequently saw glimpses ofexperiment-based empirical research. But it wasonly at the turn of the 1980s when the role of ex-perimentation in computing became a popular topicin the field’s disciplinary discussions.

‘Experimental computer science’ was brought tolimelight at the turn of the 1980s by a strong cam-paign for ‘rejuvenating experimental computer sci-ence’. The campaign was initiated by a report tothe National Science Foundation [18] and ACM Ex-ecutive Committee’s position paper on the “crisisin experimental computer science” [32]. The initialpapers were followed by a number of position papersabout what experimental computer science actually

is (see e.g. [13, 38]).However, despite the attempts to clarify experi-

mental computer science terminology, it was neverclear what exactly was meant by ‘experimental’in computer science. The practitioners and pio-neers from different traditions understood the term‘experimental’ very differently. One sense of theword refers to exploratory work on new, untestedtechniques or ideas—the thesaurus gives words like‘trial’, ‘test’, and ‘pilot’ as synonyms to ‘experimen-tal’. Another, more specialized, sense of the wordrefers to the use of experiments to test hypotheses.The original ‘rejuvenating’ report [18] teetered be-tween the two meanings of the word, never definingwhat exactly was meant by ‘experimental’ computerscience. Rooted in the different disciplinary mind-sets in the three traditions of computing, the ‘re-juvenating’ report was followed by several decadesof polemics where discussants talked about experi-mental computer science but meant different things.

3.2 Five Views on ExperimentationAfter the initial papers on experimental com-

puter science, the topic became a popular topicfor workshops, conferences, and journal articles. Inthose arenas experimentation terminology was usedin various, often conflicting ways. One can easilyfind in the debates numerous implicit and explicitmeanings of terms like ‘experiment’, ‘experiment-ing,’ ‘experimental,’ and ‘experimentation’. Of thevarious meanings of ‘experiment,’ five are relativelycommon and easily distinguishable: the demonstra-tion experiment, the trial experiment, the field ex-periment, the comparison experiment, and the con-trolled experiment.

3.2.1 The Demonstration “Experiment”The first common use for the term ‘experiment’

can be found in reports on new tools and techniques.In those texts, it is not known if a task can be au-tomated efficiently, reliably, feasibly, or by meet-ing some other simple criterion. A demonstrationof ‘experimental’ technology shows that it can in-deed be done (e.g., [25]). Although this view wascriticized already in some of the first texts on ex-perimental computer science, it is still a relativelycommon use of the term—often defended and oftencriticized.

3.2.2 The Trial ExperimentThe second common use of the term ‘experi-

ment’ can be found in reports that evaluate theperformance, usability, or other aspects of a sys-tem against some previously defined specificationsor variables. In those texts, it is not known how

well a newly developed system meets its require-ment specifications or how well it performs. A trialexperiment is set up to evaluate the system. Vari-eties of trial experiments include, for instance, em-ulation, benchmarking, and simulation [24].

3.2.3 The Field ExperimentThe third common use of the term ‘experiment’

can be found in reports that evaluate systems intheir intended use environment. In those texts, itis not known how well the system works in the fullrichness of the live environment. In a field experi-ment [39], or ‘in-situ’ experiment [24], the system’squalities are tested in its intended sociotechnicalcontext of use, and evaluated against a pre-definedset of criteria.

3.2.4 The Comparison ExperimentThe fourth common use of the term ‘experiment’

can be found in reports that compare two or morecompeting solutions for the same problem. In thosetexts, it is not known whether the author’s proposedsolution performs better than the previous solutionswith some data set, parameters, and set of criteria.An experiment is set up to compare the solutions,and to show that the author’s solution is in someways better than the other candidates [19]. Often-times objectivity is improved by not involving theauthor’s own solutions, and in many research fieldsthe test data and parameters are standardized.

3.2.5 The Controlled ExperimentThe fifth common use of the term ‘experiment’

can be found in reports that test models or hy-potheses under a controlled environment, where theeffects of extraneous and confounding variables canbe controlled [17]. The controlled experiment comesin various types for various purposes, but often con-trolled experiments are used in situations where it isnot known whether two or more variables are associ-ated, or if one thing causes another. The controlledexperiment is the approach of choice if the resultsshould be generalizable.

4. BEYOND BENCHMARKING: STA-TISTICALLY SOUND EXPERIMEN-TATION

(Kees van Hee)

As noticed already computer science has twofounding fathers: mathematics and electrical engi-neering. Mathematicians live in a world of mod-els they create themselves, therefore mathematicsis a pure theoretical discipline without an empiri-

cal component. The empirical law of large numbersseems to be the only empirical phenomenon thatis mentioned in mathematical statistics. This lawwas the motivation for the derivation of the theo-retical law of large numbers. Electrical engineeringis based on empirical phenomena, such as the em-pirical laws of electricity. But experiments play alittle role in the research domain of electrical en-gineering. Of course systems are built as a proof-of-concept, which is in fact an empirical existenceproof. So the founding fathers of computer sciencehad no experience and probably little interest in ex-perimentation, which most likely caused of the lackof an experimental component in computer science.

Today model-driven engineering is a successfulapproach in almost all engineering disciplines andin particular in software engineering. There are sev-eral reasons to build a mathematical model of a sys-tem, such as: (1) to document or explain a system,(2) to specify or construct a system, (3) to optimizeor control a system and (4) to analyze or predict thebehavior of a system. In all these cases the modelshould have so much similarity with system thatproperties derived for the model should hold for thesystem and vice versa. Although this sounds obvi-ous, it is far from easy to establish such a similarityrelationship. Since a system is a physical objectthe only way to do this, is by experimentation. Soone has to make a bijection between structural el-ements of the model and to the system. And onehas to observe the behavior of the system and com-pare it with the modeled behavior. In most casesthe behavior is an infinite set of sequences of eventsor activities. Hence it is often impossible to estab-lish the similarity completely by experimentation.Then statistical methods should offer a solution. Ifwe have enough evidence that a model is a gooddescription of a system then one may extrapolate,by claiming that properties verified for the modelalso will hold for the system. This is in particularimportant if we want to analyze the behavior of asystem under extreme circumstances that are diffi-cult or impossible to realize in a experiment, suchas the behavior of an aircraft during a crash.

In software engineering it was long time believedthat we would reach a stage where we should be ableto build a model from requirements and afterwardsthe program code from a model and that we shouldbe able to verify formally that the model satisfiesthe requirements and the program code is conformto the model. Today model checkers can deal withmodels of systems of a realistic size. So part of thedream has come true. However we know that this isonly a partial solution, since: (1) the program code

is executed on a physical device, (2) requirementsare never complete and often informal and (3) themodel of a system is always a simplification in whichwe abstracted from certain details. So experimen-tation will be necessary to establish the similaritybetween a system and its model. This form of ex-perimentation is called software testing. Althoughit is essential from an engineering point of view, thetopic is far from mainstream research in computerscience.

Software testing is only one form of experimenta-tion in computer science. There are other questionsthat can only be answered by experimentation e.g.:

• effectiveness and efficiency of algorithms;

• quality of software engineering methods;

• usability of human-machine interfaces;

• visualization of very large data sets.

In the rest of this section we will focus on the firsttopic, quality of algorithms. Of course there aresimple algorithms that can be analyzed formally:for example for a sorting algorithm we can provethat it delivers a right answer and we can computethe complexity. Note that complexity gives an an-swer for the worst case, while we often want an an-swer for something as the ‘average case’. There aremany complex problems in practice for which thereare no best solutions known and where we developheuristic algorithms. For example for combinatorialproblems like the traveling salesman or timetablingfor schools, we only have algorithms for which wedo not know if it produces always the best possibleanswer (effectiveness) and the efficiency may varystrongly over the problem instances. In principlean algorithm is a function that can be applied toan argument and it delivers a result. The argumentis a structured data set, which we will call a model.(Note that such a data set represents a part of thereal world.) The set of all possible models is themodel class. So a model is an instance of a modelclass. These concepts may also be called probleminstance and problem type. The effectiveness of analgorithm is informally defined as: the ‘fraction’ ofmodels where the algorithm gives the right answer,and the efficiency as: the ‘average’ computing timeit takes to answer to produce the result. However,in most cases, the model class has infinitely manyinstances. So the ‘fraction’ and the ‘average’ are noteven defined! If we have two or more algorithms forthe same model class then we would like to comparethem and then we run into the same problem. Thisis a similar problem as in software testing, wherewe may have to test an infinite set of behavior.

In order to solve these problems in practice, weseek resort in benchmarking. The idea is that somegroup of experts has chosen a finite sample from themodel class and that all algorithms are applied tomodels of this benchmark set. Although it is a fea-sible solution it does not really answer the originalquestions. In order to do this in a more sophisti-cated way, we have to consider the model class. Ifwe could define a probability distribution over thisclass, then we could speak about the probability ofproducing the right result and the mean comput-ing time. So the solution to the problem is to findprobability distributions for model classes!

This is what we will consider in the next subsec-tion. We will restrict us to the case where modelsare represented by graphs and we will show how wecan define probability distributions on the class ofgraphs. We also will discuss how we could obtainthis distribution in an empirical way. Finally wedescribe a case study of this approach.

4.1 Probabilistic model classesIn most cases an algorithm is applicable to an in-

finite model class. The ‘effectiveness’ should be the‘fraction’ of models for which the algorithm givesthe right answer and the efficiency the ‘average’computing time. However on an infinite set the‘fraction’ and the ‘average’ are not defined. If wehave a probability distribution on the model class,then the effectiveness and efficiency are at least de-fined. The question is then how to compute them.So there are four questions to be answered:

1. How to define a probability distribution for amodel class?

2. How to sample models from a model class?

3. How to identify the parameters of a probabilitydistribution from empirical data?

4. How to compute the quality measures if theprobability distribution is known?

We start with the last question. Since in most caseswe should really apply the algorithm on the modelsto determine the quality. However this would re-quire infinitely many runs. Since we have a proba-bility distribution we can approximate effectivenessand efficiency by computing it for a finite subsetof the model class, say a subset with probability q,e.g q = 0.99. This subset is a sample from themodel class. The effectiveness can be computedfor the sample, say p, which means that the al-gorithm gives the right result in fraction p of thesample. Then we know that the probability p ofthe model class satisfies: q.p ≤ p ≤ q.p + (1 − q),

i.e. 0.99.p ≤ p ≤ 0.99.p + 0.01. For the comput-ing time t we can use the law of large numbers andthe central limit theorem derive a confidence inter-val for t. Actually the sample can be considered asa benchmark! However in this case we know howmuch the benchmark covers the whole model classand we can generate for each experiment a differentbenchmark.

The first and second question can be answeredsimultaneously. Note that an infinite model class,where each model has the same probability, doesnot exists (they would all have probability zero andtheir sum should be one). Observe that systems innature and man-made systems are not coming outof the blue, but they are grown or constructed in asystematic way. For instance live tissues are grownby cell division and cell differentiation. Softwaresystems are normally built in a top down or bot-tom up process. In the first case a simple programis stepwise refined to a more complex one and inthe second case several small components are gluedtogether to construct a bigger one. Each interme-diate step is also a system of the class. We will usethis idea to construct model classes together with aprobability distribution over it.

From now on we will assume that each modelis represented as a graph. In computer sciencegraphs are probably the most used data structure(see [5]). A graph can be defined as a quintuple(Node,Edge, σ, τ, λ) where Node is a set of nodes,Edge a set of edges (connecting the nodes) andσ : Edge → Node is called the source function,τ : Edge→ Node is called the target function and λis a labeling function with Edge as domain. We willdefine graphs by graph transformations that can beapplied to a given graph in the class in order to ob-tain a new one in the class. Graph transformationsare defined by production rules. As in [5] we follow[15] in a simplified way. A production rule is a triple(L,K,R) where L, K and R are all graphs and K iscommon subgraph of L and R. (Often K has onlynodes, i.e. the set of edges is empty). The first stepin the application of a production rule is that in agraph G the graph L is detected, i.e. a subgraphof G is found that is isomorphic with L. The sec-ond step is that the graph G is transformed intoG = G\(L\K) + (R\K). The graph K is the inter-face between G and L and R, i.e. G\L and R\Kare only connected via K. (Here we use \ for graphsubtraction and + for graph addition, i.e. nodes andedges). There are some “sanity” requirements forthese operations such that all intermediate resultsare proper graphs. A graph transformation systemis a set of production rules and a graph transforma-

tion system together with an initial graph is calleda graph grammar. With a graph grammar we maydefine a class of graphs: all graphs that can be de-rived by applying the production rules in a finitenumber of steps. The production rules may enforcethat we only produce graphs of a certain type. Wemay distinguish transformation rules that expanda graph, called expansion rules and transformationrules that reduce a graph, called reduction rules. Infact every expansion rule may be applied in the re-versed order to become a reduction rule and viceversa. (If (L,K,R) is an expansion rule, the cor-responding (R,K,L) is a reduction rule). So if wehave one or more initial graphs and a set of expan-sion rules, we have defined implicitly a, possiblyinfinity, set of graphs. To add a probability distri-bution we will endow the production rules with anon-negative weight function. Given a graph G anew graph G′ is generated by applying an expan-sion rule, with probability equal to the weight ofthe rule divided by the sum of the weights of allrules that could be applied. Note that the samerule can be applied maybe several times on graphG. Now we are almost done: we only need a stop-ping rule for this generation process. This can bedone in several ways, e.g. by sampling the num-ber of generations from distribution before we startthe generation process, or ‘flipping a coin’ duringthe generation process to determine to stop or tocontinue. In both cases the number of expansionsteps is stochastically independent of the generatedgraph. A more sophisticated stopping rule could bebased on a Markov chain where the state is the lastgenerated graph, but we will not consider this here.So, by repeating the generation process, we have amethod to determine an arbitrary large sample froma graph class with a specified total probability.

The third question is about the application of theapproach in practice. For comparison of the qualityof algorithms a team of experts could decide on theproduction rules, their weights and a stopping rule.Then every developer could evaluate an algorithmby creating a large sample and test its algorithm.This is much better than the existing benchmarkingpractice since we have arbitrary large samples andwe know what the evaluation means for this wholeclass.

However we can apply the approach in an evenbetter way, since we often have already an empiri-cal sample of the class observed in practice, e.g. asample of problem instances of the traveling sales-man or timetabling problems. So it would be inter-esting to see if we can derive the stopping rule aswell as the weights of the expansion rules from this

empirical sample.Sometimes this can be done in a quite straight-

forward way by applying the expansion rules as re-duction rules to obtain one of the initial graphs,while counting the number of applications of eachrule. So we obtain an estimate for the weights of therules and since we count the number of steps as well,we have a sample of the stopping time distribution,from which we may estimate the parameters of thestopping rule distribution. In case there are morereduction paths we have to be a little more careful,but a similar approach seems feasible. Given theseparameters we can generate a large sample, muchlarger than the empirical sample, and use this sam-ple to evaluate the algorithm. As shown in [49]this gives much better results than using the orig-inal sample as benchmark. Although this soundsas ‘magic’, it is a phenomenon similar to the boot-strapping technique of statistics (see [44]). A simpleexplanation is that the graphs in the empirical sam-ple contain more information than we use when wejust use it as benchmark. Even in a small empiricalsample we may have a lot of information over theweights of the production rules.

4.2 Case study: Petri netsThis case study is based on [49]. Here we will

illustrate the approach for Petri nets, which are bi-partite graphs with two kind of nodes, called placesand transitions and they are only connected withdirected arcs to nodes of the other kind. Petri netshave markings, which are distributions of objects,called tokens, over the places. When a transitionhas in each of its input places a token, the markingmay change by consuming these tokens and produc-ing for each output place of the transition a new to-ken. So a Petri net determines a transition system.We consider a special class of Petri nets called work-flow nets. These nets have one input place, one out-put place (called initial and final place respectively)and all other nodes are on a directed path from theinput place to the output places. As initial mark-ing they have only one token in the initial place.They are used frequently in practice to model busi-ness processes or procedures. We used 9 very sim-ple expansion rules, see Figure 1. The initial graphconsists of just one place. All 9 rules preserve theproperty that the generated nets are workflow nets.The first 6 rules also preserve a behavioral property,the weak termination property which says that if amarking is reachable from the initial marking, thenthe marking with only one token in the final placeis reachable from here.

We applied the approach to a collection of 200

Figure 1: The generation rules for workflownets

workflow nets representing business processes froma manufacturing company. We consider this set asoriginal model class, which is finite. In order to val-idate our approach, we took 10 samples of 5 modelseach from the given 200 models. And with eachof these samples we derived the parameters and wegenerated 200 workflow nets. Instead of testing al-gorithms with these generated sets, we computedcharacteristics of the graphs themselves. In partic-ular we considered two characteristics: the lengthof the shortest path (LSP) from the initial place tothe final place, which is a structural property of thenets and we determined if the net has the weak ter-mination property (WTP) or not.

The results are as follows. First we consider theLSP. In the original population the mean and stan-dard deviation of the LSP are: 7.14 and 0.85. Inthe first sample of five models it was: 6.0 and 2.9respectively, which falls outside a 95 percent confi-dence limit of the original mean. For the collectionof 200 generated from the first sample the valuesare: 7.34 and 0.95 respectively and it fits in the in-terval of the original mean, in fact the intervals arealmost the same. This was the case with the firstsample. We repeated this for the other 9 samplesand we computed the average and standard devia-tion of the mean LSP values over these 10: 7.29 and1.46 respectively.

For the weak termination property (WTP) we fol-lowed the same procedure. All models in the orig-inal collection had the WTP and so all the modelsin the 10 samples of five as well. The reduction pro-

cess preferred the first 6 rules that preserve sound-ness, however the original models were not gener-ated with our rules (as far as we know!) and sowe also had to apply not WTP preserving rules. Itturned out that in only 3 of the 10 generated col-lections there were models not satisfying the WTP.They had a probability of WTP of 0.96, 0.89 an0.85. They average probability was 0.97 with a stan-dard deviation of 0.05.

4.3 SummaryWe have seen that for the comparison of algo-

rithms empirical methods are essential and thatthere is a better approach than classical benchmark-ing. The key is that we need a probability distri-bution over infinite sets of models. We sketched amethod to construct such a probability distributionand to generate a large sample:

1. observe an empirical sample of models;

2. define a suitable model class by productionrules and an initial model;

3. define a stopping rule;

4. identify the parameters: estimate the weightsfor the production rules and the stopping rule;

5. generate a large sample using the productionrules and the parameters;

6. run the algorithm for each model of the gen-erated sample.

In a case study for a special class of Petri nets weshowed how the method works and that it is muchbetter to use a large amount of generated modelsusing the parameters obtained from a small empir-ical sample, than using this sample directly. Thiscase study encourages us to apply the method forother model classes.

5. ROLE OF USER STUDIES IN COM-PUTER GRAPHICS AND VIRTUALREALITY

(Carlos Andujar)

We now take a closer look at a kind of experimen-tation which also plays a major role in several ofcomputer science fields: experimentation involvinghuman subjects. Conducting experiments involvinghumans is a challenging task and poses a number ofproblems not found in other types of experiments.We will focus on user studies in three user-centricfields, namely Computer Graphics (CG), VirtualReality (VR), and 3D User Interfaces (3DUI).

Figure 2: Visual equivalence problem

5.1 Sample problemsLet us introduce some problems that will serve

to illustrate the need for user studies in CG-relatedareas and to exemplify some major challenges. Ourfirst example refers to the visual equivalence prob-lem. Consider for example the two images in Fig-ure 2. These images look nearly identical de-spite the image on the right has been renderedwith a level-of-detail algorithm which is much fasterbut less accurate. This represents a speed/qualitytradeoff, and the most appropriate algorithm willdepend on the different conditions (such as view-ing distance, saliency of the image differences) thatdetermine to which extent users will perceive thetwo images as the same. This kind of question re-quires a user study and probably a psychophysicalexperiment [20]. Although several image metricshave been proposed to compare pairs of images, andsome of them take into account key features of hu-man perception, there is some evidence that thesemetrics often fail to predict the human visual sys-tem response [41]. Indeed, there is an increasinginterest of the CG community to get a deep under-standing of the Human Visual System, as evidencedby some top journals (e.g. ACM Transactions onApplied Perception) aiming to broaden the synergybetween computer science and psychology.

The second example is the evaluation of pres-ence in VR systems, that is, to which extent usersfeel and behave as if physically present in a vir-tual world. Users of immersive systems might forgetabout the real environment and the virtual environ-ment can become the dominant reality. The eval-uation of presence is very important in many VRapplications, from phobia therapies to psychologi-cal experiments through pain relief for patients withserious injuries. Many presence evaluation studiesreport surprising findings when analysing the hu-

Figure 3: Different interfaces for a puzzle-solving problem

man behaviour when presented a stress situation ina virtual environment [34, 43]. Nowadays, the com-mon practice is to evaluate presence by observingusers’s behavior and measuring their physiologicalresponse (as captured by specialized devices such asheart rate monitors and galvanic skin sensors).

The last example is about the comparison of 3DUIs in terms of their usability. Figure 3 shows threedifferent user interfaces to solve a 3D puzzle [42],using either a physical puzzle, a classic keyboard-and-mouse interface, or a Wii controller. For thistask, users must be able to select pieces, manip-ulate them, and explore the model from differentviewpoints. The key problem thus is to determinewhich UI is better in terms of usability, and the onlysolution nowadays is to conduct a user study com-paring these techniques. Despite some models havebeen proposed to predict human performance forsome tasks (the well known Fitts’ law is the mostnotable example), typical spatial tasks are just toocomplex to be predicted by such simple models.

5.2 Major challenges in user studies forCG and VR

The following list (adapted from [45]) shows thetypical steps involved in empirical methods involv-ing human subjects.

1. Formulate a hypothesis;

2. Make the hypothesis testable;

3. Design an experiment;

4. Get approval by ethics committee;

5. Recruit participants;

6. Conduct the experiment and collect data;

7. Pay participants;

8. Analyze the data;

9. Accept or refute the hypothesis;

10. Explain the results;

11. If worthy, communicate your findings.

The list above will serve to guide the discussionabout some major challenges in user studies notfound in other types of experiments.

5.3 Hypothesis formulationA first step is to formulate a hypothesis and make

it testable. Using the 3D puzzle problem as an ex-ample, a general hypothesis might be formulated asfollows: using the Wii controller will make peoplemore effective when doing manipulation tasks. Apossible testable version of it could be formulatedas: we measured the time it takes for users to solve aparticular 3D puzzle, using either Wii or mouse; wehypothesize users will be faster using the Wii. Herewe find a major problem: to make the hypothesistestable, we had to choose a particular task whichwe take as representative (besides fixing some otherimportant variables). Unfortunately, many prob-lems have a task space so large and heterogeneousthat it can be really difficult to find a small set ofrepresentative tasks. Here we wrote task for the3D puzzle example, but we could have written 3Dmodel, image, movie, stimulus and whatever otherelements are fixed to make the hypothesis testable.

Complex tasks also depend on a number of vari-ables. For example, the completion time for a3D puzzle might depend on the interaction de-vice (Wii, mouse), viewing conditions (stereoscopic,mono), mapping between user actions and applica-tion changes, and quality of the graphics [9]. Themore variables are controlled, the more general thefindings will be, but the more difficult will be thedata collection. Indeed, data collection is a fur-ther issue in typical VR applications. Compare themeasurement of simple dependent variables such astask completion times and error counts, with hard-to-collect data such as user physiological response,heart rate or galvanic skin response.

5.4 Experiment designIndependent variables can vary in two ways:

within-subjects (each participant sees all condi-tions) and between-subjects (each participant seesonly one condition). The decision on which designto use is often controversial. Within-subject de-signs are more time consuming for the participants,and require the experimenter to counterbalance forlearning effects and fatigue effects. But between-subject designs are not free from limitations, asmore participants need to be recruited and we arelikely to loose statistical power (thus less chances toproof our hypothesis).

5.5 EthicsMany usability guides address in depth all the

ethical issues related with user studies [3, 16]. Mostorganizations require experimenters to get the ap-proval by an ethics committee before running the

experiment. After the approval, it is often a hardtask to recruit participants and get their informedconsent, in particular when participants should bechosen from a specific target user group (such asphysicians).

Experiments involving immersive VR systems of-ten need a detailed informed consent. Researchersshould never deceive participants about aspectsthat would affect their willingness to participate,such as risks (VR users might bump into walls,trip over cables), discomfort (many 3D interactiontechniques for spatial tasks are physically demand-ing) and unpleasant experiences (some VR systemscause motion sickness). Frustration handling is alsoimportant when measuring user performance. Incase of failure to complete a task, experimentersshould make it clear that the responsible is the tech-nology. Usability tests should not be perceived astests of the participant’s abilities [16].

5.6 Experimenter issuesExperimenters should be careful to avoid manip-

ulating the experiment. Besides the well knownplacebo effect, there are other experimenter issuesthat often hinder data collection. The Hawthorneeffect occurs when increased attention from superi-ors or colleagues increases user performance. Theperformance of a participant might change if some-body else, e.g. the previous participant, is observ-ing. Observer-expectancy effect occurs when theresearcher unconsciously manipulates the experi-ment, using for example body language. Experi-ments should be double-blind, but researchers innon-life critical fields often disregard these issues.

5.7 Data analysisProper data analysis is absolutely required to ac-

cept or refute the hypothesis and to provide sta-tistical evidence of the findings. Unfortunately, apart of the CG community seems to lack enoughbackground on experimental design and statisticalanalysis to conduct the user studies required to eval-uate their own research. This is evidenced by thelarge number of submitted and even published pa-pers with serious evaluation errors related with thestatistical analysis of the results. Not surprisingly,some leading CG groups around the world count onpsychologists’ contributions.

5.8 SummaryIn the last decades some computer science dis-

ciplines are experiencing a shift of focus from im-plementing the technology to using the technology,and empirical validation through user studies is be-coming critical. In this section we have discussed

some major issues of such validation experiments:lack of background on experimentation, psychologyand psychophysics, time-consuming and resource-consuming nature of user studies, and the difficul-ties to fulfill all requirements (double-blind experi-ments, informed consent, representative users, rep-resentative data sets/models/tasks).

The user performance during typical computer-related task depends on a number of domain-specificfactors as well as hardware-related factors. Consid-ering all these factors simultaneously as indepen-dent variables in controlled experiments is clearlynot practical. This fact limits the validity of thefindings reported in the CG and VR literature toa specific domain and a particular setup. Thelack of de-facto standard data sets for testing pur-poses (more common in other scientific communi-ties) along with the plethora of hardware setupsmakes it difficult to make fair comparisons. Fur-thermore, many techniques are still proposed andevaluated in isolation, whereas in the real world usertasks are mixed with other tasks. These are issuesthat must still be addressed.

6. EXPERIMENTS IN COMPUTERSCIENCE: ARE TRADITIONALEXPERIMENTAL PRINCIPLESENOUGH?

(Fabio A. Schreiber)

Francis Bacon, in his Novum Organum, observesthat “... simple experience; which, if taken as itcomes, is called accident, if sought for, experiment...”[4]. This fundamental observation calls for theestablishment of a methodology in experimentation,which cannot be based on the observation of ca-sual events and their simplistic interpretation, butmust rely on accurately designed and rigorously per-formed experiments. On the other hand, Louis Pas-teur accepted some degree of casualty in discovery,but stated that: “In the field of observation, chancefavors only the prepared mind ...” [40], so admittingthat only a precise framework can produce mean-ingful research results.

Figure 4 shows how, in natural sciences, observa-tional studies are performed: from the observationof natural phenomena some hypotheses are formu-lated which lead to experiments in order to formu-late theories which, in turn, suggest further experi-mentation until a well established theoretical frame-work is reached (a); sometimes, however, observa-tions are not casual, but they are induced by au-tonomously formulated conjectures which must beexperimentally proved (b).

OBSERVATION HYPOTHESIS EXPERIMENT THEORY

Observational studies

CONJECTURE OBSERVATION EXPERIMENT/

SIMULATION THEORY

Observational studies

a)

b)

Figure 4: Observational studies

On this basis, the final panel 2 of ECCS 2012Summit tried to answer some questions the first ofwhich is: Is this model applicable to Computer Sci-ence/Engineering?

So, is Informatics a “natural science”? One ofthose disciplines that are blessed with the legacyof Galileo? If not so, would that be somehow de-sirable? Would that even make any sense? And,further: whether or not experiments are (or oughtto be) strictly related to the way computer scienceadvances, have they any other role in it?

6.1 Experimentation Goals and PropertiesMatti Tedre, in Section 2, mentioned many rea-

sons for experimenting and here we recall four ofthem:

• to discover the unknown - in this case we wantto answer questions like: “What does it hap-pens if I mix Oxygen and Hydrogen?”;

• to test a hypothesis - “From a stoichiomet-ric computation, it comes out that if I mixtwo molecules of Hydrogen and one moleculeof Oxygen I get some energy and a moleculeof water; is it true?”;

• to determine the value of some physical vari-able - “What is the speed of light in vacuum?”;

• to compare a set of different “objects” to de-termine their relative merits (benchmarking)- “Which, among a set of cars has the bestenergetic performance?”

2The panelists were Francesco Bruschi, Natalia Juristo,Antoine Petit, Matti Tedre moderated by Fabio A.Schreiber

All of these classes must satisfy at least threemethodological conditions for an experience to beconsidered an experiment:

• Repeatability at different times and in differ-ent places to check the universality of results;

• Reproducibility by other scientists to confirmthat results are independent of the details ofthe specific experiment;

• Comparison of the results of different instancesof the same experiment;

Then a second question is: How such goals andproperties apply to Computer Science/Engineeringtheories and artifacts?

Francesco Bruschi put the focus on the fact thathard experimentation is already present in the dayto day practice of applied informatics. Not only:some of the requirements for an experience to beconsidered an experiment have both a positive anda normative role in the engineering fields related tocomputer science.

Considering the cited methodological conditionsfor an experience to be considered an experiment -repeatability, reproducibility, and comparability ofthe results - Francesco argued that there are twoways in which these requirements have a role in thepractice of Informatics: they have positive value indescribing how some tasks are carried out and ac-complished, and they have normative value withrespect to the definition of some tools and method-ologies. Francesco proposed an example for eachrole.

Let us start with the positive role, consideringthe task of modifying a software system in order tocorrect a behavior not compliant with the specifica-tions (practice commonly referred to as debugging).The cornerstone of debugging is the definition of atleast one test case able to highlight the aberrationfrom the expected behavior. A test case is but thedescription of a setup (a set of boundary conditions)such that: i) it must be possible to impose it on arunning instance of the system an arbitrary num-ber of times (it must be repeatable); ii) it must bepossible to impose it for any other instance of thecomputing architecture on which the software is in-tended to be executable, everywhere else (it must bereproducible); iii) the execution of the system withthe given conditions must highlight, in an unam-biguous way, the deviation from the expected be-havior (results of the test running after changingthe code must be clearly comparable). The inter-esting thing is that, even though at various levels of

awareness, these requirements soak the engineeringpractice at any level.

As far as the normative role of the experimen-tal requirements is concerned, let us now considerthe interest recently gained, in the software engi-neering field, by purely functional languages suchas Haskell. It is advocated that these languages aresuperior at developing robust, maintainable, com-plex software systems. If we look closer at the fea-tures that define functional purity, we find that thecore ones can be stated this way:

i) a function, whenever invoked with the parame-ters, must always produce the same result (i.e., theinvocation must be repeatable);

ii) the invocation of a function contains all theinformation on the boundary conditions for the ex-ecution; this means, for instance, that there is nohidden state, which makes it much easier to repro-duce a given particular execution;

iii) all the possible side effects (i.e.: I/O, statemodification) produced by a function invocation areexplicitly inferable from the function call (i.e.: theset of effects of two different functions calls are com-parable). The remarkable thing here is that the fea-tures which in the scientific domain are definitionalfor experiments, somehow act as desiderata in thefield, noteworthily central to computer science, ofprogramming language design.

So, rigorous experimentation already has a role,widespread albeit implicit, in at least two domainsof computer science, and it would be interestingto deepen the possible epistemological and didac-tic consequences of this fact.

Anyhow, Natalia Juristo pointed out that soft-ware engineering (SE) experimental paradigm isstill immature: i) SE experiments are mostly ex-ploratory and they lack mechanisms to explain, bymeans of inference steps, the meaning of the obser-vations results; ii) SE experiments have flaws sincethey lack thoroughly thought-out designs to rule outextraneous variables in each experiment, and properanalysis techniques are not always used.

Natalia further observed that it is not enoughjust to apply experimental design and statisticaldata analysis to take experimental research in SEforward, since a discipline’s specific experimentalmethodology cannot be imported directly from oth-ers.

6.2 Design and SimulationA point on which all the speakers agreed is the

need of adopting a well defined language to giverigor and precision to experimental data and ofusing rigorous measurement methods and tools to

quantitatively describe the phenomena under inves-tigation. The role of Statistics in both the designof an experiment and in the interpretation of itsresults emerged unanimously.

Antoine Petit also pointed out how software, be-sides being the object of some experimental activ-ity, as shown in 6.1, is also a tool for making ex-periments in a similar way a telescope is the objectof research and experimentation in the field of op-tics and a tool for astronomical research and discov-ery. Computer scientists need to have their own re-search infrastructures as physicists have their lasersor the Large Hadron Collider, astronomers theirspace telescopes, or biologists their animal houses.Some of our research need such similar huge infras-tructures, in particular to produce results at a rightscale to be transferred to industry. We can thinkfor instance to immersion systems, cloud comput-ing, smart dust, robots, but these infrastructurescan be useful only if there is enough technical staffbesides the researchers.

A discussion followed about the usage of sim-ulation models and frameworks to make scientificexperimentation cheaper and faster than in real-life. While simulation, since longtime, is success-fully used to predict the performance of computingsystems, its application to disciplines other thanInformatics (Natural Sciences, Physics, Economy,etc.) requires a careful interaction between peo-ple with different cultures and languages in orderto avoid misunderstandings leading to erroneous re-sults which often are not the fault of “... that damncomputer”.

6.3 Other IssuesNew experimental activities emerged in the re-

cent years in Information Management; two amongthem have been explicitly mentioned in the discus-sion: i) Data Mining for knowledge discovery inlarge amount of operational data and ii) PervasiveSystems support to sensing real-life physical datato be used as input to application programs whichcompute the experiments’ output. These applica-tions are very far from each other, but how do theycompare to the classical notion of “experiment”?Do we need any new vision?

Contrary to the position of Natalia Juristo, An-toine Petit argued that there is no specificity of In-formatics with respect to other sciences and thatits experimental dimension amounts to the experi-mental dimension of software. We use software asastronomers use telescopes, but to study and “con-struct ” software is also part of our job, whereasastronomers do not study or construct telescopes.

This induces that the work done by researchers tostudy and construct software has to be evaluated, asclassical publications are. A difficulty comes fromthe few number of conferences or journals devotedto software; an organized Software Self-Assessmentcould be an interesting option for such an evalua-tion.

In the discussion which followed, a last impor-tant question emerged: “Are CS/CE curricula suit-able for giving our students an experimental aware-ness?”. At a first sight, it seems that, generally,engineering curricula include some separate, andsometimes optional, courses about MeasurementTheory and Statistics, while these are not alwaysfound in Computer Science curricula, but, in anycase there is no general treatment of the principlesof experimentation. This could be the topic for adeeper survey on the state of art and a subsequentproposal for the inclusion of an ad-hoc course inCS/CE curricula.

7. CONCLUSIONS(Viola Schiaffonati and Jan van Leeuwen)

The experimental method is clearly gainingground in computer science. It is now widely rec-ognized that experimentation is needed as a keystep in the design of all complex IT applications,from the design of the processes (Empirical Soft-ware Engineering) to the design of the systems (asin Robotics). The quest for experimentation man-ifests a renewed attention for rigorous methodolo-gies. Although many experiments have been putto good use over the years, more recent efforts em-phasize the systematic adoption of the experimentalapproach. However, it is less clear that the preciseuse of experimentation is always well-understood.Can we simply adopt the experimental method fromclassical disciplines or engineering? If not, in whatways should it be modified to serve as a sound sci-entific methodology for the computing and informa-tion sciences?

In this paper we first focused on the origins andthe potential uses of the experimental method in thefield of Informatics (Section 1). We highlighted thespecific context of the design of artefactual systemsand their requirements (Section 2), and the differentviews on experimentation that can be encounteredin the field (Section 3). We showed subsequentlythat experimentation in Informatics can be aimedat a variety of different properties not found in othersciences (Section 4), and might involve e.g. the useof human subjects as well (Section 5).

In a final appraisal, we observe that the exper-

imental method is well-recognized in various areaswithin (applied) Informatics, but that its rigorousapplication needs to be investigated further (Sec-tion 6). Design and simulation methods need to becarefully described and documented, so it is clearto everyone when they count as instances of exper-imentation and when they do not.

We recommend that modern Informatics curric-ula offer an adequate background in both the phi-losophy and the potential uses of the experimentalmethod. Furthermore, experiments in Informaticsshould follow sound, replicable, and verifiable pro-tocols, as is in all other sciences and in engineering(Section 1). Experimentation should be recognizedin all branches of Informatics as a crucial method-ology, with well-defined rules and practices.

This paper resulted from a collaborative effort ofthe authors after the Workshop on the Role and Rel-evance of Experimentation in Informatics, held inBarcelona in 2012 prior to the 8th European Com-puter Science Summit of Informatics Europe. Thepaper also includes the conclusions from the sub-sequent panel on experimentation at the Summit(Section 6).

Both the workshop and the panel raised many in-teresting questions about the experimental methodas it is to be used in Informatics. In this paper wehave only started to discuss some of the questions,as one of our aims was rather to show how many rel-evant issues are still open and need to be dealt with,from a variety of different perspectives. When prop-erly applied, experimentation is a powerful compo-nent of the methodological basis of Informatics as afully fledged science.

AcknowledgementsWe thank Informatics Europe for the opportunityto organize our Workshop in conjunction with itsECSS 2012 summit meeting. We also thank theDepartment of LSI of the Universitat Politecnicade Catalunya (UPC) in Barcelona for the excellentlocal arrangements. Finally, we greatly acknowl-edge the contribution of Francesco Bruschi and thecomments of Antoine Petit to Section 6.

8. REFERENCES[1] F. Amigoni, M. Reggiani, and V. Schiaffonati.

An insightful comparison betweenexperiments in mobile robotics and in science.Autonomous Robots, 27:4 (2009) 313–325.

[2] F. Amigoni, V. Schiaffonati. Goodexperimental methodologies and simulation inautonomous mobile robotics. In: L. Magnani,W. Carnielli, and C. Pizzi (Eds.),

Model-Based Reasoning in Science andTechnology, Springer, Berlin, 315-322, 2010.

[3] American Psychological Association. Ethicalprinciples of psychologist and code of conduct.American Psychologist, 47 (1992) 1957–1611.

[4] F. Bacon. Novum Organum. 1620.[5] J.C.M. Baeten, K.M. van Hee. The role of

graphs in computer science. Report CSR07-20, Eindhoven University of Technology,2007.

[6] M. Barni, F. Perez-Gonzalez. Pushing scienceinto signal processing. IEEE Signal ProcessingMagazine, 120 (2005) 119–120.

[7] Th. Bartz-Beielstein, M. Chiarandini, L.Paquete, and M. Preuss (Eds.), ExperimentalMethods for the Analysis of OptimizationAlgorithms, Springer, Berlin, 2010.

[8] V.R. Basili, The role of experimentation insoftware engineering: past, current and future,Proc. 18th Int. Conf. Software Engineering(ICSE’96), IEEE Press, 1996, pp. 442-449.

[9] D. Bowman, J. Gabbard, and D. Hix. Asurvey of usability evaluation in virtualenvironments: classification and comparisonof methods. Presence: Teleoperators andVirtual Environments, 11:4 (2002) 404–424.

[10] M. Campbell-Kelly, W. Aspray. Computer: AHistory of the Information Machine, WestviewPress, Oxford (UK), 2nd edition, 2004.

[11] M. Davis. The Universal Computer: TheRoad from Leibniz to Turing, CRC Press,Boca Raton, FL, 2012.

[12] P. J. Denning. What is experimentalcomputer science. Communications of theACM, 23:10 (1980) 543–544.

[13] P. J. Denning. ACM president’s letter: Onfolk theorems, and folk myths.Communications of the ACM, 23:9 (1980)493–494.

[14] P. J. Denning, D. E. Comer, D. Gries, M. C.Mulder, A. Tucker, A. J. Turner, and P. R.Young. Computing as a discipline.Communications of the ACM, 32:1 (1989)9–23.

[15] H. Ehrig, K. Ehrig, U. Prange and G.Taentzer. Fundamentals of Algebraic GraphTransformations. EATCS Monographs onTheoretical Computer Science, Springer, 2006.

[16] European Telecommunications StandardsInstitute (ETSI). Usability evaluation for thedesign of telecommunication systems, servicesand terminals. ETSI Guide EG 201472.Sophia Antipolis, 2000.

[17] D. G. Feitelson. Experimental computer

science. Communications of the ACM, 50:11(2007) 24–26.

[18] J. A. Feldman, W. R. Sutherland.Rejuvenating experimental computer science:A report to the National Science Foundationand others. Communications of the ACM,22:9 (1979) 497–502.

[19] P. Fletcher. The role of experiments incomputer science. Journal of Systems andSoftware, 30:1–2 (1995) 161–163.

[20] G. Gescheider. Psychophysics: TheFundamentals, 3rd Edition. LawrenceErlbaum Associates, Mahwah, NJ, 1997.

[21] G.F. Forsythe, A university’s educationalprogram in Computer Science,Communications of the ACM, 10 (1967) 3–11.

[22] M. Franssen. G. Lokhorst, and I. van de Poel.Philosophy of technology. In: E. N. Zalta(Ed.), The Stanford Encyclopedia ofPhilosophy(Spring 2010 Edition), 2010.

[23] P. Freeman. Back to experimentation.Communications of the ACM, 51:1 (2008)21–22.

[24] J. Gustedt, E. Jeannot, and M. Quinson.Experimental methodologies for large-scalesystems: A survey. Parallel ProcessingLetters, 19:3 (2009) 399–418.

[25] J. Hartmanis, H. Lin. What is computerscience and engineering? In J. Hartmanis andH. Lin (Eds.) Computing the Future: ABroader Agenda for Computer Science andEngineering, 163–216. National AcademyPress, Washington, D.C., 1992.

[26] W.S. Jevons, The principles of science : atreatise on logic and scientific method,MacMillan and Co., London, 1874.

[27] N. Juristo, A.M. Moreno, Basics of SoftwareEngineering Experimentation, KluwerAcademic Publishers (Springer), Dordrecht,2001.

[28] K. Keahey, F. Desprez (Eds.), Supportingexperimental computer science, ANL MCSTechnical Memo 326, Argonne NationalLaboratories, Argonne, IL, 2012.

[29] G. Leroy, Designing User Studies inInformatics, Springer, London, 2011.

[30] P. Lukowicz, E.A. Heinz, L. Prechelt, andW.F. Tichy. Experimental evaluation incomputer science: A quantitative study. J.Systems Software, 28:1 (1995) 9–18.

[31] P. Lukowicz, S. Intille, Experimentalmethodology in pervasive computing, IEEEPervasive Computing, 10:2 (2011) 94–96.

[32] D. D. McCracken, P. J. Denning, and D. H.

Brandin. An ACM executive committeeposition on the crisis in experimentalcomputer science. Communications of theACM, 22:9 (1979) 503–504.

[33] C.C. McGeoch, A Guide to ExperimentalAlgorithmics, Cambridge University Press,Cambridge (UK), 2012.

[34] M. Meehan, B. Insko, M. Whitton, and F. P.Brooks. Physiological measures of presence invirtual environments. ACM Transactions onGraphics, 21:3 (2002) 645–652.

[35] R. Milner. Is computing an experimentalscience? Report ECS-LFCS-86-1, Laboratoryfor Foundations of Computer Science,University of Edinburgh, 1986.

[36] C. Morrison, R. Snodgrass. Computer sciencecan use more science. Communications of theACM, 54:6 (2011) 38–43.

[37] A. Newell, H.A. Simon. Computer science asempirical inquiry: Symbols and search.Communications of the ACM, 19:3 (1976)113–126.

[38] J. K. Ousterhout. More on experimentalcomputer science and funding.Communications of the ACM, 24:8 (1981)546–549.

[39] P. Palvia, E. Mao, A. F. Salam, and K. S.Soliman. Management information systemsresearch: What’s there in a methodology?Communications of the Association forInformation Systems, 11(16):1–32, 2003.

[40] L. Pasteur. Lecture, Universite de Lille, Dec.7th, 1854.

[41] G. Ramanarayanan, J. Ferwerda, B. Walter,and K. Bala. Visual equivalence: towards anew standard for image fidelity. ACMTransactions on Graphics, 26:3 (2007) 1–11.

[42] Reality-based User Interface System (RUIS).IEEE 3DUI contest entry. Department ofMedia Technology, Aalto University.

[43] M. Slater, A. Antley, A. Davison, D. Swapp,G. Guger, C. Barker, N. Pistrang and M.Sanchez-Vives. A virtual reprise of theStanley Milgram obedience experiments,PLoS ONE 1:1 (2006): e39.

[44] M.J. Shervish. Theory of Statistics, SpringerVerlag, New York, 1995.

[45] J. Swan, S. Ellis and B. Adelstein.Conducting human-subject experiments withvirtual and augmented reality. In Proc. ofIEEE Virtual Reality 2007 Tutorials,Charlotte, NC, March 10–14, 2007.

[46] M. Tedre. Computing as a science: A surveyof competing viewpoints. Minds & Machines,

21 (2011) 361–387.[47] M. Tedre, The Development of Computer

Science: A Sociocultural Perspective, Doctoralthesis, University of Joensuu, 2006.

[48] W. Tichy. Should computer scientistsexperiment more? IEEE Computer, 31:5(1998) 32–40.

[49] K.M. van Hee, M. La Rosa, Z. Liu and N.Sidirova. Discovering characteristics ofstochastic collections of process models. In: S.Rinderle-Ma, F. Toumani, K Wolf (Eds.)Business Process Management 2011, LNCSvol. 6896, Springer, 298–312, 2011.

[50] P. Vermaas, P. Kroes, I. van de Poel, M.Franssen, and W. Houkes. A Philosophy ofTechnology. From Technical Artefacts toSociotechnical Systems, Morgan and Claypool,2011.

[51] M. R. Williams. A History of ComputingTechnology. IEEE Computer Society Press,Los Alamitos, CA, 2nd edition, 1997.

[52] C. Wohlin, P. Runeson, M. Host, M. C.Ohlsson, B. Regnell, A. Wesslen.Experimentation in Software Engineering,Revised Edition, Springer, Heidelberg, 2012.