A Tale of Two Studies - Western Sydney€¦ · A Tale of Two Studies Judy Bowen1 Steve Reeves 2 Andrea Schweer 3 (1,2) Department of Computer Science The University of Waikato, Hamilton,

A Tale of Two Studies

Judy Bowen1 Steve Reeves 2 Andrea Schweer 3

(1,2) Department of Computer ScienceThe University of Waikato,

Hamilton, New ZealandEmail: [email protected], [email protected]

3 ITS Information SystemsThe University of Waikato,

Hamilton, New ZealandEmail: [email protected]

Abstract

Running user evaluation studies is a useful way of get-ting feedback on partially or fully implemented soft-ware systems. Unlike hypothesis-based testing (wherespecific design decisions can be tested or compar-isons made between design choices) the aim is to findas many problems (both usability and functional) aspossible prior to implementation or release. It is par-ticularly useful in small-scale development projectsthat may lack the resources and expertise for othertypes of usability testing. Developing a user-studythat successfully and efficiently performs this task isnot always straightforward however. It may not beobvious how to decide what the participants shouldbe asked to do in order to explore as many parts ofthe system’s interface as possible. In addition, ad hocapproaches to such study development may mean thetesting is not easily repeatable on subsequent imple-mentations or updates, and also that particular areasof the software may not be evaluated at all. In thispaper we describe two (very different) approaches todesigning an evaluation study for the same piece ofsoftware and discuss both the approaches taken, thediffering results found and our comments on both ofthese.

Keywords: Usability studies, evaluation, UI Design,formal models

1 Introduction

There have been many investigations into the effec-tiveness of different types of usability testing and eval-uation techniques, see for example (Nielsen & Lan-dauer 1993) and (Doubleday et al. 1997) as well asresearch into the most effective ways of running thevarious types of studies (numbers of participants, ex-pertise of testers, time and cost considerations etc.)(Nielsen 1994), (Lewis 2006). Our interest, however,is in a particular type of usability study, that of userevaluations. We are interested in how such studies aredeveloped, e.g. what is the basis for the activities per-formed by the participants? In particular, given animplementation (or partial implementation) to test,is there a difference between the sort of study the de-veloper of the system under test might produce and

Copyright c©2013, Australian Computer Society, Inc. This pa-per appeared at the Fourteenth Australasian User InterfaceConference (AUIC 2013), Adelaide, Australia. Conferences inResearch and Practice in Information Technology (CRPIT),Vol. , Ross T. Smith and Burkhard Wuensche, Ed. Reproduc-tion for academic, not-for-profit purposes permitted providedthis text is included.

that of an impartial person, and if so do they producedifferent results? It is well known by the software en-gineering community that functional and behaviouraltesting is best performed by someone other than thesoftware’s developer. Often this can be achieved be-cause there is a structured mechanism in place fordevising tests, for example using model-based testing(Utting & Legeard 2006) or by having initial speci-fications that can be understood by experienced testdevelopers (Bowen & Hinchey 1995), or at least bywriting the tests before any code is written (as intest-driven or test-first development (Beck 2003)).

For user evaluation of software, and in particularthe user interface of software, we do not have the sortof structured mechanisms for developing evaluationstudies (or models upon which to base them) as we dofor functional testing. Developing such studies relieson having a good enough knowledge of the software todevise user tasks that will effectively test the software,which for smaller scale development often means thesoftware’s developer. Given that we know it is not agood idea for functional testing to be carried out bythe software’s developer we suggest it may also be truethat running, and more importantly, developing, userevaluations should not be done by the developer forthe same reasons. This then presents the problem ofhow someone other than the software’s developer canplan such a study without the (necessary) knowledgeabout the system they are testing.

In this paper we present an investigation into thedifferences between two evaluation studies developedusing different approaches. The first was developedin the ‘usual way’ (which we discuss and define fur-ther in the body of the paper) by the developer ofthe software-under-test. The second was developedbased on formal models of the software-under-test andits user interface (UI) by an independent practitionerwith very little knowledge of the software-under-testprior to the modelling stage. We discuss the differentoutcomes of the two studies and share observations ondifferences and similarities between the studies andthe results.

We begin by describing the software used as thebasis for both evaluation studies. We then describethe process of deriving and running the first studyalong with the results. This is followed by a descrip-tion of the basis and process of deriving the secondstudy as well as the results of this. We then present acomparison of the two studies and their results, andfinish with our conclusions.

Proceedings of the Fourteenth Australasian User Interface Conference (AUIC2013), Adelaide, Australia

81

Figure 1: Graph Version of the Digital Parrot

Figure 2: List Version of the Digital Parrot

2 The Digital Parrot Software

The Digital Parrot (Schweer & Hinze 2007), (Schweeret al. 2009) is a software system intended to aug-ment its user’s memory of events of their own life. Ithas been developed as a research prototype to study

how people go about recalling memories from an aug-mented memory system.

The Digital Parrot is a repository of memories.Memories are encoded as subject–predicate–objecttriples and are displayed in the system’s main viewin one of two ways: a graph view and a list view,

CRPIT Volume 139 - User Interfaces 2013

82

shown in Figures 1 and 2. Both views visualise thetriple structure of the underlying memory informationand let the user navigate along connections betweenmemory items. The type of main view is chosen onprogram start-up and cannot be modified while theprogram is running.

The user interface includes four different naviga-tors that can influence the information shown in themain view either by highlighting certain informationitems or by hiding certain information items. Thesenavigators are the timeline navigator (for temporalnavigation; shown in Figure 4), the map navigator(for navigation based on geospatial location), textualsearch and the trail navigator (for navigation basedon information items’ types and connections; shownin Figure 3).

Figure 3: Trail Navigator

3 The First Study

At the time of the first study, the first developmentphase had ended and the Digital Parrot was feature-complete. Before using the software in a long-termuser study (not described in this paper), we wantedto conduct a user evaluation of the software. Insightsgained from the usability study would be used to formrecommendations for changing parts of the DigitalParrot’s user interface in a second development phase.

3.1 Goals

The first study had two main goals. The first was todetect any serious usability flaws in the Digital Par-rot’s user interface before using it in a long-term userstudy. We wanted to test how well the software couldbe used by novice users given minimal instructions.This mode of operation is not the typical mode for asystem such as the Digital Parrot but was chosen tocut down on the time required by the participants asit removed the need to include a training period.

The second goal was to find out whether the par-ticipants would understand the visualisations and thepurpose of the four different navigators.

3.2 Planning the Study

The study was run as a between-groups design, withhalf the participants assigned to each main view type(graph vs list view). We designed the study as a task-based study so that it would be easier to comparefindings between participants. We chose a set of fourtasks that we thought would cover all of the DigitalParrot’s essential functionality. These tasks are asfollows:

1. To whom did [the researcher] talk about scubadiving? Write their name(s) into the space be-low.

2. Which conferences did [the researcher] attend inAuckland? Write the conference name(s) intothe space below.

3. At which conference(s) did [the researcher] speakto someone about Python during the poster ses-sion? Write the conference name(s) into thespace below.

4. In which place was the NZ CHI conference in2007? Write the place name into the space below.

The tasks were chosen in such a way that mostparticipants would not be able to guess an answer.We chose tasks that were not too straightforward tosolve; we expected that a combination of at least twoof the Digital Parrot’s navigators would have to beused for each task. Since the Digital Parrot is in-tended to help users remember events of their ownlives, all tasks were phrased as questions about theresearcher’s experiences recorded in the system. Thequestions mimic questions that one may plausibly findoneself trying to answer about one’s own past.

To cut down on time required by the participants,we split the participants into two groups of equal size.Each group’s participants were exposed to only oneof the two main view types. Tasks were the same forparticipants in both groups.

In addition to the tasks, the study included anestablished usability metric, the Systems UsabilityScale (SUS) (Brooke 1996). We modified the ques-tions according to the suggestions by Bangor etal. (Bangor et al. 2009). We further changed the ques-tions by replacing “the system” with “the Digital Par-rot”. The intention behind including this metric wasto get an indication of the severity of any discoveredusability issues.

3.3 Participants and Procedure

The study had ten participants. All participants weremembers of the Computer Science Department at theUniversity of Waikato; six were PhD students andfour were members of academic staff. Two partici-pants were female, eight were male. The ages rangedfrom 24 to 53 years (median 38, IQR 15 years). Par-ticipants were recruited via e-mails sent to depart-mental mailing lists and via personal contacts. Par-ticipants were not paid or otherwise rewarded for tak-ing part in the usability test.

In keeping with University regulations on perform-ing studies with human participants, ethical consentto run the study was applied for, and gained. Eachparticipant in the study would receive a copy of theirrights as a participant (including their right to with-draw from the study) and sign a consent form.

After the researcher obtained the participant’sconsent, they were provided with a workbook. Theworkbook gave a quick introduction to the purpose ofthe usability test and a brief overview of the system’sfeatures. Once the participant had read the first page,the researcher started the Digital Parrot and brieflydemonstrated the four navigators (see Section 2). Theparticipant was then asked to use the Digital Parrotto perform the four tasks stated in the workbook.

Each participant was asked to think aloud whileusing the system. The researcher took notes. Afterthe tasks were completed, the participant was askedto fill in the SUS questionnaire about their experiencewith the Digital Parrot and to answer some questionsabout their background. The researcher would thenask a few questions to follow up on observations madewhile the participant was working on the tasks.


83

Figure 4: Timeline Navigator

3.4 Expectations

We did not have any particular expectations relatedto the first goal of this study, that of detecting poten-tial usability problems within the Digital Parrot.

We did, however, have some expectations relatedto the second goal. The study was designed and con-ducted by the main researcher of the Digital Par-rot project, who is also the main software developer.Thus, the study was designed and conducted by some-one intimately familiar with the Digital Parrot’s userinterface, with all underlying concepts and also withthe test data. For each of the tasks, we were aware ofat least one way to solve the task with one or moreof the Digital Parrot’s navigators. We expected thatthe participants in the study would discover at leastone of these ways and use it to complete the tasksuccessfully.

3.5 Results

The median SUS score of the Digital Parrot as de-termined in the usability test is 65 (min = 30, max= 92.5, IQR = 35), below the cut-off point for an ac-ceptable SUS score (which is 70). The overall score of65 corresponds to a rating between “ok” and “good”on Bangor et al.’s adjective scale (Bangor et al. 2009).The median SUS score in the graph condition alone is80 (min = 42.5, max = 92.5, IQR = 40), which indi-cates an acceptable user experience and correspondsto a rating between “good” and “excellent” on theadjective scale. The median SUS score in the list con-dition is 57.5 (min = 30, max = 77.5, IQR = 42.5).The difference in SUS score is not statistically signif-icant but does reflect our observations that the usersin the list condition found the system harder to usethan those in the graph condition.

Based on our observations of the participantsin this study, we identified nine major issues withthe Digital Parrot’s user interface as well as severalsmaller problems. In terms of our goals for the studywe found that there were usability issues, some wewould consider major, which would need to be ad-dressed before conducting a long-term user study. Inaddition there were some issues with the use of navi-gators. The details of our findings are as follows:

1. Graph: The initial view is too cluttered.2. Graph: The nature of relationships is invisible.3. List: The statement structure does not become

clear.4. List: Text search does not appear to do anything.5. Having navigators in separate windows is confus-

ing.6. The map is not very useful.7. Users miss a list of search results.8. The search options could be improved.9. The trail navigator is too hard to use.

Some of these issues had obvious solutions based onour observations during the study and on partici-pants’ comments. Other issues, however, were lesseasily resolved.

We formulated seven recommendations to changethe Digital Parrot’s user interface:

1. Improve the trail navigator’s user interface.2. Improve the map navigator and switch to a dif-

ferent map provider.3. De-clutter the initial graph view.4. Enable edge labels on demand.


84

5. Highlight statement structure more strongly inlist.

6. Change window options for navigators.7. Improve the search navigator.

As can be seen, these recommendations vary greatlyin scope; some directly propose a solution while othersrequire further work to be done to find a solution.

4 The Second Study

4.1 Goals

The goals of our second study were to emulate theintentions of the first, that is try to find any usabilityor functional problems with the current version of theDigital Parrot software. In addition, the intentionwas to develop the study with no prior knowledge ofthe software or of the first study (up to the pointwhere we could no longer proceed without some ofthis information) by using abstract tests derived fromformal models of the software and its user interface(UI) as the basis for planning the tasks of the study.We were interested to discover if such abstract testscould be used to derive an evaluation study in thisway, and if so how would the results differ from thoseof the initial study (if at all).

4.2 Planning the Study

The first step was to obtain a copy of the DigitalParrot software and reverse-engineer it into UI mod-els and a system specification. In general, our as-sumption is that such models are developed duringthe design phase and prior to implementation ratherthan by reverse-engineering existing systems. Thatis, a user-centred design (UCD) approach is taken toplan and develop the UI prior to implementation andthese are used as the basis for the models. For thepurposes of this work, however, the software had beenimplemented already and as such a specification anddesigns did not exist, hence the requirement to per-form reverse-engineering. This was done manually,although tools for reverse-engineering software in thismanner do exist (for example GUI Ripper[9]), but notfor the models we planned to use.

We spent several days interacting with the soft-ware and examining screen shots in order to begin tounderstand how it worked. We wanted to produceas much of the model as possible before consultingwith the software designer to ‘fill in the gaps’ whereour understanding was incomplete. Of course, sucha detailed examination of the software was itself aform of evaluation, as by interacting with the softwarecomprehensively enough to gain the understandingrequired for modelling, we formed our own opinionsabout how usable, or how complex, parts of the sys-tem were. However, in general where models and testsare derived prior to implementation this would notbe the case. The first study had already taken placebut the only information we required about this studywas the number and type of participants used (so thatwe could use participants from the same demographicgroup) and the fact that the study was conducted asa between-groups study with both graph and list ver-sions of the software being tested. We had no precon-ceived ideas of how our study might be structured atthis point, the idea being that once the models werecompleted and the abstract tests derived we wouldtry and find some structured way of using these toguide the development of our study.

Figure 5: Find Dialogue

4.3 The Models

We began the modelling process by examining screen-shots of the Digital Parrot. This enabled us to iden-tify the widgets used in the various windows and dia-logues of the system that provided the outline for thefirst set of models. We used presentation models andpresentation interaction models (PIMs) from the workdescribed in (Bowen & Reeves 2008) as they provide away of formally describing UI designs and UIs with adefined process for generating abstract tests from themodels (Bowen & Reeves 2009). Presentation modelsdescribe each dialogue or window of a software systemin terms of its component widgets, and each widgetis described as a tuple consisting of a name, a widgetcategory and a set of the behaviours exhibited by thatwidget. Behaviours are separated into S-behaviours,which relate to system functionality (i.e. behavioursthat change the state of the underlying system) andI-behaviours that relate to interface functionality (i.e.behaviours relating to navigation or appearance of theUI).

Once we had discovered the structure of the UIand created the initial model we then spent time usingthe software and discovering what each of the identi-fied widgets did in order to identify the behaviours toadd to the model. For some parts of the system thiswas relatively easy, but occasionally we were unableto determine the behaviour by interaction alone. Forexample, the screenshot in figure 5 shows the “Find”dialogue from the Digital Parrot, from which we de-veloped the following presentation model:

FindWindow is(SStringEntry, Entry, ())(HighlightButton, ActionControl,

(S_HighlightItem))(FMinIcon, ActionControl, (I_FMinToggle))(FMaxIcon, ActionControl, (I_FMaxToggle))(FXIcon, ActionControl, (I_Main))(HSCKey, ActionControl, (S_HighlightItem))(TSCKey, ActionControl, (?))

We were unable to determine what the behaviour ofthe shortcut key option Alt-T was and so markedthe model with a “?” as a placeholder. Once thepresentation models were complete we moved on tothe second set of models, the PIMs, which describethe navigation of the interface. Each presentationmodel is represented by a state in the PIM and tran-sitions between states are labelled with I-behaviours(the UI navigational behaviours) from those presen-tation models. PIMs are described using the µChartslanguage (Reeve 2005), which enables each part ofthe system to be modelled within a single, sequentialµchart that can then be composed together or embed-ded in states of other models to build the completemodel of the entire system. Figure 6 shows one ofthe PIMs representing part of the navigation of the“Find” dialogue and “Main” window.

In the simplest case, a system with five differentwindows would be described by a PIM with five states(each state representing the presentation model forone of the windows). However, this assumes that eachof the windows is modal and does not interact with


85

MainFind

MainandFind

MainandMinFindI_FMinToggle

I_MainMinToggle

I_MainMinToggleI_FMinToggle

I_MainMaxToggle

MinMainandFind

I_FMaxToggle

I_FMaxToggle

I_MainMaxtToggle

MinMainandMinFind

Figure 6: PIM for Main and Find Navigation

any of the other windows. In the Digital Parrot sys-tem none of the dialogues are modal, in addition eachof the windows can be minimised but continues to in-teract with other parts of the system while in its min-imised state. This led to a complex PIM consistingof over 100 states. The complexity of the model andof the modelling process (which at times proved bothchallenging and confusing) gave us some indicationof how users of the system might be similarly con-fused when interacting with the system in its manyvarious states. Even before we derived the abstracttests, therefore, we began to consider areas of the sys-tem we would wish to include in our evaluation study(namely how the different windows interact).

The third stage of the modelling was to produce aformal specification of the functionality of the DigitalParrot. This was done using the Z specificationlanguage (ISO 2002) and again we completed asmuch of the specification as was possible but leftsome areas incomplete where we were not confidentwe completely understood all of the system’s be-haviour. The Z specification consists of a descriptionof the state of the system (which describes the datafor the memory items stored in the system as setsof observations on that data) and operations thatchange that state. For example the “SelectItems”operation is described in the specification as:

SelectItems∆DPSystemi? : Itemli? : Item

AllItems ′ = AllItemsSelectedItems ′ = SelectedItems ∪ {li?} ∪ {i?}li?.itemName = (i?.itemLink)VisibleItems ′ = VisibleItemsHiddenItems ′ = HiddenItemsTrailItems ′ = TrailItems

The meaning of this is that the operation consists ofobservations on the DPSystem state before and afterthe operation takes place (denoted by ∆DPSystem)and there are two inputs to the operation li? andi?, which are both of type Item. After the operationhas occurred some observations are unchanged. Ob-servations marked with ′ indicate they are after theoperation, so, for example AllItems′ = AllItems in-dicates this observation has not changed as a resultof the operation. The SelectedItems observation doeschange however, and after the operation this set isincreased to include the inputs li? and i?, which rep-resent the new items selected.

Once we had completed as much of the modelling

as was possible we met with the software’s developerto firstly ensure that the behaviour we had describedwas correct, and secondly to fill in the gaps in the ar-eas where the models were incomplete. With a com-plete set of models and a complete specification wewere then able to relate the UI behaviour to the spec-ified functionality by creating a relation between theS-behaviours of the presentation models (which relateto functionality of the system) and the operations ofthe specification. This gives a formal description ofthe S-behaviours by showing which operations of thespecification they relate to, and the specification thengives the meaning. Similarly, the meaning of the I-behaviours is given by the PIM. The relation, whichwe call the presentation model relation (PMR), wederived is shown below:

S HighlightItems 7→ SelectItemsS PointerMode 7→ TogglePointerModeS CurrentTrail 7→ SelectCurrentTrailItemsS SelectItemMenu 7→ MenuChoiceS ZoomInTL 7→ TimelineZoomInSubsetS ZoomOutTL 7→ TimelineZoomOutSupersetS SelectItemsByTime 7→ SelectItemsByTimeS FitSelectionByTime 7→ FitItemsByTimeS HighlightItem 7→ SelectByNameS AddToTrail 7→ AddToTrailS ZoomInMap 7→ RestrictByLocationS ZoomOutMap 7→ RestrictByLocationS Histogram 7→ UpdateHistogram

This completed the modelling stage and we were nowready to move on to derivation of the abstract teststhat we describe next.

4.4 The Abstract Tests

Abstract tests are based on the conditions that are re-quired to hold in order to bring about the behaviourgiven in the models. The tool which we use for creat-ing the presentation models and PIMs, called PIMed(PIMed 2009) has the ability to automatically gener-ate a set of abstract tests from the models, but for thiswork we derived them manually using the process de-scribed in (Bowen & Reeves 2009). Tests are givenin first-order logic. The informal, intended mean-ing of the predicates can initially be deduced fromtheir names, and are subsequently formalised whenthe tests are instantiated. For example, two of thetests that were derived from the presentation modeland PIM of the “Find” dialogue and “MainandFind”are:

State(MainFind) ⇒Visible(FXIcon) ∧ Active(FXIcon) ∧hasBehaviour(FXIcon, I Main)

State(MainFind) ⇒Visible(HighlightButton) ∧Active(HighlightButton) ∧hasBehaviour(HighlightButton,S HighlightItem)

The first defines the condition that when the sys-tem is in the MainFind state a widget called FXIconshould be visible and available for interaction (ac-tive) and when interacted with should generate theinteraction behaviour called I Main, whilst the sec-ond requires that in the same state a widget calledHighlightButton is similarly visible and availablefor interaction and generates the system behaviourS HighlightItem. When we come to instantiate the


86

test we use the PIM to determine the meaning of theI-behaviours and the Z specification (via the PMR)to determine the meaning of the S-behaviours. Theset of tests also includes conditions describing whatit means for the UI to be in any named state so thatthis can similarly be tested. The full details of thisare given in (Bowen & Reeves 2009) and are beyondthe scope of this paper. Similar tests are describedfor each of the widgets of the models, i.e. for everywidget in the model there is at least one correspond-ing test. The abstract tests consider both the requiredfunctionality within the UI (S-behaviour tests) as wellas the navigational possibilities (I-behaviour tests).As there are two different versions of the Digital Par-rot software, one that has a graph view for the dataand the other a list view, there were two slightly dif-ferent sets of models. However, there was very littledifference between the two models (as both versionsof the software have almost identical functionality);there were in fact three behaviours found only in thelist version of the software (all of which relate to theUI rather than underlying functionality) and one UIbehaviour found only in the graph version. This gaverise to four abstract tests that were unique to therespective versions.

4.5 Deriving The Study

With the modelling complete and a full set of abstracttests, we now began to consider how we could usethese to derive an evaluation study. To structure thestudy we first determined that all S-behaviours shouldbe tested by way of a specific task (to ensure that thefunctional behaviour could be accessed successfully byusers). The relation we showed at the end of section4.3 lists thirteen separate S-behaviours. For each ofthese we created an outline of a user task, for examplefrom the test:

State(MainFind) ⇒Visible(HighlightButton)∧ Active(HighlightButton) ∧hasBehaviour(HighlightButton,S HighlightItem)

we decided that there would be a user task in thestudy that would require interaction with the “Find”dialogue to utilise the S HighlightItem behaviour. Todetermine what this behaviour is we refer to the re-lation between specification and S-behaviours, so inthis case we are interested in the Z operation Select-ByName. We generalised this to the task:

Use the “Find” dialogue to highlight a spec-ified item.

Once we had defined all of our tasks we made thesemore specific by specifying actual data items. In orderto try and replicate conditions of the first study weused (where possible) the same examples. The “Find”task, for example, became the following in our study:

Use the “Find” functionality to discoverwhich items are connected to the item called“Scuba Diving”

In order to complete this task the user will interactwith the “Find” dialogue and use the S HighlightItembehaviour that relates to the functionality in the sys-tem, SelectByName, which subsequently highlightsthe item given in the “Find” text field as well as allitems connected to it.

Once tasks had been defined for all of theS-behaviours we turned our attention to the I-behaviours. It would not be possible to check all nav-igational possibilities due to the size of the state space

for the models of this system (and this may be truefor many systems under test) as this would make thestudy too long. What we aimed to do instead wasmaximize the coverage of the PIM by following asmany of the I-behaviour transitions as possible andso we ordered the S-behaviour tasks in such a wayas to maximise this navigation. Having the PIM asa visualisation and formalisation of the navigationalpossibilities enables us to clearly identify exactly howmuch we are testing. We also wanted to ensure thatthe areas of concern we had identified as we built themodels were tested by the users. As such, we addedtasks that relied on the interaction of functions frommultiple windows. For example we had a task in ourstudy:

What are the items linked to the trail “NZCS Research Student Conferences” whichtook place in the North Island of NewZealand?

that required interaction with the Trail window andthe Map window at the same time.

Our final study consisted of eighteen tasks for thegraph version of the software and seventeen for thelist version (the additional task relating to one of thefunctions specific to the graph version). After eachtask we asked the user to rate the ease with withthey were able to complete the task and provided theopportunity for them to give any comments they hadabout the task if they wished. Finally we created apost-task questionnaire for the participants that wasaimed at recording their subjective feelings about thetasks, software and study.

4.6 Participants and Procedure

As with the first study, ethical consent to run the sec-ond study was sought, and given. We recruited ourparticipants from the same target group, and in thesame way, as for the initial study with the added cri-terion that no one who had participated in the initialstudy would be eligible to participate in the second.

We used a “between groups” methodology withthe ten participants being randomly assigned eitherthe Graph view version of the software or the Listview. The study was run as an observational studywith notes being taken of how participants com-pleted, or attempted to complete, each task. We alsorecorded how easily we perceived they completed thetasks and subsequently compared this with the par-ticipants’ perceptions. Upon completion of the tasksthe participants were asked to complete the question-naire and provide any other feedback they had re-garding any aspect of the study. Each study took, onaverage, an hour to complete.

4.7 Results

The second study found five functionality bugs andtwenty seven usability issues. The bugs found were:

1. Continued use of “Zoom in” on map beyond max-imum ability causes a graphics error.

2. The shortcut keys ‘i’ and ‘o’ for “Zoom in” and“Zoom out” on Map don’t work.

3. Timeline loses capability to display 2009 dataonce it has been interacted with.

4. Data items visualised in timeline move aroundbetween months.


87

5. Labels on the Map sometimes move around andsit on top of each other during zoom and moveoperations.

We had identified two of the functionality bugs dur-ing the modelling stage (the loss of 2009 data andthe non-functional shortcut keys on the map) and ina traditional development/testing scenario we wouldexpect that we would report (and fix) these prior tothe evaluation. The usability issues ranged from mi-nor items to major issues. An example of a minorissue was identifying that the widget used to togglethe mouse mode in the graph view is very small andin an unusual location (away from all other widgets)and is easily overlooked. We consider this minor be-cause once a user knows the widget is there it is nolonger a problem. An example of a major issue wasthe lack of feedback from the “Find” function thatmeant it was impossible to tell whether or not it hadreturned any results in many cases.

We made 27 recommendations for changes to thesoftware. Some of these were specific changes thatcould be made to directly fix usability issues found,such as the recommendation:

“Inform users if “Find” returns no results.”

Whereas others were more general and would requirefurther consideration, for example the recommenda-tion:

“Reconsider the use of separate dialogues forfeatures to reduce navigational and cognitiveworkload.”

Other observations led us to comment on particularareas of the software and interface without directlymaking recommendations on how these might be ad-dressed. For example, we observed that most partic-ipants had difficulty understanding conceptually how“Trails” worked and what their meaning was. Thisled to difficulties with tasks involving “Trails” butalso meant that participants used “Trails” when notrequired to do so as they found it had a particulareffect on the data that they did not fully understand,but that enabled them to visualise information moreeasily. This is related to the fact that the amount ofdata and the way it is displayed and the use of high-lighting were all problematic for some of the tasks.

We also determined from our comparison of themeasure of ease with which tasks were completedagainst the participants’ perception that for manyparticipants even when they successfully completeda task they were not confident that they had doneso. For example we would record that a participantcompleted a task fairly easily as they took the min-imum number of steps required and were successful,but the participant would record that they completedthe task with difficulty. This was also evident fromthe behaviour of participants as they would often dou-ble check their result to be certain it was correct. Thisis linked to the overall lack of feedback and lack of vis-ibility that was reported as part of our findings.

From the results of the second study we were con-fident that the study produced from the formal mod-els was able to find both specific problems (such asfunctionality bugs) and usability problems, as well asidentify general overall problems such as lack of userconfidence.

5 Comparing The Two Studies

There was an obvious difference in the tasks of thetwo studies. In the first study users were given four

tasks requiring several steps and were able to tryand achieve this in any way they saw fit, whereasin the second study they were given twenty seventasks which were defined very specifically. This meantthat the way in which participants interacted with thesoftware whilst carrying out these tasks was very dif-ferent. In the first study users were encouraged tointeract with the software in the way that seemedmost natural to them in order to complete the tasks,whereas in the second study they were given muchclearer constraints on how they should carry out aparticular task (for example: “Use the “Find” func-tion to....”) to ensure they interacted with specificparts of the system. While this meant that cover-age of functionality and navigation was more tightlycontrolled in the second study (which is beneficial inensuring as many problems and issues as possible arefound) it also meant that it did not provide a clearpicture of how users would actually interact with thesoftware outside of the study environment and as suchled to the reporting of problems that were in fact non-issues (such as the difficulties users had in interpret-ing the amount of data for a time period from thehistogram).

One of the problems with the tasks of the initialstudy, however, was that by allowing users to interactin any way they chose particular parts of the systemwere hardly interacted with at all, which meant thatseveral issues relating to the “Timeline” that werediscovered during the second study were not evidentin the first study due to the lack of interaction withthis functionality.

The other effect of the way the tasks were struc-tured was the subjective satisfaction measurements ofthe participants. The participants of the first studywere more positive about their experience using thesoftware than those of the second study. We feel thatthis is partly due their interactions and the fact thatthe first group had a better understanding of the soft-ware and how it might be used in a real setting thanthe second group did. However, there is also the pos-sibility that the participants of the first study mod-erated their opinions because they knew that the re-searcher conducting the study was also the developerof the software (which is one of the concerns we werehoping to address with our work).

6 Reflections on Process and Outcomes

One of the things we have achieved by this experi-ment is an understanding of how formal models mightbe used to develop a framework for developing userevaluations. This work shows that a study producedin such a way is as good (and in some cases better)at discovering both usability and functional problemswith software. It is also clear, however, that the typeof study produced does not allow for analysis of util-ity and learnability from the perspective of a userencouraged to interact as they choose with software.

Some of the advantages of this approach are: theability to clearly identify the scope of the studywith respect to the navigational possibilities of thesoftware-under-test (via the PIM); a framework toidentify relevant user tasks (via the abstract tests);a mechanism to support creation of oracles for in-puts/outputs to tasks (via the specification). Thissupports our initial goal of supporting developmentof evaluation studies by someone other than the soft-ware developer as it provides structured informationto support this. However, it also leads to an artificialapproach to interacting with the software and doesnot take into account the ability of participants to


88

learn through exploration and as such may discoverusability issues which are unlikely to occur in a real-world use of the software as well as decrease subjectivesatisfaction of participants with the software.

7 Conclusion

It seems clear that there is no ‘one size fits all’ ap-proach to developing evaluation studies, as the un-derlying goals and intentions must play a part in howthe tasks are structured. However, it does appearthat the use of formal models in the ways shown herecan provide a structure for determining what thosetasks should be and suggests ways of organising themto maximise interaction. Perhaps using both methods(traditional and formally based) is the best way for-ward. Certainly there are benefits to be found fromtaking the formal approach, and for developers withno expertise in developing evaluation studies this pro-cess may prove supportive and help them by provid-ing a framework to work within. Similarly for formalpractitioners who might otherwise consider usabilitytesting and evaluation as too informal to be useful theformal structure might persuade them to reconsiderand include this important step within their work.The benefits of a more traditional approach are theability to tailor the study for discovery as well as eval-uation, something the formally devised study in itscurrent form was not good at at all. Blending thetwo would be a valuable way forward so that we canuse the formal models as a framework to devise struc-tured and repeatable evaluations, and then extendor develop the study with a more human-centred ap-proach that allows for the other benefits of evaluationthat would otherwise be lost.

8 Future Work

We would like to take the joint approach describedto develop a larger study. This would enable us tosee how effective combining the methods might be,how well the approach scales up to larger softwaresystems and studies, and where the difficulties lie inworking in this manner. We have also been looking atreverse-engineering techniques and tools which couldassist when working with existing or legacy systemsand this work is ongoing.

9 Acknowledgments

Thanks to all participants of both studies.

References

Bangor, A., Kortum, P. & Miller, J. (2009), ‘Deter-mining what individual SUS scores mean: Addingan adjective rating scale’, Journal of Usability Stud-ies 4(3), 114–123.

Beck, K. (2003), Test-Driven Development: ByExample, The Addison-Wesley Signature Series,Addison-Wesley.

Bowen, J. & Reeves, S. (2008), ‘Formal models foruser interface design artefacts’, Innovations in Sys-tems and Software Engineering 4(2), 125–141.

Bowen, J. & Reeves, S. (2009), Ui-design drivenmodel-based testing, in ‘M. Harrison and M.Massink (eds.) Proceedings of 3rd International

Workshop on Formal Methods for Interactive Sys-tems (FMIS’09)’, Electronic Communications ofthe EASST, 22.

Bowen, J. P. & Hinchey, M. G., eds (1995), ImprovingSoftware Tests Using Z Specifications, Vol. 967 ofLecture Notes in Computer Science, Springer.

Brooke, J. (1996), SUS: A “quick and dirty” usabilityscale, in P. W. Jordan, B. Thomas, I. L. McClelland& B. A. Weerdmeester, eds, ‘Usability evaluation inindustry’, CRC Press, chapter 21, pp. 189–194.

Doubleday, A., Ryan, M., Springett, M. & Sutcliffe,A. (1997), A comparison of usability techniques forevaluating design, in ‘DIS ’97: Proceedings of the2nd conference on Designing interactive systems’,ACM, New York, NY, USA, pp. 101–110.

ISO (2002), ISO/IEC 13568— InformationTechnology—Z Formal Specification Notation—Syntax, Type System and Semantics, Prentice-HallInternational series in computer science, first edn,ISO/IEC.

Lewis, J. R. (2006), ‘Sample sizes for usability tests:mostly math, not magic’, interactions 13(6), 29–33.

Nielsen, J. (1994), Usability Engineering, MorganKaufmann Publishers, San Francisco, California.

Nielsen, J. & Landauer, T. K. (1993), A mathemat-ical model of the finding of usability problems, in‘CHI ’93: Proceedings of the INTERACT ’93 andCHI ’93 conference on Human factors in computingsystems’, ACM, New York, NY, USA, pp. 206–213.

PIMed (2009). PIMed : An editor for presentationmodels and presentation interaction models,http://sourceforge.net/projects/pims1/?source=directory.

Reeve, G. (2005), A Refinement Theory for µCharts,PhD thesis, The University of Waikato.

Schweer, A. & Hinze, A. (2007), The Digital Par-rot: Combining context-awareness and semanticsto augment memory, in ‘Proceedings of the Work-shop on Supporting Human Memory with Interac-tive Systems (MeMos 2007) at the 2007 British HCIInternational Conference’.

Schweer, A., Hinze, A. & Jones, S. (2009), Trailsof experiences: Navigating personal memories, in‘CHINZ ’09: Proceedings of the 10th InternationalConference NZ Chapter of the ACM’s Special Inter-est Group on Human-Computer Interaction’, ACM,New York, NY, USA, pp. 105–106.

Utting, M. & Legeard, B. (2006), Practical Model-Based Testing - A tools approach, Morgan andKaufmann.


89


90

Documents

A Tale of Two Studies - Western Sydney€¦ · A Tale of Two Studies Judy Bowen1 Steve Reeves 2 Andrea Schweer 3 (1,2) Department of Computer Science The University of Waikato, Hamilton,