Creedo—Scalable and Repeatable Extrinsic Evaluation for ...poloclub.gatech.edu/idea2015/papers/p20-boley.pdf · survey framework like Survey Monkey [sur] or KwikSurveys [kwi]. However,

Creedo—Scalable and Repeatable Extrinsic Evaluation forPattern Discovery Systems by Online User Studies

Mario Boley, Maike Krause-Traudes, Bo Kang and Björn JacobsUniversity of Bonn and Fraunhofer IAIS

Schloss Birlinghoven, Sankt Augustin, [email protected]

ABSTRACTWe present Creedo—a conceptual framework along with anaccompanying software implementation for empirically eval-uating knowledge discovery systems. Creedo provides a lan-guage for the creation of study designs that test how welldifferent test systems support real users to perform certaindata analysis tasks. These designs are scalable and repeat-able, i.e., after their creation, a study can be carried outany number of times and with an arbitrary high number ofparticipants without consuming valuable resources such asthe time of supervision personnel. Building on the concep-tual framework, the accompanying web application, whichis freely available at Bitbucket, supports data mining re-searchers in all central tasks for conducting a user study:in embedding their ideas into an operational data analysisenvironment, in assigning and monitoring tasks for studyparticipants, and in evaluating the results. The implemen-tation as web application enables large scale and geograph-ically distributed studies, in which, nevertheless, all studyparticipants essentially have an identical user-experience.

1. INTRODUCTIONIn this paper we present Creedo—a software-supported

framework for conducting empirical user studies with thepurpose of evaluating scientific contributions in pattern dis-covery [Hand, 2002] and related areas of knowledge discoveryfrom data. Although the user plays a central role already inthe earliest scientific definitions of the knowledge discoveryprocess [Fayyad et al., 1996], contributions to the field aretraditionally supported by formal or empirical evaluationsthat replace the user by a set of abstract assumptions abouther—most often by the single simple assumption that userslike to see analysis results as fast as possible. Hence, the vastmajority of research articles focus on computation speed asevaluation metric while the usefulness of the computationoutput is a mere postulate.

There is, however, an increasing trend to investigate claimsthat directly contain the data mining user as a prime sub-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD 2015 Workshop on Interactive Data Exploration and Analytics(IDEA’15) August 10th, 2015, Sydney, Australia.Copyright is held by the owner/author(s).

0 10 20 30 40 50 60 70 80 90 100

No added benefit of user study over automatized/formal evaluation

Unclear how to recruit suitable group of participants

High costs of conducting study (from planning to result analysis)

Insecurity of outcome and acceptance by peers

Figure 1: Reasons reported by ECMLPKDD au-thors to not conduct a user study in cases wheresuch a study could have been beneficial (in percentof respondents).

ject. Examples include the work of De Bie and Spyropoulou[2013] which, beyond just asking “how one can formalize anugget of information”, also deals with the question of “howone can formalize how interesting such a nugget of informa-tion is to a particular user”, the work of Dzyuba et al. [2014]that aims at enabling the use of pattern discovery algorithmsto non-expert users, as well as other, including our own, workon the same subject [Xin et al., 2006, Boley et al., 2013]. Itis often hard to support direct claims about user satisfactionsolely based on a certain set of axioms about the user: theremight be no widely accepted set of such axioms relevant tothe specific claim, and, moreover, usually the very same as-sumptions used to evaluate a contribution were also used todevelop it in the first place. Where such intrinsic evaluationsreach their limits, empirical studies that involve real userscould be a powerful alternative, because they constitute anevaluation extrinsic from the simplifying development as-sumptions. Nevertheless such studies are conducted onlyrarely (see Ke et al. [2009], Shahaf and Guestrin [2010], Liet al. [2012] for some notable exceptions).

One might argue that this reluctance to engage in userstudies is due to the fact that the vast majority of datamining authors does not share the feeling that such studiescould be beneficial to support the value of their contribu-tions. However, in a recently conducted online poll amongECMLPKDD authors [Boley et al., 2015], 50% of 135 re-spondents reported that they decided not to conduct a userstudy within the last two years although their work couldhave benefited from it1. Moreover, those respondents al-

1This result held almost uniformly across different sub-fields: Machine Learning authors had only little less “yes”-answers (49%) than the (non-disjoint) group of authors whowork in pattern mining and visual analytics (56%).

20

most unanimously pointed to the high costs involved as atleast one of the reasons for their decision (see Fig. 1). In-deed, even after the already challenging task of gathering agroup of relevant study participants, there are several high-cost issues that remain to be addressed: The work has tobe embedded into an operational data analysis environmentthat a) has a user interface that the participants can oper-ate and that b) allows to perform all measurements neces-sary for performance assessment. Then, the trials that formthe basis for performance measurements have to be carriedout, which includes splitting participants into several groups,e.g., corresponding to test and control, assigning the rightparticipants to operate the right data analysis environmentunder the right instructions, and, most importantly, record-ing all variables relevant to the hypothesis that abound fromthese trials. Finally, study hypotheses can refer to humansnot only in an acting capacity, i.e., as subject operating adata analysis tool, but also in an evaluating capacity, i.e., asproviding certain quality assessments, e.g., whether patternsare interestingness or useful within the context of a specifictask. Also these evaluations have to be carried out.

Presently, if a researcher wants to tackle all of these steps,she finds herself only insufficiently supported by current soft-ware tools. The accessibility aspect of embedding could bereached by creating a plug-in for a general purpose data min-ing suite like Rapid Miner [Mierswa, 2009] or WEKA [Hallet al., 2009]. However, the resulting software system wouldnot be connected to a study infrastructure that can auto-matically extract relevant measurements. Some aspects oftesting could be addressed by using a general purpose onlinesurvey framework like Survey Monkey [sur] or KwikSurveys[kwi]. However, these frameworks are oblivious to the se-mantics of data analysis and would only allow performancemeasurements based on indirect questionnaires rather thandirectly tracking user actions. Ultimately, all current solu-tions lack a conceptual framework for defining user studiesthat tie together data analysis systems, analysis tasks, andperformance measurements.

To the best of our knowledge, such a framework is given forthe first time by Creedo. Creedo provides a language for thespecification of study designs that test how well certain dataanalysis systems support real users to perform certain dataanalysis tasks based on certain evaluation criteria. Studiesbased on these designs can be repeated any number of timesas long as a suitable pool of study participants is available.Corresponding to the goal of extrinsic evaluations, Creedostudies can not only involve human test participants, butalso the evaluation critera can involve human evaluators. Inorder to enable scalable studies, Creedo allows users to act inthe double role of participant and evaluator while providingassignment schemes that allow to control potential biasesthat could otherwise arise from such double roles. Buildingon this conceptual framework, the Creedo web applicationsupports data mining researchers in performing embedding,testing, and evaluation as defined above in a convenient andintegrated way. It provides simple reusable UI elements thatcan easily be extended and can be combined to form inter-active data analytics dashboards. Moreover, it allows torapidly design, deploy, and conduct Creedo studies involv-ing those dashboards as test systems. The application auto-matically performs all task assignments to participants andevaluators, who can then carry out their tasks using anystandard web browser.

2. FRAMEWORK OUTLINECreedo allows to specify an interaction process during

which a group of users jointly provides evidence in favoror against a test hypothesis. Particularly, the specific kindof hypothesis we are interested in can be paraphrased asfollows:

“Users can solve a certain class of analysis tasks bet-ter with a specific target system than with other con-trol systems.”

(1)

In order to break down the rather complex interactionprocess evolving around testing such a hypothesis, let us be-gin by distinguishing between different roles in which usersare acting within this process and different time periods inwhich certain actions are performed. Chronologically, thehypothesis validation starts with the study design timewhere the study designer translates the test hypothesisinto an executable study design. This is followed by thestudy execution time where all the actual measurementdata is accumulated. This happens for once through studyparticipants who engage in trial sessions with data anal-ysis systems, in which they produce some tangible outputthat we refer to as session results. And this also happensthrough evaluators who review those results and produceevaluations of them. Finally, there is the study conclu-sion time where the data is processed in order to provideevidence in favor of or against the hypothesis to the studydesigner. Note that it is a central feature of Creedo stud-ies that users can act in more than one role within a givenstudy. In fact it can be very crucial for a study to be scalablethat all the participants also act as evaluators.

As we can see, the central element of a Creedo study is thestudy design, which corresponds to a specific test hypothe-sis. More specifically, it can be considered as an operational-ization of the natural language version of the hypothesis.That is, a study design gives a measurement procedure thatcan be practically carried out in order to assess all quanti-ties necessary to evaluate the hypothesis. Once created, astudy design can be used for an arbitrary number of actualstudies, each of which corresponds to a measurement accord-ing to the design or, from an alternative point of view, tothe realization of a random variable defined by the design(where the randomness would typically involve the selectionof users). Thus, by “repeating a study” we actually refer tothe process of running a new study with an identical studydesign as used by a previous study but with a new set ofusers. A typical motivation for this would be to increase thelevel of confidence in a result.

A Creedo study design translates an hypothesis followingthe blueprint given in (1) by specifying the following set ofcomponents (see also Fig. 2):

1. a set of system specifications, corresponding to the“target system” and the “control systems” referred toin the hypothesis,

2. a set of task specifications that represent the “certainclass of analysis tasks”, what it means to “solve” themas well as what are the basic qualities of task results,

3. an assignment logic that specifies how assignments areissued in a study to its users in order to generate allrequired measurements, and finally

21

Peformance MetricsStudyOwner

System Variants

BA

Analysis Tasks

?1

?2

A !?2

Assignment Logic

+ + +

Tasks

?to work on

!!

! ! !using producing

to evaluateassigns using producing

and presentsaggregates ! ! !Result Evaluations Extrinsic System Evaluations

ResultsAnalysis System

Result Metric Result Evaluation

assigns

Participant

Evaluator

21

defines

StudyOwner

to

Design Time

Execution Time

Conclusion Time

Figure 2: Outline of elements and process of a Creedo study.

4. system performance metrics that formally define themeaning of “better” in terms of the basic result quali-ties given by the task specifications.

While the conceptual framework of Creedo is more ab-stract than its implementation in the Creedo web applica-tion, some of the concrete options that are available for thedesign components above are determined by the implemen-tation. Hence, before we investigate the detailed composi-tion of study designs (Sec. 4), we will first review the Creedoweb application (Sec. 3). After that, in order to illustrateall previously introduced concepts, we will present a studydesign that was developed to evaluate the FORSIED frame-work (Sec. 5).

3. CREEDO WEB APPLICATIONThe Creedo web application [cre] is a Java server appli-

cation that allows to define and execute study designs. Itcan be installed to run on a central web server, from whichstudy participants and evaluators can receive their assign-ments in the form of interactive web pages. These pages arevery light-weight. All actual computation is carried out onthe server and all content-relevant state like the input datais also kept there. The motivation for this architecture istwo-fold: Firstly, it enables large scale and geographicallydistributed studies, in which, nevertheless, all study partic-ipants essentially have an identical user-experience: Everycomputer with access to the server can potentially be usedas an input terminal, as long as all used computers have thesame input/output devices and their hardware is otherwisesufficient to fluently display the pages in one of the stan-dard web browsers. Secondly, all study-related definitionsand statistics are persistent in one centralized location fromwhich study execution can be controlled consistently.

The data analysis systems that are examined in a study

executed by the web application correspond to Creedo an-alytics dashboards. These are a visual user front-end of aninteractive and iterative knowledge discovery process thatbuilds on our open-source Java library realKD [rea]. Thislibrary is designed to allow easy and straightforward imple-mentations of pattern discovery algorithms by providing awide range of pattern discovery primitives and an expressivedata model. Thus, extending realKD provides a convenientway to embed a novel algorithmic idea into Creedo analyticsdashboards. The server side state of the dashboard is givenby a realKD data workspace, which contains all the dataartifacts available for analysis, and by the state of optionalUI components that can be added to a dashboard. Thesecomponents allow to perform certain data analysis activi-ties like data inspection, data mining, and post-processing.The state of all components can be accessed to define con-crete study concepts (e.g., result definitions, performancemetrics).

The components that are implemented at the time of writ-ing this paper support an interactive pattern discovery pro-cess, in which the user can run mining algorithms, inspecttheir results in the context of the given input data, poten-tially choose some of the produced patterns in a result col-lection, and then re-iterate. In the following we introducesome of these components together with the part of theirstate that is relevant to study definitions (see Fig. 3 for ascreenshot of an analytics dashboard containing all compo-nents presented here).

For the inspection of data artifacts, the data view con-tainer can aggregate a number of data view sub-components,each of which is able to represent a different kind of data orto provide an alternative view on the same data. These sub-components currently include: A data table view for render-ing tabular data composed of numeric, ordinal, and categor-ical attributes, a point-cloud, which renders a 2-dimensional

22

Figure 3: Creedo analytics dashboard with data view container (a), pattern mining launcher (b), and KDresult container (c).

embedding of the data, and a propositional logic view, whichrenders a collection of propositional statements defined onrows of a data table (as required by pattern discovery algo-rithms that work on binary data).

The component that allows to perform the actual datamining is referred to as the pattern mining launcher.It enables the user to run pattern discovery algorithms onall data-artifacts present in the system and to display theirresults. The set of available algorithms can be chosen bythe system designer from all realKD mining algorithms. Inorder to launch one of the algorithms, the user first has toclick on a mine button, then choose the specific algorithmshe wants to use, and finally select valid settings for all theparameters of that algorithm (the last two steps are onlyrequired when more then one algorithm is available and theselected algorithm exposes user parameters, both of whichcan be avoided by the system designer in order to providea non-expert user-experience). After the termination of thealgorithm, the component displays all patterns found in ascrollable list below the mine button. This list is referredto as the candidate area because patterns can be takenover from here into the user’s result set should she considerthem suitable. For the purpose of defining study designs,the accessible state of the pattern mining launcher includes,in addition to the content of the candidate area, also thealgorithmic settings that have been used by the user to findthe patterns in it.

Finally, the knowledge discovery result container al-lows users to incrementally assemble result sets for their dataanalysis task. Results are received by means of drag-and-drop from the candidate area of the pattern mining launcher.They also can be deleted, re-arranged dynamically, e.g., inorder to express a (linear) result prioritization, and anno-tated in order to express a result interpretation. The ac-

cessible state of this component is the ordered set of resultsalong with all annotations as well as the time and numberof mining rounds that have passed until each pattern wasfound.

In summary, the components of analytics dashboards areall designed towards simplicity and a strictly defined user-purpose. This serves two goals: the resulting data analysissystems are relatively accessible also to inexperienced usersand the components allow for an unambiguous interpreta-tion of user interaction—both of which are useful propertiesfor test systems in user studies.

4. STUDY DESIGN COMPONENTSAfter having an overview over the possible user-experiences

that can be provided by the Creedo web application, we arenow ready to turn to the more abstract aspects of study def-initions. As discussed earlier, the central notion of Creedo’sstudy domain language is the study design. In this sectionwe now present in more detail the individual components ofthose designs: system and task specifications, performancemetrics, and assignment schemes.

4.1 System and Task SpecificationsA system specification within a study design is a speci-

fication of how to construct an analysis system given certaininput parameters. When using the Creedo web application,this means how to construct an analytics dashboard fromthe components presented in Sec. 3. More generally, sys-tem specifications have the role of determining all aspects ofthe user-experience that are relevant to the study hypothe-sis. Typically, this means that different specifications withinthe same study design are identical except for a specific testcomponent, the effect of which is to be determined by an

23

extrinsic evaluation. For example, the test system mightcontain a specific data visualization (e.g., PCA-based pointcloud) whereas the control system does not, or the test sys-tem replaces specific algorithm parameters (e.g., support/liftin association mining) by an automatic choice whereas thecontrol system exposes those parameters to the user.

The input parameters of the system specifications are thenused during study execution to inject task specific contentprovided by the task specifications into the analysis systems.In the Creedo web application, the current default imple-mentation of a system specification accepts a collection ofdata loaders that determine the data artifacts availablein the dashboard and a collection of component buildersthat shape its visual appearance.

While system specifications define tools, task specifica-tions describe the purpose and evaluation context, in whichthese tools are supposed to be used within a study. Lookingback to our hypothesis blueprint (1), we are interested inassessing how well human users can solve specific tasks withcertain analysis systems. That is, we are interested in howwell humans can operate a software-tool with the purposeof producing results (a solution) for some task.

Naturally, this purpose has to be communicated to theuser in a human-interpretable form. At the same time,the data analysis system expects a machine-readable input.Then, the evaluation of the hypothesis requires actual so-lution entities, for which certain qualities can be assessed.Hence, the state of the tool, in which the participant consid-ers the task as solved, has to be interpreted as one or moreof these result entities. Finally, also the human evaluators, ifthey are involved, need to have an understanding of the taskin order to carry out their evaluations. Thus, another setof human-understandable instructions, yet from a differentperspective, is required in studies which involve evaluators.In summary, task specifications in Creedo essentially haveto be able to translate an abstract concept of a task betweenseveral ontological realms and between several perspectives.For the Creedo web application they consist of

1. human-understandable instructions, i.e., a sequenceof HTML-documents that may refer to embedded me-dia files,

2. input values of parameters accepted by all of the sys-tem specifications in the study design (e.g., the inputdata for which the task has to be carried out and otherauxiliary input that is task specific),

3. a result definition, i.e., a rule that describes howto extract result entities from a state of the dash-board, in which the participant has declared that she isdone working (this also includes determining whether agiven state of the dashboard can be declared as “done”in the first place), and

4. a set of evaluation schemes that are used by humanevaluators to evaluate task results.

At the moment, all evaluation schemes in Creedo are ratingschemes that consist of a name, a set of natural languageevaluation instructions, and a finite set V ⊆ Z referred toas the scale of the rating scheme. For a result r, a singleevaluation according to such a rating scheme is then simplythe value v ∈ V that is assigned to r by one evaluator.

4.2 System Performance MetricsCreedo formalizes the concept of one analysis system be-

ing “better” than another by comparing them through asystem performance metric. These metrics are supposed tocapture intuitive concepts like the “average quality of resultsproduced (with a specific system)” or “average time neededby participants to produce a good result”. The formal defi-nition of such metrics for systems can involve other metricsdefined on the different types of artifacts that are producedduring study execution. That is, system metrics can rely onsession metrics, session metrics on results metrics, and re-sult metrics on evaluation metrics. On each of these levels,Creedo allows the definition of metrics according to certainproduction rules, but also provides a set of useful predefinedmetrics. Before we introduce the general language for metricdefinitions, let us consider some of the predefined metrics inorder to get a sense for their general flavor.

Let c be an evaluation scheme given in the task specifi-cation of the study, and Ex denote the set of all c-ratingsthat have been performed for a result x ∈ X. A basic re-sult metric is then the average c-value defined by c(x) =avg{c(e) : e ∈ Ex}. A slightly more complex variant of thismetric abounds from applying a z-score transformation toall rating values assigned to results by a specific evaluator.This gives rise to the average evaluator-normalized c-value defined by c∗(x) = avge∈Ex

(c(e)−µu(e))/σu(e) whereu(e) denotes the evaluator that has provided evaluation eand µu and σu denote the sample mean an standard devi-ation of all c-ratings provided by a particular evaluator u.This metric can be useful when the study designer suspectsthat some evaluators might interpret the rating scale of cdifferently than others. As an example for a full system per-formance metric consider the average maximal f-valuethat is defined for any result metric f : X → R by

a 7→ avg{maxx∈Xs

f(x) : s ∈ Sa, |Xs| > 0}

where Sa denotes the set of all sessions that have been per-formed using system a, and Xs the set of all results thathave been produced in session s. Another example is themedian time until success that is again defined with re-spect to some result metric f and a success threshold τ ∈ Rby

a 7→ med{min{t(x) : x ∈ Xs, c(x) > τ} : s ∈ Sa, |Rs| > 0}(2)

where t(x) denotes the total session time until the result xwas stored by the participant.

Creedo’s general language for metric definitions isbased on a set of elementary metrics (which can vary de-pending on context and implementation) and a set of pro-duction rules from which more advanced metrics can be de-fined. Let us first review some elementary metrics, whichare defined in most contexts. For rating evaluations, elemen-tary metrics contain:

• Value, i.e., the rating value chosen by the evaluator,

• EvaluationTime, i.e., the time taken by the evaluatorto choose value.

For results, some elementary metrics are:

• SessionTimeUntilStored, i.e., the total time since thetrial session started at the moment a result is addedto the result area.

24

• RoundsUntilFound, i.e., how often the participant hadto start a mining algorithm until the result was shownin the candidate area.

• TimeBetweenFoundAndStored, i.e., how long it tookthe participant to find and store the result after it hasbeen produced by a mining algorithm.

The set of elementary session metrics includes:

• TotalSessionTime, i.e., the total time between themoment the participant first sees the analysis systemuntil the system is closed,

• RoundsInSession, i.e., the total number of times theparticipant has pressed the mine button within thesession,

• NumberOfResults, i.e., the number of results that havebeen submitted by the participant at the end of thesession.

Now we can turn to the production rules for buildingmore complex X-metrics, where X can be one of the sets ofanalysis systems, analysis sessions, results, or rating eval-uations, i.e., X ∈ {A,S,R} ∪ {Ec : c ∈ C} (the rulesalso involves X-constraints, which are boolean functionq : X → {true, false}):

• If f and g are X-metrics then x 7→ f(x) + g(x), x 7→f(x) − g(x), x 7→ f(x)g(x), x 7→ f(x)/g(x), and x 7→f(x)g(x) are also X-metrics.

• The trivial constraint x 7→ true is an X-constraint.

• If f is an X-metric then for all thresholds τ ∈ R thefunctions x 7→ δ(f(x) > τ), x 7→ δ(f(x) = τ), andx 7→ δ(f(x) < τ) are X-constraints where δ evaluatesto true if the condition given in brackets is true andfalse otherwise.

• If q and r are X-constraints then x 7→ q(x)∧ r(x) andx 7→ ¬q(x) are also X-constraints.

• If g is a Y -metric, q is a Y -constraint, and X ⊆ 2Y

then

x 7→ aggr{g(y) : y ∈ x, q(y)}

is an X-metric for all aggregation functions

aggr ∈ {max,min, avg,med,mode, count} .

Here, as a slight abuse of notation, we identify analysis sys-tems with the set of analysis session that have been per-formed with them, i.e., A ⊆ 2S , analysis sessions with theset of their results, S ⊆ 2X , and result with the set of ratingevaluations that they have received X ⊆ 2E

c

for all ratingschemes c in the task specification.

4.3 Assignment SchemesThe system performance metrics of a study design define a

lot of measurements that can be taken based on the actionsof study users (participants and evaluators). What thesemetrics do not determine is how an actual pool of studyusers should be assigned to perform these actions. This isthe role of assignment schemes. In particular, they haveto solve the following problems:

• User workload has to be limited. Each user has a lim-ited amount of time and attention. If the study isdemanding more of these resources than the user iswilling to provide, she will not respond, and the useris essentially lost for the study.

• Biases have to be controlled. For instance, a partici-pant working on the same task with two different sys-tems, is likely to improve the second time indepen-dently of the system due to an increased understandingof the task itself. Another important source of biasesabounds if users act in the double role of participantsand evaluators: the study designer has to take into ac-count how the knowledge that a participant gains whenworking on a task will effect her judgment when she isassigned to evaluate results of that task afterwards.

• Dependencies of assignments have be resolved. Triv-ially, evaluation tasks can only be assigned when thecorresponding results have been produced in trials. Ontop of that, if a uniform distribution of evaluations isrequired for all results, then it makes sense to holdback evaluation assignments even further until the to-tal number of results is known.

In summary, assignment schemes shape the behavior of astudy at execution time by answering the following ques-tions: Which participant is supposed to work with whichsystem on which task, which evaluator is supposed to evalu-ate which result, and when, i.e., in what sequence, are theseactions to be performed?

Currently, two schemes are available: TwoPhaseAssign-

ment, which first issues all trial assignments to all partici-pants and then, when those are all completed, issues all eval-uation assignments, and CombinedAssignment, which gener-ates a combined trial/evaluation assignment for users thathave both roles. Both schemes have the following parame-ters in common:

• Number of trial assignments per participant.

• Constraints for valid trial assignments which can be asubset from

– IdenticalSystems, i.e., all systems in all trial as-signments for a given participant have to be iden-tical,

– IdenticalTasks, i.e., all tasks in all trial assign-ments for a given participant have to be identical,

– NoIdenticalTasks, i.e., in no two tasks assign-ments of a given participant she is allowed to workon the same task,

– NoIdenticalSystems, i.e., in no two tasks assign-ments of a given participant she is allowed to workwith the same systems.

• Number of results per evaluator.

• Constraints for valid evaluation assignments which canbe a subset from

– NoEvaluationOfOwnResults, i.e., no result givenfor evaluation has been produced by the evaluatorherself in one of her own trial assignments,

25

Figure 4: Association pattern stating that the jointfrequency of three attribute/value combinations is“higher then expected”. Intuitively, this statementis less interesting for someone who already knowsthat “annual health spend” is correlated with “an-nual income”. FORSIED aims to quantify this ef-fect.

– NoEvaluationOfOwnTasks, i.e., no result given forevaluation is of the same task as a task that theevaluator worked on herself,

– ResultsStratifiedBySystems. i.e., the resultswithin an evaluation assignment have to comefrom all different systems at approximately equalamounts.

Both schemes assure that all system/task combinations willreceive an approximately equal number of sessions and allresults will receive an approximately equal number of eval-uations.

5. USE CASE: EVALUATING FORSIEDIn order to illustrate the various concepts of the Creedo

framework that were introduced in the previous sections, wenow turn to an exemplary case study. Namely, we presentthe design of a study that we conducted to evaluate theframework for“Formalizing Subjective Interestingness in Ex-ploratory Data Mining” (FORSIED) [De Bie, 2013, De Bieand Spyropoulou, 2013]—a recently proposed pattern min-ing technique with a particularly user-centric approach. Notethat a detailed discussion of this study and its results is outof scope of this paper, and will be published separately. Herewe focus on the mapping of a real study hypothesis withtheory-specific requirements to Creedo components.

5.1 Background and HypothesisFORSIED aims to quantify the interestingness of a pat-

tern for a particular user depending on the prior knowledgethat this user already has about the dataset, in which thepattern was found. Prior knowledge can come, e.g., in theform of certain summary statistics of the data, like row andcolumn averages, or by other patterns that have been previ-ously discovered. Importantly, FORSIED is an attempt tocapture a universal notion of interestingness that in an idealimplementation (where all prior knowledge of a user canbe assessed) coincides with the intuitive natural-languagenotion of interestingness. In order to make this very gen-eral concept more tangible, we focus here on an embod-iment where we use a FORSIED-based pattern ranker aspost-processing step in a round-based discovery of associa-tion patterns (see Webb [2010] and Fig. 4). This processworks as follows: every round starts with the production ofa random set of association patterns that, subsequently, is

ordered by a ranker, before it is presented to the user. Thenthe user can pick from that ordered list patterns accordingto her interest and start over into the next round until sheis done.

The FORSIED-based ranker, in particular, orders pat-terns according to their subjective interestingness consid-ering as prior knowledge all the patterns that have alreadybeen discovered and stored by the user in previous roundsas well as the univariate distributions of all data attributes.Based on the design claim of FORSIED, in every round, theFORSIED-based ranker should point the user directly tothose new patterns that are still interesting to her, and con-sequently allow her to save time and attention while brows-ing the result set compared to when a traditional static mea-sure is used for ranking. The longer the discovery processproceeds the more this advantage should be emphasized.This gives rise to the following hypothesis:

“A FORSIED-based association discovery process al-lows users to discover a set of interesting patternsfrom a dataset faster than a conventional associationdiscovery process (based on a static interestingnessmeasure that is oblivious to prior and gained knowl-edge).”

Translating this hypothesis into a useful operational studydesign implies defining a sufficiently robust objective mea-surement about subjective interestingness. In the next sec-tion, we will see how this apparent conundrum can be at-tacked by using Creedo components in order to control theknowledge of participants and evaluators throughout studyexecution.

5.2 Design RequirementsIn addition to developing an executable study design that

captures as precisely as possible our hypothesis, we also aimfor a design that meets the following general requirements:

• It should evaluate the claim about “user interesting-ness” extrinsically, i.e., this notion should not just becaptured by simplifying formal assumptions. In par-ticular, this means that we want to employ humanevaluators for assessing whether a result is interestingor not, instead of relying, e.g., on the intra-theoreticassumption that FORSIED’s formalization of interest-ingness does indeed correspond to the intuitive notionthereof.

• Moreover, the design should be robust, i.e., the studyresult should be affected as little as possible by thepotentially outlying behavior of individual participantsor evaluators. This means that we want to average overdifferent trials and result evaluations.

• Finally, we aim for a design that leads to scalable stud-ies, i.e., the amount of time and attention required bythe study owner should not depend on the number ofparticipants. This means that we want to assign tousers the double role of trial participant and resultevaluator.

These three requirements have two practical implicationsfor our study design. First, since we want to be able tomeaningfully average over results, it must be possible for alltask results to evaluate them on an identical scale. That

26

System τ = 1 τ = 2 τ = 3

atest 563.63 563.63 505.5acontrol 694.25 803.96 inf

Table 1: Median time in seconds until success cor-responding to the two systems and different successthresholds.

is, interestingness of a result should mean the same for allparticipants and evaluators. Thus, our task definitions (inparticular instructions and dataset) have to control the priorknowledge among all users, because, as per our hypothesis,prior knowledge is what determines interestingness. Sincewe also want to put users into the double role of participantsand evaluators, this creates the further difficulty that whena user is asked to evaluate a result, she must have the sameprior knowledge at that moment as the participant whenshe created the result. Hence, we have to define two taskvariants such that performing one task as a participant doesnot change the prior knowledge for the other variant (andhence still allows an unbiased evaluation of it).

5.3 Study DesignThe system variants for the study design are already

more or less defined by the considerations in Section 5.1.In particular we have atest corresponding to an analyticsdashboard equipped with the FORSIED-based ranker andacontrol corresponding to one with a conventional ranker basedon the lift measure. As actual association discovery algo-rithm we fix a controlled pattern sampling algorithm [Bo-ley et al., 2012] where all parameters (such as number ofpatterns per round, pattern pruning strategy) are fixed tofeasible values. This makes the study design accessible alsofor participants that are not pattern discovery experts.

As stated above, we need two task variants in order tomeet our requirements. In order to control the prior knowl-edge, the input datasets consist of randomly generatedpopulation data of two fictitious lands called“Lakeland” and“The Plain”. Each dataset contains 1000 rows, each of whichcorresponding to a fictitious inhabitant that is sampled ac-cording to a joint probability distribution of the followingvariables:

• Race ∈ {Griffin, Diricawl, Werewolf},

• Region ∈ {east, west, north, east} (of residency),

• Income ∈ {0, 1, 2, . . . , 1000} (in gold coins),

• Health spending ∈ [0, 1] (as fraction of annual in-come), and

• Happiness ∈ {happy, unhappy}.

The probability distributions for each of the two lands havedifferent multi-variate dependencies and model parametersin order to “inject” certain patterns into the datasets suchas, e.g., P [Race = Diricawl,Region = north] being muchlarger than the product P [Race = Diricawl]P [Region =north]. In addition to the input dataset the input valuesalso contain the univariate statistics of all attributes for theFORSIED-based ranker. Corresponding to the two inputdata generators, the two variants of task instructions are:

“You see a sample of the population of (Lake-land/The Plain). Get familiar with the sum-mary statistics of each attribute by [...]. Thenclick the mine button in order to find patternsin the population. You can save patterns youconsider interesting by [...] and delete patternsby [...]. Repeat these steps to refine your patterncollection until you think you have discovered themost interesting collection consisting of exactly 3patterns.”

The result definition is then the collection of patternspresent at the result container of the analytics dashboardconsidered as a single result set (and a result can only besubmitted if there are exactly three patterns in this area).This corresponds to our hypothesis, in which we conjecturethat the FORSIED-based system should have an advantagewhen considering the time it takes to construct a whole set ofresult patterns. In contrast, since the FORSIED-based sys-tem and the conventional system behave essentially identicalbefore the first pattern is stored, we would not expect to seea substantial advantage on the single pattern level. Finally,this also has to be reflected in the evaluation scheme usedfor result evaluation. Here, we use a single elementary ratingmetric called “joint interestingness” with the instructions:

“How interesting is this set of patterns as a whole(taking into account the elementary statistics ofthe data attributes)?”

and scale {0 (not), 1 (almost not), 2 (a little), 3 (somewhat),4 (very)}.

Corresponding to our initial considerations, we chooseTwoPhaseAssignment as assignment scheme with the pa-rameter values of 1 for the number of trial assignments perparticipant, 3 for number of results per evaluator, and No-

EvaluationOfOwnTask as constraint for the evaluation as-signments. Using the constraint assures that evaluators willsee results only from the task variant they did not work onthemselves. Thus, as required, they will have the same priorknowledge when evaluating the task as the participant hadwhen producing it.

As system performance metrics we can then define themedian time until success (see Eq. (2)) for different thresh-olds, e.g., τ ∈ {1, 2, 3} corresponding to different require-ments to the average ranking of a result to be considered asuccess. As a relatively strict criterion, we can then say thatour hypothesis is supported by a study, if for all τ ∈ {1, 2, 3},we have fτ (atest) < fτ (acontrol) where fτ denotes the systemperformance metric with respect to τ .

5.4 Results and ExperiencesThe authors conducted a study based on the above de-

sign with 16 participants/evaluators who followed an openinvitation among our department members and master stu-dents. While this population was recruited from a friendlyenvironment, note that based on the study setup, there wasno reliable way for participants to tell with what systemvariant they were working or what system was used to pro-duce the result they were evaluating. Table 1 contains theaggregated results of this study for the different strictnesslevels of the system performance metric. As we can see, itwas indeed the case that for all three levels the median timeuntil success was smaller for the test system than for thetarget system.

27

Using the Creedo web application, the whole study pro-cess could be administrated conveniently from one location,while participants and evaluators where able to fulfill theirassignments from wherever they wanted within a previouslyfixed six day period (the first half of which was reservedfor trial participation, the second half for result evaluation).Moreover, the study design is readily available for re-usewith a different population of users—either unmodified inorder to increase the confidence in our initial result or ina modified form in order to address specific doubts or toinvestigate refined follow-up questions.

6. SUMMARY AND OUTLOOKAs we have seen, Creedo can support researchers in defin-

ing and conducting repeatable studies with real users in or-der to extrinsically evaluate novel contributions in data min-ing. The focus of this support is at the moment on the easydefinition of analysis systems, tasks, measurements, and as-signment logic in order to control biases and to reduce thecost of the actual study organization and performance mea-surement. In particular, Creedo allows to employ partici-pants also in the role of result evaluators in order to providescalable study designs.

The most important limitations of Creedo’s current stateare perhaps as follows. While a relatively rich set of per-formance metrics can be expressed in the system, Creedocurrently does not provide any support for the statisticalinterpretation of those metrics. That is, in case the studyauthors do know that their participants are a representativesubset of a certain target demographic, there is no supportfor testing whether metric measurements are significant forthat demographic. Moreover, reflecting current restrictionsof the realKD library, the data sources that can be injectedinto analytics dashboards of the web application are limitedto rather small-scale tabular data. Finally, the available vi-sual mining components are somewhat specific to patterndiscovery. However, as our survey among ECMLPKDD au-thors revealed, there appears to be a high demand for userstudies also in other subfields of data mining, especially ofcourse in visual analytics. Hence, extending the componentsavailable towards these areas is another major direction ofgrowth.

Among the many directions for potential improvement,the authors believe that the extension of Creedo and itsimplementation should be mainly driven by the require-ments emerging from actual practical attempts of extrin-sically evaluating novel data mining techniques. If there isa developing culture in data mining research of performingstudies with real users, then the community as a whole willunderstand better than today, what are central requirementsfor supporting this task. The organization of the Creedo webapplication as an open source project is a suitable basis foran organic growth that follows such a development. Theauthors are excited to engage in collaborations with otherdata mining research groups to jointly gain more experiencein performing user studies and foster their growth.

AcknowledgmentThis work was supported by the German Science Foundation(DFG) under the reference number ‘GA 1615/2-1’.

ReferencesCreedo web application. https://bitbucket.org/realKD/creedo/.

Kwiksurveys. URL https://kwiksurveys.com/.

realKD Java library. https://bitbucket.org/realKD/realkd/.

Surveymonkey. https://www.surveymonkey.com.

M. Boley, S. Moens, and T. Gartner. Linear space direct patternsampling using coupling from the past. In The 18th ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining, KDD ’12, Beijing, China, August 12-16,2012, pages 69–77, 2012.

M. Boley, M. Mampaey, B. Kang, P. Tokmakov, and S. Wrobel.One click mining: Interactive local pattern discovery throughimplicit preference and performance learning. In Proceedingsof the ACM SIGKDD Workshop IDEA, pages 27–35. ACM,2013.

M. Boley, B. Kang, and T. Gaertner. Poll among ECML-PKDD authors on user studies. http://www.realkd.org/dm-userstudies/ecmlpkdd-authorpoll-march2015/, 2015. [On-line; accessed 2-April-2015].

T. De Bie. Subjective interestingness in exploratory data mining.In Advances in Intelligent Data Analysis XII - 12th Interna-tional Symposium, IDA 2013, London, UK, October 17-19,2013. Proceedings, pages 19–31, 2013.

T. De Bie and E. Spyropoulou. A theoretical framework for ex-ploratory data mining: Recent insights and challenges ahead.In Machine Learning and Knowledge Discovery in Databases,volume 8190 of LNCS, pages 612–616. Springer Berlin Heidel-berg, 2013.

V. Dzyuba, M. v. Leeuwen, S. Nijssen, and L. De Raedt. Inter-active learning of pattern rankings. International Journal onArtificial Intelligence Tools, 23(06), 2014.

U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data min-ing to knowledge discovery in databases. AI magazine, 17(3):37, 1996.

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, andI. H. Witten. The weka data mining software: an update. ACMSIGKDD Explorations Newsletter, 11(1):10–18, 2009.

D. J. Hand. Pattern detection and discovery. In ESF ExploratoryWorkshop on Pattern Detection and Discovery, pages 1–12.Springer, 2002.

W. Ke, C. R. Sugimoto, and J. Mostafa. Dynamicity vs. effective-ness: Studying online clustering for scatter/gather. In J. Allanand J. Aslam, editors, Proc. of the 32nd int. ACM SIGIR con-ference on Research and development in information retrieval,pages 19–26. ACM, New York, 2009.

Y. Li, L. Chiticariu, H. Yang, F. R. Reiss, and A. Carreno-fuentes.Wizie: A best practices guided development environment forinformation extraction. In Proc. of 50th Ann. Meeting of theAss. for Comp. Linguistics, pages 109–114. ACM, 2012.

I. Mierswa. Rapid miner. KI, 23(2):62–63, 2009.

D. Shahaf and C. Guestrin. Connecting the dots between newsarticles. In Proc. of the 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages623–632. ACM, 2010.

G. I. Webb. Self-sufficient itemsets: An approach to screeningpotentially interesting associations between items. TKDD, 4(1), 2010.

D. Xin, X. Shen, Q. Mei, and J. Han. Discovering interesting pat-terns through user’s interactive feedback. In Proceedings of the12th ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 773–778. ACM, 2006.

28

Documents

Creedo—Scalable and Repeatable Extrinsic Evaluation for ...poloclub.gatech.edu/idea2015/papers/p20-boley.pdf · survey framework like Survey Monkey [sur] or KwikSurveys [kwi]. However,