activity mining Soft pros.pdf

Activity Mining forDiscovering Software Process Models

Ekkart Kindler, Vladimir Rubin, Wilhelm SchaferSoftware Engineering Group, University of Paderborn, Germany

[kindler, vroubine, wilhelm]@uni-paderborn.de

Abstract: Todays enterprises spend much effort in obtaining precise models of theirsoftware engineering processes in order to improve the process capability of their or-ganization and to advance one step further in the CMM. The manual design of processmodels, however, is complicated, time-consuming, error-prone, and the results arerapidly becoming obsolete; capabilities of human beings in detecting discrepanciesbetween the actual process and the process model are rather limited. Consequently,automatic techniques for deriving and updating these process models are becomingmore and more important.

In this paper, we present an approach that exploits the user interaction with a docu-ment version management system for the automatic derivation of a descriptive processmodel that faithfully reflects the real process.

1 Introduction

Enterprises can increase their process capability by obtaining faithful models of their ongo-ing software processes in a timely manner. Therefore, obtaining precise descriptive mod-els of the software engineering processes is an important task in many enterprises. Today,process engineers and managers are solving this problem manually. This is complicated,time-consuming, error-prone, and the results are rapidly becoming obsolete; capabilitiesof human beings in detecting discrepancies between the actual process and the processmodel are rather limited. In addition, one usually does not start from scratch when defin-ing an explicit process model; rather existing more or less informally executed processeshave to be taken into account.

As a consequence, tool support is needed for deriving the process models. Since a doc-ument version management system is an essential part of the software engineering en-vironment nowadays and is usually used even when no precise definition of the processexists, we suggest using it as a source of input information for an automatic approach. Inthis paper, we present the algorithms and models which constitute our approach to semi-automatically deriving a process model from versioning information, called incrementalworkflow mining.

This approach supports achieving the third level of the Capability Maturity Model (CMM)[Hum89, PCB+93] automatically once the second level is reached. The Key Process Areas(KPA) [PWG+93] of the second level (repeatable) of CMM include the use of a Software

Configuration Management System (SCM) and tracking the history reports of all the doc-uments. On the third level (defined), an organization must have a well-defined softwareengineering process. Incremental workflow mining can be used for automatically discov-ering, formalizing and documenting the organizations software process from the auditinformation of a software configuration management system.

The initial step for deriving the software process model is discovering the software pro-cess activities. These activities reflect the units of work carried out by employees of thesoftware company. Since we deal with a document version management system, we con-sider the essential part of an activity to be its input and output documents and the usersexecuting it; this is a document-oriented point of view on the activity. Our algorithm fordiscovering the activities is called activity mining.

After detecting the sequence and parallel relations on the set of activities, we can discoverthe structure of the process model, which is formalized as a Petri net. This model canbe analyzed and verified using the wide range of techniques and tools available for thisformalism. The algorithm for deriving the overall process model is called merging. Later,this Petri net model is transformed into a user-oriented format (UML Activity Diagrams),which is better understandable by software process engineers and managers of a company.As soon as new audit information is available, the algorithms can be executed again. Theset of activities and the process model are refined, transformed and shown to the user again.This capability of refining the process model as soon as new log information is availableallows us to have an up-to-date model all the time. Consequently, we call our approachincremental workflow mining.

The basic ideas of our approach were presented in our workshop paper [KRS05]; here,we do not deal with such steps as model analysis and model transformation to the user-oriented format, but present the details of the mining and merging algorithms that wereimplemented in our research prototype.

2 Related Work

The research in the area of software process mining started in the mid 90ties with newapproaches to the grammar inference problem proposed by Cook and Wolf in their first ar-ticles, which was improved later [CW98]. This research considers events to be tokens andevent streams to be sentences generated by some grammar, thus, deals with the grammarinference problem from event logs.

The other work from the software domain is in the area of mining from software repos-itories [MSR04, MSR05]. Like in our approach, they use SCM systems and especiallyCVS as a source of input information, but for measuring the project activity, detecting andpredicting changes in code, advising newcomers to an open-source project and detectingthe social dependencies between the developers [MLRW05, ZW04, CM03].

The first application of process mining to the workflow domain was presented by Agrawalet al. in 1998 [AGL98]. This approach models business processes as annotated activitygraphs and assumes the absence of cycles in the event log, it is restricted to sequential

2

patterns only. The approach of Herbst and Karagiannis [HK99] uses machine learningtechniques for acquisition and adaptation of workflow models. The seminal work in thearea of process mining was presented by van der Aalst et al. [WvdA01, WvdA02]. Withinthis approach, workflow logs and classes of sound workflow nets are defined. Formalcausality relations between events in logs, the -mining algorithm for discovering work-flow models, and its improvements are presented. An extension of Aalsts approach calledMulti-phase Process mining [vDvdA05] and the approach used in ARIS Process Per-formance Manager (ARIS PPM) [Sch02] discover the models of process instances fromevent logs.

In addition to the software process and business process domains, the research concerningdiscovering the sequential patterns treats similar problems in the area of data mining. Thework of Agrawal and Srikant deals with discovering sequential patterns in the databasesof customer transactions [AS95].

In comparison to the classical approaches, we do not use event logs and we can not assumethat activities are already defined; rather we work on the logs of SCM systems, thus, wehave to introduce activity mining. We make use of our document-oriented view on theactivities, i.e. the process model is derived from the inputs and outputs of the activities.We suggest coming up with the model very early and refining it as soon as additional loginformation is available. We call it incremental approach.

3 Incremental Workflow Mining

In this section, we describe our incremental workflow mining approach; moreover, weshow the role of the activity mining and the merging algorithms in it. Then, we describethe format of the versioning log and our assumptions about the log.

We start with briefly explaining a traditional software engineering environment schemainspired by the works in the area of process-centered software engineering environments(PSEE) and software processes in general [Ost87, CKO92, FH93, Gru02]. Figure 1 givesan overview: The environment consists of an SCM system where the software product(models, documents, source code, etc...) is maintained and practitioners the users whodevelop this product and interact with the SCM system. The software product is an in-stance of the software product structure, which is the informational model containing thesoftware product documents and their relationships. The organizational structure is theorganizational model containing the roles, organizational units and resources (practition-ers).

In this schema, the Process Engineer (project manager or process engineering department)designs the process model using his experience, existing approaches, like V-model [DW00],RUP [JBR99], etc. The model is instantiated and practitioners follow it during the devel-opment of the software product (see Fig. 1).

There are the following problems with this schema:

The designed model is prescriptive. It does not necessarily reflect the actual way of

3

Software Product

ModelsDocumentationUse CasesTest CasesSource CodeExecutable Code

SoftwareProcessModel

Software ProductStructure

OrganizationalStructure

Practitioner ProcessEngineer

instance

instance

Software ConfigurationManagement

SoftwareProcessInstance

instance

Figure 1: Process-Centered Software Engineering Environment

how work is done in the company. So, there is a problem of getting a documentedand formalized descriptive model of the actual processes for the companies that arealready in business.

Human possibilities in detecting discrepancies between the process model and theactual process are limited.

The practitioner is not involved in the design of the process model, in spite of thefact that he is the best specialist in the part of the process that he carries out.

3.1 Incremental Approach

In contrast to the traditional way of work, in the incremental workflow mining approach,we go the other direction, the steps are shown in Fig. 2:

First, we take the versioning log of the SCM system and do activity mining. Asa result, we get the set of activities and the models of the process instances. Thisalgorithm is defined in Sect. 4.

Second, we take the set of discovered activities and do merging - derive the over-all process model in a system internal formalism (Petri nets). This model can beanalyzed and verified. This algorithm is defined in Sect. 5.

Third, we make the transformation from the system internal model to the externalmodel (UML2.0 Activity Diagrams [OMG03]). This external process model can be

4

shown to the user. This algorithm is, however, beyond the scope of this paper.

Software Product

ModelsDocumentationUse CasesTest CasesSource CodeExecutable Code

SoftwareProcessModel

Software ProductStructure

OrganizationalStructure

Practitioner ProcessEngineer

instance

instance

Software ConfigurationManagement

SoftwareProcessInstance

1

23

Figure 2: Mining in a Process-Centered Software Engineering Environment

This approach works incrementally, i.e. as soon as new records are added to the versioninglog, we refine the set of activities derived in the first step, refine the overall model andmake the transformation again. Following this approach, after the process models arediscovered, they must be inserted to the Workflow Management System (WfMS), wherethey are maintained and executed.

Initially, not the whole functionality of a WfMS is used. First, it is utilized only for stor-ing the newly discovered models; after further refinements, when process models becomemore faithful, the WfMS starts advising the users and controlling their work in the com-pany. In this context, the incremental approach enables gradual process support.

3.2 Versioning Log

In this section, we discuss the input information needed for our approach in more detail.SCM and PDM systems provide the audit information in form of logs or reports. Thisinformation comprises data about checkouts and checkins (commits) of documents. In thispaper, we deal only with checkin information. The reason is that not all SCM systemsenforce or support checkouts, checkins to the system are done more accurately peoplecheck in the documents only after having done changes, and after the checkin they becomeresponsible for them to the colleagues. In spite of this fact, checkout information can berather valuable, especially for improving the models derived from the checkins. The sourceof such additional information could be also the tools, where the documents were changed

5

before the checkin. These kinds of additional information will be regarded in our futurework in this area.

An example of a checkin versioning log1 is shown in Table 1. In order to be independentfrom a particular SCM system, we define a common log format, which includes the typ-ically available information. The versioning log consists of records (rows). Records areproduced after checkins of the documents during different executions of different softwareprocesses. For example, in the log in Table 1, we have one process, lets call it DesignChange; during this process, some software module has to be designed, code has to begenerated and tested and the design has to be reviewed. We must distinguish between theprocess model and the process instances; our goal is discovering the process model fromthe versioning log, which contains data about the process instances. So, there are three ex-ecutions (process instances) of the process (process model) in the log: the first four recordsbelong to the first execution, the second four to the second execution, and the third four to the third one. The set of records that belong to one execution of one process is calledan execution log.

Table 1: Versioning LogDocument Date Author Commentdesign 01.01.05 14:30 de status: initialcode 01.01.05 15:00 de status: generatedreview 05.01.05 10:00 se status: pendingtestres 07.01.05 11:00 qa status: initial, type: manual

design 01.02.05 11:00 de status: initialcode 15.02.05 17:00 dev status: generatedtestres 20.02.05 09:00 dev status: initial, type: manualreview 28.02.05 18:45 se status: pending

design 01.03.05 11:00 de status: initialreview 15.03.05 17:00 se status: pendingcode 20.03.05 09:00 dev status: generatedtestres 28.03.05 18:45 dev status: initial, type: manual

The problem of detecting the log structure is application domain, SCM system and company-specific, we need interaction with the end user to solve it. For the rest of the paper, let usassume that we know the structure of the log, i.e. we know which records belong to whichexecution log and which execution logs belong to which process.

For the algorithms presented in this paper, we make the following assumptions on theversioning log (the assumptions will be relaxed in future):

Different execution logs have the same naming conventions. E.g. the documentcontaining the design is called design in all the execution logs.

1Later, simply called versioning log.

6

The document name and the comment together are the unique identifier of therecord. So, there is no documents with the same name and the same commentwithin one execution log.

There are no two records with the same timestamp in the same execution log.

For the rest of the paper, we can ignore the comments and deal only with document names;the timestamps can be converted to the values belonging to the set of natural numbers N,just to show the order of records. Let us assume, there are n execution logs and eachexecution log is given a unique identifier from the set C = {1, . . . , n}; for each log, thereis the set D of document names and the set R of users, who committed the documents.Hence, a log can be defined the following way:

L C D NR (1)

In our example, we have D = {design, code, review, testres} and R = {de, se, dev, qa}.Each record of the versioning log is represented as a tuple, e.g. the first row in Table 1 isrepresented as (1, design, 1, de). The whole versioning log can be represented as a set oftuples:

L = {(1, design, 1, de), (1, code, 2, de), (1, review, 3, se), (1, testres, 4, qa),(2, design, 1, de), (2, code, 2, dev), (2, testres, 3, dev), (2, review, 4, se),(3, design, 1, de), (3, review, 2, se), (3, code, 3, dev), (3, testres, 4, dev)}

(2)

For each tuple, we define the . notation, which gives the concrete field value by its name.E.g. for r = (1, design, 1, de), we have r.c = 1, r.d = design, r.t = 1 and r.u = de.

4 Activity Mining

The goal of activity mining within the incremental workflow mining approach is discov-ering the set of activities from the versioning log and generating the set of process modelsfor all the process executions. In this paper, we define only the algorithm for generatingthe set of activities. As mentioned earlier, an activity is a logical unit of work defined bythe sets of its input and its output documents and by the set of resources that can executeit.

In the activity mining algorithm, we go through the execution logs looking for the outputsof the activities. Every document committed to the system is the output of some activity.E.g. for the log given in Table 1, the set D (see Sect. 3.2) contains all the outputs of theactivities.

Further, for each activity, we also get the set of resources (subset of the set R) allowed toexecute it: this set contains all the resources who committed the output document in some

7

execution log. E.g. the document code was committed by de in the first executionlog and by dev in the second and the third one, thus the set of resources for the activityproducing code is {de, dev}.The most important problem here is finding the input of each activity, since this informa-tion is not given in the logs explicitly. For the time being, we assume that all the documentsthat were committed to the system always prior to the output of the activity belong to itsinput. For example, the document design was committed prior to the activity producingcode in all execution logs, thus it is the input of the activity; the document review wasalso committed earlier, but only in the third execution log, thus it does not really belong tothe input set.

Now, we define the algorithm more formally. For a fixed execution log with identifier cand a document k, the set Bc,k contains the documents of this execution log that werecommitted before k. The set Bk contains the documents that were committed beforedocument k in all the execution logs. This set can be defined as the intersection of all setsBc,k for all c. The set Uk contains all the users that ever committed the document k. Theset A is exactly the set of activities the output of the algorithm. Here, we present thealgorithm in a formal notation:

Algorithm 1 (Activity Mining)Let L be the versioning log, such that the following assumptions are met:

1.ri, rk L : (ri.c = rk.c) (ri.t = rk.t) ri = rk2.ri, rk L : (ri.c = rk.c) (ri.d = rk.d) ri = rkThen we define, for c C, k D:

1. Bc,k = {ri.d|r, ri L : (r.d = k) (r.c = c) (ri.c = c) (ri.t < r.t)}

2. Bk =n

c=1 Bc,k

3. Uk = {r.u|r L : r.d = k}

4. A = {(Bk, {k}, Uk)}

Each activity is a tuple (I,O, U), where O (output) is a one-element set containing a doc-ument k from the set D; I (input) and U (users) are the sets Bk and Uk for this documentrespectively. For our example in Table 1, for the document code, the sets B1,code andB2,code are equal to {design}, the set B3,code is {design, review}, thus the set Bcode is{design}; the set Ucode is {dev, de}; therefore the activity is

({design}, {code}, {dev, de})

In general, the set of the activities is a set:

A {(I,O, U)|I D,O D,U R} (3)

8

4.1 Example

Let us consider the example log from Table 1, which was formalized in (2), to be the inputagain. To define the beginning and the end of each execution log explicitly, we insert thefollowing records to the versioning log L:

(1, start, 0, {}), (1, end, 5, {}),(2, start, 0, {}), (2, end, 5, {}),(3, start, 0, {}), (3, end, 5, {})

Then, the activity mining algorithm will create the following sets:

1. B1,start = {}, B2,start = {}, B3,start = {},B1,design = {start}, B2,design = {start}, B3,design = {start},B1,code = {start, design},B2,code = {start, design},B3,code = {start, design, review},B1,review = {start, design, code},B2,review = {start, design, code, testres},B3,review = {start, design},B1,testres = {start, design, code, review},B2,testres = {start, design, code},B3,testres = {start, design, review, code},B1,end = {start, design, code, review, testres},B2,end = {start, design, code, review, testres},B3,end = {start, design, code, review, testres}

2. Bstart = {},Bdesign = {start},Bcode = {start, design},Breview = {start, design},Btestres = {start, design, code},Bend = {start, design, code, review, testres}

3. Ustart = {},Udesign = {de},Ucode = {de, dev},Ureview = {se},Utestres = {dev, qa},Uend = {}

9

4. A = {({}, {start}, {}),({start}, {design}, {de}),({start, design}, {review}, {se}),({start, design}, {code}, {dev, de}),({start, design, code}, {testres}, {dev, qa}),({start, design, code, testres, review}, {end}, {})}

The last set is the desired set of activities. E.g for the activity

({start, design}, {review}, {se})

we know that it is done by se, gives the document review as a result and needs the designas precondition.

For the rest of the paper, we enumerate the activities in this set and remove the first activ-ity, since it is abstract and is used only for introducing the document start, which showsthe beginning of the process. Thus, the result of the activity mining algorithm looks thefollowing way:

A = {a1 = ({start}, {design}, {de}),a2 = ({start, design}, {review}, {se}),a3 = ({start, design}, {code}, {dev, de}),a4 = ({start, design, code}, {testres}, {dev, qa}),a5 = ({start, design, code, testres, review}, {end}, {})}

(4)

5 Merging

The goal of merging within the incremental workflow mining approach is generating thesoftware process model using the set of activities. Since the set of activities obtained byactivity mining contains the essence of the versioning log needed for deriving the processmodel, we do not deal with the log anymore.

First, we define three relations on the set of activities, such as sequence, parallel split andparallel join. The sequence relation contains the pairs of activities that follow each other,i.e. the input of the second activity must contain the input and the output of the first one.E.g. activities a3 and a4 as presented on formula 4 are in the sequence relation, sincethe second one needs the set {start, design, code} as an input and the first produces thedocument code having the set {start, design} as an input. The parallel split relationcontains the activities which have the same input, but produce different output documents;these activities can be executed concurrently. E.g. a2 and a3 are in this relation. The lastrelation is parallel join, the activities, that are followed by the activity that needs the input

10

and the output of both of them as an input, are in this relation. E.g. a2 and a4 are in thisrelation, since the activity a5 needs review, testres and {start, design, code} as aninput.

We define the relations formally. Let ai, ak A.

Definition 1 (Sequence) ai l ak iff ai 6= ak and ai.I ai.O = ak.I

Definition 2 (Parallel split) ai ak iff ai.I = ak.I and ai.O 6= ak.O

Definition 3 (Parallel join) ai ak iff (ai.I ai.O) * (ak.I ak.O) and (ak.I ak.O) * (ai.I ai.O) and a A : ((ai.I ai.O) (ak.I ak.O) = a.I)

For the set of activities and the parallel split relation on this set, we define a set of maxi-mum cliques (maximum subsets) maxCliques() , where all the elements of each max-imum clique are in a parallel split relation with each other. For example, activities a2and a3 build a maximum clique, since a2 a3 and a3 a2; so, for our example,maxCliques() = {{a2, a3}, {a1}, {a4}, {a5}}.Similarly, for the set of activities and the parallel join relation, we define maxCliques().For our example, maxCliques() = {{a2, a4}, {a1}, {a3}, {a5}}. For the rest of thepaper, we exclude the subsets with only one element from maxCliques, since they arenot necessary for introducing the parallel join pattern to the process model.

Now, after we have got the set of activities and the sequence and parallel relations onthem, we present the algorithms for generating the Petri Net from these activities. Thesealgorithms generate the control, the informational and the organizational aspects of theprocess models.

Control Aspect The goal of the control aspect algorithm is generating the Petri Net thatrepresents the control aspect of the process model and, thus, shows the order of executionof the activities (see Fig. 3). This algorithm generates a sound workflow net according tothe definition of van der Aalst et al. [vdAvH02].

For each activity, we generate a transition. For example, t{design} for a1, t{code} for a3.For each activity, which is not followed by parallel activities, we also generate one outputplace and an arc between them. E.g for activity a3, we generate place p{code} and an arc(t{code}, p{code}).

For activity a1, however, we have to generate two output places, since it is followed byparallel activities a2 and a3. Generally, we introduce a parallel split pattern [AHKB03]here, which consists of a set of places one for each parallel branch; the transition has tobe connected to all the places. E.g. in our case, we know that activities a2 and a3 belong tothe maximum clique, therefore, we connect the transition of the activity a1, say t{design},to these places. Lets call these places p({a2,a3},a2) and p({a2,a3},a3), where {a2, a3} is theclique in the maxCliques() set and a2 and a3 are the activities in this clique, and the arcs(t{design}, p({a2,a3},a2)) and (t{design}, p({a2,a3},a3)) respectively. The places have to be

11

P{end}

Pde PendPse

a1

T {design}

P({a2, a3}, a3)a3

T {code} P{code}a4

T {testres} P{testres}

a2

T {review} P{review}a5

T{end}

Pstart

Pdev Pqa

Pdesign Pcode PreviewPtestres Pend Informational Aspect

Control Aspect

Organizational Aspect

P({a2, a3}, a2)

P{de} P{dev,de} P{dev,qa} P{se} P{end}

Pa{de} Pa{dev,de} Pa{dev,qa} Pa{se} Pa{end}

T(a3,de) T(a3,dev)T(a1,de) T(a4,dev) T(a4,qa) T(a2,se) T(a5,end)

Figure 3: The Process Model

connected to the transitions of appropriate activities, i.e. p({a2,a3},a2) has to be connectedto the transition of a2 i.e. t{review} and p({a2,a3},a3) to t{code} respectively.

Additionally, we have to regard the case, when parallel activities do not have any activityas a predecessor. This can happen only if the input of these activities is equal to start. Inthis case, we have to create an additional transition, connect it to the start place pstart andconsider this transition to belong to the predecessor activity. After it, the algorithm worksas described earlier.

Next, we have to connect all the activities that belong to the sequence relation, excludingthose belonging to the parallel split relation, with each other. E.g. a3 and a4 are sequential,therefore, we connect the output place p{code} with the transition t{testres}.

The last step is connecting the parallel join pattern. Here, we have to connect the outputplaces of the activities that belong to the parallel join clique with the transition of theactivity that needs the documents produced by all these activities as input. E.g. activitiesa2 and a4 belong to one set in maxCliques() relation and a5 needs the documents ofboth of them for input. Thus, we have to connect the output places p{review} and p{testres}with the transition t{end}.

Now, we present the algorithm in a formal notation:

Algorithm 2 (Merging algorithm) Let A be the set of activities with sequence, parallelsplit and parallel join relations on it.

12

1. Tact = {a.O|a A}

2. T ={{t} if (C maxCliques()) (a C) (a.I = {start}){} otherwise

3. Tcontrol = Tact T

4. Pact = {a.O|a A

@C maxCliques() : (b C) (a l b)}

5. P = {(C, a)|(C maxCliques()) (a C)}

6. Pcontrol = Pact P {start}

7. Estart ={{(start, a.O)} if (a.O Tact) (a.I = {start}) (T = {}){(start, t)} if T 6= {}

8. Eact = {(a, b)|(a Tact) (b Pact) (a = b)}

9. El = {(a.O, b.O)|a, b A : (a l b) (a.O Pact) (b.O Tact)}

10. Ein = {(a.O, (C, b))|a, b A : (a.O Tact) (C maxCliques()) (b C) (a l b)}

11. Einstart = {(t, (C, a))|a A : (C maxCliques()) (a C) (a.I ={start}}

12. Eout = {((C, a), a.O)|a A : (C maxCliques())(a C)(a.O Tact)}

13. E = {(a.O, b.O)|a, b A : (C maxCliques()) (a C) (a.O Pact) (b.O Tact) (

cC(c.I c.O) = b.I)}

14. Econtrol = Estart Eact El Ein Einstart Eout E

15. Ncontrol = (Pcontrol, Tcontrol, Econtrol)

Altogether, the Petri Net Ncontrol consists of the set of places Pcontrol, the set of transi-tions Tcontrol and the set of arcs Econtrol. The set of places consists of sets Pact and Pand an initial place. The first set contains the places, which are needed for connecting theactivities with each other. The second set contains the places which belong to the parallelsplit pattern.

The set of transitions consists of two sets: Tact and T. The first set represents the activitiesthemselves. The second set contains a transition that is needed to model the parallel splitpattern after the initial place.

The set of arcs consists of seven sets. The set Estart either contains an arc connectingthe initial place with the first transition or contains an arc connecting the initial place withthe transition, which belongs to the parallel split pattern if it starts from the initial place.The set Eact contains the arcs connecting the Tact transitions with their output places.The set El connects the activities which belong to the sequence relation, excluding thosebelonging to the parallel split. The sets Ein and Eout connect the transitions and placesbelonging to the parallel split pattern to the rest of the net. The set E represents theparallel join pattern and connects parallel branches.

13

Informational Aspect Next, we present the algorithm for generating the informationalaspect of the process model. In this algorithm, we create one place for each documentin our document set D. Since each activity produces a document as an output, we canconnect the corresponding transitions to the document places. E.g. activity a1 producesdocument design as an output, thus, we create a place pdesign and connect the transitionto this place. Therefore, after firing this transition, the place of the output document willget a token. In a formal notation, this algorithm looks following:

1. Pinf = D {end}

2. Einf = {(a, b)|(a Tact) (b Pinf ) (b a)}

3. Ncontrol inf = (Pcontrol Pinf , Tcontrol, Econtrol Einf )

Organizational Aspect In the organizational aspect algorithm, we create one place foreach user in our user set R and put a token to each of these places to show that the user isavailable. For example, for the user de, we create a place pde. For each activity, for eachuser that can execute the activity, we create a transition, which fires when the activity isassigned to the user. For example, since the user de can execute a1, we create a transitiont(a1,de). The user place pde is connected to the transition t(a1,de). For each activity, wecreate a place, which is marked initially and looses the token as soon as assignment isdone. For example, for activity a1, we create a place pa{de}. This place is connectedto the transition t(a1,de), and, thus, looses the token after firing the transition. For eachactivity, we create a place, which gets the token after firing the assignment transition andshows that the resources are available for the activity. For example, for activity a1, wecreate a place p{de} and connect the t(a1,de) transition to this place and this place to theactivitys transition t{design}. If an activity can be executed by one of several users, weuse the defined above elements to model the exclusive choice between these users. In aformal notation, this algorithm looks following:

1. Porg act = {a.U |a A}

2. Puser = R {end}

3. Passign = {a.U |a A}

4. Porg = Porg act Puser Passign

5. Tassign = {(a, u)|a A, u a.U}

6. Eorg act = {(a.U, a.O)|a A : (a.U Porg act) (a.O Tact)}

7. Eorg assign = {(u, (a, u)), ((a, u), u)|a A : (u Puser) ((a, u) Tassign)}

8. Eassign = {(a.U, (a, u))|a A : (a.U Passign) ((a, u) Tassign)}

9. Eassign org act = {((a, u), a.U)|a A : ((a, u) Tassign) (a.U Porg act)}

10. Eorg = Eorg act Eorg assign Eassign Eassign org act

14

11. Ncontrol org = (Pcontrol Porg, Tcontrol Tassign, Econtrol Eorg)

So, the Petri Net that contains all three aspects is defined the following way:

N = (Pcontrol Pinf Porg, Tcontrol Tassign, Econtrol Einf Eorg) (5)

Strictly speaking, by now we have defined only the structure of the Petri Net; to definethe Place/Transition system, we have to enrich it with an initial marking and a weightfunction. The Petri Net has the following initial marking: The initial place Pstart in thecontrol aspect has a token to show the start of the process. In the organizational aspect,places Puser have tokens to show the users availability; places Passign are marked toshow that user can be assigned some activity. In a formal notation, initial marking of thePetri Net is the following set (in general, a multiset):

M = Pstart Puser Passign (6)

As a result, we get a Place/Transition system

= (N,M,W ) (7)

where N is a net, M is an initial marking and W : (Econtrol Einf Eorg) {1} is aweight function.

Therefore, for our example in Table 1, after executing the algorithms, we get the Petri Netpresented in Fig. 3.

5.1 Example

In this section, we present an example of the merging algorithm. We take the set A fromsection 4 as an input.

According to the definitions in Sect. 5, we have the following relations on the set A:

a1 l a2, a1 l a3, a3 l a4,a2 a3, a2 a4

Additionally, we build the following sets using the relations:

maxCliques() = {{a2, a3}}maxCliques() = {{a2, a4}}

In the example, we add a prefix t to the elements of the transition sets and p to theelements of the place sets, because it makes easier to distinguish the elements of these sets.

The merging algorithm forms the following sets:

15

1. Tact = {t{design}, t{review}, t{code}, t{testres}, t{end}}

2. T = {}

3. Tcontrol = {t{design}, t{review}, t{code}, t{testres}, t{end}}

4. Pact = {p{review}, p{code}, p{testres}, p{end}}

5. P = {p({a2,a3},a2), p({a2,a3},a3)}

6. Pcontrol = {p{review}, p{code}, p{testres}, p{end},p({a2,a3},a2), p({a2,a3},a3), pstart}

7. Estart = {(pstart, t{design})}

8. Eact = {(t{review}, p{review}), (t{code}, p{code}),(t{testres}, p{testres}), (t{end}, p{end})}

9. El = {(p{code}, t{testres})}

10. Ein = {(t{design}, p({a2,a3},a2)), (t{design}, p({a2,a3},a3))}

11. Einstart = {}

12. Eout = {(p({a2,a3},a2), t{review}), (p({a2,a3},a3), t{code})}

13. E = {(p{review}, t{end}), (p{testres}, t{end})}

14. Econtrol = {(pstart, t{design}), (p{code}, t{testres})(t{design}, p({a2,a3},a2)), (t{design}, p({a2,a3},a3)),(p({a2,a3},a2), t{review}), (p({a2,a3},a3), t{code}),(t{code}, p{code}), (t{review}, p{review}),(p{code}, t{testres}), (t{testres}, p{testres}),(p{review}, t{end}), (p{testres}, p{end}),(t{end}, p{end})}

15. N = (Pcontrol, Tcontrol, Econtrol)

Thus, having the sets of places Pcontrol, transitions Tcontrol and arcs Econtrol, we canbuild the Petri Net, see Fig. 4. Since we have one input place pstart and one output placepend, such that pstart = {} and pend = {}, we can say that we are building the workflownet [Aal98].

The informational aspect algorithm forms the following sets:

1. Pinf = {pdesign, pcode, preview, ptestres, pend}

16

a1

T{design}

a3

T{code} P{code}a4

T{testres} P{testres}

a2

T{review} P{review}a5

T{end}

Pstart

P{end}

P({a2, a3}, a3)

P({a2, a3}, a2)

Figure 4: Control Aspect of the Process Model

2. Einf = {(t{design}, pdesign), (t{code}, pcode),(t{review}, preview), (t{testres}, ptestres),(t{end}, pend)}

3. Ncontrol inf = (Pcontrol Pinf , Tcontrol, Econtrol Einf )

Thus, control and informational aspects together form the following Petri Net Ncontrol inf ,see Fig. 5.

a1

T{design}

a3

T{code} P{code}a4

T{testres} P{testres}

a2

T{review} P{review}a5

T{end}

Pstart

P{end}

Pdesign Pcode Preview Ptestres Pend

P({a2, a3}, a3)

P({a2, a3}, a2)

Figure 5: Informational Aspect of the Process Model

The organizational aspect algorithm forms the following sets:

1. Porg act = {p{de}, p{se}, p{dev,de}, p{dev,qa}, p{end}}

2. Puser = {pde, pse, pdev, pqa, pend}

3. Passign = {pa{de}, pa{se}, pa{dev,de}, pa{dev,qa}, pa{end}}

4. Porg = {p{de}, p{se}, p{dev,de}, p{dev,qa}, p{end},pde, pse, pdev, pqa, pend,pa{de}, pa{se}, pa{dev,de}, pa{dev,qa}, pa{end}}

17

5. Tassign = {t(a1,de), t(a2,se), t(a3,dev), t(a3,de), t(a4,dev), t(a4,qa), t(a5,end)}

6. Eorg act = {(p{de}, t{design}), (p{se}, t{review}), (p{dev,de}, t{code})(p{dev,qa}, t{testres}), (p{end}, t{end})}

7. Eorg assign = {(pde, t(a1,de)), (t(a1,de), pde),(pse, t(a2,se)), (t(a2,se), pse),(pdev, t(a3,dev)), (t(a3,dev), pdev),(pde, t(a3,de)), (t(a3,de), pde),(pdev, t(a4,dev)), (t(a4,dev), pdev),(pqa, t(a4,qa)), (t(a4,qa), pqa),(pend, t(a5,end)), (t(a5,end), pend)}

8. Eassign = {(pa{de}, t(a1,de)),(pa{se}, t(a2,se)),(pa{dev,de}, t(a3,dev)), (pa{dev,de}, t(a3,de)),(pa{dev,qa}, t(a4,dev)), (pa{dev,qa}, t(a4,qa)),(pa{end}, t(a5,end))}

9. Eassign org act = {(t(a1,de), p{de}),(t(a2,se), p{se}),(t(a3,dev), p{dev,de}), (t(a3,de), p{dev,de}),(t(a4,dev), p{dev,qa}), (t(a4,qa), p{dev,qa}),(t(a5,end), p{end})}

10. Eorg = Eorg act Eorg assign Eassign Eassign org act

11. Ncontrol org = (Pcontrol Porg, Tcontrol Tassign, Econtrol Eorg)

In the example, we added prefix pa to the Passign places to distinguish them from thePorg act places, which are named with standard prefix p. Thus, control and organiza-tional aspects together form the following Petri Net Ncontrol org , see Fig. 6.

Thus, combining all the aspects, we get the Petri Net

=((Pcontrol Pinf Porg, Tcontrol Tassign, Econtrol Einf Eorg),M,W

)where M, the initial marking, is following:

M = {pstart, pde, pse, pdev, pqa, pend, pa{de}, pa{se}, pa{dev,de}, pa{dev,qa}, pa{end}}

and the weight function is W : (Econtrol Einf Eorg) {1}. This Petri Net wasshown in Fig. 3.

18

P{end}

Pde PendPse

a1

T {design}

P({a2, a3}, a3)a3

T {code} P{code}a4

T {testres} P{testres}

a2

T {review} P{review}a5

T{end}

Pstart

Pdev Pqa

P({a2, a3}, a2)

P{de} P{dev,de} P{dev,qa} P{se} P{end}

Pa{de} Pa{dev,de} Pa{dev,qa} Pa{se} Pa{end}

T(a3,de) T(a3,dev)T(a1,de) T(a4,dev) T(a4,qa) T(a2,se) T(a5,end)

Figure 6: Organizational Aspect of the Process Model

6 Conclusion and Future Work

The activity mining and merging algorithms were implemented in SWI Prolog [Wie03] asa part of the incremental workflow mining research prototype. The declarative semantics ofProlog was considered to be especially valuable for implementing such kind of algorithms,since the program could be formulated as a set of first-order logic clauses defining relationson the set of activities. A special clause generates the resulting Petri net in a file followingthe syntax of the DOT language, which belongs to the GraphViz set of graph visualizationtools [GN00]. From this dot file, the .jpeg file containing the image of the Petri Net isgenerated. An example of the automatically generated Petri net, which contains the controland the informational aspects, is given in Fig. 7.

Figure 7: An Automatically Generated Process Model

19

In our approach, we do mining from different perspectives, we use data on such aspects ofbusiness process modeling as informational, organizational and control. Since SoftwareConfiguration Management systems do not immediately support the concept of activities,we must derive this information from the logs of the SCM system. Then, we discover theoverall process model from this set of activities. Our approach has the following benefits:

It does not need information on the existing activities; rather, we can mine thisinformation from the versioning logs.

After deriving the set of activities from the versioning log, we do not deal with logsany more. This set contains already sufficient information for discovering the overallprocess model using our merging algorithm.

Our approach works incrementally and it is efficient, since all the information onthe dependencies is captured in the set of activities. Therefore, we do not need to goback to the logs when mining the process.

As a result, the algorithms produce a multi-perspective process model, which re-flects the real process carried out in a company. This model can be later visualized,analyzed, verified and optimized by process engineers and managers of the com-pany.

Consequently, the approach helps gradually improving the management of the softwareprocesses of a company along with incrementally improving the process models.

In this paper, we presented algorithms which work under rather restrictive assumptionsabout the versioning log (see Sect. 3.2). The algorithms were implemented and testedfollowing these assumptions. In the future, we will improve the algorithms by means ofrelaxing these assumptions: several documents can be committed at the same time, therecan be several commits of the same document within one execution log. We have to dealwith loops and choices. Additionally, we must identify the types of documents, the rolesof users and we must generate meaningful names for the activities. This can be achievedby exploiting checkout information of SCM systems or information of the additional toolsand by interacting with the user.

For discovering not only software processes, but system engineering processes in general,we shall deal with similar audit information that can be obtained from Product Data Man-agement (PDM) systems or other types of configuration and change management systems.

Altogether, our approach supports mining process models from ad-hoc executions withouthaving predefined information on them. This way, our approach supports increasing therepeatability of the processes; in terms of workflow terminology, the ad-hoc processesbecome administrative. In terms of the CMM, this means automatic support for increasingthe maturity level of some enterprise from repeatable to defined.

20

References

[Aal98] W.M.P. van der Aalst. The Application of Petri Nets to Workflow Management. TheJournal of Circuits, Systems and Computers, 8(1):2166, 1998.

[AGL98] Rakesh Agrawal, Dimitrios Gunopulos, and Frank Leymann. Mining Process Modelsfrom Workflow Logs. In Proceedings of the 6th International Conference on ExtendingDatabase Technology, pages 469483. Springer-Verlag, 1998.

[AHKB03] W. M. P. Van Der Aalst, A. H. M. Ter Hofstede, B. Kiepuszewski, and A. P. Barros.Workflow Patterns. Distrib. Parallel Databases, 14(1):551, 2003.

[AS95] Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Philip S. Yuand Arbee S. P. Chen, editors, Eleventh International Conference on Data Engineering,pages 314, Taipei, Taiwan, 1995. IEEE Computer Society Press.

[CKO92] Bill Curtis, Marc I. Kellner, and Jim Over. Process Modeling. Communication of theACM, 35(9):7590, 1992.

[CM03] Davor Cubranic and Gail C. Murphy. Hipikat: recommending pertinent software de-velopment artifacts. In ICSE 03: Proceedings of the 25th International Conference onSoftware Engineering, pages 408418, Washington, DC, USA, 2003. IEEE ComputerSociety.

[CW98] Jonathan E. Cook and Alexander L. Wolf. Discovering Models of Software Processesfrom Event-Based Data. ACM Trans. Softw. Eng. Methodol., 7(3):215249, 1998.

[DW00] W. Droschel and H. Wiemers. Das V-Modell 97, Der Standard fur die Entwicklung vonIT-Systemen mit Anleitung fur den Praxiseinsatz. Oldenbourg, 2000.

[FH93] Peter H. Feiler and Watts S. Humphrey. Software Process Development and Enactment:Concepts and Definitions. Technical Report CMU/SEI-92-TR-004, SEI Carnegie Mel-lon, 1993.

[GN00] Emden R. Gansner and Stephen C. North. An open graph visualization system and itsapplications to software engineering. Software Practice and Experience, 30(11):12031233, 2000.

[Gru02] Volker Gruhn. Process-Centered Software Engineering Environments - A Brief Historyand Future Challenges, 2002.

[HK99] J. Herbst and D. Karagiannis. An Inductive approach to the Acquisition and Adaptationof Workflow Models. citeseer.ist.psu.edu/herbst99inductive.html, 1999.

[Hum89] Watts S. Humphrey. Managing the software process. Addison-Wesley Longman Pub-lishing Co., Inc., Boston, MA, USA, 1989.

[JBR99] Ivar Jacobson, Grady Booch, and James Rumbaugh. The unified software developmentprocess. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.

[KRS05] Ekkart Kindler, Vladimir Rubin, and Wilhelm Schafer. Incremental Workflow Miningbased on Document Versioning Information. In Mingshu Li, Barry Boehm, and Leon J.Osterweil, editors, Proc. of the Software Process Workshop 2005, Beijing, China, volume3840 of LNCS. Springer, May 2005.

21

[MLRW05] Keir Mierle, Kevin Laven, Sam Roweis, and Greg Wilson. Mining student CVS repos-itories for performance indicators. In MSR 05: Proceedings of the 2005 internationalworkshop on Mining software repositories, pages 15, New York, NY, USA, 2005. ACMPress.

[MSR04] MSR 2004: International Workshop on Mining Software Repositories, Washington, DC,USA, 2004. IEEE Computer Society.

[MSR05] MSR 2005 International Workshop on Mining Software Repositories, New York, NY,USA, 2005. ACM Press.

[OMG03] OMG. UML 2.0 Superstructure Specification. Version 2.0 ptc/03-08-02, Object Man-agement Group, August 2003. Final Adopted Specification.

[Ost87] L Osterweil. Software processes are software too. In Proceedings of the 9th InternationalConference on Software Engineering, pages 213, Los Alamitos, CA, USA, 1987. IEEEComputer Society Press.

[PCB+93] Mark C. Paulk, Bill Curtis, Mary Beth, Chrissis, and Charles V. Weber. Capability Matu-rity Model for Software (SW-CMM). Technical Report CMU/SEI-93-TR-024, CarnegieMellon University, Software Engineering Institute, February 1993.

[PWG+93] Mark C. Paulk, Charles V. Weber, Suzanne M. Garcia, Mary Beth Chrissis, and Mari-lyn Bush. Key Practices of the Capability Maturity Model, Version 1.1. Technical Re-port CMU/SEI-93-TR-025, Carnegie Mellon University, Software Engineering Institute,February 1993.

[Sch02] IDS Scheer. ARIS Process Performance Manager (ARIS PPM). http://www.ids-scheer.com, 2002.

[vdAvH02] W.M.P. van der Aalst and K. van Hee. Workflow Management: Models, Methods, andSystem. Cooperative Information Systems. The MIT Press, 2002.

[vDvdA05] B.F. van Dongen and W.M.P. van der Aalst. Multi-phase Process mining: Aggregat-ing Instance Graphs into EPCs and Petri Nets. In 2nd International Workshop on Ap-plications of Petri Nets to Coordination, Worklflow and Business Process Management(PNCWB) at the ICATPN 2005, 2005.

[Wie03] Jan Wielemaker. An overview of the SWI-Prolog Programming Environment. In FredMesnard and Alexander Serebenik, editors, Proceedings of the 13th International Work-shop on Logic Programming Environments, pages 116, Heverlee, Belgium, december2003. Katholieke Universiteit Leuven. CW 371.

[WvdA01] A.J.M.M. Weijters and W.M.P. van der Aalst. Process mining: discovering workflowmodels from event-based data. In Proceedings of the 13th Belgium-Netherlands Confer-ence on Artificial Intelligence (BNAIC 2001), pages 283290, 2001.

[WvdA02] A.J.M.M. Weijters and W.M.P. van der Aalst. Workflow Mining: Discovering WorkflowModels from Event-Based Data. In Dousson, C., Hoppner, F., and Quiniou, R., editors,Proceedings of the ECAI Workshop on Knowledge Discovery and Spatial Data, pages7884, 2002.

[ZW04] Thomas Zimmermann and Peter Weissgerber. Preprocessing CVS Data for Fine-GrainedAnalysis. In Proc. 1st International Workshop on Mining Software Repositories (MSR),may 2004.

22

Documents

activity mining Soft pros.pdf