Process Mining for ERP Systems

Preview:

DESCRIPTION

Presentation held at the 1st Workshop for Data- and Artifact-Centric Processes, co-located with BPM 2012, September 2012.

Citation preview

Process Mining for ERP Systems

Erik Nooijen,

Boudewijn v. Dongen, Dirk Fahland

PAGE 1

Process Discovery

event

log

process

discovery

algorithm

process

model

c1: A B C D E

c2: A C B D E

c3: A F D E

assumptions

• case = sequence of events of this case

• cases are isolated:

event A in c1 happens only in c1 (and not in c2)

• cases of the same process

• one unique case id,

• each event associated to exactly one case id

PAGE 2

Typical Process in an ERP System

Build to Order

Material A

Material B order

product X Alice

order

product Y

Material B

Material C

Bob

Material B

Material B

Material A

Material C

ACME Inc.

Mega Corp.

Manufacturer

order

materials

order

materials

PAGE 3

n-to-m relations database

poID cust. … created processed built shipped

po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15

po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

ProductOrder

poID moID type added

po1 mo3 B 30-08 13:13

po1 mo4 A 30-08 13:14

po2 mo3 B 30-08 13:15

po2 mo4 C 30-08 13:16

OrderedMaterial

moID suppl. … completed sent received

mo3 ACME 30-08 13:15 30-08 14:15 01-09 9:05

mo4 MEGA 30-08 13:17 30-08 16:12 01-09 10:13

MaterialOrder

cust. address …

Alice … …

Bob … …

Customer id attributes time-stamp attributes

relations

id attributes relations data attributes

process

discovery

algorithm

process

model

MaterialOrder

- moID

- supplier

- completed

- sent

- received

OrderedMat.

- poID

- moID

- type

- added

Customer

- cust

- …

ProductOrder

- poID

- cust

- created

- processed

- built

- shipped

PAGE 4

Process Discovery for ERP Systems

process

discovery

algorithm

process

model

reality: data in a relational DB

• events stored as time-stamped

attributes in tables

• multiple primary keys

multiple notions of case

• tables are related

one event related to

multiple cases

1

0..*

1

1..* 1

1..*

PAGE 5

Process Discovery for ERP Systems

process

discovery

algorithm

process

model

reality: data in a relational DB

• events stored as time-stamped

attributes in tables

• multiple primary keys

multiple notions of case

• tables are related

one event related to

multiple cases

MaterialOrder

- moID

- supplier

- completed

- sent

- received

OrderedMat.

- poID

- moID

- type

- added

Customer

- cust

- …

ProductOrder

- poID

- cust

- created

- processed

- built

- shipped

1

0..*

1

1..* 1

1..*

PAGE 6

Outline

process

model

decompose by primary keys

log f.

PO

log f.

MO discovery

model f.

PO

discovery

model f.

MO

related by

primary foreign-key

relations

PAGE 7

Find Artifact Schemas

process

model

decompose by primary keys

log f.

PO

log f.

MO discovery

model f.

PO

discovery

model f.

MO

related by

primary foreign-key

relations

document schema vs. actual schema identify

• column types (esp. time-stamped columns)

• primary keys

• foreign keys

various (non-trivial) techniques available

key discovery is NP-complete in the size of the

table(s)

result:

PAGE 8

Step 0: discover database schema

= schema summarization

PAGE 9

Step 1: decompose schema into processes

ProductOrder MaterialOrder

1. sets of

corresponding

tables

2. links between

those

find:

Automatic Schema Summarization

= group similar tables

through clustering

define a distance between

any 2 tables

• by relations

• by information content

tables that are close to

each other

same cluster

# of clusters: user input

PAGE 10

Automatic Schema Summarization

1. structural distance

between tables

fanout ~ avg. # of child

records related to the

same parent record

PAGE 11

A

1

2

A B

1 X

2 Y

A B

1 X

1 Y

2 Z

2 U

A B

1 X

1 Y

fanout: 1

fanout: 2

fanout: 1 = (2+0)/2

Automatic Schema Summarization

1. structural distance

between tables

fanout ~ avg. # of child

records related to the

same parent record

matched fraction ~

1 / (fraction of records in

parent with matching child

record)

PAGE 12

A

1

2

A B

1 X

2 Y

A B

1 X

1 Y

2 Z

2 U

A B

1 X

1 Y

fanout: 1

fanout: 2

fanout: 1

m.fr: 1

m.fr: 1

m.fr: 2 = 1/ (1/2)

Grouping by Clustering

1. structural distance

2. information distance

importance of each table

= entropy (is maximal if all

records are different)

distance: 2 tables with high

entropies large distance

3. weighted distance by

structure + information

4. k-means clustering:

k clusters based on

weighted distance

PAGE 13

most important table of cluster

= table with least distance to all

key attribute of the cluster

PAGE 14

Artifact Schema Artifact Log

process

model

decompose by primary keys

log f.

PO

log f.

MO discovery

model f.

PO

discovery

model f.

MO

related by

primary foreign-key

relations

poID moID type added

po1 mo3 B 30-08 13:13

po1 mo4 A 30-08 13:14

po2 mo3 B 30-08 13:15

po2 mo4 C 30-08 13:16

poID cust. … created processed built shipped

po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15

po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

PAGE 15

Log Extraction

log f.

PO

cluster = set of related tables

+ primary key of most important table

case id

po1:

po2:

(created, poID=po1, time=30-08 9:22, …)

poID moID type added

po1 mo3 B 30-08 13:13

po1 mo4 A 30-08 13:14

po2 mo3 B 30-08 13:15

po2 mo4 C 30-08 13:16

poID cust. … created processed built shipped

po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15

po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

PAGE 16

Log Extraction

log f.

PO

cluster = set of related tables

+ primary key of most important table

time-stamped attribute event

case id

po1:

(created, poID=po1, time=30-08 9:22, cust.=Alice, …)

poID moID type added

po1 mo3 B 30-08 13:13

po1 mo4 A 30-08 13:14

po2 mo3 B 30-08 13:15

po2 mo4 C 30-08 13:16

poID cust. … created processed built shipped

po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15

po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

PAGE 17

Log Extraction

log f.

PO

cluster = set of related tables

+ primary key of most important table

time-stamped attribute event

case id

related attributes event attributes

po1:

(created, poID=po1, time=30-08 9:22, cust.=Alice, …)

poID moID type added

po1 mo3 B 30-08 13:13

po1 mo4 A 30-08 13:14

po2 mo3 B 30-08 13:15

po2 mo4 C 30-08 13:16

poID cust. … created processed built shipped

po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15

po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

PAGE 18

Log Extraction

log f.

PO

cluster = set of related tables

+ primary key of most important table

time-stamped attribute event

case id

related attributes event attributes

po1:

(processed, poID=po1, time=30-08 13:12, …)

(created, poID=po1, time=30-08 9:22, cust.=Alice, …)

poID moID type added

po1 mo3 B 30-08 13:13

po1 mo4 A 30-08 13:14

po2 mo3 B 30-08 13:15

po2 mo4 C 30-08 13:16

poID cust. … created processed built shipped

po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15

po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

PAGE 19

Log Extraction

log f.

PO

cluster = set of related tables

+ primary key of most important table

time-stamped attribute event

case id

related attributes event attributes

po1:

(processed, poID=po1, time=30-08 13:12, …)

(added, poID=po1, time=30-08 13:13, moID=mo3, …)

refers to artifact “MaterialOrder”

PAGE 20

Outline

process

model

decompose by primary keys

log f.

quote

log f.

order discovery

model f.

order

discovery

model f.

quote

compose by

primary foreign-key

relations

PAGE 21

Resulting Model(s)

create

processed

added

built

shipped

added

completed

sent

received

1..*

1..*

Product Order Material Order

(addded, poID=po1, …, moID=mo3)

prototype tool

• input: relational database (via JDBC), .csv tables

• steps

− discover database schema (types, keys, relations)

− discover artifact schema

− by k-means clustering

− by user picking tables

− extract logs ProM

PAGE 22

Implementation & Evaluation

> 300 tables, > 40 GiB of data

schema extraction

clustering

log extraction

PAGE 23

Evaluation: SAP System of Sligro

time-stamp attributes: 15 hrs

primary keys: 4 hrs

foreign keys: 5 hrs (single col)/

6 days (double col.)

entropies: 17 hrs

table distances: 5 hrs

clustering: a few seconds

~20 different artifacts found

largest: 47 tables, 869 columns

extract 1000 traces of > 246,000 events

query database: 1 hrs

write log file: 32 hrs

PAGE 24

Sligro: Artikel lifecycle model

performance

• key discovery: NP-complete in R (# of records)

• foreign key discovery: NP-complete in R2

• problem is in the “hard part” of NP

• sampling of data, domain knowledge, semi-automatic

requires good database structure

• proper relations, proper keys

• otherwise wrong clusters are formed

• events don’t get right attributes

• semi-automatic approach

events shared by multiple cases… working on it…

PAGE 25

Open issues

Process Mining for ERP Systems

Erik Nooijen,

Boudewijn v. Dongen, Dirk Fahland

Recommended