Upload
markus-scheidgen
View
26
Download
0
Embed Size (px)
Citation preview
Model Comparison for Delta-Compression
1
Markus [email protected]
@mscheidgen
BigMDE at STAF 2016, Vienna
Agenda
▶ Motivation for Delta-Compression
▶ Model Comparison: Approaches
▶ Experiments
▶ Conclusions
2
Motivation – Delta-Compression
▶ What it is: Only store the differences of similar models
▶ Where do we have a lot of similar models:
■ Model Versioning
■ Model-based Mining of Source Repositories with reverse engineering
▶ Why: Storage space and (indirectly) execution time for persistence operations (I/O, etc.)
3
Model Versioning – Approaches
4
(or versioning in general)
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)
Model Versioning – Approaches
4
(or versioning in general)
state-based
r0
r1
r2
r3
e.g. models stored in regular version control systems
(VCS)
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)
Model Versioning – Approaches
4
(or versioning in general)
+ +
+
-
change-based(or operation-based)
+
e.g. EMF-store,requires to record or infer operations from the editing
environment
state-based
r0
r1
r2
r3
e.g. models stored in regular version control systems
(VCS)
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)
Model Versioning – Approaches
4
(or versioning in general)
+ +
+
-
change-based(or operation-based)
+
e.g. EMF-store,requires to record or infer operations from the editing
environment
state-based
r0
r1
r2
r3
e.g. models stored in regular version control systems
(VCS)
or compare?
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)
Model Versioning – Approaches
4
(or versioning in general)
+ +
+
-
change-based(or operation-based)
+
e.g. EMF-store,requires to record or infer operations from the editing
environment
state-based
r0
r1
r2
r3
e.g. models stored in regular version control systems
(VCS)
+ +
+
-
hybrid(persist changes, appear state-based)
+
e.g. GIT: you only see whole files, internally uses
pack-files with delta-compression
or compare?
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)
Model Versioning – Architecture
5
user environment interface state representation
compression persistence
+ +
+
-
Model Versioning – Architecture
5
user environment interface state representation
compression persistence
+ +
+
-- +
Model Versioning – Architecture
5
user environment interface state representation
compression persistence
+ +
+
-- +
show existing diff
Model Versioning – Architecture
5
user environment interface state representation
compression persistence
+ +
+
-- +
create diff
Comparison Time
Difference Model Usability
Extraction Time
Storage Space
Model Comparison – Tradeoffs
6
Comparison Quality
Comparison Time
Difference Model Usability
Extraction Time
Storage Space
Delta-Compression – Tradeoffs
7
Comparison Quality
priority for showing diffs to users [model comparison]
priority for using diffs in persistence [compression]
Model Comparison
▶ We know how to compare lists (e.g. lines of code) for a long time
■ Meyer’s algorithm: O(N*D)
▶ Models aren’t list, but graphs (with spanning trees)
■ but, each feature in each model element is a “list” of values
■ we can compare two model elements, feature by feature
■ But, what pairs of elements should we compare?
■ We need a prior step to establish pairs of (supposingly) matching elements
8
Meyers, E.W.: An O (ND) difference algorithm and its variations. Algorithmica 1(1- 4), 251–266 (1986)
Model Matching
9
model 1
model 2
matches differences
▶ Matching determines the quality of the comparison
▶ Different strategies to matching model elements
■ signatures: just meta-class, [name, parameter types], parent
■ similarity: lots of different criteria and heuristics
cheapexpensive
Comparison Representation for Compression
12
model 1 Δ(1,2)+
Δ(2,3)+
model 2=
model 3=
Δ(n,n+1) model n+1+ =...
Comparison Representation for Compression
13
model 1 Δ(1,2)+
Δ(2,3)+
model 2=
model 3=
Δ(n,n+1) model n+1+ =...
EMF-Compress
▶ We build a comparison framework for compression
■ signature-based matching
■ difference meta-model that allows patching
▶ http://github.com/markus1978/emf-compress
14
Experiments
▶ Reverse engineered GIT repositories with Java-code using MoDisco
▶ Eclipse Foundation sources, i.e. Eclipse platform and plug-ins
▶ organized in different projects: JDT, CDT, EMF, ...
▶ available via GIT-Hub
▶ GIT repositories can be gathered automated via GIT-Hub’s REST API
▶ We used the 200 largest Eclipse repositories that actually contained Java code: 6.6 GB Git, 400 MLOC, 250 GB (binary) models with 4 billion objects.
15
Experiment 1: EMF-Compare vs EMF-Compress
▶ only first 1000 revisions of the 100 largest (GIT-size) repos; only CU’s with less than 20k elements: ~300k individual comparisons
16
signature lines
Differences model size
signature similarity similarity similaritylines
020
4060
8010
0
Number of matches
(%)
020
4060
8010
0(%
)
signature parse
15
5050
0
Avg. execution times (log)
avg.
tim
e pe
r com
pila
tion
unit
(ms)
17
0
10
20
30
0 5000 10000 15000 20000 0 5000 10000 15000 20000
0 5000 10000 15000 20000 0 5000 10000 15000 20000
exec
utio
n tim
e (s
)
0
10
20
30
1204008000
0.0
0.5
1.0
1.5
2.0
size (#objects)
exec
utio
n tim
e (s
)
0.0
0.5
1.0
1.5
2.0
size (#objects)
Similarity-based Signature-based
18
0
10
20
30
0 5000 10000 15000 20000 0 5000 10000 15000 20000
0 5000 10000 15000 20000 0 5000 10000 15000 20000
exec
utio
n tim
e (s
)
0
10
20
30
1204008000
0.0
0.5
1.0
1.5
2.0
size (#objects)
exec
utio
n tim
e (s
)
0.0
0.5
1.0
1.5
2.0
size (#objects)
Similarity-based Signature-based
Experiment 2: Partial Comparison
▶ Problem: not all elements have a meaningful signature
▶ Two signature matching strategies:
■ Named-only: only match named elements, use equality for the contents of named elements
■ Meta-class: match named elements based on their signature, match the contents of named elements based on their parent and meta-class only
19
Experiment 2: Partial Comparison
20
cdt
cdo
...om
pare
em
f
...e.
core
DeltaUCFull
Size - Named-only Matcher
GB
0
2
4
6
8
10
12
14
cdt
cdo
...om
pare
em
f
...e.
core
DeltaUCFull
Size - Meta Class Matcher
GB
0
2
4
6
8
10
12
14
cdt
cdo
...om
pare
em
f
...e.
core
DeltaUCFull
Lines
MLi
nes
0
20
40
60
80
100
cdt
cdo
...om
pare
em
f
...e.
core
DeltaUCFull
All vs Matched - Named-only Matcher
MO
bjec
ts
0
100
200
300
400
500
cdt
cdo
...om
pare
em
f
...e.
core
DeltaUCFull
All vs Matched - Meta Class Matcher
MO
bjec
ts
0
100
200
300
400
500
cdt
cdo
...om
pare
em
f
...e.
core
DeltaUCFull
All vs Matched Lines
MLi
nes
0
20
40
60
80
100
Discussion
▶ Only reverse engineered Java models, different result for other meta-models possible
▶ Similarity-based matching != similarity-based matching: more evaluation with different qualities of similarity-based matching necessary
▶ Signatures are not necessarily meta-model independent
▶ We need a better understanding about the relationship of comparison runtime and comparison quality
21
Conclusions
▶ EMF Compare is tailored for difference model usability, not for model-compression
■ insufficient execution times
■ wrong representation of differences
▶ EMF-Compress is an alternative: http://github.com/markus1978/emf-compress
▶ Better analysis of matching strategies necessary:
■ evaluation of more matching strategies
■ evaluation with models in different languages
22
Model-based Mining of Software Repositories
▶ MSR tools are already “model-based”, but in a proprietary manner
▶ Idea: existing reverse engineering framework and corresponding standard meta-models and modeling frameworks instead of proprietary solutions
▶ Goals
■ deal with heterogeneity (different version control systems, different languages)
■ reuse of existing meta-models, transformations, and languages
■ interoperability with existing analysis tools
■ retaining meaningful scalability24
Model-based Mining of Software Repositories
▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no statements, expressions, etc.
■ Full AST with or without cross-references
25
Model-based Mining of Software Repositories
▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no statements, expressions, etc.
■ Full AST with or without cross-references
26
Reverse Engineering with MoDisco
▶ Model Discovery
▶ reverse engineering for Java, based on EMF
▶ discovery, i.e. finding sources (so called compilation units) within projects, source folders, and packages
▶ uses Eclipse’s workspace and Java Development Toolkit (JDT)
▶ provides
■ discovers for many languages: Java, xText, JSP, XML
■ creates instances of a Java EMF meta-model that corresponds to the handwritten JDT AST-model
■ provides transformation to language independent artifacts, e.g. KDM
27
From Source Code- to Model-Repository
28
snapshot
A1 B1
snapshot
A2 B1
snapshot
A2 B3
snapshot
snapshot
snapshot
M3
M2
M1
f
B.f
fB.f
Load(r)
Analysis(r)
Merge(r)
Save(r)
Checkout(r)
X
d2CUs(r)
Parse(d)
X
R
Checkout+
X
CUs
Parse+Analysis
!
�X
R
Checkout+
X
�CUs
(Parse+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Merge) +Analysis
!
�X
R
X
�CUs
(Load+Analysis
0)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
X
d2CUs(r)
Parse(d)
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d)
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
snapshot
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
snapshot
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
29
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
30
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
snapshot
snapshot
A2 B3
Model-based MSR Strategies
30
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
30
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
31
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
31
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
Load(r)Save(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
32
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
Load(r)Save(r)
snapshot
snapshot
A2 B3
Model-based MSR Strategies
33
snapshot
A1 B1
Checkout(r)
vers
ion
cont
rol s
yste
m
A1-A2
A1 B1
B1-B3
snapshot
A2 B1
snapshot
X
d2CUs(r)
Parse(d) snapshot
M1
Analysis(r)
M3
M2
Merge(r)
f
B.f
fB.f
Parse(d)X
d2�CUs(r)
Load(r)Save(r)
Importing and Traversing Source Code Repositories
34
R1
R2
R3
A1 B1f
A2
A1 B1f
f
B3
A2
A1 B1f
f
A2 B3
fA2 B1f
A1 B1f
A2 B1f
A1 B1fA1 B1f
B3
A2
A1 B1f
f
A2
A1 B1f
fA1 B1f
A2 B3
A2 B1
f
f
A1 B1f
A2 B1
f
A1 B1f
A2 B.f ?
A1 B1! fB.f ?
A2 B3f
A2 B1f
A1 B1f ✓
✗
✗
✓
✗
✓
import traverse
Importing and Traversing Source Code Repositories
35
R1
R2
R3
import (persistent/storage)
traverse(transient/runtime)
Importing and Traversing Source Code Repositories
36
R1
R2
R3
A1 B1f
import (persistent)
traverse(transient)
A2
A1 B1f
f
Importing and Traversing Source Code Repositories
37
R1
R2
R3
import (persistent)
traverse(transient)
Importing and Traversing Source Code Repositories
38
R1
R2
R3 B3
A2
A1 B1f
f
import (persistent)
traverse(transient)
Importing and Traversing Source Code Repositories
39
R1
R2
R3
A1 B1f
B3
A2
A1 B1f
f✓
import (persistent)
traverse(transient)
✓
Importing and Traversing Source Code Repositories
40
R1
R2
R3
A2 B1f
A1 B1f
B3
A2
A1 B1f
f✓
import (persistent)
traverse(transient)
✓
Importing and Traversing Source Code Repositories
41
R1
R2
R3 A2 B3
fA2 B1f
A1 B1f
B3
A2
A1 B1f
f✓
✗
import (persistent)
traverse(transient)
A1 B1! fB.f ?
Importing and Traversing Source Code Repositories
42
R1
R2
R3
import (persistent)
traverse(transient)
A2 B.f ?
A1 B1! fB.f ?
Importing and Traversing Source Code Repositories
43
R1
R2
R3
import (persistent)
traverse(transient)
Importing and Traversing Source Code Repositories
44
R1
R2
R3 B3! f
A2 B.f ?
A1 B1! fB.f ?
import (persistent)
traverse(transient)
Importing and Traversing Source Code Repositories
45
R1
R2
R3 B3! f
A2 B.f ?
A1 B1! fB.f ? A1 B1f ✓
import (persistent)
traverse(transient)
Importing and Traversing Source Code Repositories
46
R1
R2
R3 B3! f
A2 B.f ?
A1 B1! fB.f ?
A2 B1f
A1 B1f
✓
✓
import (persistent)
traverse(transient)
Importing and Traversing Source Code Repositories
47
R1
R2
R3 B3! f
A2 B.f ?
A1 B1! fB.f ?
A2 B1f
A1 B1f
A2 B3f ✓
✓
✓
import (persistent)
traverse(transient)
Importing and Traversing Source Code Repositories
48
R1
R2
R3
✓
B3! f
A2 B.f ?
A1 B1! fB.f ? A1 B1f
import (persistent)
traverse(transient)
✓
Importing and Traversing Source Code Repositories
49
R1
R2
R3
✓
B3! f
A2 B.f ?
A1 B1! fB.f ?
A2
A1 B1f
f✗
import (persistent)
traverse(transient)