Reference Representation in Large Metamodel-based Datasets

Preview:

DESCRIPTION

This presentation was held at the BigMDE Workshop (at STAF) in Budapest, 2013

Citation preview

Markus Scheidgen

Model representations for large meta-model based data-sets

■ Introduction: Technological spaces and model representations■ Comparison of representation ■ Implementation■ Application

1

Introduction:Technological Spaces

2

Software Models

Code

revers

e engin

eering

code genera

tion

XML

persistence / exchange

databases

persistence/versioning

processing (via ORMs: e.g. JPA)

Objects(e.g. POJOs)

debu

ggin

g/pr

ofilin

g

refle

ctio

n

runt

ime

mod

elin

g

processin

g (e.g

. dom/jaxb)

exchange

(e.g. i

n web-servic

es) xslt/xsl/xquery/xpath

model-transformation/-constraints/-queries

static analysis/compilation/refactoring

SQL

running programs

other data

othe

r dat

a other dataother d

ata other data

Introduction: State of the Art

3

Meta-ModelsModels

SchemasXML

GammarsCode

ClassesObjects

ER-SchemasRelational Data

*

visualization and editing by human users

processing in computer programs

exchange

large data-sets/persistence and querying

Introduction: New Class of DBMS

4

Meta-ModelsModels

SchemasXML

GammarsCode

ClassesObjects

ER-SchemasRelational Data

*

-Big Data

+

-Graphs

ER-SchemasBig Relational Data

?

Representation: Strategies

5

Object-by-object Fragments

Part-of-source Morsa, ( Java) XMI, EMF-Frag

Relations CDO ?

Refe

renc

es

Objects

Representation: Object-by-object vs. Fragmentation(considering traversal, theoretical results)

6

100 101 102 103 104 105 106

100

101

102

103

104

105

Number of loaded objects [l]

no fragmentation [f=m]

optimal fragmentation

total fragmentation [f=1]

Exec

utio

n tim

e [t]

(in

ms)

1e+001e+011e+021e+031e+041e+051e+06

Fragment size [f]

Representation: Object-by-object vs. Fragmentation(considering traversal, theoretical results vs. implementation)

7

100 101 102 103 104 105 106

100

101

102

103

104

105

Number of loaded objects [l]

no fragmentation [f=m]

optimal fragmentation

total fragmentation [f=1]

Exec

utio

n tim

e [t]

(in

ms)

1e+001e+011e+021e+031e+041e+051e+06

Fragment size [f]

100 101 102 103 104 105 106100

101

102

103

104

105

Number of loaded objects [l]

Exec

utio

n tim

e [t]

(in

ms)

1e+011e+021e+031e+041e+05

Fragment size [f]

optimal fragmentation

Representation: Object-by-object vs. Fragmentation(considering traversal, implementation with actual model)

■Model traversal of Grabats models with four different sizes and different characteristics

8

set0 set1 set2 set3 set40

1

2

3

4

5

6

7

8

XMI

CDO

Morsa

EMFFrag coarse

EMFFrag fine

no

t m

ea

su

red

– e

xtr

ap

ola

ted

no

t m

ea

su

red

– e

xtr

ap

ola

tedOb

jects

pe

r se

co

nd

(=

10

4)

set0 set1 set2 set3 set410

3

104

105

106

107

Nu

mb

er

of

fra

gm

en

ts

CDO/Morsa

EMFFrag coarse

EMFFrag fine

Representation: Object-by-object vs. Fragmentation(considering query, implementation with actual model)

■Query of Grabats models with four different sizes and different characteristics

9

set0 set1 set2 set3 set410

3

104

105

106

107

Nu

mb

er

of

fra

gm

en

ts

CDO/Morsa

EMFFrag coarse

EMFFrag fine

set0 set1 set2 set3 set40

50

100

150

200

250

300

350

Exe

cu

tio

n t

ime

(in

s)

XMI

CDO w/o SQL

CDO

Morsa w/o index

Morsa

EMFFrag coarse

EMFFrag fine

not m

easure

d –

extr

apola

ted

not m

easure

d –

extr

apola

ted

not m

easure

d –

extr

apola

ted

not m

easure

d –

extr

apola

ted

Representation: Part-of-source vs. Relations(real implementation, artificial model)

10

100 102 104 106

101

102

103

104

number of outgoing references

exec

utio

n tim

e in

ms

100 102 104 106

101

102

103

104

number of outgoing references

exec

utio

n tim

e in

ms

Part of source implementation Relation implementation with individual access

access of one outgoing referencetraversal of all outgoing references

access of one outgoing referencetraversal of all outgoing references

Representation: Part-of-source vs. Relations(real implementation, artificial model)

11

100 102 104 106

101

102

103

104

number of outgoing references

exec

utio

n tim

e in

ms

Part of source implementation

access of one outgoing referencetraversal of all outgoing references

100 102 104 106

101

102

103

104

number of outgoing references

exec

utio

n tim

e in

ms

Relation implementation with scanning

access of one outgoing referencetraversal of all outgoing references

1

2

3

4

Implementation: EMF-Fragments

12

map/reduce(hadoop)

“Share Nothing” Nodes(cluster, adhoc-network)

DFS (HDFS)

key-value-store(hbase)

structured datadata-sets

applications meta-model

structured datamodel transformations

Implementation: Datastore mapping

13

regular containment

metamodel

0

1

part of source fragmentation

relation based fragmentation

Implementation: Meta-mode-based declaration of representations

14

Project

Package

CompilationUnit

FieldMethod

Class

«fragments»

«fragments»

«fragments»

*

* *

*

*

*

Call«relation»

Implementation: Architecture

15

FragmentedModel extends Resource

ResourceSet

FObject extends EObject©UHÁHFWLYH�IHDWXUH�GHOHJDWLRQª

FStore extends EStore©VLQJOHWRQ��VWDWHOHVVª

ResourceSet

Fragment extends Resource

FInternalObject extends DynamicEObject

URIHandler

DataStore©GHULYHGª

©GHOHJDWHVª

©GHOHJDWHVª

*

*1

*

*

1

11

1GDWDEDVH

visi

ble

API

EMF-Fragments ClassesRegular EMF Classes

1EList

EObjectEList FValueSetList

*

1

*

Applications: Mining and Analyzing Software Repositories

■ Software repositories contain more information than the current software code:■ “developers who changed class/method/statement X also changed class/

method/statement Y”■ this information leads to knowledge about dependencies that cannot be

determined through static or even dynamic analysis■ this can be used to• predict/find bugs• understand/improve the code-base

■ dependency information should be stored as relational data

■ When a piece of software evolves, its metrics change. Such dynamic metrics describe software better than static code metrics. Could lead to a better assessment of methodologies or understanding of software engineering in general.

16

Applications: Mining and Analyzing Software Repositories

■ JGit: Java implementation of the Git version control system■ MoDisco: Reverse engineering framework for eclipse java

projects based on EMF■ EMF-Compare: Determines matches and differences between

models■ EMF-Fragments: My own framework for large models■ over 300 Git repositories with eclipse plug-ins that

constitute the whole eclipse foundation source base as “example” data-set

17

Applications: Model of a Software Repository

18

A B C

A

A B

A D

PB1.R1

B1.R2

B1.R3

B1.R4

B2.R1

B2.R2

A

A B

Repository

Revision Diff

CompilationUnit

Model

Package Class

...

* * * *

*

1

prevnext

JGit MoDisco

model

metamodel

usageInPackageAccess

*

package1

«relation,fragmentation»

«fragmentation» «relation,fragmentation»

«relation»

«fragmentation»

* * extends1

Summary■ Choosing the right representation makes a difference ■Meta-model-based declaration of representations works

(might not be good enough)■ There are applications that can benefit from different

representations

19

Object-by-object Fragments

Part-of-source Morsa, ( Java) XMI, EMF-Frag

Relations CDO ?

Refe

renc

es

Objects

Backup

20

Possible Approaches: Different Target Platforms

21

SchemasXML

*

-Big Data

-Graphs

BASE

CAP-Theorem1

1Eric A. Brewer: Towards robust distributed systems; 19th ACM Symposium on Principles of Distributed Computing, 20002K. Barmpis and D.S. Kolovos. Comparative Analysis of Data Persistence Technologies for Large-Scale Models. XM 2012

ORM

XMI

XMI+Resources

ER-SchemasRelational Data

ACID,structured data

ER-SchemasBig Relational Data

BASE,structured data

BASE,structured data

Big

*

ORM?

2

Possible Approaches: Different Types of Mapping

22

*

1Javier Espinazo-Pagán, Jesús Sánchez Cuadrado, Jesús García Molina: Morsa, A Scalable Approach for Persisting and Accessing Large Models; MoDELS 2011

per o

bject m

appin

g fragmentation

ER-SchemasRelational Data

fast query,slow traversal,slow entry,(fine transactions)

fast query,slow traversal,slow entry,(fine transactions)1

Big

*

per object m

apping

slow query,fast traversal,fast entry,(coarse trans.)

Big

*ER-SchemasBig Relational Data/

Fragmentation: Types of references

■ organizing large artifacts in different resources is already implemented in EMF■ resources are loaded if necessary, objects in unloaded

resources are represented by proxy objects■ objects in different resources (as all related objects) are

related through references, therefore models are fragmented along references■ EMF-Fragments automatically fragments large models based

on annotations in the meta-model■ resources are identified via URIs and can be serialized (e.g.

XMI), therefore resources can be stored in a key-value store

23

Fragmentation: Types of references

24

*normal

references

*«fragments»fragmenting

references

large value sets *

Applications

■ HWL sensor and network operation data (or experiment data in general)■ realtime persistence required ➜ fast data entry■ hierarchical structured data (different sensors and other data sources) ➜ meta-modeling■ queries for experiments, sensors, specific time periods ➜ only coarse simple queries■ traversal of larger sub-trees, mostly applications based on data aggregation■ actual demand for big-data depends on size of sensor network ➜ scalability

■ CityGML models (or geo-spatial data in general)■ standardized as XML-schemas ➜ XML based data■ special proprietary indexes (e.g. spacial indexes like R-trees) and corresponding queries■ rather query intense applications■ actual demand for big-data depends on LOL of the models ➜ scalability

■ Software Engineering■ Code/Model Version Control■ Mining Software Repositories (MSR)■ revisions of AST-trees and differences between AST-trees ➜ existing meta-model based frameworks (e.g. designed

for reverse engineering purposes)■ large number of revisions causes many large value sets■ queries for revisions, compilation-units ➜ rather coarse queries■ aggregations and statistics ➜ can be expressed in an OCL-like language■ immediate demand for processing in (at least smaller) clusters■ has to be mixed with relational data for some applications

25

Applications: Scientific Data

26

WSN

<xm

l? ..

. >

<xm

l? ..

. >

click *

*

xml-to-model

text-to-model*

Applications: CityGML

■ XML-based standard ➜ meta-models can be generated (1-to-1 mapping)■ different standards define XML-schemas that extend each

other: GML⇽CityGML⇽extensions■ transparent use of spacial indexes ■ map onto existing platforms (e.g. SpatialHadoop)■ use existing implementations and persist into the key-value

store

■ extensions to CityGML can be facilitated to reference CityGML-models as spatial context for sensor data

27

backup

28

Research Overview

29

WIRELESS SENSOR NETWORKS

DATA ANALYSIS FRAMEW

ORK

GEO INFORMATION SYSTEMS

sensor data

heterogenous networks

mesh-networks

cellular-networks

spatial dataregular databases

spatial databases

distributeddata stores

distributedanalysis

data homo-genisation

domain speci!c analysis languages

HWL: Commodity Hardware

30

31

‣120+ Nodes

‣indoor and outdoor

‣dense and sparse

‣short and long links

‣stationary and mobil nodes

‣120+ Nodes

‣indoor and outdoor

‣dense and sparse

‣short and long links

‣stationary and mobil nodes

1

2

3

4

6

7

8

9

stein

? m

10m

5 10

Richtung Groß-Berliner Damm

Richtung Institut

Markus Scheidgen: H

WL – A

High-Perform

ance Wireless Sensor R

esearch Netw

ork

35

Experiments: The Test Site

§ simplest case: two lane, newly paved road

§ spatially equally distributed nodes on both sides of the rode

§ 2x5 nodes§ homogeneous test-bed:

same nodes, equally calibrated, same stone ground

§ one camera to record control data

0 20 40 60 80 100 120 140 160 180 2000

50

100

150

200

250

300

350

400

450Single−sided Amplitude Spectrum

Frequency (Hz)

|Y(fr

)|

Channel ZChannel YChannel X

0 500 1000 1500 2000 2500 3000−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Time sample (1/400 sec)

Acce

lera

tor v

alue

Time signal of all 3 channels

Channel ZChannel YChannel X

Markus Scheidgen: H

WL – A

High-Perform

ance Wireless Sensor R

esearch Netw

ork

Experiments: Example Data

36

Amplitudes Frequencies

Markus Scheidgen: H

WL – A

High-Perform

ance Wireless Sensor R

esearch Netw

ork

Experiment: Algorithm

§ Similar to earthquake detection: comparison of short and long moving averages (S=0.2s, L=4s)

38

s

x

= xth acceleration value (1)

mavg(s

x

,W ) =

Px

i=x�W

s

i

W

(2)

s

x

= |sx

� avg(s

x

, L)| (3)

w

S

x

= mavg(s

x

, S) (4)

w

L

x

= mavg(s

x

, L) (5)

�w = w

S

x

� w

L

x

(6)

Data Management

39

Research Overview

40

WIRELESS SENSOR NETWORKS

DATA ANALYSIS FRAMEW

ORK

GEO INFORMATION SYSTEMS

sensor data

heterogenous networks

mesh-networks

cellular-networks

spatial dataregular databases

spatial databases

distributeddata stores

distributedanalysis

data homo-genisation

domain speci!c analysis languages

41

internetcellular

cellular

wifi

zigbee

zigbee

Technological Infrastructure

Logical Infrastructure

actions

visualization

sensors

information

43

internetcellular

cellular

wifi

zigbee

zigbee

information/knowledge

distributed programming models

data bases

data representation

algorithmsprocesses

programming languages

CPUs

machine code radios

network protocols

hard drives

gene

ric

dom

ain

spec

ific

software engineering

algorithmsprocesses

programming languages

information/knowledge

distributed programming models

data bases

data representation

DSL

Complex Data Types

44

➡ complex data structures➡ lots of links between data objects➡ evolving structures➡ requires a type safe programming

environment that proliferates re-use

Large Amounts of Data

45

➡ a certain amount of data needs to be stored per second (HWL: 120 nodes)

~140x103 data objects per second~7MB/s serialized

➡ a certain amount of data needs to be stored all together (24h)

~12x109 data objects~600GB serialized

➡ Data analysis must complete in reasonable time. For live applications in real time.

From Click to ClickWatch

46

Click API software

Element

Element

Element

CompoundHandler

Han

dler

Net

wor

k In

terf

ace

Complex Data Types: Meta-Modeling

47

This [ ] happens all the time in software modeling

state charts class diagrams MSCsOCL

context Fooself.properties-> foreach(a|a.x != a.y)

eclipse modeling framework (EMF)

➡ Distributed storage and links between different types of data is only a simple extension of existing technology: multi resource persistence is already implemented

“Share Nothing” Nodes(cluster, adhoc-network)

DFS (HDFS)

key-value-store1

(hbase)

Large Amounts of Data: Problem Statement

48

1. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A distributed storage system for structured data (awarded best paper!). In Brian N. Bershad and Jeffrey C. Mogul, editors, OSDI, pages 205–218. USENIX Association, 2006.

2. Jeffrey Dean and Sanjay Ghemawat. Map/reduce: Simplified data processing on large clusters. In OSDI, pages 137–150. USENIX Association, 2004.

map/reduce2

(hadoop)

hierarchical data(XML, OGC standards)

data series(sensor data)

signal analysis, statistics, sensor-fusion

dom

ain

spec

ific

gene

ric

1

2

3

4

Large Amounts of Data: Approach

49

map/reduce(hadoop)

“Share Nothing” Nodes(cluster, adhoc-network)

DFS (HDFS)

key-value-store(hbase)

hierarchical data(XML, OGC standards)

data series(sensor data)

signal analysis, statistics, sensor-fusion meta-model

structured datamodel transformations

Recommended