University of Namur Faculté d'informatique PReCISE Research Center - Database Engineering...

Preview:

Citation preview

University of NamurFaculté d'informatique

PReCISE Research Center - Database Engineering Groupwww.info.fundp.ac.be/libd

PReCISE

- A (sort of) spatio-temporal view of DB reverse engineering -

Jean-Luc Hainaut

February 5, 2014 Stevens Award lecture WCRE-CSMR 2014

Data matters most

but where has all the semantics gone?

2

• Introduction

• Understanding data semantics

• Data models

• Tracing data semantics

• Recovering hidden data semantic

• Is data semantics recovery that important, actually?

• Summary and conclusions

3

Introduction

4

1. To study the concept of data semantics in business applications

2. To identify and evaluate the techniques used to represent data semantics

3. To observe how these techniques have evolved in time and in different cultures.

4. To discuss the methods used to recover the semantics lost when poor representation techniques have been used.

Objectives of the lecture

5

1. The database is a picture of the application domain

• Its schema is a model of the static structures of the domain

• Its data describe the current state (or suite thereof) of the domain

The role of data in business applications

2. The database is designed independently of the application programs

The database is designed before the application programs

3. The database schema evolution translates the evolution of the functional requirements

Axioms on databases

4. The database is described by (at least) two schemas:

• the conceptual schema: abstract, platform-independent

formalism: ER model, conceptual UML class diagrams

• the logical schema: concrete, platform-dependent

formalism: SQL2, Java classes

There exists a bidirectional mapping between both.

6

1. The axioms often are ignored by developers

- ignore = how interesting! I didn't know them

- ignore = I know them but they do not suit my way of working

The role of data in business applications

3. The biggest violation of the axioms concern the existence and role of the conceptual schema

Meta-axioms on axioms on databases

7

Understanding data semanticsExperimental approach and first conclusions

8Preliminary question

C400B512S144

C1

Darwen Owens Garcia

C2

London NY Madrid

C3

T

C400B512S144

CustID

Darwen Owens Garcia

Name

London NY Madrid

City

CUSTOMER

C400 Darwen LondonB512 Owens NYS144 Garcia Madrid

C

T

To what extent does each of these data setsexpresses the semantics of data?

Same data, different structures

9Motivating example. 1. Reading data from a COBOL file (1970)

application code (COBOL)

WORKING-STORAGE SECTION.01 CUSTOMER. 02 CustID PIC X(12). 02 Name PIC X(60). 02 City PIC X(40).

CustID

Name

City

CUSTOMER

external file

SELECT FILE1 ASSIGN TO "FILE1.DAT"ORGANIZATION IS INDEXEDACCESS MODE IS DYNAMICRECORD KEY IS RKEY.

FD FILE1.01 REC. 02 RKEY PIC X(12). 02 RINFO PIC X(100).

C400B512S144

RKEY

Darwen London Owens NY Garcia Madrid

RINFO

REC

REC

RKEYRINFO

CUSTOMER

CustIDNameCity

B512

CustID

Owens

Name

NY

City

CUSTOMER READ FILE1 INTO CUSTOMER.

10Motivating example: 1. Reading data from a COBOL file (1970)

REC

RKEYRINFO

CUSTOMER

CustIDNameCity

Where has data semantics been defined?

• In file description (10%) - [unique key, key data type]

• In application code (93%).

10%93%

11Motivating example. 2. Reading data from an RDB (1980+)

Relational DB

create table CUSTOMER( CustID char(12) not null, Name char(60) not null, City char(40) not null, primary key (CustID)).

CustID

Name

City

CUSTOMER

C400B512S144

CustID

Darwen Owens Garcia

Name

London NY Madrid

City

CUSTOMER

application code (C)

string v1;string v2;string v3;

v1

v2

v3

select * into v1,v2,v3 from CUSTOMER where CustID = 'B512'v1 v2 v3

B512 Owens NY

v1 CUSTOMER

CustIDNameCity

v2

v3

12Motivating example: 2. Reading data from an RDB (1980+)

Where has data semantics been defined?

• In DB schema (100%)

• In application code (3%) - [data type].

v1 CUSTOMER

CustIDNameCity

v2

v3

3% 100%

13What does data semantics mean?

A tentative practical definition

Data semantics is the knowledge defined by all the

non technical,

domain-dependent,

information

that allows us to understand, to use and to manage the data.

14Where can we find traces of data semantics?

data

DB schema

Applicationprogram

in the application code (reading from file)

in the DB schema (reading from DB)

15

1. Expressiveness: DDL is the most appropriate language to declare data structures and constraints

2. Language independence: DDL is independent of application programming languages

6. Stability. The schema must be changed only when the application domain evolve.

3. Uniqueness: the schema is unique and centralized

4. Integration with data: the schema is a part of the database (no risk to loose it!))

5. Program independence: the schema is independent of application programs

1. Expressiveness: DDL is the most appropriate language to declare data structures and constraints

A first (trivial) observation

2. Language independence: DDL is independent of application programming languages

6. Stability. The schema must be changed only when the application domain evolve.

3. Uniqueness: the schema is unique and centralized

4. Integration with data: the schema is a part of the database (no risk to loose it!))

5. Program independence: the schema is independent of application programs

It is best to express data semantics in the database schema

16

Only data structures are explicit in application programs:

• record name

• field name

• field data type

However, things are not always that simple (e.g.,COBOL files)

Additional constraints generally are controlled by the application code:

• where?

• in which way?

• in all the modules processing the data?

Understanding data semantics by analyzing the program code can be much complex than expected.

17

Only standard integrity constraints can be coded through the DDL (SQL2):

• not null

• uniqueness

• referential integrity

However, things are not always that simple (e.g., RDB)

Additional constraints must be coded through generic means:

• check predicates

• triggers

• store procedures

Understanding data semantics by reading the database schema can be less easy than

expected.

18

Data models

19Data models: abstraction hierarchy

Coding SQL-DDL code

Physical design

Logical design

Information analysis

Userrequirements

Conceptualschema

Logical (RDB)schema

Physical (DB2)schema

Reminder on the database design process - The standard view

20999. Data semantics and data models

Conceptual models

• ER (*)• UML class diagrams

Logical models

• Record oriented models: • files • legacy DBMS (IMS, CODASYL) • RDB (*)

• Key-Value models: • NoSQL (*)• CSV

• Structured object models: • OO• NoSQL• Json (*)• XML

The way data semantics is expressed in a database depends on its data model

21ER conceptual model

Abstract, platform-independent information description

The world is perceived as:- sets of entities,- properties that characterize entities- relationships holding between entities

A conceptual schema can be translated into several logical, DBMS-dependent, schemas

1-10-N place

ORDER

OrdIDDateOrdAccount

id: OrdID

CUSTOMER

CustIDNameCity

id: CustID

22

data

metadata

Relational data model (schema-based, 1NF)

Examples: Oracle, DB2, SQL Server, MySQL, PostgreSQL, etc.

• Domain-dependent schema• Schema and data are hierarchically distinct• Values are aggregated into rows• The semantics is explicit in the schema (part of!)• The semantic is managed/controlled by the DBMS

C400B512S144

CustID

Darwen Owens Garcia

Name

London NY Madrid

City

-124 5509 0

Account

23

meta-metadata

metadata

data

ENTITY

903179031790317903175973159731597315973166830668306683066830

ATTRIBUTE

CustID Name City Account CustID Name City Account CustID Name City Account

VALUE

C400 Darwen London -124 B512 Owens NY 5509 S144 Garcia Madrid 0

Key-Value data model (schema-less, triples, 1NF)

Examples: Oracle NoSQL, BerkeleyDB, Voldemort, Riak, Redis

• Domain-independent schema• Metadata mixed with data • Elementary Key-Value• The semantics is explicit in the data• The semantics is managed/controlled by application programs or middleware

24

data

metadata

{"CustID": "C400", "Name": "Darwen","City": "London", "Account": 124} {"CustID": "B512", "Name": "Owens", "City": "NY", "Account": 5509} {"CustID": "S144", "Name": "Garcia", "City": "Madrid", "Account": 0}

903175973166830

meta-metadata

ENTITY ATTRIBUTES

Structured object data models (schema-less, NF2)

Examples: CouchDB, MongoDB (BSON), SimpleDB

• Domain-independent schema• Metadata mixed with data• Aggregated Key-Value into objects (here in Json) • The semantics is explicit in the data• The semantic is managed/controlled by application programs or middleware

25

Tracing data semantics

26In the real world, where is semantics expressed?

We have identified two places: DB schema and application code.

Are there other places?

27Architectural framework

data

DB schema

Applicationprogram

O/RMapping

class schema

User interface- data structure- labels- help, error messages)

Application code- data structures- procedural code)

Class schema

DB logical schema- global schema- views

Data

Doc

Documentation (text, structured, ontology)

Object/Relational mapping

28Semantics in the documentation

data

DB schema

Applicationprogram

O/RMapping

class schema

Doc

Documentation (text, structured, ontology)

Functional documentation (should include the conceptual schema)

Technical documentation (should include the logical schema)

Drawback the documentation often is

• obsolete, • incomplete, • inconsistent• missing

298. Semantics in the DB schema

data

DB schema

Applicationprogram

O/RMapping

class schema

Doc

DB logical schema- global logical schema- views

The logical schema is DBMS-dependent.

It is a more or less faithful implementation of the conceptual schema.

Some views can be more detailed than the logical schema.

Drawbacks• not a conceptual schema• additional constraints not always trivial to

identify and to understand

3010. Semantics in the class schema

data

DB schema

Applicationprogram

O/RMapping

class schema

Class schema

Doc

DB logical schema

T

Bidirectional relation/object transformation.

Solving the impedance mismatch problem

The class schema seen as the domain model.

It is implemented into a relational database, which ensures object persistence.

The DB schema itself is hidden and may bear little semantics.

Drawbacks• inappropriate formalism• poor change propagation mechanism (if any)• semantics in the application and not in the DB• data model not easily shared by several

applications

3111. Semantics in the application code

data

DB schema

Applicationprogram

O/RMapping

class schema

Application code- data structures- procedural code

Doc

Internal data structures may be more explicit that theDB schema.

Data integrity constraints checked by the application code.

Understanding data semantics from the wayprograms process the data.

However, program analysis is far from trivial:• size (millions of LOC)• architectural complexity• algorithmic complexity• data flow complexity• creative data processing

Drawbacks• redundancies (a constraint may be checked in

many places)• distributed traces (potential inconsistencies)

3212. Semantics in the GUI

data

DB schema

Applicationprogram

O/RMapping

class schema

User interface- data structure- labels- help, error messages)Doc

The UI often is a view on a part of the database.

This view is intended for users user friendly.

Provides useful hints about the constraints and meaning of data:

• data structure (data types, aggregates)

• explicit labels

• sample data

• informative help and error messages

Drawbacks• distributed control (potential inconsistencies)• does not cover all the database objects

3313. Semantics in the data (record-oriented models)

data

DB schema

Applicationprogram

O/RMapping

class schema

Data

Doc

In standard models

Data analysis: finding relationships among data

• uniqueness

• data types

• inclusion properties (foreign keys)

• etc.

Main strategy• validating hypotheses

3413. Semantics in the data (alternative models)

data

DB schema

Applicationprogram

O/RMapping

class schema

Data

Doc

In alternative (schema-less) models

Metadata extraction

But also data analysis as in standard models

Experience• none. Too new.

35

Recovering hidden data semantics:database reverse engineering

36

Definition

DB reverse engineering

Reverse engineering a piece of software consists, among others, in recovering or reconstructing its functional and technical specifications, starting mainly from the source text of the programs. Recovering these specifications is generally intended to redocument, convert, refactor, maintain or extend existing applications.

Database reverse engineering is that part of Information System Engineering that addresses the problems and techniques related to the recovery of the conceptual and logical schemas of files and databases of existing systems.

37

DB reverse engineering methodology

DB reverse engineering

Full project

Pilote

Conceptualization

Logical extraction

Physical extraction

Sourcemanagement

Projectplanning

Conceptualschema

Logical (RDB)schema

38

DB reverse engineering methodology

DB reverse engineering

Full project

Pilote

Conceptualization

Logical extraction

Physical extraction

Sourcemanagement

Projectplanning

Others

UI analysis

Class analysis

Prog. analysis

Data analysis

Sch. analysis

Normalization

Untranslation

De-optimization

39

Is data semantics recovery that important, actually?

40

Yes

Definitely!

41Can you prove it? At least I can show you an example

Example: database application migration

Porting a complete existing application, or some of its components, on another, generally

more modern, platform.

For a database: changing its DMS. A popular example: migrating the legacy set of files of

a business application to a RDBMS.

Two main approaches :

• physical approach

• semantic approach

42Physical database migration

Database migration

The physical, or one-to-one migration strategy is the cheapest but also the worst

approach since it deeply degrades the final structure.

Requires no knowledge on data semantics Very popular

Physicalextraction

Physical (file)schema

COBOL code SQL-DDL code

Coding

Physical (DB2)schemaTransform

43Physical database migration

physical (one-to-one) migration

SELECT CLIENT ASSIGN TO "CUST.DAT"ORGANIZATION IS INDEXEDRECORD KEY IS CUST_ID.FD CUST-FILE.01 CUSTOMER. 02 CUST-ID PIC X(12). 02 CUST-INFO PIC X(80). 02 CUST-HIST PIC X(1000).

Create table CUSTOMER( CUST_ID char(12) not null, CUST_INFO char(80) not null, CUST_HIST char(1000) not null, primary key (CUST_ID));

=

=

CUSTOMER

CUST-ID: char (12)CUST-INFO: char (80)CUST-HIST: char (1000)

id: CUST-ID

CUSTOMER

CUST_ID: char (12)CUST_INFO: char (80)CUST_HIST: char (1000)

id: CUST_ID

no added value

44Semantic database migration

Database migration

Semantic approach: based on an in-depth understanding of the semantics of source data.

Provides a high quality result. Strong basis for the future.

Requires a complete, up to date, knowledge of the DB

Physicalextraction

Physical (IDMS)schema

Logical (DBTG)schema

Conceptualschema

Logical extraction

Conceptual-ization

IDMS-DDL code SQL-DDL code

Coding

Physicaldesign

Logicaldesign

Logical (RDB)schema

Physical (DB2)schema

Conceptualschema

Reverse Engineering

COBOL code SQL-DDL code

Coding

Physicaldesign

Logicaldesign

Logical (RDB)schema

Physical (DB2)schema

45Semantic database migration (1)

semantic migration (refinement)

SELECT CLIENT ASSIGN TO "CUST.DAT"ORGANIZATION IS INDEXEDRECORD KEY IS CUST_ID.FD CUST-FILE.01 CUSTOMER. 02 CUST-ID PIC X(12). 02 CUST-INFO PIC X(80). 02 CUST-HIST PIC X(1000).

+

CUSTOMERCUST-ID: char (12)CUST-INFO: compound (70)

NAME: char (20)ADDRESS: char (40)STATUS: char (10)

CUST-HIST-PURCH[0-100] array: compound (10)ITEM: num (5)TOTAL: num (5)

id: CUST-IDid(CUST-HIST-PURCH):

ITEM

1-10-100 record

CUSTOMER

CUST-ID: char (12)CUST-INFO: compound (70)

NAME: char (20)ADDRESS: char (40)STATUS: char (10)

id: CUST-ID

CUST-HIST-PURCH

Index: index (4)ITEM: num (5)TOTAL: num (5)id: record.CUSTOMER

ITEMid': record.CUSTOMER

Index

46Semantic database migration (2)

semantic migration (SQL translation)

1-10-100 record

CUSTOMER

CUST-ID: char (12)CUST-INFO: compound (70)

NAME: char (20)ADDRESS: char (40)STATUS: char (10)

id: CUST-ID

CUST-HIST-PURCH

ITEM: num (5)Index: index (4)TOTAL: num (5)id: record.CUSTOMER

ITEMid': record.CUSTOMER

Index

No more than 100 CUST_HIST_PURCHper CUSTOMER

CUSTOMER

CUST_IDCUS_NAMECUS_ADDRESSCUS_STATUS

id: CUST_ID

CUST_HIST_PURCH

CUST_IDITEMCINDEXTOTALid: CUST_ID

ITEMid': CUST_ID

CINDEXref: CUST_ID

Create table CUSTOMER( CUST_ID char(12) not null, CUST_NAME char(28) not null, CUST_ADDRESS char(60) not null, CUST_STATUS char(2) not null, primary key (CUST_ID));

Create table CUST_HIST_PURCH( CUST_ID char(12) not null, ITEM char(10) not null, CINDEX smallint not null check(CINDEX <= 100), TOTAL smallint not null, primary key (CUST_ID,ITEM), unique (CUST_ID,CINDEX), foreign key (CUST_ID) reference CUSTOMER);

Normalized DB

47Database migration - Synthesis

Create table CUSTOMER( CUST_ID char(12) not null, CUST_NAME char(28) not null, CUST_ADDRESS char(60) not null, CUST_STATUS char(2) not null, primary key (CUST_ID));

Create table CUST_HIST_PURCH( CUST_ID char(12) not null, ITEM char(10) not null, CINDEX smallint not null check(CINDEX <= 100), TOTAL smallint not null, primary key (CUST_ID,ITEM), unique (CUST_ID,CINDEX), foreign key (CUST_ID) reference CUSTOMER);

Create table CUSTOMER( CUST_ID char(12) not null, CUST_INFO char(80) not null, CUST_HIST char(1000) not null, primary key (CUST_ID));

physical migration

semantic migration

48Evolution

new application: compute total sales per item

CUSTOMER

CUST-ID: char (12)CUST-INFO: char (80)CUST-HIST: char (1000)

id: CUST-ID

?

• where is the required information?

• how to extract it from the CUSTOMER table?

• who will develop the (C, Java, VB) program?

• … and when?

Select ITEM, sum(TOTAL)from CUST_HIST_PURCHgroup by ITEM;

• clearly visible + documentation if needed

• just name the columns

• by any non expert

• immediately, 2 minutes

CUST_HIST_PURCH

CUST_IDITEMCINDEXTOTALid: CUST_ID

ITEMid': CUST_ID

CINDEXref: CUST_ID

CUSTOMER

CUST_IDCUS_NAMECUS_ADDRESSCUS_STATUS

id: CUST_ID

49

Summary and conclusions

50

• Theories (e.g., text books) teach that the conceptual schema must be the unique expression of data semantics. In an ideal world, the conceptual schema exists, and all the other artefacts (DB schemas, UML diagrams, views, class schema, programs, UI) derive from it and capture each a part of this semantics.

Some mundane observations

• Identifying, extracting, understanding and merging these traces to rebuilt the conceptual schema are the very goals of database reverse engineering.

• However, the real world doesn't learn from theories. Most often, the conceptual schema does not exist so that only the other artefacts bear traces of the data semantics.

51Cultural aspects of data semantics expression

1. Small personal application

Mainly non-professional developers. Intuitive, bottom-up, incremental development. Weak culture in DB.

Data semantics: in the UI, in application code

2. Database (record-oriented) data-intensive processing

Professional developers. Disciplined, top-down development. Strong culture in DB.

Data semantics: in the DB schema (including additional constraints).

3. OO data-intensive processing

Professional developers. OO minded. Disciplined, top-down development. Weak culture in DB.

Data semantics: in the class schema (through O/RM middleware).

4. Big data

(Semi-)Professional developers. Low complexity applications.RDB discarded as old-style (however NewSQL DBMS are lurking!)

Data semantics: simple, loose (few constraints); metadata in data

52

1950 - 1975: file-oriented processing

Semantics in record schema and application code

Evolution of data semantics expression

1968 - 1990: hierarchical/network database processing

Semantics in DB schema

1980 - ?: relational database processing

Semantics in DB schema

1990 - 2000: object-oriented DB processing

Semantics in DB schema and application code (methods)

2000 - ?: object-relational DB processing

Semantics in DB schema

2000 - ?: O/RM processing

Semantics in class schema

2011 - ?: NewSQL DB processing

Semantics in DB schema

2005 - ?: NoSQL DB processing

Semantics in data and in application code

prog

DB

DB

prog

DB

prog

prog

DB

Quality of DS representation

53

Quite often, developers see the database as a mere repository for the data used and created by programs:

• "the database offers persistence services for the business logic layer"

• "the database is an implementation of the program classes"

Some conclusions

This view entails much problems when long term maintenance and evolution are concerned. When the program changes, the database schema often must be modified accordingly, even if its semantics does not change.

The view of the database as a model of the application domain ensures a great stability of business systems.

So, the database is directly dependent on the current state of program architecture.

It makes the joy of researchers in system evolution but lets the practitioners less enthousiast.

Is the database culture still living among today developers?

54

Thanks

55

56

57

Abstract of the lecture

The role of databases may sometimes appear controversial since they are mere basic services for a significant part of the the software engineering community (the transparent "persistence layer") while they are the central component of business application for the database community. In this lecture, we examine the evolution of the balance database/program both in time (from the early sixties to a foreseenable future) and in space (technologies, communities) from the data semantics point of view. In particular we analyze and compare how and where data semantics has been located and implemented in each of these contexts. Current development practices tend to migrate semantics from the database (as was usual in the eighties and nineties) to the application logic (e.g., O/RM, NoSQL DB managers), a trend that may be seen of regression that reminds us the infancy of business application development where files were dedicated to one application. Finally, the lecture defines how data semantics can be recovered in these scenarios.

Recommended