48
Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with F. Bugiotti, L. Rossi and on work done over many years with R. Torlone, L. Bellomarini, P.A. Bernstein, P. Cappellari, G. Gianforme Paris, June 2012

Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Management of Heterogeneous Data in Traditional and non Traditional Databases

Paolo Atzeni

Based on recent work with F. Bugiotti, L. Rossi and on work done over many years with R. Torlone, L. Bellomarini, P.A. Bernstein, P. Cappellari, G. Gianforme

Paris, June 2012

Page 2: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Heterogeneity

Paris, June 2012 2

… of mountains

Glaciers …

…rocks …

… volcanos

Page 3: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

On the shuttle bus from Catania to Taormina

•  We were discussing about languages and dialects •  A colleague said:

–  Sanscrit is "more general than any other language, it has twelve cases …"

Paris, June 2012 3

Page 4: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Outline

•  Heterogeneity in interoperabilty and migration settings •  Management of multiple models with a metamodel approach •  Heterogeneity in NoSQL systems •  A common interface for NoSQL systems •  Future work

Paris, June 2012 4

Page 5: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Heterogeneity

•  Despite all standardization efforts, many data models exist •  With different features and goals

–  semantic models and logical models: •  E-R, functional, (conceptual) object •  relational, network, object

–  general purpose models as well as problem oriented models (for specific contexts: DW, statistical, spatial, temporal)

–  more models recently with the Web and XML –  yet more with NoSQL

•  Variations of models –  versions within a family:

•  many versions of the ER model, of the OR, many NoSQL

Paris, June 2012 5

Page 6: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 6

When and how we handle heterogeneity

•  In design of complex heterogeneous systems –  the results of independent design activities need to be

integrated or exchanged •  In transition, migration, consolidation, ETL and similar settings

–  there is the need to transform data in an "off-line" way •  In heterogeneous operational systems

–  interoperability at run-time is needed

Page 7: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Outline

  Heterogeneity in interoperabilty and migration settings •  MIDST: heterogeneity and translation with traditional models •  Heterogeneity in NoSQL systems •  A common interface for NoSQL systems •  Future work

Paris, June 2012 7

Page 8: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

MIDST (Model Independent Schema and Data Translation)

•  Schema and data translation –  initially with an off-line approach –  later also with a run-time one

•  Model-generic: –  works for many models, in an extensible way

•  Model-aware: –  models are described

Paris, June 2012 8

Page 9: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 9

Off-line schema and data translation

•  Schema translation: –  given

•  schema S1 in model M1 and •  model M2

–  find a schema S2 in M2 that “corresponds” to S1 •  Schema and data translation:

–  given also a database D1 for S1 –  find also a database D2 for S2 that “contains the same data”

as D1

Page 10: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Run-time support to schema and data translation

•  Given –  a database D1 for a schema S1 in model M1 –  and model M2

•  let D1 be accessed as if it were in a schema S2 in model M2 –  so, S2 is again the translation of S1 into M2

Paris, June 2012 10

Page 11: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 11

A metamodel approach

•  The constructs in the various models are rather similar: –  can be classified into a few categories (Hull & King 1986):

•  Abstract (entity, class, …) •  Lexical: set of printable values (domain) •  Aggregation: a construction based on (subsets of)

cartesian products (relationship, table) •  Function (attribute, property) •  Hierarchies •  …

•  We can fix a set of metaconstructs (each with variants): –  abstract, lexical, aggregation, function, ... –  the set can be extended if needed, but this will not be frequent

•  A model is defined in terms of the metaconstructs it uses

Page 12: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 12

The metamodel approach, example

•  The ER model: –  Abstract (called Entity) –  Function from Abstract to Lexical (Attribute) –  Aggregation of abstracts (Relationship) –  …

•  The OR model: –  Abstract (Table with ID) –  Function from Abstract to Lexical (value-based Attribute) –  Function from Abstract to Abstract (reference Attribute) –  Aggregation of lexicals (value-based Table) –  Component of Aggregation of Lexicals (Column) –  …

Page 13: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 13

The supermodel

•  A model that includes all the meta-constructs (in their most general forms) –  Each model is subsumed by the supermodel (modulo

construct renaming) –  Each schema for any model is also a schema for the

supermodel (modulo construct renaming) •  … •  The supermodel is … the Sanscrit of models

•  In the example, a model that generalizes ER, OR and relational

Page 14: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 14

A lattice of models

OR w/ PK, gen, ref, FK

OR w/ PK, gen, ref

OR w/ PK, gen, FK

OR w/ PK, ref, FK

OR w/ gen, ref

OR w/ PK, FK

OR w/ PK, ref

OR w/ ref

Relational

Supermodel

Page 15: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 15

Translations with the supermodel

•  Each translation from the supermodel SM to a target model M is also a translation from any other model to M: –  given n models, we need n translations, not n2

•  We still have too many models: –  we have few constructs, but each has several independent

features which give rise to variants •  for example, within simple OR model versions,

–  Key may be specifiable or not – Generalizations may be allowed or not –  Foreign keys may be used or not – Nesting may be used or not

–  Combining all these, we get hundreds of models! –  The management of a specific translation for each model

would be hopeless

Page 16: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 16

The metamodel approach, translations

•  As we saw, the constructs in the various models are similar: –  can be classified according to the metaconstructs –  translations can be defined on metaconstructs,

•  there are standard, known ways to deal with translations of constructs (or variants theoreof)

•  Elementary translation steps can be defined in this way –  Each translation step handles a supermodel construct (or a

feature thereof) "to be eliminated" or "transformed" •  Then, elementary translation steps to be combined •  A translation is the concatenation of elementary translation

steps

Page 17: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 17

An example

•  An object relational database, to be translated in a relational one –  Source: an OR-model –  Target: the relational model

Page 18: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 18

An example, 2

ENG

EMP DEPT

Last Name

School

Dept

Name Address

ID ID

ID

Dept_ID

Emp_ID

Target: relational model Eliminate generalizations

Add keys Replace refs with FKs

Page 19: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

An example, 3

Paris, June 2012 19

EMP

Last Name

ID

Dept_ID

DEPT

Name Address

ID

ENG

School

ID

Emp_ID

Target: relational model Eliminate generalizations

Add keys Replace refs with FKs Replace objects with tables

Page 20: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Paris, June 2012 20

The steps for a translation

OR w/ PK, gen, ref, FK

Relational

OR w/ PK, gen, ref

OR w/ PK, gen, FK

OR w/ PK, ref, FK

OR w/ gen, ref

OR w/ PK, FK

OR w/ PK, ref

OR w/ ref

Eliminate generalizations Add keys Replace refs with FKs Replace objects with tables

Source

Target

Page 21: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

The MIDST solution

•  A metamodel (fixed but extendible) •  A meta dictionary for specifying models within the metamodel •  A dictionary for the description of schemas (and for handling

data in the off-line version) •  A library of elementary translations •  An algorithm (and its implementation) for choosing the needed

steps given source and target models (based on "signatures" for models and basic translations)

Paris, June 2012 21

Page 22: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Off-line approach

Paris, June 2012 22

Operational Systems MIDST

DB Translator

Importer

Exporter

Schemas and data

DB

DB DB

Page 23: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

The off-line approach: drawbacks

•  Highly inefficient in practice, because it requires databases to be moved back and forth

•  It does not allow data to be used directly •  A "run-time" approach is needed

23 Paris, June 2012

Page 24: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

A run-time alternative: generating views

•  Main feature: –  generate, from the datalog traslation programs, executable

statements defining views representing the target schema. •  How:

–  by means an analysis of the datalog schema rules under a new classification of constructs

24 Paris, June 2012

Page 25: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Runtime translation

Views

DB Schemas Schema Importer

View Generator

Translator

MIDST Operational System

Access via target schema St

Access via source schema Ss

25 Paris, June 2012

Page 26: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Run-time vs off-line

Paris, June 2012 26

Operational Systems MIDST

DB Translator

Importer

Exporter

Schemas and data

DB

DB

DB

Views

DB Schemas Schema Importer

View Generator

Translator

MIDST Operational System

Access via target schema

Access via source schema

Page 27: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Outline

  Heterogeneity in interoperabilty and migration settings   MIDST: heterogeneity and translation with traditional models •  Heterogeneity in NoSQL systems •  A common interface for NoSQL systems •  Future work

Paris, June 2012 27

Page 28: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

"NoSQL" systems

•  New family of database systems –  High scalability (wrt simple operations on many nodes) –  Replication and distribution (over many nodes) –  Flexibility in data structure –  New indexing techniques –  …

•  With some limitations –  Data model (and API) much simpler than SQL –  Less strict transaction management

Paris, June 2012 28

Page 29: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

There are many "NoSQL" systems

•  "NoSQL is about choice" (Jan Lehnardt on April 9, 2010, ages ago in this area, but still valid): –  no "one-size-fit-all" –  it could even make sense to use multiple systems within a

single application •  Various categories

–  Key-value stores (Redis, Project Voldemort, …) –  Document stores (SimpleDB, CouchDB, MongoDB, …) –  Extensible record stores (BigTable, HBase, …)

•  Issues: –  there is no standard (not even an idea …) –  (comparison of) performances are yet to be understood –  new systems appear, investments can be wasted (lock-in)

Paris, June 2012 29

Page 30: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

A problem we know

•  In the same way as with traditional databases, we have heterogeneity –  even more than we were used to (even more than with XML)

Paris, June 2012 30

Page 31: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Which specific version of the problem?

•  Schema translation? –  There is no common notion of schema, so this is not really

an issue •  Offline data conversion?

–  Possibly relevant, but there would be need for managing the transition period and also it would not easily support multiple systems

•  Runtime support? –  Important, in order to be able to use the various systems

interchangeably and multiple systems at the same time

Paris, June 2012 31

Page 32: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

A long term goal: Runtime translation

Interface

DB

Adapter Generator

Translator

New MIDST Operational System

Access via alternative interface

Access via native interface

Description

32 Paris, June 2012

Page 33: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Issues

•  Settings are not that much similar to those for traditional databases –  Interfaces

•  are usually much simpler •  have different "expressive power"

–  The structure of data is represented only to a certain extent (there is no notion of schema, and structure is usually very flexible)

–  Similarly, there is no notion of query language, nor a general pattern for queries

Paris, June 2012 33

Page 34: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

A supermodel based approach?

•  In traditional settings, our idea was to have the supermodel as the most general model, at the top of a lattice

•  Here, simplicity is a goal, even if objects could have some structure

•  Also, while in databases data are "exposed" in full (and so there are powerful query languages that can exploit the structure), here operations are more focussed

•  Therefore, while in our previous approach we used as a "pivot" a very rich model, the supermodel, here a much simpler one would be needed

•  … •  … Sanscrit would not be much suitable here

Paris, June 2012 34

Page 35: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

A possible architecture

Paris, June 2012 35

Page 36: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Outline

  Heterogeneity in interoperabilty and migration settings   MIDST: heterogeneity and translation with traditional models   Heterogeneity in NoSQL systems •  A common interface for NoSQL systems •  Future work

Paris, June 2012 36

Page 37: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

A possible architecture

Paris, June 2012 37

Page 38: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

A first step

•  A common interface, with a simple set of methods involving single objects or (for retrieval) sets thereof –  put –  get –  delete

•  Motivation –  the general, common goal of NoSQL systems is to support

simple operations

•  First implementation in Java

Paris, June 2012 38

Page 39: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

SOS: Save Our Systems

Paris, June 2012 39

Page 40: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

SOS, concretely

Paris, June 2012 40

Page 41: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Issues

•  Do objects have a structure? Should we handle it? •  How much sophisticated is the retrieval (get) operation?

Paris, June 2012 41

Page 42: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Object structure

•  In general, to support a very basic interface, we could just treat objects as blobs, serializing them

•  However, objects often have a complex structure, which can be modeled in tree form, with sets and structures, possibly nested, as well as simple attributes

•  Our interface gets the native objects and the implementation serializes them into JSON

Paris, June 2012 42

Page 43: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Implementation of the structure

Paris, June 2012 43

Page 44: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

The get operation

•  Various forms in mind 1.   Object get (String collection, String ID) 2.   Object get (String collection, Path p) 3.   Set<Object> get (Query q)

•  Currently, the first two implemented 1.  Straightforward 2.  Currently retrieval of simple fields, in the future

reconstruction of objects 3.  Many interesting challenges, related to query processing

and performances

Paris, June 2012 44

Page 45: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Outline

  Heterogeneity in interoperabilty and migration settings   MIDST: heterogeneity and translation with traditional models   Heterogeneity in NoSQL systems   A common interface for NoSQL systems •  Future work

Paris, June 2012 45

Page 46: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

The next steps

•  With respect to SOS: –  more implementation features, with flexibility –  custom mappings –  performance evaluation

•  An interoperability approach: –  Given the goals of NoSQL systems, a general "metamodel

based" approach is not obvious –  The definition of a supermodel should balance expressivity

with simplicity –  The simple interface we have developed could be a

reasonable pivot, limiting the expressive power

Paris, June 2012 46

Page 47: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Main references

•  MIDST, offline approach –  P. Atzeni, P. Cappellari, R. Torlone, P. A. Bernstein, G. Gianforme:

Model-independent schema translation. VLDB J. 17(6): 1347-1370 (2008)

–  P. Atzeni, P. Cappellari, P. A. Bernstein: Model-Independent Schema and Data Translation. EDBT 2006: 368-385

•  MIDST, runtime approach –  P. Atzeni, L. Bellomarini, F. Bugiotti, G. Gianforme: A runtime approach

to model-independent schema and data translation. Information Systems, 37(3): 269-287 (2012).

–  P. Atzeni, L. Bellomarini, F. Bugiotti, G. Gianforme: A runtime approach to model-independent schema and data translation. EDBT 2009: 275-286

•  SOS –  P. Atzeni, F., Bugiotti, L. Rossi. Uniform access to non-relational

database systems: the SOS platform. CAiSE 2012. –  P. Atzeni, F., Bugiotti, L. Rossi. SOS (Save Our Systems): A uniform

programming interface for non-relational systems. EDBT 2012 (demo section).

Paris, June 2012 47

Page 48: Management of Heterogeneous Data in Traditional and non ... · Management of Heterogeneous Data in Traditional and non Traditional Databases Paolo Atzeni Based on recent work with

Thank you!