42
Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept.

Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

Embed Size (px)

Citation preview

Page 1: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

Metadata, Ontologies, and Provenance: Towards Extended Forms of Data

Management

Beth Plale,Yogesh Simmhan

Computer Science Dept.

Page 2: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

2

The Data Deluge

Computational science is increasingly data intense and getting more so. Why?

• More complex computations:– Nested model runs– Linked models– Finer resolution

• More sources of data products – Observational data products

• Streaming continuously from hundreds of sensor and network sources, scaling to thousands

• Large archives – Annotations– Model configuration parameters– Output results– Model data– Statistical data (e.g., data mining)

Page 3: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

3

Problem

Computational scientists are reaching their limit on ability to manage data products associated with investigations– Scientist can touch hundreds to thousands of data

products in single investigation

Page 4: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

4

Seeds of solution in Internet?• Internet has proven the utility of user-oriented

view towards information space management– Search, tag: browser, bookmarks– Publish: blogs, web page tools

• But web not completely appropriate. Web is– Single-writer, multiple reader, and– Search-and-download.

• Apply concept of user-oriented view to managing data space

• Want ability to work locally.– myLEAD: tool to help an investigator make sense of,

and operate in, the vast information space that is computational science (e.g., mesoscale meteorology.)

Page 5: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

5

Personal metadata catalog requirements

Scientists have following needs:• Want to share products but retain control over

what gets shared and with whom– Data not made public until results appear in journal

• Want rich search criteria over vast data space but don’t necessarily want to write SQL queries

• Need help managing products generated over extended period of time (I.e., years)

• Want high level of reliability - data must always be accessible,

Page 6: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

6

Distributed and replicated personal metadata catalogues

IU

NCSA

UAHuntsville

MillersvilleUCAR

Unidata

OklaUniv

Master myLEADcatalog

SatellitemyLEAD catalog

-- distribution: users partitioned over 6 sites in LEAD testbed-- replication: master is replica site for all satellites

Page 7: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

7

Hurricane Ivan

SE quadrant

Voltice study 1998

Voltice study 2002

Workflow template

Collection

Input parameter

Hurricane Ivan

SE quadrant

Voltice study 1998

Workflow template

Collection

Input parameter

Hurricane Ivan

SE quadrant

Voltice study 1998

Voltice study 2003

Workflow template

Collection

Input parameter

ftp://fileserver.org/file1998o768

Voltice study 2002

User Bob’s workspace in 1998 User Bob’s workspace in 2002 User Bob’s workspace in 2003

Physical data storage

Table of collection

Table of file

Table of User

Metadata Catalog

Page 8: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

8

Ontologies aid in querying

Preservation

Sha

ring

Structure

Depth 2: searchable

Depth 3: brow sable

Doe

s no

t kn

ow

exis

tenc

e

Flat structure

Tempo

rary

data

pro

duct

Non-published Data products of other users

Non-preserved data product

Non structured data products

structure

sharing

preservation

Ontologies provide -- transparent structure -- controlled vocabulary

Page 9: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

9

LEAD (http://lead.ou.edu)

• Each year, mesoscale weather – floods, tornadoes, hail, strong winds, lightning, and winter storms – causes hundreds of deaths, routinely disrupts transportation and commerce, and results in annual economic losses > $13B.

Page 10: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

10

Conventional Numerical Weather Prediction

OBSERVATIONS

Radar DataMobile Mesonets

Surface ObservationsUpper-Air BalloonsCommercial Aircraft

Geostationary and Polar Orbiting Satellite

Wind ProfilersGPS Satellites

Page 11: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

11

OBSERVATIONS

Radar DataMobile Mesonets

Surface ObservationsUpper-Air BalloonsCommercial Aircraft

Geostationary and Polar Orbiting Satellite

Wind ProfilersGPS Satellites

Analysis/Assimilation

Quality ControlRetrieval of Unobserved

QuantitiesCreation of Gridded Fields

Conventional Numerical Weather Prediction

Page 12: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

12

Analysis/Assimilation

Quality ControlRetrieval of Unobserved

QuantitiesCreation of Gridded Fields

Prediction

PCs to Teraflop Systems

Conventional Numerical Weather Prediction

OBSERVATIONS

Radar DataMobile Mesonets

Surface ObservationsUpper-Air BalloonsCommercial Aircraft

Geostationary and Polar Orbiting Satellite

Wind ProfilersGPS Satellites

Page 13: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

13

Analysis/Assimilation

Quality ControlRetrieval of Unobserved

QuantitiesCreation of Gridded Fields

Prediction

PCs to Teraflop Systems

Product Generation, Display,

Dissemination

Conventional Numerical Weather Prediction

OBSERVATIONS

Radar DataMobile Mesonets

Surface ObservationsUpper-Air BalloonsCommercial Aircraft

Geostationary and Polar Orbiting Satellite

Wind ProfilersGPS Satellites

Page 14: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

14

Analysis/Assimilation

Quality ControlRetrieval of Unobserved

QuantitiesCreation of Gridded Fields

Prediction

PCs to Teraflop Systems

Product Generation, Display,

Dissemination

End Users

NWSPrivate Companies

Students

Conventional Numerical Weather Prediction

OBSERVATIONS

Radar DataMobile Mesonets

Surface ObservationsUpper-Air BalloonsCommercial Aircraft

Geostationary and Polar Orbiting Satellite

Wind ProfilersGPS Satellites

Page 15: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

15

Analysis/Assimilation

Quality ControlRetrieval of Unobserved

QuantitiesCreation of Gridded Fields

Prediction

PCs to Teraflop Systems

Product Generation, Display,

Dissemination

End Users

NWSPrivate Companies

Students

Conventional Numerical Weather Prediction

OBSERVATIONS

Radar DataMobile Mesonets

Surface ObservationsUpper-Air BalloonsCommercial Aircraft

Geostationary and Polar Orbiting Satellite

Wind ProfilersGPS Satellites

The process is entirely serialand pre-scheduled: no response

to weather!

The process is entirely serialand pre-scheduled: no response

to weather!

Page 16: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

16

Analysis/Assimilation

Quality ControlRetrieval of Unobserved

QuantitiesCreation of Gridded Fields

Prediction

PCs to Teraflop Systems

Product Generation, Display,

Dissemination

End Users

NWSPrivate Companies

Students

The LEAD Vision: No Longer Serial or Static

OBSERVATIONS

Radar DataMobile Mesonets

Surface ObservationsUpper-Air BalloonsCommercial Aircraft

Geostationary and Polar Orbiting Satellite

Wind ProfilersGPS Satellites

Page 17: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

17

Analysis/Assimilation

Quality ControlRetrieval of Unobserved

QuantitiesCreation of Gridded Fields

Prediction

PCs to Teraflop Systems

Product Generation, Display,

Dissemination

End Users

NWSPrivate Companies

Students

The LEAD Vision: No Longer Serial or Static

OBSERVATIONS

Radar DataMobile Mesonets

Surface ObservationsUpper-Air BalloonsCommercial Aircraft

Geostationary and Polar Orbiting Satellite

Wind ProfilersGPS Satellites

Page 18: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

18

Page 19: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

19

Objective discussed in this talk:

• Grow the value of the data holdings. Can do so through provenance:

workflow

myLEAD

time

Process, time,causality

Page 20: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

Exploiting Provenance Metadata

Page 21: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

21

Contents of Talk

• Importance of Provenance

• Techniques for Provenance Management

• Data Quality and Provenance

• Conclusion

Page 22: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

22

Data Provenance

• Derivation History of Data starting from its original sources

• Data: Files, tables, tuples, virtual collections

• Derivation: Process that transforms data – Script, Web service, Queries, Commands

• Lineage, Pedigree, Genealogy, Filiation, Parentage, …

Page 23: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

23

A Simple Provenance DAG

D1

D0

D2

D4D3

P1

P2 P3

D2’

D0’

Page 24: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

24

Importance of Provenance

• Scientific Domain– Publications are Provenance!– Many scientific datasets available online

• Biology, Astronomy (SDSS)

– Standard metadata describes datasets in well-known repositories

– Lineage information usually missing, but vital– GIS: Fitness for use– Material Engineering: Pedigree, Auditing– Biology: Citation & copyright, trust– Astronomy: Context information

Page 25: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

25

Importance of Provenance

• Business Domain– Data warehousing: Integrated

view over historical data from multiple sources

– Complex transformations to generate normalized view (ETL)

– Business analytics and intelligence (OLAP queries)

– Lineage allows “drill-down” from view to source table

– Allows tracing back sources of errors

– “View deletion” problem

V1

V0

V2

T2T1

P1

Q2 Q3 Extract

Transform

Load

View Data

Source Tables

Page 26: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

26

Application of Provenance

• Data Quality– Evaluate quality of data– Trust in the source of data– Use provenance and metadata information to

estimate data quality for a user– Assertions and Signatures for provenance

guarantee

• Audit Trail– Error detection– Usage log

Page 27: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

27

Application of Provenance

• Replication Recipe– Provenance can be recipe for generating a

dataset– Repeat to verify/compare– Recreate/replicate– Partial updates

• Attribution– Copyright, citation, check data users

• Informational– Discover datasets– Browse provenance

Page 28: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

28

Subject of Provenance

• What is provenance about?• Granularity

– Attribute, tables, files, data collections Fine-grained vs. Coarse-grained

– Trade-off with cost of collecting, storing, querying

• Data vs. Process Provenance– Provenance can be a graph of data & processes– Which of them is provenance focused upon?– Hybrid where all grouped together

Page 29: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

29

Process vs. Data Oriented

D1

D0

D2

D4D3

P1

P2 P3

D2’

D0’

D1

D0

D2

D4D3

P1

P2 P3

D2’

D0’

Page 30: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

30

Data Processing Architectures

• Service Oriented Architecture– Grid & Web services– Workflow & Service invocations– Data as parameters, references

• Databases– Update/View Queries, Stored Procedure Calls– Views, Tables, tuples, attributes

• Scripting, Command-line, etc.

Page 31: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

31

Scheme for Representing Provenance

• Scheme for representing provenance– Annotations vs. Inversion

• Annotation – Annotate data with ancestral data & the steps used to

derive it e.g. a DAG– Annotation requires more storage; “Eager”– Annotation can be as rich as user decides

• Inversion– Store function (query) used to generate data and invert it– Not all functions are invertible; auxiliary data required;

JIT computation; query optimization– Minimal information provided (“Where”, “Why”)

Page 32: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

32

Syntactic vs. Semantic Representation of Provenance

• Syntactic Structure– XML for Annotations– Implement specific for Inversion

• Semantic Knowledge– Semantic language used to define lineage metadata

• RDF, OWL

– Advantages• Provides Context• Enhance searches• Lineage proofs

– Ontologies used as a framework for semantic knowledge– Community effort needed!

Page 33: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

33

Provenance Storage

• Stored with or separate from data?– Integrity, accessibility

• Maintenance– Mutability, versioning– who is responsible – data creator or central?

• Scalability– # of datasets, depth of lineage, granularity,

geographical distribution, # of users– Inversion vs. Annotation; Distributed vs. Centralized

• Overhead– Collection & storage– Automation

Page 34: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

34

Provenance Dissemination

• Browsing Provenance as a DAG– Go back and forward in lineage through GUI

• Query based on lineage– By source data, or generating process– Enhanced by semantic information– Drill down during data mining

• Verify how data was created by reenactment or present proof statements

Page 35: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

35

Taxonomy in Brief

• Application of ProvenanceData quality Audit trail AttributionReplication Recipe Informational

• Subject of ProvenanceData vs. Process Granularity

• Representation of ProvenanceAnnotation vs. Inversion ContentsSyntactic vs. Semantic

• Provenance StorageScalability Overhead

• Provenance Dissemination

Page 36: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

36

Data Quality for Scientific Data

• Fitness for use• Subjective & Objective Parameters

– believability, reputation, reliability– precision, timeliness, accuracy

• Intrinsic Quality of data vs. Quality of data service– Correctness, consistency– accessibility, throughput, availability

• Good quality for one application may not be good for another (user driven)

Page 37: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

37

Estimating Data Quality from Provenance

• Hypothesis: For derived datasets, quality depends not just on the dataset but also on its provenance — ancestral processes and data

• Quality of a dataset could be a function of:– Attributes of dataset– Attributes of generating process– Ancestral Datasets used to derive this dataset – And so on recursively …

Page 38: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

38

Weighted DAG?

D1

D0

D2

D4D3

P1

P2 P3

D2’

D0’ D0_q = f(D0, P1_q)

P1_q = f(P1, D1_q, D2_q, D4_q)

D1_q = f(D1, P2_q)

D2_q = f(D2, P3_q)

P2_q = f(P2, D3_q) P3_q = f(P3, D4_q)

D4_q = f(D4)D4 = f(D3)

Page 39: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

39

Challenges for Quality Metrics

• Some process may produce better quality data than its input dataset

• Subsetting, aggregation of data may change overall quality estimate

• Quality of transformation may be parameter dependent

• Multiple user profiles for different applications

• Missing lineage information can short-circuit measurement

Page 40: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

40

Uses of Data Quality Measurement

• Comparing and rank datasets uniformly– Google Personalized

• Reduce search space to datasets matching user quality requirement

• Built community-wide quality feedback mechanism– Leverage knowledge of domain expert– Promote publication of better quality data– Amazon reviews?

Page 41: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

2005-03-07T18:00-05:00

Networks & Complex Systems Seminar Talk

41

Research Questions

• What are the metrics for estimating the quality for data using provenance?

• How do we optimize user-centric searches based on quality?

• How can we recover information from incomplete lineage?

Page 42: Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept

Thank you!

Questions | Comments