An Architecture for Creating Collaborative Semantically Capable Scientific Data Sharing...

Preview:

Citation preview

An Architecture for Creating Collaborative Semantically

Capable Scientific Data Sharing Infrastructures

Anuj R. Jaiswal, C. Lee Giles, Prasenjit Mitra, James Z. Wang

Presentation by Paulo Shakarian

Outline

• Problem• Overall Goal• Contributions• Metadata• Implementation• Future Work• Comparison to SIBDATA Concept

Problem

• Researchers often reference experimental results of their predecessors

• However, the raw data of experimental results is often not readily available.– Hence, results often cannot easily be re-used or

combined with other experiments

Problem (cont.)

• Large repositories (i.e. NASA, NOAA, etc.) do collect experimental data– Often conform to global schema (which may cause

some data to be lost)– Or stored as flat-files (requiring custom-built query

applications)• Also, data labels in experiments may differ (i.e.

Temp. vs. Temperature vs. Celsius)

Overall Goal

• Architecture for dissemination, sharing, querying, and searching of scientific data on the WWW

• Schema not known a-priori• Approach relies on sufficient meta-data of two

varieties:– Data about the experiment (conditions, source, when

uploaded, etc.)– Semantics for columns/rows in experimental results

(what they represent, what units, etc.)

Overall Goal (cont.)

• Two-part approach:– Annotation

application for semi-automatic creation of annotations

– Web-portal for searchable storage of annotated scientific data.

Contributions of the Paper

• Propose architecture for semantically capable collaborative infrastructure for data collection and sharing

• System that utilizes two-level metadata scheme for document description and dataset attributes

• Description of current implementation

Dataset Metadata

• Paper states “uses Dublin Core 15 elements” but actually uses the following 15:– Title– Creator– Subject– Description– Contributor– Publisher

– Date– Type– Format– Identifier– Source– Relation

– References– Is referenced by– Language– Rights– Coverage.

Attribute Metadata

• Challenges:– Same attribute, different row/column name– (i.e. Temp vs Temperature– Same row/column name, but different attribute (i.e. Temperature

(in deg C) vs Temperature (in deg K)– Row/column names may be ambiguous (i.e. Rate)

Attribute Metadata

• Metadata tags for attributes (right)

• Note they allow for dynamic generation of a dynamic collaboration ontology– Equivalent To– Different From– Superset Of– Subset Of– Type Of

Submitting a Dataset

• Uses a ``pull’’ technique– Author submits URL– System pulls annotated data

• Pull method allows the following– A moderator can check the URL from non-authorized

submitters– Automatic tagging of provenance information for

authorized users based on URL– Better protection from DOS attacks

• Banning of malicious users• Implement a round-robin policy for fetching

Implementation: Metadata• Used for chemical kinetics experiments• Experimental results in MS Excel• Metadata added through a MS Excel add-in

Implementation: Web Portal

• Three components– Web portal front-end– Data downloader and parser– Data analysis toolkit

Implementation: Web Portal

• Web Portal Front-End– Content management system– Dataset viewer– Data submission system

• Uses Mambo Server (open source, PHP-based) content-management system

• Data submission system deployed using JSP on ApacheTomcat 5

Implementation: Web Portal

• Data downloader and parser– Scheduler– Downloader– Parser

• Parser– Creates metadata as XML files– Data in Excel files imported into

MySQL database– Parser creates a dataset index,

linking dataset with dataset metadata and attribute metadata with data tables

Implementation: Data Analysis Tools

• In addition to supporting queries, plotting and regression tools included in web portal

Future Work

• Develop algorithms to derive dynamic collaboration ontology's

• Integrating query re-wrting and semantic searching using attribute-level semantics

• Automatic metadata generation using a user’s previous experiments

• Group, trust, privacy mechanisms

Comparison to SIBDATA Concept

• Relies on central repository (as opposed to multiple repositories for SIBDATA)

• Only useful for Excel-formatted experimental results

• Annotations may be an interesting feature to include in a SIBDATA or CDATA.

Questions