View
221
Download
1
Category
Tags:
Preview:
Citation preview
An Architecture for Creating Collaborative Semantically
Capable Scientific Data Sharing Infrastructures
Anuj R. Jaiswal, C. Lee Giles, Prasenjit Mitra, James Z. Wang
Presentation by Paulo Shakarian
Outline
• Problem• Overall Goal• Contributions• Metadata• Implementation• Future Work• Comparison to SIBDATA Concept
Problem
• Researchers often reference experimental results of their predecessors
• However, the raw data of experimental results is often not readily available.– Hence, results often cannot easily be re-used or
combined with other experiments
Problem (cont.)
• Large repositories (i.e. NASA, NOAA, etc.) do collect experimental data– Often conform to global schema (which may cause
some data to be lost)– Or stored as flat-files (requiring custom-built query
applications)• Also, data labels in experiments may differ (i.e.
Temp. vs. Temperature vs. Celsius)
Overall Goal
• Architecture for dissemination, sharing, querying, and searching of scientific data on the WWW
• Schema not known a-priori• Approach relies on sufficient meta-data of two
varieties:– Data about the experiment (conditions, source, when
uploaded, etc.)– Semantics for columns/rows in experimental results
(what they represent, what units, etc.)
Overall Goal (cont.)
• Two-part approach:– Annotation
application for semi-automatic creation of annotations
– Web-portal for searchable storage of annotated scientific data.
Contributions of the Paper
• Propose architecture for semantically capable collaborative infrastructure for data collection and sharing
• System that utilizes two-level metadata scheme for document description and dataset attributes
• Description of current implementation
Dataset Metadata
• Dublin Core (http://dublincore.org) is a set of 15 elements for minimal resource description to ensure minimal operability– OAI-PMH– IETF RFC 5013– ANSI/NISO Standard Z39.85-2007– ISO Standard 15836:2009
• Attributes listed on next 3 slides
Dataset Metadata
• Paper states “uses Dublin Core 15 elements” but actually uses the following 15:– Title– Creator– Subject– Description– Contributor– Publisher
– Date– Type– Format– Identifier– Source– Relation
– References– Is referenced by– Language– Rights– Coverage.
Attribute Metadata
• Challenges:– Same attribute, different row/column name– (i.e. Temp vs Temperature– Same row/column name, but different attribute (i.e. Temperature
(in deg C) vs Temperature (in deg K)– Row/column names may be ambiguous (i.e. Rate)
Attribute Metadata
• Metadata tags for attributes (right)
• Note they allow for dynamic generation of a dynamic collaboration ontology– Equivalent To– Different From– Superset Of– Subset Of– Type Of
Submitting a Dataset
• Uses a ``pull’’ technique– Author submits URL– System pulls annotated data
• Pull method allows the following– A moderator can check the URL from non-authorized
submitters– Automatic tagging of provenance information for
authorized users based on URL– Better protection from DOS attacks
• Banning of malicious users• Implement a round-robin policy for fetching
Implementation: Metadata• Used for chemical kinetics experiments• Experimental results in MS Excel• Metadata added through a MS Excel add-in
Implementation: Web Portal
• Three components– Web portal front-end– Data downloader and parser– Data analysis toolkit
Implementation: Web Portal
• Web Portal Front-End– Content management system– Dataset viewer– Data submission system
• Uses Mambo Server (open source, PHP-based) content-management system
• Data submission system deployed using JSP on ApacheTomcat 5
Implementation: Web Portal
• Data downloader and parser– Scheduler– Downloader– Parser
• Parser– Creates metadata as XML files– Data in Excel files imported into
MySQL database– Parser creates a dataset index,
linking dataset with dataset metadata and attribute metadata with data tables
Implementation: Data Analysis Tools
• In addition to supporting queries, plotting and regression tools included in web portal
Future Work
• Develop algorithms to derive dynamic collaboration ontology's
• Integrating query re-wrting and semantic searching using attribute-level semantics
• Automatic metadata generation using a user’s previous experiments
• Group, trust, privacy mechanisms
Comparison to SIBDATA Concept
• Relies on central repository (as opposed to multiple repositories for SIBDATA)
• Only useful for Excel-formatted experimental results
• Annotations may be an interesting feature to include in a SIBDATA or CDATA.
Questions
Recommended