29
Brian Matthews, CRIS 2002, 31/08/02 1 Accessing the Outputs of Scientific Projects Brian Matthews, Michael Wilson, Business & Information Technology Dept, CLRC Kerstin Kleese-van Dam E-Science Centre, CLRC [email protected]

Brian Matthews, CRIS 2002, 31/08/02 1 Accessing the Outputs of Scientific Projects Brian Matthews, Michael Wilson, Business & Information Technology Dept,

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Brian Matthews, CRIS 2002, 31/08/02 1

Accessing the Outputs of Scientific Projects

Brian Matthews, Michael Wilson,Business & Information Technology Dept, CLRC

Kerstin Kleese-van Dam E-Science Centre, CLRC

[email protected]

Brian Matthews, CRIS 2002, 31/08/02 2

Overview

• Science produces two outputs– Conventional Publications – Science Data Sets

• In traditional Science, the 1st is used as a measure of success– The second is locked away.

• In this talk I shall discuss:– A general purpose science data portal for allowing access to data

sets– Potential links to publications.

• To make all the outputs of science available.

Brian Matthews, CRIS 2002, 31/08/02 3

• Central Laboratory of the Research

Councils

• 1700 staff - supporting 12000 scientists and

engineers from universities and industry

• Based at 3 sites:

– Daresbury Laboratory

– Rutherford Appleton Laboratory

– Chilbolton Observatory

• A Multidisciplinary Laboratory

Who we are (CLRC)

Brian Matthews, CRIS 2002, 31/08/02 4

A Multidisciplinary Laboratory

• Spallation Neutron and Muon Source (ISIS)

• Synchrotron Radiation Source (SRS)

• Lasers• Microstructures• Space Science and

Technology• Molecular

Spectroscopy

• Earth Observation• Atmospheric Science• Computational Science• Energy Research• Information Technology • Particle Physics• Radio Communications• Surfaces Transforms

and Interfaces

Brian Matthews, CRIS 2002, 31/08/02 5

The Problem

• Scientific institutions generate vast quantities of data– CLRC - ISIS, SRS, Space Science, Particle Physics,

Computational Science, ...

• More data coming on stream all the time: – CERN-LHC, Diamond, CASIM, HGP, ...

• Very good at handling large amounts of data• Diverse approaches to organising and distributing it.

Need a usable way of gaining access to the data

Brian Matthews, CRIS 2002, 31/08/02 6

User Scenarios

• Lecturer: – This published study would be a good example for teaching, is the

raw data publicly available?

• Researcher:– This is an interesting paper - can I check the data?

• Experiment Proposer: – Have there been any neutron or X-Ray studies of this molecule at

100 K? What reports and papers have been published on them?

• Instrument Scientist: – The instrument seems a bit unstable recently, fetch me the results

of all calibration runs from the last 3 months? Is there are report on this instrument?

Need a usable way of gaining access to publications with data

Brian Matthews, CRIS 2002, 31/08/02 7

The Data Portal Concept

• Single point of access to the CLRC data resources

• Encompasses a wide range of data holdings– Describes what data is available from the facilities– Links to the data held at the facility– Different archiving methods

• Caters for a wide range of users– general community data curators

• Supports a wide range of queries– employing data mining, thesauri, ….

Brian Matthews, CRIS 2002, 31/08/02 8

Combine Diverse Users & Searches ...

Discovery Excavation

Wider science

comm

unity

Data curator

Specialist userExperim

enter

General

comm

unity

Brian Matthews, CRIS 2002, 31/08/02 9

… with Distributed Data Silos….

Facility 1 Facility 2 Facility 3 Facility 4

Brian Matthews, CRIS 2002, 31/08/02 10

…using a central common metadata index ...

http

CLRC Data Access

Server

Client

XML wrapper

Common metadata catalogue database

Local data

Local metadata

XML wrapper

Facility 1

Brian Matthews, CRIS 2002, 31/08/02 11

… and a Web based interface

• Exploit the existing Web infrastructure.– Use New Technologies (XML/RDF);– rapidly disseminated;– widely accessible;

– database and user platform independent

– can be developed now, but with the GRID in mind.

Every user who needs to can get to the information.

Brian Matthews, CRIS 2002, 31/08/02 12

Metadata

Science Metadata Model

ISIS SRS HEPSpace

ScienceSocial

ScienceEnv.

Science

A generic metadata model for all scientific applications with Specialisation for each domain

Can answer questions across domains

Can answer questions about specific domains

Brian Matthews, CRIS 2002, 31/08/02 13

Metadata Model

Metadata Object

Topic

Study Description

Access Conditions

Data Location

Data Description

Related Material

Keywords providing a index on what the study is about.

Provenance about what the study is, who did it and when.

Conditions of use providing information on who and how the data can be accessed.

Detailed description of the organisation of the data into datasets and files.

Locations providing a navigational to where the data on the study can be found.

References into the literature and community providing context about the study.

Brian Matthews, CRIS 2002, 31/08/02 14

Study Description

• The Study is the basic unit for a scientific activity.

• Can be further divided into:– Programmes: for

connected studies.– Investigations: for a

single measurement, experiment or simulation.

STUDY Name

STUDY

Investigator STUDY Id

Investigation

Data Manager

STUDY Info

Experiment Measurement

Programme

Simulation

contains

associated 0..*

0..*

0..1

0..*

Brian Matthews, CRIS 2002, 31/08/02 15

Hierarchy of Data Holdings

• With investigations, there are associated data holdings.

• These are themselves arranged in a hierarchy: data sets, and files, with links between them

• Logical organisation – identity separated from location.

Data HoldingData Holding

File 1 name: date:

Investigation

Data Holding

Data-Set 1 (Raw) Data-Set 2 (Inter) Data-Set 3 (Final)

File 1 name: date:

File 1 name: date:

Brian Matthews, CRIS 2002, 31/08/02 16

Metadata example

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE CLRCMetadata SYSTEM "clrcmetadata.dtd"><CLRCMetadata><MetadataRecord metadataID="N000001">

<Topic><Discipline>Chemistry</Discipline><Subject>Crystal Structure</Subject><Subject>Copper</Subject>...

<Experiment><StudyName>Crystal Structure: Copper : Palladium: :complex: 150K ...<Investigator><Name><Surname>Porter...<Institution>University of Peebles ...<Funding>EPSRC ...<TimePeriod><StartDate><Date>21/04/1999….<Purpose><Abstract>

To study the structure of Copper and Palladium co-ordination complexes at a 150K. <DataManager><Name><Surname>Teat...<Instrument>SRS Station 9.8, BRUKER AXS SMART 1K...<Condition>...Wavelength...<Units>Angstrom...<ParamValue>0.6890...<Condition>…Crystal-to-detector distance<Units>cm...<ParamValue>5.00...

<AccessConditions>The user has to be one of: Prof. F. Porter….

Brian Matthews, CRIS 2002, 31/08/02 17

Metadata collection

Metadata collection and maintenance is a big problem.• But doing science is a process.

Submit proposal

Prepare experiment

Generateresults

Analyseresults

Write report

Provenancemetadata + access

conditionsdata

description ++ +datalocation

Related material

Collecting the metadata can then become part of the experimental support environment

Brian Matthews, CRIS 2002, 31/08/02 18

Grid middleware

Architecture

UsersOther Data

Portals

Local data

Local metadata

XML wrapper

Facility 4

Local data

Local metadata

XML wrapper

Facility 2

Local data

Local metadata

XML wrapper

Facility 1

Local data

Local metadata

XML wrapper

Facility 3

CLRC broke

r

XML wrapper

Common metadata catalogue database

CLRC Data Portal

Brian Matthews, CRIS 2002, 31/08/02 19

Server Architecture

User input interpreter

pre-set XSL

ScriptQuery

Generator

USER

Central metadata repository

XML File

XML Parser

Key: Internal

http

Ascii file

External agent

module

User output generator

Response Generator

Localmetadata repository

XML File

Brian Matthews, CRIS 2002, 31/08/02 20

Example

Result of searching: search across facilities - returns XML to session and displays summary

Brian Matthews, CRIS 2002, 31/08/02 21

Expand Results- give more

details from the same XML

Brian Matthews, CRIS 2002, 31/08/02 22

Going Deeper

- Can browse the data sets

Brian Matthews, CRIS 2002, 31/08/02 23

Select data

- pick the required data files and download

from convenient location.

Brian Matthews, CRIS 2002, 31/08/02 24

Current developments

• Pilot completed

• Consolidate and broaden existing system– move towards a development system

– handle a greater diversity of data sources – e.g. Max Planck Institute for

Meteorology

• Enhance the Technology– Web services (SOAP, WDSL, OGSA, XML Query)

• Provide links to other information sources:– Library systems

– Thesauri

Brian Matthews, CRIS 2002, 31/08/02 25

Interface with existing archives

• CLRC maintains existing data archives – Atmospheric, earth observation, STP, astronomy.– Existing access mechanisms (Web, Z39.50)– Existing metadata catalogues and formats

• Can we use the Data Portal to access them?– Use the Metadata format as a framework to be

specialised to express existing metadata framework– XML Query as a query layer on the archive

Brian Matthews, CRIS 2002, 31/08/02 26

Re-architect system

• Break up the portal middleware into components.

DP

Resultscollation

Data source location

Query generation

ontologyservice

Security service

Replicationservice

Userservice

replicationservice

Globus GIS - MDS

Globus GSI

Grid Enable

with

Web Services

RDF+DAML+OIL

XML Query

Brian Matthews, CRIS 2002, 31/08/02 27

Access to Data and Publications

• The Data Portal offers the potential to integrate the outputs of scientific research: data and publications.

• Need to have a common search mechanism over library and data portals.– Can abstract the science metadata to Dublin Core.– Links to CERIF would further deepen connection.– Access to common thesauri for classification.

• Common web service interface – Data Portal provides this.– XML Query as a communication mechanism

Brian Matthews, CRIS 2002, 31/08/02 28

Mapping between Dublin Core and Science

Metadata• Title

– Study: Name• Creator

– Study: Investigator: Name (Role is principle investigator)

• Subject – Topic: Keyword

• Description– Study: Study Information: Purpose

• Publisher– Investigation: Data Manager

• Contributor– Study: Investigator: Name ;

Investigation: Data Manager• Date

– Study: Study Information: Time• Resource Type

– Collection; or Dataset.

• Format– Data Description: File Format

• Resource Identifier– Study: Study Id (whole study)– Data description: File: URI (for individual

data files).• Source

– Data description: Data sets: Related Data sets

– Related Material: Related work• Language

– Not covered in the current metadata format; but an simple extension

• Relation– Related Material: Related work

• Coverage– Data description: Logical Description:

Coverage• Rights Management

– Access Conditions

Brian Matthews, CRIS 2002, 31/08/02 29

Where are we?

• Data Portal up and running – Being developed in the E-Science Centre in CLRC

• http://esc.dl.ac.uk:9000/index.html

– Science metadata proving very robust– Trying to extend its use into other areas of science – materials

science, environmental science.

• Beginning to approach the problem of integrating with electronic library resources.

[email protected]