Upload
maria-king
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
1
Shifting the Burden from the User to the Data Provider
Peter FoxHigh Altitude Observatory,NCAR (***)
With thanks to eGY and various NSF, DoE and NASA projects
2
Outline• Background, definitions• Informatics -> e-Science• Data has lots of uses
– Virtual Observatories: use cases– Data Framework: Examples– Data ingest, integration, mining and …
• Discussion
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
3
BackgroundScientists should be able to access a global, distributed
knowledge base of scientific data that:• appears to be integrated• appears to be locally available
But… data is obtained by multiple instruments, using various protocols, in differing vocabularies, using (sometimes unstated) assumptions, with inconsistent (or non-existent) meta-data. It may be inconsistent, incomplete, evolving, and distributed
And… there exist(ed) significant levels of semantic heterogeneity, large-scale data, complex data types, legacy systems, inflexible and unsustainable implementation technology…
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
4
But data has Lots of Audiences
From “Why EPO?”, a NASA internalreport on science education, 2005
More Strategic
Less Strategic
InformationInformation products have
SCIENTISTS TOO
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
5
The Information Era: Interoperability
• managing and accessing large data sets• higher space/time resolution capabilities • rapid response requirements• data assimilation into models• crossing disciplinary boundaries.
Modern information and communications technologies are creating an “interoperable” information era in which ready access to data and information can be truly universal. Open access to data and services enables us to meet the new challenges of understand the Earth and its space environment as a complex system:
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
6
Shifting the Burden from the Userto the Provider
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
7
Modern capabilities
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
8
Mind the Gap!
• As a result of finding out who is doing what,
sharing experience/ expertise, and substantial
coordination:
• There is/ was still a gap between science and the
underlying infrastructure and technology that is
available• Cyberinfrastructure is the new
research environment(s) that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing and information processing services over the Internet.
Informatics - information science includes the
science of (data and) information, the practice
of information processing, and the engineering
of information systems. Informatics studies the
structure, behavior, and interactions of natural
and artificial systems that store, process and
communicate (data and) information. It also
develops its own conceptual and theoretical
foundations. Since computers, individuals and
organizations all process information,
informatics has computational, cognitive and
social aspects, including study of the social
impact of information technologies. Wikipedia.
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
9
Progression after progression
IT Cyber
Infrastructure
Cyber Informatics
Core Informatics
Science Informatics,
aka
Xinformatics
Science, SBAs
Informatics
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
10
Virtual Observatories• Conceptual examples: • In-situ: Virtual measurements
– Related measurements
• Remote sensing: Virtual, integrative measurements– Data integration
• Managing virtual data products/ sets
11
Virtual ObservatoriesMake data and tools quickly and easily accessible to a
wide audience.
Operationally, virtual observatories need to find the right balance of data/model holdings, portals and client software that researchers can use without effort or interference as if all the materials were available on his/her local computer using the user’s preferred language: i.e. appear to be local and integrated
Likely to provide controlled vocabularies that may be used for interoperation in appropriate domains along with database interfaces for access and storage and “smart” tools for evolution and maintenance.
12
Early days of discipline specific VOs
… … … …
VO1
VO2 VO3
DB2 DB3DBn
DB1
?
13
The Astronomy approach; data-types as a service
… … … …
VO App1
VO App2VO App3
DB2 DB3DBn
DB1
VOTable
Simple Image
Access Protocol
Simple Spectrum
Access Protocol
Simple Time Access
Protocol
VO layer
Limited interoperability
Lightweight semantics
Limited meaning, hard coded
Limited extensibility
Under review
Open Geospatial Consortium:
Web {Feature, Coverage, Mapping} Service
Sensor Web Enablement:
Sensor {Observation, Planning, Analysis} Service
use the same approach
14… … … …
VO Portal
Web Serv.
VO API
DB2 DB3DBn
DB1
Semantic mediation layer - VSTO - low level
Semantic mediation layer - mid-upper-level
Education, clearinghouses, other services, disciplines, et c.
Metadata, schema, data
Query, access and use of data
Semantic query, hypothesis and inference
Semantic interoperability
Added value
Added value
Added value
Added value
Mediation Layer• Ontology - capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and associated classes, properties) and Service Classes
• Maps queries to underlying data• Generates access requests for metadata, data• Allows queries, reasoning, analysis, new
hypothesis generation, testing, explanation, et c.
15
Content: Coupling Energetics and Dynamics of Atmospheric Regions WEB
Community data archive for observations and models of Earth's upper atmosphere and geophysical indices and parameters needed to interpret them. Includes browsing capabilities by periods, > 310 instruments, models, > 820 parameters…
16
Content: Mauna Loa Solar Observatory Near real-time
data products from Hawaii from a variety of solar instruments.
Source for space weather, solar variability, and basic solar physics
Other content used too - Center for Integrated Space Weather Modeling
17
Semantic Web Methodology and Technology Development Process
• Establish and improve a well-defined methodology vision for Semantic Technology based application development
• Leverage controlled vocabularies, et c.
Use Case
Small Team, mixed skills
Analysis
Adopt Technology Approach
Leverage Technology
Infrastructure
Rapid Prototype
Open World: Evolve, Iterate,
Redesign, Redeploy
Use Tools
Science/Expert Review & Iteration
Develop model/
ontology
18
Science and technical use casesFind data which represents the state of the neutral
atmosphere anywhere above 100km and toward the arctic circle (above 45N) at any time of high geomagnetic activity.
– Extract information from the use-case - encode knowledge– Translate this into a complete query for data - inference and
integration of data from instruments, indices and models
Provide semantically-enabled, smart data query services via a SOAP web for the Virtual Ionosphere-Thermosphere-Mesosphere Observatory that retrieve data, filtered by constraints on Instrument, Date-Time, and Parameter in any order and with constraints included in any combination.
Fox RPI: Semantic Data Frameworks May 14, 2008
19
VSTO - semantics and ontologies in an operational environment: vsto.hao.ucar.edu, www.vsto.org
Web Service
20
Partial exposure of Instrument class hierarchy - users seem to LIKE THIS
Semantic filtering by domain or instrument hierarchy
21
Fox RPI: Semantic Data Frameworks May 14, 2008
22
Inferred plot type and return formats for data products
Fox RPI: Semantic Data Frameworks May 14, 2008
23
Inferred plot type and return required axes data
24
Semantic Web Benefits• Unified/ abstracted query workflow: Parameters, Instruments, Date-Time• Decreased input requirements for query: in one case reducing the
number of selections from eight to three• Generates only syntactically correct queries: which was not always
insurable in previous implementations without semantics• Semantic query support: by using background ontologies and a
reasoner, our application has the opportunity to only expose coherent query (portal and services)
• Semantic integration: in the past users had to remember (and maintain codes) to account for numerous different ways to combine and plot the data whereas now semantic mediation provides the level of sensible data integration required, now exposed as smart web services– understanding of coordinate systems, relationships, data synthesis,
transformations, et c.– returns independent variables and related parameters
• A broader range of potential users (PhD scientists, students, professional research associates and those from outside the fields)
25
What is a Non-Specialist Use Case?
Teacher accesses internet goes to An Educational Virtual Observatory and enters a search for “Aurora”.
Someone should be able to query a virtual observatory without having specialist knowledge
26
Teacher receives four groupings of search results:
1) Educational materials: http://www.meted.ucar.edu/topics_spacewx.php and http://www.meted.ucar.edu/hao/aurora/
2) Research, data and tools: via VSTO, VSPO and VITMO, knows to search for brightness, or green/red line emission
3) Did you know?: Aurora is a phenomena of the upper terrestrial atmosphere (ionosphere) also known as Northern Lights
4) Did you mean?: Aurora Borealis or Aurora
Australis, et c.
What should the User Receive?
Fox RPI: Semantic Data Frameworks May 14, 2008
27
Semantic Information Integration: Concept map for educational use of
science data in a lesson plan
Fox RPI: Semantic Data Frameworks May 14, 2008
28
29
• Scaling to large numbers of data providers and redefining the role(s)/ relations with them
• Crossing discipline boundaries• Security, access to resources, policies• Branding and attribution (where did this data come
from and who gets the credit, is it the correct version, is this an authoritative source?)
• Provenance/derivation (propagating key information as it passes through a variety of services, copies of processing algorithms, …)
• Data quality, preservation, stewardship
Issues for Virtual Observatories
These are currently burden areas for users
30
Problem definition• Data is coming in faster, in greater volumes and outstripping our ability to
perform adequate quality control
• Data is being used in new ways and we frequently do not have sufficient information on what happened to the data along the processing stages to determine if it is suitable for a use we did not envision
• We often fail to capture, represent and propagate manually generated information that need to go with the data flows
• Each time we develop a new instrument, we develop a new data ingest procedure and collect different metadata and organize it differently. It is then hard to use with previous projects
• The task of event determination and feature classification is onerous and we don't do it until after we get the data
31
• Determine which flat field calibration was applied to the image taken on January, 26, 2005 around 2100UT by the ACOS Mark IV polarimeter.
• Which flat-field algorithm was applied to the set of images taken during the period November 1, 2004 to February 28, 2005?
• How many different data product types can be generated from the ACOS CHIP instrument?
• What images comprised the flat field calibration image used on January 26, 2007 for all ACOS CHIP images?
• What processing steps were completed to obtain the ACOS PICS limb image of the day for January 26, 2005?
• Who (person or program) added the comments to the science data file for the best vignetted, rectangular polarization brightness image from January, 26, 2005 1849:09UT taken by the ACOS Mark IV polarimeter?
• What was the cloud cover and atmospheric seeing conditions during the local morning of January 26, 2005 at MLSO?
• Find all good images on March 21, 2008.• Why are the quick look images from March 21, 2008, 1900UT missing?• Why does this image look bad?
Use cases
32
Provenance
• Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility
33
34
35
36
Visual browse
37
38
39
Discussion (1)
• Taken together, an emerging set of collected experience manifests an emerging informatics core capability that is starting to take data intensive science into a new realm of realizability and potentially, sustainability– Use cases (i.e. real users)– X-informatics– Core Informatics– Cyber Informatics
• There are implications for data models
40
Progression after progression
IT Cyber
Infrastructure
Cyber Informatics
Core Informatics
Science Informatics
Science, SBAs
Informatics
Example:
•CI = OPeNDAP server running over HTTP/HTTPS
•Cyberinformatics = Data (product) and service ontologies, triple store
•Core informatics = Reasoning engine (Pellet), OWL
•Science (X) informatics = Use cases, science domain terms, concepts in an ontology
41
Discussion (2)• Data and information science is becoming
the ‘fourth’ column (along with theory, experiment and computation)
• Semantics (of the data) are a very key ingredient -> may imply richer data models
Fox RPI: Semantic Data Frameworks May 14, 2008
42
Summary• Informatics is playing a key role in filling the gap
between science (and the spectrum of non-expert) use and generation and the underlying cyberinfrastructure, i.e. in shifting the burden– This is evident due to the emergence of Xinformatics (world-
wide)• Our experience is implementing informatics as
semantics in Virtual Observatories (as a working paradigm) and Grid environments– VSTO is only one example of success– Data mining, data integration, smart search, provenance are
close behind• Informatics is a profession and a community activity
and requires efforts in all 3 sub-areas (science, core, cyber) and must be synergistic
43
More Information• Virtual Solar Terrestrial Observatory (VSTO):
http://vsto.hao.ucar.edu, http://www.vsto.org• Semantically-Enalbed Science Data Integration (SESDI):
http://sesdi.hao.ucar.edu • Semantic Provenance Capture in Data Ingest Systems
(SPCDIS): http://spcdis.hao.ucar.edu • Semantic Knowledge Integration Framework (SKIF/SAM):
http://skif.hao.ucar.edu • Semantic Web for Earth and Environmental Terminology
(SWEET): http://sweet.jpl.nasa.gov • Conferences: AGU 2008, EGU 2009, ISWC 2008, CIKM
2008, …• Peter Fox [email protected]