47
Dan Crichton April 2009

Developing a Reference Architecture for Scientific Data Systems

  • Upload
    bowen

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

Developing a Reference Architecture for Scientific Data Systems. Dan Crichton April 2009. Background. Employed by Jet Propulsion Laboratory since 1995; prior software engineering positions at Hughes Aircraft Company and in private industry MS in Computer Science, USC 1996 Program Manager for - PowerPoint PPT Presentation

Citation preview

Page 1: Developing a Reference Architecture for Scientific Data Systems

Dan Crichton

April 2009

Page 2: Developing a Reference Architecture for Scientific Data Systems

BackgroundEmployed by Jet Propulsion Laboratory since 1995; prior software

engineering positions at Hughes Aircraft Company and in private industry

MS in Computer Science, USC 1996

Program Manager for Planetary Data System Engineering in Solar System Exploration Directorate Data Systems and Technology in Earth and Technology Directorate

Principal Investigator for Informatics Center, Early Detection Research Network, National Cancer

Institute Facilitating Integration of NASA and Earth System Grid, NASA Object Oriented Data Technology

Page 3: Developing a Reference Architecture for Scientific Data Systems

Science data systemsCovers a wide variety of science disciplines

Solar system exploration AstrophysicsEarth scienceBiomedicineetc

Each has its own communities, standards and systems

How do you define a reference architecture vs a point solution?

Page 4: Developing a Reference Architecture for Scientific Data Systems

DJC-4

External Science

Community

Data Acquisition

and CommandMission

OperationsInstrument /Sensor Operations

ScienceData

Archive

ScienceData

Processing

Data Analysis and

Modeling

Science Information Package

Science Team

Relay Satellite

Spacecraft / lander

Spacecraft andScientific Instruments

Primitive Information Object

Primitive Information Object

Simple Information Object

Telemetry Information Package

Science Information Package

Instrument Planning

Information Object

Science Information Package

Science Products - Information Objects

PlanningInformation

Object

Science Information Package

• Common Meta Models for Describing Space Information Objects• Common Data Dictionary end-to-end

Page 5: Developing a Reference Architecture for Scientific Data Systems

DJC-5

Increasing data volumes

Increased emphasis on usability and analysis of the data across the end-to-end system

Mining/discovery

Increasing diversity of data sets and complexity for integrating across missions/experiments (E.g., common information model for describing the data)

Increasing distribution of coordinated processing and operations (E.g., federation)

Increased pressure to reduce cost of supporting new missions

Increasing desire for PIs to have integrated tool sets to work with data products with their own environments (E.g. perform their own generation and distribution)

Archive Volume Growth

010

203040

506070

8090

1990 1992 1994 1996 1998 2000 2002 2004 2006 2008

Year

TB (A

ccum

)

TBytes

Planetary Science Archive

Page 6: Developing a Reference Architecture for Scientific Data Systems

Architecture: why do I care?Data system costs per mission, project,

investigation, etc is high

Technology infusion is limited

Need to capture and leverage domain knowledge and experience across projects

Page 7: Developing a Reference Architecture for Scientific Data Systems

Architecture: what is it?The fundamental organization of a system

embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution. (ANSI/IEEE Std. 1471-2000)

Page 8: Developing a Reference Architecture for Scientific Data Systems

Architects: what are they?Effective Architects have…

• Years of experience

• Holistic view of domain – Look at both aesthetics and

practical details– Variable technical depth

• Lifecycle roles– Strong involvement up-front– May oversee development– Chooses stable steps in

development

Effective Architects are not…

• Lone inventors or scientists– The architect is a good

communicator and politician -- architectures must be sold and explained and their integrity maintained

– Architecting is not a science, but depends on science

• Purely technologists• Architecture is a strategy

• “Top level only” designers– Details are often critical

• Collaborators– A coherent vision is critical;

they drive it

Page 9: Developing a Reference Architecture for Scientific Data Systems

9

• A viewpoint is a template for constructing a view

• A view is a description of the entire system from the perspective of a set of related concerns. A view is composed of one or more models.

• A model is an abstraction or representation of some aspect of a thing

• Examples: RM-ODP, FEAF, TOGAF, etc

The viewpoint is where you look from

The view is what you see

(Project Managers, Engineers, Scientists, Business Analysts, …)

Page 10: Developing a Reference Architecture for Scientific Data Systems

Planetary science example

Page 11: Developing a Reference Architecture for Scientific Data Systems

Defining the reference architectureIn science data systems, construction of multiple

architecture views of a system are criticalProcessInformation/DataTechnology

We find the “views” are similar, but models can be domain specificThis is the opportunity to develop a reusable

reference architecture if the “patterns” can be extracted

Page 12: Developing a Reference Architecture for Scientific Data Systems

Domain Specific Software Architectures*Domain model

Leverage experts who have the “holistic” view and can drive the need for product lines

An unambiguous view is critical (in fact, this has been a problem in science arenas)

Reference requirements Drives the reference architecture However, it is critical to map domain models to reference requirements

in order to understand the solution spaceReference architecture

Satisfies an abstracted set of functions from the reference requirements It’s engineered for the “ilities” reusability, extensibility and

configurability It demonstrates the separation of functional elements of the architecture

* Tracz, Will, Domain-Specific Software Architecture, ACM SIGSOFT, 1995

Page 13: Developing a Reference Architecture for Scientific Data Systems

13

Ingest (Receive,Validate, Accept)

PDS-2010 SystemArchitecture

Process Architecture

Data Architecture TechnologyArchitecture

Information Model

Data Formats

Data Dictionary

Grammar

Catalog/Data Mgmt

Storage

Portal

Search

Data Distribution

Archive Organization

User Tools/Services

Deep Archive

Data Movement

Distributed Infrastructure

Archive (APG, PAG)

Archive

Query/AccessData Node Integration

Data Standards

Technology Standards

Administration

Peer Review Archive Tools

PreservationPlanning

Page 14: Developing a Reference Architecture for Scientific Data Systems

• Data/Information Architecture

• Components, middleware, and communication

• NOTE: Process is implicit here

Middleware andMessaging

Comm Layer

Metamodel

InformationComponents

InformationObject

Domain Model

Metamodel

InformationComponents

InformationObject

Domain Model

Middleware andMessaging

Comm LayerCommon Protocols - TCPIP, ...

Common Messaging - SOAP, JMS, ...

Common Functions - Registry, Repository, ...

Common or Mediated Metamodel - DEDSL,ISO1179, UML

Common or Mediated Domain Models --Planetary Data Systems, EOSDIS, ...

Information Exchange - Science, Mission, etc, DataProducts, Observations, SLE Objects, ...

Communications

Software/Application

DataArchitecture/Content

Page 15: Developing a Reference Architecture for Scientific Data Systems

Software product linesThis is about strategy more than technology

Goal is a software product line thatImplements our reference architectureAllows for construction of core software

components that can be reused across projects and science disciplines

Can demonstrate sufficient cost and schedule benefits without sacrificing flexibility in meeting requirements and adapting to technology change

Page 16: Developing a Reference Architecture for Scientific Data Systems

Object Oriented Data Technology• Represents both a reference

architecture AND a software product line for science data systems

• Exploits common patterns• Delivers reusable software

components as building blocks for construction of higher order data systems

• Applied to multiple science disciplines

• Funded originally back in 1998; runner up for NASA Software of the Year in 2003

• Heavily used by NASA and NIH projects

OODT/Science Web Tools

OODT/Science Web Tools

ArchiveClient

OBJ ECT ORIENTED DATA TECHNOLOGY FRAMEWORK

ProfileXMLData

ProfileXMLData

NavigationService

NavigationService

Data System

2

Data System

2

Data System

1

Data System

1

Other Service 1

Other Service 1

Other Service 2

Other Service 2

QueryServiceQuery

ServiceProductServiceProductService

ProfileServiceProfileService

ArchiveServiceArchiveService

Bridge to External Services

Bridge to External Services

DJC-16

Page 17: Developing a Reference Architecture for Scientific Data Systems

Architectural principles*Separate the technology and the information architectureEncapsulate the messaging layer to support different messaging

implementationsEncapsulate individual data systems to hide uniquenessProvide data system location independence Require that communication between distributed systems use

metadataDefine a model for describing systems and their resources Provide scalability in linking both number of nodes and size of data

setsAllow systems using different data dictionaries and metadata

implementations to be integratedLeverage existing software, where possible (e.g., open source, etc)`

DJC-17

* Crichton, D, Hughes, J. S, Hyon, J, Kelly, S. “Science Search and Retrieval using XML”,Proceedings of the 2nd National Conference on Scientific and Technical Data, National Academy of Science, Washington DC, 2000.

Page 18: Developing a Reference Architecture for Scientific Data Systems

Architectural focusConsistent distributed capabilities

Resource discovery (data, metadata, services, etc), “grid-ing” loosely coupled science system, workflow management

On-demand, shared services (E.g. processing, translation, etc) Processing Translation

Deploy high throughput data movement mechanisms

End-to-end capabilities across the science environment

Reduce local software solutions that do not scale Increasing importance in developing an “enterprise” approach with

common services

Build value-added services and capabilities on top of the infrastructure

DJC-18

Page 19: Developing a Reference Architecture for Scientific Data Systems

Exploiting common patternsHow data is managed (registry/repository,

information objects themselves)…How data is generated, captured, etc (e.g.,

workflow and data processing)…How data is accessed (metadata, data)…How information is discovered …How data is distributed (e.g., transformed)…How data is visualized…

Page 20: Developing a Reference Architecture for Scientific Data Systems

What does OODT do? Tie together loosely coupled distributed heterogeneous data

systems into a virtual data grid

Support critical functions Data Production and workflow Data Distribution Data Discovery (including query optimization across highly distributed

systems) Data Access

An architectural approach first, an implementation second Adapt to different distributed computing deployments Promotes a REST-style architectural pattern for search and retrieval

Scalability in linking together large, distributed data sets

Page 21: Developing a Reference Architecture for Scientific Data Systems

OODT data architecture focus

On types of and relationships among a software system’s data

Decomposition of data within a software system to its logical components and interactions

Components: Data Elements, Data Dictionary, Data Models of individual data sources

Interactions: Mappings between Data Dictionary to Data Models, Data Element structural comparison

Some standards currently exist for data architecture ISO: ISO-11179 Standardization and Specification of Data Elements Dublin Core Metadata Initiative: Dublin Core Data Elements to describe any

electronic resource

Specifications for the Data Architecture Common XML schema for managing information about data

resources Common XML schema for messaging between distributed services Methods for integrating existing domain models within architecture

Page 22: Developing a Reference Architecture for Scientific Data Systems

ProfileAttributes-id: String-version: String-statusID: String-securityType: String-parent: String-children: List-regAuthority: String-revisionNotes: List-dataDictID: String

ProfileAttributes-id: String-version: String-statusID: String-securityType: String-parent: String-children: List-regAuthority: String-revisionNotes: List-dataDictID: String

ResourceAttributes-identifier: String-title: String-formats: List-description: String-creators: List-subjects: List-publishers: List-contributors: List-dates: List-sources: List-languages: List-coverages: List-rights: List-contexts: List-aggregation: String-clazz: String-locations: List

ResourceAttributes-identifier: String-title: String-formats: List-description: String-creators: List-subjects: List-publishers: List-contributors: List-dates: List-sources: List-languages: List-coverages: List-rights: List-contexts: List-aggregation: String-clazz: String-locations: List

ProfileElement-name: String-id: String-desc: String-type: String-unit: String-synonyms: List-obligation: boolean-maxOccurrence: int-comments: String

ProfileElement-name: String-id: String-desc: String-type: String-unit: String-synonyms: List-obligation: boolean-maxOccurrence: int-comments: String

EnumeratedProfileElement

-values: List

EnumeratedProfileElement

-values: List

RangedProfileElement

-min: double-max: double

RangedProfileElement

-min: double-max: double

ProfileProfile

UnspecifiedProfileElement

UnspecifiedProfileElement

MapMap

resourceAttributesprofileAttributeselements

1 1

1

1 11

*

profile profile

Keys areStrings,equal toelements’names

Resource Metadata ModelRequest/Response Model

Based on ISO/IEC 11179

Based on Dublin Core

XMLQuery-resultModeId: String-propogationType: String-propogationLevels: String-maxResults: int-kwqString: String-numResults: int-mimeAccept: List

XMLQuery-resultModeId: String-propogationType: String-propogationLevels: String-maxResults: int-kwqString: String-numResults: int-mimeAccept: List

QueryHeader-id: String-title: String-description: String-type: String-statusID: String-securityType: String-revisionNote: String-dataDictID: String

QueryHeader-id: String-title: String-description: String-type: String-statusID: String-securityType: String-revisionNote: String-dataDictID: String

QueryResult-list: List

QueryResult-list: List

QueryElement-role: String-value: String

QueryElement-role: String-value: String

1

1

1

1

1

1

1

fromSet

selectSet

whereSet

resultqueryHeader

nasa.pds.xmlquery

Page 23: Developing a Reference Architecture for Scientific Data Systems

OODT software componentsProfile Service – A server-based registry that is able

to either serve local XML profiles or plug-into an existing catalog. This component provides resource discovery.

Product Service – A server component that plugs into existing repositories and serves products. This includes translation serves, etc

Catalog and Archive Service – Transaction-based server that catalogs and archives products providing profile and product servers for discovery and distribution

Query Service – Provides query management across distributed services to enable discovery.

Page 24: Developing a Reference Architecture for Scientific Data Systems

DJC-24

3. Repositories for storing and retrieving many types of data

1. Science data tools and applications use “APIs” to connect to a virtual data repository

Visualization Tools

Analysis Tools

OODTReusable

DataGrid

Framework

MissionData

RepositoriesOODT

API

2. Middleware creates thedata grid infrastructure connecting distributed heterogeneous systems and data

BiomedicalData

Repositories

EngineeringData

Repositories

Web Search Tools

OODTAPI

OODTAPI

Page 25: Developing a Reference Architecture for Scientific Data Systems

• Common Meta Models for Describing Space Information Objects• Common Data Dictionary end-to-end

Query Integration

Node 1Profile Server

XML Request

Information Object

XML Request

Info

Obj

ect

XM

L R

eque

st

Repository Product Server

Information Object

Web I/F

Desktop I/FXML Request

Information Object

Name Server

Repository Product Server

Node 1Profile Server

Node 1Profile ServerRegistry Server

Repository/ArchiveServer

Name ServerService Registry

XML Request

Information Object

WSDL WSDL

ProductCatalogs

Science Products

ScienceProducts

Science Products

Page 26: Developing a Reference Architecture for Scientific Data Systems

OODT software implementation OODT is Open Source Developed using open source software (i.e. Java/J2EE and XML) Implemented reusable, extensible Java-based software components

Core software for building and connecting data management systems Provided messaging as a “plug-in” component that can be replaced

independent of the other core components. Messaging components include: CORBA, Java RMI, JXTA, Web Services, etc REST seems to have prevailed

Provided client APIs in Java, C++, HTTP, Python, IDL Simple installation on a variety of platforms (Windows, Unix, Mac OS X,

etc) Used international data architecture standards

ISO/IEC 11179 – Specification and Standardization of Data Elements Dublin Core Metadata Initiative W3C’s Resource Description Framework (RDF) from Semantic Web Community

DJC-26

Page 27: Developing a Reference Architecture for Scientific Data Systems

EDRN Knowledge Environment EDRN has been a pioneer in the use of

informatics technologies to support biomarker research

EDRN has developed a comprehensive infrastructure to support biomarker data management across EDRN’s distributed cancer centers

Twelve institutions are sharing data Same architectural framework as planetary

science

It supports capture and access to a diverse set of information and results

Biomarkers Proteomics Biospecimens Various technologies and data products

(image, micro-satellite, …) Study Management

DJC-27

Data and Computers interconnected to

f orm a virtual database Integrated Cancer Resources

SpecimensImagesAssaysBiomarkersetc

Page 28: Developing a Reference Architecture for Scientific Data Systems

DJC-28

• Often unique, one of a kind missions– Can drive technological changes

• Instruments are competed and developed by academic, industry and industrial partners

– Highly distributed acquisition and processing across partner organizations

– Highly diverse data sets given heterogeneity of the instruments and the targets (i.e. solar system)

• Missions are required to share science data results with the research community requiring:

– Common domain information model used to drive system implementations

– Expert scientific help to the user community on using the data

– Peer-review of data results to ensure quality– Distribution of data to the community

• Planetary science data from NASA (and some international) missions is deposited into the Planetary Data System

Page 29: Developing a Reference Architecture for Scientific Data Systems

Source: A. Hooke, NASA/JPL

A GroundTrackin

gNetwork

One or MoreSpacecraft

An Instrument

ControlCenterA Spacecraft

ControlCenter

A ScienceFacility

A SpaceTrackingNetwork

Commodity Space

Communications Systems

Commodity Space

Navigation Systems

One or MoreInstruments

Page 30: Developing a Reference Architecture for Scientific Data Systems

Planetary Data SystemNASA’s official archive for research results

from solar system exploration

Distributed across the United States at “PDS Nodes” 8 nodes including both science nodes and

support nodesData and Services reside at each nodeUnified by a common data architecture and

broad technical architecture

Page 31: Developing a Reference Architecture for Scientific Data Systems

NAIF/JPL

Small Bodies/UMD

Atmospheres/New Mexico State

Geosciences/Washington University

Planetary Plasma/UCLA

Rings/SETI

Radio Science/Stanford

Engineering/JPL Imaging/

USGS

Imaging/JPL

Mars Odyssey THEMIS/ASU Data Node

MRO-HiRISE/UofA Data Node

Page 32: Developing a Reference Architecture for Scientific Data Systems

The data architecture is keyThe planetary community has

developed a diverse model, that is enforced and used in data management NASA-led, but ESA, ISRO, JAXA,

etc are leveraging planetary science data standards

Core “information” model that has been used to describe every type of data from NASA’s planetary exploration missions and instruments ~4000 different types of data

Unique to planetary, but the concept of models and how they apply to science data is not

DJC-32

PDS ImageLabel (ODL)

PDS Image Class (Object-Oriented)

An Image

Describes

Page 33: Developing a Reference Architecture for Scientific Data Systems

DJC-33

• Pre-Oct 2002, no unified view across distributed operational planetary science data repositories

– Science data distributed across the country– Science data distributed on physical media

• Planetary data archive increasing from 4 TBs in 2001 to 100 TBs in 2009

– Traditional distribution infeasible due to cost and system constraints

– Mars Odyssey could not be distributed using traditional method

• Current work with the OODT Data Grid Framework has provided the technology for NASA’s planetary data management infrastructure to

– Support online distribution of science data to planetary scientists

– Enable interoperability between nine institutions– Support real-time access to data products– Provided uniform software interfaces to all Mars

Odyssey data allowing scientists and developers to link in their own tools

– Operational October 1, 2002

• Moving to multi-terrabyte online data movement in 2009

2001 Mars Odyssey

Page 34: Developing a Reference Architecture for Scientific Data Systems

The architecture reuse opportunityWhile planetary has unique constraints and

requirements , the broader architecture patterns are exhibited in other science areasPlanetary can be very unforgiving when it comes to system

failures

Biology and Earth, for example, are DistributedHave similar pipelines and processes

Focus on instruments that perform observations and then analysis of those instruments

Work with data in similar waysAre PI and science-driven

Page 35: Developing a Reference Architecture for Scientific Data Systems

DJC-35

• “To thrive, the field that links biologists and their data urgently needs structure, recognition and support. The exponential growth in the amount of biological data means that revolutionary measures are needed for data management, analysis and accessibility. Online databases have become important avenues for publishing biological data.” – Nature Magazine, September 2008

• The capture and sharing of data to support collaborative research is leading to new opportunities to examine data in many sciences– NASA routinely releases “data analysis programs”

to analyze and process existing data

Apr 22, 2023 35

EDRN DataRepositories

Page 36: Developing a Reference Architecture for Scientific Data Systems

DJC-36

• Initiated in 2000, renewed in 2005• 100+ Researchers (both members and

associated members)• ~40 + Research Institutions• Mission of EDRN

– Discover, develop and validate biomarkers for cancer detection, diagnosis and risk assessment

– Conduct correlative studies/trials to validate biomarkers as indicators of early cancer, pre-invasive cancer, risk, or as surrogate endpoints

– Develop quality assurance programs for biomarker testing and evaluation

– Forge public-private partnerships

• Leverage building distributed planetary science data systems for biomedicine

Page 37: Developing a Reference Architecture for Scientific Data Systems

Apr 22, 2023 37

Instrument Operations Science

Data Processing

DataDistribution

(EDRN Public Portal)

EDRN Bioinformatics Tools

Instrument eCAS - EDRN Biorepository

ExternalScience

Community

EDRN Researchers

Laboratory Biorepository

AnalysisTeam

Local Laboratory Science Data System

Publish Data Sets

Page 38: Developing a Reference Architecture for Scientific Data Systems

EDRN’s Ontology Model EDRN has developed a High level ontology model

for biomarker research which provides standards for the capture of biomarker information across the enterprise

Specific models are derived from this high level model

Model of biospecimens Model for each class of science data

EDRN is specifically focusing on a granular model for annotating biomarkers, studies and scientific results

EDRN has a set of EDRN Common Data Elements which is used to provide standard data elements and values for the capture and exchange of data

DJC-38EDRN Biomarker Ontology Model

EDRN CDE Tools

Page 39: Developing a Reference Architecture for Scientific Data Systems

Apr 22, 2023 39

ESIS -- EDRN Study I

nformati

on System

eCAS -- EDRN Catalog and Archive System

ERNE -- EDRN Resource Network Exchange

BMDB -- NCI Biomarker DB

The EDRN Knowledge Environment

Page 40: Developing a Reference Architecture for Scientific Data Systems

Apr 22, 2023 40

Page 41: Developing a Reference Architecture for Scientific Data Systems

Apr 22, 2023 41

Moving to an integrated semanticarchitecture

Semantic science portal driven by the EDRN ontologySchema loaded into

the ontology via RDFS (and Protégé)

Metadata from distributed applications dumped into the portal via RDF

Moving EDRN towards a “pure” model-driven environment

Page 42: Developing a Reference Architecture for Scientific Data Systems
Page 43: Developing a Reference Architecture for Scientific Data Systems

Other science areasEarth Science

Leveraged OODT software framework for constructing ground data systems for earth science missions Used OODT Catalog and Archive Service

software

Constructed “workflows” Execution of “processors” based on a set

of rules

Medical Research Support for distributed analysis of

pediatric intensive care units

Climate Research Support for distributed modeling

DJC-43

SeaWinds on ADEOS II (Launched Dec 2002)

Page 44: Developing a Reference Architecture for Scientific Data Systems

Related work….The plethora of middleware, e-science and grid

efforts…

Major agency efforts in physical and life sciences…

Standards efforts….

All the technology support (but see my message on next slide as an architect!)

Page 45: Developing a Reference Architecture for Scientific Data Systems

My message…Distributed service architectures

Not anything new (my experience with them goes back to the early 1990s)

But, often, newer technologies and approaches are seen as a panacea

Technology is not a replacement for a conceptual architectureMy experience is that definition of the architecture independent of

technology is critical The goal should be stability in the architecture model; the

selection of appropriate technology will change over timeThis is why an architect is much more of a strategist than a

technologist

Page 46: Developing a Reference Architecture for Scientific Data Systems

More preaching…Think about the entire system and identify

the abstractionsYou need the holistic viewWhat are the patternsWill an architecture framework help?

(separation of process, data, technology, etc views)? Can these evolve independently?

Page 47: Developing a Reference Architecture for Scientific Data Systems

Resources (1) Tracz, Will. Domain-Specific Software Architecture. ACM SIGSOFT, 1995.

(2) D. Crichton, S. Kelly, C. Mattmann, Q. Xiao, J. S. Hughes, J. Oh, M. Thornquist, D. Johnsey, S. Srivastava, L. Esserman, and B. Bigbee. A Distributed Information Services Architecture to Support Biomarker Discovery in Early Detection of Cancer. In Proceedings of the 2nd IEEE International Conference on e-Science and Grid Computing, pp. 44, Amsterdam, the Netherlands, December 4th-6th, 2006.

(3) C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Proceedings of the 28th International Conference on Software Engineering (ICSE06), pp. 721-730, Shanghai, China, May 20th-28th, 2006.

,