Upload
asist
View
575
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery; Steve Hughes, NASA; Data Publication Repositories The 2nd Research Data Access and Preservation (RDAP) Summit An ASIS&T Summit March 31-April 1, 2011 Denver, CO In cooperation with the Coalition for Networked Information http://asist.org/Conferences/RDAP11/index.html
Citation preview
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery
Research Data Access & Preservation
Denver, Colorado
March 31 - April 1, 2011
Steve Hughes
Dan Crichton
Chris Mattmann
Sean Kelly
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Topics
• E-Science Trends• Software Architectures• Open Source• Object-Oriented Data Technology• Use Case• Data Driven
2Leveraging Open Source Technologies to Enable Scientific Discovery
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
“eScience” Trends
• Highly distributed, multi-organizational systems– Systems are moving towards loosely coupled systems or federations in
order to solve science problems which span center and institutional environments
• Sharing of data and services which allow for the discovery, access, and transformation of data – Systems are moving towards publishing of services and data in order to
address data and computationally-intensive problems– Infrastructures which are being built to handle future demand– Use of commodity services to address elasticity
• Address complex modeling, inter-disciplinary science and decision support needs– Need a dynamic environment where data and services can be used quickly
as the building blocks for constructing predictive models and answering critical science questions
– Need to ensure information architecture support the varying science needs
• Changing the way in which data analysis is performed– Moving towards analysis of distributed data to increase the study power– Enabling greater collaboration across centers– Systematizing, where possible
3Leveraging Open Source Technologies to Enable Scientific Discovery
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Highly Distributed Science Environments
Leveraging Open Source Technologies to Enable Scientific Discovery 4
Planetary Data SystemDistributed Planetary Science Archive
Small Bodies NodeUniversity of Maryland
College Park, MD
Planetary Plasma Interactions NodeUniversity of California Los AngelesLos Angeles, CA
Geosciences NodeWashington University
St. Louis, MOImaging NodeJPL and USGSPasadena, CA and Flagstaff, AZ
THEMIS Data NodeArizona State UniversityTempe, AZ
Central NodeJet Propulsion LaboratoryPasadena, CA
Navigation Ancillary Information NodeJet Propulsion LaboratoryPasadena, CA
Rings NodeAmes Research CenterMoffett Field, CA
Atmospheres NodeNew Mexico State UniversityLas Cruces, NM
National Data Sharing InfrastructureSupporting Collaboration In Biomedical Research For EDRN
Universityof Michigan
(CEC)
Moffitt CancerCenter, Tampa
(BDL)
CreightonUniversity
(CEC)
UT Health ScienceCenter, San Antonio
(CEC)
University ofColorado
(CEC)
Fred HutchinsonCancer Research Center, Seattle
(DMCC)
University ofPittsburgh
(CEC)
Highly distributed/federatedCollaborative
Information-centricDiscipline-specificGrowing/evolvingHeterogeneous
(Implementations)
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Why Software Architecture?
• Software Architecture: The fundamental organization of a system embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution. (ANSI/IEEE Std. 1471-2000)
• Architecture is about strategy to address key architectural concerns…– How can we exploit common patterns to improve reuse?– Can we develop software product lines?– Can we improve interoperability?– Can we reduce dependencies?
• What are the architectural principles..?: loosely-coupled, data-driven, highly distributed, commodity services, service oriented, collaborative/multi-institutional
5Leveraging Open Source Technologies to Enable Scientific Discovery
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Notional Service Architectures Concept
6Leveraging Open Source Technologies to Enable Scientific Discovery
• The service architecture concept exploits many of the architectural concepts discussed• Loosely coupled• Elasticity (e.g. Commodity-based)• Multi-organizational• etc
• At an enterprise-scale, architectures don’t need to prescribe what’s inside services….just their interfaces, function, behavior, etc…
• Services might include….• Data discovery• Data access• Security• Transformation
Client BClient A
Service
CService Interface
C2 Architectural Style
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
What does this have to do with open source?
• The identification of core software product lines and tools, that can be reused, are excellent examples of opportunities to create open source projects– Across a federation of organizations, systems and users, what be
developed and shared?– How can software components be developed in generic ways, but allow
for extensions?
• Open source itself is a strategy– Can improve collaborations – Can drive a robust set of reusable software components and tools– Can push standards development– Can encourage use of common architectural patterns
Leveraging Open Source Technologies to Enable Scientific Discovery 7
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Open Source Models
• Software sharing with an open source license (e.g, BSD-style license)
• Software distribution through open source organizations (e.g., SourceForge)
• Software projects under the governance of an open source community/foundation (e.g., Apache Software Foundation)
• Ad hoc open source project communities with their own governance
Leveraging Open Source Technologies to Enable Scientific Discovery 8
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Open Source Models: Our Opinion
• Software sharing with an open source license (e.g, BSD-style license)– It’s a great start– Limited community involvement
• Software distribution through open source organizations (e.g., SourceForge)– Provides good software distribution support
• Software projects under the governance of an open source community/foundation (e.g., Apache Software Foundation)– This moves from just distribution support to collaboration and
governance over the development
• Ad hoc open source project communities with their own governance– This can make a lot of sense for larger federations…
Leveraging Open Source Technologies to Enable Scientific Discovery 9
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
The Apache Software Foundation
• Largest open sourcesoftware development entity in the world– Over 2300+ committers– Over 3500+ contributors
• 84 Top Level Projects– 36 Incubating– 30 Lab Projects
• 8 retired projects in the “Attic”• Over 1.2 million revisions
Leveraging Open Source Technologies to Enable Scientific Discovery 10
- Over 10M successful requests served a day across the world
- HTTPD web server used on 100+ million web sites (52+% of the market)
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
OODT: An Open Source Framework for Building Distributed Science Data Mgmt Environments
• Focus on– distribute environments– science data generation – data capture, end-to-end– access to science data by
the community
• A set of building blocks/services to exploit common system patterns for reuse
• 04-FEB-2011 - Apache OODT v0.2 Released
• Used for a number of science data system activities
11Leveraging Open Source Technologies to Enable Scientific Discovery
http://oodt.apache.org/
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
e-Science Examples and OODT
Planetary Data SystemDistributed Planetary Science Archive
Small Bodies NodeUniversity of Maryland
College Park, MD
Planetary Plasma Interactions NodeUniversity of California Los AngelesLos Angeles, CA
Geosciences NodeWashington University
St. Louis, MOImaging NodeJPL and USGSPasadena, CA and Flagstaff, AZ
THEMIS Data NodeArizona State UniversityTempe, AZ
Central NodeJet Propulsion LaboratoryPasadena, CA
Navigation Ancillary Information NodeJet Propulsion LaboratoryPasadena, CA
Rings NodeAmes Research CenterMoffett Field, CA
Atmospheres NodeNew Mexico State UniversityLas Cruces, NM
National Data Sharing InfrastructureSupporting Collaboration In Biomedical Research For EDRN
Universityof Michigan
(CEC)
Moffitt CancerCenter, Tampa
(BDL)
CreightonUniversity
(CEC)
UT Health ScienceCenter, San Antonio
(CEC)
University ofColorado
(CEC)
Fred HutchinsonCancer Research Center, Seattle
(DMCC)
University ofPittsburgh
(CEC)
Planetary Science Data System• Highly diverse (40 years of science data from NASA and Int’l missions)• Geographically distributed; moving int’l• New centers plugging in (i.e. data nodes)• Multi-center data system infrastructure• Heterogeneous nodes with common interfaces• Integrated based on enterprise-wide data standards• Sits on top of COTS-based middleware
EDRN Cancer Research• Highly diverse (30+ centers performing parallel studies using different instruments)• Geographically distributed• New centers plugging in (i.e. data nodes)• Multi-center data system infrastructure• Heterogeneous sites with common interfaces allowing access to distributed portals Integrated based on common data standards Secure (e.g. encryption, authentication, authorization)
12Leveraging Open Source Technologies to Enable Scientific Discovery
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Mission Pipelines – Data Generation and Archive
DJC-13
• Leveraged OODT software framework for constructing ground data systems for earth science missions– Used OODT Catalog and Archive Service
software– Focus is on “process management”
• Constructed “workflows” – Execution of “processors” based on a set of
rules– Explicit separation of workflow management
from management of computational resources
• Provided “lights out” operations
• Multiple Missions– SeaWinds– QuikSCAT– Orbiting Carbon Observatory (OCO), OCO-
2…– NP Sounder PEATE– SMAP
Spacecraft& Ancillary
Files
Pre-Processors
(PP)
ScienceLevel
Processors(LP)
Science Analysis
and Quality
Reporting(SA)
InstrumentCommands
File
Transf er (F
X)
User Interface (Process Monitoring & Control, Instrument Commanding, Data Verification)
Data Management and Automatic Process Control (PM) using OODT
EngineeringAnalysis
(EA)
Product D
elive ry (PM
)
ScienceProductsReleased
toPO.DAAC
SeaWinds on ADEOS II (Launched Dec 2002)
Leveraging Open Source Technologies to Enable Scientific Discovery
Credit: D. Freeborn, C. Mattmann, D. Woollard
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Conceptual Capabilities
• OODT Apache Suite (oodt.apache.org)– File Management– Workflow Management (for jobs/processing)– Data Transformation– Data Access– Metadata Query
• Registry (future addition to OODT)– Metadata Management based on ebXML registry specification– Used to manage different type of “extrinsic” objects (metadata
descriptions of data, services, etc)• “targets”, “science data products”, “documents”, “services”, etc
– Product identification, versioning, tracking, and subscription/notification
– Indexing, Classification, and Associations
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Information Architecture
• OODT + Registry contains two different types of “models”– Core Infrastructure model– Discipline model
• Core infrastructure model is intrinsic (integrated with the software)– It is built in and used by the software; this never changes and you don’t need to
worry about it– Services are part of the core infrastructure (“intrinsic”) but all other metadata
objects are “extrinsic”
• Discipline model is extrinsic (defined outside the software)– It is dynamically configured – For example, the registry can be configured to use whatever “extrinsic”
metadata objects are important to manage– This allows for the registry to be used for tracking artifacts, managing services,
etc.– This is what projects need to define
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Observational Product – Concept Map
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
PDS4 High Level Concept Map
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Defining Extrinsic Objects and their Context (Ontology)
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
External Data Standards
• Open Archival Information System (OAIS) Reference Model - Defines the “Information Object” a key component of the model.
• ISO/IEC 11179-3: Registry Metamodel and Basic Attributes - Provides the schema for the data dictionary. Defines the concepts of registration authority and steward for governance.
• Object_Oriented Data Modeling – Used as a standard modeling methodology.
• XML/XML Schema – Provides the label syntax and validation mechanism.
• OASIS/ebXML Registry Information Model - Provides attributes for object registration within a federated registry/repository.
• ISO 15836:2009 The Dublin Core Metadata Element Set – Provides standard web resource identification attributes.
• Semantics - RDF, RDFS, OWL - Provides W3C standards for knowledge representation.
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
A perspective to leave you with…
• Agency science federations, based on an open source/collaborative model, are very attractive for the following reasons:
– Science benefits: can drive a growing enterprise of shared science services and software infrastructure support
– Technology benefits: can drive innovation through its peer review and collaboration process
– Infusion benefits: creates a defined process for contributing new ideas and capabilities
– Architecture benefits: helps you build towards a common architectural vision and drive community standards
– Cost benefits: can enable better leveraging and reuse of skills and capabilities across institutions
– Tech Transfer Benefits: may benefit other science (and non-science disciplines)
20Leveraging Open Source Technologies to Enable Scientific Discovery
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Questions?
Thank You!!!
Steve Hughes
Chris Mattmann
Note…we have several papers, book chapters on data intensive systems, etc that we’d be happy to share! A few key ones…
D. Crichton, C. Mattmann, J. S. Hughes, S. Kelly, and A. Hart. “A Multi-Disciplinary, Model- Driven, Distributed Science Data System Architecture.” Guide to e-Science: Next Generation Scientific Research and Discovery. X. Yang, L. L. Wang, W. Jie, eds. Spring Verlag, 2010, To appear.
D. Crichton, S. Kelly, C. Mattmann, Q. Xiao, J. S. Hughes, J. Oh, M. Thornquist, D. Johnsey, S. Srivastava, L. Esserman, and B. Bigbee. “A Distributed Information Services Architecture to Support Biomarker Discovery in Early Detection of Cancer”. Accepted for publication at the 2nd IEEE International Conference on e-Science and Grid Computing, Amsterdam, the Netherlands, December 4th-6th, 2006.
C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. “A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications”. In Proceedings of the 28th International Conference on Software Engineering (ICSE06), pp. 721-730, Shanghai, China, May 20th-28th, 2006. 21Leveraging Open Source Technologies to Enable Scientific Discovery
Sean Kelly [email protected]
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Backup
22Leveraging Open Source Technologies to Enable Scientific Discovery