caGrid Executive
IntroductioncaGrid 1.3
Justin PermarcaGrid Knowledge Center
https://cabig-kc.nci.nih.gov/CaGrid/KC
2
Agenda
• Vision and Use Cases• caGrid Introduction• Building and Using caBIG Applications• Component / Service Survey• Grid Interactions• Grid Service Deployment
3
Vision
• “Imagine, if you will, a resource that would give individual scientists the capacity to easily view aggregate information on thousands of patients; a system that would also allow both patients and physicians to have complete medical records - including the patient's personal genome, tests performed over time, and medications taken - available at the click of a mouse. Rather than recruiting patients into clinical trials by who walks into the clinic or by individual referral, clinician-scientists could scan a database for patients precisely matched to their study, even if the study is looking for patients with specific genomic alterations, mutations, or translocations.”
• “In efforts to increase both the efficacy and efficiency of cancer care, managers of healthcare systems would have patient outcome data from hospitals across the country to utilize in benchmarking their own outcomes in key areas and managing cost. These brief examples are just a glimpse of the power that could come from such an interconnected national biomedical resource.”
Source: John Niederhuber, Director, NCI
4
About caBIG®
• caBIG® stands for the cancer Biomedical Informatics Grid®. caBIG® is an information network enabling all constituencies in the cancer community – researchers, physicians, and patients – to share data and knowledge. The components of caBIG® are widely applicable beyond cancer as well.
• The mission of caBIG® is to develop a truly collaborative information network that accelerates the discovery of new approaches for the detection, diagnosis, treatment, and prevention of cancer, ultimately improving patient outcomes.
• The goals of caBIG® are to: • Connect scientists and practitioners through a shareable and interoperable
infrastructure • Develop standard rules and a common language to more easily share information • Build or adapt tools for collecting, analyzing, integrating, and disseminating
information associated with cancer research and care.
Source: https://cabig.nci.nih.gov/overview/
5
Driving needs:cancer Biomedical Informatics Grid• A multitude of “legacy” information systems, most of which cannot be
readily shared between institutions• An absence of tools to connect different databases• An absence of common data formats• A huge and growing volume of data must be collected, analyzed, and made
accessible• Few common vocabularies, making it difficult, if not impossible, to interlink
diverse research and clinical results• Difficulty in identifying and accessing available resources• An absence of information infrastructure to share data within an institution,
or among multiple institutions• Avoid redundancy by re-building applications at multiple institutions
6
What is the Grid?
• “Controlled and coordinated resource sharing and problem solving in dynamic, scalable virtual organizations.”1
• Securely sharing (with policies!):• Computers• Software• Data• Other Resources
1The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.
7
What is caBIG?
• Common, widely distributed infrastructure that addresses common caBIG needs and permits the cancer research community to focus on innovation
• Shared, harmonized set of terminology, data elements, and data models that facilitate information exchange
• Collection of interoperable applications developed to common standards
• Cancer research data available for mining and integration
8
Why Grid for caBIG?
Informatics Requirements Advantages of GridControlled Secure, role-based, locally-controlled
accessComprehensive Data from multiple types of sources
Connected Syntactic and Semantic Interoperability
Convenient Simplified and customizable interfaces
Cost Cost effective – builds on existing technologies
Compliant Implements policy & technical standards
Credible Built on experience & best practices
Adapted from Muzna Mirza, MD, MSHI’s presentation on Global Public Health Grid:http://cdc.confex.com/cdc/phin2009/webprogram/Paper21091.html
9
Agenda
• Vision and Use Cases• caGrid Introduction• Building and Using caBIG Applications• Component / Service Survey• Grid Interactions• Grid Service Deployment
• The “G” in caBIG
• Cancer Biomedical Informatics Grid
• Provides the software infrastructure that underlies the tools and applications of caBIG
• Analogous to the “power grid”
• A multitude of applications with differing requirements can seamlessly be plugged in to a common infrastructure
What is caGrid to caBIG?
11
What is caGrid? (2)
• Biomedical applications that share data all have common needs for syntactic and semantic interoperability• caGrid aims to be a platform for interoperability
• caGrid is a Grid software toolkit aimed at software developers creating Grid applications
• caGrid provides• the GAARDS toolkit, a standard security platform• metadata services that add semantic information to all Grid services• Introduce, a toolkit to develop Grid services
• The Grid is a trusted network that supports collaborative biomedical research.• “Getting on the Grid” involves joining the trusted network by applying for
and utilizing Grid credentials
12
Compatibility and Interoperability
caBIG® provides standards-based compatibility guidelines for creating software systems that are syntactically and semantically interoperable.
The Grid Allows Users to Find and Utilize Data and Analytical resources
Grid service information is advertised to a Grid service directory called the Index service. This service is used to locate Grid services relevant to your research objectives.
Data or Analytical Resource
caBIOGrid
Service
Grid Service
Grid (Client Apps, Users)
Grid Service Directory(Index Service)
advertise
discover
14
caGrid: High Level View
Once a caBIG® tool is adopted or adapted by members of the research community, the tool is connected to the Grid to securely share data and analysis routines with collaborating researchers.
15
Infrastructure Focus Areas
• Leveraging Grid technologies and standards as an interoperability platform• Metadata Infrastructure
• Surfacing wealth of existing caBIG data-oriented metadata on the grid• Providing new service-oriented metadata
• Security• Integrating existing systems and applications with Grid security• Lowering burden of implementation of grid-wide and local policy
• Tooling for Service Developers• Powerful platform for bringing applications and data to the grid
• Facilitating Grid-wide operations• Federated query, workflow execution, resource discovery
• Making the Grid more accessible• Graphical installation and configuration, higher-level object-oriented APIs, web
portals, graphical administrative applications• Quality
• Comprehensive testing infrastructure, automated builds and test execution on multiple platforms, dashboard with historical archive
16
More About Security
• Comprehensive security is critical for collaboration scenarios involving biomedical data sharing. The caGrid security components, collectively known as GAARDS, include the following services:
• Dorian – Allows users to login to the Grid• Authentication Service – Integrates existing institutional login capabilities with
the Grid• Grid Grouper – Allows institutions to implement group-based security policies• Grid Trust Service – Provides capabilities for Grid entities to trust each other• Credential Delegation Service – Provides the ability to securely transfer Grid
credentials to others• Web Single Sign-On – Allows a single login to provide access to multiple web
applications that utilize Grid services
17
caGrid Integration with Existing Information Systems
• caGrid is an informatics platform that integrates and augments existing informatics infrastructure
• Examples include the following:• caGrid integrates existing repositories of semantic information such as
ontology servers• caGrid integrates with existing institutional login systems (e.g., LDAP)• caGrid shares data from existing databases and files
• In summary, caGrid integrates with existing systems to share and analyze data for multi-institutional clinical and research scenarios
18
Getting Started with caGrid
• To get started developing Grid applications, first install caGrid
• Use the caGrid installer to load caGrid onto your development machine• Using the installer is the easiest way to install caGrid
• Features include:
• Guided, wizard-like interface for easy installation• The installer can be used to re-configure existing installations
• The only requirement to run the installer is the Sun® Java™ 5 Development Kit.
19
Agenda
• Vision and Use Cases• caGrid Introduction• Building and Using caBIG Applications• Component / Service Survey• Grid Interactions• Grid Service Deployment
20
caGrid Community Involvement: Building Grid Applications
• caGrid itself provides no real “data” or “analysis” to caBIG; caGrid enables the community to build services that share and analyze data
• The real “value” of the grid comes from bringing this information to the “end user”
• Community members develop end user applications which consume of the resources provided by the grid• A Grid data service shares data securely with collaborators• A Grid analytical service analyzes data• A Grid application utilizes multiple Grid services to aid clinical and research
workflows
21
caCORE Development Process
• caCORE is a robust set of tools and resources to support the development of caBIG®-compatible systems
• NCI offers comprehensive training for caCORE tools
Create an Information
Model using a modeling tool
Information Models
Perform Semantic
Integration using the SIW
Vocabularies
Generate Code and Interfaces
using the caCORE SDK
Code Generator
APIs
Transform the Model into
Metadata using the UML Loader
CDEs
Generate a Grid Service using caGrid
Grid
Reference: Dr. Robert Freimuth, Vocabulary Knowledge Center Director
22
UML Model Creation Process
• Enterprise Vocabulary Services (EVS)• Stores controlled terminologies used during semantic annotation• The SIW pulls concepts from EVS and attaches them to model
components• cancer Data Standards Repository (caDSR)
• Common Data Elements (CDEs)• UML model elements that are semantically annotated are added to the
caDSR as CDEs
Create a Logical Model
(UML class diagram) using
Enterprise Architect
Logical Model
Create a Data Model
(database schema) using
Enterprise Architect
Data Model
Semantically Annotate the UML Model
using the SIW
Semantics
Map the Logical Model
to the Data Model using caAdapter
Mapping
Model is complete and
ready for compatibility
review and load into caDSR
Load Model
23
caBIG® Compatibility GuidelinesAreas of Interoperability
• Semantic Interoperability (VCDE)• Information Models• Vocabularies and Ontologies• Common Data Elements (CDEs)
• Syntactic Interoperability (Architecture)• Programming and Messaging Interfaces
An application must meet the criteria specified in all four areas to be "caBIG® Compatible"
Vocabularies Information Models
APIs
CDEs
Reference: Dr. Robert Freimuth, Vocabulary Knowledge Center Director
24
caBIG® Compatibility GuidelinesLevels of Maturity
• Legacy: Implies no interoperability with an external system or resource
• Bronze: Minimum requirements to achieve basic interoperability
• Silver: Rigorous requirements to significantly reduce the barrier of use for parties not involved with development of that resource
• Gold: Extensions to silver that add standardization and harmonization practices to enable full syntactic and semantic interoperability
Vocabularies Information Models
APIs
CDEs
Source: https://cabig.nci.nih.gov/guidelines_documentation
Using caBIG Applications
26
Agenda
• Vision and Use Cases• caGrid Introduction• Building and Using caBIG Applications• Component / Service Survey• Grid Interactions• Grid Service Deployment
27
caGrid 1.3 Core Services
All caGrid Core Services were redeployed on all caBIG® Grids (OSU Training, QA, Stage, and Production) for this release.
The (12) caGrid 1.3 Core Services are:
* New for 1.3 ** Significantly Rewritten or Enhanced for 1.3
Metadata Services Security Services Business Activity Services
Global Model Exchange Service**
Authentication Service** Federated Query Processor Service**
Index Service** Credential Delegation Service BPEL Workflow Service
Metadata Model Service* Dorian Service** Taverna Workflow Service*
Grid Grouper Service
Grid Trust Service (Master & Slave)
28
What’s the use of metadata?• Service metadata is critical for finding Grid resources relevant to particular
research and clinical scenarios • Metadata describes the service functionality and meaning of data that are
shared by a Grid service• Scenario: Scientists and others using the Grid want to find and utilize existing
data sources and algorithms relevant to their research scenarios• Solution: Grid services register with a Grid service directory• Scenario: Users want to view the structure and relationships of data on the
Grid• Solution: The UML model defines the content of Grid data types and
relationships between these types• Scenario: Users need to know the format of the data described in a UML
model• Solution: XML schemas, stored in a Grid repository, define the data format to
act as the foundation for syntactic interoperability• Scenario: Scientists want to identify the meaning of the data described in a
UML model• Solution: Grid data is annotated with semantic information, such as use of
community-approved vocabulary and concept definitions
29
What caGrid services provide this functionality?
• Scenario: Scientists and others using the Grid want to find and utilize existing data sources and algorithms relevant to their research scenarios• The Index Service included in caGrid is a Grid-wide service directory that serves as
the “white” and “yellow” pages of the Grid• Scenario: Users want to view the structure and relationships of data on the
Grid• Every data service provides a data model that represents the information in the
UML model• Scenario: Users need to know the format of the data described in a UML
model• The Global Model Exchange (GME) Service is a Grid-wide repository for XML
schemas• Scenario: Scientists want to identify the meaning of the data described in a
UML model• The Metadata Model Service (MMS) is used to add semantic information to caGrid
services• The MMS also is used to generate a Grid representation of the data in your UML
model, including semantic information
30
How does caGrid use the caBIG semantic repositories?
• All caGrid Services are expected to publish a set of standard metadata which draws heavily from the metadata registered in caDSR and EVS• Common Metadata describes generic information about service providing
Cancer Center, points of contact, etc• The Service’s operations are defined and their inputs and outputs
described using CDEs in caDSR and vocabulary from EVS• Data Services additionally describe the domain Model they are exposing
• Classes, attributes, and associations from the UML model• Semantics of the UML model
31
What security problems exist for multi-institutional data sharing scenarios?• Inter-institutional “trust”
• What institutions participate in the Grid? How can you verify that an identity is issued by an institution (that is claims to be from)?
• User authentication• How does a user prove their identity? How can we check that the identity
is legitimate?• User authorization
• How can institutions that share Grid services grant privileges to their collaborators?
• How can institutions that share data ensure their collaborators can only access data that the institutions intend to share?
• Data Integrity• How can institutions be sure that data they are sharing is transmitted
properly?• Data Security
• How can institutions be sure that they share data only with whom they intend to share data?
• Allowing services to retrieve and analyze data on your behalf
32
What caGrid Services Address these Security Scenarios?• Inter-institutional “trust”
• The Grid Trust Service (GTS) is used to establish a trust fabric, which is a collection of authoritative certificate authorities
• User authentication• Dorian has a CA that is an essential part of the trust fabric• Dorian issues both host certificates and user credentials that are trusted by others
in the Grid because they have synchronized with the trust fabric• The Authentication Service allows institutions to integrate their local user
management systems with the Grid• User authorization
• Grid Grouper provides group management, which in turn, allows service developers to add group-based authorization policies
• The Common Security Module (CSM) can be used to protect individual data elements shared by a Grid data service
33
What caGrid Services Address these Security Scenarios? (2)• Data Integrity
• caGrid supports checksums to ensure that data has not been altered during transmissions
• Data Security• caGrid supports encryption to ensure that data cannot be read by others during
transmission• Allowing services to work for you
• The credential delegation service (CDS) allows you to hand your credential to a third party for a specified period of time
34
How do Grid applications use core caGrid services?
• The user community adds data services and analytical services to the Grid• These services share data and analytical resources with others
• Multi-institutional collaborations will require the use of multiple Grid services• caGrid provides “higher-level” services that utilize the aforementioned Grid
services• The Federated Query Processor (FQP) provides applications with capabilities to
aggregate data from multiple (equivalent) data services and to join data from multiple data services
• The workflow services allow users to specify interactions between services to achieve a desired result
• For example, retrieve all ECG data for subjects in our clinical trial and calculate the mean QT value, storing the data in a results data service
35
Other caGrid Utilities and APIs
• CQL and DCQL• CQL is the “caGrid Query Language” that is used to retrieve data from caGrid data
services• DCQL is the distributed query language that is used for federated query processing
• Web Single Sign On• The Web Single Sign On component allows users to sign in once and use multiple
secure web applications• Introduce
• Grid application developers use the Introduce toolkit to create data and analytical services
• The Introduce toolkit can be extended to add project-specific functionality
36
An example Introduce development process (0 lines of developer code!)
Generate Code and Messaging Interfaces using the caCORE SDK Code Generator
PerformSemantic Integration using the Semantic Integration Workbench (SIW)
Create an Information Model in a modeling Tool
Transform the Information Model into Metadata using the UML Loader
y
Generate a caGrid Interface using “Introduce”
y
Getting Connected: Deploying to caGrid™Create Semantically Harmonized Data Model Grid-ifyGenerateData Resource
37
Agenda
• Vision and Use Cases• caGrid Introduction• Building and Using caBIG Applications• Component / Service Survey• Grid Interactions• Grid Service Deployment
38
Grid Workflows utilize core Grid Services
• The Grid services that are included in caGrid provide a core set of features available for Grid usage scenarios
• Grid workflows are software implementations of real-life clinical and research workflows
Figure: Example Data Analysis Workflow
39
Example Image Analysis Scenario
Each image processing step is a Grid service
Each step in background correction is an operation
Source: Joel H. Saltz, Scott Oster, Shannon L. Hastings, Stephen Langella, Renato A. Ferreira, Justin D. Permar, Ashish Sharma, David W. Ervin, Tony C. Pan, Umit V. Catalyurek, Tahsin M. Kurc, "Translational research design templates, Grid computing, and HPC", IEEE International Symposium on Parallel and Distributed Processing., : pp. 1-15, June, 2008. http://bmi.osu.edu/publications_more.php?ID=1113
40
Agenda
• Vision and Use Cases• caGrid Introduction• Building and Using caBIG Applications• Component / Service Survey• Grid Interactions• Grid Service Deployment
41
Joining the Grid
• During Grid service creation, the service creator specifies the authentication and authorization requirements for the service• For example, a service can require that users must authenticate with the service in order to
communicate• Specify authorization options (CSM/Grid Grouper) that are needed to support data retrieval and
analysis operations that the service offers. A service can require authorization at the service level, operation level, and data level (give the user permission to retrieve only what they are allowed to view)
• Configure a container to host the service• Two types of containers: secure and non-secure• A non-secure container can only host non-secure services and does not support authentication
or authorization• A secure container can host secure and non-secure services and will support authentication
and authorization as specified by the service• A secure container has its own identity that it uses to communicate with the rest of the
Grid• Deploy the service to the container and start the container• The service advertises itself to the Grid service directory
• The service directory, in turn, asks your service for information about its operations and data
42
The Role of Grid Policy
• The virtual organizations that join a Grid collectively establish (and enforce) policies that govern the use of the Grid• Security policies
• How long can a user Grid session last?• Data sharing policies
• Sharing de-identified data? Limited data sets? PHI?• Service level agreements
• What requirements are imposed on service providers?• Other domain-specific policies
43
Project Resources and Communication
• cagrid.org• Software Downloads• Documentation• Tutorials• Technical Paper and Presentations• FAQs
• caBIG® caGrid Knowledge Center• Knowledge Base• Forums• Enterprise Support• Community engagement• https://cabig-kc.nci.nih.gov/CaGrid/KC/index.php/Main_Page
• caGrid GForge Home (project website)• Feature Requests• Bug Reports• http://gforge.nci.nih.gov/projects/cagrid-1-0/
• caGrid Portal (web portal)• http://cagrid-portal.nci.nih.gov/
44
Acknowledgments
• THANK YOU• caGrid Development team• caBIG® Documentation and Training team