View
18
Download
0
Category
Preview:
DESCRIPTION
Status report from the TWG/CCIT to the CEWG. 2009-08-12 Dave Vieglais and Ryan Scherle. TWG Overview. Activity Two Meetings Weekly (or so) telecon Time contributions by Duane and Mark Significant outcomes thus far Project infrastructure (plone sites, svn) - PowerPoint PPT Presentation
Citation preview
Status report from the TWG/CCITto the CEWG
2009-08-12
Dave Vieglais and Ryan Scherle
TWG Overview
• Activity
• Two Meetings
• Weekly (or so) telecon
• Time contributions by Duane and Mark
• Significant outcomes thus far
• Project infrastructure (plone sites, svn)
• Use cases, interactions, requirements
• Discussions, especially Identifiers and Identity
• Student projects
Architecture
• Process (Meta-architecture > Conceptual A. > Logical A.)
• Use Cases
• Functional requirements
• Interfaces and interactions
• Prototyping
• Core pieces
• Fluff
• Iterative process (somewhat)
• Identify and resolve issues at all stages
• Limited resources - so important to get design right
Use Cases
• Identified major categories and obvious use cases early
• Subsequently expanded to 34 or so
• Diagrams developed to illustrate interactions for each use case
• Capture desired system functional requirements
• APIs identified, getting more stableNetwork Preservation Federated IdentityObject
ManagementDiscovery and Use
HeartbeatGUID
ReplicationIdentity ProviderAuthentication
CreateRead
Query
AuthorizationData use policy
UpdateDelete
LoggingNotification
HealthCapacity
ValidationMigration
WorkflowOntology
Provenance
Use Case Issues
• UC2 "Get list of GUIDs from metadata search"
• Can queries be done at MN with equivalent results?
• Where is result filtering based on access privileges performed?
• Authentication issue - if search across many nodes, then where is identity resolved
• UC3 "Registration of a new member node"
• Should new nodes be registered with specified trust levels?
Use Case Issues (2)
• UC4,5 "Create/Update/Delete metadata record in Member Node."
• What is the policy on archival copies of data and metadata? (Can data packages be deleted? Published packages modified?)
• UC12 "User Authentication - Person via client software authenticates against Identify Provider to establish session token."
• Where is identity stored? MN? CN? Combination of all?
Use Case Issues (3)
• UC24 "Transactions - CNs and MNs should support transaction sets where operations all complete successfully or get rolled back (e.g., upload both data and metadata records)."
• Do transactions span multiple MNs, CNs?
• UC27 "CN should support forward migration of metadata documents from one version to another within a standard and to other standards."
• 20+ metadata standards
• How to handle lossy conversions?
Use Case Issues (4)
• UC28 "Relationships/Versioning - Derived products should be linked to source objects so that notifications can be made to users of derived products when source products change."
• Who asserts these relationships? How are relationships managed?
• UC31 "Manage Access Policies - Client can specify access restrictions for their data and metadata objects. Also supports release time embargoes."
• Group management has an important, perhaps unusual temporal component.
Coordinating Node Requirements
• CNs provide a central role in infrastructure ∴ critical to identify functional and non-functional requirements early
• Non-exhaustive list of 21 requirements (so far)
• e.g.:
• “Coordinating Node services should be designed to be independently scalable.”
• “Data packages are not discoverable through any public interface until all Coordinating Nodes have confirmed that they have a copy of the corresponding metadata document.”
• “Metadata searches should return in a maximum of “xxx” seconds.”
General conclusions
The member nodes come with a diverse set of technologies and practices. The coordinating nodes will need to be very permissive while providing quality services.
History/versioning:
• Keep all versions of metadata, so we can see where it came from (and metadata doesn't take much storage)
• The original data package should always be stored. Transformed versions may be needed for some operations of the coordinating nodes.
• It may be too much of a burden to store all versions of a data file.
Identity, Authentication, Authorization
• MN & CN security services necessary to
• preserve and verify integrity of data packages (in D1)
• prevent malicious intent or inappropriate access
• Six identity / security models in industry:
• Centralized (LDAP)
• Distributed directories (LDAP + referrals)
• Distributed management and replication (LDAP + replication)
• Grid Security Infrastructure proxy certificates
• Open ID
• Shibboleth + InCommon
Identities
Types of users:• non-authenticated user• registered user (at member node)• registered user (DataONE central)• group member• site manager (for harvests, system operations, etc.)• change request approval workflow• owner of intellectual property rights
Privileges:• access/modify both data and metadata • Member Node Write• create/execute system functions• access logged information
Metadata Standards
• 20 or so relevant standards
DC, DwC, EML, CSDGM, GCMD-DIF, ISO 19137:2007, NeXML, WaterML, Genbank-FFF, ISO 19115, GML, CDF, DDI, GEML, ESML, CSR, ESG, ECHO, ...
• Conversion between standards is a lossy process
• Issues of compatibility in metadata storage across MNs
• Original metadata will be stored unchanged
• Need to define metadata standard that will be used to support search and discovery operations (CN)
Search Terms
Identifiers
• Fundamental component of entire architecture
• Many schemes (handle, LSID, PURL, ...), each with advantages and faults
• Not practical for DataONE to dictate single identifier scheme across all Member Nodes
• Feasible to require that identifiers are unique across all participating MNs
• However, not feasible to assume that all MNs will support all identifier schemes
• Key question: Must an identifier always resolve to the same sequence of bits? Or should it be more abstract?
PrototypesBy November 2009 meeting (hmm...):
• Member Node contributes metadata to Coordinating Node using GUID
• CN initiates replication of data object from MN to MN
• Logging for instrumentation and usage
• Update data object (revision) by Member Node
Others targets, in order of importance:
• Replication of metadata and system information between CNs
• Failover and load balancing between CNs
• Formalize all service API specs. using a language agnostic IDL
• Comparison and evaluation of existing systems/standards/protocols used by prototype implementations
• Authentication and authorization using LDAP (initial impl.)
• Search portal user interface using Coordinating Node metadata content
• Heartbeat/state of health services
• Registry services using, perhaps, a simple list as an initial method
• Stress and load testing
Current activities
• Wrapping up this year’s student internships.
• Addressing the general questions arising out of the use case diagrams (some of these questions will be discussed at the coordination meeting)
• Developing a report on identifier usage.
• Creating APIs to be used in prototypes.
Hurdles
• Resources & Contributors
• Identity, authentication, authorization
• Identifiers
• Rules for data handling and archive (what is data?)
• Metadata extraction
• CN replication
Feedback from CEWG
What is the vision for access management to DataONE, and how much of that will be left up to member nodes?
• Answer: Data providers must "establish trust" to publish/modify content.
• What does "establish trust" entail? Is there a technical component?
• Who are “data providers”? The member nodes or the end users?
Open Questions
• What policies should we have for managing DataONE documents?
• What properties should we enforce regarding identifiers?
• What are the minimum requirements for a member node to join the DataONE community? Or, how accommodating should we be?
• Can we identify some member nodes that will implement all best practices and serve as models for the other member nodes?
• How much data should we expect to handle? It is unclear what the uptake curve will be, but this has major implications for our architectural planning.
• Do we want/need a registry of name spaces for identifiers?
• Is it reasonable to store replicas using the ID scheme of the secondary member node, as long as the coordinating nodes are capable of resolving the original identifier to the correct location?
• What types of access control should be allowed?
• What time constraints are we under?
Open Questions (2)
• Can the CEWG produce some science-oriented use cases that augment our current technical cases?
• Will member nodes be willing to use central DataONE services and/or create adapters that allow their services to communicate with DataONE?
• Are there technologies that are widely used across the member node community? If so, these would be promising targets, as we could create a small number of adapters that could be used for a large number of member nodes.
• What are the high-value member nodes, for which we must provide custom adapters?
Recommended