View
775
Download
1
Embed Size (px)
Citation preview
Open Source Technologiesat the
National Agricultural Library
Ursula PieperIT Specialist – Web Team Lead
National Agricultural LibraryAgricultural Research Service
United States Department of Agriculture
Feb 17, 2016
2
Ursula [email protected]
301-504-7379
Acknowledgements:Knowledge Services Division (Susan McCarthy)
Monica Poelchau and Chris Childers (i5K Workspace)Peter Arbuckle and Ezra Kahn (LCA Commons)Jeffrey Campbell (LTAR)Cynthia Parr (Ag Data Commons)
Information Services Division (Vernon Chapman) Chuck Schoppet, NAL – (Fedora Commons/Islandora)
Why Open Source?
• Benefit from community contributions and support• Security managed by community• Cost – Vendor lock-in• Can get customized locally• Interoperability• Re-use of skills
Open Source based Projects(Selection)
Drupal
Python
Grails
Java
Solr
Django• Ag Data Commons
– Scientific data catalog/repository • LCA Commons
– Life Cycle Assessment repo and tools• PubAg
– Catalog of agricultural scientific literature• I5K@NAL Workspace
– Repository and workspace for Arthropod Genomes• Long Term Agro-ecosystem Research
– Historical and future agricultural research data• National Nutrient Database• Dr. Duke's Phytochemical and Ethnobotanical
Databases
Open Source based Projects(Selection)
Drupal
Grails
Java Based
Ag Data Commons http://data.nal.usda.govi5K@NAL Workspace http://i5k.nal.usda.govLCA Commons http://lcacommons.govPubAg – Data Management System http://pubag.nal.usda.gov
LCA Commons http://lcacommons.govNational Nutrient Database http://ndb.nal.usda.gov/ndb/Phytochem Database (Duke) http://phytochem.nal.usda.gov
Long-term Agro-ecosystem Researchhttp://ltar.nal.usda.gov
Ag Data Commons
Requirements• Public Access to USDA
funded research results• Support scientific research
and evidence-based policy• Re-use / re-analysis• REE Action Plan: 2012 goals• Journal submission
requirements
Mandates• America COMPETES Act• OSTP Memorandum• M-13-13, Open Data Policy
7
Summary of Required Capabilities
• Comprehensive catalog of research results– Support for compliance reporting– Feeds Data.gov– Enhanced dataset description for discovery and reuse
• Flexibility to support distributed data repositories– Some disciplines already have repositories (e.g. GenBank)
• Preservation of valuable data for long-term research• Supportive infrastructure for small agencies & labs• Link scholarly literature to its supporting data• Sustainable business model
9
Ag Data Commons Pilot Standard DKAN Features
• Drupal 7 Installation Profile• Fulfills Project Open Data requirements
– Dataset content type: POD 1.1 metadata schema– Unlimited number of resources can get uploaded– data.json and rdf available
• Additional Features– Social media links– Some data analysis tools (map, graph through recline
library)– License display
10
Ag Data Commons Pilot What’s missing from DKAN?
• DKAN’s main use case: Government and organizational documents and datasets
• General improvements
– Large File upload, virus checking, file size display– Harvest Dashboard – for harvesting external POD datasets or data using other standards– Solr search– Versioning– Data curation workflow
• Scientific data require additional functionality
– DOI assignments to datasets – Identity management for authors (orcid, etc.)– Citation information (Primary citation, Methods citation, Related publications)– Collection of additional metadata – Long-term archiving capabilities– Funding source reference– Embargo period– Specialized taxonomies
11
Ag Data Commons Pilot Lessons learned
• Keeping codebase compliant with standard DKAN – All configuration changes need to get committed to code– Codebase cannot clash with standard DKAN
(which requires discipline when under time pressure)– Significant pain merging NAL customizations with new DKAN releases– Local programming and systems support is necessary (our model)
• Contributing back to DKAN and Drupal– Many of NAL’s customizations are adopted (and then maintained) by standard DKAN– General Drupal functionality:
• Open data schema mapper • NALT Thesaurus
• Taking advantage of customizations by other organizations– Workflow, Stories, Visualizations
12
I5k Workspace@NAL• Provides tools and resources for scientists
working on insect genomes. • Goal:
– to store insect genome sequences– visualize them, – enable their curation– make them accessible to scientists.
• Designed specifically to handle and support genomic data.
• Website: https://i5k.nal.usda.gov
Key open-source software used by the i5k Workspace
1. Main portal/website– built with Drupal/Tripal
2. Key web application for genome visualization and feature annotation– Jbrowse/Apollo
I5K Workspace @ NAL 1. Drupal + Tripal
• Chado is a database schema for biological data• Tripal allows Drupal to access data stored in the
Chado database to populate web pages using Drupal functionality.
• Community: small and academic
• Apollo is a web application that allows interactive, instantaneous editing of genome features
• It is one of the key features of the i5k Workspace • Community: small and academic
I5K Workspace @ NAL 2. Apollo
• Registration module for Apollo application– Completely built in house– Integrates notifications, account creation, and captcha
• Visualizing custom data types: gene pages– Hierarchical view to display gene/transcript relationships
• Search website (many thousands of nodes)– Apache Solr search
I5K Workspace @ NAL Customized Resources
• Customization requires one full-time developer at the NAL
• Because our customizations are forked off the main repository, any updates in the main branch require more updates on our part
• Customizations are too specific to our website to be able to fully contribute back to/integrate with the main project
I5K Workspace @ NAL Tripal: Lessons learned
• Instead of building customized resources, we contributed financially to the salary of the lead developer.
• Improvements were not specific to the NAL’s goals, but were aimed at improving the stability of the application
• Even without a financial contribution, bug reports and feature requests from the entire user community are usually addressed very quickly due to an active development team, and a lead developer solely focused on this project.
I5K Workspace @ NAL Apollo: Customized resources
• How you interact with the development community of an OSS project depends on – 1) the community itself – 2) the specificity of the customization required
I5K Workspace @ NAL Apollo: Lessons learned
Life Cycle Assessment (LCA) Commons• LCA Commons is a repository that provides access to
data and tools that support life cycle assessment of agricultural products.
• We collect, curate, and provide access to data edited and formatted explicitly for use in LCA
• The LCA Commons is designed specifically to handle and support unit process data for LCA.
• Website: www.lcacommons.gov
LCA Commons Technology Stack
• Three separate applications accessed through Drupal web content management system. – Discovery and Editorial Applications
• Groovy/grails web implementation of domain specific openLCA data model/modeling tool
– LCA Collection on Ag Data Commons• DKAN catalog and datastore
Discovery Application Editorial Application LCA Collection on Ag Data Commonslcacommons.govApplication
Groovy/Grails Framework
Solr Index openLCA API Activiti BPM
DKANDrupalTechnology
Drupal Custom User Mgt.
openLCA mySQL
openLCA mySQL
DKANDatastore
DKAN Catalog
Database
LCA Commons Technology Stack
LCA CommonsCustomized Resources
• openLCA datastore not designed explicitly for data management beyond what is necessary for desktop modeling. – has required developing custom “work-arounds” for data
management
• Activiti BPM has required significant customization for editorial workflow for LCA data
• Will need to develop customized search capabilities that enable search across all three applications through Drupal
LCA CommonsLessons learned
• Technology selection based on clearly defined functional requirements is critical– Using openLCA for an application for which it was not
exactly designed has required custom development– AND innovation in the field
• Spurred openLCA developer to build functionality that more closely meets our needs and pushed the domain forward in terms of data sharing and management
PubAg Data Management System• PubAg is the National Agricultural Library's
search system for agricultural information.• Content:
– Full-text articles relevant to the agricultural sciences– Citations to peer-reviewed journal articles.
• Repository (Data Management):– Fedora Commons/Islandora/Drupal
• Public Interface:– Apache Solr and Java application layer
PubAg Data Management SystemLessons learned
• Customization needed to accommodate NAL Quality Assurance and workflow
• Performance tuning is necessary and non-trivial for large repositories
Long-Term Agroecosystem Research Network
• Historical and future agricultural research data https://ltar.nal.usda.gov
• Aims to ensure sustained crop and livestock production and ecosystem services from agroecosystems.
• Aims to forecast and verify the effects of environmental trends, public policies, and emerging technologies.
Long-Term Agroecosystem Research Network
• Historical and future agricultural research data• 18 sites across country• Aim: 30 to 100+ years of data
Long-Term Agroecosystem Research NetworkLessons learned
• The project is still in the initial stages• Lessons learned is: we still have a lot to learn