Shortstop to First to Third: Collaborating to Access Digital Collections for Biomedical Natural
Language Processing (BNLP) Research
Lynne M. Fox & Leslie Williams, Health Sciences Library
Dr. Larry Hunter & Christophe Roeder,Center for Computational Pharmacology, School of Medicine
University of Colorado Anschutz Medical Campus
“Forget goals. Value the process.” Jim Bouton (Author of Ball Four)
Presentation Overview
• The Teams• The Goal• The Game Plans• The Players• The Pitch• Instant Replay google.com
The Teams
• University of Colorado Anschutz Medical Campus– Center for Computational
Pharmacology– Health Sciences Library
• Content Producers & Holders– PubMed Central– Major STM Vendor google.com
Center forComputational Pharmacology
Bio-Medical Natural Language Processing (BNLP) uses techniques from computer science, artificial intelligence (also called machine learning), biology and linguistics to extract meaningful information from natural language (in this case bio-medical language).
http://hanalyzer.sourceforge.net/
The GoalTo obtain a digital collection of full-text
biomedical journal articles, in XML format, for the Center’s BNLP research
NIH
Four Game Plans
– Secure open access content from BMC & PLoS
– Download content from PubMedCentral
– Leverage researcher networks to obtain content from major STM vendors
– License content from Major STM vendor
google.com
The Players
From University of Colorado:• Dr. Larry Hunter, Principle
Investigator• Chris Roeder, Programmer• Helen Johnson, Computational
Linguist• Leslie Williams, Acquisitions
Librarian• Lynne Fox, Education
Librarian and Center Librarian• Annalissa Philbin, University
Counsel
From Major STEM journal vendor:• Senior Vice- President of
Academic Affairs• Vice-President for Science &
Technology Strategy • Strategy Analyst• Senior Account Manager• Associate Account Manager• Corporate Counsel
Price Structure
• New product, new model– Annual subscription– Per Article Price– Volume Discount– Subscribed Content Included– Library’s Continued Subscription
to Vendor’s Product Required• Grant budgets
– Span multiple fiscal years– Payment process requires
additional levels of approvals
google.com
Key Elements of License
– Definitions• Users
– Subscription• How Dataset Can Be Used
– Obligations• Agreement Contingent Upon
– Use of Names• How Each Party May or May
Not Use the Other’s Name– Other
• Financial, Term, Etc.
google.com
Dataset & Delivery
• Test, Test, Test• Refine Definitions
– What is an article?– What format will the
article be delivered?• Refine Delivery and
Tracking Mechanisms google.com
Instant Replay: What did we learn so we improve the process for next time?
• Communication is key• Get the right people involved• Understand the rights and limitations of the
library’s existing licenses• Research domain finance is different • Be clear about rights to collections and
discoveries• Be clear about obligations on both sides
There’s always next year . . .
An expected grant wasn’t received, so signing the license is on hold until additional funds can be found.
If we continue to secure content: for better efficiency and to ensure expanded digital collections content, we’d like to work with vendors during renewal negotiations for regular content licensing to include additional xml access rights as part of the license agreement.
ReferencesHoekman, Anne. “Journal Publishing Technologies: XML.” Accessed May 14, 2012. URL: https://www.msu.edu/~hoekmana/WRA%20420/ISMTE%20article.pdf
Howard J. Technology: Major STEM journal vendor Experiments With Allowing 'Text Mining' of Its Journals - Technology - The Chronicle of Higher Education. The Chronicle of Higher Education. May 6, 2012. Accessed May 8, 2012. URL: http://chronicle.com/article/Hot-Type-Major STEM journal vendor-Experiments/131789/?sid=wc&utm_source=wc&utm_medium=en
“Natural language processing.” Wikipedia. Accessed April 27, 2012. URL: http://en.wikipedia.org/wiki/Natural_language_processing#cite_note-1
SM Leach, H Tipney, W Feng, WA Baumgartner Jr, P Kasliwal, RP Schuyler, T Williams, RA Spritz, and L Hunter. “Biomedical Discovery Acceleration, with Applications to Craniofacial Development.” PLoS Comput Bio 2009, 5(3):e1000215. doi:10.1371/journal.pcbi.1000215. PMID: 19325874
University of Colorado, School of Medicine, Center for Computational Pharmacology. “Hanalyzer: A 3R system for genome-scale discovery: Let knowledge drive your data exploration.” Accessed April 27, 2012. URL: http://hanalyzer.sourceforge.net/