82
Ontologies: What Librarians Need to Know Barry Smith Department of Philosophy University at Buffalo presented at the conference on Research Data: Management, Access and Control, University at Buffalo, November 14, 2011 http ://libweb1.lib.buffalo.edu/blog/scholarly/?p=85 1

Ontologies: What Librarians Need to Know

Embed Size (px)

DESCRIPTION

 

Citation preview

  • 1. Ontologies:What Librarians Need to KnowBarry SmithDepartment of PhilosophyUniversity at Buffalopresented at the conference on Research Data: Management, Access and Control,University at Buffalo, November 14, 2011http://libweb1.lib.buffalo.edu/blog/scholarly/?p=85 1

2. 2 3. 3/24 4. a short movementof one lower legcrossing the otherleg with the footpointing outward4 5. same movement, different terms part of a mannequins step on the catwalk an epileptic jerk the kicking of a ball by a soccer player a signal (Get out!) issued in heatedconversation a half cut in Irish Sean-ns dancing5/ 6. 6/ 7. Some questions How to find data? How to understand data when you find it? How to use data when you find it, for example in hypothesis-checking and reasoning? How to integrate with other data? How to label the data you are collecting? How to build a set of labels for a new domain that willintegrate well with labels used in neighboring domains?7 8. Network effects of the Web You build a site. Others discover the site and they link to it The more they link to it, the more important and well knownthe page becomes (this is what Google exploits) Your page becomes important, and others begin to rely on it The same network effect works on the raw data Many people link to the data, use it Many more (and diverse) applications will be createdthan the authors would even dream of New secondary uses are discoveredIvan Herman 8 9. The problem: doing it this way, we end up withdata in many, many silos because links areformed in overlapping and redundant waysPhoto credit nepatterson, Flickr9 10. To avoid silos:1. The raw data must be available in astandard way on the Web.2. There must be links among thedatasets to create a web of dataUse ontologies to capture commonmeanings with definitions that areunderstandable to both humans andcomputersThe roots of Semantic Technology 11. Ontologies as controlledvocabularies for the tagging of data Hardware changes rapidly Organizations rapidly forming anddisbanding collaborations Data is exploding Meanings of common words change slowly Use web architecture to annotate explodingdata stores using ontologies exploitingthese common meanings11 12. Mandates for Data Reuse Organizations such as the NIH now require useof common standards in a way that will ensurethat the results obtained through fundedresearch are more easily accessible to externalgroups http://grants.nih.gov/grants/policy/data_sharing/ http://www.nsf.gov/bfa/dias/policy/dmp.jsp Data Ontologies for Biomedical Research (R01):http://grants.nih.gov/grants/guide/pa-files/par-07-425.html12/24 13. NCBO: National Center forBiomedical Ontology Stanford Biomedical Research Informatics Mayo Clinic Department of Bioinformatics University at Buffalo, Department ofPhilosophyhttp://bioportal.bioontology.org/13/24 14. NCBO Bioportal14 15. Goals of Semantic TechnologyTo support data reuseTo enable data registriesMetadata managementSupport for Natural Language UnderstandingSemantic Wikisvia ontologies formulated for example inthe Web Ontology Language (OWL)15 16. Where we stand today html demonstrated the power of the Web toallow sharing of information increasing availability of semantically enhanceddata increasing power of semantic software toallowautomatic reasoning with online information increasing use of OWL in attempts to break downsilos, and create useful integration of on-linedata and information16 17. Ontology success stories, and somereasons for failureA fragment of the Linked OpenData in the biomedical domain17 18. as of September 2010 19. The result: the more Semantic Technologyis successful, they more it fails to achieveit goalsAs we break down silos via controlledvocabularies for the tagging of datathe very success of the approach leads to thecreation of ever new controlled vocabularies ,semantic silos because multiple ontologiesare being created in ad hoc waysThe Semantic Web framework as currentlyconceived and governed by the W3C yieldsminimal standardizationCreates data cemeteries19 20. 20/24 21. 21/24 22. Reasons for this effect Shrink-wrapped software mentality you will notget paid for reusing old and good ontologies (Leta million lite ontologies bloom) Belief that there are no good ontologies (justarbitrary choices of terms and relations ) No licensing regime (database inspection tax ) Information technology (hardware) changesconstantly, not worth the effort of getting thingsright We have done it this way for 30 years, we are notgoing to change now22 23. Ontology success stories, and somereasons for failureCan we solve the problem bymeans of mappings? 23 24. Unified Medical Language System ofthe National Library of Medicine let a million ontologies bloom, each one closeto the terminological habits of its authors in concordance with the not invented heresyndrome then map these ontologies, and use thesemappings to integrate your different pots ofdata24/24 25. What you get with mappingsall phenotypes (excess hair loss, duck feet)all organismsallose (a form of sugar)Acute Lymphoblastic Leukemia (A.L.L.)25 26. Mappings are hardThey are fragile, and expensive to maintainNeed a new authority to maintain, yielding newrisk of forkingThe goal should be to minimize the need formappingsInvest resources in disjoint ontology moduleswhich work well together26 27. Why should you care? you need to create systems for data miningand text processing which will yield usefuloutput for library users if the codes you use are constantly in need ofad hoc repair huge resources will be wasted,manual effort will be needed on each occasionof use DoD alone spends $6 billion per annum onthis problem27/24 28. And there are other problems Weak expressivity of OWL Poor quality coding, poor quality ontologies,poor quality ontology management Strategy often serves only retrieval, notreasoning Confusion as to the meaning of linked28 29. Uncontrolled proliferation of links29 30. 31/24 31. How to do it right? how create an incremental, evolutionary process,where what is good survives, and what is bad fails create a scenario in which people will find itprofitable to reuse ontologies, terminologies andcoding systems which have been tried and tested silo effects will be avoided and results ofinvestment in Semantic Technology will cumulateeffectively32 32. 0200400600800100012002000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010AxisTitleSeries 1Ontology in PubMed 33. Uses of ontology in PubMed abstracts34/24 34. By far the most successful: GO (Gene Ontology)35 35. GO provides a controlled vocabulary of termsfor use in annotating (describing, tagging) data multi-species, multi-disciplinary, open source contributing to the cumulativity of scientificresults obtained by distinct researchcommunities compare use of kilograms, meters, seconds informulating experimental results natural language and logical definitions for allterms to support consistent human applicationand computational exploitation36 36. Youre interestedin which genescontrol heartmuscledevelopment17,536 results37 37. arson lw n3d ...t_LW_n3 d_5p_...Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (1 4010)attackedtimecontrolPuparial adhesionMolting cyclehemocyaninDefense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genesImmune responseToll regulated genesAmino acid catabolismLipid metobolismPeptidase activityProtein catabloismImmune responsee Tree: pearson lw n3d ...lassification: Set_LW_n3d_5p_...Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)Microarray datashows changedexpression ofthousands of genes.How will you spotthe patterns?38 38. Youre interested in whichof your hospitals patientdata is relevant tounderstanding how genescontrol heart muscledevelopment39 39. Lab / pathology dataEHR dataClinical trial dataFamily history dataMedical imagingMicroarray dataModel organism dataFlow cytometryMass specGenotype / SNP dataHow will you spot the patterns?How will you find the data youneed?40 40. One strategy for bringing order into this hugeconglomeration of data is through the use ofCommon Data Elements Discipline-specific (cancer, NIAID, ) Do not solve the problems of balkanization (datasiloes) Do not evolve gracefully as knowledge advances Support data cumulation, but do not readilysupport data integration and computation41 41. How does theGene Ontology work?with thanks toJane Lomax, Gene Ontology Consortium42 42. GO provides a controlled system ofrepresentations for use in annotating data multi-species, multi-disciplinary, opensource contributing to the cumulativity of scientificresults obtained by distinct researchcommunities compare use of kilograms, meters, seconds in formulating experimental results43 43. 44 44. Definitions45 45. Gene products involved in cardiac muscledevelopment in humans 46 46. GO provides answers to three types ofquestionsfor each gene product in what parts of the cell has it been identified? exercising what types of molecular functions? with what types of biological processes?when is a particular gene product involved in the course of normal development? in the process leading to abnormalitywith what functions is the gene productassociated in other biological processes?47 47. Some pain-related terms in GOGO:0048265 response to painGO:0019233 sensory perception of painGO:0048266 behavioral response to painGO:0019234 sensory perception of fast painGO:0019235 sensory perception of slow painGO:0051930 regulation of sensory perception of painGO:0050967 detection of electrical stimulus during sensory perception of painGO:0050968 detection of chemical stimulus involved in sensory perception of painGO:0050966 detection of mechanical stimulus involved in sensory perception of pain48 48. 49Hierarchical view representingrelations between representedtypes 49. A new kind of biological researchbased on analysis and comparison of the massivequantities of annotations linking ontology termsto raw data, including genomic data, clinical data,public health dataWhat 10 years ago took multiple groups ofresearchers months of data comparison effort,can now be performed in milliseconds50 50. One standard methodSjblm T, et al. analyzed13,023 genes in 11breast and 11 colorectal cancersusing functional information captured by GO forgiven gene product typesidentified 189 as being mutated at significantfrequency and thus as providing targets fordiagnostic and therapeutic intervention.Science. 2006 Oct 13;314(5797):268-74.51 51. What is the key to GOs success? GO is developed, maintained and by expertswho adhere to ontology best practices over 11 million annotations relating geneproducts described in the UniProt, Ensembl andother databases to terms in the GO experimental results reported in 52,000scientific journal articles manually annoted byexpert biologists using GO ontology building and ontology QA are twosides of the same coin52 52. If controlled vocabularies are to serveto remove silosthey have to be updated by respected expertswho are trained in best practices of ontologymaintenancethey have to be respected by many owners ofdata as resources that ensure accuratedescription of their data GO maintained not by computer scientists butby biologiststhey have to be willingly used in annotations bymany owners of data53 53. 54The new profession of biocurator 54. 55 55. How to do it right? how create an incremental, evolutionary process,where what is good survives, and what is bad fails where the number of ontologies needing to belinked is small where links are stable create a scenario in which people will find itprofitable to reuse ontologies, terminologies andcoding systems which have been tried and tested and in which ontologies will evolve on the basis offeedback from users56 56. Reasons why GO has been successfulIt is a system for prospective standardization built withcoherent top level but with content contributed andmonitored by domain specialistsBased on community consensusClear versioning principles ensure backwardscompatibility; prior annotations do not lose theirvalueInitially low-tech to encourage adoption by newcommunities of usersTracker for user input with rapid turnaround and helpdesk57 57. But GO is limited in its scopeit covers only generic biological entities of threesorts:cellular componentsmolecular functionsbiological processesno diseases, symptoms, diseasebiomarkers, protein interactions, experimentalprocesses 58 58. Extending the GO methodology toother domains of biology andmedicine59 59. RELATIONTO TIMEGRANULARITYCONTINUANT OCCURRENTINDEPENDENT DEPENDENTORGAN ANDORGANISMOrganism(NCBITaxonomy)AnatomicalEntity(FMA,CARO)OrganFunction(FMP, CPRO) PhenotypicQuality(PaTO)BiologicalProcess(GO)CELL ANDCELLULARCOMPONENTCell(CL)CellularComponent(FMA, GO)CellularFunction(GO)MOLECULEMolecule(ChEBI, SO,RnaO, PrO)Molecular Function(GO)Molecular Process(GO)OBO (Open Biomedical Ontology) Foundry proposal(Gene Ontology in yellow) 60 60. RELATIONTO TIMEGRANULARITYCONTINUANT OCCURRENTINDEPENDENT DEPENDENTORGAN ANDORGANISMOrganism(NCBITaxonomy)AnatomicalEntity(FMA,CARO)OrganFunction(FMP, CPRO) PhenotypicQuality(PaTO)BiologicalProcess(GO)CELL ANDCELLULARCOMPONENTCell(CL)CellularComponent(FMA, GO)CellularFunction(GO)MOLECULEMolecule(ChEBI, SO,RnaO, PrO)Molecular Function(GO)Molecular Process(GO)The strategy of orthogonal modules61 61. Ontology Scope URL CustodiansCell Ontology(CL)cell types from prokaryotesto mammalsobo.sourceforge.net/cgi-bin/detail.cgi?cellJonathan Bard, MichaelAshburner, Oliver HofmanChemical Entities of Bio-logical Interest (ChEBI)molecular entities ebi.ac.uk/chebiPaula Dematos,Rafael AlcantaraCommon Anatomy Refer-ence Ontology (CARO)anatomical structures inhuman and model organisms(under development)Melissa Haendel, TerryHayamizu, Cornelius Rosse,David Sutherland,Foundational Model ofAnatomy (FMA)structure of the human bodyfma.biostr.washington.eduJLV Mejino Jr.,Cornelius RosseFunctional GenomicsInvestigation Ontology(FuGO)design, protocol, datainstrumentation, and analysisfugo.sf.net FuGO Working GroupGene Ontology(GO)cellular components,molecular functions,biological processeswww.geneontology.org Gene Ontology ConsortiumPhenotypic QualityOntology(PaTO)qualities of anatomicalstructuresobo.sourceforge.net/cgi-bin/ detail.cgi?attribute_and_valueMichael Ashburner, SuzannaLewis, Georgios GkoutosProtein Ontology(PrO)protein types andmodifications(under development) Protein Ontology ConsortiumRelation Ontology (RO) relations obo.sf.net/relationship Barry Smith, Chris MungallRNA Ontology(RnaO)three-dimensional RNAstructures(under development) RNA Ontology ConsortiumSequence Ontology(SO)properties and features ofnucleic sequencessong.sf.net Karen Eilbeck 62. OBO Foundryrecognized by NIH as framework to addressmandates for re-usability of data collectedthrough Federally funded researchsee NIH PAR-07-425: Data Ontologies forBiomedical Research (R01)63 63. OBO Foundry provides tested guidelines enabling new groups to developthe ontologies they need in ways which counteractforking and dispersion of effort an incremental bottoms-up approach to evidence-based terminology practices in medicine that isrooted in basic biology automatic web-based linkage between biologicalknowledge resources (massive integration ofdatabases across species and biological system)64 64. RELATIONTO TIMEGRANULARITYCONTINUANT OCCURRENTINDEPENDENT DEPENDENTORGAN ANDORGANISMOrganism(NCBITaxonomy)AnatomicalEntity(FMA,CARO)OrganFunction(FMP, CPRO) PhenotypicQuality(PaTO)BiologicalProcess(GO)CELL ANDCELLULARCOMPONENTCell(CL)CellularComponent(FMA, GO)CellularFunction(GO)MOLECULEMolecule(ChEBI, SO,RnaO, PrO)Molecular Function(GO)Molecular Process(GO)The Open Biomedical Ontologies (OBO) Foundry65 65. Anatomy Ontology(FMA*, CARO)EnvironmentOntology(EnvO)InfectiousDiseaseOntology(IDO*)BiologicalProcessOntology (GO*)CellOntology(CL)CellularComponentOntology(FMA*, GO*) PhenotypicQualityOntology(PaTO)Subcellular Anatomy Ontology (SAO)Sequence Ontology(SO*) MolecularFunction(GO*)Protein Ontology(PRO*)OBO Foundry Modular OrganizationGovernanceInformation ArtifactOntology(IAO)Ontology for BiomedicalInvestigations(OBI)Ontology of GeneralMedical Science(OGMS)Basic Formal Ontology (BFO)66 66. Anatomy Ontology(FMA*, CARO)EnvironmentOntology(EnvO)InfectiousDiseaseOntology(IDO*)BiologicalProcessOntology (GO*)CellOntology(CL)CellularComponentOntology(FMA*, GO*) PhenotypicQualityOntology(PaTO)Subcellular Anatomy Ontology (SAO)Sequence Ontology(SO*) MolecularFunction(GO*)Protein Ontology(PRO*)OBO Foundry Modular OrganizationTrainingInformation ArtifactOntology(IAO)Ontology for BiomedicalInvestigations(OBI)Ontology of GeneralMedical Science(OGMS)Basic Formal Ontology (BFO)67 67. Anatomy Ontology(FMA*, CARO)EnvironmentOntology(EnvO)InfectiousDiseaseOntology(IDO*)BiologicalProcessOntology (GO*)CellOntology(CL)CellularComponentOntology(FMA*, GO*) PhenotypicQualityOntology(PaTO)Subcellular Anatomy Ontology (SAO)Sequence Ontology(SO*) MolecularFunction(GO*)Protein Ontology(PRO*)Extension Strategy + Modular Organization 68top levelmid-leveldomainlevelInformation ArtifactOntology(IAO)Ontology forBiomedicalInvestigations(OBI)Spatial Ontology(BSPO)Basic Formal Ontology (BFO) 68. 69How to build an ontology1.due diligence: identify the existing ontologycontent that is most relevant to your needs2.work with domain experts to identify parts of thedomain not covered by this ontology3.find ~50 most commonly used termscorresponding to types of entities in this domain4.arrange these terms into a taxonomical hierarchyusing the strategy of downward population5.work with domain experts to populate the lowerlevels of the hierarchy 69. Example: The Cell Ontology 70. Ontology and Library Science Nanopublishing FaBRO Semantically enhanced publishing eagle-I and VIVO resource registry i71 71. Nanopublishing Definition An online publishing model thatuses a scaled-down, inexpensive operation toreach a targeted audience, especially by usingblogging techniques Applied to ontologies gives credit to authorsof fragments of ontologies, including singleontology terms and definitions Applied to annotations gives credit tocurators for use of ontology terms in literaturetagging 72 72. Functional Requirements forBibliographic Records (FRBR) Group 1 entities: user interests in intellectual or artistic products Work: a distinct intellectual or artistic creation Expression: its intellectual or artistic realization Manifestation: the physical embodiment of an expression of a work Item: a single exemplar of a manifestation Group 2 entities: are responsible for content, production, ..., ofgroup 1 entities Person: an individual Corporate body: an organization or group of individuals or organizations Group 3 entities: serve as the subjects of works Concept: an abstract notion or idea Object: a material thing Event: an action or occurrence Place: a location 73 73. FaBiO FRBR (Functional Requirements forBibliographic Records) model from to OWLformat.FaBiO (FRBR-aligned Bibliographic Ontology). Paolo Ciccaresehttp://www.paolociccarese.info/http://www.hcklab.org/74 74. 75http://code.google.com/p/information-artifact-ontology/ 75. Semantically enhanced publishing76 76. With highlighting on77 77. 78 78. eagle-i and VIVO resource registryinitiativeseagle-i: Ontology for indexing and queryingbiomedical research resourceshttp://code.google.com/p/eagle-i/VIVO: An interdisciplinary national networkenabling collaboration and discoveryamong scientists across all disciplineshttp://vivoweb.org/Shared ontology resources in OBO Foundry79 79. BFO Basic Formal OntologyBiometrics Biometrics OntologyCL Cell OntologyCUMBO Common Upper Mammalian Brain OntologyCTO Counterterrorism OntologyENVO Environment OntologyFMA Foundational Model of AnatomyGO Gene OntologyIAO Information Artifact OntologyIDO Infectious Disease OntologyMFO Mental Functioning OntologyMFO-MD Mental Disease OntologyMFO-EM Emotion OntologyND Neurological Disease OntologyOBI Ontology for Biomedical InvestigationsOGMS Ontology for General Medical SciencePO Plant OntologyPRO Protein OntologyRNAO RNA OntologyVSO Vital Sign Ontology80 80. New role for librarians as stewardsof local digital data repositories81 81. Librarians can take over the worldshared VIVO and eagle-I ontologies inventoryinglaboratoriesservicesinstrumentsreagentsorganismsimages and videospersonsprotocolspatentshuman studiestissue samplesDNA samplessample repositoriestraining opportunitiesdatabasespapersjournalsgenomes (plants, cars )82