Upload
anita-de-waard
View
403
Download
2
Tags:
Embed Size (px)
DESCRIPTION
http://www.escience2009.org/Web Semantics in Action: Web 3.0 in e-Science11:25 – 11:50 Sabina Leonelli: An HPSSB Approach to Gene Ontology
Citation preview
An HPSSB (History, Philosophy and Social
Studies of Biology) Approach to Biomedical Ontologies
Sabina LeonelliESRC Centre for Genomics in Society
Department of Sociology and Philosophy
University of [email protected]
An HPSSB Perspective on the epistemic role of e-Science
Characterisation of experimental science as encompassing a variety of ways of knowing and communicating, beyond what can be formalised
e.g. modelling, experimental practices, tacit familiarity with instruments and materials
This awareness needs to carry to e-Science: not attempt to replace laboratory activities, but to complement them (attention: pointing to new directions does not mean guiding research, exploratory quality of experimentation)
• History of biology ‘big science’ infrastructure since WWII; history of model organism research in biology, and of relations between biological and medical research
• Philosophy of biology The role of data, theories, different types of models, instruments and materials in experimental practices; Epistemic functions of classification
• Social Studies of biology: Social organisation of science; Forms of and conditions for cooperation and communication; Power relations among actors; Institutional and economic context
Case Study: The Gene Ontology• Arguably most successful bio-ontology to date• Developed for use by community databases as
a standard for the annotation of gene products history steeped in model organism research
• Good tool for data sharing:– Choice of terms is based on research interests of
users– Dynamic system: can be updated to reflect scientific
developments
• Flexibility comes from appropriate curation:– Manual and labour-intensive (impossible to
automate)– Research interests vary across epistemic cultures:
• How to choose relevant and intelligible labels? • How and when to update labels?
The Classification Problem
stability of classificatory categories
versus
dynamism and diversity of research practices
Can classification through standard categories enable collaborative research without at the same time stifling its development and pluralism?
GO as a Classification SystemMaking data travel across different epistemic
communities, to facilitate cross-species, integrative research: classification of both biological phenomena and data• Data are associated to biological phenomena via machine-
readable labels• Users can automatically assess the relevance of data as
evidence for claims about those phenomena• To re-use data towards new discoveries, users need to assess
their reliability within their own research context: meta-data enable users to ‘situate’ information through their own expertise and tacit knowledge
= data are de-contextualised for travel and re-contextualised for appropriation by a new context
= access is differential: users can choose parameters for their queries depending on their interests and expertise
= data vary evidential scope depending on contexts
Classification of ‘mined’
data
Experimental evidence codes - Inferred from Mutant Phenotype - Inferred from Direct Assay - Inferred from Genetic Interaction - Inferred from Physical Interaction - Inferred from Expression Pattern Computational analysis IEA - Inferred from Electronic Annotation RCA - Reviewed Computational Analysis ISS - Inferred from Sequence Similarity
Author statement TAS - Traceable Author Statement NAS - Non-traceable Author Statement
Curatorial statement IC - Inferred by Curator ND - No biological Data available
Classification of data provenance
EVIDENCE CODES
GO as an Expert CommunityThe threat of imperialism Vs. GO as ‘service to biology’: whoever chooses labels and what counts as meta-data
determines nomenclature and protocols used as standard across biology (and thus interpretation of data as well as experimental set-ups)
1. De-contextualisation: separating data from information about ‘local’ features of data production2. Abstraction: simplifying, eliminating or modifying characteristics of data to be standardised3. Knowledge-stabilisation: define terms and relations to mirror (what they see as) the consensus4. Situating: associate each dataset with a specific term (and thus a specific phenomenon)
Solution: Curator as mediator between requirements of e-Science (consistency, computability, ease of use and wide intelligibility) and the diverse practices characterising experimental biology
• GO curators develop specific expertise to tackle the threat– Cross-disciplinary training> awareness of diverse epistemic cultures – Experience ‘at the bench’ > awareness of what users need and look for
• Community involvement (content meetings, feedback, crowdsourcing, user training workshop and online material)
GO as a Scientific InstitutionHowever: emergence of separate expertise is itself an
obstacle to dialogue with users. Curators face two severe problems:
• Impossible to serve users without consultation, yet users do not provide feedback: lack of interest, time, expertise
• Need to minimise duplication/proliferation of labels, yet each curator/ontology has a different perception/function of/in the field
Solution: Consortia as regulatory centres -- standardisation as a tool to serve diversity in epistemic practices and interests of users:
• Centralising expertise • Centralising procedures• Centralising objectives (e.g. Open access, re-use of data as primary
goal)
The Gene Ontology Consortium• Michael Ashburner 1998: the terms used for data classification
should be the ones used to describe research interests• July 1998: First meeting of the consortium, members from
Saccharomyces Genome Database, Mouse Genome Informatics, FlyBase, Berkeley Drosophila Genome Project
• October 1999: funding application NIH, AstraZeneca• 2000-1: Rapid expansion, including the Zebrafish Information
Network, the Rat Genome Database, The Arabidopsis Information Resource, Gramene.
• 2002: Central office in Cambridge• Grants from National Human Genome Research Institute (NHGRI),
NIH, EU, AstraZeneca, InciteGenomics, United States Department of Agriculture, Research and Education Service and the UK Medical Research Council.
• De facto standard for classification, annotation and dissemination of genomic data in model organism biology
• In parallel: birth of the Open Biomedical Ontologies Consortium
The Institutional Role of Consortia: Enforcing
Collaboration• Encourage feedback loops among curators:
– Rules for bio-ontology development– Organisation of curator meetings and communication– Enhancing accountability and clear division of labour
• Encourage dialogue with users:– ‘Content meetings’– Experiment on peer review procedures (e.g. Reactome)– Liase with industry to align their data sharing practices
• Co-operate with journals (linking data disclosure with publication)
E.g. Plant Physiology and TAIR: enforcing feedback on GO
• Train users and curators– Workshops at conferences and elsewhere– Enforce institutionalisation within universities (e.g. Stanford
Biomedical Informatics; graduate training in UK system biology)
The multiple identities of GO• GO needs to be playing several epistemic roles in biology
• Classification system• Expert community• Regulatory institution
• Exemplifies and regulates epistemic and social relations between virtual (in silico) and material (wet) practices in biology
• Despite institutionalisation within biology, still far from having resolved tensions between curator’s vision of what technology can do for science, and user needs and practices
• Handling dissent on terms or definitions• Providing sufficient meta-data to assess data provenance• Non-overlapping datasets and checking data quality• Long-term maintenance, strategies for revision and updating
(how has GO actually been revised?)
Thanks to ESRC for funding and several bio-ontology curators (including the GO team at EBI) for their patience
and availability for interviews• (in preparation) On the Role of Theory in Data-Driven Research:
The Case of Bio-Ontologies. • (2010) Documenting the Emergence of Bio-Ontologies: Or, Why
Researching Bioinformatics Requires HPSSB. History and Philosophy of the Life Sciences.
• (2010) Packaging Data for Re-Use: Databases in Model Organism Biology. In Howlett, P and Morgan, MS (eds) How Well Do ‘Facts’ Travel. CUP.
• (2009) On the Locality of Data and Claims About Phenomena. Philosophy of Science 76, 5.
• (2009) Centralising Labels to Distribute Data: The Regulatory Role of Genomic Consortia. In Atkinson et al (eds.) Handbook for Genetics and Society: Mapping the New Genomic Era. Routledge, pp. 469-485.
• (2008) Bio-Ontologies as Tools for Integration in Biology. Biological Theory 3, 1: 8-11.
Abstract This paper reflects on the analytic challenges emerging from the
study of bioinformatic tools recently created to store and disseminate biological data, such as databases, repositories and bio-ontologies. I focus my discussion on the Gene Ontology, a term that defines three entities at once: a classification system facilitating the distribution and use of genomic data as evidence towards new insights; an expert community specialised in the curation of those data; and a scientific institution promoting the use of this tool among experimental biologists. These three dimensions of the Gene Ontology can be clearly distinguished analytically, but are tightly intertwined in practice. I suggest that this is true of all bioinformatic tools: they need to be understood simultaneously as epistemic, social and institutional entities, since they shape the knowledge extracted from data and at the same time regulate the organisation, development and communication of research. This viewpoint has one important implication for the methodologies used to study these tools, that is the need to integrate historical, philosophical and sociological approaches. I illustrate this claim through examples of misunderstandings that may result from a narrowly disciplinary study of the Gene Ontology, as I experienced them in my own research.