Sabina Leonelli

An HPSSB (History, Philosophy and Social

Studies of Biology) Approach to Biomedical Ontologies

Sabina LeonelliESRC Centre for Genomics in Society

Department of Sociology and Philosophy

University of [email protected]

mailto:[email protected]

mailto:[email protected]

An HPSSB Perspective on the epistemic role of e-Science

Characterisation of experimental science as encompassing a variety of ways of knowing and communicating, beyond what can be formalised

e.g. modelling, experimental practices, tacit familiarity with instruments and materials

This awareness needs to carry to e-Science: not attempt to replace laboratory activities, but to complement them (attention: pointing to new directions does not mean guiding research, exploratory quality of experimentation)

• History of biology ‘big science’ infrastructure since WWII; history of model organism research in biology, and of relations between biological and medical research

• Philosophy of biology The role of data, theories, different types of models, instruments and materials in experimental practices; Epistemic functions of classification

• Social Studies of biology: Social organisation of science; Forms of and conditions for cooperation and communication; Power relations among actors; Institutional and economic context

Case Study: The Gene Ontology• Arguably most successful bio-ontology to date• Developed for use by community databases as

a standard for the annotation of gene products history steeped in model organism research

• Good tool for data sharing:– Choice of terms is based on research interests of

users– Dynamic system: can be updated to reflect scientific

developments

• Flexibility comes from appropriate curation:– Manual and labour-intensive (impossible to

automate)– Research interests vary across epistemic cultures:

• How to choose relevant and intelligible labels? • How and when to update labels?

The Classification Problem

stability of classificatory categories

versus

dynamism and diversity of research practices

Can classification through standard categories enable collaborative research without at the same time stifling its development and pluralism?

GO as a Classification SystemMaking data travel across different epistemic

communities, to facilitate cross-species, integrative research: classification of both biological phenomena and data• Data are associated to biological phenomena via machine-

readable labels• Users can automatically assess the relevance of data as

evidence for claims about those phenomena• To re-use data towards new discoveries, users need to assess

their reliability within their own research context: meta-data enable users to ‘situate’ information through their own expertise and tacit knowledge

= data are de-contextualised for travel and re-contextualised for appropriation by a new context

= access is differential: users can choose parameters for their queries depending on their interests and expertise

= data vary evidential scope depending on contexts

Classification of ‘mined’

data

Experimental evidence codes - Inferred from Mutant Phenotype - Inferred from Direct Assay - Inferred from Genetic Interaction - Inferred from Physical Interaction - Inferred from Expression Pattern Computational analysis IEA - Inferred from Electronic Annotation RCA - Reviewed Computational Analysis ISS - Inferred from Sequence Similarity

Author statement TAS - Traceable Author Statement NAS - Non-traceable Author Statement

Curatorial statement IC - Inferred by Curator ND - No biological Data available

Classification of data provenance

EVIDENCE CODES

GO as an Expert CommunityThe threat of imperialism Vs. GO as ‘service to biology’: whoever chooses labels and what counts as meta-data

determines nomenclature and protocols used as standard across biology (and thus interpretation of data as well as experimental set-ups)

1. De-contextualisation: separating data from information about ‘local’ features of data production2. Abstraction: simplifying, eliminating or modifying characteristics of data to be standardised3. Knowledge-stabilisation: define terms and relations to mirror (what they see as) the consensus4. Situating: associate each dataset with a specific term (and thus a specific phenomenon)

Solution: Curator as mediator between requirements of e-Science (consistency, computability, ease of use and wide intelligibility) and the diverse practices characterising experimental biology

• GO curators develop specific expertise to tackle the threat– Cross-disciplinary training> awareness of diverse epistemic cultures – Experience ‘at the bench’ > awareness of what users need and look for

• Community involvement (content meetings, feedback, crowdsourcing, user training workshop and online material)

GO as a Scientific InstitutionHowever: emergence of separate expertise is itself an

obstacle to dialogue with users. Curators face two severe problems:

• Impossible to serve users without consultation, yet users do not provide feedback: lack of interest, time, expertise

• Need to minimise duplication/proliferation of labels, yet each curator/ontology has a different perception/function of/in the field

Solution: Consortia as regulatory centres -- standardisation as a tool to serve diversity in epistemic practices and interests of users:

• Centralising expertise • Centralising procedures• Centralising objectives (e.g. Open access, re-use of data as primary

goal)

The Gene Ontology Consortium• Michael Ashburner 1998: the terms used for data classification

should be the ones used to describe research interests• July 1998: First meeting of the consortium, members from

Saccharomyces Genome Database, Mouse Genome Informatics, FlyBase, Berkeley Drosophila Genome Project

• October 1999: funding application NIH, AstraZeneca• 2000-1: Rapid expansion, including the Zebrafish Information

Network, the Rat Genome Database, The Arabidopsis Information Resource, Gramene.

• 2002: Central office in Cambridge• Grants from National Human Genome Research Institute (NHGRI),

NIH, EU, AstraZeneca, InciteGenomics, United States Department of Agriculture, Research and Education Service and the UK Medical Research Council.

• De facto standard for classification, annotation and dissemination of genomic data in model organism biology

• In parallel: birth of the Open Biomedical Ontologies Consortium

The Institutional Role of Consortia: Enforcing

Collaboration• Encourage feedback loops among curators:

– Rules for bio-ontology development– Organisation of curator meetings and communication– Enhancing accountability and clear division of labour

• Encourage dialogue with users:– ‘Content meetings’– Experiment on peer review procedures (e.g. Reactome)– Liase with industry to align their data sharing practices

• Co-operate with journals (linking data disclosure with publication)

E.g. Plant Physiology and TAIR: enforcing feedback on GO

• Train users and curators– Workshops at conferences and elsewhere– Enforce institutionalisation within universities (e.g. Stanford

Biomedical Informatics; graduate training in UK system biology)

The multiple identities of GO• GO needs to be playing several epistemic roles in biology

• Classification system• Expert community• Regulatory institution

• Exemplifies and regulates epistemic and social relations between virtual (in silico) and material (wet) practices in biology

• Despite institutionalisation within biology, still far from having resolved tensions between curator’s vision of what technology can do for science, and user needs and practices

• Handling dissent on terms or definitions• Providing sufficient meta-data to assess data provenance• Non-overlapping datasets and checking data quality• Long-term maintenance, strategies for revision and updating

(how has GO actually been revised?)

Thanks to ESRC for funding and several bio-ontology curators (including the GO team at EBI) for their patience

and availability for interviews• (in preparation) On the Role of Theory in Data-Driven Research:

The Case of Bio-Ontologies. • (2010) Documenting the Emergence of Bio-Ontologies: Or, Why

Researching Bioinformatics Requires HPSSB. History and Philosophy of the Life Sciences.

• (2010) Packaging Data for Re-Use: Databases in Model Organism Biology. In Howlett, P and Morgan, MS (eds) How Well Do ‘Facts’ Travel. CUP.

• (2009) On the Locality of Data and Claims About Phenomena. Philosophy of Science 76, 5.

• (2009) Centralising Labels to Distribute Data: The Regulatory Role of Genomic Consortia. In Atkinson et al (eds.) Handbook for Genetics and Society: Mapping the New Genomic Era. Routledge, pp. 469-485.

• (2008) Bio-Ontologies as Tools for Integration in Biology. Biological Theory 3, 1: 8-11.

Abstract This paper reflects on the analytic challenges emerging from the

study of bioinformatic tools recently created to store and disseminate biological data, such as databases, repositories and bio-ontologies. I focus my discussion on the Gene Ontology, a term that defines three entities at once: a classification system facilitating the distribution and use of genomic data as evidence towards new insights; an expert community specialised in the curation of those data; and a scientific institution promoting the use of this tool among experimental biologists. These three dimensions of the Gene Ontology can be clearly distinguished analytically, but are tightly intertwined in practice. I suggest that this is true of all bioinformatic tools: they need to be understood simultaneously as epistemic, social and institutional entities, since they shape the knowledge extracted from data and at the same time regulate the organisation, development and communication of research. This viewpoint has one important implication for the methodologies used to study these tools, that is the need to integrate historical, philosophical and sociological approaches. I illustrate this claim through examples of misunderstandings that may result from a narrowly disciplinary study of the Gene Ontology, as I experienced them in my own research.

Technology

Sabina Leonelli