Upload
andrew-conway
View
218
Download
4
Tags:
Embed Size (px)
Citation preview
Facilitating the development of
controlled vocabularies for metabolomics with
text miningI. Spasić,1 D. Schober,2 S. Sansone,2
D. Rebholz-Schuhmann,2 D. Kell,1 N. Paton1 and the MSI Ontology Working Group
Members3
1 MCISB http://www.mcisb.org2 EBI http://www.ebi.ac.uk3 MSI http://msi-workgroups.sf.net
Motivation
• experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformatics
• controlled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resources
• the pressing need for vocabularies and ontologies for metabolomics
Metabolomics Society
• http://www.metabolomicssociety.org
• the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experiments
• five working groups:
– biological sample context
– chemical analysis
– data analysis
– ontology
– data exchange
MSI OWG
• Metabolomics Standardisation Initiative Ontology WG
• http://msi-ontology.sourceforge.net
• coordinated by Dr Susanna-Assunta Sansone
• develop a common semantic framework for metabolomics studies by means of
– controlled vocabularies
– ontologies
so to be able to:
– describe the experimental process consistently
– ensure meaningful and unambiguous data exchange
Scope
• the coverage of the domain reflects the typical structure of metabolomics investigations:
– general components (investigation design; sample source, characteristics, treatments and collection; computational analysis)
– technology-specific components (sample preparation; instrumental analysis; data pre-processing)
• analytical technologies: mass spectrometry (MS), gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy…
Terms
• terms:
– linguistic representations of domain-specific concepts
– means of conveying scientific and technical information
• CV terms:
– used to tag units of information so that they can be more easily retrieved by a search
– improve technical communication by ensuring that everyone is using the same term to mean the same thing
Term acquisition
• CV terms are chosen and organised by trained professionals who possess expertise in the subject area
• in a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised terms
• problem: manual term acquisition approaches are time-consuming, labour-intensive and error-prone
• solution: a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literature
Strategy
• each CV is compiled in an iterative process consisting of the following steps:
1. create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventions
2. expand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publications
3. circulate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completeness
A text mining workflow
1. information retrieval: gather a technology-specific corpus of documents
search terms: MeSH terms & CV termsdocuments: abstracts & full papersresources: Entrez — MEDLINE & PubMed Central (PMC)
2. term recognition: extract terms as lexical units frequently occurring in a domain-specific corpus
method: C-value provided by NaCTeM
3. term filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc.
resources: UMLS — MetaThesaurus & Semantic Network
Information retrieval using MeSH terms
• MeSH = Medical Subject Headings
• http://www.nlm.nih.gov/mesh/
• MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed
• MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts
IR using MeSH terms
• finding the relevant MeSH terms using the MeSH browser
• http://www.nlm.nih.gov/mesh/MBrowser.html
• look up: NMR
• resulting MeSH term(s): Magnetic Resonance Spectroscopy
• PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]
Beyond MeSH terms
• NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results
• the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material it is important to process the full text articles as opposed to abstracts only
• as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked)
NMR
NMR
NMR
NMR
MEDLINE(abstracts)
PubMed Central
(full papers)
biomedical literature
Selecting documents
doc ID
number of matching
terms
> threshold
local corpus0
5000
10000
15000
20000
25000
30000
do
cum
ents
1 4 7 10 13 16 19 22 25 28 31
search terms
= 3
C-value
• syntactic pattern matching used to select term candidates:
(ADJ | N)+ | ((ADJ | N)* [N PREP] (ADJ | N)*) N
• termhood of each candidate term t is calculated using:
– |t| its length as the number of words
– f(t) its frequency of occurrence
– S(t) the set of other candidate terms containing
it as a subphrase
)( if ,))(|)(|
1)((||ln
)( if ,)(||ln)(
)(
tSsftS
tft
tStfttC
tSs
Unified Medical Language System (UMLS)
• UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies
• http://umlsks.nlm.nih.gov
• UMLS contains the following semantic classes relevant to our problem:
Organism A.1.1Anatomical Structure A.1.2Substance A.1.4Biological Function B.2.2.1Injury or Poisoning B.2.3
• we used these classes to automatically extract the corresponding terms from the UMLS thesaurus
Results
• input: 243 NMR terms & 152 GC terms
• output: 5,699 NMR terms & 2,612 GC terms
2%
16.25
0.13