Upload
kerry-preston
View
216
Download
4
Embed Size (px)
Citation preview
Relevance Detection Approach to Gene Annotation
• Aid to automatic annotation of databases• Annotation flow
– Extraction of molecular function of a gene from literature
– That annotation of this function with a term in a controlled vocabulary
• Premise– If the document sets retrieved by a GeneRIF and a GO
concept are similar then a link can be made between them
Data
• GeneRIF/GO term pairs– Paired if reference same MEDLINE article– Manually filtered for obvious errors– 550 pairs from 335 distinct genes
• GO concept = GO term + definition• GeneRIFs and GO concepts too short for simple
keyword matching• Treated as an IR problem
– Similar to TREC novelty track– Compute relevance and similarity of 2 sentences
• Document set - TREC Genomics 2003 docs
• Each sentence within GeneRIF/GO concept pair treated as IR query
• Similarity between the 2 computed based on top 200 docs retrieved by each query
• Best Recall = 78.2%(prec = 22.1%)
• Best Precision = 66.2% (rec = 46.9%)
GO Dependence Relations
• Previous work (PSB)– Using substring matching between GO codes
– Derived from annotation databases, using vector space models, co-occurrence, association rule-mining.
• ChEBI: www.ebi.ac.uk/chebi/– Chemical Entities of Biological Interest
– Preferred names + synonyms
– IS_A (poly)hierarchy
methods
• String matching• If the same ChEBI entity is used within 2 GO
codes, they are in a dependence relationship– First order relationship– ChEBI term must be whole word or surrounded by
punctuation, e.g. carbonic anhydrase activity is not related to carbon-oxygen lyase activity
• Also, in a dependence relationship with the ancestors– Second order relationship
Results
• 55% of GO terms contain a ChEBI entity• 56% of dependent pairs with a ChEBI term found
in PSB study were identified in this study• Less than 1% of GO term pairs found in this
study were identified by the PSB study• Issues
– How to validate potential relationships?– Usual naming/synonym ambiguity!– Substrings not used: imidazolonepropionase
Disease Text Classification
• Task: Classification of text into one of 26 disease classes
• Used full text and weighted sections according to information distribution published by other groups
Data Preparation
• HTML full text documents, semi automatic section division
• Tokenisation, Stemming, Stop word filtering, Part of speech tagging
• Dataset: 21*25 positive full text articles, 33 negative full text articles
• 10 fold cross validation • Nearest centroid classifier
Results
• Baseline: 56% F-score
• Additional preprocessing: 67%– 10,000 stopword filter– Only nouns
• Section weighting: 74%– Abstract and Introduction weighted highest
From Nonsense to Sense in Healthcare Questions
• Diagnosis, Prognosis, Therapy, Prevention• medicine finds disease mechanisms by first
finding cures– Currently by trial and error
• Try drug then test
– Future - test then try drug
• Biomarkers– Normality -> dysfunction -> disease– There are prognostic markers before any diagnostic
markers
Integrative Genomics
• Looking for hidden connections over wide field, e.g.– Immune system works too hard = rheumatoid
arthritis– Immune system doesn’t work hard enough =
infectious diseases
Term Disambiguation
• 40% of genes have homonym problem• For 300 genes = 1mil MEDLINE articles• After disambiguation = 60,000 articles• 93% accuracy in asigning correct ID to ambiguous
genes• Use contectual fingerprints:
– Experts choose 5 abstracts about a concept– Fingerprint then created for that concept