Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from

Relevance Detection Approach to Gene Annotation

• Aid to automatic annotation of databases• Annotation flow

– Extraction of molecular function of a gene from literature

– That annotation of this function with a term in a controlled vocabulary

• Premise– If the document sets retrieved by a GeneRIF and a GO

concept are similar then a link can be made between them

Data

• GeneRIF/GO term pairs– Paired if reference same MEDLINE article– Manually filtered for obvious errors– 550 pairs from 335 distinct genes

• GO concept = GO term + definition• GeneRIFs and GO concepts too short for simple

keyword matching• Treated as an IR problem

– Similar to TREC novelty track– Compute relevance and similarity of 2 sentences

• Document set - TREC Genomics 2003 docs

• Each sentence within GeneRIF/GO concept pair treated as IR query

• Similarity between the 2 computed based on top 200 docs retrieved by each query

• Best Recall = 78.2%(prec = 22.1%)

• Best Precision = 66.2% (rec = 46.9%)

GO Dependence Relations

• Previous work (PSB)– Using substring matching between GO codes

– Derived from annotation databases, using vector space models, co-occurrence, association rule-mining.

• ChEBI: www.ebi.ac.uk/chebi/– Chemical Entities of Biological Interest

– Preferred names + synonyms

– IS_A (poly)hierarchy

methods

• String matching• If the same ChEBI entity is used within 2 GO

codes, they are in a dependence relationship– First order relationship– ChEBI term must be whole word or surrounded by

punctuation, e.g. carbonic anhydrase activity is not related to carbon-oxygen lyase activity

• Also, in a dependence relationship with the ancestors– Second order relationship

Results

• 55% of GO terms contain a ChEBI entity• 56% of dependent pairs with a ChEBI term found

in PSB study were identified in this study• Less than 1% of GO term pairs found in this

study were identified by the PSB study• Issues

– How to validate potential relationships?– Usual naming/synonym ambiguity!– Substrings not used: imidazolonepropionase

Disease Text Classification

• Task: Classification of text into one of 26 disease classes

• Used full text and weighted sections according to information distribution published by other groups

Data Preparation

• HTML full text documents, semi automatic section division

• Tokenisation, Stemming, Stop word filtering, Part of speech tagging

• Dataset: 21*25 positive full text articles, 33 negative full text articles

• 10 fold cross validation • Nearest centroid classifier

Results

• Baseline: 56% F-score

• Additional preprocessing: 67%– 10,000 stopword filter– Only nouns

• Section weighting: 74%– Abstract and Introduction weighted highest

From Nonsense to Sense in Healthcare Questions

• Diagnosis, Prognosis, Therapy, Prevention• medicine finds disease mechanisms by first

finding cures– Currently by trial and error

• Try drug then test

– Future - test then try drug

• Biomarkers– Normality -> dysfunction -> disease– There are prognostic markers before any diagnostic

markers

Integrative Genomics

• Looking for hidden connections over wide field, e.g.– Immune system works too hard = rheumatoid

arthritis– Immune system doesn’t work hard enough =

infectious diseases

Term Disambiguation

• 40% of genes have homonym problem• For 300 genes = 1mil MEDLINE articles• After disambiguation = 60,000 articles• 93% accuracy in asigning correct ID to ambiguous

genes• Use contectual fingerprints:

– Experts choose 5 abstracts about a concept– Fingerprint then created for that concept

Documents

Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from