Why• goal: Named Entity Recognition
• method: supervised learning
• feature extraction• (text) internal features: word shape, n-grams, ...
protein-indicative features:- of shape a0a0a0a…- followed by /bind/- shorter than 5 characters
• generalisations on training data might be incomplete
• acquired evidence might be absent in test instance
Getting Additional Evidence
internal features might be insufficient, but good evidence might be somewhere else...
Note: some systems (MaxEnt for instance) can easily and successfully integrate a huge number of features
• small and accurate lists of proteins (gazetteers)• use as rules• use as features
• other texts might contain indicative n-grams• other texts might contain indicative n-grams• how to use other texts• which texts to use
How
patterns
“X gene/protein/DNA”
“X sequence/motif”
A. Create patterns (aim, method, input)
B. Search corpus for patterns and obtain counts
C. Use counts as appropriate
1. AIM (granularity)
Create Patterns (I)
distinguish entities from non-entities
distinguish between entities
“X gene OR DNA OR protein”
“X DNA”“X gene”
+ bypass ambiguities and data sparseness– less information
+ more information– ambiguities, data sparseness
“X binds”
1. AIM2. METHOD3. INPUT
Create Patterns (II)
2. METHOD
by hand (experts)
+ high precision, exact target– time consuming, experts needed
automatically (collocations, clustering)
+ no human intervention– lower precision, not necessarily interesting
patterns
1. AIM2. METHOD3. INPUT
3. INPUT (“X gene”)
Create Patterns (III)1. AIM2. METHOD3. INPUT
low frequency words (as estimated from a non-specific corpus)
first output of classifier
NP chunks
words not found in standard dictionary
increase precision but lower recall
prec rec f-scoreall features .813 .861 .836– web .807 .864 .835
What? Google vs PubMed
• PubMed: searchable collection of over 12M biomedical abstracts, more sophisticated search options
• Everything: Google searches over 8 billion pages, raw search, API
“p53 gene”
5,843 documents ~165,000 pages
PubMed Google
Google + PubMed“anything you want” site:<specific_site>
“p53 gene” site:www.ncbi.nlm.nih.gov
Rob Futrelle has this function available on this webpage:
http://www.ccs.neu.edu/home/futrelle/bionlp/search.html
• comment: sometimes PubMed reports “Quoted phrase not found” even when Google finds the phrase.
PubMed provides phrase search only on pre-indexed phrases
PubMed > Google• query expansion
PubMed uses the MeSH headings to match synonyms(it will expand “Pol II” to search for “DNA Polymerase II”)
Google will only try correct misspelling
• field specific search
PubMed allows field-specific searches (eg year)
Google cannot refine its search in this respect
• timeliness
PubMed is updated daily
Google is slow in updating
PubMed > Google (cont’d)
• ranking
Google does a ‘vote’-based ranking: not necessarily good
PubMed does not do any ranking (possibly bad too...)
• truncation and flexibility
PubMed accepts truncated entries and will look for all possible Variations. It will try break phrases if no matches are found.
Google has a rigid search
• manual indexing
PubMed’s MeSH contain keywords not necessarily containedin the abstract
Google cannot find something that is not mentioned in the abstract
• as a rule
• as a feature
+ less false positives+ some systems (MaxEnt) can integrate huge number of features – might still not get used or provide enough evidence
+ sure identification of entities– too powerful -> high risk of false positives
might be OK to use Google: more info but not necessarily precise
might be better to use PubMed: less info but precise
What to Use?(or How to Use the Evidence)
What to Use?(or How to Use the Evidence)
iHOP (Information Hyperlinked Over Proteins) A gene network for navigating the literature
http://www.pdg.cnb.uam.es/UniPub/iHOP
• uses genes and proteins as hyperlinks between sentences and abstracts http://www.pdg.cnb.uam.es/UniPub/iHOP
• each step through the network produces information about one single gene and its interactions
• information retrieved by connecting similar concepts
• precision of gene name and synonym identification: 87-99%
• readers can still check correctness of sentences when they are presented to them
• shortest path between any 2 genes is on average 4 steps only
Nature Genetics, Vol. 36(7), July 2004