Upload
elijah-stephens
View
216
Download
0
Embed Size (px)
DESCRIPTION
One node (open) search: formulate a problem (literature A) find a different literature C containing complementary information focus on implicit links between A and C But… most scientists already have more hypotheses and leads than they can handle!
Citation preview
Aiding Biomedical Researchers
with Tools to Assist Discovery
Neil R. Smalheiser
May 18, 2006
Don SwansonUndiscovered Public
Knowledge “A affects B”, (separately) “B
affects C” Does A affect C? The pieces are all public, but need
to be put together to see a pattern
One node (open) search: formulate a problem (literature A) find a different literature C containing
complementary information focus on implicit links
between A and C But… most scientists already have more
hypotheses and leads than they can handle!
The Two Node (Closed) Search
Link between A and C is either known (often newly discovered) or hypothesized
Examine title terms B in common between A and C as possibly pointing to meaningful links
A and C don’t have to be disjoint!
The Arrowsmith Project Human Brain Project, NLM, NIMH Public web interfaces for one and
two node searches Develop the system further in
collaboration with neuroscience field testers
http://arrowsmith.psych.uic.edu
Lessons from Field Testers Used Arrowsmith two node search for
many daily information needs finding, assessing or prioritizing
hypotheses Items studied in common to two literatures Browsing unfamiliar lit C for the subset that
is likely to be most relevant to familiar lit A Arrowsmith as an extension of PubMed
searches
Lessons for the “Back End” Two node searches need to be fast
(seconds, not minutes), B-list needs to be assessed quickly
(seconds or minutes, not hours) No need to be comprehensive No need to find only “novel” links
Filtering and Ranking B-terms Features permitting users to filter and
rank B-terms: Semantic categories Frequency Recency MeSH Characteristic-ness Coherence Stoplist
A quantitative model for filtering and ranking B-terms Even though each search is different, and each
person has their own idea of “relevance”, can identify features that are associated with chosen B-terms
Chose 5 gold standards, with user-chosen positive and *negative B-terms
combined all 7 features into single logistic regression model (optimal weighting of each feature, 1 score for each B-term; score varies for each 2 node search)
ID
A-literature query C-literature query Raw B-terms
Relevant B-terms sought
1 retinal detachment[ti] n = 5122
aortic aneurysm[ti] n = 5687
n = 2294
a) diseases or syndromes in which both features have been described n = 30
b) surgical procedures used for diagnosis or treatment of both n = 26
2 mglur5[ti] OR (metabotropic glutamate receptor[ti] OR metabotropic glutamate receptors[ti]) n = 2032
Lewy body[ti] OR Lewy bodies[ti] n = 1141
n = 820
a) signaling molecules that directly or indirectly modulate orare modulated by mGluR5 and that either modulate Lewy bodies or are altered in diseases that have Lewy bodies n = 19
b) specific brain regions studied in both n = 42
3 "magnesium"[MeSH Terms] AND magnesium[ti] AND ("1900"[PDAT] : "1987/12/31"[PDAT]) n = 6238
("migraine disorders"[MeSH] AND migraine[ti]) AND ("1900"[PDAT] : "1987/12/31"[PDAT]) n = 3205
n = 1879
terms described as relevant in the JASIST paper (ref. 23, in Appendix) excluding two judged too general to be useful (reactivity and spreading) n = 41
4 beta-amyloid precursor protein[ti] OR amyloid precursor protein[ti] OR APP[ti] AND ("amyloid"[MeSH Terms] OR amyloid[Text Word]) n = 2118
reelin[All Fields] n = 493
n = 1003
genes or proteins shared in Reelin and APP (amyloid precursor protein) signal transduction pathways n = 54
5 ("nitric oxide"[MeSH Terms] OR nitric oxide[ti]) AND (("mitochondria"[MeSH Terms] OR mitochondria[ti]) OR mitochondrial[ti]) n = 786
(psd[ti] OR psd93[ti] OR psd95[ti] OR psds[ti]) OR "postsynaptic density"[ti] OR "postsynaptic densities"[ti] n = 545
n = 584
physiological or pathological responses that link the action of nitric oxide on mitochondria and the normal function of post-synaptic densities n = 51
Some Findings of the Model Coherence was most important in
identifying relevant B-terms. Characteristic value, semantic category
mapping, frequency and recency all contributed significantly as well.
> 5% of the marked relevant B-terms in the gold standard searches were terms found on the 1400 word stoplist (e.g., Down Syndrome)
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20
20
40
60
80
100
120
Number of B-terms
B-term score
retinal detachment vs aortic aneurysm
predicted non-relevantpredicted relevant
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
predicted recall
predicted precision
retinal detachment vs aortic aneurysm
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20
200
400
Number of B-terms
two randomly selected literatures
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20
10
20
30
Number of B-terms
mesothelioma vs machiavellianism predicted non-relevantpredicted relevant
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20
50
100
Number of B-terms
retinal detachment vs aortic aneurysm
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20
10
20
30
Number of B-terms
B-term score
mesothelioma/etiology vs mesothelioma/physiology
Implications We can now rank all B-terms rigorously and
automatically, in order of the probability that they will be found relevant by SOME user
We can now predict the NUMBER of relevant B-terms in any given search
Can apply to B-terms arising within abstracts We now have a global measure of OVERALL
implicit information linking two (topical, disjoint) literatures
Can apply to one node searches too!
Conclusion The two node search can now be
conducted and analyzed in a matter of minutes, not hours or days
Can be utilized by the general scientific public for a variety of information needs, including but NOT restricted to searching for and assessing hypotheses
Thanks to…. Vetle Torvik Don Swanson Wei Zhou Maryann Martone & Guy Perkins Ramin Homayouni Bob Bilder & Don Kalar