Aiding Biomedical Researchers with Tools to Assist Discovery Neil R. Smalheiser May 18, 2006

Aiding Biomedical Researchers

with Tools to Assist Discovery

Neil R. Smalheiser

May 18, 2006

Don SwansonUndiscovered Public

Knowledge “A affects B”, (separately) “B

affects C” Does A affect C? The pieces are all public, but need

to be put together to see a pattern

One node (open) search: formulate a problem (literature A) find a different literature C containing

complementary information focus on implicit links

between A and C But… most scientists already have more

hypotheses and leads than they can handle!

The Two Node (Closed) Search

Link between A and C is either known (often newly discovered) or hypothesized

Examine title terms B in common between A and C as possibly pointing to meaningful links

A and C don’t have to be disjoint!

The Arrowsmith Project Human Brain Project, NLM, NIMH Public web interfaces for one and

two node searches Develop the system further in

collaboration with neuroscience field testers

http://arrowsmith.psych.uic.edu

Lessons from Field Testers Used Arrowsmith two node search for

many daily information needs finding, assessing or prioritizing

hypotheses Items studied in common to two literatures Browsing unfamiliar lit C for the subset that

is likely to be most relevant to familiar lit A Arrowsmith as an extension of PubMed

searches

Lessons for the “Back End” Two node searches need to be fast

(seconds, not minutes), B-list needs to be assessed quickly

(seconds or minutes, not hours) No need to be comprehensive No need to find only “novel” links

Filtering and Ranking B-terms Features permitting users to filter and

rank B-terms: Semantic categories Frequency Recency MeSH Characteristic-ness Coherence Stoplist

A quantitative model for filtering and ranking B-terms Even though each search is different, and each

person has their own idea of “relevance”, can identify features that are associated with chosen B-terms

Chose 5 gold standards, with user-chosen positive and *negative B-terms

combined all 7 features into single logistic regression model (optimal weighting of each feature, 1 score for each B-term; score varies for each 2 node search)

ID

A-literature query C-literature query Raw B-terms

Relevant B-terms sought

1 retinal detachment[ti] n = 5122

aortic aneurysm[ti] n = 5687

n = 2294

a) diseases or syndromes in which both features have been described n = 30

b) surgical procedures used for diagnosis or treatment of both n = 26

2 mglur5[ti] OR (metabotropic glutamate receptor[ti] OR metabotropic glutamate receptors[ti]) n = 2032

Lewy body[ti] OR Lewy bodies[ti] n = 1141

n = 820

a) signaling molecules that directly or indirectly modulate orare modulated by mGluR5 and that either modulate Lewy bodies or are altered in diseases that have Lewy bodies n = 19

b) specific brain regions studied in both n = 42

3 "magnesium"[MeSH Terms] AND magnesium[ti] AND ("1900"[PDAT] : "1987/12/31"[PDAT]) n = 6238

("migraine disorders"[MeSH] AND migraine[ti]) AND ("1900"[PDAT] : "1987/12/31"[PDAT]) n = 3205

n = 1879

terms described as relevant in the JASIST paper (ref. 23, in Appendix) excluding two judged too general to be useful (reactivity and spreading) n = 41

4 beta-amyloid precursor protein[ti] OR amyloid precursor protein[ti] OR APP[ti] AND ("amyloid"[MeSH Terms] OR amyloid[Text Word]) n = 2118

reelin[All Fields] n = 493

n = 1003

genes or proteins shared in Reelin and APP (amyloid precursor protein) signal transduction pathways n = 54

5 ("nitric oxide"[MeSH Terms] OR nitric oxide[ti]) AND (("mitochondria"[MeSH Terms] OR mitochondria[ti]) OR mitochondrial[ti]) n = 786

(psd[ti] OR psd93[ti] OR psd95[ti] OR psds[ti]) OR "postsynaptic density"[ti] OR "postsynaptic densities"[ti] n = 545

n = 584

physiological or pathological responses that link the action of nitric oxide on mitochondria and the normal function of post-synaptic densities n = 51

Some Findings of the Model Coherence was most important in

identifying relevant B-terms. Characteristic value, semantic category

mapping, frequency and recency all contributed significantly as well.

> 5% of the marked relevant B-terms in the gold standard searches were terms found on the 1400 word stoplist (e.g., Down Syndrome)

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20

20

40

60

80

100

120

Number of B-terms

B-term score

retinal detachment vs aortic aneurysm

predicted non-relevantpredicted relevant

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

predicted recall

predicted precision


-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20

200

400

Number of B-terms

two randomly selected literatures

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20

10

20

30

Number of B-terms

mesothelioma vs machiavellianism predicted non-relevantpredicted relevant

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20

50

100

Number of B-terms


-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.20

10

20

30

Number of B-terms

B-term score

mesothelioma/etiology vs mesothelioma/physiology

Implications We can now rank all B-terms rigorously and

automatically, in order of the probability that they will be found relevant by SOME user

We can now predict the NUMBER of relevant B-terms in any given search

Can apply to B-terms arising within abstracts We now have a global measure of OVERALL

implicit information linking two (topical, disjoint) literatures

Can apply to one node searches too!

Conclusion The two node search can now be

conducted and analyzed in a matter of minutes, not hours or days

Can be utilized by the general scientific public for a variety of information needs, including but NOT restricted to searching for and assessing hypotheses

Thanks to…. Vetle Torvik Don Swanson Wei Zhou Maryann Martone & Guy Perkins Ramin Homayouni Bob Bilder & Don Kalar

Documents

Aiding Biomedical Researchers with Tools to Assist Discovery Neil R. Smalheiser May 18, 2006