Re-Conceptualizing Literature-Based Discovery Neil R. Smalheiser March 29, 2008

Re-Conceptualizing Literature-Based Discovery

Neil R. Smalheiser

March 29, 2008

What is LBD? A strategy for uncovering novel hypotheses

• advocated by Don Swanson

• Magnesium-migraine, Fish oil-Raynaud’s

• The key idea is: putting together

explicit assertions from different papers to form new implicit assertions

• Regardless of how this is done, how the implicit assertions are assessed, whether the implicit assertions are correct!

What is LBD? A routine way of life for scientists

greatly under-recognized! Not just background reading, not just identifying anomalies or “critical incidents” that appear (explicitly) in a paper

Since 1996: 8 papers with Swanson, 40 without (i.e. non-one node search), 24 are biological (i.e. non-informatics modeling): 9/24 = 3/8 > 1/3

• Proteins in unexpected locations (Molec. Biol. Cell, 1996) • Expression of reelin in the blood (PNAS, 2000) • Reelin and schizophrenia (PNAS 2000)• Fluoxetine and neurogenesis (Eur. J. Pharmacol. 2001) • RNAi and memory (Trends in Neurosci. 2001)• Bath toys (New Engl. J. Med. 2003)• Dicer and calpain (J. Neurochem. 2005)• Exosomal transfer of proteins & RNAs at synapses (Biol Direct, 2007)• microRNA machinery and regulation by phosphorylation (BBA, 2008)

What is LBD? A body of research articles, software and websites

• mostly by information scientists and computer scientists

• Mostly concerned with “open discovery” or “one node searches”, begin with a set of articles A that represents a problem

• Mostly use “B-terms” present in A to expand the search, find disparate lits Ci that share B-terms with A

• Try to find the Ci that is disparate yet “most similar” to A

What is LBD? other researchers employ implicit information too

• Bioinformatics– gene-gene interactions– protein-protein interactions

• web search

• author disambiguation

• text mining

Yet these are not viewed as examples of LBD for some reason!

Has the LBD field stagnated and not fulfilled its promise?

• Kostoff critique(s) – “what is a discovery” vs. an “innovation”– argues against frequency based ranking,– Uses very high recall, hundreds of “discoveries” claimed per question

• “Swanson’s legacy” Sw refs ended 2001!• Bork review refs Sw ended 1996!• Few gold standards are available (Mg, fish oil worn out)• Combinatorial explosion A – B – C search method• Impossible standards for what counts as a LBD

prediction (never considered, never tested, must shatter a paradigm but must be proven experimentally??)

• Excluding active approaches other than “one node search” as being LBD

Well, what DO we know about progress in LBD?

• The two-node search• http://arrowsmith.psych.uic.edu• Begin with two lits A and C that represent a

known finding or a hypothesis (estrogen-AD)• look for meaningful links • (whether or not A and C are disparate) • We use B-terms extracted from titles• Could use abstracts, MeSH, triples…

http://arrowsmith.psych.uic.edu/

Modeling the Two-Node Search-1

• Field testers, free-form use of the tool• Chose 6 two-node searches as gold standards: not too

big or small, disparate, topically coherent, clean questions

• E.g. for A = retinal detachment, C = aortic aneurysm, a) find diseases in which both features appear [not necessarily in same person] or b) find surgical procedures that have been applied to both conditions.

• Manually marked relevant B-terms for a given query (sometimes several queries for the same two node search)

• Details in Bioinformatics (2007) paper


• Used 8 complementary features to score each B-term (e.g. recency, frequency, semantic categories)

• created a single combined and weighted score for each B-term

• Used logistic regression model to optimally give weights to each feature so as to separate marked relevant B-terms from all others (mixed set)


78 80 82 84 860

200

78 80 82 84 860

5

10 II

Nu

mb

er

of B

-te

rms

78 80 82 84 860

5III

B-term score

I

Two End-Points of this Research

• For any two-node search, we can now rank the list of B-terms in order of estimated probability that they will be marked as relevant (meaningful) by SOME user for SOME query.

• For any pair of lits A and C,

we can now estimate the OVERALL shared implicit information between A and C

(= % of B-terms that are predicted to be relevant)

Relevance to the One Node Search

We can re-conceptualize the one-node search as a series of two-node searches:

Choose A, then choose category C Divide category C into many small coherent Ci denselyFor each Ci, score multi-dimensional features: Including,

but not limited to, features that relate A to Ci (e.g. number of B-terms in common or %predicted relevant B-terms)

Rank the Ci to identify the most promising lits (which are presumed to point to novel hyps or implicit information helpful when applied to A)

A is evaluated pairwise against C =C1

might involve B-terms C2might not! C3

C4……….

e.g. A = Huntington Disease C = lifestyle factorsautophagy, or therapeutic agents

“Interestingness” Measures• Field of data mining. • This allows us to encode real-life priorities and strategies of working

scientists: • Existing one node search looks for novelty, relevance, non-trivial,

likelihood of being true …. [get low hanging fruit]• What about actionability, feasibility of follow-up,

surprisingness, cross-discipline, presence of high experimental support, generalizability to other problems, or high potential impact?

• A candidate Ci could be interesting because it is recently discovered and rapidly growing (e.g. microRNAs), well characterized, [for a disease] has an animal model, [for a protein] is connected to many other proteins, [for a drug] has FDA approval.

• not only re-conceptualizes one node search (e.g., no combinatorial explosion) but it generalizes the ranking methods.

Gold Standards for One-Node Searches

• Also, we can now envision preparing a series of gold standard searches, even automatically (cf. TREC 2006, 2007).

• Use implicit assertions to reconstruct explicit knowledge.• Use review articles; • lists (e.g. in virus study, gold standard was a list of

viruses that were thought to be at risk of being exploited for biological warfare).

• time slices;• Avoids the paradox that one node searches must predict

things that have no experimental support!

Conclusions

• LBD is (can be, will be) alive and well!

• Need to incorporate the types of real-life priorities and strategies of working scientists

• Re-conceptualize the one node search as a series of two-node searches

• Use “interestingness” measures to supplement B-term measures.

Journal of Biomedical Discovery and Collaboration

• Unique multi-disciplinary audience– People who engage in scientific discovery and collaboration– People who make tools that enhance scientific discovery and

collaboration– People who study scientific discovery and collaboration

• Hosted by Biomed Central• Fully peer-reviewed • RAPID review (<3 weeks is routine)• Open-access, indexed in PubMed Central et al• Readership goes up 10-100-fold• Impact goes up too…• Article fee reduced or zeroed depending on institution

Acknowledgements

• Don Swanson

• Vetle Torvik

• Wei Zhou (Clement Yu)

• Marc Weeber

Ruminations

• Should LBD analyses be user-friendly? Popular??

• Don’t they overlook true divergent discoveries?

• Should LBD be run automatically as a program in the background, with alerts of possible discoveries?

• Does LBD bypass, or reinforce, good old fashioned hypothesis driven science?

Documents

Re-Conceptualizing Literature-Based Discovery Neil R. Smalheiser March 29, 2008