Upload
rolf-cole
View
222
Download
0
Embed Size (px)
Citation preview
Re-Conceptualizing Literature-Based Discovery
Neil R. Smalheiser
March 29, 2008
What is LBD? A strategy for uncovering novel hypotheses
• advocated by Don Swanson
• Magnesium-migraine, Fish oil-Raynaud’s
• The key idea is: putting together
explicit assertions from different papers to form new implicit assertions
• Regardless of how this is done, how the implicit assertions are assessed, whether the implicit assertions are correct!
What is LBD? A routine way of life for scientists
greatly under-recognized! Not just background reading, not just identifying anomalies or “critical incidents” that appear (explicitly) in a paper
Since 1996: 8 papers with Swanson, 40 without (i.e. non-one node search), 24 are biological (i.e. non-informatics modeling): 9/24 = 3/8 > 1/3
• Proteins in unexpected locations (Molec. Biol. Cell, 1996) • Expression of reelin in the blood (PNAS, 2000) • Reelin and schizophrenia (PNAS 2000)• Fluoxetine and neurogenesis (Eur. J. Pharmacol. 2001) • RNAi and memory (Trends in Neurosci. 2001)• Bath toys (New Engl. J. Med. 2003)• Dicer and calpain (J. Neurochem. 2005)• Exosomal transfer of proteins & RNAs at synapses (Biol Direct, 2007)• microRNA machinery and regulation by phosphorylation (BBA, 2008)
What is LBD? A body of research articles, software and websites
• mostly by information scientists and computer scientists
• Mostly concerned with “open discovery” or “one node searches”, begin with a set of articles A that represents a problem
• Mostly use “B-terms” present in A to expand the search, find disparate lits Ci that share B-terms with A
• Try to find the Ci that is disparate yet “most similar” to A
What is LBD? other researchers employ implicit information too
• Bioinformatics– gene-gene interactions– protein-protein interactions
• web search
• author disambiguation
• text mining
Yet these are not viewed as examples of LBD for some reason!
Has the LBD field stagnated and not fulfilled its promise?
• Kostoff critique(s) – “what is a discovery” vs. an “innovation”– argues against frequency based ranking,– Uses very high recall, hundreds of “discoveries” claimed per question
• “Swanson’s legacy” Sw refs ended 2001!• Bork review refs Sw ended 1996!• Few gold standards are available (Mg, fish oil worn out)• Combinatorial explosion A – B – C search method• Impossible standards for what counts as a LBD
prediction (never considered, never tested, must shatter a paradigm but must be proven experimentally??)
• Excluding active approaches other than “one node search” as being LBD
Well, what DO we know about progress in LBD?
• The two-node search• http://arrowsmith.psych.uic.edu• Begin with two lits A and C that represent a
known finding or a hypothesis (estrogen-AD)• look for meaningful links • (whether or not A and C are disparate) • We use B-terms extracted from titles• Could use abstracts, MeSH, triples…
Modeling the Two-Node Search-1
• Field testers, free-form use of the tool• Chose 6 two-node searches as gold standards: not too
big or small, disparate, topically coherent, clean questions
• E.g. for A = retinal detachment, C = aortic aneurysm, a) find diseases in which both features appear [not necessarily in same person] or b) find surgical procedures that have been applied to both conditions.
• Manually marked relevant B-terms for a given query (sometimes several queries for the same two node search)
• Details in Bioinformatics (2007) paper
Modeling the Two-Node Search-2
• Used 8 complementary features to score each B-term (e.g. recency, frequency, semantic categories)
• created a single combined and weighted score for each B-term
• Used logistic regression model to optimally give weights to each feature so as to separate marked relevant B-terms from all others (mixed set)
Modeling the Two-Node Search-3
78 80 82 84 860
200
78 80 82 84 860
5
10 II
Nu
mb
er
of B
-te
rms
78 80 82 84 860
5III
B-term score
I
Two End-Points of this Research
• For any two-node search, we can now rank the list of B-terms in order of estimated probability that they will be marked as relevant (meaningful) by SOME user for SOME query.
• For any pair of lits A and C,
we can now estimate the OVERALL shared implicit information between A and C
(= % of B-terms that are predicted to be relevant)
Relevance to the One Node Search
We can re-conceptualize the one-node search as a series of two-node searches:
Choose A, then choose category C Divide category C into many small coherent Ci denselyFor each Ci, score multi-dimensional features: Including,
but not limited to, features that relate A to Ci (e.g. number of B-terms in common or %predicted relevant B-terms)
Rank the Ci to identify the most promising lits (which are presumed to point to novel hyps or implicit information helpful when applied to A)
A is evaluated pairwise against C =C1
might involve B-terms C2might not! C3
C4……….
e.g. A = Huntington Disease C = lifestyle factorsautophagy, or therapeutic agents
“Interestingness” Measures• Field of data mining. • This allows us to encode real-life priorities and strategies of working
scientists: • Existing one node search looks for novelty, relevance, non-trivial,
likelihood of being true …. [get low hanging fruit]• What about actionability, feasibility of follow-up,
surprisingness, cross-discipline, presence of high experimental support, generalizability to other problems, or high potential impact?
• A candidate Ci could be interesting because it is recently discovered and rapidly growing (e.g. microRNAs), well characterized, [for a disease] has an animal model, [for a protein] is connected to many other proteins, [for a drug] has FDA approval.
• not only re-conceptualizes one node search (e.g., no combinatorial explosion) but it generalizes the ranking methods.
Gold Standards for One-Node Searches
• Also, we can now envision preparing a series of gold standard searches, even automatically (cf. TREC 2006, 2007).
• Use implicit assertions to reconstruct explicit knowledge.• Use review articles; • lists (e.g. in virus study, gold standard was a list of
viruses that were thought to be at risk of being exploited for biological warfare).
• time slices;• Avoids the paradox that one node searches must predict
things that have no experimental support!
Conclusions
• LBD is (can be, will be) alive and well!
• Need to incorporate the types of real-life priorities and strategies of working scientists
• Re-conceptualize the one node search as a series of two-node searches
• Use “interestingness” measures to supplement B-term measures.
Journal of Biomedical Discovery and Collaboration
• Unique multi-disciplinary audience– People who engage in scientific discovery and collaboration– People who make tools that enhance scientific discovery and
collaboration– People who study scientific discovery and collaboration
• Hosted by Biomed Central• Fully peer-reviewed • RAPID review (<3 weeks is routine)• Open-access, indexed in PubMed Central et al• Readership goes up 10-100-fold• Impact goes up too…• Article fee reduced or zeroed depending on institution
Acknowledgements
• Don Swanson
• Vetle Torvik
• Wei Zhou (Clement Yu)
• Marc Weeber
Ruminations
• Should LBD analyses be user-friendly? Popular??
• Don’t they overlook true divergent discoveries?
• Should LBD be run automatically as a program in the background, with alerts of possible discoveries?
• Does LBD bypass, or reinforce, good old fashioned hypothesis driven science?