Finding and Accessing Diagrams inBiomedical Publications
Tobias Kuhn, ThaiBinh Luong, and Michael Krauthammer
Krauthammer Lab, Department of PathologyYale University School of Medicine
AMIA 2012 Annual Symposium6 November 2012
Chicago
Introduction
The inclusion of figure images is a recent trend in the area ofliterature mining.
The increasing amount of open access publications makes suchimages available for automated analysis.
Image mining techniques can be used for image search interfaces,for relation mining, and to complement text mining approaches.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 2 / 20
Answer Queries with Images
Often, a query is best answered by an image.
For example, WolframAlpha for “growth age 6”:
Idea: Use existing diagrams of scientific articles to answer queries.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 3 / 20
Yale Image Finder
http://krauthammerlab.med.yale.edu/imagefinder/
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 4 / 20
Detection and Analysis of Specific Image Types
For the next version of the Yale Image Finder, we are working on thedetection and analysis of specific image types:
• Axis Diagrams
• Gel Images
• Network Diagrams (work in progress)
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 5 / 20
Axis Diagrams: Examples
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 6 / 20
Axis Diagrams
Axis diagrams are important for several reasons:
• They are abundant in biomedical literature: about 38% of allsubfigures are axis diagrams
• They follow simple common patterns based on axes
• They are complex in the sense that they combine severaldimensions
• They summarize data for human readers
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 7 / 20
Axis Diagram Detection Steps
Basic Idea: Large segments are detected as center segments of axisdiagrams if surrounded by a number of small label segments.
1. 2. 3. 4. 5.original segments center label result
candidates candidates
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 8 / 20
Additional Classifiers
To compare and improve our approach, we apply SVM classifiers withthe following two types of features:
• Image: texture and histogram features of the bitmap image
• Caption: word vector of the tokenized caption text
These classifiers only act on the complete figure and cannot spot thelocation of axis diagrams.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 9 / 20
Results
Evaluation on a random sample of 100 articles from PubMed Centralwith at least one figure. These 404 figures were manually annotated:they contained 508 axis diagrams.
task method prec
isio
n
reca
ll
F-s
core
detection of figures segments 0.87 0.66 0.75with axis diagrams image 0.66 0.90 0.76
caption 0.84 0.77 0.80image + segments 0.80 0.73 0.76caption + segments 0.90 0.85 0.88image + caption 0.85 0.84 0.84image + caption + segments 0.90 0.89 0.89
extraction of axis segments 0.85 0.40 0.54diagram locations image + segments 0.84 0.39 0.54
caption + segments 0.88 0.39 0.54image + caption + segments 0.89 0.39 0.55
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 10 / 20
Gel Images
Gel diagrams are another important type of image:
• They are the result of gel electrophoresis (e.g. Southern,Western and Northern blotting)
• They are often shown in biomedical publication as evidence forthe discussed findings (e.g. protein-protein interactions andprotein expressions under different conditions)
• About 15% of all subfigures are gel images
• They are structured according to common regular patterns
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 11 / 20
Relations from Gel Images
Condition Measurement ResultMDA-MB-231 14-3-3σ high expressionNHEM 14-3-3σ no expressionC8161.9 14-3-3σ high expressionLOX 14-3-3σ low expressionMDA-MB-231 β-actin high expressionNHEM β-actin high expressionC8161.9 β-actin high expressionLOX β-actin high expression
Condition Measurement ResultIL-1β (–) DEX (–) RU486 (–) p-p38 low expressionIL-1β (+) DEX (–) RU486 (–) p-p38 high expressionIL-1β (–) DEX (+) RU486 (–) p-p38 no expressionIL-1β (+) DEX (+) RU486 (–) p-p38 low expressionIL-1β (–) DEX (–) RU486 (+) p-p38 no expressionIL-1β (+) DEX (–) RU486 (+) p-p38 high expressionIL-1β (–) DEX (+) RU486 (+) p-p38 low expressionIL-1β (+) DEX (+) RU486 (+) p-p38 high expression... ... ...
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 12 / 20
Procedure
A BX
Y
P
A BX
Y
P
A BX
Y
P
A BX
Y
P
A BX
Y
P
A BX
Y
P
articles figures segments text
gels gel panels named entities
1 21 3
4 5 6
relations
7
We focus here on the steps 4, 5, and 6. Steps 1, 2, and 3 have beenaddressed in prior work. Step 7 is future work.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 13 / 20
Gel Segment Detection
A BX
Y
P
gels
4
Random forest classifiers on a number of features of image segments(position, size, grayscale histogram, color, texture, and number ofrecognized characters).
Results on 1000 manually annotated, random figures:
Threshold Precision Recall F-score AUC
high recall 0.15 0.439 0.909 0.5920.30 0.765 0.739 0.752 0.980
high precision 0.60 0.926 0.301 0.455
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 14 / 20
Gel Panel Detection
A BX
Y
P
gel panels
5
Algorithm:
• Start with a gel segment according to the high-precision classifier
• Repeatedly look for adjacent gel segments according to thehigh-recall classifier, and merge them
• Collect labels in the form of text segments arround the detectedgel region
Results on another set of 500 manually annotated figures:
Precision Recall F-score
0.951 0.379 0.542
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 15 / 20
Named Entity Recognition
named entities
6
Detection of gene and protein names in gel labels from a sampleof 2000 random figures (tokenization; case-sensitive Entrez Genelookup; exclude very short and very common words):
absolute relativeTotal 156 100.0%Incorrect 54 34.6%– Not mentioned (OCR errors) 28 17.9%– Not references to genes or proteins 26 16.7%Correct 102 65.3%– Partially correct (could be more specific) 14 9.0%– Fully correct 88 56.4%
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 16 / 20
Overall Results on PubMed Central
We ran our pipeline on the whole open access subset of PubMedCentral:
Total articles 410 950Processed articles 386 428Total figures from processed articles 1 110 643Processed figures 884 152Detected gel panels 85 942Detected gel panels per figure 0.097
Detected gene tokens 1 854 609Detected gene tokens in gel labels 75 610
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 17 / 20
Conclusions and Future Work
Conclusions
• The location of certain diagram types like axis and gel diagramscan be extracted at a high precision of about 90% with anf-score around 55%
Future Work
• Relation extraction
• Include other image types like network diagrams
• Combination with classical text mining techniques
• Detection of other named entity types: cell lines, drugs, ...
• Sophisticated diagram search interface
• Standard for biomedical diagrams?
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 18 / 20
Discussion: Standardized Biomedical Diagrams?
It seems feasible to extract relations from gel images at satisfactoryaccuracy, but it is clear that this procedure is far from perfect.
Do we need a standard for biomedical diagrams? A UnifiedModeling Language (UML) for biology and medicine?
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 19 / 20
Thank you for your Attention!
Questions?
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 20 / 20