Upload
lucas
View
44
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Whole Slide Imagery as an Enabling Technology for Content-Based Image Retrieval: A review of current capabilities, opportunities and challenges. Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology University of Michigan Health System [email protected]. - PowerPoint PPT Presentation
Citation preview
Whole Slide Imagery as an Enabling Technologyfor Content-Based Image Retrieval:
A review of current capabilities, opportunities and challenges
Ulysses J. Balis, M.D.Director, Division of Pathology Informatics
Department of PathologyUniversity of Michigan Health System
Disclosures*
• Aperio:– Technical Advisory Board and Shareholder
• Living Microsystems/Artemis Health, Inc.:– Founder and Shareholder
• Cellpoint Diagnostics:– Founder and Shareholder
*These are listed for completeness only; this presentation does not contain proprietary or commercial content from any of the above entities.
Overview of Topics
• Thesis statement• Definitions• A quick history of content-based image retrieval
(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:
– upcoming Web-enabled tool suites– Intended use-cases
Slide 3 of 94
Topics
• Thesis statement• Definitions• A quick history of content-based image retrieval
(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:
– upcoming Web-enabled tool suites– Intended use-cases
Slide 4 of 139
Thesis Statement• The availability of digital whole slide data
sets represent an enormous opportunity to carry out new forms of numerical and data- driven query, in modes not based on textual, ontological or lexical matching.– Search image repositories with whole
images or image regions of interest– Carry our search in real-time via use of
scalable computational architectures
Extraction from Image repositories based upon
spatial information
Analysis of datain the digital domain
…001011010111010111..
Resultant Surface Map orgallery of matching images
or
Topics
• Thesis statement• Definitions• A quick history of content-based image retrieval
(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:
– upcoming Web-enabled tool suites– Intended use-cases
Definition• Content-Based Image Retrieval (CBIR):
– Within the context of an image-based repository, searching for matching predicates with image-based operators in lieu of text matching
• Reverse Metadata Lookup (RML):– Using the cohort of returned images from a CBIR
query to generate a list of associated metadata concept terms
• Anatomic frame of reference• Prior diagnoses• Differential Diagnosis
Topics
• Thesis statement• Definitions• A quick history of content-based image retrieval
(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:
– upcoming Web-enabled tool suites– Intended use-cases
A Quick History of CBIR• 1970’s: Corona Satellite Remote Sensing Initiative
– Film-based– Resultant analog content, when digitized, represented
Gigabytes of data (consider the computational burden for 1972…
– Several numerical approaches devised to quickly crunch data
• Many approaches based on conventional image analysis: one or more specific algorithms developed for each feature to be extracted / identified
– Technically challenging– Time consuming– Computationally expensive
• The term CBIR first coined in 1992 by T. Kato to describe automatic retrieval of images from a database.
• One promising approach also explored was Vector Quantization (V.Q.)
• Many-log increase in computational throughput required for routine use
…001011010111010111..
478.2 TFLOPS 1.026 PFLOPS
35 OPS
CBIR Operational Modes• Query by Example
– Find pictures that contain this snippet / ROI • Semantic Retrieval
– Find pictures like adenocarcinoma• Like this adenocarcinoma
• Multimodal Retrieval– Search for matches based on imagery data combined
with other search metrics• High-throughput “omics” data, etc.• Patient clinical outcomes and therapeutic response data• Other imaging modalities
CBIR Techniques (conventional)
• Color Operators• Texture operators• Shape• Spectral information• Frequency and phase domain information
There are at least several thousand major classes of conventional image analysis operations, with most exhibiting the common trait of requiring some degree of application tuning for the intended use-case. Hence, this class of approaches should not be generally viewed as turnkey solutions.
CBIR Techniques (innovative)• “Genetic” Image Exploration
– Originally designed to analyze multispectral satellite data
– Semi-autonomous systems that employ a decision-tree to search a known repertoire of conventional image analysis algorithms for the most sensitive and specific combination of algorithms that fits the query predicate
– is representative• (Los Alamos National Labs)
– Autonomous operation comes at a price: the need for significant computational throughput in training mode (e.g. slow…)
Topics
• Thesis statement• Definitions• A quick history of content-based image retrieval
(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:
– upcoming Web-enabled tool suites– Intended use-cases
Prior Work
• Conventional Image analysis• Conventional Vector Quantization
Conventional Image Analysis
• At present, confined to specific use-cases:– Quantitative IHC– FDA validation an ongoing challenge
• Not reduced to practice as an integral tool of the “pathologist’s workstation”
Conventional Vector Quantization
Original Image Division of image into
local domains
Extraction of Local Domain
Composite Vectors
Individual assessment of
each vector dimension
Vectorization of each local kernel
VK=Σ{[L•x0y0]Order ,… [L•xnym]Order}
?
38857448643
Conventional Vector Quantization
VK=Σ{[L•x0y0]Order ,… [L•xnym]Order}
Query Against library (Vocabulary) of Established
Vectors
EstablishedVocabulary
Novel Vector
PreviouslyIdentified Vector
38857448643
Assignment of a unique serial number and
inclusion into global vocabulary
3885744864355324656453887554323267
865438676354554343
5556543544685444685445666963658
7769564688865433
Assembly of compressed dataset
VQ-Based Image Compression as the Original Predicate for Carrying Out
Image-Based Search
Raw Data RestoredData
Compressed dataThe spatially-preserved organization of the encoded data represents a many-fold decrease in overall search dataset size, thus providing a significant computational opportunity for accelerated search.
Additionally, the vectors identified as contributing to a match may be visually interrogated for confirmation of their predictive morphologic content.
38857448643
553246564
53887
554323267
865438676
354554343
55565435
446854
446854456
66963658
776956468
8865433
Topics
• Thesis statement• Definitions• A quick history of content-based image retrieval
(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:
– upcoming Web-enabled tool suites– Intended use-cases
The Challenge That IsPathology CBIR
• Start with some conservative initial assumptions, concerning a prototypic image repository, in terms of search potential:– Ability to search 10 years of data– 1000 slides day 200,000 slides/year– 500 Mb of compressed whole slide data/slide– Operational goal of being able to:
• Search in real-time• Re-index the database every evening, such that searches
carried out the next day are current
The Challenge That IsPathology CBIR
• Net storage required for ten year’s worth of data:– 1 Billion Megabytes
• 106 Gigabytes• 103 Terabytes• 100 Petabytes 1 Petabyte
– Current conservative enterprise storage is $2000/ Terabyte– The full Petabyte would cost $2M– A single Genetic-type search across all images, assuming 5 seconds of
computation / slide, would be:• 200,000 slides x 10 x 5 seconds 5 million seconds
– This is 6 log too slow– 8.27 weeks or about 6 searches per year
» (original Apple 2e: 78 years)– So we would need to save our queries for those “really important” image searches….
– Conventional VQ, which is ~100 times faster, is still not fast enough: 13.8 hours per feature search
• Yet another 4 log of performance is required…– Two ways to address this:
» 10,000 parallel processors or» better algorithms
Topics
• Thesis statement• Definitions• A quick history of content-based image retrieval
(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:
– upcoming Web-enabled tool suites– Intended use-cases
On Current Technology…
• Modern computational throughput continues to increase, with this capability representing an opportunity for perhaps 1-2 log performance increase in the next decade
• With a one-log increase, we are still left with a five-log gap that needs to be made up by improved algorithmic performance.
Recent Developments
• A number of promising algorithms being developed– Support Vector Machines (SVM)– Principle Component analysis– High-dimensional reduction approaches– Spatially-invariant VQ (SiVQ)
VQ Revisited and SiVQ
Q: What is conventional VQ’s greatest weakness:
A: Too many required vectors to represent a single atomic morphologic feature– (promiscuity of vector set growth with
continued training)
Conventional VQ Vector Growth during training
Organ System Number of Initial Training Cases Average Vectors / Case ConvergenceLiver 20 340554 8%Colon 25 127824 24%Lymph Node 12 76443 75%Skin 12 1675439 1%Ovary 14 107437 22%
A Matter of Degrees of Freedom…
Candidate Feature
How many ways can
this be sampled?
How Many Ways Can A Candidate Feature Be Matched During Training?
Y Translational Freedom
X Translational Freedom
RotationalFreedom
In VQ: it may be the same feature but there are excessively enumerable ways to sample• Typical Feature Vector:
– 25 x 25 pixels (x by y) or larger 625 translational degrees of freedom
– Effective radius of 12.5 pixels– After Nyquist rotational sampling (2x spatial frequency)
• 2 x (2 x 12.5 x π) 79 separate rotations– 3 color planes– 2 mirror symmetries– At least 20 possible semi-discreet length-scale Nyquist samples– All together, there are at least 625 x 79 x 3 x 2 x 20 5,925,000
possible ways to represent one possible vector (assuming twenty fixed magnifications in use)
– This explains the non-asymptotic (unbounded) vector growth observed of some histology patterns.
– Multispectral data (e.g. 28 vs. 3 bands) will further multiply the diagnostic power of SiVQ vectors (55,300,000 degrees of freedom / vector)
Consequences of SiVQ• Use one spatially-invariant vector to do the work of
5,925,000 spatially-constrained vectors– 5,925,000x faster– 5,925,000 fewer vectors to store per feature archetype– 6 log+ increase in algorithmic performance (we only needed 4
log, so we have CPU to burn)– Implies an operational solution to the real-time requirement for
large datasets• CBIR is essentially reduced to practice for a sizable
contingent of textural-based whole slide image-retrieval use-cases
• Emergent property: SiVQ works equally-well on all structurally-repetitive data sets (e.g. remote sensing, Google-like image searches of the Web)
Topics
• Thesis statement• Definitions• A quick history of content-based image retrieval
(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:
– upcoming Web-enabled tool suites– Intended use-cases
Interactive Demonstration
Topics
• Thesis statement• Definitions• A quick history of content-based image retrieval
(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:
– upcoming Web-enabled tool suites– Intended use-cases
Opportunities and Future Work• CBIR development will continue
– Many groups already demonstrating feasibility of real-time query capability– Activity at Rutgers, U. of Pittsburgh and Cal Tech
• For the UofM Group:– Rapid dissemination of the algorithm and libraries via peer-reviewed
publications and/or e-pubs – Extension of the discovery tool suite to support multiple-vector classification,
similar to the approaches taken for prior VQ systems, with rapid follow-on publications
– “Ground-Truth Engine” for integrative multimodality studies– Activation of an open-architectures website that will provide a downloadable
tool suite and a Web-Based, real-time decision support environment for submitted images, operating in two general use-cases:
• Surface classification with rare event detection (anything not classified as normal)• Differential diagnosis generation with return of matching images and associated
metadata– Generation of a classification library of extensive “normal SiVQ vectors” for each
organ system– Actively pursue collaboration to form a core team to adjudicate needed normal
and abnormal vector classes
Closing Remarks
• CBIR is not vaporware or an elusive computational goal
• Contemporary computation speed is, actually, quite adequate for many CBIR tasks
• Much work remains to realize its full potential• SiVQ will likely be one of a plurality of
compelling solutions in the Image Query / Decision-support armamentarium
Acknowledgements
• Jerome Cheng, U. of Michigan• Anastasios Markas, Insilica Corporation• Mehmet Toner and Ronald Tompkins,
Harvard Medical School• Mike Feldman, U. of Pennsylvania