Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Whole Slide Imagery as an Enabling Technologyfor Content-Based Image Retrieval:

A review of current capabilities, opportunities and challenges

Ulysses J. Balis, M.D.Director, Division of Pathology Informatics

Department of PathologyUniversity of Michigan Health System

[email protected]

http://www.med.umich.edu/

Disclosures*

• Aperio:– Technical Advisory Board and Shareholder

• Living Microsystems/Artemis Health, Inc.:– Founder and Shareholder

• Cellpoint Diagnostics:– Founder and Shareholder

*These are listed for completeness only; this presentation does not contain proprietary or commercial content from any of the above entities.

Overview of Topics

• Thesis statement• Definitions• A quick history of content-based image retrieval

(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:

– upcoming Web-enabled tool suites– Intended use-cases

Slide 3 of 94

Topics




Slide 4 of 139

Thesis Statement• The availability of digital whole slide data

sets represent an enormous opportunity to carry out new forms of numerical and data- driven query, in modes not based on textual, ontological or lexical matching.– Search image repositories with whole

images or image regions of interest– Carry our search in real-time via use of

scalable computational architectures

Extraction from Image repositories based upon

spatial information

Analysis of datain the digital domain

…001011010111010111..

Resultant Surface Map orgallery of matching images

or

Topics




Definition• Content-Based Image Retrieval (CBIR):

– Within the context of an image-based repository, searching for matching predicates with image-based operators in lieu of text matching

• Reverse Metadata Lookup (RML):– Using the cohort of returned images from a CBIR

query to generate a list of associated metadata concept terms

• Anatomic frame of reference• Prior diagnoses• Differential Diagnosis

Topics




A Quick History of CBIR• 1970’s: Corona Satellite Remote Sensing Initiative

– Film-based– Resultant analog content, when digitized, represented

Gigabytes of data (consider the computational burden for 1972…

– Several numerical approaches devised to quickly crunch data

• Many approaches based on conventional image analysis: one or more specific algorithms developed for each feature to be extracted / identified

– Technically challenging– Time consuming– Computationally expensive

• The term CBIR first coined in 1992 by T. Kato to describe automatic retrieval of images from a database.

• One promising approach also explored was Vector Quantization (V.Q.)

• Many-log increase in computational throughput required for routine use

…001011010111010111..

478.2 TFLOPS 1.026 PFLOPS

35 OPS

CBIR Operational Modes• Query by Example

– Find pictures that contain this snippet / ROI • Semantic Retrieval

– Find pictures like adenocarcinoma• Like this adenocarcinoma

• Multimodal Retrieval– Search for matches based on imagery data combined

with other search metrics• High-throughput “omics” data, etc.• Patient clinical outcomes and therapeutic response data• Other imaging modalities

CBIR Techniques (conventional)

• Color Operators• Texture operators• Shape• Spectral information• Frequency and phase domain information

There are at least several thousand major classes of conventional image analysis operations, with most exhibiting the common trait of requiring some degree of application tuning for the intended use-case. Hence, this class of approaches should not be generally viewed as turnkey solutions.

CBIR Techniques (innovative)• “Genetic” Image Exploration

– Originally designed to analyze multispectral satellite data

– Semi-autonomous systems that employ a decision-tree to search a known repertoire of conventional image analysis algorithms for the most sensitive and specific combination of algorithms that fits the query predicate

– is representative• (Los Alamos National Labs)

– Autonomous operation comes at a price: the need for significant computational throughput in training mode (e.g. slow…)

Topics




Prior Work

• Conventional Image analysis• Conventional Vector Quantization

Conventional Image Analysis

• At present, confined to specific use-cases:– Quantitative IHC– FDA validation an ongoing challenge

• Not reduced to practice as an integral tool of the “pathologist’s workstation”

Conventional Vector Quantization

Original Image Division of image into

local domains

Extraction of Local Domain

Composite Vectors

Individual assessment of

each vector dimension

Vectorization of each local kernel

VK=Σ{[L•x0y0]Order ,… [L•xnym]Order}

?

38857448643

Conventional Vector Quantization

VK=Σ{[L•x0y0]Order ,… [L•xnym]Order}

Query Against library (Vocabulary) of Established

Vectors

EstablishedVocabulary

Novel Vector

PreviouslyIdentified Vector

38857448643

Assignment of a unique serial number and

inclusion into global vocabulary

3885744864355324656453887554323267

865438676354554343

5556543544685444685445666963658

7769564688865433

Assembly of compressed dataset

VQ-Based Image Compression as the Original Predicate for Carrying Out

Image-Based Search

Raw Data RestoredData

Compressed dataThe spatially-preserved organization of the encoded data represents a many-fold decrease in overall search dataset size, thus providing a significant computational opportunity for accelerated search.

Additionally, the vectors identified as contributing to a match may be visually interrogated for confirmation of their predictive morphologic content.

38857448643

553246564

53887

554323267

865438676

354554343

55565435

446854

446854456

66963658

776956468

8865433

Topics




The Challenge That IsPathology CBIR

• Start with some conservative initial assumptions, concerning a prototypic image repository, in terms of search potential:– Ability to search 10 years of data– 1000 slides day 200,000 slides/year– 500 Mb of compressed whole slide data/slide– Operational goal of being able to:

• Search in real-time• Re-index the database every evening, such that searches

carried out the next day are current

The Challenge That IsPathology CBIR

• Net storage required for ten year’s worth of data:– 1 Billion Megabytes

• 106 Gigabytes• 103 Terabytes• 100 Petabytes 1 Petabyte

– Current conservative enterprise storage is $2000/ Terabyte– The full Petabyte would cost $2M– A single Genetic-type search across all images, assuming 5 seconds of

computation / slide, would be:• 200,000 slides x 10 x 5 seconds 5 million seconds

– This is 6 log too slow– 8.27 weeks or about 6 searches per year

» (original Apple 2e: 78 years)– So we would need to save our queries for those “really important” image searches….

– Conventional VQ, which is ~100 times faster, is still not fast enough: 13.8 hours per feature search

• Yet another 4 log of performance is required…– Two ways to address this:

» 10,000 parallel processors or» better algorithms

Topics




On Current Technology…

• Modern computational throughput continues to increase, with this capability representing an opportunity for perhaps 1-2 log performance increase in the next decade

• With a one-log increase, we are still left with a five-log gap that needs to be made up by improved algorithmic performance.

Recent Developments

• A number of promising algorithms being developed– Support Vector Machines (SVM)– Principle Component analysis– High-dimensional reduction approaches– Spatially-invariant VQ (SiVQ)

VQ Revisited and SiVQ

Q: What is conventional VQ’s greatest weakness:

A: Too many required vectors to represent a single atomic morphologic feature– (promiscuity of vector set growth with

continued training)

Conventional VQ Vector Growth during training

Organ System Number of Initial Training Cases Average Vectors / Case ConvergenceLiver 20 340554 8%Colon 25 127824 24%Lymph Node 12 76443 75%Skin 12 1675439 1%Ovary 14 107437 22%

A Matter of Degrees of Freedom…

Candidate Feature

How many ways can

this be sampled?

How Many Ways Can A Candidate Feature Be Matched During Training?

Y Translational Freedom

X Translational Freedom

RotationalFreedom

In VQ: it may be the same feature but there are excessively enumerable ways to sample• Typical Feature Vector:

– 25 x 25 pixels (x by y) or larger 625 translational degrees of freedom

– Effective radius of 12.5 pixels– After Nyquist rotational sampling (2x spatial frequency)

• 2 x (2 x 12.5 x π) 79 separate rotations– 3 color planes– 2 mirror symmetries– At least 20 possible semi-discreet length-scale Nyquist samples– All together, there are at least 625 x 79 x 3 x 2 x 20 5,925,000

possible ways to represent one possible vector (assuming twenty fixed magnifications in use)

– This explains the non-asymptotic (unbounded) vector growth observed of some histology patterns.

– Multispectral data (e.g. 28 vs. 3 bands) will further multiply the diagnostic power of SiVQ vectors (55,300,000 degrees of freedom / vector)

Consequences of SiVQ• Use one spatially-invariant vector to do the work of

5,925,000 spatially-constrained vectors– 5,925,000x faster– 5,925,000 fewer vectors to store per feature archetype– 6 log+ increase in algorithmic performance (we only needed 4

log, so we have CPU to burn)– Implies an operational solution to the real-time requirement for

large datasets• CBIR is essentially reduced to practice for a sizable

contingent of textural-based whole slide image-retrieval use-cases

• Emergent property: SiVQ works equally-well on all structurally-repetitive data sets (e.g. remote sensing, Google-like image searches of the Web)

Topics




Interactive Demonstration

Topics




Opportunities and Future Work• CBIR development will continue

– Many groups already demonstrating feasibility of real-time query capability– Activity at Rutgers, U. of Pittsburgh and Cal Tech

• For the UofM Group:– Rapid dissemination of the algorithm and libraries via peer-reviewed

publications and/or e-pubs – Extension of the discovery tool suite to support multiple-vector classification,

similar to the approaches taken for prior VQ systems, with rapid follow-on publications

– “Ground-Truth Engine” for integrative multimodality studies– Activation of an open-architectures website that will provide a downloadable

tool suite and a Web-Based, real-time decision support environment for submitted images, operating in two general use-cases:

• Surface classification with rare event detection (anything not classified as normal)• Differential diagnosis generation with return of matching images and associated

metadata– Generation of a classification library of extensive “normal SiVQ vectors” for each

organ system– Actively pursue collaboration to form a core team to adjudicate needed normal

and abnormal vector classes

Closing Remarks

• CBIR is not vaporware or an elusive computational goal

• Contemporary computation speed is, actually, quite adequate for many CBIR tasks

• Much work remains to realize its full potential• SiVQ will likely be one of a plurality of

compelling solutions in the Image Query / Decision-support armamentarium

Acknowledgements

• Jerome Cheng, U. of Michigan• Anastasios Markas, Insilica Corporation• Mehmet Toner and Ronald Tompkins,

Harvard Medical School• Mike Feldman, U. of Pennsylvania

Documents

Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology