37
Whole Slide Imagery as an Enabling Technology for Content-Based Image Retrieval: A review of current capabilities, opportunities and challenges Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology University of Michigan Health System [email protected]

Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

  • Upload
    lucas

  • View
    44

  • Download
    1

Embed Size (px)

DESCRIPTION

Whole Slide Imagery as an Enabling Technology for Content-Based Image Retrieval: A review of current capabilities, opportunities and challenges. Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology University of Michigan Health System [email protected]. - PowerPoint PPT Presentation

Citation preview

Page 1: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Whole Slide Imagery as an Enabling Technologyfor Content-Based Image Retrieval:

A review of current capabilities, opportunities and challenges

Ulysses J. Balis, M.D.Director, Division of Pathology Informatics

Department of PathologyUniversity of Michigan Health System

[email protected]

Page 2: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Disclosures*

• Aperio:– Technical Advisory Board and Shareholder

• Living Microsystems/Artemis Health, Inc.:– Founder and Shareholder

• Cellpoint Diagnostics:– Founder and Shareholder

*These are listed for completeness only; this presentation does not contain proprietary or commercial content from any of the above entities.

Page 3: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Overview of Topics

• Thesis statement• Definitions• A quick history of content-based image retrieval

(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:

– upcoming Web-enabled tool suites– Intended use-cases

Slide 3 of 94

Page 4: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Topics

• Thesis statement• Definitions• A quick history of content-based image retrieval

(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:

– upcoming Web-enabled tool suites– Intended use-cases

Slide 4 of 139

Page 5: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Thesis Statement• The availability of digital whole slide data

sets represent an enormous opportunity to carry out new forms of numerical and data- driven query, in modes not based on textual, ontological or lexical matching.– Search image repositories with whole

images or image regions of interest– Carry our search in real-time via use of

scalable computational architectures

Extraction from Image repositories based upon

spatial information

Analysis of datain the digital domain

…001011010111010111..

Resultant Surface Map orgallery of matching images

or

Page 6: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Topics

• Thesis statement• Definitions• A quick history of content-based image retrieval

(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:

– upcoming Web-enabled tool suites– Intended use-cases

Page 7: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Definition• Content-Based Image Retrieval (CBIR):

– Within the context of an image-based repository, searching for matching predicates with image-based operators in lieu of text matching

• Reverse Metadata Lookup (RML):– Using the cohort of returned images from a CBIR

query to generate a list of associated metadata concept terms

• Anatomic frame of reference• Prior diagnoses• Differential Diagnosis

Page 8: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Topics

• Thesis statement• Definitions• A quick history of content-based image retrieval

(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:

– upcoming Web-enabled tool suites– Intended use-cases

Page 9: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

A Quick History of CBIR• 1970’s: Corona Satellite Remote Sensing Initiative

– Film-based– Resultant analog content, when digitized, represented

Gigabytes of data (consider the computational burden for 1972…

– Several numerical approaches devised to quickly crunch data

• Many approaches based on conventional image analysis: one or more specific algorithms developed for each feature to be extracted / identified

– Technically challenging– Time consuming– Computationally expensive

• The term CBIR first coined in 1992 by T. Kato to describe automatic retrieval of images from a database.

• One promising approach also explored was Vector Quantization (V.Q.)

• Many-log increase in computational throughput required for routine use

…001011010111010111..

Page 10: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

478.2 TFLOPS 1.026 PFLOPS

35 OPS

Page 11: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

CBIR Operational Modes• Query by Example

– Find pictures that contain this snippet / ROI • Semantic Retrieval

– Find pictures like adenocarcinoma• Like this adenocarcinoma

• Multimodal Retrieval– Search for matches based on imagery data combined

with other search metrics• High-throughput “omics” data, etc.• Patient clinical outcomes and therapeutic response data• Other imaging modalities

Page 12: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

CBIR Techniques (conventional)

• Color Operators• Texture operators• Shape• Spectral information• Frequency and phase domain information

There are at least several thousand major classes of conventional image analysis operations, with most exhibiting the common trait of requiring some degree of application tuning for the intended use-case. Hence, this class of approaches should not be generally viewed as turnkey solutions.

Page 13: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

CBIR Techniques (innovative)• “Genetic” Image Exploration

– Originally designed to analyze multispectral satellite data

– Semi-autonomous systems that employ a decision-tree to search a known repertoire of conventional image analysis algorithms for the most sensitive and specific combination of algorithms that fits the query predicate

– is representative• (Los Alamos National Labs)

– Autonomous operation comes at a price: the need for significant computational throughput in training mode (e.g. slow…)

Page 14: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Topics

• Thesis statement• Definitions• A quick history of content-based image retrieval

(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:

– upcoming Web-enabled tool suites– Intended use-cases

Page 15: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Prior Work

• Conventional Image analysis• Conventional Vector Quantization

Page 16: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Conventional Image Analysis

• At present, confined to specific use-cases:– Quantitative IHC– FDA validation an ongoing challenge

• Not reduced to practice as an integral tool of the “pathologist’s workstation”

Page 17: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Conventional Vector Quantization

Original Image Division of image into

local domains

Extraction of Local Domain

Composite Vectors

Individual assessment of

each vector dimension

Vectorization of each local kernel

VK=Σ{[L•x0y0]Order ,… [L•xnym]Order}

?

38857448643

Page 18: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Conventional Vector Quantization

VK=Σ{[L•x0y0]Order ,… [L•xnym]Order}

Query Against library (Vocabulary) of Established

Vectors

EstablishedVocabulary

Novel Vector

PreviouslyIdentified Vector

38857448643

Assignment of a unique serial number and

inclusion into global vocabulary

3885744864355324656453887554323267

865438676354554343

5556543544685444685445666963658

7769564688865433

Assembly of compressed dataset

Page 19: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

VQ-Based Image Compression as the Original Predicate for Carrying Out

Image-Based Search

Raw Data RestoredData

Compressed dataThe spatially-preserved organization of the encoded data represents a many-fold decrease in overall search dataset size, thus providing a significant computational opportunity for accelerated search.

Additionally, the vectors identified as contributing to a match may be visually interrogated for confirmation of their predictive morphologic content.

38857448643

553246564

53887

554323267

865438676

354554343

55565435

446854

446854456

66963658

776956468

8865433

Page 20: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Topics

• Thesis statement• Definitions• A quick history of content-based image retrieval

(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:

– upcoming Web-enabled tool suites– Intended use-cases

Page 21: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

The Challenge That IsPathology CBIR

• Start with some conservative initial assumptions, concerning a prototypic image repository, in terms of search potential:– Ability to search 10 years of data– 1000 slides day 200,000 slides/year– 500 Mb of compressed whole slide data/slide– Operational goal of being able to:

• Search in real-time• Re-index the database every evening, such that searches

carried out the next day are current

Page 22: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

The Challenge That IsPathology CBIR

• Net storage required for ten year’s worth of data:– 1 Billion Megabytes

• 106 Gigabytes• 103 Terabytes• 100 Petabytes 1 Petabyte

– Current conservative enterprise storage is $2000/ Terabyte– The full Petabyte would cost $2M– A single Genetic-type search across all images, assuming 5 seconds of

computation / slide, would be:• 200,000 slides x 10 x 5 seconds 5 million seconds

– This is 6 log too slow– 8.27 weeks or about 6 searches per year

» (original Apple 2e: 78 years)– So we would need to save our queries for those “really important” image searches….

– Conventional VQ, which is ~100 times faster, is still not fast enough: 13.8 hours per feature search

• Yet another 4 log of performance is required…– Two ways to address this:

» 10,000 parallel processors or» better algorithms

Page 23: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Topics

• Thesis statement• Definitions• A quick history of content-based image retrieval

(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:

– upcoming Web-enabled tool suites– Intended use-cases

Page 24: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

On Current Technology…

• Modern computational throughput continues to increase, with this capability representing an opportunity for perhaps 1-2 log performance increase in the next decade

• With a one-log increase, we are still left with a five-log gap that needs to be made up by improved algorithmic performance.

Page 25: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Recent Developments

• A number of promising algorithms being developed– Support Vector Machines (SVM)– Principle Component analysis– High-dimensional reduction approaches– Spatially-invariant VQ (SiVQ)

Page 26: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

VQ Revisited and SiVQ

Q: What is conventional VQ’s greatest weakness:

A: Too many required vectors to represent a single atomic morphologic feature– (promiscuity of vector set growth with

continued training)

Page 27: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Conventional VQ Vector Growth during training

Organ System Number of Initial Training Cases Average Vectors / Case ConvergenceLiver 20 340554 8%Colon 25 127824 24%Lymph Node 12 76443 75%Skin 12 1675439 1%Ovary 14 107437 22%

Page 28: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

A Matter of Degrees of Freedom…

Candidate Feature

How many ways can

this be sampled?

Page 29: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

How Many Ways Can A Candidate Feature Be Matched During Training?

Y Translational Freedom

X Translational Freedom

RotationalFreedom

Page 30: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

In VQ: it may be the same feature but there are excessively enumerable ways to sample• Typical Feature Vector:

– 25 x 25 pixels (x by y) or larger 625 translational degrees of freedom

– Effective radius of 12.5 pixels– After Nyquist rotational sampling (2x spatial frequency)

• 2 x (2 x 12.5 x π) 79 separate rotations– 3 color planes– 2 mirror symmetries– At least 20 possible semi-discreet length-scale Nyquist samples– All together, there are at least 625 x 79 x 3 x 2 x 20 5,925,000

possible ways to represent one possible vector (assuming twenty fixed magnifications in use)

– This explains the non-asymptotic (unbounded) vector growth observed of some histology patterns.

– Multispectral data (e.g. 28 vs. 3 bands) will further multiply the diagnostic power of SiVQ vectors (55,300,000 degrees of freedom / vector)

Page 31: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Consequences of SiVQ• Use one spatially-invariant vector to do the work of

5,925,000 spatially-constrained vectors– 5,925,000x faster– 5,925,000 fewer vectors to store per feature archetype– 6 log+ increase in algorithmic performance (we only needed 4

log, so we have CPU to burn)– Implies an operational solution to the real-time requirement for

large datasets• CBIR is essentially reduced to practice for a sizable

contingent of textural-based whole slide image-retrieval use-cases

• Emergent property: SiVQ works equally-well on all structurally-repetitive data sets (e.g. remote sensing, Google-like image searches of the Web)

Page 32: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Topics

• Thesis statement• Definitions• A quick history of content-based image retrieval

(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:

– upcoming Web-enabled tool suites– Intended use-cases

Page 33: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Interactive Demonstration

Page 34: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Topics

• Thesis statement• Definitions• A quick history of content-based image retrieval

(CBIR)• Prior work• The challenge that is Pathology CBIR• Current technology and recent developments• Demonstrations• Opportunities:

– upcoming Web-enabled tool suites– Intended use-cases

Page 35: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Opportunities and Future Work• CBIR development will continue

– Many groups already demonstrating feasibility of real-time query capability– Activity at Rutgers, U. of Pittsburgh and Cal Tech

• For the UofM Group:– Rapid dissemination of the algorithm and libraries via peer-reviewed

publications and/or e-pubs – Extension of the discovery tool suite to support multiple-vector classification,

similar to the approaches taken for prior VQ systems, with rapid follow-on publications

– “Ground-Truth Engine” for integrative multimodality studies– Activation of an open-architectures website that will provide a downloadable

tool suite and a Web-Based, real-time decision support environment for submitted images, operating in two general use-cases:

• Surface classification with rare event detection (anything not classified as normal)• Differential diagnosis generation with return of matching images and associated

metadata– Generation of a classification library of extensive “normal SiVQ vectors” for each

organ system– Actively pursue collaboration to form a core team to adjudicate needed normal

and abnormal vector classes

Page 36: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Closing Remarks

• CBIR is not vaporware or an elusive computational goal

• Contemporary computation speed is, actually, quite adequate for many CBIR tasks

• Much work remains to realize its full potential• SiVQ will likely be one of a plurality of

compelling solutions in the Image Query / Decision-support armamentarium

Page 37: Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Acknowledgements

• Jerome Cheng, U. of Michigan• Anastasios Markas, Insilica Corporation• Mehmet Toner and Ronald Tompkins,

Harvard Medical School• Mike Feldman, U. of Pennsylvania