Digital Archives for Molecular Microscopy

Preview:

DESCRIPTION

A community database for biological research Christoph Best European Bioinformatics Institute, Cambridge, UK Matthew T. Dougherty NCMI - Baylor College of Medicine Houston, Texas. Digital Archives for Molecular Microscopy. Bioimage Informatics. Informatics in support of biological imaging - PowerPoint PPT Presentation

Citation preview

Digital Archives forMolecular Microscopy

A community database forbiological research

Christoph BestEuropean Bioinformatics Institute,

Cambridge, UK

Matthew T. DoughertyNCMI - Baylor College of Medicine

Houston, Texas

Bioimage Informatics Informatics in support of biological imaging Why?

Image data rapidly increasing (Confocal) Fluorescence microscopy (Cellular B.) EMDB: Electron Microscopy (Structural Biology) High-throughput methods (Genome Biology)

Enabling science by making data accessible, reliable, and understandable

Standards&Conventions Public Databases

Quality assessment Open Microscopy Environment

S.Haertel, U. Chile

J. Swedlow, U. Dundee

EMDB, EBI

Structural Databases at EBI Protein Databank (PDB)

Atomic structures (positions of atoms) PDB file format, mmCIF Derived from X-ray crystallography Long tradition, curated data base Huge: 65,000+ entries, 3 wwPDB sites

Electron Microscopy Databank (EMDB) Part of PDB at EBI and Rutgers 600 density maps of macromolecular structures and

subcellular complexes Started 2002 Curated, but limited metadata, experiment info XML-based

Page 4

SCIENTIFIC BACKGROUND

Page 5

Electron microscope

From Schweikert, 2004

Biocenter, U Helsinki

Page 6

Page 7

Single-particle method

Tripeptidyl-peptidase II(TPP II)

courtesy of B. Rockel, Martinsried

Molecular structure Many images

computationally combined

3D from 2D resolution increase by

avaraging

Page 8

Single-particle analysis: GroEL to 4A

Ludtke et al, Structure 2008

Page 9

Data Management Issues

Initial EM images:

O(1000), 4k x 4k -> O(10GPixel) Particle stacks:

O(100,000), 256x256 -> O(10 GPixel) Final data set: 1 MVoxel small Processing power:

O(100) cores, some weeks, lab-owned clusters Software:

1970s FORTRAN codes, 1990s C codes

fragmented communities, lack of standards

Page 10

Electron tomography 3D reconstruction by taking a series of images

from different angles Difficulty: Nanometer accuracy Problems:

Limited tilt range ↔ missing wedge⇒ distortion

Imperfections of the tilt ↔ alignment⇒ limited resolution

Computational reconstruction algorithms

Page 11

Tomography of eukaryotic cells

PROJECTION SLICE

O. Medalia et al, Science, 2002Dictyostelium discoideum

Page 12

Image enhancement

Before

Cytoskeleton of Spiroplasma melliferum

J. Kürner et al., Science, 2005

Page 13

Image enhancement

yellow: geodetic line J. Kürner J. Kürner et al.,et al., Science, Science, 20052005

After

Page 14

Automated image analysis

Manual Automatic

A. Linaroudis, Ph.D. Thesis, 2006

Automatic segmentation to identify points/lines/surfaces

Page 15

Data Management Issues

Original data:

60 images, 8k x 8k -> O(4 GPixel) Reconstruction:

8k x 8k x 256 -> O(16 GPixel) ? Software:

1970s algorithm in 1990s software Visualization:

“let's buy more memory” Future: web-based applications (Google Maps) ?

The Electron Microscopy Data Bank

contains EM-derived density maps complementary to coordinate sets in PDB established 2002 @ EBI (Kim Henrick) web-based submission and retrieval hand-curated (R. Newman)

A bit like Ebay – and you won't make any money, either

THE ELECTRON MICROSCOPY DATA BANK

A Unified Data Resource for EM

NIH-funded joint project

Baylor College of Medicine, Houston (W. Chiu, M. Baker)

Rutgers University, New Jersey [H. Berman, C. Lawson)

PDBe, EBI, Cambridge, UK [K. Henrick, C. Best, R. Newman

Baylor College of MedicineHouston, TX

Rutgers University,Piscataway, NJ

European Bioinformatics Institute,Cambridge, UK

Characteristics

Curated Community Archive: PDB and EMDB NIH, EU (in past), and BBSRC funding (+ EMBL) Worldwide cooperation Advisory boards and task forces from the community Open deposition and retrieval

→ Alternative access systems by other institutions 760 entries, 26 GB data ca 100 entries/year curation both in Europe and US

Growth of EMDB

EMDep deposition system 750 entries, current rate approx. 15-20/month Contents of an entry:

Metadata (XML header) → experimental metadataMap (any format, converted to CCP4/MRC)Additional files

Java/Tomcat/XML

Unified data resource plan

Joint deposition system

EMDB search systemJava/Tomcat

EMDB search systemJava/Tomcat

EMDB Atlas pages

XSLT

ISSUES

Metadata management

Difficult: many rounds of consulting the community

Still most fields remain empty Data harvesting

LIMS, PIMS -> rarely used Processing pipelines, image processing software

-> Lack of standards, idiosyncrasies Image formats: Appalling lack of standards

Data issues

Current: Deposit final result of experiment and computation

How much of original/intermediate data should be deposited?

Issues: Cost / Practicability Reproducibility of experiment Intellectual property (un-exploited results?) Usefulness

Non-data issues

Embargo: Image data can be withheld up to two years Allows original researcher to further exploit them Journals and funders must define:

what data must be deposited when they are to be released

Quality Standards: Require community acceptance Technically difficult Data Bank does enrich/annotate, but does not do

science → quality standards must be set by scientists

Image data formats

Current: Variety of historical ad hoc formats Unclear definitions, variations in different software

Need: Interoperability Standards Technical level? Acceptance? → Question for the community

HDF5 Common container format to deal with numerical data Heavyweight library, but widely available (but Java?) Would at least solve low-level format problems Metadata format still needs to be specified

Ontologies

Systematic way to define classes of objects attributes of these objects relationships between objects

Provides framework for metadata models Advantage: Powerful formal method Disadvantage: Not yet widely used

TECHNICAL DEVELOPMENTS

Rich data sets

Submissions consist of maps (increasingly more than one) relations between data sets → unexpressed

XML-based standards for represen-ting relationships between data:

Subject-predicate-object relationships (RDF framework)

Harvesting interface to EM processing software Web-based visualization for sub-mission and

retrieval, complex sub-missions assembled interactively (AJAX)

Rich data submissions

Possible XML representation

Bioimage informatics tools

Current EMDB interface: simple and efficient but must be extended to

accommodate more complex experiments

OMERO interface: geared at labs, not

public databases All the beauty of AJAX high-performance

visualization

multichannel imageslab notebooktaggingimage markup

Bioimage informatics tools

BISQUE/BISUICK (UCSB)

No Standards

Experiment?Image?Analytics?Annotations?

Current Imaging Workflow Paradigm

Jason Swedlow(U. Dundee)

Towards Image Informatics

OMERO in 2007/8/9

Jason Swedlow(Univ. Dundee)

CONCLUSIONS

Imaging Centers

USERS

Databases

Grid/cloud computing/storagein house storage

storage and co

mputing engines

data submission

data harvesting

acquisition,storage, and

managementof images

storagedistributionquality assessment

Software

A Virtual Research Community

CONCLUSIONS

Community data bases are a central part of the Scientific Data Infrastructure

Image databases rapidly growing Technical challenges: data formats, size Standards and interoperability Improve metadata collection Keep the community engaged

Recommended