Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani [email protected]

Data Management Supportfor Life Sciences

orWhat can we do

for the Life Sciences?

Mourad [email protected]

The Big Picture

Internet

Web Service

Biological Databases

Isotopefitting

Noisefiltering

Peak groupselection

Peak Deconvolution

Chargefitting

Rawpeakgroup

Multi-scan

clusters

Chromatographicrefinement

Singlescan

clustersPriority

list

Raw MSfile

Doubletdetection

Mixeddoubletrescue

Calculation ofDifferentials

Protein identification

Storage System

Processed Data Raw DataSamples

Experiments

Proteomics, Metabolomics, & Cytomics

Data Streaming

System

Online Analysis

Data Mining

Databases Integration

System

Purd

ue B

ioSc

ienc

e Pi

pelin

e

Sample Data

Intelligent Instrument Control

Protein Identification

Massive Data Storage

Non Diseased Sample

Samples Preparation & Transformation

(LC, isotoping, etc.…)

Data Analysis

Data Mining

Mass Spectra

Mass Spectrometer

Diseased Sample

Step 1 Step 2 Step 3

Issues

Time sensitive data Limited sample quantities Experiments repetition Massive data


Instrument Vendor PC (2)

Instrument Intelligence (3)

Mass Spectrometer (1)

Databases (4) Raw Data Archives (5)

Sam

ples

Network

Benefits

The outcome of IIC will be biological knowledge instead of raw mass spectra.

The biological knowledge is backed up by data acquired by IIC.

Scientists do not need to review the raw mass spectra.

Data Flow in IIC

Nile Support and others

IIC Issues

IIC system development Non-proprietary API for both data

collection and control of the instrument

Optimized storage for Massive data (Instrument Output and Sequences)

etc.

Data Stream Issues Data filters that identify interesting data

and reduce chemical noise Algorithms for rapid identification of the

base peaks and the number of peaks in the spectrum

Algorithms for prediction of upcoming peaks Online statistical analysis over the streams Data summaries on different granularities etc.

Data Integration

Non-glycosylated peptide

identification

preprocessing

de novo sequencing

stats auto-validation

APLIXYXCLIKWDYR

MS/MS Spectra

database search

protein validation

Protein List

Data Integration and Informatics

Request Handler

MetadataRepository

WebGlycoManager

Web Browser

BiologicalDatabases

Web Service Invocation (SOAP) Queries

Database Discovery

Database Locator

Mapping Agent

QueryOptimizer

ExecutionEngine

Wrappers

Glycoprotein Databases Other Protein Databases

Web Service Access

InformaticsToolbox

NON-GLYCOSYLATEDPEPTIDE

IDENTIFICATION

GLYCOSYLATEDPEPTIDE

IDENTIFICATION

Web ServiceConsumer

Web ServiceDescription (WSDL)

Data Integration Issues

Databases description and organization Schemas mediation Annotation and Provenance Use of model management techniques Query processing and optimization Web-service access Implementation and deployment

Requirements Data types diversity: sequences,

graphs, 3D structures, etc. Unconventional queries: similarity,

pattern matching, etc. Uncertainty (probability) Data curation: cleaning and annotation Data provenance (pedigree) Large scale: 100s of DBs Terminology management (semantics) etc.

Data Correlation

Non-overlapping Schemas (different instruments or scales of resolution)

Contradictory information (experiments with different assumptions)

Comparing data only after matching their context (constraints)

Other Issues

?

IIC Information Flow

Interesting ions?

Priority list of interesting ions

Empty priority list?

QA/QC?

Peptide identification

Protein identification

External Databases query

Y

N

Y

N

N

Step 1

Step 2

Step 3

sample

N

Y


Algorithms design Spectra Deconvolution Online analysis (protein/peptide identification) Online peaks Identification for feedback Data filters and noise removal Prediction of upcoming peaks

Experimental Simulation In silico generation of spectrum Algorithms simulation


Experimental settings Selection of a biology system, e.g., yeast Two types of experiments

Target analysis Global analysis

Integration with the instrument Data collection Control of the instrument API Actual implementation (algorithms)


Online data mining Other Issues:

Optimized storage of massive data Data representation (streams,

database)

Integrated Access to Glycoprotein Databases

Informatics tools Glycosylated peptide identification Non-glycosylated peptide identification

Enabling uniform access to different glycoprotein databases

Databases description and organization

Schema mediation

Integrated Access to Glycoprotein Databases

Query Processing Data correlation

Non-overlapping schemas Contradictory information Sequence alignment

Web service enabled access Target databases selection (focus)

Documents

Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani [email protected]