24
Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani [email protected]

Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani [email protected]

Embed Size (px)

Citation preview

Page 1: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Data Management Supportfor Life Sciences

orWhat can we do

for the Life Sciences?

Mourad [email protected]

Page 2: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

The Big Picture

Internet

Web Service

Biological Databases

Isotopefitting

Noisefiltering

Peak groupselection

Peak Deconvolution

Chargefitting

Rawpeakgroup

Multi-scan

clusters

Chromatographicrefinement

Singlescan

clustersPriority

list

Raw MSfile

Doubletdetection

Mixeddoubletrescue

Calculation ofDifferentials

Protein identification

Storage System

Processed Data Raw DataSamples

Experiments

Proteomics, Metabolomics, & Cytomics

Data Streaming

System

Online Analysis

Data Mining

Databases Integration

System

Purd

ue B

ioSc

ienc

e Pi

pelin

e

Sample Data

Page 3: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Intelligent Instrument Control

Page 4: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Protein Identification

Massive Data Storage

Non Diseased Sample

Samples Preparation & Transformation

(LC, isotoping, etc.…)

Data Analysis

Data Mining

Mass Spectra

Mass Spectrometer

Diseased Sample

Step 1 Step 2 Step 3

Page 5: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Issues

Time sensitive data Limited sample quantities Experiments repetition Massive data

Page 6: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Intelligent Instrument Control

Instrument Vendor PC (2)

Instrument Intelligence (3)

Mass Spectrometer (1)

Databases (4) Raw Data Archives (5)

Sam

ples

Network

Page 7: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Benefits

The outcome of IIC will be biological knowledge instead of raw mass spectra.

The biological knowledge is backed up by data acquired by IIC.

Scientists do not need to review the raw mass spectra.

Page 8: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Data Flow in IIC

Page 9: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Nile Support and others

Page 10: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

IIC Issues

IIC system development Non-proprietary API for both data

collection and control of the instrument

Optimized storage for Massive data (Instrument Output and Sequences)

etc.

Page 11: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Data Stream Issues Data filters that identify interesting data

and reduce chemical noise Algorithms for rapid identification of the

base peaks and the number of peaks in the spectrum

Algorithms for prediction of upcoming peaks Online statistical analysis over the streams Data summaries on different granularities etc.

Page 12: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Data Integration

Page 13: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Non-glycosylated peptide

identification

preprocessing

de novo sequencing

stats auto-validation

APLIXYXCLIKWDYR

MS/MS Spectra

database search

protein validation

Protein List

Page 14: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Data Integration and Informatics

Request Handler

MetadataRepository

WebGlycoManager

Web Browser

BiologicalDatabases

Web Service Invocation (SOAP) Queries

Database Discovery

Database Locator

Mapping Agent

QueryOptimizer

ExecutionEngine

Wrappers

Glycoprotein Databases Other Protein Databases

Web Service Access

InformaticsToolbox

NON-GLYCOSYLATEDPEPTIDE

IDENTIFICATION

GLYCOSYLATEDPEPTIDE

IDENTIFICATION

Web ServiceConsumer

Web ServiceDescription (WSDL)

Page 15: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Data Integration Issues

Databases description and organization Schemas mediation Annotation and Provenance Use of model management techniques Query processing and optimization Web-service access Implementation and deployment

Page 16: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Requirements Data types diversity: sequences,

graphs, 3D structures, etc. Unconventional queries: similarity,

pattern matching, etc. Uncertainty (probability) Data curation: cleaning and annotation Data provenance (pedigree) Large scale: 100s of DBs Terminology management (semantics) etc.

Page 17: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Data Correlation

Non-overlapping Schemas (different instruments or scales of resolution)

Contradictory information (experiments with different assumptions)

Comparing data only after matching their context (constraints)

Page 18: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Other Issues

?

Page 19: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

IIC Information Flow

Interesting ions?

Priority list of interesting ions

Empty priority list?

QA/QC?

Peptide identification

Protein identification

External Databases query

Y

N

Y

N

N

Step 1

Step 2

Step 3

sample

N

Y

Page 20: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Intelligent Instrument Control

Algorithms design Spectra Deconvolution Online analysis (protein/peptide identification) Online peaks Identification for feedback Data filters and noise removal Prediction of upcoming peaks

Experimental Simulation In silico generation of spectrum Algorithms simulation

Page 21: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Intelligent Instrument Control

Experimental settings Selection of a biology system, e.g., yeast Two types of experiments

Target analysis Global analysis

Integration with the instrument Data collection Control of the instrument API Actual implementation (algorithms)

Page 22: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Intelligent Instrument Control

Online data mining Other Issues:

Optimized storage of massive data Data representation (streams,

database)

Page 23: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Integrated Access to Glycoprotein Databases

Informatics tools Glycosylated peptide identification Non-glycosylated peptide identification

Enabling uniform access to different glycoprotein databases

Databases description and organization

Schema mediation

Page 24: Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani mourad@cs.purdue.edu

Integrated Access to Glycoprotein Databases

Query Processing Data correlation

Non-overlapping schemas Contradictory information Sequence alignment

Web service enabled access Target databases selection (focus)