Upload
morgan-burke
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Data Management Supportfor Life Sciences
orWhat can we do
for the Life Sciences?
Mourad [email protected]
The Big Picture
Internet
Web Service
Biological Databases
Isotopefitting
Noisefiltering
Peak groupselection
Peak Deconvolution
Chargefitting
Rawpeakgroup
Multi-scan
clusters
Chromatographicrefinement
Singlescan
clustersPriority
list
Raw MSfile
Doubletdetection
Mixeddoubletrescue
Calculation ofDifferentials
Protein identification
Storage System
Processed Data Raw DataSamples
Experiments
Proteomics, Metabolomics, & Cytomics
Data Streaming
System
Online Analysis
Data Mining
Databases Integration
System
Purd
ue B
ioSc
ienc
e Pi
pelin
e
Sample Data
Intelligent Instrument Control
Protein Identification
Massive Data Storage
Non Diseased Sample
Samples Preparation & Transformation
(LC, isotoping, etc.…)
Data Analysis
Data Mining
Mass Spectra
Mass Spectrometer
Diseased Sample
Step 1 Step 2 Step 3
Issues
Time sensitive data Limited sample quantities Experiments repetition Massive data
Intelligent Instrument Control
Instrument Vendor PC (2)
Instrument Intelligence (3)
Mass Spectrometer (1)
Databases (4) Raw Data Archives (5)
Sam
ples
Network
Benefits
The outcome of IIC will be biological knowledge instead of raw mass spectra.
The biological knowledge is backed up by data acquired by IIC.
Scientists do not need to review the raw mass spectra.
Data Flow in IIC
Nile Support and others
IIC Issues
IIC system development Non-proprietary API for both data
collection and control of the instrument
Optimized storage for Massive data (Instrument Output and Sequences)
etc.
Data Stream Issues Data filters that identify interesting data
and reduce chemical noise Algorithms for rapid identification of the
base peaks and the number of peaks in the spectrum
Algorithms for prediction of upcoming peaks Online statistical analysis over the streams Data summaries on different granularities etc.
Data Integration
Non-glycosylated peptide
identification
preprocessing
de novo sequencing
stats auto-validation
APLIXYXCLIKWDYR
MS/MS Spectra
database search
protein validation
Protein List
Data Integration and Informatics
Request Handler
MetadataRepository
WebGlycoManager
Web Browser
BiologicalDatabases
Web Service Invocation (SOAP) Queries
Database Discovery
Database Locator
Mapping Agent
QueryOptimizer
ExecutionEngine
Wrappers
Glycoprotein Databases Other Protein Databases
Web Service Access
InformaticsToolbox
NON-GLYCOSYLATEDPEPTIDE
IDENTIFICATION
GLYCOSYLATEDPEPTIDE
IDENTIFICATION
Web ServiceConsumer
Web ServiceDescription (WSDL)
Data Integration Issues
Databases description and organization Schemas mediation Annotation and Provenance Use of model management techniques Query processing and optimization Web-service access Implementation and deployment
Requirements Data types diversity: sequences,
graphs, 3D structures, etc. Unconventional queries: similarity,
pattern matching, etc. Uncertainty (probability) Data curation: cleaning and annotation Data provenance (pedigree) Large scale: 100s of DBs Terminology management (semantics) etc.
Data Correlation
Non-overlapping Schemas (different instruments or scales of resolution)
Contradictory information (experiments with different assumptions)
Comparing data only after matching their context (constraints)
Other Issues
?
IIC Information Flow
Interesting ions?
Priority list of interesting ions
Empty priority list?
QA/QC?
Peptide identification
Protein identification
External Databases query
Y
N
Y
N
N
Step 1
Step 2
Step 3
sample
N
Y
Intelligent Instrument Control
Algorithms design Spectra Deconvolution Online analysis (protein/peptide identification) Online peaks Identification for feedback Data filters and noise removal Prediction of upcoming peaks
Experimental Simulation In silico generation of spectrum Algorithms simulation
Intelligent Instrument Control
Experimental settings Selection of a biology system, e.g., yeast Two types of experiments
Target analysis Global analysis
Integration with the instrument Data collection Control of the instrument API Actual implementation (algorithms)
Intelligent Instrument Control
Online data mining Other Issues:
Optimized storage of massive data Data representation (streams,
database)
Integrated Access to Glycoprotein Databases
Informatics tools Glycosylated peptide identification Non-glycosylated peptide identification
Enabling uniform access to different glycoprotein databases
Databases description and organization
Schema mediation
Integrated Access to Glycoprotein Databases
Query Processing Data correlation
Non-overlapping schemas Contradictory information Sequence alignment
Web service enabled access Target databases selection (focus)