34
Xianfeng Jeff Chen Ph.D. Research Investigator/Project Manager Overview and Implementation Overview and Implementation Strategy of the NIAID-Funded Bio- Strategy of the NIAID-Funded Bio- defense defense Proteomics Database System Proteomics Database System

Xianfeng Jeff Chen Ph.D . Research Investigator/Project Manager

Embed Size (px)

DESCRIPTION

Overview and Implementation Strategy of the NIAID-Funded Bio-defense Proteomics Database System. Xianfeng Jeff Chen Ph.D . Research Investigator/Project Manager. (1) Introduction. Agenda Today. VBI responsibility in Admin Center PRCs datatype and organism - PowerPoint PPT Presentation

Citation preview

Page 1: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Xianfeng Jeff Chen Ph.D.

Research Investigator/Project Manager

Overview and Implementation Strategy of Overview and Implementation Strategy of the NIAID-Funded Bio-defense the NIAID-Funded Bio-defense Proteomics Database SystemProteomics Database System

Page 2: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

• VBI responsibility in Admin Center

• PRCs datatype and organism

• Proteomics data submission and storage work flow

• VBI computing system architecture (CPU and storage)

• VBI database system prototype and functionality

• VBI existing database schema and status

• Example Y2H schema for design logics and case study

• Proposed data integration and knowledgebase construction

Agenda TodayAgenda Today

(1) Introduction

(2) Database Development

(3) Strategy on Knowledgebase Development

Page 3: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

IntroductionIntroduction

Page 4: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Proteomics Data ManagementProteomics Data Management

(processed data)

Tasks of Proteomics Data Management

RAWDATA

Data Storage& Visualization

Tools(VBI)

Analysis,Annotation,& Curation

(GU)

DataQA/QC,

Interoperability (VBI/GU)

SOP, LIMS, & Adm DB

(SSS)

Page 5: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

University of Michigan Microarray and mass spectrometry

Caprion Mass spectrometry

Harvard Proteomics Institute Genomics and protein expression array

Albert Einsten College of Medicine Mass spectrometry

PNNL Mass spectrometry

Scripps NMR structural and X-ray crystal diffraction data

Myriad Genetics Yeast two-hybrid system

PRCs Major Data TypePRCs Major Data Type

Organization Major Data Type

Page 6: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

PRCs OrganismsPRCs Organisms

Einstein Toxoplasma gondii, Cryptosporidium parvum

Caprion Brucella abortus

Harvard Bacillus anthracis (Protein array), Vibrio cholerae

Myriad Bacillus anthracis (Y2H), Yersinia pestis,

Francisella tularensis, vaccinia

PNNL Orthopox (vaccinia and monkeypox), Salmonella typhimurium, Salmonella typhi

Scripps SARS CoV

Michigan Bacillus anthracis (TXP, MS) + host (human)

Page 7: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Proteomics Data FlowProteomics Data Flow

PRCS

VBI

Public

Data Sources

2D GELS

Protein Array

LC

Immunoaffinity purification

Y2H

MS

MS/MS

NMR

X-Ray Cryoelectron Microscopy

X-Ray Defraction

etc…

Data Types

QA

&

QC

Quality Assurance

& Quality Control

Converting to Standard Format

Standard

Format

Standard Format for Each Data Type

QA

&

QC

Quality Assurance

& Quality Control

Data Modeling w/ Decomposition

Relational Database

MIAME and MIAPE-like Standards/SOP for Data Submission

Page 8: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Database Development

Page 9: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

VBI Computing SystemVBI Computing System

Binary Software

Project

Proteomics

Genomics

Data Storage

PC Users

Jeff

Wei

Chaitanya

Chengdong

Ranjan

Oswald

Bruno

LINUX

SUN (Solaris)

Gimli

Elenwe

7 PRCsNetworked File Server

TUOR Relational Database Server

ProteomicsChendong, Jeff, Wei, Ranjan, Chaitanya

Web Server

Application Server

Page 10: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Development Test/Stage Production

Web Interface

Database

System Development in Q3 of 2005

Page 11: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Production: http://proteinbank.vbi.vt.edu/bprc

Test: http://proteinbankdev.gepasi.org/bprc/

Development: http://txue.bioinformatics.vt.edu:8080/bprc http://wsun.vbi.vt.edu:8080/bprc/

Proteomics Database Project Websites

Page 12: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Dynamically generated webpage Dynamically generated webpage

(1) Account management

(2) File and doc management

(3) News group and news update

(4)Textual data display

(5) 2D gel Image data display

(6) Table and record query

(7) Data uploading and simple submission

(8)HTTP data downloading

(9)SFTP file transfer

Production Website InstanceProduction Website Instance

Functionalities:Functionalities:

Page 13: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Search By Experiment

•Select Experiment•Retrieve list of Bait protein and nucleotide, Prey protein & nucleotide•Links to details of bait and Prey example: Drosophila melanogaster

Search By Organism

•Escherichia coli•Saccharomyces cerevisiae•Homo sapiens•Drosophila melanogaster•Helicobacter pylori•Caenorhabclitis elegans

Search By Data Type

•Proteomics •Genomics•Microarray

Database QueryDatabase Query

Page 14: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Search By Project/Experiment

•Scripps MS testing project•Available peptide hit list•Retrieve peak information and m/z & intensity list

Query for Scripps Sample Data Query for Scripps Sample Data

Page 15: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Search By Experiment/Sample

Query for 2 D Gel DataQuery for 2 D Gel Data

Page 16: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Proteomics Database ArchitectureProteomics Database Architecture

Process-Oriented Production Design

2D Gel

Y2H

MS NMR

Protein

Array

LC

X-Ray Cryoelectron Microscopy

Immnoaffinity

Purification

X-Ray Defraction

Multiple Schemas of Disparate Data

Consolidate to One Schema to Remove

Redundancy

Stored Procedure for Analysis

Pipeline

Physical Layer

Logical Layer

Views -- materialized views

Final Views

Application

Layer

Three Phases of Database DesignThree Phases of Database Design

Normalized with Key-value Pair

Page 17: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Proteomics Database ArchitectureThree Database Instances

Proteomics Database ArchitectureThree Database Instances

Phase 1

Version 1

0.5-1 year

Disparate Data

With Multiple Schemas

Individual Dataset Modeling

Phase 2

Version 2

1-1.5 year

Consolidation into a Few Schema

A normalized data model

implemented as key –value pairs, highly

decomposed.

Phase 3

Version 3

2 years

Analysis Pipeline

Procedures

Logical Layer with Views for the User

Physical Layer

1. Partially Processed Data

2. Data Enhanced with Knowledge

3. Interface Less Changeable

4. Curated/Annotated Data

Development

Test/stage

Production

Page 18: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Status of VBI Database DevelopmentStatus of VBI Database Development

Schema Development Test/stage Production

Adm +(10/10) + +

2 D Gel +(10/10) + +

MS +(10/10) + +

Interaction +(9/10) + -

Pathway +(7/10) + -

Data Repository +(8/10) + +

Y2H +(10/10) + +

Genomics +(10/10)(GUS) + +

Microarray +(10/10) (AE) + +

Default Tablespace: Admin_data, Genomics_TBLS, Pathway_TBLS,

Microarray_TBLS, Proteomics_TBLS.

(Maturity)

Page 19: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Who (People)

Where (Organization)

Project (Goal)

Materials and Methods (Metadata)

Results (Raw Data)

Conclusion and Hypothesis (Processed and Analyzed Data)

Generic Experiment Data Components-------Example of Database Design Logics

Page 20: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

People

Experiment

Project

Sample

ResultsConclusion HypothesisDNA /Protein

Detail

Y2H Data Component Modeling

Page 21: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Experiment

Experiment Design

Experiment Factor

Factor Value

Design Description

Ontology Entry

Ontology entries are taking care of the annotation cases1) There are diverse choices and there exist ontologies that can better capture the information 2) What are essentially controlled vocabularies which are limited in number of choices but might grow in the future or vary by technology type

Experiment Component Object Model

Page 22: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Y2H Partial Database Schema

Page 23: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Proteomics DB System Architecture

Public File Server

Private File ServerOracle Relational Database

JDBC,

Perl DBI/DBD,

ODBC

Batch Processing

(1) Data uploading;

(2) Data validation;

(3) Data analysis;

(4) Data processing

JSP, CGI,

Java

Perl,

Java

Page 24: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Virtual Database/ Warehouse

Application Layer

Web Display and Data Visualization

System Architecture of Putative VBI Proteomics KnowledgebaseSystem Architecture of Putative VBI Proteomics Knowledgebase

Security

Security

Security

Security

Temporary data

Service-Oriented MiddleWare with Process Control

Array Express Mass Spectrometry Two Component System 2D Gel Structure Data Genomics Data

------- Data, Tool, Project, and Team Interoperability------- Data, Tool, Project, and Team Interoperability

Page 25: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Strategy on Data Integration and

Construction of Knowledge Warehouse

Page 26: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Biological Information WorkflowBiological Information Workflow

Information Storage, Queries & DB Management

Cleaning, Processing Algorithms

Curation and Annotation of Data

Knowledge Generation

Biological Research

Target Discovery

Diagnostics, Therapeutics &

Vaccines

Data Management Knowledge Management

Page 27: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Bio-IT Scope Data IntegrationKnowledge generationKnowledge managementKnowledge presentation

Phase I Phase II Phase III

First 2 years 3rd-4th years 5th year

•Raw data management•Schema development•Data visualization•Data standardization

•Integration at interface level•Integration of data at DB level•Interoperability of datasets•Normalization and warehousing

•Predefined query•Materialized view •Comparative analysis•Statistical analysis

VBI PDC Project PhasesVBI PDC Project Phases

Page 28: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

(2) Mass spectrometryAllows identification of proteins within large complexes (2-100 proteins).Lower throughput.

(1) Yeast two-hybrid systemMeasures association between two proteins.Allows very high throughput.

Mapping the ProteomeMapping the Proteome

Page 29: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

ComplexInteraction

Model

R2H Analysis

N-ary interationsPO4

Proteins MS Analysis

Binary interactions

Infer Complex Interaction TopologyInfer Complex Interaction Topology

Knowledgebase

Page 30: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

(1) Completed Genome

Ames, Ames Ancestor, a2012 NCBI, TIGR

(2) Yeast two-hybrid interaction data Myriad Genetics

(3) Mass Spectrometry Scripps and Caprion

(4) Microarray expression profiling Univ. of Michigan

(5) Interspecies and interspecies clustering NCBI(COG) and TIGR

(6) Functional category assignment GU(PIR)

Data Organization

Bacillus anthracisBacillus anthracis

Page 31: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

(1) Annotation Improvement

(1) Non-homologous based methods -------------- phylogenetic profiling,

Rosetta stone pattern,

operon analysis,

co-expression profiling,

gene neighboring etc.

(2) Comparative genomics with two reference genomes --- E. Coli and Yeast

(2) Identifying anchor points for data integration

(1) Known metabolic pathway – E. coli and yeast;

(2) Known signal transduction pathway;

(3) Known Gene regulation machinery;

(4) Known Protein-protein interaction map.

Strategy for Knowledgebase Construction Strategy for Knowledgebase Construction

Page 32: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Data IntegrationData Integration

Genomics Data

Improved annotation

Comparative Genomics

Anchor on knowledge network of

Reference Genomes – E. Coli and Yeast

Lay down Y2H interaction data and expend network

Lay down MS multiple interaction data to expend the network

Lay down microarray data to add co-expression pattern to gene network

http://www.Bacillus_anthracis.org

Putative Knowledgebase:

No thing

Page 33: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Key: Multi-Protein ComplexCuratedIn-House Y2HBoth Curated + Y2H

Data Mining and Knowledge Augmentation

Data Mining and Knowledge Augmentation

Literature Y2H analysis MS analysis Microarray

Page 34: Xianfeng Jeff Chen  Ph.D . Research Investigator/Project Manager

Dr. Jeff Chen Project Manager/Investigator VBIDr. Chendong Zhang Senior Software Engineer VBIDr. Steve Cammer Bioinformatics Scientist VBIDr. Oswald Crasta Scientist and CI-Co-director VBISusan Baker DBA VBIJiang Lu DBA VBIRanjan Jha Software Engineer VBIQiang Yu Software Engineer VBIJian Li Software Engineer VBIWei Sun Software Engineer VBIChaitanya Kommidi Software Engineer VBIDr.Bruno Sobral Co-PI VBIDr. Peter MacGarvey Senior Bioinformatics Scientist GUDr. Cathy Wu Co-PI GUPaula Yadvish Web Coordinator SSSMargaret Moore PI SSS

AcknowledgementAcknowledgementName Role Organization