12
Johannes Griss [email protected] PSI Meeting Heidelberg, April 2011 EBI is an Outstation of the European Molecular Biology Laboratory. mzTab Proposal for A Simple Data Format for Proteomics Results

Johannes Griss [email protected] PSI Meeting Heidelberg, April 2011 EBI is an Outstation of the European Molecular Biology Laboratory. mzTab Proposal for

Embed Size (px)

Citation preview

Johannes [email protected]

PSI MeetingHeidelberg, April 2011EBI is an Outstation of the European Molecular Biology Laboratory.

mzTabProposal for A Simple Data Format for Proteomics Results

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

Current Situation

• The necessity of standard data formats has become generally accepted

• Proteomics techniques are constantly evolving• Proposed standard formats had to become very complex

to adequately capture proteomics data• mzIdentML for identification data• mzQuantML for quantitative data

• An effective use of these data formats requires sophisticated bioinformatic knowledge

• Many researchers are still used to use MS Excel to “look” at their data

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

Communication of Proteomics Results

• Proteomics resources require a mechanism to simply/efficiently exchange basic proteomics results

• Collaboration with colleagues from other scientific fields is increasingly important• Necessity to share proteomics results with researchers outside of

proteomics

• Need to make proteomics data easily accessible

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

Potential Current Problems

• Currently proposed standard formats are difficult to use without the JAVA APIs

• “Complete” standard formats are too complex and big to quickly share the essential results

• Quick, f.e. Perl scripts for specific research questions are not easily possible• Large amount of potential innovation could be lost

• Reading files requires special software• Further processing of the data (f.e. with statistical) tools is not easily

possible• No standard tools to read / write mz*ML files available• Custom built software required for many use cases otherwise fulfilled by

“Excel & friends”

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

mzTab - Aim

• To provide a simple and efficient way of exchanging proteomics data• Which protein / peptide was identified in a given experimental

setting

• Easy to update and maintain• Easy to use by the proteomics community, systems

biologists as well as providers of knowledge bases

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

mzTab – Target Audience

• Proteomics repositories (f.e. PRIDE, PeptideAtlas) • Knowledge base resources (f.e. UniProt, HPRD)• Researchers outside of proteomics• Researchers analyzing proteomics data with limited

bioinformatic knowledge / support

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

mzTab – proposed concept

• A tab-delimited file format• Goals

• Content should be “readable” using MS Excel• Should contain minimal information for proteomics repositories /

knowledge bases to exchange data• Data should be easily accessible using f.e. scripting languages• One file should be able to contain multiple experiments / proteins from

different resources• Aim: To represent the result of a query to f.e. PRIDE using this

format• Provide a simplisitic summary of proteomics results

• Every entry contains a reference to the source data (in mzIdentML / mzQuantML format)

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

mzTab – proposed concept

• What the format does NOT aim at:• Replace mzIdentML or mzQuantML• Contain the complete data of a proteomics experiment• Provide detailed evidence for the data• Allow a researcher to recreate the process which led to the

results• Be requirements conform (MIAPE, journal guidelines, etc.)• In short: be complete in any way

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

mzTab – Possible Format Specification

• Three sections• (Optional) Metdata section• (Required) Protein section• (Optional) Peptide section

• Can report proteomics data at different levels• Single experiments• Multiple (possibly linked) experiments• Data generated as a result to a query (possibly to multiple

resources)

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

mzTab – Metadata Section

----metadataPRIDE_16649-title: The Synaptic Proteome during

Development and Plasticity of the Mouse Visual CortexPRIDE_16649-species: [NEWT, 10090, Mouse,]PRIDE_16649-tissue: [EFO, EFO:0000916, visual cortex,]PRIDE_16649-instrument[1]-type: [MS, MS:1000287, TOF-

MS,]PRIDE_16649-search_engine: [MS, MS:1001207, Mascot, ]PRIDE_16649-contact[1]-name: August B SmitPRIDE_16649-contact[1]-email: [email protected]_16649-url:

http://www.ebi.ac.uk/pride/q.do?accession=16649----END

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

mzTab – Protein Section

----proteinsAccession … reliability peptides …

ambiguity_membersP12345 4 2

P12346,P123457…´----END

• A Table holding the basic identification information• Suggestions of how to include

• quantitative data• multiple search engine scores• ambiguous modification positions

Johannes [email protected]

PSI MeetingHeidelberg, April 2011

mzTab – Peptide Table

----peptidessequence accession unit unique … reliability …DIIL O00160 PRIDE_3381 false 5 …VESVDL O00160 PRIDE_3381 true 4 …----END

• A Table holding the basic peptide information

• Suggestions of how to include • quantitative data• multiple search engine scores• ambiguous modification positions