Upload
jerry
View
111
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Big Data & the CPTAC Data Portal . Nathan Edwards, Peter McGarvey Mauricio Oberti , Ratna Thangudu Shuang Cai , Karen Ketchum Georgetown University & ESAC Nathan Edwards Georgetown University Medical Center. NCI: CPTAC. Clinical Proteomic Tumor Analysis Consortium (CPTAC) - PowerPoint PPT Presentation
Citation preview
Big Data & the CPTACData Portal
Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu
Shuang Cai, Karen Ketchum
Georgetown University & ESAC
Nathan EdwardsGeorgetown University Medical Center
NCI: CPTAC Clinical Proteomic Tumor Analysis
Consortium (CPTAC) Comprehensive study of genomically
characterized (TCGA) cancer biospecimens by bottom-up mass-spectrometry-based proteomics workflows
Follows Clinical Proteomics Technology Assessment Consortium (CPTAC Phase I)
2
NCI: CPTAC
3
CPTAC Data Portal All data is publicly
released… …subject to
responsible use guidelines
Consortium has 15 months to publish first global analysis
Data available in the meantime.
http://grg.tn/cptac
4
Proteomics Workflows Modern Instrumentation:
Orbitrap, Q-Exactive, AB 5600 Protein Enrichment:
Phosphoproteins, Glycoproteins Quantitation:
Label-free, precursor area or spectral count; or iTRAQ
Peptide Fractionation: Deep sampling of less abundant peptides
5
Available Data Mass Spectrometry Data
Raw and mzML formats Experimental Design Meta-Data
Link to TCGA, clinical context Analytical Protocol Documents
Sample prep, chromatography, MS Peptide-Spectrum-Match Data
CPTAC Common analysis pipeline (NIST) MS-GF+ based, TSV and mzIdentML formats
Gene inference and quantitation6
CPTAC/TCGA Colorectal Cancer (Proteome) Vanderbilt PCC (PI: Liebler), Embargo: 12/2014 95 TCGA samples, 15 fractions / sample Label-free spectral count / precursor XIC quant. Orbitrap Velos; high-accuracy precursor 1425 spectra files ~ 600 Gb / ~ 129 Gb (mzML.gz) Spectra: ~ 18M; ~ 13M MS/MS 4,644,354 PSMs at 1% MSGF+ q-value 10,258 genes at 0.01% gene FDR, 9047 groups
7
CPTAC/TCGA Breast Cancer (Proteome) Broad PCC (PI: Carr), Embargo: 5/2015 108 TCGA samples, 25 fractions / sample-mixture Proteome; iTRAQ quantitation; 3 samples vs POOL Q-Exactive; high-accuracy precursor 900 spectra files ~ 1Tb / ~ 280 Gb (mzML.gz) Spectra: ~ 41M; ~ 32M MS/MS 13,764,193 PSMs at 1% MSGF+ q-value 13,716 genes at 0.01% gene FDR, 10,007 groups
8
CPTAC/TCGA Breast Cancer (Phosphoproteome) Broad PCC (PI: Carr), Embargo: 5/2015 108 TCGA samples, 13 fractions / sample-mixture IMAC enriched; iTRAQ quant.; 3 samp. vs POOL Q-Exactive; high-accuracy precursor 468 spectra files ~ 600 Gb / ~ 130 Gb (mzML.gz) Spectra: ~ 16M; ~ 10M MS/MS 3,355,721 PSMs at 1% MSGF+ q-value 10,352 genes at 0.01% gene FDR, 8875 groups
9
CPTAC Data Center Lessons Files on disk are "easy"
Meta-data, experimental design, semantics HARD File naming conventions seem trivial but do it Backup, access, redundancy is IT and costs $$
Advanced network transfer tools really work! Aspera provides order of magnitude improvement Scriptable upload/download/navigation matters!
(Spectra) file integrity is really important Platform agnostic chain of custody from lab mzML conversion verifies RAW file semantics mzML embeds checksums, platform agnostic mzML semantic compression (peaks only) 10
CPTAC TCGA Data Lessons Monolithic computation no longer sufficient!
Many datafiles, distributed computation, out-of-core PSMs are the new RAW data? (~ NGS reads)
Many PSMs / gene; # Spectra >> # Sequences! "Poor" acquisitions are not uncommon
Need fast, easy QC to permit re-analysis Other issues:
Is identifiability information leaking (germline mutations)? Protein inference for human/mouse xenograft spectra? How to really handle isoforms? Proteome coverage – how to estimate? 11
Heresy: PSMs as NGS reads Need O(n) spectra → good PSMs
We work too hard to identify all spectra, too stringent? Progressive, pareto, PTAS identification? Output as genome alignments, BAM files?
Volume dominates noise and loss of detail: e.g. Twitter; indirect observation of splicing, PTMs?
Models of distributed computation Distributed data and/or computation Failure, interruption tolerant computing Heterogeneous computing resources PSM search engine API for mining (social, reward?) 12